User Tools

Site Tools


rl20:potential-open-problems

A weekly open problem discussion group meets every Wednesday, 11 am to 12 noon Pacific Time.

Etiquette: We are a collaborative group with a common goal of making progress on the theory of RL. Therefore, to avoid competition and redundancy, if you are interested in working on one of these problems, please coordinate with the problem proposer.

Prove or disprove the RH (no, not the Riemann Hypothesis!)

Proposer: Ambuj Tewari

In their book, Reinforcement Learning: An Introduction (2nd ed), Sutton and Barto state the "Reward Hypothesis" (RH) in Section 3.2 as follows: "That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)" First, can we formalize what it means for an agent to have a goal/purpose? Second, for a given formalization can we prove or disprove the RH?

Interestingly, they credit Michael Littman for suggesting the RH. So perhaps we can pester him about clarifications on the exact meaning of RH.

Known results:

  • Comment by Csaba: I did look into the validity of the RH in a specific case. The specific question I asked is whether multicriteria MDPs are a strictly richer class than single criterion MDPs. The answer is yes, multicriteria MDPs cannot be simply reduced to single criterion ones (despite that people often have the naive idea that they can – the reason I wrote this up is because I have heard many times from various people that this reduction works).

Model Selection for Contextual Bandits

Proposers: Dylan Foster and Akshay Krishnamurthy

With Haipeng Luo, we have been working on getting "structural risk minimization" type model-selection guarantees for contextual bandits, which is perhaps the simplest RL setting with many of the core ingredients (online learning, partial feedback). We had some positive results on this (https://arxiv.org/abs/1906.00531), as did Niladri, Vidya, and Peter (https://arxiv.org/abs/1905.10040) focusing on the linear setting with fairly strong assumptions. For the more general setting, we had an open problem at this year's COLT (https://arxiv.org/abs/2006.10940) which contains some precise problem formulations.

The more conceptual question here is: How do we do cross validation for online settings that require exploration such as active learning, contextual bandits, and reinforcement learning? Can we provably adapt to the complexity of these problems?

Questions around finite-time instance optimality of TD learning

Proposer: Csaba Szepesvari

There is this paper: https://arxiv.org/abs/2003.07337 that came out in March: Koulik Khamaru, Ashwin Pananjady, Feng Ruan, Martin J. Wainwright, Michael I. Jordan: "Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis"

The paper takes the tabular case, defines some notion of finite-time instance optimality, shows that standard TD is not finite-time instance optimal (iid data case only though) and then applies a fix (using standard variance reduction) to address this. One interesting bit is that this is done for the max-norm error.

This is nice but I have some complaints and there are interesting questions on extending these arguments to other settings.

Complaints and the resulting problems: The notion of local optimality borrowed from statisticians seems completely ad-hoc. In the bandit literature, we have the Lai-Robbins style instance optimality, which you can also modify to deal with the finite case. I have a strong preference for this type of instance optimality (which starts with just the problem class and algorithms and no arbitrary neighborhood are brought in). So problem #1 is to redo the analysis with this notion. Problem #2 asks whether the two notions agree or disagree and when (this is a bit independent of RL).

Extensions/complaints: Problem #3 asks for what happens when we don't have iid data, but only a single trajectory from a nice Markov chain. Problem #4 asks for lower and upper bounds for the linear function approximation setting and then for efficient algorithms to match these (very open ended).

Questions around Eluder dimension

Proposer: Alekh Agarwal

Eluder dimension is an interesting concept which has been around for almost a decade, seems compelling in that it enables computationally and statistically efficient learning, and yet we know of no interesting "non-linear" examples beyond generalized linear models. I think it would be great to understand whether the quantity does capture interesting structures beyond GLMs, or not, given the various results it can enable.

Assumption-free offline RL

Proposer: Zhaoran Wang

Suppose we only have an offline dataset and can't interact with the environment. Without any assumption (e.g., density lower bounded, concentrability coefficient upper bounded…), what is the best optimality gap we can achieve information-theoretically? What is the "best-effort" algorithm that achieves such an optimality gap?

From Zhuoran: Perhaps it might also be interesting to ask what the most sensible notion of optimality is under the ``assumption-free'' offline setting.

Analysis of competitive multi-agent RL

Proposer: Vidya Muthukumar

On the competitive side, there is rich literature on convergence of online learning algorithms to equilibrium concepts when the equilibrium concept is static (e.g. one shot Nash/correlated equilibrium). There is also rich literature in dynamic non-cooperative game theory in control. This is maybe more on the formulation side, but maybe one could explore if there are analogs of the classic theory of learning in games in RL.

Survey article: https://arxiv.org/pdf/1911.10635.pdf

From Akshay: Maybe we could have Chi Jin give us an overview of his recent results?

Open problems in multi-player multi-armed bandits

From Vidya: On the cooperative side, Sebastien Bubeck mentioned interesting open problems in the multi-player multi-armed bandit problem both in stochastic and adversarial environments.

Akshay: I will see if Seb wants to present at some point.

RL in misspecified linear models

Proposer: Gergely Neu

What happens if your features do not exactly match the true features of the MDP? what sorts of mismatch models are reasonable? (modeling mismatch as missing features, in terms of norms like $\mathbb{E}_x[\left|\varphi(x) - \varphi^*(x)\right|]$ under some distribution…?) do the existing algorithms work or do we need something more fancy?

Primal-dual analysis for dynamic programming algorithms

Proposer: Gergely Neu

We know that some DP algs (PI, VI, regularized PI) can be formulated in a primal-dual framework, but is there a way to analyze them in a reasonably unified way? there are a bunch of newer regularized DP methods (e.g., by Matthieu Geist) that are very cool, but analyzing them is a bit of a pain. a primal-dual analysis should be more accessible for ppl with an optimization background so it should enable some interesting progress.

Can we learn to play Zork (text adventure game)

Proposer: Wouter Koolen

An interesting sub-challenge is to figure out the state transition graph automatically, without interacting with objects/characters. So what's going on here is that the agent starts in the start state, can sequentially go north/south/east/west, and the environment provides a fixed feedback (some text description) in each state. One interesting aspect here is that Zork is partially observable by design. There is a maze where movement is deterministic but the feedback at different states may be identical.

What the agent can do, basically, is approximating the state transition graph from traces. For example, by identifying bisimilar states. But there is always the possibility that we have not explored everything yet. How would one explore as fast as possible? How to even formulate that objective precisely? For example, how would one grow the set of newly discriminated states as fast as possible? How to do the computations efficiently? Going beyond that, what if there is noise? And how to connect exploration and with some notion of reward?

Akshay: One potentially useful observation is that planning in deterministic (small state) POMDPs is computationally tractable, since you can be open-loop.

Fully resolve linear Q^\star

Proposer: Lin Yang

Akshay (Inferring what Lin wants here): We have seen very nice progress on RL with linear function approximation in the last couple of years. But all of the results require stronger function-approximation assumptions than perhaps the bare minimum: that the Q^\star function is linear in the features. What can we say under this assumption?

rl20/potential-open-problems.txt · Last modified: by 127.0.0.1