Summary of Batch Reinforcement Learning Theoretical Comparison of Q Approximation Schemes

Summary Batch Reinforcement Learning Theoretical Comparison of Q Approximation Schemes arxiv.org

13,510 words - PDF document - View PDF document

One Line

The study examines two algorithms used in batch reinforcement learning and analyzes their guarantees and error propagation when approximating Q*.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Batch Reinforcement Learning: Comparing Q Approximation Schemes

Source: arxiv.org - PDF - 13,510 words - view

Introduction

• Batch reinforcement learning aims to overcome limitations of classical iterative methods

• Linear-in-horizon error propagation is desirable for algorithms relying solely on batch data

• Two algorithms are compared for approximating Q* in batch reinforcement learning

• The study provides improved characterizations of distribution shift effects and guarantees under weaker assumptions

Algorithm 1 - MSBO

• MSBO enjoys linear-in-horizon error propagation

• Similar to classical AVI/API algorithms

• Improved error bound compared to previous analyses

Algorithm 2 - MABO

• MABO uses explicit importance-weighting correction and plain average objectives

• Does not suffer from the looseness of squared-to-average conversion

• Offers augmented expressivity for its importance-weight class

Error Propagation

• Both MSBO and MABO have linear-in-horizon error propagation

• Desirable property for batch algorithms

• Overcome limitations of classical iterative methods

Robustness against Misspecified Q

• MABO's guarantee never suffers more than that of MSBO on misspecified Q

• Advantage of MABO may be weakened if additional functions in W do not correspond to real importance weights

Statistical Rates

• Terms in the theorems for MSBO and MABO match each other under certain conditions

• MABO suffers from an additional term due to explicit importance weighting and concentration inequalities

• Additional term fades away quickly with sufficient data

Assumptions on Helper Classes

• Both helper classes capture aspects of the problem

• W enjoys a superior property compared to F due to linearity of average Bellman error loss

• Simple W can have high expressivity

Bellman Error and Approximation Errors

• Aim to bound the Bellman error and obtain a low-dimensional approximation of the Q function

• Concentrability coefficients measure the deviation between Q and its approximation Qb

• Concentration function can be minimized by choosing appropriate weights wb

Limitations of Per-Step Concentrability Coefficients

• Per-step concentrability coefficients may not accurately capture the concentration of Q

• Occupancy-based concentrability coefficients are always 1 in certain cases

Low-Rank MDPs and Weight Matrices

• Weight matrices can achieve low-dimensional approximation in low-rank MDPs

• Proof using determinant maximization argument

• Left factorization matrix as features in certain cases

Key Takeaways

• Two algorithms compared for approximating Q* in batch reinforcement learning

• Linear-in-horizon error propagation and estimation of Bellman error are important properties

• MABO offers advantages with explicit importance-weighting correction and plain average objectives

• The study provides insights into theoretical aspects and potential directions for future research

Key Points

Two algorithms are compared for approximating Q* in batch reinforcement learning.
The algorithms aim to overcome the limitations of classical iterative methods.
The analyzed algorithms have linear-in-horizon error propagation and estimate the Bellman error.
One of the algorithms uses explicit importance-weighting correction and plain average objectives.
The study provides improved characterizations of distribution shift effects and has provable guarantees under weaker assumptions.

Summaries

18 word summary

This study compares two algorithms for approximating Q* in batch reinforcement learning, investigating their guarantees and error propagation.

80 word summary

This study compares two algorithms for approximating Q* in batch reinforcement learning. One algorithm uses explicit importance-weighting correction and does not use squared losses. The study seeks algorithms with provable guarantees under weaker conditions and investigates batch algorithms whose loss functions are more directly connected to the expected return. The study presents novel analyses of two algorithms, MSBO and MABO, which have linear-in-horizon error propagation. MABO has a similar sample complexity to MSBO when all expressivity assumptions are met exactly.

139 word summary

This study compares two algorithms for approximating Q* in batch reinforcement learning. One algorithm uses explicit importance-weighting correction and does not use squared losses, offering distinct characteristics and potential advantages. Linear-in-horizon algorithms exist but often require interactive access to the environment or knowledge of transition probabilities, limiting their applicability to the batch learning setting. The study seeks algorithms with provable guarantees under weaker conditions and investigates batch algorithms whose loss functions are more directly connected to the expected return. The study presents novel analyses of two algorithms, MSBO and MABO, which have linear-in-horizon error propagation. MABO has an error bound that depends on error terms related to approximation, expressivity, and statistics. The study compares the sample complexity of MSBO and MABO and shows that MABO has a similar sample complexity to MSBO when all expressivity assumptions are met exactly.

415 word summary

The study focuses on value-function approximation for batch-mode reinforcement learning. Linear-in-horizon algorithms exist but often require interactive access to the environment or knowledge of transition probabilities, limiting their applicability to the batch learning setting.

The study seeks algorithms with provable guarantees under somewhat weaker conditions and investigates batch algorithms whose loss functions are more directly connected to the expected return.

The study presents novel analyses of two algorithms, MSBO and MABO. Both algorithms enjoy linear-in-horizon error propagation, and their distribution shift effects can be characterized by simple notions of concentrability coefficients. MABO, which uses explicit importance-weighting correction and plain average objectives, offers augmented expressivity for its importance-weight class.

The study provides preliminary information about Markov Decision Processes (MDPs) and the batch value-function approximation setup. It introduces the data generation protocol for generating the dataset used in batch RL. The study also explains the concept of importance weights and their role in RL algorithms.

The study presents the performance difference bounds and telescoping performance difference lemma, which are crucial for analyzing the algorithms. The telescoping lemma shows that the difference between the performance of an algorithm and the optimal policy is controlled by the average Bellman errors under certain distributions.

The study analyzes MSBO and provides an improved error bound compared to previous analyses. The study also introduces MABO, which directly estimates average Bellman errors using explicit importance-weighting correction. MABO has an error bound that depends on error terms related to approximation, expressivity, and statistics.

The study discusses the horizon dependence of both algorithms and provides sample complexity bounds. It compares the sample complexity of MSBO and MABO and shows that MABO has a similar sample complexity to MSBO when all expressivity assumptions are met exactly.

In summary, the study presents two algorithms for approximating Q* in batch reinforcement learning and analyzes their performance guarantees. Both algorithms overcome limitations of classical iterative methods and have linear-in-horizon error propagation. The study compares the algorithms and highlights the advantages of MABO, which uses explicit importance-weighting correction and plain average objectives. The study provides insights into the theoretical aspects of batch reinforcement learning and offers potential directions for future research.

645 word summary

This study compares two algorithms for approximating Q* in batch reinforcement learning. The algorithms aim to overcome the limitations of classical iterative methods, such as Fitted Q-Iteration, which incur performance loss with a quadratic dependence on the effective horizon. The analyzed algorithms estimate the Bellman error and have linear-in-horizon error propagation. One of the algorithms uses explicit importance-weighting correction and does not use squared losses, offering distinct characteristics and potential advantages.

The study focuses on value-function approximation for batch-mode reinforcement learning. Most iterative methods have a quadratic dependence on the effective horizon, which is significantly worse than the ideal linear dependence. Linear-in-horizon algorithms exist but often require interactive access to the environment or knowledge of transition probabilities, limiting their applicability to the batch learning setting.

One of the challenges in RL is distribution shift. The study aims to find algorithms that characterize distribution shift effects elegantly and tightly. Existing analyses also require strong expressivity assumptions on the function classes. The study seeks algorithms with provable guarantees under somewhat weaker conditions.

Most batch RL algorithms heavily rely on squared loss, but bounding the performance loss with squared-loss objectives often requires multiple relaxations and leads to a significant gap between the objective and the surrogate squared loss. The study investigates batch algorithms whose loss functions are more directly connected to the expected return.

The study presents novel analyses of two algorithms, MSBO and MABO. Both algorithms enjoy linear-in-horizon error propagation, and their distribution shift effects can be characterized by simple notions of concentrability coefficients. MABO, which uses explicit importance-weighting correction and plain average objectives, does not suffer from the looseness of squared-to-average conversion and offers augmented expressivity for its importance-weight class.

The study analyzes MSBO and provides an improved error bound compared to previous analyses. The bound depends on error terms related to approximation and optimality. The study also introduces MABO, which directly estimates average Bellman errors using explicit importance-weighting correction. MABO has an error bound that depends on error terms related to approximation, expressivity, and statistics.

In summary, the study presents two algorithms for approximating Q* in batch reinforcement learning and analyzes their performance guarantees. Both algorithms overcome limitations of classical iterative methods and have linear-in-horizon error propagation. They also provide improved characterizations of distribution shift effects and have provable guarantees under weaker assumptions. The study compares the algorithms and highlights the advantages of MABO, which uses explicit importance-weighting correction and plain average objectives. The study provides insights into the theoretical aspects of batch reinforcement learning and offers potential directions for future research.

This document discusses batch reinforcement learning and provides a theoretical comparison of Q approximation schemes. The main focus is on two algorithms: MSBO (Model-based State-Action Value Optimization) and MABO (Model-based Advantage Bellman Operator). The authors analyze these algorithms in terms of error propagation, concentrability coefficients, robustness against misspecified Q, and statistical rates.

The authors start by introducing the Q function and the maximum Bellman operator J(?). They aim to bound the Bellman error and obtain a low-dimensional approximation of the Q function.

To achieve this, they introduce the concept of concentrability

1388 word summary

This study compares two algorithms for approximating Q* in batch reinforcement learning. The algorithms aim to overcome the limitations of classical iterative methods, such as Fitted Q-Iteration, which incur performance loss with a quadratic dependence on the effective horizon. The analyzed algorithms estimate the Bellman error and have linear-in-horizon error propagation, which is a desirable property for algorithms that rely solely on batch data and output stationary policies. One of the algorithms uses explicit importance-weighting correction and does not use squared losses, offering distinct characteristics and potential advantages compared to classical algorithms.

The study focuses on value-function approximation for batch-mode reinforcement learning, which is crucial for the success of modern RL algorithms. The iterative nature of these algorithms can lead to instability and theoretical issues. Most iterative methods have a quadratic dependence on the effective horizon, which is significantly worse than the ideal linear dependence. Linear-in-horizon algorithms exist but often require interactive access to the environment or knowledge of transition probabilities, limiting their applicability to the batch learning setting.

One of the challenges in RL is distribution shift, where the computed policy may induce a state distribution different from what it was trained on. Existing analyses characterize this effect using concentrability coefficients, which can be loose and complicated. The study aims to find algorithms that characterize distribution shift effects with elegantly and tightly defined quantities.

Existing analyses also require strong expressivity assumptions on the function classes, such as approximate closedness under Bellman update. The study seeks algorithms with provable guarantees under somewhat weaker conditions.

The study presents novel analyses of two algorithms, MSBO and MABO, and provides positive answers to the challenges mentioned above. Both algorithms enjoy linear-in-horizon error propagation, and their distribution shift effects can be characterized by simple notions of concentrability coefficients. The study compares the two algorithms and shows that MABO, which uses explicit importance-weighting correction and plain average objectives, does not suffer from the looseness of squared-to-average conversion and offers augmented expressivity for its importance-weight class.

In summary, the study presents two algorithms for approximating Q* in batch reinforcement learning and analyzes their performance guarantees. Both algorithms overcome limitations of classical iterative methods and have linear-in-horizon error propagation. They also provide improved characterizations of distribution shift effects and have provable guarantees under weaker assumptions. The study compares the algorithms and highlights the advantages of MABO, which uses explicit importance-weighting correction and plain average objectives. The study provides insights into the theoretical aspects of batch reinforcement learning and offers potential directions for future research.

The paper discusses the theoretical comparison of Q approximation schemes in batch reinforcement learning. The main focus is on two algorithms: MSBO (Model-based State-Action Value Optimization) and MABO (Model-based Advantage Bellman Operator). The authors analyze these algorithms in terms of error propagation, concentrability coefficients, robustness against misspecified Q, and statistical rates.

In the analysis of error propagation, the authors show that both MSBO and MABO enjoy linear-in-horizon error propagation, which is a desirable property for batch algorithms. They also demonstrate that MSBO bears similarities to classical AVI/API algorithms, while MABO uses a novel importance-weight correction to handle the difficulty of Bellman error estimation.

The authors compare the robustness of MSBO and MABO against misspecified Q. They show that if Q is specified correctly, MABO's guarantee never suffers more than that of MSBO on misspecified Q. However, they also note that the advantage of MABO may be weakened if additional functions in W do not correspond to real importance weights.

The statistical rates of the algorithms are also compared. The authors find that the terms in the theorems for MSBO and MABO match each other when considering certain conditions. However, MABO suffers from an additional term due to explicit importance weighting and concentration inequalities. This term fades away quickly with sufficient data.

The authors discuss the assumptions on the helper classes used in MSBO and MABO. They note that while both helper classes are required to capture certain aspects of the problem, W enjoys a superior property compared to F. The linearity of the average Bellman error loss allows for simple W to have high expressivity.

In conclusion, the authors analyze MSBO and MABO in terms of error propagation, robustness against misspecified Q, statistical rates, and assumptions on helper classes. They highlight the advantages and potential limitations of MABO compared to classical algorithms. The authors provide detailed proofs for the theorems and acknowledge the contributions of related works.

This document discusses batch reinforcement learning and provides a theoretical comparison of Q approximation schemes. The main focus is on bounding the Bellman error and controlling the approximation errors in Q-learning algorithms.

The authors start by introducing the Q function and the maximum Bellman operator J(?). They define the Bellman error as the difference between J(?) and J(?Qb), where Qb is the approximate Q function. They aim to bound the Bellman error and obtain a low-dimensional approximation of the Q function.

To achieve this, they introduce the concept of concentrability coefficients. They define L(Q, wb d/?) as the concentration function, which measures the deviation between Q and its approximation Qb. They show that L(Q, wb d/?) can be bounded by the average Q-value and the estimation error.

Next, they introduce a helper lemma that bounds the maximum norm of a linear function. Using this lemma, they bound the concentration function L(Q, wb d/?) in terms of the maximum norm of Q and the approximation errors. They also show that the concentration function can be minimized by choosing appropriate weights wb.

They then discuss the difference between per-step and occupancy-based concentrability coefficients. They provide an example to illustrate the limitations of per-step concentrability coefficients and show that occupancy-based concentrability coefficients are always 1 in certain cases.

Furthermore, they demonstrate that iterative methods fail to directly control the Bellman error on the data distribution. They provide a counterexample using a two-state deterministic MDP and show that iterative methods like Fitted Q-Iteration (FQI) do not directly control the Bellman error on the data distribution.

They also discuss the existence of simple weight matrices in low-rank MDPs. They claim that for MDPs with low-rank transition matrices, there exist weight matrices that can achieve low-dimensional approximation of the Q function. They provide a proof for this claim using a determinant maximization argument.

Finally, they discuss the restricted case of knowing the left factorization matrix as features. They consider the case where the transition matrix has a low-rank structure and show that it is possible to achieve a low-dimensional approximation of the Q function using the left factorization matrix as features.

In conclusion, this document provides a theoretical comparison of Q approximation schemes in batch reinforcement learning. It discusses the bounds on the Bellman error and the concentration function, and highlights the limitations of per-step concentrability coefficients. It also explores the existence of simple weight matrices in low-rank MDPs and the use of left factorization matrix as features in certain cases.

Raw indexed text (64,677 chars / 13,510 words / 1,960 lines)

Q* Approximation Schemes for Batch Reinforcement Learning:

A Theoretical Comparison

Tengyang Xie and Nan Jiang

Department of Computer Science, University of Illinois at Urbana-Champaign

{tx10, nanjiang}@illinois.edu

Abstract

We prove performance guarantees of two algorithms for approximating Q ⋆ in batch reinforcement

learning. Compared to classical iterative methods such as Fitted Q-Iteration—whose performance loss

incurs quadratic dependence on horizon—these methods estimate (some forms of) the Bellman error

and enjoy linear-in-horizon error propagation, a property established for the ﬁrst time for algorithms

that rely solely on batch data and output stationary policies. One of the algorithms uses a novel and

explicit importance-weighting correction to overcome the infamous “double sampling” diﬃculty in

Bellman error estimation, and does not use any squared losses. Our analyses reveal its distinct charac-

teristics and potential advantages compared to classical algorithms.

1 Introduction

We study value-function approximation for batch-mode reinforcement learning (RL), which is central to

the success of modern RL as many popular oﬀ-policy deep RL algorithms ﬁnd their prototypes in this

literature. These algorithms are typically iterative, that is, they solve a series of optimization problems,

aiming to mimic each step of value- or policy-iteration [Puterman, 2014].

In the setting of general function approximation, however, not only the iterative style causes instability

in practice, but it also brings several theoretical issues, which have been made abundantly clear in existing

analyses [e.g., Munos, 2003, 2007; Antos et al., 2008; Farahmand et al., 2010; Chen and Jiang, 2019]:

(A) Quadratic Dependence on Horizon The performance loss of most iterative methods incur

quadratic dependence on the eﬀective horizon, i.e., O( (1−γ)

2 ), and this is tight for the popular Approx-

imate Value/Policy Iteration (AVI/API) [Scherrer and Lesner, 2012]. One typical way this occurs in AVI

analyses is through the use of (some ﬁne-grained variants of) the following result from Singh and Yee

[1994], that the performance loss of a policy greedy w.r.t. some Q is bounded by

2kQ − Q ⋆ k ∞

1 − γ

(1)

and translating kQ − Q ⋆ k to the quantities that the algorithm actually optimizes incurs at least another

factor of horizon. Such a quadratic dependence is signiﬁcantly worse than the ideal linear dependence, the

best one could hope for [Scherrer, 2014].

While linear-in-horizon algorithms exist, they often require interactive access to the environment (to

collect new data using policies of the algorithm’s choice), or the knowledge of transition probabilities

1to compute the true expectation in the Bellman operators, 1 and few of them apply to the batch learning

setting. 2 Are there batch algorithms for Q ⋆ that incur linear-in-horizon dependence?

(B) Characterization of Distribution Shift One of the central challenges in RL is the distribution shift,

that the computed policy may induce a state (and action) distribution diﬀerent from what it is trained on.

Existing analyses characterize this eﬀect using the concentrability coeﬃcients [Munos, 2007], with a typical

deﬁnition being the density ratio (or importance weights) between the state distribution induced at a partic-

ular time step by some non-stationary policy and the data distribution. These “per-step” deﬁnitions can be

very loose even in the uncontrolled setting (Section 5.2) and sometimes very complicated [Farahmand et al.,

2010]. Are there algorithms whose distribution shift eﬀects are characterized by elegantly and tightly deﬁned

quantities?

tions on the function classes, such as approximate closedness under Bellman update [see inherent Bellman

errors; Munos and Szepesvári, 2008]. Are there algorithms with provable guarantees under somewhat weaker

conditions?

(D) Squared-to-Average Conversion Most batch RL algorithms heavily rely on the squared loss, but

bounding the performance loss (which we eventually care about) with squared-loss objectives (which we

optimize) often goes through multiple relaxations, including adding point-wise absolute values and com-

municating between ℓ 1 and ℓ 2 norms with Jensen’s inequality, reﬂecting a signiﬁcant gap between the

actual objective (maximizing return) and the surrogate squared loss. On the other hand, we know such

indirectness is not necessary in RL from the policy-gradient type algorithms [Sutton et al., 2000; Williams,

1992; Kakade and Langford, 2002], but they cannot be applied in the batch setting due to on-policy roll-

outs. Are there batch algorithms whose loss functions are more directly connected to the expected return?

In this paper we present novel analyses of two algorithms, MSBO (which has been analyzed

by Chen and Jiang [2019]) and MABO (which is novel), and provide positive answers to all questions

above. A simple telescoping argument (Section 4) shows that both algorithms enjoy linear-in-horizon error

propagation—which immediately improves the previous bound of Chen and Jiang [2019] for MSBO—and

the distribution shift eﬀects can be characterized by simple notions of concentrability coeﬃcients that are

signiﬁcantly tighter than previous per-step deﬁnitions, which address (A) and (B). By carefully examining

the diﬀerence between the two algorithms, we further show that MABO, a novel algorithm that uses

explicit importance-weighting correction and plain average objectives (without squared loss) does not

suﬀer from the looseness of squared-to-average conversion, and comes with automatically augmented

expressivity for its importance-weight class, addressing (C) and (D).

2 Preliminaries

2.1

Markov Decision Processes (MDPs)

An (inﬁnite-horizon discounted) MDP [Puterman, 2014] is a tuple (S, A, P , R, γ, d 0 ): S and A are the ﬁnite

state and the ﬁnite action spaces, respectively, whose cardinalities can be arbitrarily large. P : S × A →

∆(S) is the transition function (we use ∆(·) to denote the probability simplex), R : S × A → [0, R max ]

At the minimum, two i.i.d. next-states must be drawn from the same state-action pair, known as the double sampling

trick [Baird, 1995], which is unrealistic in non-simulator problems.

Exceptions exist when we are allowed to output complex non-stationary policies; see Section 3 for details.

2Table 1: Algorithms considered in this paper, all of which require Q ⋆ ∈ Q (deﬁnitions of approximation

error diﬀer). Q, F and W are deﬁned in Section 2.2, Section 5 and Section 6.

Algorithm Style Requirement on

helper class Horizon

dependence

FQI Iterative + Sq-loss ∀Q ∈ Q, T Q ∈ Q 1 / (1−γ) 2

MSBO Minimax + Sq-loss ∀Q ∈ Q, T Q ∈ F 1 / (1−γ)

MABO Minimax + Avg-loss ∀Q ∈ Q,

w dπQ / µ ∈ sp(W) 1 / (1−γ)

Concentrability

coeﬃcient

Per-step-based

(Eq.(5) and App.B)

Occupancy-based

(see Thm.5)

W-based (see Thm.8)

practical algorithm

DQN

[Mnih et al., 2015]

SBEED

[Dai et al., 2018]

Kernel-loss

[Feng et al., 2019]

is the reward function, and γ ∈ [0, 1) is a parameter that characterizes how rewards are discounted over

time. d 0 ∈ ∆(S) is the initial state distribution.

A (stochastic) policy, π : S → ∆(A), induces a random trajectory s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . with the fol-

lowing generative process: s 0 ∼ d 0 , a t ∼ π(·|s t ), r t = R(s t , a t ), s t+1 ∼ P (·|s t , a t ), ∀t ≥ 0. The ultimate

goodness of a policy is

by the expected discounted return (w.r.t. the initial state distribution),

P measured

t r |s ∼ d , π]. There always exists a policy π ⋆ that maximizes the expected

deﬁned as J(π) := E[ ∞

t 0

t=0

return for any initial state distribution.

It will be useful to deﬁne

the (state-)value function V π (s) := E[ ∞

the Q-

t=0 γ r t |s 0 = s, π] and

⋆

∞

⋆

function Q (s, a) := E[ t=0 γ r t |s 0 = s, a 0 = a, π]. Let V and Q be the shorthand for V π and Q π .

All value functions are bounded in [0, V max ], where V max := R max /(1 − γ). It is also known that the

greedy policy of Q ⋆ , deﬁned as π Q ⋆ (s) = argmax a∈A Q ⋆ (s, a), 3 is an optimal policy π ⋆ .

Deﬁne the Bellman optimality operator: (T Q)(s, a) := R(s, a) + γE s ′ ∼P (·|s,a) [max a ′ ∈A Q(s ′ , a ′ )] for

⋆

any Q ∈ R S×A

P . Q is the unique ﬁxed point of T , that is, T Q = Q . We also use Q(s, π) as the

shorthand for a∈A π(a|s)Q(s, a).

Another concept crucial to this paper is the normalized discounted state occupancy:

d π (s) := (1 − γ)

∞

t=0

γ t Pr [s t = s|s 0 ∼ d 0 , π] .

The state-action occupancy d π (s, a) is deﬁned similarly and satisﬁes d π (s, a) = d π (s)π(a|s).

2.2

Batch Value-Function Approximation

Setup We are concerned with approximating Q ⋆ in the batch RL setting, where a dataset D consisting

of n (s, a, r, s ′ ) tuples is given, and we cannot interact with the MDP to obtain new data. We adopt the

following data generation protocol from Chen and Jiang [2019], that the tuples are i.i.d. 4 as (s, a) ∼ µ,

r = R(s, a), s ′ ∼ P (·|s, a), and µ is fully supported on S × A.

Function Approximation We assume access to a function class Q ⊂ [0, V max ] S×A , and focus on algo-

rithms that approximate Q ⋆ with some Q ∈ Q and output its greedy policy π Q . This automatically implies

a policy class Π Q := {π Q : Q ∈ Q}, from which the output policy will be chosen. Some algorithms

require additional function classes, which we introduce later. We assume all function classes have ﬁnite

With a slight abuse of notations we treat deterministic policies—which are stochastic policies that put all probability mass

on a single action for each state—as of type S → A.

In reality, the transition tuples extracted from the same trajectory are in general dependent, which can be handled by con-

centration inequalities for dependent processes with mixing assumptions [see e.g., Antos et al., 2008].

3cardinalities for simplicity when analyzing statistical errors, as they are not our main focus and extension

to continuous classes with e.g., ﬁnite VC-type dimensions [Natarajan, 1989] are standard.

A representative algorithm for this setting is Fitted Q-Iteration (FQI) [Ernst et al., 2005], which can be

viewed as the theoretical prototype of the popular DQN algorithm [Mnih et al., 2015]: After initializing

Q 0 ∈ Q arbitrarily, we iteratively compute Q t as

Q t = argmin ℓ D (Q; Q t−1 ),

(2)

Q∈Q

where

ℓ D (Q; Q ) :=

′

(s,a,r,s ′ )∈D

Q(s, a) − r − γ max

Q (s , a ) .

′

a ∈A

′

(3)

We will discuss the relationship between FQI (and iterative methods in general) and algorithms we analyze.

Marginalized Importance Weights We deﬁne the importance weight of any policy π to be the ratio

between its normalized discounted state-action occupancy and the data distribution:

w dπ / µ (s, a) :=

d π (s, a)

µ(s, a)

Such functions are of vital importance to us, as in Section 6 we model them with function approximation

to explicitly correct distribution mismatch. Their norms also characterize the exploratoriness of the data

distribution, which are closely related to the concentrability coeﬃcients in prior analyses [Munos, 2007;

Antos et al., 2008; Farahmand et al., 2010; Chen and Jiang, 2019].

Additional Notations We use the shorthand E µ [·] for the population expectation of function of

(s, a, r, s ′ ) drawn from the data distribution, and E D [·] for its sample-based approximation. When the func-

tion only depends on (s, a), we further omit the function arguments for brevity; for example, E µ [Q 2 ] :=

E (s,a)∼µ [Q(s, a) 2 ]. It will also be convenient to deﬁne the µ-weighted 2-norm k · k 22,µ := E µ [(·) 2 ].

3 Related Work

Linear-in-horizon Analyses As mentioned in the introduction, most of the existing linear-in-horizon

results do not apply to the setting of batch learning with general function approximation. For example,

Munos [2007, Section 5.2] points out that AVI enjoys linear-in-horizon error propagation if it happens to

converge. 5 Unfortunately, AVI—and iterative methods in general—has no convergence guarantees (and

known to diverge with simple linear classes) unless used with very restricted choices of function approxi-

mators [see e.g., averagers; Gordon, 1995]. As another example, linear-in-horizon error can be achieved if

one can directly minimize the Bellman error [e.g., Geist et al., 2017], but computing that requires knowl-

edge of the transition probabilities. We refer the readers to Scherrer [2014] and the references therein for

further results of this kind.

The only exceptions we are aware of are the non-stationary versions of AVI/API [e.g.,

Scherrer and Lesner, 2012], when the algorithm is allowed to output a periodic non-stationary poli-

cies consisting of Ω( 1 / (1−γ) ) stationary policies. For a typical value of γ = 0.99 this translates to 100

Our paper provides a novel explanation of this result: when FQI (which is a concrete instantiation of the abstract AVI proce-

dure) happens to converge, Chen and Jiang [2019] shows that its solution coincides with that of MSBO, which we show enjoys

linear-in-horizon error propagation whatsoever.

4policies, and we believe such a complexity is responsible for the clever idea not being picked up in practice

despite its appealing theoretical properties. In contrast, we establish linear-in-horizon guarantees for

batch algorithms that output simple stationary policies.

Clean and Tight Concentrability Coeﬃcients The situation of concentrability coeﬃcients is

very similar. The best deﬁnition is kw dπ⋆ / µ k ∞ , enjoyed by e.g., CPI [Kakade and Langford, 2002] (see

also Agarwal et al. [2019]). However, concrete instantiations of these abstract algorithms (in a way that

preserve their theoretical properties) typically require on-policy Monte-Carlo roll-outs, which are not

available in the batch setting. The same constant has been associated with an abstract Bellman error

minimization procedure [Geist et al., 2017], but the algorithm only searches over valid value-functions

(instead of arbitrary functions produced by the function approximator). While our deﬁnition is worse

than theirs by a maximum over policies under consideration, it is still signiﬁcantly tighter and cleaner

than the per-step deﬁnitions in most previous analyses of AVI/API [Szepesvári and Munos, 2005; Munos,

2007; Antos et al., 2008; Farahmand et al., 2010]. In fact, we show in Appendix B that even in a simple

uncontrolled setting, our occupancy-based deﬁnition can be 1 / (1−γ) multiplicatively tighter than any

per-step deﬁnitions.

MSBO The ﬁrst algorithm we analyze, MSBO, is essentially the analogy of Modiﬁed BRM [Antos et al.,

2008] (which approximates Q π ) in the context of approximating Q ⋆ . To our knowledge, the algorithm is

ﬁrst analyzed by Chen and Jiang [2019], and we improve their loss bound by 1 / (1−γ) (which translates to

1 / (1−γ) 2 improvement in sample complexity). It is also worth pointing out that Dai et al. [2018] has derived

a closely related algorithm and demonstrated its empirical eﬀectiveness with deep neural nets.

MABO Our second algorithm, MABO, is presented and described in such a general form for the ﬁrst time.

That said, the algorithmic idea can be found in several recent works: Just as MSBO is the Q ⋆ -counterpart of

Modiﬁed BRM, MABO is the Q ⋆ -counterpart of the MQL algorithm for oﬀ-policy evaluation [Uehara et al.,

2019]. Another closely related work is kernel loss [Feng et al., 2019], which becomes similar to MABO

when the implicit maximization in the RHKS is interpreted as searching over an importance weight class

(this connection is pointed out by Uehara et al. [2019]). Finally, the average Bellman error is ﬁrst used

by Jiang et al. [2017] for PAC-exploration with function approximation, and MABO can be viewed as the

batch analogy of their OLIVE algorithm, using importance weights to mimic the data collected by diﬀerent

exploration policies.

4 Telescoping Performance Diﬀerence

We present the important telescoping lemmas that enable the nice guarantees of the algorithms to be

introduced and analyzed later. We start with a simple telescoping lemma, which has also been used in

recent oﬀ-policy evaluation literature [e.g., Uehara et al., 2019]. Unless otherwise speciﬁed, the full proofs

of the results in the main text can be found in Appendix A.

Lemma 1. For any policy π and any Q ∈ R S×A ,

E d 0 [Q(s, π)] − J(π) =

E d π [Q(s, a) − r − γQ(s ′ , π)]

1 − γ

[r]

dπ

, so we can remove them from both sides. The remaining terms cancel out by

Proof Sketch. J(π) = 1−γ

telescoping, which is essentially the Bellman equation for d π found in the dual linear program of MDPs.

5Using this lemma, we prove the following performance diﬀerence bound, which is central to the nice

guarantees we are able to prove for MSBO and MABO. The coarse-grained, ℓ ∞ version of Theorem 2 for the

speciﬁc choice of π = π ⋆ has been given by Williams and Baird [1993], and some of the technical insights

can be found in the derivations of Munos [2007]. Still, we present the results in a general and agnostic

fashion, and their applications to the analyses of MSBO and MABO are also novel.

Theorem 2 (Telescoping Performance Diﬀerence). For any policy π and any Q ∈ R S×A ,

J(π) − J(π Q ) ≤

E d π [T Q − Q] E d πQ [Q − T Q]

1 − γ

Proof Sketch. Note that J(π) − J(π Q ) ≤ J(π) − E s∼d 0 [Q(s, π)] + E s∼d 0 [Q(s, π Q )] − J(π Q ), as the sum

of the two terms added on the RHS is non-negative due to greediness of π Q . Invoking Lemma 1 on Q

with π and π Q , respectively, yields E d π [T π Q − Q] and E d πQ [Q − T π Q Q] (up to a horizon factor). These

policy-speciﬁc Bellman errors can be bounded by the optimality error using the greediness of π Q .

As the result shows, the diﬀerence between J(π Q ) and that of any π is controlled by the average

Bellman errors E (·) [T Q − Q] under the distributions d π and d π Q , with only one factor of horizon. This

is in sharp contrast to the typical analyses for AVI sketched in the introduction (Eq.(1)), and immediately

hints at a linear-in-horizon error propagation for algorithms that control (an upper bound) of the average

Bellman errors, and we only need to consider d π and d π Q when characterizing distribution shift eﬀects. In

Appendix C, we also illustrate that iterative methods (such as FQI) fail to control the Bellman error—which

is in contrary to the popular folklore belief that they do—and explain in part their quadratic dependence

on horizon.

In addition, the average Bellman errors E d π [T Q − Q] do not have absolute values inside the expecta-

tion, and the errors at diﬀerent (s, a) pairs with opposite signs may cancel with each other. This property

is often ignored in previous works, as they add absolute values (and use Jensen’s to bound ℓ 1 with ℓ 2

norms) anyway when analyzing algorithms that optimize squared-loss, just as we will do to MSBO. How-

ever, we emphasize that it is important to state this theorem in such a primitive form for the analysis of

MABO, which directly estimates such average Bellman errors (allowing sign cancellations) using impor-

tance weights. Any absolute value relaxations [e.g., Williams and Baird, 1993] will immediately make the

result useless for MABO.

We conclude this section with some useful corollaries of Theorem 2, which may also be of independent

interest on their own.

Corollary 3 (Two-side Performance Diﬀerence Bound). For any Q, f ∈ R S×A ,

(

)

E d πf [T Q − Q] E d πQ [Q − T Q] E d πQ [T f − f ] E d πf [f − T f ]

|J(π f ) − J(π Q )| ≤ 2 max

1 − γ

Corollary 4 (Performance Loss w.r.t. a Class). ∀Q ∈ Q,

max J(π) − J(π Q ) ≤

π∈Π Q

2 max π∈Π Q |E d π [T Q − Q]|

1 − γ

5 Minimax Squared Bellman Optimality Error Minimization (MSBO)

We present the performance guarantee of the ﬁrst algorithm, MSBO, which uses another helper class

F ⊂ [0, V max ] S×A to model T Q for any Q ∈ Q, seeking to form an (approximately) unbiased estimate of

6the Bellman error kQ − T Qk 22,µ :

b = argmin max (ℓ D (Q; Q) − ℓ D (f ; Q)) ,

Q∈Q

f ∈F

(4)

where ℓ D (·; ·) is deﬁned in Eq.(3). To give some intuitions, ℓ D (Q; Q) over-estimates kQ − T Qk 22,µ (which

is why the double sampling trick was invented in the ﬁrst place [Baird, 1995]), and the amount of over-

estimation can be captured by min f ∈F ℓ D (f ; Q) if F is a rich function class satisfying T Q ∈ F, ∀Q ∈ Q;

see Antos et al. [2008]; Chen and Jiang [2019] for further intuitions.

We now state the guarantee of the algorithm.

b be the output of MSBO. W.p. at least 1 − δ,

Theorem 5 (Improved error bound of MSBO). Let Q

√

2 2C eff q sq q sq

ε Q + ε Q,F

max J(π) − J(π Q b ) ≤

π∈Π Q

1 − γ



 s

√

|Q||F |

|Q|

|Q||F |

4 V

C eff  V max ln δ

max ln δ sq

max ln

 ,

ε Q +

ε sq

Q,F

1 − γ

where

C eff := max kw dπ / µ k 22,µ .

π∈Π Q

ε sq

Q,F

:= min kQ − T Qk 22,µ .

Q∈Q

:= max min kf − T Qk 22,µ .

Q∈Q f ∈F

This result improves over the bound of Chen and Jiang [2019] in several aspects, which we explain

below. Furthermore, their bound for MSBO is structurally the same as that for FQI when F is set as Q,

and while we are able to improve the bound for MSBO, some of the improvements cannot be enjoyed by

FQI (see the argument of Scherrer and Lesner [2012]), creating a gap between performance guarantees of

the two algorithms.

In the rest of this section, we explain the result and discuss its signiﬁcance in detail. We also include a

high-level sketch of the proof at the end, deferring the full proof to Appendix A.

5.1

Errors Terms and Optimality

⋆

ε sq

Q measures the violation of the realizability assumption Q ∈ Q, and when the assumption holds exactly

⋆

we have ε Q = 0 as kQ − T Q k = 0. Similarly, ε Q,F measures the violation of the assumption that T Q ∈

F, ∀Q ∈ Q. These deﬁnitions are directly taken from Chen and Jiang [2019] and consistent with prior

literature [e.g., Antos et al., 2008]. The statistical error term within O(·) is also the same as Chen and Jiang

[2019], which consists of a n − 1 / 2 fast rate term and two n − 1 / 4 terms which vanish as the approximation

errors ε sq

Q and ε Q,F go to 0. The novelty of the bound is in the multiplicative constants in front of these

errors.

Regarding the optimality guarantee (LHS of the bound), note that we compete with max π∈Π Q J(π) as

the optimal value. Slightly modifying the analyses will immediately allow us to compete with any policy

π even if it is not in Π Q (e.g., π ⋆ ), as long as we include the policy in the deﬁnition of C eff .

75.2

Concentrability Coeﬃcient

The distribution shift eﬀects are characterized by C eff in our bound. Not only this deﬁnition is much

simpler, it is also tighter than previous deﬁnitions in two ways, and we start with the minor one: we use a

weighted square of w dπ / µ rather than its ℓ ∞ norm, the latter of which is more common in literature [Munos,

2007; Munos and Szepesvári, 2008; Antos et al., 2008; Chen and Jiang, 2019]. It is easy to show that the

squared version is tighter [Farahmand et al., 2010]: for example, consider the ℓ ∞ version of our C eff , which

should be deﬁned as

C ∞ := max kw dπ / µ k ∞ .

π∈Π Q

One can easily show that C eff is tighter: for any π ∈ Π Q ,

kw dπ / µ k 22,µ = E µ [w d 2 π / µ ] ≤ E µ [C ∞ w dπ / µ ] = C ∞ .

The second improvement, which is much more signiﬁcant, is the departure from “per-step” deﬁnitions. In

all analyses of AVI/API, the concentrability coeﬃcient takes the form of

C per-step :=

∞

t=0

β(t)C t , C t := max kw dπ,t / µ k ∞ ,

(5)

where d π,t is the marginal distribution of (s t , a t ). β(t) is a series of non-negative coeﬃcients that sum

up to 1. Diﬀerent versions of C per-step diﬀer in β(t), the policy

S space considered in max π (typically non-

stationary policies concatenated using policies from Π Q {π ⋆ }), and sometimes replacing k · k ∞ with

k · k 22 ; see Farahmand et al. [2010] for a detailed discussion. While it is diﬃcult to directly compare this

quantity to ours due to its complication, we show that in a simplest uncontrolled scenario where there is

no distribution shift at all, any per-step deﬁnition will be at least 1/(1 − γ) looser than ours. We include

an intuitive but informal statement below, and defer the detailed discussions to Appendix B.

Proposition 6 (Informal). Consider an uncontrolled deterministic problem (there is only 1 action) formed

by a long chain of states. Let µ = d π where π is the only policy. C ∞ = C eff = 1, and any deﬁnition of

C per-step ≥ 1/(1 − γ).

5.3

Horizon Dependence

We now verify that the bound has linear dependence on horizon. Doing so can be tricky given the

complicated expression, and we provide 3 veriﬁcation methods following the conventions in the liter-

ature [Scherrer, 2014]: The ﬁrst one is to observe that FQI has quadratic dependence on horizon and

our bound for MSBO has a 1 / (1−γ) net improvement over FQI [Chen and Jiang, 2019]. The second one

is to read the expression, and count the explicit dependence; while the statistical error depends on

V max = R max /(1 − γ), such a dependence is superﬁcial and not produced by error accumulation over

multi-stage decision-making, and is never counted in the literature. 6 The third method is to consider the

fully realizable case (ε sq

Q = ε Q,F = 0) and calculate the sample complexity. Since the statistical rate

√

is 1/ n, an algorithm with linear-in-horizon error propagation should have O(1/(1 − γ) 2 ) sample com-

plexity, which we show below. This contrasts the O(1/(1−γ) 4 ) sample complexity of FQI [Chen and Jiang,

2019].

See Jiang and Agarwal [2018] for a deeper discussion.

8sq

Corollary 7 (Improved sample complexity of MSBO). Let ε sq

Q = ε Q,F = 0. For any ǫ, δ > 0, Eq.(4) satisﬁes

max π∈Π Q J(π) − J(π Q b ) ≤ ε · V max w.p. ≥ 1 − δ, if

n = O

5.4

C eff ln |Q||F

ε 2 (1 − γ) 2

Proof Sketch

We sketch the high-level proof here, deferring the details to Appendix A; this analysis is relatively straight-

forward due to existing work (compared to MABO, which is novel). To bound J(π) − J(π Q b ) for any

b − Q]|.

π ∈ Π Q , we invoke Theorem 2, which produces two average Bellman error terms of form |E d π [T Q

Then

b − Qk

b 2,µ .

b − Q)

b 2 ] ≤ C eff kT Q

b − Q)]|

b ≤ E µ [w d 2 ]E µ [(T Q

b − Q]|

b = |E µ [w dπ / µ · (T Q

|E d π [T Q

π / µ

b Qk

b 2,µ is well-studied

The last step follows from Cauchy-Schwarz for random variables, and the term kT Q−

by Chen and Jiang [2019] and we directly use their result.

6 Minimax Average Bellman Optimality Error Minimization (MABO)

We introduce and analyze our second (and novel) algorithm, MABO, which directly estimates the average

Bellman errors (allowing sign cancellations) that show up in the telescoping results from Section 4 by

explicit importance-weighting correction. Doing so requires an additional function approximator W to

model the marginalized importance weights (see Section 2.2), W ⊂ R S×A , in addition to the Q class that

models Q ⋆ . Given Q and W, the algorithm is

b = argmin max

Q∈Q

where

w∈W

|L D (Q, w)| ,

(6)

′ ′

L D (Q, w) := E D w(s, a) r + γ max

Q(s

)

−

Q(s,

′

It is important to point out that we only use the single sample estimate of Bellman error (i.e., no double

sampling), but we obtain an unbiased estimate of average Bellman error thanks to not using the squared

loss (unlike ℓ D (Q; Q) in MSBO, which is an over-estimation). To see how L D (Q, w) is related to the

average Bellman errors, simply consider its population version:

L µ (Q, w) := E D [L D (Q, w)] = E (s,a)∼µ [w(s, a) ((T Q)(s, a) − Q(s, a))] ,

thus L µ (Q, w dπ / µ ) = E d π [T Q−Q]. Therefore, as long as W realizes w dπ / µ for all π ∈ Π Q (this assumption

will be relaxed), max w∈W |L µ (Q, w)| will control the suboptimality gap of π Q , which is the intuition for

the algorithm.

We now state the guarantee of this algorithm. For convenience, we will use E µ [w · (T Q − Q)] as a

shorthand for Eq.(7) in the rest of this paper.

b be the output of MABO. W.p. 1 − δ,

Theorem 8 (Error bound of MABO). Let Q

2 avg

max J(π) − J(π Q b ) ≤

ε Q + ε avg

stat,n

Q,W

π∈Π Q

1 − γ

(7)where

ε avg

Q := min max |E µ [w · (T Q − Q)]| ,

Q∈Q w∈W

inf max E µ (w dπ / µ − w) · (T Q − Q) ,

π∈Π Q w∈sp(W) Q∈Q

4C ∞,W V max ln 2|Q||W|

2C eff,W ln 2|Q||W|

:= 2V max

:= max kwk 22,µ , C ∞,W := max kwk ∞ ,

ε avg

Q,W := max

ε stat,n

C eff,W

w∈W

and sp(W) is the linear span of W using coeﬃcients with (at most) unit ℓ 1 norm, i.e.,

sp(W) :=

w∈W α(w)w :

w∈W |α(w)| ≤ 1 .

In the rest of this section, we explain the bound and discuss its signiﬁcance.

6.1

Error Terms and Augmented Expressivity

avg

⋆

Similar to ε sq

Q for MSBO, ε Q also measures the violation of Q ∈ Q, though in a diﬀerent manner: we

measure Q’s worst-case average Bellman error on any w ∈ W.

The situation of ε avg

Q,W is a little more special. Despite that we provide intuition for MABO by requiring

that w dπQ / µ ∈ W, ∀Q ∈ Q, it turns out we only need a much more relaxed version of this assumption (and

can measure violation against the relaxed version): thanks to the linearity of L D (Q, ·), we are automati-

cally approximating w dπQ / µ from an augmented class sp(W). 7 Moreover, the loss L D (Q, w) is “scale-free”

w.r.t. w, i.e., it is completely equivalent to replace W with any cW := {cw : w ∈ W}, for any c 6 = 0.

Therefore, we may rescale W arbitrarily in the theorem to obtain the sharpest bound.

To help develop further intuition, we illustrate the idea using a familiar tabular example: Consider

the case where |S| and |A| are manageable and we use a tabular function class Q := [0, V max ] S×A . It is

easy to see that we can recover the standard tabular model-based algorithm (a.k.a. certainty-equivalence,

or C-E) by using W = {(s, a) 7→ 1(s = s ∗ , a = a ∗ ) : s ∗ ∈ S, a ∗ ∈ A}, i.e., a set of |S × A| indicator

functions. This is because the lowest possible value for the objective is 0, achieving which requires that

|L D (Q, w)| = 0, ∀w ∈ W. This set of |W| = |S ×A| equations is essentially the Bellman equation for each

state-action pair in the empirical MDP, which can and can only be satisﬁed by the C-E solution. While the

C-E solution incurs no approximation error, W clearly fails to realize w dπQ / µ for all Q ∈ Q. The reason, as

we have already explained earlier, is because sp(W)—which now becomes the tabular function space—can

model any importance weights with proper scaling.

As a ﬁnal remark, given any w ∈ sp(W) and the target importance weight w dπQ / µ , we measure their

distance by projecting their diﬀerence using T Q − Q for the worst-case Q ∈ Q. If we treat it as approxi-

mating distribution d π with (µ · w)(s, a) := µ(s, a)w(s, a), then this measure is essentially the Integral

Probability Metric [Müller, 1997] between d π and µ · w using a discriminator class induced by Q.

6.2

Concentrability Coeﬃcients

Our C eff,W and C ∞,W are deﬁned in a way similar to C eff and C ∞ in Section 5, except that we consider

w ∈ W, i.e., the functions provided by the function approximator W instead of the true importance

weights w dπQ / µ themselves. While these two sets of coeﬃcients are not directly comparable, we provide

some insights about their relationship.

Similar properties have been recognized regarding the policy evaluation counterpart of MABO [Uehara et al., 2019].

10On one hand, if we choose W = {w dπQ / µ : Q ∈ Π Q }, which precisely satisﬁes the expressivity

assumption, then C eff,W = C eff and C ∞,W = C ∞ . Given that W is likely to include other functions

as well, we might conclude that C eff,W and C ∞,W are in general greater. On the other hand, to satisfy

ε avg

Q,W = 0 we only need sp(W) to be the above-mentioned class, and the actual W could be smaller and

simpler. Also, since C eff,W and C ∞,W only occur in the statistical error term in Theorem 8 (which is in sharp

contrast to Theorem 5, where C eff also ampliﬁes approximation errors), the damage caused by w ∈ W

with unnecessarily large magnitude can be mitigated by proper regularization (see e.g., Kallus [2016];

Hirshberg and Wager [2017]; Su et al. [2019] for how importance weights can be regularized in contextual

bandits). Given these competing considerations, we suggest that it is reasonable to treat C eff,W ≈ C eff ,

C ∞,W ≈ C ∞ .

6.3

Horizon Dependence

The linear dependence on horizon of Theorem 8 can be veriﬁed in a way similar to Section 5.3, and we

only include the sample complexity of MABO when all the expressivity assumptions are met exactly. The

sample complexity contains two terms corresponding to the slow rate (n − 1 / 2 ) and the fast rate (n −1 ) terms

in ε stat,n , and when C ∞,W is not too much larger than C eff,W , 8 the fast rate term is dominated and the

sample complexity is very similar to that of MSBO.

Corollary 9 (Sample complexity of MABO). Suppose ε avg

= ε avg

Q,W = 0. The output of MABO Eq.(6),

satisﬁes max π∈Π Q J(π) − J(π Q b ) ≤ ε · V max w.p. 1 − δ, if

C eff,W

C ∞,W

|Q||W|

n = O

ε 2 (1 − γ) 2 ε(1 − γ)

6.4

Proof Sketch of Theorem 8

b − Q]|

We conclude the section by a high-level proof sketch. With Theorem 2, it suﬃces to control |E d π [T Q

= |E µ [w dπ / µ · (T Q − Q)]| for the worst-case π ∈ Π Q . Fixing any π, the ﬁrst step is to peel oﬀ the

approximation error of W: for any w ∈ sp(W), we have

b − Q)]|

|E µ [w dπ / µ · (T Q

b − Q)]|

b + |E µ [w · (T Q

b − Q)]|

≤ |E µ [(w dπ / µ − w)(T Q

b − Q)]|.

≤ max |E µ [(w dπ / µ − w)(T Q − Q)]| + |E µ [w · (T Q

Q∈Q

So if we choose w as the one that achieves the inﬁmum in the deﬁnition of ε avg

b then the

Q,W , denoted as w,

The

second

term

much

closer

the

loss

function

MABO,

and can be

ﬁrst term is bounded by ε avg

Q,W

handled as

b − Q)]|

b ≤

|E µ [ w

b · (T Q

sup

w∈sp(W)

b − Q)]|

|E µ [w · (T Q

b − Q)]|.

= max |E µ [w · (T Q

w∈W

Crucially, using the linearity of E µ [w · (·)] in w and the norm constraints of sp(·), we are able to replace

sup w∈sp(W) with max w∈W , leading to the augmented expressivity discussed in Section 6.1; see Eq.(11) in

Appendix A for a detailed argument. Then with similar strategies, we peel oﬀ the approximation error of

b − Q)]|.

Q from |E µ [ w

b · (T Q

The rest of the analysis handles statistical errors using generalization error

bounds.

E.g., C eff,W = C ∞,W when W only contains indicator functions (e.g., in the tabular scenario in Section 6.2).

117 Further Comparisons and Discussions

In the previous sections we have analyzed MSBO and MABO, showing that they enjoy linear-in-horizon

error propagation and cleanly and tightly deﬁned concentrability coeﬃcients, which answers (A) and (B)

in the introduction. Still, MSBO bears signiﬁcant similarities to classical AVI/API algorithms 9 in the use

of squared loss and the expressivity requirement on function approximation ((C) and (D)). In this section

we compare its guarantee (Theorem 5) to that of MABO (Theorem 8), and discuss the potential advantages

of MABO (which is novel and understudied), as well as its limitations, compared to currently popular

algorithms. The recurring theme of the comparisons—as we will see below—is the pros and cons of implicit

(e.g., FQI and MSBO) and explicit (MABO) distribution corrections.

7.1

Robustness Against Misspeciﬁed Q

We compare the robustness of the two algorithms against misspeciﬁed

q Q, that is, how much we pay when

avg

Q ⋆ ∈

/ Q. Omitting the common horizon factor, MSBO pays O( C eff · ε sq

Q ) and MABO pays O(ε Q ).

Again, they are not directly comparable, but we can still oﬀer some useful insights. Imagine the scenario

of W = {w dπQ / µ : Q ∈ Q} (as we did in Section 6.2), then

ε avg

min max E µ [w dπ / µ · (T Q − Q)]

Q = Q∈Q

π∈Π Q

≤ min max E µ [w d 2 π / µ ] · E µ [(T Q − Q) 2 ] = C eff · ε sq

Q .

(8)

Q∈Q π∈Π Q

Here the second step follows from Cauchy-Schwarz, which we also used in Section 5.4. As we can see,

if W is speciﬁed “just right”, MABO’s guarantee never suﬀers more than that of MSBO on misspeciﬁed

Q, and any looseness from Cauchy-Schwarz 10 enters the gap. On the other hand, such an advantage of

MABO may be weakened if W includes additional functions that do not correspond to real importance

weights.

√

Another diﬀerence between MSBO and MABO is that MSBO pays C eff in front of ε sq

Q , whereas

MABO does not pay any concentrability coeﬃcients in its approximation error terms, thanks to explicit

distribution correction. While Eq.(8) might leave the impression that the diﬀerence is superﬁcial, the

avg

inequality only relaxes

√ ε Q (apart from the nice choice of W) hence unfairly favors MSBO, and there are

scenarios where the C eff diﬀerence is real: for example, consider the scenario

√ where Q has uniformly

′

low error across all distributions, and Q has small Bellman error on µ but (up to C eff times) higher errors

′

′ . In this case, MABO clearly prefers Q over Q due to explicit distribution correction, whereas

on e.g., d π Q

MSBO is indiﬀerent between them and can suﬀer the poor performance of Q ′ .

7.2

Statistical Rates

The n − 1 / 2 terms in Theorems 5 and 8 match each other if we treat C eff ≈ C eff,W (see Section 6.2). MABO

suﬀers another C ∞,W /n term, whereas C ∞ does not enter the guarantee of MSBO; this is an (unfortu-

nately) inevitable consequence of explicit importance weighting and concentration inequalities. On the

other hand, the term fades away quickly with n and will be of minor issue with suﬃcient data. Finally,

MSBO suﬀers two n − 1 / 4 terms, and although they can be absorbed by the worse between the fast rate and

the approximation error terms in Big-Oh notations [Chen and Jiang, 2019, Appendix C], doing so worsens

the constant.

Recall that FQI coincides with MSBO using F = Q when FQI converges [Chen and Jiang, 2019], and in this sense MSBO can

be viewed as a best-case scenario for FQI.

See (D) in the introduction.

127.3

Assumptions on the Helper Classes

A characteristic shared by MSBO and MABO is the use of a helper class (F for MSBO and W for MABO) to

assist the estimation of the Bellman error. These helper classes also take the heaviest expressivity burdens

in their corresponding algorithms: while Q is only required to capture Q ⋆ , F and W are required to capture

T Q and w dπQ / µ , respectively, for all Q ∈ Q.

While F and W model completely diﬀerent objects, we note that W enjoys a superior property that F

does not have, that is we essentially approximate the importance weights from sp(W), allowing simple W

to have high expressivity. This property crucially comes from the linearity of the average Bellman error

loss, which is another advantage of the average loss over the squared loss.

To further illustrate the representation power of sp(W), we provide the following result, showing

that in MDPs with low-rank dynamics (which are often suﬃcient conditions that allow an exploratory 11

µ to exist in the ﬁrst place [Chen and Jiang, 2019]), there exists very simple (in the sense of low statistical

complexity) W that satisﬁes ε avg

Q,W = 0.

Proposition 10. Suppose the rank of the MDP’s transition matrix is k. Then,

1. For any choice of Q, there exists W with cardinality |W| ≤ (k + 1)|Q|, such that ε avg

Q,W = 0.

2. Let the transition matrix P = ΦP ′ , where Φ ∈ R |S×A|×k and let φ(s, a) ⊤ denote its (s, a)-th row. For the

choice of Q = {(s, a) 7→ R(s, a) + γφ(s, a) ⊤ θ : θ ∈ R k }, there exists W with cardinality |W| ≤ k + 1

such that ε avg

Q,W = 0.

The formal deﬁnitions and proofs are deferred to Appendix D. In the ﬁrst claim (general case), W has

low statistical capacity despite scaling with |Q|, as we need to pay ln |Q| anyway by using the Q class, and

the dependence of |W| on |Q| is not a signiﬁcant burden. In the second claim, which is the more restricted

“linear MDP” setting recently studied by e.g., Yang and Wang [2019], we are able to bring |W| down to as

low as k + 1; it is also interesting to point out that we cannot guarantee w dπQ / µ ∈ sp(W), but using the

linear structure of Q we can still prove that ε avg

Q,W = 0. Finally, we emphasize that the existence of such

a simple W does not imply that we are guaranteed to ﬁnd it for every problem, as the design of function

approximation always requires appropriate prior knowledge and inductive biases.

8 Conclusions

We analyze two algorithms, MSBO and MABO, which enjoy linear-in-horizon error propagation, a prop-

erty established for the ﬁrst time for batch algorithms outputting stationary policies. MABO uses a novel

importance-weight correction to handle the diﬃculty of Bellman error estimation, and our analyses reveal

its distinct properties and potential advantages compared to classical squared-loss-based algorithms.

Acknowledgement

The authors thank Aditya Modi for providing the references to some important related works.

Technically, a small C eff or C ∞ .

13References

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with

policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261, 2019.

András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual

minimization based ﬁtted policy iteration and a single sample path. Machine Learning, 71(1):89–129,

2008.

Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine

Learning Proceedings 1995, pages 30–37. Elsevier, 1995.

Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In

International Conference on Machine Learning, pages 1042–1051, 2019.

Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Con-

vergent reinforcement learning with nonlinear function approximation. In International Conference on

Machine Learning, pages 1133–1142, 2018.

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal

of Machine Learning Research, 6:503–556, 2005.

Amir-massoud Farahmand, Csaba Szepesvári, and Rémi Munos. Error propagation for approximate policy

and value iteration. In Advances in Neural Information Processing Systems, pages 568–576, 2010.

Yihao Feng, Lihong Li, and Qiang Liu. A kernel loss for solving the bellman equation. In Advances in

Neural Information Processing Systems, pages 15430–15441, 2019.

Matthieu Geist, Bilal Piot, and Olivier Pietquin. Is the bellman residual a bad proxy? In Advances in Neural

Information Processing Systems, pages 3205–3214, 2017.

Geoﬀrey J Gordon. Stable function approximation in dynamic programming. In Machine Learning Pro-

ceedings 1995, pages 261–268. Elsevier, 1995.

David A Hirshberg and Stefan Wager.

arXiv:1712.00038, 2017.

Augmented minimax linear estimation.

arXiv preprint

Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds on

planning horizon. In Conference On Learning Theory, pages 3395–3398, 2018.

Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual

decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International

Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.

Sham Kakade and John Langford. Approximately Optimal Approximate Reinforcement Learning. In Pro-

ceedings of the 19th International Conference on Machine Learning, volume 2, pages 267–274, 2002.

Nathan Kallus.

Generalized optimal matching methods for causal inference.

arXiv:1612.08321, 2016.

arXiv preprint

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex

Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through

deep reinforcement learning. Nature, 518(7540):529–533, 2015.

14Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied

Probability, 29(2):429–443, 1997.

Rémi Munos. Error bounds for approximate policy iteration. In Proceedings of the Twentieth International

Conference on International Conference on Machine Learning, pages 560–567, 2003.

Rémi Munos. Performance bounds in l p-norm for approximate value iteration. SIAM journal on control

and optimization, 46(2):541–561, 2007.

Rémi Munos and Csaba Szepesvári. Finite-time bounds for ﬁtted value iteration. Journal of Machine

Learning Research, 9(May):815–857, 2008.

Balas K Natarajan. On learning sets and functions. Machine Learning, 4(1):67–97, 1989.

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley &

Sons, 2014.

Bruno Scherrer. Approximate policy iteration schemes: a comparison. In International Conference on

Machine Learning, pages 1314–1322, 2014.

Bruno Scherrer and Boris Lesner. On the use of non-stationary policies for stationary inﬁnite-horizon

markov decision processes. In Advances in Neural Information Processing Systems, pages 1826–1834,

2012.

Satinder Singh and Richard Yee. An upper bound on the loss from approximate optimal-value functions.

Machine Learning, 16(3):227–233, 1994.

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudı́k. Doubly robust oﬀ-policy eval-

uation with shrinkage. arXiv preprint arXiv:1907.09623, 2019.

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods

for reinforcement learning with function approximation. In Advances in neural information processing

systems, pages 1057–1063, 2000.

Csaba Szepesvári and Rémi Munos. Finite time bounds for sampling based ﬁtted value iteration. In Pro-

ceedings of the 22nd international conference on Machine learning, pages 880–887. ACM, 2005.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for oﬀ-policy

evaluation. arXiv preprint arXiv:1910.12809, 2019.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learn-

ing. Machine learning, 8(3-4):229–256, 1992.

Ronald J Williams and Leemon C Baird. Tight performance bounds on greedy policies based on imperfect

value functions. Technical report, Citeseer, 1993.

Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In

International Conference on Machine Learning, pages 6995–7004, 2019.

15Appendix

A Detailed Proofs

Lemma 1 (Evaluation error lemma, restated). For any policy π and any Q ∈ R S×A ,

E d 0 [Q(s, π)] − J(π) =

Proof of Lemma 1. Since J(π) =

identity.

E dπ [r]

1−γ ,

E d π [Q(s, a) − r − γQ(s ′ , π)]

1 − γ

we remove these terms from both sides, and prove the rest of the

E (s,a,r,s ′ )∼d π [Q(s, a) − γQ(s ′ , π(s ′ ))]

1 − γ

∞

X X

γ t Pr(s t = s|s 0 ∼ d 0 , π)Q(s, π(s))

γ t Pr(s t = s, a t = a|s 0 ∼ d 0 , π)Q(s, a) −

s,a t=0 s,a t=1

∞

X X ∞

X X

s,a t=0

s,a

γ t Pr(s t = s, a t = a|s 0 ∼ d 0 , π)Q(s, a) −

s,a t=1

γ t Pr(s t = s, a t = a|s 0 ∼ d 0 , π)Q(s, a)

Pr(s 0 = s, a 0 = a|s 0 ∼ d 0 , π)Q(s, a) = E s∼d 0 [Q(s, π(s))],

where the ﬁrst equation follows from the deﬁnition of d π , the second equation follows from the deﬁnition

of Q(s, π(s)).

Theorem 2 (Telescoping Performance Diﬀerence, restated). For any policy π and any Q ∈ R S×A ,

J(π) − J(π Q ) ≤

E d π [T Q − Q] E d πQ [Q − T Q]

1 − γ

Proof of Theorem 2.

J(π) − J(π Q ) = J(π) − E s∼d 0 [Q(s, π(s))] + E s∼d 0 [Q(s, π(s))] − E s∼d 0 [Q(s, π Q (s))]

} |

}

(I)

(II)

+ E s∼d 0 [Q(s, π Q (s))] − J(π Q ) .

}

(III)

These three terms can be bound separately as follows.

(I) = J(π) − E s∼d 0 [Q(s, π)]

E d π r + γQ(s ′ , π) − Q(s, a)

1 − γ

′ ′

E d r + γ max

≤

Q(s , a ) − Q(s, a)

a ′ ∈A

1 − γ π

E d [T Q − Q] .

1 − γ π

The second equation follows from Lemma 1, and the last step follows from marginalizing out r and s ′ by

conditioning on (s, a) using law of total expectation.

16For (II),

(II) = E s∼d 0 [Q(s, π(s))] − E s∼d 0 [Q(s, π Q (s))] = E s∼d 0 Q(s, π(s)) − max Q(s, a) ≤ 0.

Finally, (III), which is handled similarly to (I).

(III) = E s∼d 0 [Q(s, π Q )] − J(π Q )

E d πQ Q(s, a) − r − γQ(s ′ , π Q )

1 − γ

′ ′

Q(s

)

E d πQ Q(s, a) − r − γ max

a ′

1 − γ

[Q − T Q] ,

′

1 − γ (s,a,s )∼d πQ

where the third equation follows from the deﬁnition of π Q being greedy w.r.t. Q. The result follows by

putting all three parts together.

b be the output of MSBO. W.p. at least 1 − δ,

Theorem 5 (Improved error bound of MSBO, restated). Let Q

√

2 2C eff q sq q sq

ε Q + ε Q,F

max J(π) − J(π Q b ) ≤

π∈Π Q

1 − γ

 s



√

|Q||F |

|Q|

|Q||F |

4 V

4 V 2

C eff  V max ln δ

max

δ sq

 ,

ε Q +

ε sq

Q,F

1 − γ

where

C eff := max kw dπ / µ k 22,µ .

π∈Π Q

ε sq

Q,F

:= inf kQ − T Qk 22,µ .

Q∈Q

:= sup inf kf − T Qk 22,µ .

Q∈Q f ∈F

Proof of Theorem 5. We use π ⋆ to denote argmax π∈Π Q J(π). By applying Theorem 2, we can obtain

b −T Q

b − Q

E d π b Q

E d π⋆ T Q

max J(π) − J(π Q b ) ≤

π∈Π Q

1 − γ

b −T Q

b − Q

E µ w dπ b / µ · Q

E µ w dπ⋆ / µ · T Q

1 − γ

b a)

E (s,a)∼µ (T Q)(s,

a) − Q(s,

E (s,a)∼µ w dπ⋆ / µ (s, a)

(a)

≤

1 − γ

b a)

E (s,a)∼µ w dπ b / µ (s, a)

E (s,a)∼µ (T Q)(s,

a) − Q(s,

√

(b) 2 C

eﬀ

kQ − T Qk 2,µ .

≤

1 − γ

(9)p

where (a) follows from the Cauchy-Schwarz inequality for random variables (|EXY | ≤ E[X 2 ]E[Y 2 ])

and (b) follows from the deﬁnition of C eff .

b −T Q

from Chen and Jiang [2019]:

We then directly adopt the upper bound on Q

2,µ

2 ln 2|Q|

u 8V max

16V

10V

max

b −T Q

≤

+ 2ε 2 + ε 3 +

+ 2ε 2 + ε 3 ,

2,µ

2 ln 8|Q||F |

239V max

43V max

ε sq

where, ε 2 =

Q,F + ε Q,F ,

2 ln 2|Q|

2 ln 2|F |

4V max

8V max

ε Q +

(10)

and, ε 3 = ε Q +

By substitute Eq.(9) into Eq.(10) and adapt the the proof of Theorem 17 in Chen and Jiang [2019], we

have

√

2 C eﬀ b

Q −T Q

max J(π) − J(π Q b ) ≤

π∈Π Q

1 − γ

2,µ

 s



√

2|Q|

2 ln 8|Q||F |

24V

172V

2 C eﬀ

eﬀ 

max



2ε sq

≤

Q +

Q,F +

1 − γ

 s



√

2|Q|

8|Q||F |

4 3824V

2 C eﬀ  4 32V max ln δ sq

max ln

 .

ε Q +

ε sq

Q,F

1 − γ

b be the output of MABO. W.p. 1 − δ,

Theorem 8 (Error bound of MABO, restated). Let Q

2 avg

ε Q + ε avg

max J(π) − J(π Q b ) ≤

stat,n .

Q,W

π∈Π Q

1 − γ

where

ε avg

Q := min max |E µ [w · (T Q − Q)]| ,

Q∈Q w∈W

inf max E µ (w dπ / µ − w) · (T Q − Q) ,

π∈Π Q w∈sp(W) Q∈Q

ε avg

Q,W := max

ε stat,n := 2V max

2C eff,W ln

C eff,W := max kwk 22,µ ,

w∈W

2|Q||W|

4C ∞,W V max ln

2|Q||W|

C ∞,W := max kwk ∞ ,

w∈W

and sp(W) is the linear span of W using coeﬃcients with (at most) unit ℓ 1 norm, i.e.,

sp(W) :=

w∈W α(w)w :

w∈W |α(w)| ≤ 1 .

⋆ := argmax

Proof of Theorem 8. Let π Q

π∈Π Q J(π). By Theorem 2, we have

b −T Q

b − Q

E d π b Q

E d π⋆ T Q

max J(π) − J(π Q b ) ≤

π∈Π Q

1 − γ

≤

b w dπ / µ )

2 max π∈Π Q L µ ( Q,

1 − γ

.b w dπ / µ ) for any policy π ∈ Π Q . Let

We now bound L µ ( Q,

b dπ / µ := argmin max E µ

w∈sp(W) Q∈Q

and we obtain

b − Q

w dπ / µ − w · T Q

b − Q

b + E µ w

b dπ / µ · T Q

w dπ / µ − w

b dπ / µ · T Q

b − Q

b dπ / µ · T Q

≤ E µ w dπ / µ − w

b dπ / µ · T Q

+ E µ w

b − Q

b dπ / µ · T Q

= ε avg

Q,W + E µ w

b w dπ / µ ) = E µ

L µ ( Q,

where the last equation follows from the deﬁnition of ε avg

Q,W .

To bound the remaining term, we ﬁrst need a helper lemma that sup w∈sp(W) |f (·)| = P

max w∈W |f (·)|

for any linear function

(·):

consider

any

∈

sp(W),

which

can

written

i α i w i , where

w i ∈ W, ∀i and i |α i | ≤ 1. For linear f (·) and any w ∈ sp(W) we have

|α i | |f (w i )| ≤ sup |f (w ′ )|.

(11)

α i f (w i ) ≤

α i w i =

|f (w)| = f

w ′ ∈W

So sup w∈sp(W) |f (·)| ≤ max w∈W |f (·)|. On the other hand, since W ⊂ sp(W), we conclude that

sup w∈sp(W) |f (·)| = max w∈W |f (·)| for linear f (·).

b − Q

b . Note that

b dπ / µ · T Q

With this preparation, now we are ready to bound E µ w

ε avg

Q := min max |E µ [w · (T Q − Q)]| = min

Q∈Q w∈W

sup

Q∈Q w∈sp(W)

so we have

b − Q

b dπ / µ · T Q

E µ w

= E µ w

− min

|E µ [w · (T Q − Q)]| ,

sup

Q∈Q w∈sp(W)

|E µ [w · (T Q − Q)]| + ε avg

Q ,

b w dπ / µ ) , and it remains to bound the

At this point, we peeled oﬀ all the approximation errors from L µ ( Q,

estimation error

b − Q

b dπ / µ · T Q

E µ w

− inf sup |E µ [w · (T Q − Q)]| .

Q∈Q w∈sp(W)

19e := argmin Q∈Q sup w∈sp(W) |E µ [w · (T Q − Q)]| and W 1 := {aw : a ∈ [−1, 1], w ∈ W}.

Let Q

E µ

≤

b dπ / µ · T Q − Q

− inf

sup

w∈sp(W)

sup

b − Q

E µ w · T Q

−

E µ

w∈sp(W)

sup

w∈sp(W)

(a)

≤

sup

w∈sp(W)

sup

Q∈Q w∈sp(W)

sup

b − Q

w · T Q

E (s,a,r,s ′ )∼D

w∈sp(W)

−

w∈sp(W)

sup

w∈sp(W)

e − Q

E µ w · T Q

E (s,a,r,s ′ )∼D

′ ′

w(s, a) r + max

Q(s , a ) − Q(s, a)

′

b , a ) − Q(s,

b a)

−

w(s, a) r + max

Q(s

′

b − Q

E µ w · T Q

sup

|E µ [w · (T Q − Q)]|

− E (s,a,r,s ′ )∼D

sup

w∈sp(W)

e − Q

E µ w · T Q

b , a ) − Q(s,

b a)

w(s, a) r + max

Q(s

′

b ′ , a ′ ) − Q(s,

b a)

E (s,a,r,s ′ )∼D w(s, a) r + max

−

Q(s

′

sup

w∈sp(W)

e − Q

E µ w · T Q

b − Q

b − E (s,a,r,s ′ )∼D w(s, a) r + max Q(s

b ′ , a ′ ) − Q(s,

b a)

≤ sup E µ w · T Q

a ′

w∈W

}

(b)

(12)

(I)

e − Q

e ′ , a ′ ) − Q(s,

e a) − E µ w(s, a) T Q

+ sup E (s,a,r,s ′ )∼D w(s, a) r + max

Q(s

a ′

w∈W

}

(II)

the following argument:

b ′ , a ′ ) − Q(s,

b a)

sup E (s,a,r,s ′ )∼D w(s, a) r + max

−

sup

−

Q(s

′

w∈sp(W)

≤ sup

w∈W

≤ sup

w∈W

≤ sup

w∈W

≤ sup

w∈W

w∈sp(W)

′ ′

e − Q

E (s,a,r,s ′ )∼D w(s, a) r + max

−

sup

Q(s

)

−

Q(s,

a ′

w∈W 1

b ′ , a ′ ) − Q(s,

b a)

e − Q

E (s,a,r,s ′ )∼D w(s, a) r + max

Q(s

−

sup

a ′

w∈W

e ′ , a ′ ) − Q(s,

e a)

E (s,a,r,s ′ )∼D w(s, a) r + max

−

sup

−

Q(s

a ′

w∈W

e − Q

e ′ , a ′ ) − Q(s,

e a) − E µ w · T Q

E (s,a,r,s ′ )∼D w(s, a) r + max

Q(s

′

where the ﬁrst inequality follows

h from Eq.(11)

i and the fact that W 1 ⊆ sp(W), the second inequality follows

b optimizes

from the linearity of E µ w · T Q − Q , the third inequality follows from the fact that Q

max w∈W |L D (·, w)|, and the last inequality follows from sup x |f (x)| − sup x |g(x)| ≤ sup x |f (x) − g(x)|.

Now, since the only diﬀerence between term (I) and term (II) is the choice of Q and w, it suﬃces to

provide a uniform deviation bound that applies to all w ∈ W and Q ∈ Q. Before applying concentration

bounds, it will be useful to ﬁrst verify the boundedness of the random variables: w(s, a) ∈ [−C, C], and

r + γ max a ′ Q(s ′ , a ′ ) − Q(s, a) ∈ [−V max , V max ] (recall that we assumed Q ∈ [0, V max ]). Therefore, by

20Bernstein’s inequality and the union bound, w.p. at least 1 − δ we have that for any w ∈ W and Q ∈ Q,

1 X

′ ′

Q(s i , a ) − Q(s i , a i )

w(s i , a i ) r i + γ max

E µ [w · (T Q − Q)] −

a ′

i=1

2C ∞,W V max ln 2|Q||W|

2Var µ [w(s, a) (r + γ max a ′ Q(s ′ , a ′ ) − Q(s, a))] ln 2|Q||W|

≤

(a)

2C eff,W ln 2|Q||W|

2C ∞,W V max ln 2|Q||W|

ε stat,n

≤ V max

(13)

where (a) is obtained by the following argument:

′ ′

Var µ w(s, a) r + γ max

Q(s , a ) − Q(s, a)

a ′

≤ E µ w(s, a) 2 r + γ max

Q(s ′ , a ′ ) − Q(s, a)

′

≤ V max

E µ

w(s, a) 2 ≤ V max

C eff,W .

Substituting Eq.(13) into Eq.(12), we obtain that the both of term (I) and term (II) in Eq.(12) can be

simultaneously bounded by ε stat,n /2 w.p. 1 − δ . Therefore, we bound max π∈Π Q J(π) − J(π Q b ) w.p. 1 − δ

as follows

2 avg

ε Q + ε avg

max J(π) − J(π Q b ) ≤

stat,n

Q,W

π∈Π Q

1 − γ

Comparison between Per-step vs. Occupancy-based Concentrability

Coeﬃcients

We provide an example to illustrate the limitation of the per-step concentrability coeﬃcients (Proposi-

tion 6). Consider a deterministic chain MDP, where there are L + 1 states, {s 0 , s 1 , s 2 , . . . , s L }. There is

only one action, which we omit in the notations. s 0 is the deterministic initial state, and each s l transi-

tions to s l+1 under the only action for 0 ≤ l < L. s L is an absorbing state (i.e., it transitions to itself). The

reward function is inconsequential.

There is only one possible policy π for this MDP, and we let the data distribution µ = d π . The

occupancy-based concentrability coeﬃcient is always 1 (either C ∞ or C eﬀ ), which agrees with the in-

tuition that there is no distribution shift. Since the per-step deﬁnitions (Eq.(5)) are always the convex

combinations of C t = max π kw dπ,t / µ k ∞ for t ≥ 0, we can assert that it is never lower than min t C t

however the combination coeﬃcients are chosen.

Now we calculate C t for this MDP:

(

= (1−γ)γ

t , 0 ≤ t < L

C t = µ(s 1 t )

µ(s L ) = γ L , t ≥ L

Replacing k · k ∞ with k · k 22,µ gives exactly the same results. (When the distribution on the enumerator is

a point mass, k · k 22,µ of the importance weight is the same as k · k ∞ .) Therefore, as long as L is suﬃciently

large so that γ 1 L ≥ (1−γ)

, we have C t ≥ 1 / 1−γ for all t, and the per-step concentrability coeﬃcient is at

least 1 / 1−γ . As a ﬁnal remark, since the MDP only has 1 policy, the result has no dependence on the choice

of policy class in max π in the deﬁnition of concentrability coeﬃcient, so we have virtually covered all

existing deﬁnitions in the AVI/API literature.

21C

On Iterative Methods’ Lack of Control of Bellman Errors

We demonstrate that iterative methods fail to directly control the Bellman error on the data distribution µ.

Consider a two-state deterministic MDP with just 1 action, where s 1 transitions to s 2 , and s 2 is absorbing.

The reward is always 0.

We use the tabular representation for this MDP, where Q = [Q(s 1 , a), Q(s 2 , a)] ⊤ . Assume our batch

data D only contains transition tuples of form (s 1 , a, 0, s 2 ), and no data points from (s 2 , a 2 ) are present.

We ﬁrst show how FQI behave on this example. Given the update rule of FQI (Eq.(2)),

Q t ∈ argmin ℓ D (Q; Q t−1 ) = {[Q(s 1 , a), Q(s 2 , a)] ⊤ : Q(s 1 , a) = γQ t−1 (s 2 , a)}.

Therefore, with very update, Q(s 1 , a) will obtain the old value of γQ(s 2 , a) from the previous iteration,

whereas the new value of Q(s 2 , a) will be set arbitrarily. Since the mean square Bellman error is kQ t −

T Q t k 22,µ = (Q t (s 1 , a) − γQ t (s 2 , a)) 2 , its value can be arbitrarily away from 0 and do not become smaller

over iterations. In comparison, it is easy to verify that MSBO and MABO do not suﬀer from this issue:

although there is also arbitrariness in their outputs due to insuﬃcient data coverage, their outputs will

always satisfy Q(s 1 , a) = γQ(s 2 , a) and hence imply zero Bellman error on µ.

As a ﬁnal remark, it should be noted that the counterexample holds because µ is non-exploratory

and C eff = C ∞ = ∞, which breaks the assumption for all algorithms considered in this paper. Although

kQ−T Qk 2 µ,2 will be controlled by FQI when µ is exploratory, this is an indirect consequence of FQI ﬁnding

Q ≈ Q ⋆ , and our example illustrates that these iterative methods do not directly control the Bellman error

on the data distribution.

D Existence of Simple W in Low-rank MDPs (Proposition 10)

Claim 1: General Low-rank Case Consider an MDP whose transition matrix P ∈ R |S×A|×|S| satisﬁes

rank(P ) = k. Let there be a total of N policies in Π Q , and we stack ν π ∈ R |S| for all π ∈ Π Q as a

⊤

matrix: M ν := ν π 1 · · · ν π N ; all vectors in this proof are treated as column vectors. We ﬁrst argue

that rank(M ν ) ≤ k + 1.

Let ν π,t (s) be the marginal distribution of s t under π. Also let Π ∈ R |S|×|SA| be the standard matrix

⊤ = d ⊤ (ΠP ) t ,

representation of a policy π, that is, Π s ′ ,(s,a) := 1(s = s ′ , a = π(s)). It is known that ν π,t

h i

P ∞ t

⊤

which shows that ν π,t is in the row-space of d ⊤ for any π and t. Since ν π = (1 − γ) t=0 γ ν π,t , the

h i

same holds for ν π . Therefore, we have rank(M ν ) ≤ rank( d ⊤ ) ≤ k + 1. For convenience, let k ′ := k + 1.

Then, following a determinant(volume)-maximization argument similar to Chen and Jiang [2019,

Proposition 10], we can ﬁnd k ′ rows from M ν , denoted as η 1 , . . . , η k ′ , which satisﬁes the following:

P k ′

′

1 ′

for any i = 1, . . . , N , there exists α 1 , . . . , α k ′ , such that ν π i =

j=1 α j · k · η j , and |α j | ≤ / k for

′

j = 1, . . . , k . This implies that {ν π 1 , . . . , ν π N } ⊆ sp({η 1 , . . . , η k ′ }), where η i := k η i . Now consider

sp({η 1 ′ , . . . , η k ′ ′ } × Π Q , where the Cartesian product produces k ′ |Π Q | pairs of state-action functions, de-

ﬁned as η ′ × π := ((s, a) 7→ η ′ (s) · 1(a = π(s))). We claim that {d π 1 , . . . , d π N } ⊂ sp({η 1 ′ , . . . , η k ′ ′ } × Π Q :

for any π i , since ν π i can be expressed as the linear combination of {η 1 ′ , . . . , η k ′ ′ } with coeﬃcients satisfying

the norm constraints, d π i = ν π i × π i is also the combination of {η 1 ′ × π i , . . . , η k ′ ′ × π i } with exactly the

same coeﬃcients.

Since µ is supported on the entire S × A, we have w dπ / µ = diag(µ) −1 d π . Putting all results together,

it suﬃces to choose W = {diag(µ) −1 (η i ′ × π Q ) : i ∈ [k ′ ], Q ∈ Q}, and |W| ≤ (k + 1)|Π Q |.

Remark on the |Q| Dependence in the General Case The annoying dependence on |Q| comes from

the fact that we hope the state-action occupancy vectors of diﬀerent policies to have low-rank factorization

22(which is satisﬁed in the more restricted case; see Claim 2). In general low-rank MDPs, however, only

state occupancy factorizes and the state-action one does not; a counter-example can be easily shown in

contextual bandits:

Consider an MDP with 2 actions per state. d 0 is uniform among |S| − 1 states, all of which transition

deterministically to the last state, which is absorbing. This MDP essentially emulates a contextual bandit.

Since all states share exactly the same next-state distribution, the rank of the transition matrix is 1 regard-

less of how large |S| is. Now consider a policy space Π Q , where each policy takes action a 1 in one of the

|S| − 1 states, and takes a 2 in all other states; there are |S| − 1 such policies. It is easy to show that the

matrix consisting of state-action occupancy d π for all policies in Π Q has full-rank |S| − 1, which cannot

be bounded by the rank of the transition matrix when |S| is large.

Given this diﬃculty, our strategy is to ﬁrst ﬁnd the policies whose state occupancy vectors span the

entire low-dimensional space, and take their Cartesian product with Π Q to handle the actions, which

results in the |Q| dependence. As we will see below, we can avoid paying |Q| when the Q class is more

structured.

Claim 2: Restricted Case of Knowing the Left Factorization Matrix as Features [Yang and Wang,

2019] Here we consider the setting of P = ΦP ′ , where Φ ∈ R |S×A|×k and φ(s, a) ⊤ denotes its (s, a)-

th row. For the choice of Q = {(s, a) 7→ R(s, a) + γφ(s, a) ⊤ θ : θ ∈ R k }, note that any Q ∈ Q is

in the column space of Φ + := [Φ R], where the reward function R is treated as an |S × A| × 1 vector.

Yang and Wang [2019, Proposition 2] shows that it is realizable and closed under Bellman update, i.e.,

T Q ∈ Q, ∀Q ∈ Q. Therefore, the Bellman error Q − T Q is also in the column space of Φ + . Let φ + (s, a) ⊤

and T Q =

and θ T + Q be the coeﬃcients such that Q = φ + (s, a) ⊤ θ Q

be the (s, a)-th row of Φ + , and θ Q

⊤

φ (s, a) θ T Q .

Fixing any π, consider

E µ [(w − w dπ / µ ) · (T Q − Q)]

− θ T + Q ))]

= E µ [(w − w dπ / µ ) · (φ + (s, a) ⊤ (θ Q

− θ T + Q ).

= (w − w dπ / µ ) ⊤ diag(µ)Φ + (θ Q

avg

According to the deﬁnition of ε avg

Q,W , to achieve ε Q,W = 0 it suﬃces to have the following: for every

π ∈ Π Q , there exists w ∈ sp(W), such that E µ [(w − w dπ / µ ) · (Q − T Q)] = 0 for any Q ∈ Q. Given the

linear structure of Q and T Q, we can relax the last statement to its suﬃcient condition:

(w − w dπ / µ ) ⊤ diag(µ)Φ + = 0 ⊤

k+1 ,

where 0 is the all-zero vector. The rest of the proof is very similar to Claim 1: we simply stack

w d ⊤ π / µ diag(µ)Φ + ∈ R 1×(k+1) together into a |Π Q | × (k + 1) matrix, use the determinant-maximization

argument to select its rows, and form W with the corresponding w dπ / µ after proper rescaling.

Remark Since Q is closed under Bellman update in this setting, one may also use Q as the helper class

F for MSBO. However, the complexity of F in this case only matches that of W in the more general case

(Claim 1) and is signiﬁcant worse than what we can achieve here (|W| ≤ k + 1).