Reinforcement Learning with Function 
Approximation Converges to a Region 
Geoffrey J. Gordon 
ggordoncs. cmu. edu 
Abstract 
Many algorithms for approximate reinforcement learning are not 
known to converge. In fact, there are counterexamples showing 
that the adjustable weights in some algorithms may oscillate within 
a region rather than converging to a point. This paper shows that, 
for two popular algorithms, such oscillation is the worst that can 
happen: the weights cannot diverge, but instead must converge 
to a bounded region. The algorithms are SARSA(0) and V(0); the 
latter algorithm was used in the well-known TD-Gammon program. 
1 Introduction 
Although there are convergent online algorithms (such as TD(A) [1]) for learning 
the parameters of a linear approximation to the value function of a Markov process, 
no way is known to extend these convergence proofs to the task of online approxi- 
mation of either the state-value (V*) or the action-value (Q*) function of a general 
Markov decision process. In fact, there are known counterexamples to many pro- 
posed algorithms. For example, fitted value iteration can diverge even for Markov 
processes [2]; Q-learning with linear function approximators can diverge, even when 
the states are updated according to a fixed update policy [3]; and SARSA(0) can 
oscillate between multiple policies with different value functions [4]. 
Given the similarities between SARSA(0) and Q-learning, and between V(0) and 
value iteration, one might suppose that their convergence properties would be identi- 
cal. That is not the case: while Q-learning can diverge for some exploration strate- 
gies, this paper proves that the iterates for trajectory-based SARSA(0) converge 
with probability 1 to a fixed region. Similarly, while value iteration can diverge for 
some exploration strategies, this paper proves that the iterates for trajectory-based 
V(0) converge with probability 1 to a fixed region.  
The question of the convergence behavior of SARSA(A) is one of the four open theo- 
retical questions of reinforcement learning that Sutton [5] identifies as "particularly 
important, pressing, or opportune." This paper covers SARSA(0), and together 
1In a "trajectory-based" algorithm, the exploration policy may not change within a 
single episode of learning. The policy may change between episodes, and the value function 
may change within a single episode. (Episodes end when the agent enters a terminal state. 
This paper considers only episodic tasks, but since any discounted task can be transformed 
into an equivalent episodic task, the algorithms apply to non-episodic tasks as well.) 
with an earlier paper [4] describes its convergence behavior: it is stable in the sense 
that there exist bounded regions which with probability I it eventually enters and 
never leaves, but for some Markov decision processes it may not converge to a single 
point. The proofs extend easily to SARSA() for  > 0. 
Unfortunately the bound given here is not of much use as a practical guarantee: it 
is loose enough that it provides little reason to believe that SARSA(0) and V(0) 
produce useful approximations to the state- and action-value functions. However, 
it is important for several reasons. First, it is the best result available for these two 
algorithms. Second, such a bound is often the first step towards proving stronger 
results. Finally, in practice it often happens that after some initial exploration 
period, only a few different policies are ever greedy; if this is the case, the strategy 
of this paper could be used to prove much tighter bounds. 
Results similar to the ones presented here were developed independently in [6]. 
2 The algorithms 
The SARSA(0) algorithm was first suggested in [7]. The V(0) algorithm was pop- 
ularized by its use in the TD-Gammon backgammon playing program [8]. 2 
Fix a Markov decision process Air, with a finite set $ of states, a finite set A of 
actions, a terminal state T, an initial distribution S0 over $, a one-step reward 
function r: $ x A - R, and a transition function 5: $ x A - $ U {T}. (M 
may also have a discount factor 7 specifying how to trade future rewards against 
present ones. Here we fix 7 = 1, but our results carry through to 7 < 1.) Both the 
transition and reward functions may be stochastic, so long as successive samples are 
independent (the Markov property) and the reward has bounded expectation and 
variance. We assume that all states in S are reachable with positive probability. 
We define a policy r to be a function mapping states to probability distributions 
over actions. Given a policy we can sample a trajectory (a sequence of states, 
actions, and one-step rewards) by the following rule: begin by selecting a state so 
according to So. Now choose an action ao according to 7r(so). Now choose a one- 
step reward ro according to r(so,ao). Finally choose a new state s according to 
5(so, ao). If s = T, stop; otherwise repeat. We assume that all policies are proper, 
that is, that the agent reaches T with probability I no matter what policy it follows. 
(This assumption is satisfied trivially if y < 1.) 
The reward for a trajectory is the sum of all of its one-step rewards. Our goal is 
to find an optimal policy, that is, a policy which on average generates trajectories 
with the highest possible reward. Define Q*(s,a) to be the best total expected 
reward that we can achieve by starting in state s, performing action a, and acting 
optimally afterwards. Define V*(s) = maxaQ*(s,a). Knowledge of either Q* or the 
combination of V*, 5, and r is enough to determine an optimal policy. 
The SARSA(0) algorithm maintains an approximation to Q*. We will write 
for s E $ and a E A to refer to this approximation. We will assume that O is a 
full-rank linear function of some parameters w. For convenience of notation, we 
will write O(T, a) = 0 for all a 6 A, and tack an arbitrary action onto the end of 
all trajectories (which would otherwise end with the terminal state). After seeing 
2The proof given here does not cover the TD-Gammon program, since TD-Gammon 
uses a nonlinear function approximator to represent its value function. Interestingly, 
though, the proof extends easily to cover games such as backgammon in addition to MDPs. 
It also extends to cover SARSA(A) and V(A) for A > 0. 
a trajectory fragment s, a, r, s , a , the SARSA(0) algorithm updates 
Q(s,a) - r + Q(s,a ) 
The notation Q(s,a) - V means that the parameters, w, which represent Q(s,a) 
should be adjusted by gradient descent to reduce the error (Q(s,a)- V)2; that is, 
for some preselected learning rate c _ 0, 
Wnew= Wold - o(V - Q(s,a))wQ(s,a ) 
For convenience, we assume that a remains constant within a single trajectory. 
We also make the standard assumption that the sequence of learning rates is fixed 
before the start of learning and satisfies -t at - c and -t at   c. 
We will consider only the trajectory-based version of SARSA(0). This version 
changes policies only between trajectories. At the beginning of each trajectory, 
it selects the e-greedy policy for its current Q function. From state s, the e-greedy 
policy chooses the action argmaxa Q(s, a) with probability I - e, and otherwise 
selects uniformly at random among all actions. This rule ensures that, no matter 
the sequence of learned Q functions, each state-action pair will be visited infinitely 
often. (The use of e-greedy policies is not essential. We just need to be able to 
find a region that contains all of the approximate value functions for every policy 
considered, and a bound on the convergence rate of TD(0).) 
We can compare the SARSA(0) update rule to the one for Q-learning: 
Q(s,a) - r + maxQ(s,b) 
b 
Often a  in the SARSA(0) update rule will be the same as the maximizing b in 
the Q-learning update rule; the difference only appears when the agent takes an 
exploring action, i.e., one which is not greedy for the current Q function. 
The V(0) algorithm maintains an approximation to V* which we will write V(s) for 
all s E $. Again, we will assume V is a full-rank linear function of parameters w, 
and V(T) is held fixed at 0. After seeing a trajectory fragment s,a,r,s , V(0) sets 
This update ignores a. Often a is chosen according to a greedy or e-greedy policy 
for a recent V. However, for our analysis we only need to assume that we consider 
finitely many policies and that the policy remains fixed during each trajectory. 
We leave open the question of whether updates to w happen immediately after each 
transition or only at the end of each trajectory. As pointed out in [9], this difference 
will not affect convergence: the updates within a single trajectory are O(c), so they 
cause a change in Q(s,a) or V(s) of O(c), which means subsequent updates are 
affected by at most O(c2). Since c is decaying to zero, the O(c ) terms can be 
neglected. (If we were to change policies during the trajectory, this argument would 
no longer hold, since small changes in Q or V can cause large changes in the policy.) 
3 The result 
Our result is that the weights w in either SARSA(0) or V(0) converge with proba- 
bility 1 to a fixed region. The proof of the result is based on the following intuition: 
while SARSA(0) and V(0) might consider many different policies over time, on any 
given trajectory they always follow the TD(0) update rule for some policy. The 
TD(0) update is, under general conditions, a 2-norm contraction, and so would 
converge to its fixed point if it were applied repeatedly; what causes SARSA(0) and 
V(0) not to converge to a point is just that they consider different policies (and 
so take steps towards different fixed points) during different trajectories. Crucially, 
under general conditions, all of these fixed points are within some bounded region. 
So, we can view the SARSA(0) and V(0) update rules as contraction mappings plus 
a bounded amount of "slop." With this observation, standard convergence theorems 
show that the weight vectors generated by SARSA(0) and V(0) cannot diverge. 
Theorem I For any Markov decision process M satisfying our assumptions, there 
is a bounded region R such that the SARSA (0) algorithm, when acting on M, pro- 
duces a series of weight vectors which with probability I converges to R. Similarly, 
there is another bounded region R  such that the V(O) algorithm acting on M pro- 
duces a series of weight vectors converging with probability I to R . 
PROOf: Lemma 2, below, shows that both the SARSA(0) and V(0) updates can 
be written in the form 
where At is positive definite, ct is the current learning rate, E(et) - O, Var(et) 
K(1 + IIwtll), and At and rt depend only on the currently greedy policy. (At and 
rt represent, in a manner described in the lemma, the transition probabilities and 
one-step costs which result from following the current policy. Of course, wt, At, 
and rt will be different depending on whether we are following SARSA(0) or V(0).) 
Since At is positive definite, the SARSA(0) and V(0) updates are 2-norm contrac- 
tions for small enough ct. So, if we kept the policy fixed rather than changing it 
at the beginning of each trajectory, standard results such as Lemma I below would 
guarantee convergence. The intuition is that we can define a nonnegative potential 
function J(w) and show that, on average, the updates tend to decrease J(w) as 
long as ct is small enough and J(w) starts out large enough compared to 
To apply Lemma I under the assumption that we keep the policy constant rather 
than changing it every trajectory, write At = A and rt = r for all t, and write 
w = A-r. Let p be the smallest eigenvalue of A (which must be real and positive 
since A is positive definite). Write st = Awt - r + et for the update direction at 
step t. Then if we take J(w) = lw - wl , 
= + 
= 2(wt--w)T(Awt- Aw) 
2pllwt-w11 
= 2pJ(wt) 
so that -st is a descent direction in the sense required by the lemma. It is easy 
to check the lemma's variance condition. So, Lemma I shows that J(wt) converges 
with probability I to 0, which means wt must converge with probability I to w. 
If we pick an arbitrary vector u and define H(w) = max(O, ]]w - ul] - C) 
suciently large constant C, then the same argument reaches the weaker conclusion 
that wt must converge with probability I to a sphere of radius C centered at u. To 
see why, note that -st is also a descent direction for H(w): inside the sphere, H = 0 
and VH = 0, so the descent condition is satisfied trivially. Outside the sphere, 
V(w) - 2(w ) IIw-11-c 
- -  d(w)(w - ) 
= - 
= d(w)(wt - + - )TA(wt - 
d(wt)(pllwt - w11 - IIw - IIAII Ilwt - w11) 
The positive term will be larger than the negative one if ]]wt -w]] is large enough. 
So, if we choose C large enough, the descent condition will be satisfied. The variance 
condition is again easy to check. Lemma 3 shows that VH is Lipschitz. So, Lemma 1 
shows that H(wt) converges with probability I to 0, which means that wt must 
converge with probability I to the sphere of radius C centered at u. 
But now we are done: since there are finitely many policies that SARSA(0) or V(0) 
can consider, we can pick any u and then choose a C large enough that the above 
argument holds for all policies simultaneously. With this choice of C the update for 
any policy decreases H(wt) on average as long as ct is small enough, so the update 
for SARSA(0) or V(0) does too, and Lemma 1 applies. [] 
The following lemma is Corollary 1 of [10]. In the statement of the lemma, a 
Lipschitz continuous function F is one for which there exists a constant L so that 
IIF(u) - F(w)l I _ Lllu- wll for all u and w. The Lipschitz condition is essentially 
a uniform bound on the derivative of F. 
Lemma 1 Let J be a differentiable function, bounded below by J*, and let VJ be 
Lipschitz continuous. Suppose the sequence wt satisfies 
Wt+l -- Wt -- OtSt 
for random vectors st independent of wt+,wt+, .... Suppose -st is a descent 
direction for J in the sense that E(stlwt)TVj(wt)  5(e)  0 whenever J(wt)  
J* + e. Suppose also that 
E(lltlllwt) _< KJ(wt) + KE(stlwt)TVJ(wt) + K 
and finally that the constants at satisfy 
t 
Then J(wt) -- J* with probability 1. 
Most of the work in proving the next lemma is already present in [1]. The transfor- 
mation from an MDP under a fixed policy to a Markov chain is standard. 
Lemma 2 The update made by SARSA(O) or V(O) during a single trajectory can 
be written in the form 
Wnew= Wold -- a(AWold -- r + e) 
where the constant matrix A and constant vector r depend on the currently greedy 
policy r,  is the current learning rate, and E(e) = O. Furthermore, A is positive 
definite, and there is a constant K such that Var(e) _ K(1 + Ilwll). 
PROOF: Consider the following Markov process M: M has one state for each 
state-action pair in M. If M has a transition which goes from state s under action 
a with reward r to state s  with probability p, then Mx has a transition from state 
{s, a) with reward r to state {s , a ) for every a; the probability of this transition 
is pw(als). We will represent the value function for Mx in the same way that we 
represented the Q function for M; in other words, the representation for V({s,a)) is 
the same as the representation for Q(s, a). With these definitions, it is easy to see 
that TD(0) acting on Mx produces exactly the same sequence of parameter changes 
as SARSA(0) acting on M under the fixed policy r. (And since r(als ) > 0, every 
state of M will be visited infinitely often.) 
Write T for the transition probability matrix of the above Markov process. That 
is, the entry of T in row {s, a) and column {s', a') will be equal to the probability of 
taking a step to {s',a') given that we start in {s,a). By definition, T is substochas- 
tic. That is, it has nonnegative entries, and its row sums are less than or equal to 1. 
Write s for the vector whose {s, a)th element is So(s)r(als), that is, the probability 
that we start in state s and take action a. Write d (I T -1 
= -T) s, where l is the 
identity matrix. As demonstrated in, e.g., [11], d is the vector of expected visita- 
tion frequencies under r; that is, the element of d corresponding to state s and 
action a is the expected number of times that the agent will visit state s and select 
action a during a single trajectory following policy r. Write D for the diagonal 
matrix with d on its diagonal. Write r for the vector of expected rewards; that 
is, the component of r corresponding to state s and action a is E(r(s,a)). Finally 
write X for the Jacobian matrix 
With this notation, Sutton [1] showed that the expected TD(O) update is 
E(WnewlWold) -- Wold -- ctX TDx(I - Tx)XWold q- aN T Dxr 
(Actually, he only considered the case where all rewards are zero except on transi- 
tions from nonterminal to terminal states, but his argument works equally well for 
the more general case where nonzero rewards are allowed everywhere.) So, we can 
take A = X TD(I - T)X and r = X TDr to make E(e) = O. 
Furthermore, Sutton showed that, as long as the agent reaches the terminal state 
with probability 1 (in other words, as long as r is proper) and as long as every 
state is visited with positive probability (which is true since all states are reachable 
and r has a nonzero probability of choosing every action), the matrix D(I - T) 
is strictly positive definite. Therefore, so is A. 
Finally, as can be seen from Sutton's equations on p. 25, there are two sources of 
variance in the update direction: variation in the number of times each transition 
is visited, and variation in the one-step rewards. The visitation frequencies and the 
one-step rewards both have bounded variance, and are independent of one another. 
They enter into the overall update in two ways: there is one set of terms which is 
bilinear in the one-step rewards and the visitation frequencies, and there is another 
set of terms which is bilinear in the visitation frequencies and the weights w. The 
former set of terms has constant variance. Because the policy is fixed, w is inde- 
pendent of the visitation frequencies, and so the latter set of terms has variance 
proportional to IIwll 2. So, there is a constant K such that the total variance in e 
can be bounded by K(1 + 
A similar but simpler argument applies to V(0). In this case we define M to have 
the same states as M, and to have the transition matrix T whose element s, s' is 
the probability of landing in s' in M on step t + 1, given that we start in s at step 
t and follow w. Write s for the vector of starting probabilities, that is, sx = S0(x). 
Now define X = ov and d = (I- TT)-s. Since we have assumed that all policies 
Ow 
are proper and that every policy considered has a positive probability of reaching 
any state, the update matrix A = XTDr(I -- Tr)X is strictly positive definite. [] 
Lemma 3 The gradient of the function H(w) = max(O, Ilwll- 1D is Lipschitz 
continuous. 
PROOF: Inside the unit sphere, H and all of its derivatives are uniformly zero. 
Outside, we have 
VH = wd(w) 
where d(w) - I111- and 
I111 ' 
V2H 
= d(w) + Vd(w)w T 
w I wT 
= + -- 
ww T 
- d(w) + iiwll (1 - d(w)) 
The norm of the first term is d(w), the norm of the second is 1- dw), and since one 
of the terms is a multiple of I the norms add. So, the norm of VH is 0 inside the 
unit sphere and I outside. At the boundary of the unit sphere, VH is continuous, 
and its directional derivatives from every direction are bounded by the argument 
above. So, VH is Lipschitz continuous. [] 
Acknowledgements 
Thanks to Andrew Moore and to the anonymous reviewers for helpful comments. 
This work was supported in part by DARPA contract number F30602-97-1-0215, 
and in part by NSF KDI award number DMS-9873442. The opinions and conclu- 
sions are the author's and do not reflect those of the US government or its agencies. 
References 
[1] R. S. Sutton. Learning to predict by the methods of temporal differences. 
Machine Learning, 3(1):9-44, 1988. 
[2] Geoffrey J. Gordon. Stable function approximation in dynamic programming. 
Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. 
[3] L. C. Baird. Residual algorithms: Reinforcement learning with function ap- 
proximation. In Machine Learning: proceedings of the twelfth international 
conference, San Francisco, CA, 1995. Morgan Kaufmann. 
[4] Geoffrey J. Gordon. Chattering in SARSA(A). Internal report, 1996. CMU 
Learning Lab. Available from www. cs. cmu. edu/~ggordon. 
[5] R. S. Sutton. Open theoretical questions in reinforcement learning. In P. Fis- 
cher and H. U. Simon, editors, Computational Learnins Theory (Proceedings 
of EuroCOLT'99), pages 11-17, 1999. 
[6] D. P. de Farias and B. Van Roy. On the existence of fixed points for approxi- 
mate value iteration and temporal-difference learning. Journal of Optimization 
Theory and Applications, 105(3), 2000. 
[7] Gavin A. Rummew and Mahesan Niranjan. On-line Q-learning using con- 
nectionist systems. Technical Report 166, Cambridge University Engineering 
Department, 1994. 
[8] G. Tesauro. TD-Gammon, a self-teaching backgammon program, achieves 
master-level play. Neural Computation, 6:215-219, 1994. 
[9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic 
iterative dynamic programming algorithms. Neural Computation, 6:1185-1201, 
1994. 
[10] B. T. Polyak and Ya. Z. Tsypkin. Pseudogradient adaptation and training 
algorithms. Automation and Remote Control, 34(3):377-397, 1973. Translated 
from Avtomatika i Telemekhanika. 
[11] J. G. Kemeny and J. L. Snell. Finite Markov Chains. Van Nostrand--Reinhold, 
New York, 1960. 
