Kernel-Based Reinforcement Learning in 
Average-Cost Problems: An Application 
to Optimal Portfolio Choice 
Dirk Ormoneit 
Department of Computer Science 
Stanford University 
Stanford, CA 94305-9010 
ormonit @cs. stanford. du 
Peter Glynn 
EESOR 
Stanford University 
Stanford, CA 94305-4023 
Abstract 
Many approaches to reinforcement learning combine neural net- 
works or other parametric function approximators with a form of 
temporal-difference learning to estimate the value function of a 
Markov Decision Process. A significant disadvantage of those pro- 
cedures is that the resulting learning algorithms are frequently un- 
stable. In this work, we present a new, kernel-based approach to 
reinforcement learning which overcomes this difficulty and provably 
converges to a unique solution. By contrast to existing algorithms, 
our method can also be shown to be consistent in the sense that 
its costs converge to the optimal costs asymptotically. Our focus 
is on learning in an average-cost framework and on a practical ap- 
plication to the optimal portfolio choice problem. 
1 Introduction 
Temporal-difference (TD) learning has been applied successfully to many real-world 
applications that can be formulated as discrete state Markov Decision Processes 
(MDPs) with unknown transition probabilities. If the state variables are continuous 
or high-dimensional, the TD learning rule is typically combined with some sort of 
function approximator - e.g. a linear combination of feature vectors or a neural 
network - which may well lead to numerical instabilities (see, for example, [BM95, 
TR96]). Specifically, the algorithm may fail to converge under several circumstances 
which, in the authors' opinion, is one of the main obstacles to a more wide-spread 
use of reinforcement learning (RL) in industrial applications. As a remedy, we 
adopt a non-parametric perspective on reinforcement learning in this work and we 
suggest a new algorithm that always converges to a unique solution in a finite 
number of steps. In detail, we assign value function estimates to the states in a 
sample trajectory and we update these estimates in an iterative procedure. The 
updates are based on local averaging using a so-called "weighting kernel". Besides 
numerical stability, a second crucial advantage of this algorithm is that additional 
training data always improve the quality of the approximation and eventually lead 
to optimal performance - that is, our algorithm is consistent in a statistical sense. 
To the authors' best knowledge, this is the first reinforcement learning algorithm 
for which consistency has been demonstrated in a continuous space framework. 
Specifically, the recently advocated "direct" policy search or perturbation methods 
can by construction at most be optimal in a local sense [SMSM00, VRK00]. 
Relevant earlier work on local averaging in the context of reinforcement learning 
includes [Rus97] and [Got99]. While these papers pursue related ideas, their ap- 
proaches differ fundamentally from ours in the assumption that the transition prob- 
abilities of the MDP are known and can be used for learning. By contrast, kernel- 
based reinforcement learning only relies on sample trajectories of the MDP and it 
is therefore much more widely applicable in practice. While our method addresses 
both discounted- and average-cost problems, we focus on average-costs here and 
refer the reader interested in discounted-costs to [as00]. For brevity, we also defer 
technical details and proofs to an accompanying paper [aGO0]. Note that average- 
cost reinforcement learning has been discussed by several authors (e.g. [TR99]). 
The remainder of this work is organized as follows. In Section 2 be provide basic 
definitions and we describe the kernel-based reinforcement learning algorithm. Sec- 
tion 3 focuses on the practical implementation of the algorithm and on theoretical 
issues. Sections 4 and 5 present our experimental results and conclusions. 
2 Kernel-Based Reinforcement Learning 
Consider a MDP defined by a sequence of states Xt taking values in if/d, a sequence 
of actions at taking values in A = {1, 2,..., M}, and a family of transition kernels 
{Pa(x,B)la  A} characterizing the conditional probability of the event Xt  B 
given Xt- = x and at-1 : a. The cost function c(x, a) represents an immediate 
penalty for applying action a in state x. Strategies, policies, or controls are under- 
stood as mappings of the form/u  ff/d _ A, and we let P,., denote the probability 
distribution governing the Markov chain starting from X0: x associated with the 
policy/u. Several regularity conditions are listed in detail in [aGO0]. 
Our goal is to identify policies that are optimal in that they minimize the long-run 
average-cost rl -- limT_ E.,  -]t=0 c(Xt,lu(Xt))  An optimal policy,/u*, can 
be characterized as a solution to the Average-Cost Optimality Equation 
r]* + h*(z) : min{c(x,a) + (Fah*)(z)}, (1) 
tz*(x) : argmin{c(x,a)+(Fah*)(x)}, (2) 
where r/* is the minimum average-cost and h* (x) has an interpretation as the differ- 
ential value of starting in x as opposed to drawing a random starting position from 
the stationary distribution under/z*. Fa denotes the conditional expectation oper- 
ator (Fah)(x) = E.,a[h(X)], which is assumed to be unknown so that (1) cannot 
be solved explicitly. Instead, we simulate the MDP using a fixed proposal strat- 
egy tz in reinforcement learning to generate a sample trajectory as training data. 
Formally, let $ = {Zo,..., z,} denote such an m-step sample trajectory and let 
`4 : and C -- {c(s,as)10  s < be the sequences 
of actions and costs associated with $. Then our objective can be reformulated as 
the approximation of p* based on information in $, .4, and C. In detail, we will 
construct an approximate expectation operator, J,,,a, based on the training data, 
$, and use this approximation in place of the true operator ['a in this work. For- 
^ 
mally substituting P.,a for Pa in (1) and (2) gives the Approximate Average-Cost 
Optimality Equation (AACOE): 
0 + .(x) : min{c(x,a) + (J.,a.)(x)), (3) 
/2,(x) : argmjn {c(x,a)+ (P,,a,)(x)}. (4) 
Note that, if the solutions ., and ., to (3) are well-defined, they can be interpreted 
as statistical estimates of r/* and h* in equation (1). However, 0-, and ., need not 
exist unless ).,,a is defined appropriately. We therefore employ local averaging in 
this work to construct ).,,a in a way that guarantees the existence of a unique 
fixed point of (3). For the derivation of the local averaging operator, note that 
the task of approximating (Fh)(x): E.,[h(X)] can be interpreted alternatively 
as a regression of the "target" variable h(X) onto the "input" X0 = x. So-called 
kernel-smoothers address regression tasks of this sort by locally averaging the target 
values in a small neighborhood of x. This gives the following approximation: 
rr -- 1 
: (5) 
x) : . (6) 
In detail, we employ the weighting function or weighting kernel k.,,a (zs, x) in (6) to 
determine the weights that are used for averaging in equation (5). Here k.,,a(zs, x) is 
a multivariate Gaussian, normalized so as to satisfy the constraints k.,,(zs, x) > 0 
rr, - 1 
if as : a, k.,,a(zs,x) : 0 if as  a, and s=0 k.,,(zs,x) : 1. Intuitively, (5) 
assesses the future differential cost of applying action a in state x by looking at all 
times in the training data where a has been applied previously in a state similar 
to x, and by averaging the current differential value estimates at the outcomes of 
these previous transitions. Because the weights k.,,(zs, x) are related inversely 
to the distance I Izs - x I I, transitions originating in the neighborhood of x are most 
influential in this averaging procedure. A more statistical interpretation of (5) would 
suggest that ideally we could simply generate a large number of independent samples 
from the conditional distribution P.,a and estimate E.,[h(X)] using Monte-Carlo 
approximation. Practically speaking, this approach is clearly infeasible because in 
order to assess the value of the simulated successor states we would need to sample 
recursively, thereby incurring exponentially increasing computational complexity. A 
more realistic alternative is to estimate P.,,ah(x) as a local average of the rewards 
that were generated in previous transitions originating in the neighborhood of x, 
where the membership of an observation zs in the neighborhood of x is quantified 
using k.,,a(zs, x). Here the regularization parameter b determines the width of the 
Gaussian kernel and thereby also the size of the neighborhood used for averaging. 
Depending on the application, it may be advisable to choose b either fixed or as a 
location-dependent function of the training data. 
3 "Self-Approximating Property" 
As we illustrated above, kernel-based reinforcementf, learning formally amounts to 
substituting the approximate expectation operator F,a for Fa and then applying 
dynamic programming to derive solutions to the approximate optimality equation 
(3). In this section, we outline a practical implementation of this approach and 
we present some of our theoretical results. In particular, we consider the relative 
value iteration algorithm for average-cost MDPs that is described, for example, in 
[Ber95]. This procedure iterates a variant of equation (1) to generate a sequence of 
value function estimates, h, that eventually converge to a solution of (1) (or (3), 
respectively). An important practical problem in continuous state MDPs is that the 
intermediate functions h need to be represented explicitly on a computer. This re- 
quires some form of function approximation which may be numerically undesirable 
and computationally burdensome in practice. In the case of kernel-based reinforce- 
ment learning, the so-called "self-approximating" property allows for a much more 
efficient implementation in vector format (see also [Rus97]). Specifically, because 
our definition of f,ah in (5) only depends on the values of h at the states in $, 
the AACOE (3) can be solved in two steps: 
a 
-- min{4x, a)+ - 
a 
(7) 
(8) 
we first determine the values of h at the points in $ using (7) 
In other words, 
and then compute the values at new locations x in a second step using (8). Note 
that (7) is a finite equation system by contrast to (3). By introducing the vectors 
and matrices h(i) -- h,(zi), c(i) -- c(zi), 42(i,j) -- k,(zj,zi) for i - 
1,..., m and j - 1,..., m, the relative value iteration algorithm can thus be written 
conveniently as (for details, see [Ber95, OG00])' 
h min(ca + (ah), h +   
:: :: - 
a 
Hence we end up with an algorithm that is analogous to value iteration except that 
we use the weighting matrix (a in place of the usual transition probabilities and h  
and ca are vectors of points in the training set $ as opposed to vectors of states. 
Intuitively, (9) assigns value estimates to the states in the sample trajectory and 
updates these estimates in an iterative fashion. Here the update of each state is 
based on a local average over the costs and values of the samples in its neighborhood. 
Since (a(i, j) > 0 and ]4= (a(i, j): 1 we can further exploit the analogy between 
(9) and the usual value iteration in an "artificial" MDP with transition probabilities 
(a to prove the following theorem: 
Theorem 1 The relative value iteration (9) converges to a unique fixed point. 
For details, the reader is referred to [OS00, OG00]. Note that Theorem 1 illustrates 
a rather unique property of kernel-based reinforcement learning by comparison to 
alternative approaches. In addition, we can show that - under suitable regularity 
conditions - kernel-based reinforcement learning is consistent in the following sense: 
Theorem 2 The approximate optimal cost i), converges to the true optimal cost 
r]* in the sense that 
Also 
the true cost of the approximate strategg j converges to the optimal cost: 
I o. 
Hence  performs as well as /u* asymptotically and we can also predict the op- 
timal cost q* using /. From a practical standpoint, Theorem 2 asserts that the 
performance of approximate dynamic programming can be improved by increasing 
the amount of training data. Note, however, that the computational complexity 
of approximate dynamic programming depends on the sample size m. In detail, 
the complexity of a single application of (9) is O(m 2) in a naive implementation 
and O(mlog m) in a more elaborate nearest neighbor approach. This complexity 
issue prevents the use of very large data sets using the "exact" algorithm described 
above. As in the case of parametric reinforcement learning, we can of course restrict 
ourselves to a fixed amount of computational resources simply by discarding obser- 
vations from the training data or by summarizing clusters of data using %ufficient 
statistics". Note that the convergence property in Theorem 1 remains unaffected 
by such an approximation. 
4 Optimal Portfolio Choice 
In this section, we describe the practical application of kernel-based reinforcement 
learning to an investment problem where an agent in a financial market decides 
whether to buy or sell stocks depending on the market situation. In the finance 
and economics literature, this task is known as "optimal portfolio choice" and has 
created an enormous literature over the past decades. Formally, let St symbolize 
the value of the stock at time t and let the investor choose her portfolio at from the 
set A = {0, 0.1, 0.2,..., 1}, corresponding to the relative amount of wealth invested 
in stocks as opposed to an alternative riskless asset. At time t + 1, the stock price 
changes from St to St+, and the portfolio of the investor participates in the price 
movement depending on her investment choice. Formally, if her wealth at time t is 
Wt itbecomesWt+: (l+at s'+-s') 
, s, Wt at time t + 1. To render this simulation 
as realistic as possible, our investor is assumed to be risk-averse in that her fear of 
losses dominates her appreciation of gains of equal magnitude. A standard way to 
express these preferences formally is to aim at maximizing the expectation of a con- 
cave "utility function", N(z), of the final wealth W. Using the choice N(z) = log(z), 
the investor's utility can be written as H(WT): 0  log 1 + at  . Hence 
utilities are additive over time, and the objective of maximizing E[H(WT)] can be 
stated in an average-cost framework where c(, a): E., log 1 + a s,  
We present results using simulated and real stock prices. With regard to the simu- 
lated data, we adopt the common assumption in finance literature that stock prices 
are driven by an Ito process with stochastic, mean-reverting volatility: 
dSt = luStdt + x/StdBt, 
Here vt is the time-varying volatility, and Bt and/)t are independent Brownian mo- 
tions. The parameters of the model are/u: 1.03, : 0.3, &: 10.0, and p: 5.0. We 
simulated daily data for the period of 13 years using the usual Euler approximation 
of these equations. The resulting stock prices, volatilities, and returns are shown in 
Figure 1. Next, we grouped the simulated time series into 10 sets of training and 
Figure 1: The simulated time-series of stock prices (left), volatility (middle), and 
daily returns (right; rt = log(St/St_)) over a period of one year. 
test data such that the last 10 years are used as 10 test sets and the three years 
preceding each test year are used as training data. Table 1 reports the training and 
test performances on each of these experiments using kernel-based reinforcement 
learning and a benchmark buy &; hold strategy. Performance is measured using 
Year 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
Reinforcement Learning 
Training Test 
0.129753 0.096555 
0.125742 0.107905 
0.100265 -0.074588 
0.059405 0.201186 
0.082622 0.227161 
0.077856 0.098172 
0.136525 0.199804 
0.145992 0.121507 
0.126052 -0.018110 
0.127900 -0.022748 
Buy 8z Hold 
Training Test 
0.058819 0.052533 
0.043107 0.081395 
0.053755 -0.064981 
0.018023 0.172968 
0.041410 0.197319 
0.074632 0.092312 
0.137416 0.194993 
0.147065 0.118656 
0.125978 -0.017869 
0.077196 -0.029886 
Table 1: Investment performance on the simulated data (initial wealth Wo = 100). 
the Sharpe-ratio which is a standard measure of risk-adjusted investment perfor- 
mance. In detail, the Sharpe-ratio is defined as SR = log(WT/Wo)/& where & is 
the standard deviation of log(Wt/Wt_) over time. Note that large values indicate 
good risk-adjusted performance in years of positive growth, whereas negative val- 
ues cannot readily be interpreted. We used the root of the volatility (standardized 
to zero mean and unit variance) as input information and determined a suitable 
choice for the bandwidth parameter (b = 1) experimentally. Our results in Table 1 
demonstrate that reinforcement learning dominates buy g; hold in eight out of ten 
years on the training set and in all seven years with positive growth on the test set. 
Table 2 shows the results of an experiment where we replaced the artificial time 
series with eight years of daily German stock index data (DAX index, 1993-2000). 
We used the years 1996-2000 as test data and the three years preceding each test 
year for training. As the model input, we computed an approximation of the (root- 
) volatility using a geometric average of historical returns. Note that the training 
performance of reinforcement learning always dominates the buy &; hold strategy, 
and the test results are also superior to the benchmark except in the year 2000. 
Year 
1996 
1997 
1998 
1999 
2000 
Reinforcement Learning 
Training Test 
0.083925 0.173373 
0.119875 0.121583 
0.123927 0.079584 
0.141242 0.094807 
0.085236 -0.007878 
Buy & Hold 
Training Test 
0.038818 0.120107 
0.119875 0.096369 
0.096183 0.035204 
0.035137 0.090541 
0.081271 0.148203 
Table 2: Investment performance on the DAX data. 
5 Conclusions 
We presented a new, kernel-based reinforcement learning method that overcomes 
several important shortcomings of temporal-difference learning in continuous-state 
domains. In particular, we demonstrated that the new approach always converges 
to a unique approximation of the optimal policy and that the quality of this approx- 
imation improves with the amount of training data. Also, we described a financial 
application where our method consistently outperformed a benchmark model in an 
artificial and a real market scenario. While the optimal portfolio choice problem is 
relatively simple, it provides an impressive proof of concept by demonstrating the 
practical feasibility of our method. Efficient implementations of local averaging for 
large-scale problems have been discussed in the data mining community. Our work 
makes these methods applicable to reinforcement learning, which should be valuable 
to meet the real-time and dimensionality constraints of real-world problems. 
Acknowledgements. The work of Dirk Ormoneit was partly supported by the Deutsche 
Forschungsgemeinschaft. aunak Sen helped with valuable discussions and suggestions. 
References 
[Ber95] 
IBM95] 
IGor99] 
[OG00] 
[os00] 
[Rus97] 
[SMSM00] 
[TR96] 
[TR99] 
D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 1 and 
2. Athena Scientific, 1995. 
J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely 
approximating the value function. In NIPS 7,1995. 
G. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, 
Computer Science Department, Carnegie Mellon University, 1999. 
D. Ormoneit and P. Glynn. Kernel-based reinforcement learning in average- 
cost problems. Working paper, Stanford University. In preparation. 
D. Ormoneit and . Sen. Kernel-based reinforcement learning. Machine Learn- 
ing, 2001. To appear. 
J. Rust. Using randomization to break the curse of dimensionality. Economet- 
rica, 65(3):487 516, 1997. 
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient meth- 
ods for reinforcement learning with function approximation. In NIPS 12, 2000. 
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large-scale dynamic 
programming. Machine Learning, 22:59 94, 1996. 
J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. 
Automatlea, 35(11):1799 1808, 1999. 
[VRK00] J.N. Tsitsiklis V. R. Konda. Actor-critic algorithms. In NIPS 12, 2000. 
