Position Variance, Recurrence and Perceptual 
Learning 
Zhaoping Li Peter Dayan 
Gatsby Computational Neuroscience Unit 
17 Queen Square, London, England, WC1N 3AR. 
zhaoping@gatsby.ucl. ac.uk dayan@ gatsby.ucl. ac.uk 
Abstract 
Stimulus arrays are inevitably presented at different positions on the 
retina in visual tasks, even those that nominally require fixation. In par- 
ticular, this applies to many perceptual learning tasks. We show that per- 
ceptual inference or discrimination in the face of positional variance has a 
structurally different quality from inference about fixed position stimuli, 
involving a particular, quadratic, non-linearity rather than a purely lin- 
ear discrimination. We show the advantage taking this non-linearity into 
account has for discrimination, and suggest it as a role for recurrent con- 
nections in area V1, by demonstrating the superior discrimination perfor- 
mance of a recurrent network. We propose that learning the feedforward 
and recurrent neural connections for these tasks corresponds to the fast 
and slow components of learning observed in perceptual learning tasks. 
1 Introduction 
The field of perceptual learning in simple, but high precision, visual tasks (such as vernier 
acuity tasks) has produced many surprising results whose import for models has yet to be 
fully felt. A core of results is that there are two stages of learning, one fast, which happens 
over the first few trials, and another slow, which happens over multiple sessions, may in- 
volve REM sleep, and can last for months or even years (Fahle, 1994; Karni & Sagi, 1993; 
Fahle, Edelman, & Poggio 1995). Learning is surprisingly specific, in some cases being 
tied to the eye of origin of the input and rarely admitting generalisation across wide areas of 
space or between tasks that appear extremely similar, even involving the same early-stage 
detectors (eg Fahle, Edelman, & Poggio 1995; Fahle, 1994). For instance, improvement 
through learning on an orientation discrimination task does not lead to improvement on a 
vernier acuity task (Fahle 1997), even though both tasks presumably use the same orienta- 
tion selective striate cortical cells to process inputs. 
Of course, learning in human psychophysics is likely to involve plasticity in a large num- 
ber of different pans of the brain over various timescales. Previous studies (Poggio, Fahle, 
& Edelman 1992, Weiss, Edelman, & Fahle 1993) proposed phenomenological models of 
learning in a feedforward network architecture. In these models, the first stage units in 
the network receive the sensory inputs through the medium of basis functions relevant for 
the perceptual task. Over learning, a set of feedforward weights is acquired such that the 
weighted sum of the activities from the input units can be used to make an appropriate bi- 
nary decision, eg using a threshold. These models can account for some, but not all, obser- 
vations on perceptual learning (Fahle et al 1995). Since the activity of V1 units seems not 
to relate directly to behavioral decisions on these visual tasks, the feedforward connections 
A 
B 
30 
20 
x_ x 0 x+  
c 
-l+y y+ l+y ' x -02 - 0  2 
x 
Figure 1: Mid-point discrimination. A) Three bars are presented at x_, xo and x+. The task is to 
report which of the outer bars is closer to the central bar. y represents the variable placement of the 
stimulus array. B) Population activities in cortical cells evoked by the stimulus bars -- the activities 
ai is plotted against the preferred location xi of the cells. This comes from Gaussian tuning curves 
(k = 20; r = 0.1) and Poisson noise. There are 81 units whose preferred values are placed at regular 
intervals of Az = 0.05 between z = -2 and z = 2. 
must model processing beyond V1. The lack of generalisation between tasks that involve 
the same visual feature samplers suggests that the basis functions, eg the orientation selec- 
tive primary cortical cells that sample the inputs, do not change their sensitivity and shapes, 
eg their orientation selectivity or tuning widths. However, evidence such as the specificity 
of learning to the eye of origin and spatial location strongly suggest that lower visual ar- 
eas such as V1 are directly involved in learning. Indeed, V1 is a visual processor of quite 
some computational power (performing tasks such as segmentation, contour-integration, 
pop-out, noise removal) rather than being just a feedforward, linear, processing stage (eg 
Li, 1999; Pouget et al 1998). 
Here, we study a paradigmatic perceptual task from a statistical perspective. Rather than 
suggest particular learning rules, we seek to understand what it is about the structure of 
the task that might lead to two phases of learning (fast and slow), and thus what compu- 
tational job might be ascribed to V1 processing, in particular, the role of lateral recurrent 
connections. We agree with the general consensus that fast learning involves the feedfor- 
ward connections. However, by considering positional invariance for discrimination, we 
show that there is an inherently non-linear component to the overall task, which defeats 
feedforward algorithms. 
2 The bisection task 
Figure 1A shows the bisection task. Three bars are presented at horizontal positions xo = 
y+e,x_ = -l+y andx+ = l+y, where-1 << e << 1. Here y is a nuisance 
random number with zero mean, reflecting the variability in the position of stimulus array 
due to eye movements or other uncontrolled factors. The task for the subject is to report 
which of the outer bars is closer to the central bar, ie to report whether e is greater than 
or less than 0. The bars create a population-coded representation in V1 cells preferring 
vertical orientation. In figure lB, we show the activity of cells ai as a function of preferred 
topographic location xi of the cell; and, for simplicity, we ignore activities from other V1 
cells which prefer orientations other than vertical. 
We assume that the cortical response to the bars is additive, with mean 
ai(e,y) = f(xi - Xo) + f(xi - x_) + f(xi - x+) (1) 
(we often drop the dependence on e, y and write ai, or, for all the components, fi) where 
f is, say, a Gaussian, tuning curve with height k and tuning width r, f(x) = ke -x2/2-2, 
usually with - << 1. The net activity is ai = ai q- hi, where ni is a noise term. We assume 
that ni comes from a Poisson distribution and is independent across the units, and e and y 
have mean zero and are uniformly distributed in their respective ranges. 
The subject must report whether e is greater or less than 0 on the basis of the activities a. 
A normative way to do this is to calculate the probability P[ela] of e given a, and report by 
maximum likelihood (ME) that e > 0 if f>o de P[ela ] > 0.5. Without prior information 
about e, y, and with Poisson noise ni = ai -- i, we have 
P[e, yla ] = P[ale, y]P[e,y]/P(a ) or P[ale, y ] or H e-a(e'Y)(i(e'y))a 
i 
(2) 
3 Fixed position stimulus array 
When the stimulus array is in a fixed position y = 0, analysis is easy, and is very similar to 
that carried out by Seung & Sompolinsky (1993). Dropping y, we calculate log P[el a] and 
approximate it by Taylor expansion about e = 0 to second order in e: 
02log P[ale] :o (3) 
log P[ale ] ,- constant + e  log P[ale ] =o + 2o 2 ignoring higher order terms. Provided that the last term is negative (which it indeed is, 
almost surely), we derive an approximately Gaussian distribution 
P[ela ] cr exp[-(e- a)2/(2cr)] (4) 
cr 2 o log P[ale]l=o. Thus the 
2 02 logP[ale]l:o] and mean --   
with variance cr -- [- 3-/- 
subject should report that e > 0 or e < 0 if the test t(a) =  log P[ale]l_-o is greater or 
less than zero respectively. For the Poisson noise case we consider, log P[ale ] = constant+ 
Y4 ai log ai(e) since Y4 ai(e) is a constant, independent of e. Thus, 
t(a) = y ai  log i(e) =o (5) 
i 
Therefore, maximum likelihood discrimination can be implemented by a linear feedforward 
network mapping inputs ai through feedforward weights wi =  log ai to calculate as the 
output t(a) = Y4 wia. A threshold of 0 on t(a) provides the discrimination e > 0 if 
t(a) > 0 and e < 0 for t(a) < 0. The task therefore has an essentially linear character. 
Note that if the noise corrupting the activities is Gaussian, the weights should instead be 
Wi ---- Oe 
Figure 2A shows the optimal discrimination weights for the case of independent Poisson 
noise. The lower solid line in figure 2C shows optimal performance as a function of e. The 
error rate drops precipitately from 50% for very small (and thus difficult) e to almost 0, 
long before e approaches the tuning width r. 
It is also possible to learn weights in a variety of ways (eg Poggio, Fahle & Edelman, 1992; 
Weiss, Edelman & Fahle, 1993; Fable, Edelman & Poggio 1995;) Figure 2B shows dis- 
crimination weights learned using a simple error-correcting learning procedure, which are 
almost the same as the optimal weights and lead to performance that is essentially optimal 
(the lower dashed line in figure 2C) . We use error-correcting learning as a comparison 
technique below. 
4 Moveable stimulus array 
If the stimulus array can move around, ie if y is not necessarily 0, then the discrimination 
task gets considerably harder. The upper dotted line in figure 2C shows the (rather unfair) 
test of using the learned weights in figure 2B when y E [-.2, .2] varies uniformly. Clearly 
this has a highly detrimental effect on the quality of discrimination. Looking at the weight 
structure in figure 2A;B suggests an obvious reason for this - the weights associated with 
the outer bars are zero since they provide no information about e when y = 0, and the 
M L weights 
-2 0 2 2 
learned weights 
-2 0 o 
performance 
0.05 
Figure 2: A) The ML optimal discrimination weights w =  log gt (plotted as wi vs. xi) for 
deciding if e > 0 when y = 0. B) The learned discrimination weights w for the same deci- 
sion. During on line learning, random examples were selected with e C -2[-r, r] uniformly, 
r = 0.1, and the weights were adjusted online to maximise the log probability of generating 
the correct discrimination under a model in which the probability of declaring that e > 0 is 
o'(i wiai) = 1/(1 + exp(- i wiai)). C) Performance of the networks with ML (lower solid 
line) and learned (lower dashed line) weights as a function of e. Performance is measured by drawing 
a randomly given e and y, and assessing the %'age of trials the answer is incorrect. The upper dotted 
line shows the effect of drawing y C [-0.2, 0.2] uniformly, yet using the ML weights in (B) that 
assume y = 0. 
weights are finely balanced about 0, the mid-point of the outer bars, giving an unbiased or 
balanced discrimination on e. If the whole array can move, this balance will be destroyed, 
and all the above conclusions change. 
The equivalent of equation (3) when y  0 is 
( 2 o2 v2 o2 o2 ) logP[ale, Y]l,v=  
logP[e, Yla ]  constant + e +Yffv + T& + 2 ov 2 +eYo--b-7 
Thus, to second-order, a Gaussian distribution can approximate P[e, y[a]. Figure 3A shows 
the high quality of this approximation. Here, e and y are anti-correlated given activities a, 
because the information from the center stimulus bar only constrains their sum e + y. Of 
interest is the probability P[ela] = f dy log P[e, yla], which is approximately Gaussian 
with mean fip and variance p, where, under Poisson noise ni = ai - ai, 
02 log t'2/( 02 logit 02 log alle,y=o 
P7 = 0y 2 )--a' 2 ' 
0 2log a (which is the inverse variance of the Gaussian distribution of y that we 
Since-a. or2 integrated out) is positive, the appropriate test for the sign of e is 
(6) 
If t(a) > 0 then we should report e > 0, and conversely. Interestingly, t(a) is a very simple 
quadratic form 
r( 02 log i log  0 log i ' ( 02 log  ] 
t(a)=a.q.a--Y.ijaiaj[, Oy& )(Oy )-( & /' Oy 2 ) I,y--o 
(7) 
Therefore, the discrimination problem in the face of positional variance has a precisely 
quantifiable non-linear character. The quadratic test t(a) cannot be implemented by a 
linear feedforward architecture only, since the optimal boundary t(a) = 0 to separate the 
state space a for a decision is now curved. Writing t(a) = a. Q  a where the symmetric 
A 
0.2 
0.18 
0.14 
-0.02 
B 
exact approx  
0.02 e 0.08 -0.02 0.02 e 0.08 
c 
0.05 -0.05 
-2 0 2 -2 0 
Figure 3: Varying y. A) Posterior distribution P[e, yla]. Exact (left) P[e, yl a] for a particular a 
with true values e = 0.2r, y = 1.5r (with r = 0.1) and its bivariate Gaussian approximation (right). 
Only the relevant region of (e, y) space is shown - outside this, the probability mass is essentially 0 
(and the contour values are the same). B) The quadratic form Q, Qij vs. Zi and xj. C) The four 
eigenvectors of Q with non-zero eigenvalues (shown above). The eigenvalues come in q- pairs; the 
associated eigenvectors come in antisymmetric pairs. The absolute scale of Q and its eigenvalues is 
arbitrary. 
A ML errors B C linear/ML errors 
o 
50- 
y 
-3 -2 2 3 0 
0.05 0.2 
-1 0 1 
IOO0 
   100 
 10 
 
o.1 0.2 
Figure 4: y  0. A) Performance of the approximate inference based on the quadratic form of 
figure 3B in terms of %'age error as a function of lyl and (r -- 0.1). B) Feedforward weights, wi 
vs. zi, learned using the same procedure as in figure 2B, but with y E [--.2, .2] chosen uniformly 
at random. C) Ratio of error rates for the linear (weights from B) to the quadratic discrimination. 
Values that would be infinite are pegged at 20. 
form Oij = (O'ij + O'ji)/2, we find Q only has four non-zero eigenvalues, for the 4- 
o" log a o log a o log a and 
dimensional sub-space spanned by 4 vectors Oy& Oy o 
0  log fi 
or2 ,v=o. Q and its eigenvectors and eigenvalues are shown in Figure 3B;C. Note that 
if Gaussian rather than Poisson noise is used for ni = ai - i, the test t(a) is still quadratic. 
Using t(a) to infer  is sound for y up to two standard deviations (r) of the tuning curve 
f(x) away from 0, as shown in Figure 4A. By comparison, a feedforward network, of 
weights shown in figure 4B and learned using the same error-correcting learning procedure 
as above, gives substantially worse performance, even though it is better than the feedfor- 
ward net of Figure 2A;B. Figure 4C shows the ratio of the error rates for the linear to the 
quadratic decisions. The linear network is often dramatically worse, because it fails to take 
proper account of y. 
We originally suggested that recurrent interactions in the form of horizontal intra-cortical 
connections within V1 might be the site of the longer term improvement in behavior. Fig- 
ure 5 demonstrates the plausibility of this idea. Input activity (as in figure lB) is used 
to initialise the state u at time t = 0 of a recurrent network. The recurrent weights are 
A 
Decision 
Feedforward 
ght s % 
Input a 
Recurrent 
weights 
B recurrent weights 
-0 05 
-015 
C recurrent error 
D lin/rec error 
1 oo 
0.1 
Figure 5: Threshold linear recurrent network, its weights, and performance. See text. 
symmetric and shown in figure 5B. The network activities evolve according to 
dui/dt = -ui + Yj Jijg(uj) + ai (8) 
where Jij are the recurrent weight from unit j to i, g(u) = u if u > 0 and g(u) = 0 
for u _< 0. The network activities finally settle to an equilibrium u(t --> ec) (note that 
ui(t - oc) = ai when J = 0). The activity values u(t --> ec) of this equilibrium are fed 
through feedforward weights w, that are trained for this recurrent network just as for the 
pure feedforward case, to reach a decision Yi Witti(t -- Cx3). Figure 5C shows that using 
this network gives results that are almost invariant to y (as for the quadratic discriminator); 
and figure 5D shows that it generally outperforms the optimal linear discriminator by a large 
margin, albeit performing slightly worse than the quadratic form. The recurrent weight 
matrix is subject to three influences: (1) a short range interaction Jij for Ixi - xjl  r to 
stablize activities ai induced by a single bar in the input; (2) a longer range interaction Jij 
for I xi - xjl  i to mediate interaction between neighboring stimulus bars, amplifying the 
effects of the displacement signal e, and (3) a slight local interaction Jij for Ixl, Ixjl < 
r. The first two interaction components are translation invariant in the spatial range of 
xi, xj E [-2, 2] where the stimulus array appears, in order to accommodate the positional 
variance in y. The last component is not translation invariant and counters variations in y. 
5 Discussion 
The problem of position invariant discrimination is common to many perceptual learning 
tasks, including hyper-acuity tasks such as the standard line vernier, three-dot vernier, cur- 
vature vernier, and orientation vernier tasks (Fahle et al 1995, Fahle 1997). Hence, the 
issues we address and analyze here are of general relevance. In particular, our mathemat- 
ical formulation, derivations, and thus conclusions, are general and do not depend on any 
particular aspect of the bisection task. One essential problem in many of these tasks is 
to discriminate a stimulus variable e that depends only on the relative positions between 
the stimulus features, while the absolute position y of the whole stimulus array can vary 
between trials by an amount that is much larger than the discrimination threshold (or acu- 
ity) on e. The positional variable y may not have to correspond to the absolute position 
of the stimulus array, but merely to the error in the estimation of the absolute position of 
the stimulus by other neural areas. Our study suggests that although when y = 0 is fixed, 
the discrimination is easy and soluble by a linear, feedforward network, whose weights 
can be learnt in a straight-forward manner, when y is not fixed, optimal discrimination of 
e is based on an approximately quadratic function of the input activities, which cannot be 
implemented using a linear feedforward net. 
We also showed that a non-linear recurrent network, which is a close relative of a line at- 
tractor network, can perform much better than a pure feedforward network on the bisection 
task in the face of position variance. There is experimental evidence that lateral connections 
within V1 change after learning the bisection task (Gilbert 2000), although we have yet to 
construct an appropriate learning rule. We suggest that learning the recurrent weights for 
the nonlinear transform corresponds to the slow component in perceptual learning, while 
learning the feedforward weights corresponds to the fast component. The desired recurrent 
weights are expected to be much more difficult to learn, in the face of nonlinear transforms 
and (the easily unstable) recurrent dynamics. Further, the feedforward weights need to be 
adjusted further as the recurrent weights change the activities on which they work. 
The precise recurrent interactions in our network are very specific to the task and its pa- 
rameters. In particular, the range of the interactions is completely determined by the scale 
of spacing between stimulus bars; and the distance-dependent excitation and inhibition in 
the recurrent weights is determined by the nature of the bisection task. This may be why 
there is little transfer of learning between tasks, when the nature and the spatial scale of the 
task change, even if the same input units are involved. However, our recurrent interaction 
model does predict that transfer is likely when the spacing between the two outer bars (here 
at Aa: = 2) changes by a small fraction. Further, since the signs of the recurrent synapses 
change drastically with the distance between the interacting cells, negative transfer is likely 
between two bisection tasks of slightly different spatial scales. We are planning to test this 
prediction. 
Achieving selectivity at the same time as translation invariance is a very basic requirement 
for position-invariant object recognition (see Riesenhuber & Poggio 1999 for a recent dis- 
cussion), and arises in a pure form in this bisection task. Note, for instance, that trying 
to cope with different values of t by averaging spatially shifted versions of the optimal 
weights for t = 0 (figure 2A) would be hopeless, since this would erase (or at very least 
blur) the precise spatial positioning of the peaks and troughs which underlies the discrim- 
ination power. It would be possible to scan the input for the value of t that fits the best 
and then apply the discriminator centered about that value, and, indeed, this is conceptu- 
ally what the neocognitron (Fukushima 1980) and the MAX-model (Riesenhuber & Poggio 
1999) do using layers of linear and non-linear combination. In our case, we have shown, at 
least for fairly small t, that the optimal non-linearity for the task is a simple quadratic. 
Acknowledgements 
Funding is from the Gatsby Charitable Foundation. We are very grateful to Shimon Edel- 
man, Manfred Fahle and Maneesh Sahani for discussions. 
References 
[1] Kami A and Sagi D. Nature 365 250-252, 1993. 
[2] Fahle M. Edelman S. and Poggio T. Vision Res. 35 3003-3013, 1995. 
[3] Fahle M. Perception 23 411-427, (1994). And also Fahle M. Vis. Res. 37(14) 1885-1895, (1997). 
[4] Poggio T. Fahle M. and Edelman S. Science 256 1018-1021, 1992. 
[5] Weiss Y. Edelman S. and Fahle M. Neural Computation 5 695-718, 1993. 
[6] Li, Zhaoping Network: Computation in Neural Systems 10(2) 187-212, 1999. 
[7] Pouget A, Zhang K, Deneve S, Latham PE. Neural Cornput. 10(2):373-401, 1998. 
[8] Seung HS, Sompolinsky H. Proc Natl Acad Sci U S A. 90(22): 10749-53, 1993 
[9] Koch C. Biophysics of computation. Oxford University Press, 1999. 
[10] Gilbert C. Presentation at the Neural Dynamics Workshop, Gatsby Unit, 2/2000. 
[11] Riesenhuber M, Poggio T. NatNeurosci. 2(11):1019-25, 1999. 
[12] Fukushima, K. Biol. Cybern. 36 193-202, 1980. 
