Emergence of movement sensitive 
neurons' properties by learning a sparse 
code for natural moving images 
Rafal Bogacz 
Dept. of Computer Science 
University of Bristol 
Bristol BS8 tUB, U.K. 
R.Bogacz@ bristol. ac. uk 
Malcolm W. Brown Christophe Giraud-Carrier 
Dept. of Anatomy Dept. of Computer Science 
University of Bristol University of Bristol 
Bristol BS8 tTD, U.K. Bristol BS8 tUB, U.K. 
M.W. Brown@bristol. ac. uk cgc@cs. bris. ac. uk 
Abstract 
Olshausen & Field demonstrated that a learning algorithm that 
attempts to generate a sparse code for natural scenes develops a 
complete family of localised, oriented, bandpass receptive fields, 
similar to those of 'simple cells' in Vt. This paper describes an 
algorithm which finds a sparse code for sequences of images that 
preserves information about the input. This algorithm when trained 
on natural video sequences develops bases representing the 
movement in particular directions with particular speeds, similar to 
the receptive fields of the movement-sensitive cells observed in 
cortical visual areas. Furthermore, in contrast to previous 
approaches to learning direction selectivity, the timing of neuronal 
activity encodes the phase of the movement, so the precise timing 
of spikes is crucially important to the information encoding. 
1 Introduction 
It was suggested by Barlow [3] that the goal of early sensory processing is to reduce 
redundancy in sensory information and the activity of sensory neurons encodes 
independent features. Neural modelling can give some insight into how these neural 
nets may learn and operate. Atick & Redlich [t] showed that training a neural 
network on patches of natural images, aiming to remove pair-wise correlation 
between neuronal responses, results in neurons having centre-surround receptive 
fields resembling those of retinal ganglion neurons. Olshausen & Field [11,12] 
demonstrated that a learning algorithm that attempts to generate a sparse code for 
natural scenes while preserving information about the visual input, develops a 
complete family of localised, oriented, bandpass receptive fields, similar to those of 
simple-cells in Vt. The activities of the neurons implementing this coding signal the 
presence of edges, which are basic components of natural images. Olshausen & 
Field chose their algorithm to create a sparse representation because it possesses a 
higher degree of statistical independence among its outputs [t t]. Similar receptive 
fields were also obtained by training a neural net so as to make the responses of 
neurons as independent as possible [4]. Other authors [14,16,5] have shown that 
direction selectivity of the simple-cells may also emerge from unsupervised 
learning. However, there is no agreed way of how the receptive fields of neurons 
that encode movements are created. 
This paper describes an algorithm which finds a sparse code for sequences of 
images that preserves the critical information about the input. This algorithm, 
trained on natural video images, develops bases representing movements in 
particular directions at particular speeds, similar to the receptive fields of the 
movement-sensitive cells observed in early visual areas [9,2]. The activities of the 
neurons implementing this encoding signal the presence of edges moving with 
certain speeds in certain directions, with each neuron having its preferred speed and 
direction. Furthermore, in contrast to all the previous approaches, the timing of 
neural activity encodes the movement's phase, so the precise timing of spikes is 
crucially important for information coding. 
The proposed algorithm is an extension of the one proposed by Olshausen & Field. 
Hence it is a high level algorithm, which cannot be directly implemented in a 
biologically plausible neural network. However, a plausible neural network 
performing a similar task can be developed. The proposed algorithm is described in 
Section 2. Sections 3 and 4 show the methods and the results of simulations. Finally, 
Section 5 discusses how the algorithm differs from the previous approaches, and the 
implications of the presented results. 
2 Description of the algorithm 
Since the proposed algorithm is an extension of the one described by Olshausen & 
Field [1 1,12], this section starts with a brief introduction of the main ideas of their 
algorithm. They assume that an image x can be represented in terms of a linear 
superposition of basis functions Ai. For clarity of notation, let us represent both 
images and bases as vectors created by concatenating rows of pixels as shown in 
Figure 1, and let each number in the vector describe the brightness of the 
corresponding pixel. Let the basis functions Ai form the columns of a matrix A. Let 
the weighting of the above mentioned linear superposition (which changes from one 
image to the next) be given by a vector s: 
x=As (1) 
The image x may be encoded, for example using the inverted transformation where 
it exists. Hence, the image code s is determined by the choice of basis functions Ai. 
Olshausen & Field [11,12] try to find bases that result in a code s that preserves 
information about the original image x and that is sparse. Therefore, they minimise 
the following cost function with respect to A, where / denotes a constant 
determining the importance of sparseness [11]: 
E = -[preserved information in s about x] - 2[sparseness of s] 
(2) 
The algorithm proposed in this paper is similar, but it takes into consideration the 
temporal order of images. Let us divide time into intervals (to be able to treat it as 
discrete) and denote the image observed at time t and the code generated by x t and 
s t, respectively. The Olshausen & Field algorithm assumes that image x is a linear 
superposition (mixture) of s. By contrast, our algorithm assumes that images are 
convolved mixtures of s, i.e., s t depends not only on x t but also on x t4, xt-2,..., X t-(r-D 
(i.e. s t depends on T preceding xt). Therefore, each basis function may also be 
image  x* 
Figure 1: Representing images as vectors. 
X 1 X 2 X 3 X 4 X 5 X 6 
A 2 A  A  ,sh  S12 S13 S14 S15 S16 
A2 2 A21 A2 0 s21 s2 2 s2 3 s2 5 s2 6 
I 
S2 4 
Figure 2: Encoding of an image sequence. In the example, there are two basis 
functions, each described by T = 3 vectors. The first basis encodes movement to 
the right, the second encodes movement down. A sequence x of 6 images is 
shown on the top and the corresponding code s below. A "spike" over a 
coefficient Si t denotes that Si t: 1, the absence of a "spike" denotes Si t: O. 
represented as a sequence of vectors Ai , Aft, ... , Ai TM (corresponding to a sequence 
of images). These vectors create columns of the mixing matrices A , A , ... , A TM. 
Each coefficient Si t describes how strongly the basis function Ai is present in the last 
T images. This relationship is illustrated in Figure 2 and is expressed by Equation 3. 
T-1 
x' = EAYs y+' (3) 
f=0 
In the proposed algorithm, the basis functions A are also found by optimising the 
cost function of Equation 2. The detailed method of this minimisation is described 
below, and this paragraph gives its overview. In each optimisation step, a sequence 
x of P image patches is selected from a random position in the video sequence (P _> 
2T). Each of the optimisation steps consists of two operations. Firstly, the sequence 
of coefficient vectors s which minimises the cost function E for the images x is 
found. Secondly, the basis matrices A are modified in the direction opposite to the 
gradient of E over A, thus minimising the cost function. These two operations are 
repeated for different sequences of image patches. 
In Equation 2, the term "preserved information in s about x" expresses how well x 
may be reconstructed on the basis of s. In particular, it is defined as the negative of 
the square of the reconstruction error. The reconstruction error is the difference 
between the original image sequence x and the sequence of images r reconstructed 
from s. The sequence r may be reconstructed from s in the following way: 
T-1 
r' =EAYs y+' (4) 
f=0 
The precise definition of the cost function is then given by: 
E= E -ry +2; C 
t=T j t=T ' 
In Equation 5, C is a nonlinear function, and cr is a scaling constant. Images at the 
start and end of the sequence (e.g., x , x P) may share some bases with images not in 
0 1 P+ 1 
the sequence (e.g., x, x- , x ). To avoid this problem, only the middle images are 
reconstructed and only for them is the reconstruction error computed in the cost 
function. In particular, only images from T to P-T+I are reconstructed - since the 
assumed length of the bases is T, those images contain only the bases whose other 
parts are also contained in the sequence. Since only images from T to P-T+I are 
reconstructed, it is clear from Equation 4, that only coefficients s r to s P need to be 
found. These considerations explain the limits of the outer summations in both 
terms of Equation 5. 
For each image sequence, in the first operation, the coefficients s r, s TM, ... , s ' 
minimising E are found using an optimisation method. Minus the gradient of E over 
s is given by: 
 O' 
In the second operation, the bases A are modified so as to minimise E: 
(6) 
2 J - q s;+s (7) 
t 
In equation 7, r/denotes the learning rate. The vector length of each basis function 
Ai is adapted over time so as to maintain equal variance on each coefficient s, in 
exactly the same way as described in [12]. 
3 Methods of simulations 
The proposed algorithm was implemented in Matlab except for finding s minimising 
E, which was implemented in C++, using the conjugate gradient method for the sake 
of speed. In the implementation, the original codes of Olshausen & Field were used 
and modified (downloaded from http://redwd'ucdavis'edu/brun/sparsenet'html)' 
Many parameters of the proposed algorithm were taken from [11]. In particular, 
C(x) = ln(l+x2), cris the standard deviation of pixels' colours in the images, , is set 
up such that ,/cr = 0.14, and r/= 1. AA is averaged over 100 image sequences, and 
hence the bases A are updated with the average of AA every 100 optimisation steps. 
The length of an image sequence P is set up such that P = 3T. 
The proposed algorithm was tested on two types of video sequences: 'toy' problems 
and natural video sequences. Each of the toy sequences consisted of 10 frames - 
100x100 pixels. In the sequence, there were 20 moving lines. Each line was either 
horizontal or vertical and 1 pixel thick. Each line was either black or white, which 
corresponded to positive or negative values of the elements of x vectors (the grey 
background corresponded to zero). Each horizontal line moved up or down, each 
vertical - left or right, with the speed of one pixel per frame. 
Then the algorithm was tested on five natural video sequences showing moving 
people or animals. In each optimisation step, a sequence of image patches was 
selected from a randomly chosen video. The video sequences were preprocessed. 
First, to remove the static aspect of the images, from each frame the previous one 
was subtracted, i.e., each image encoded the difference between two successive 
frames of the video. This simple operation reduces redundancy in data since the 
corresponding pixels in the successive frames tend to have similar colours. An 
analogous operation may be performed by the retina, since the ganglion cells 
typically respond to the changes in light intensity [10]. 
Then, to remove the pair-wise correlation between pixels of the same frame, Zero- 
phase Component Analysis (ZCA) [4] was applied to each of the patches from the 
selected sequence, i.e., x t := W x t, where W = (xt(xt)T> -V2 i.e., W is equal to the 
inverted square root of the covariance matrix of x. The filters in W have centre- 
surround receptive fields resembling those of retinal ganglion neurons [4]. 
a 
b 
c 
Figure3: Bases found by the algorithm trained on the toy problem. a) optimal 
solution, b) local minimum solution, c) not converged (see discussion). 
4 Results of simulations 
First, the algorithm was tested on moving lines ('toy' problem). The size of the 
bases was chosen as: 3 patches containing 3x3 pixels each, and the number of bases 
as 4 (to match the number of possible movements). The sample bases found during 
optimisation are shown in Figure 3a,b. 20 simulations were performed and in 35% 
of cases the algorithm found the optimal solution shown in Figure 3a, and in 65% of 
cases it found a local minimum solution similar to one shown in Figure 3b. In the 
case of Figure 3a, each base encoded one of 4 types of movement present in the 
video. Hence, the elements of the generated code vector s are mutually independent 
and the code is very sparse - resembling the one in the example of Figure 2. 
Then, the algorithm was tested on natural video sequences. The size of base was 
chosen as 4 frames 10x10 pixels each, the number of bases - 100 (to allow for 
adequate coverage of a large number of different movements). Figure 4a shows 
sample resulting bases transformed through inverted pre-whitening, i.e., each square 
in the sequence visualises W-1A t, where W is the ZCA matrix. The bases represent 
movements in different directions with different speeds. Figure 4b shows the bases 
in frequency space. They cover different spatial and temporal frequencies and hence 
represent a wide range of movement. There exists a weak negative correlation 
between maximum spatial and temporal frequencies of the bases (r = -0.20). Such a 
weak negative correlation was also observed for preferred spatial and temporal 
frequencies of neurons (simple cells) in cat striate cortex [2] and is consistent with 
results of [16]. Figure 5 shows a code generated for a sample video sequence and its 
reconstruction based on the largest sti . 
a b 
1.5 
0.5 
0 1 2 3 4 5 
summed spatial frequency 
Figure 4: Bases found by the algorithm trained on natural video sequences. 
a) sample bases, b) positions of the peaks in frequency space of 100 bases. 
Frequency resolution was enhanced by performing a 128x128x128-point FFT 
with zero-padding of the 10x10x4 data. Summed spatial frequency is formed by 
adding the maximum spatial frequency in x and y directions (as in [15]). 
____1 
__11 
II 
__1__ 
I 
Figure 5: Representation of a sample video. A sequence x is shown on the top and 
the 8 most commonly active bases Ai in a column on the left. Beneath the 
sequence is shown the code s representing activity of the 8 bases. The sequence in 
the bottom right corner shows the reconstruction based on the code of the 8 bases 
where sti < 1/3 max sti are made equal to 0 (the thresholded code shown above). 
5 Discussion 
The results of Section 4 show that learning sparse representations of natural moving 
images yields bases encoding movements in different directions with different 
speeds, thus resembling the receptive fields of cortical movement-sensitive neurons. 
The code of the video sequence is generated here not directly by a neural net, but by 
an optimisation algorithm. However, it has already been shown [15] that once the 
bases such as A have been found, the neural network may be easily constructed. The 
network needs to contain connections between neurons that respond at different (or 
adaptive) delays. Such a network would effect the separation of convolved mixtures. 
Algorithms for training such networks have already been proposed [15]. 
In the generated code, the order of the vectors s t is crucially important, hence, when 
a neural net performs a similar task, the precise timing of spike activity will carry a 
lot of information. Experimental observations indicate that a lot of information 
about moving stimuli is carried by the timing of neural activity in primate visual 
cortex [6]. Furthermore, as illustrated in Figure 5, this activity would be typically 
irregular, which is consistent with the lack of regularity in neuronal firing found in 
V5 during the processing of information about complex moving stimuli [6]. 
Other authors have shown that independent component analysis [16,5] or learning 
an efficient code [14] of natural image sequences yields receptive fields encoding 
movement. However, they treated time as no more than an additional dimension, 
treated in the same way as spatial ones. Thus they calculated only separation of 
independent components but did not performed the deconvolution. Hence, in their 
approach, the generated code for the sequence of images was only a vector of mean 
neuronal responses, with no information concerning how the neuronal activity 
changed with time. In contrast, in our approach a spatio-temporal code is generated. 
In their approach receptive fields are developed which encode different phases of 
the same movement. In contrast, in our approach the timing of activity encodes the 
phase of movement. For example, Figure 3c shows the bases obtained for the 'toy' 
problem for N = 4 using their approach. No reasonable bases were found because 
when the time is treated as an extra 'spatial' dimension, there exist 20 independent 
components (5 phases of each of 4 movements) and only 4 bases were used. Hence, 
our approach allows feature extraction using many fewer neurons. Furthermore, in 
their approach, where neurons encode particular phase of a movement, two 
sequences shifted in time, and therefore shifted in phase, will usually cause activity 
of completely different neurons (a situation unlikely to be used by the brain), while 
in our approach they result in the same patterns of neuronal activity shifted in time. 
The neurons in the network performing encoding similar to that of the proposed 
algorithm would respond after presentation of stimuli containing moving lines in 
their preferred direction. So, hypothetically, by lowering the threshold of such 
neurons, they could respond to static lines having a preferred direction independent 
of the position of the line, so behaving like the complex cells in V1 [9]. This finding 
is consistent with the suggestion of Foldiak [7], that invariance (e.g., to position) 
may be learnt from sequences in which stimuli undergo transformation. 
Just as Independent Component Analysis has application in image de-noising [8], 
the proposed algorithm may find application in video de-noising. 
This paper shows that learning a sparse code for natural moving images yields 
spatio-temporal basis functions similar to the receptive fields of movement sensitive 
cells and coding in which the precise timing of spikes is crucially important. Similar 
algorithm was recently proposed independently by Olshausen [13]. 
Acknowledgement: This work was supported by an ORS grant and an MRC grant. 
References 
[1] Atick, J.J. 8: Redlich, A.N. (1993) Convergent algorithm for sensory receptive field 
development. Neural Computation 5:45-60. 
[2] Baker, C.L. (1990) Spatial- and temporal-frequency selectivity as a basis for velocity 
preference in cat striate cortex neurons. Visual Neuroscience 2:101-113. 
[3] Barlow, H.B. (1989) Unsupervised learning. Neural Computation 1:295-311. 
[4] Bell, A.J. 8: Sejnowski, T.J. (1996) Edges are the 'Independent Components' of Natural 
Scenes. Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press. 
[5] Blais, B., Cooper, L.N. 8: Shouval H. (2000) Formation of direction selectivity in natural 
scene environment. Neural Computation 12:1057-1066. 
[6] Buracas, G.T., Zador, A., DeWesse, M. 8: Albright, T.D. (1998) Efficient discrimination 
of temporal patterns by motion sensitive neurons in primate visual cortex. Neuron 20:959-69. 
[7] Foldiak, P. (1991) Learning invariance from transformation sequences. Neural 
Computation 6:194-200. 
[8] Hyvfirinen, A. (1999) Sparse Code Shrinkage: Denoising of Nongaussian Data by 
Maximum Likelihood Estimation. Neural Computation 11:1739-1768. 
[9] Livingstone, M.S. 8: Hubel, D.H. (1987) Psychophysical evidence for separate channels 
for the perception of form, colour, movement and depth. Journal of Neuroscience 7:3416-68. 
[10] Meister, M. 8: Berry II, M.J. (1999) The neural code of the retina. Neuron 22:435-450. 
[11] Olshausen, B.A. 8: Field, D.J. (1996) Emergence of simple-cell receptive field 
properties by learning a sparse code for natural images. Nature 381:607-609. 
[12] Olshausen, B.A. 8: Field, D.J. (1997) Sparse coding with an overcomplete basis set: A 
strategy employed by V1 ? Vision Research 37:3311-3325. 
[13] Olshausen, B.A. (2000) Sparse coding of time-varying natural images. ICA'2000 
Proceedings, June 19-22, 2000, Helsinki, Finland, 603-608. 
[14] Rao, R.P.N. 8: Ballard, D.H. (1997) Efficient encoding of natural time varying images 
produces oriented space-time receptive fields. Technical Report, University of Rochester. 
[15] Torkkola, K. (1996) Blind separation of convolved sources based on information 
maximisation. IEEE Workshop on Neural Networks for Signal Processing, Kyoto, 423-432. 
[16] van Hateten, J.H. 8: Ruderman, D.L. (1998) Independent component analysis of natural 
image sequences yields spatio-temporal filters similar to simple cells in primary visual 
cortex. Proceeding of the Royal Society of London, B. 265:2315-2320. 
