Facial Expression Recognition using neural networks

By Ben Wong

December 5th, 2000

1. Introduction

Human-computer interaction will be much more effective if a computer know the emotional state of human. Facial expression contains much information about emotion. So if we can recognize facial expressions, we will know something about the emotion. However, it is difficult to categorize facial expressions from static images.
Neural network may be suitable in this problem because it can improve its performance given more examples. Moreover, we do not need to know much about the features of the facial expressions to build the system. The system will generalize the features itself, given enough examples.
In this paper, we investigate the performance of neural network on this problem, at the same time compare different ways of training the network.
I found that the best performance system is a 2 hidden layers specially designed system. It has a hit rate of about 65.38%. The best training method used is back-propagation with a learning rate of 0.1 and momentum factor of 0 to 0.1. In section 2, I will talk about the problem and the algorithms used to obtain the result. The evaluation methodology, the data and the result will be presented in section 3. Section 4 is about future work. Finally, I will talk about the conclusion in section 5.


2. Problem Definition and Algorithm

2.1 Task Definition
The input to the system is a 96x72 pixels 8-bit gray-scale image containing a human face.
The outputs are 4 numbers. Each number represents a facial expression (neutral, smile, angry, scream). The number is 1 if that facial expression is present and 0 otherwise.
There are many different facial expressions, but I only choose four of them. The facial expressions chosen are representative because they can tell the emotions without much ambiguity. Besides, the expressions are quite different from each other, which should make the network training easier. Although the resulting system can only recognize 4 expressions, it provides a basis for more complex system that can recognize more different expressions.

2.2 Algorithm Definition
I use multiple layers feed-forward neural network with backpropagation training.
More specifically, the network has an input layer, some hidden layers and an output layer.
After the input layer received its input, for each successive layers, each node will compute it output based on the following equation:
Oj = f((? xi*wij) - tj),
where Oj is the output of node j, wij is the weight of the link from node i to j, xi is the output from node i (node i is in the previous layer linking to node j), tj is the threshold of node j, and f is the activation function. The activation used is a sigmoid function f(x) = 1/(1+exp(-x)).
The output from each node is the input to nodes in the next layer. The output are computed and passed layer by layer until the output layer, where we get the results. This is the feed-forward algorithm.

The backpropagation algorithm is as follow:
1. Use feed-forward algorithm to compute the output
2. Update the weights layer by layer, start from the output layer back to the input layer.
For each weight wkj:
Dj = Errj*f'(inj) for weights linking the output layer
= f'(inj) ?wjiDi for other layers (node i is in layer next to node j)
dwkj (new) = a dwkj (old) - m*ak*Dj
wkj = wkj + dwkj(new),
where Errj is the error at output node j, which is target output - output from feed-forward,
f'(x) is the derivative of f(x), inj is the total input to node j - threshold of node j,
wkj is the weight of the link from node k to j, ak is the output from node k,
dwkj (old) is the amount of the last update of weight, initially 0,
a is the momentum factor, and m is the learning rate.

Beside the parameters mentioned in the above algorithm (w, t, a, m), there are still many other parameters. Networks with different parameters are implemented. They are explained below:

A. Initial weight and threshold
The threshold can be treated as a weight with input -1, so that the algorithm will update it just like other weight. We can set the initial weights to 0 or randomly generate them. We may delete links that have little contribution (close to 0), so if we choose zero initial weight, those "useless" will converge to their small value faster. Randomly generated weights also have their advantage. Back-propagation is a kind of gradient descent search, so it may be trapped in a local minimum. Randomizing the weights helps to avoid it. The sigmoid function changes very little when the magnitude of input is too large. Therefore, we choose the range of the random weights to be [-1,1].

B. Connections between layers
The layers can be fully connected or partially connected. Fully connected network has more links and is slower to run and to train. However, it should be able to do what a partially connected network can do, since it can set the "useless" links to 0. So we can first build a fully connected network, train it and then remove the close to zero links and the nodes that has no path to the output nodes.
For partially connected network, we can link the layers randomly. That is, for each possible pair of nodes between two adjacent layers, a link is present with probability p. We can also build a partially connected network using some knowledge of the problem, which is described in part E.

C. Relevant Input
The input size is 96*72=6912 pixels, which is large since there will be many links with even a small hidden layer. In our example images, the faces only occupy small regions of the images. Moreover, the important features of facial expressions are located around the eyebrows and the mouth. So we can get our input from these region only. Taking into account that the face positions vary from different images, the relevant input size becomes 1011 pixels.


a sample image relevant input

D. Number of hidden layers and hidden layer size
The number of hidden layers may affect the performance, but in general, one should be enough. For the hidden layer size, if it is too large, the network tends to adapt to the particular details of the training set. The performance will not be good. If it is too small, there will be overfitting and the generalization ability will be poor. So we have to experiment different hidden layer size.
E. Specially designed network
I have also build a network using some knowledge of the problem. As mentioned above, the important features of facial expressions are located around the eyebrows and the mouth. We can build feature detectors around these regions. Specially, I divide the input into 3 regions: two eyebrows and the mouth (see figure below). The first hidden layer has 3 groups that correspond to the 3 regions. Each group has 4 subgroups that corresponds to the 4 features (neutral, smile, angry and scream) Suppose an eyebrow of size ew x eh and the mouth is of size about mw x mh. Each node in the first group is connected to an ew x eh region in the first eyebrow region. Each node in the second group is connected to an ew x eh region in the second eyebrow region. Each node in the third group is connected to an mw x mh region in the mouth region (see figure below). Moreover, each node in the same subgroup has the same initial set of weights. (We would like to make each node in the same subgroup has the same initial set of weights during training, but we will have to find a way to impose the constraint, so the set of weights are not the same during training.) Now, each subgroup is like a feature detector. We can build one or two more hidden layer to finish the network.

an eyebrow area the other eyebrow area the mouth area
|||||||….||||| ||||||…|||||| ||||||…||||||
subgroup 1, 2, 3, 4 subgroup 5, 6, 7, 8 subgroup 9, 10, 11, 12
||| … ||| … |||
the next layer
|||
output layer

One guess of the network is to assume that a feature detector can be trained to output a 1 if the feature is present at that position and 0 otherwise. The second hidden layer has 12 nodes, each links to a subgroup in the previous layer. The input to these nodes will be 1 if the feature is present. The output nodes can link to the second hidden layer to select the facial expression in a way analogy to testing if (f1^f2^f3^f4^f5^...^ f12) is true, where fi are the features. This can be done in 1 layer.

F. Learning rate and momentum factor
The learning should be small enough to allow the network to converge to the minimum and large enough so that the training is not too slow. The momentum term can smooth the update of weight if the successive weight update is oscillatory. This should lead to faster convergence.

G. Input and output normalization
The input is in range [0,255]. However, the sigmoid function is very close to one when the input is >10. So we have to normalize the input. The first way is to divide the input by 255 so that the resulting range is [0,1]. The second way is to to compute xk = (Ik - mean(I))/s, where xk is the normalized input, Ik is the original input, mean(I) is the mean of all input, and s is the root mean square of all input. The resulting input will have 0 mean and 1 variance.
For the output, we can use the output from the feed-forward algorithm. We can also use yi=exp(vi)/ ? exp(vi) as the output, where vi are the output from feed-forward algorithm.
This way, the sum of output is 1 and we can interpret the output as the probability that the input is having a certain facial expression.

Since back-propagation is a gradient descent search, it may be trapped in a local minimum. Therefore simulated annealing may be use to find the weights of the network, as it is not easily trapped in local minimum. We do not use genetic algorithm since the memory requirement for storing a number of networks is too large. In the simulated annealing, we randomly change the weights in the range [-1,1] as the neighbor of the current network. The range is [-1,1] because from results of back-propagation, the weights are within that range. We do not change the network structure as the neighbor since it is too time consuming to change the structure. For the specially designed network, the set of weights in a subgroup is kept the same is the weights are changed.
We cool the temperature at each iteration. The schedule is determined by the average uphill move u at high very temperature. If we want the initial probability of uphill move be 0.95 and the final probability be 10^-10, the parameters are estimated from the following equations:
Exp(-u/Tinitial) = 0.95, exp(-u/Tfinal) = 10^ -10, Tinitial*alpha^(n-1) = Tfinal,
where alpha is the cooling factor and n is the maximum number of iterations.
The cost function is the error when we present some data from the training to the system. So we have to minimize the cost.


3. Experimental Evaluation

3.1 Methodology
The image set used is taken from the Aleix face database. I have to shrink the images and convert them to gray-scale before putting them into our image set, which contains pictures of 58 different females and 74 different males. There are 4 different images (different expressions) for each person. Altogether there are 528 examples.
First we have to prepare a training set and a testing set. In choosing the training set, first separate the data set in to male and female. Then in each set, randomly pick some people. I use all the images of these people as the training set. The remaining is the testing set.
Since the image set is small with respect to the number of links in the network, we want as many data for training as possible. I choose about 10% for the data for testing, i.e. 7 males and 6 females in the test set.
Call all the images of a person 1 unit data. I train the system with 1 unit data from the training set each time. For simulated annealing, I only use some unit data from the training set for the cost function. The number of units varies since it affects the speed of training.
Since the training set is not large, an image may be used to train the system multiple times.
To get a performance measure, I use all the data from the testing set to get a performance measure. I do not train the network while I am testing it.

The target values are integer and the algorithm's outputs are not, since the sigmoid function can never produce 0 or 1 exactly. So it is not fair to use the sum of square errors as a performance measure.
We have a number of other choices:
i. Round up the output and use the percentage of hit for the testing set as a performance measure.
ii. Set the output with the maximum amount to 1 and the others to 0. Then use the percentage of hit for the testing set as a performance measure.
Set the normalized output with the maximum amount to 1 and the others to 0. Then use the percentage of hit for the testing set as a performance measure.
The more the percentage of hit is, the better the performance is.
Since it can be the case that the range of output from the algorithm is always < 0.5, choice i is not preferred. Choice ii and iii gives the same measures so we can choose any of them.
We can use the time the network takes to evaluate the testing set as another dimension of performance measure.

To compare the performance of different network, I vary the parameters mentioned in section 2.2. They are the independent variables. For each parameter setting, I train the system, record the best performance it can achieve. Then we know what network structure gives the best performance.
To compare the performance of different training method, I also record the amount of time needed to train a system to reach a certain performance. The training method that gives that best system and need the shortest time to train is the best.
After some training, I observe some over-training (more detail in the next section). So I reserve about ¼ of the training set to detect over-training. I train the system 10 unit data from the ¾ of the training set each time. Then I record a performance measure using all data from the ¼ training set (29 unit data). I keep track of the system that gives the best performance. I repeat the training and performance evaluation until the performance start getting worse. Then the system with the best performance is returned and it is evaluated with the true testing set.

3.2 Results and Discussion
Due to the large number of different parameter settings, I only compare a small subset of it.
The results for system that use all the input (6912 pixels) is poor it is slow and its performance cannot beat chance (25%). The result is also bad (cannot beat chance) if we use the first way to normalize the input (divide by 255). This is because the input to the first layer will be positive and very possibly a large number. The output from the first hidden layer will be almost 1. The results without output normalization are very similar to the results that use it. Therefore, I will concentrate on training systems that use relevant input (1011 pixels), with second type of input normalization and output normalization.
There is almost not improvement for using simulated annealing to train the systems. An exception is to use a back-propagation trained network as the initial network for simulated annealing. However, the improvement is only a few percent of hit rate. So I only present result that use back-propagation training.

Overtraining is observed in many runs. The performance oscillates when we increase the number of training. The amplitude is about 10%. Usually, overtraining will result in poorer performance. In our case, the performance oscillates may be because I reuse the training data so the performance re-improve. The training time below include the overtraining detection time, since the detection time is proportional to the training time without the detection, it will not affect our comparison.

Due to the large amount of results, I only present those that are representative.
The results:
0 hidden layer: performance is 25% for fully or partially connected network.
1 hidden layer: The performance for using randomize initial weight is much better than using 0 initial weight (>10% higher hit rate). So only results for random initial weights are presented.

Meaning of each column:
(1) is the size of hidden layer,
(2) the probability of having links between the hidden layers and the input layer. Links to the output layers are fully connected. (See 2.2 B), 1 means fully connected.
(3) is the learning rate,
(4) is the momentum factor,
(5) is best performance (hit rate) on the overtraining detection set (1/4 of the training set) in 2000*4 examples presentation, or before the performance decreases 20% from the best performance,
(6) is performance on the testing set
(7) is the execution time of the feed forward algorithm on the testing set, and
(8) is the training time required to reach best performance network.
(1) (2) (3) (4) (5) (%) (6) (%) (7) (ms) (8) (s)
10 1 0.1 0 56.89655 57.69231 495 225
0.1 0.1 50.86207 36.53846 524 264
0.3 0.1 43.96552 42.30769 628 123
0.3 0.3 47.41379 36.53846 519 131
0.3 0.5 44.82759 42.30769 570 292
300 1 0.1 0 69.82759 53.84615 4534 2557
0.1 0.1 52.58621 53.84615 4468 2889
0.3 0.1 31.89655 28.84615 4542 33
0.3 0.3 28.44827 19.23077 4554 287
0.7 0.1 25.00000 25.00000 4531 16
0.7 0.5 25.00000 25.00000 4489 16
Table 1

(1) (2) (3) (4) (5) (%) (6) (%) (7) (ms) (8) (s)
10 1 0.1 0 56.89655 57.69231 495 225
0.5 0.1 0.1 58.62069 57.69231 399 237
50 1 0.1 0.1 60.34483 53.84615 1031 387
0.5 0.1 0 62.06897 61.53846 778 384
0.2 0.1 0 62.06897 63.46154 523 267
100 0.5 0.1 0 66.37931 61.53846 1011 663
300 1 0.1 0 69.82759 53.84615 4534 2557
0.5 0.1 0 61.20690 59.61538 2784 1686
600* * 0.1 0 25.86207 25.00000 2391 342
Table 2, showing the best result for each hidden layer size
(* means specially designed network with no other hidden layer other than the feature detectors, the feature detector layers has 600 nodes.)

2 hidden layers: I have only trained the specially designed network.
Now column 1 is the size of hidden layer below the feature detectors, which have 600 nodes.
(1) (2) (3) (4) (5) (%) (6) (%) (7) (ms) (8) (s)
10 0.5 0.1 0 57.75862 55.76923 2366 265
25 0.5 0.1 0.1 53.44828 55.76923 2386 185
50 0.2 0.1 0 66.37931 59.61538 2273 1038
100 0.2 0.1 0 60.34483 61.53846 2250 320
12* 1 0.1 0.1 67.24138 65.38462 2048 897
Table 3, showing the best result for each second hidden layer size.
(* means that each nodes links to a subgroup in the previous layer.)

Shrinking function is not good because after shrinking, the system performance badly. After retraining, it performs similar to a randomly generated, partially connected trained network of similar size. Therefore, result for shrinking is not presented.

From table 1, we can see that as learning rate increase, the training time is generally shorter, but the performance of the system decreases. It also appears that (from data not shown) when the momentum factor is equal or greater than the learning rate, the performance will be poorer. The best training parameters found is learning rate = 0.1 (or lower) and momentum factor = 0 or 0.1
From table 2, we can see that the best systems have links that are partially connected, usually with probability of linking about 0.2 - 0.5. Also notice that large system required longer training and the time for evaluation (feed-forward algorithm runtime) is longer. The best system with one hidden layer found has 50 nodes in the hidden layer with probability 0.2 of linking the nodes to input layer.
From table 3, we found that specially designed systems using 2 hidden layers perform similar to systems with 1 hidden layer in terms of hit rate. In terms of runtime, 2 hidden layers systems with different number of second hidden layer nodes are very similar. The runtime is quite constant. They are comparable to 1 hidden layer systems with about 100 fully connected hidden units.
The best system found is a specially designed system with 12 second hidden layer nodes each linking to a subgroup in the previous layer. 6its hit rate is 65.38462%, which beats the best system with 1 hidden layer (61.53846%). This shows that knowledge about the problem may be incorporated into a neural network to make it performs better.
Note that the best system runs 2 times slower than the best system with 1 hidden layer. So they may be a tradeoff between hit rate and runtime.

The best system found has a hit rate of 65.38462%. Compared to chance (25%), the result is not bad, and can also be considered as good given that our data set is not very good. Our images has not been normalized. To normalize is to translate, rotate and expand/shrink the face so that all the faces are at almost the same position and have the same size. As seen in the figures below, the face in (a) is too small, (c) should be rotated, and the face position in (f) is too low. So our data set is not very good.
Sometimes the facial expressions are hard to distinguish even for human. Can you tell the facial expressions in the figures? (Answers are on next page)


Figures a to f

Therefore, the performance of the network is quite good. The performance may also be further improved if we try different size of the feature detectors and more different parameters.

Finally, note that the performance on the overtraining detection set can be quite different from the performance on the test set (see table 1) This might indicate that our data are not general enough. Although there are more images in the overtraining detection set, we stick to using the testing set to get the performance measure to reduce bias.


4. Related Work

From the NEuroNet (see reference), people are doing a similar task. They use a backpropagation network of size (1295x5x4) to recognize the four facial expressions: neutral, angry, happy, and sad. Their input size is 35x37 pixels. They achieve a hit rate of 78%. However, their input images are normalized. So their system may not be better than our system. Turn the other way round, our system may perform better if we normalize our input first.


5. Future Work

There is much room for improvement for our system. We can try to normalize the images before training the system and try more different parameters to train the system. We could also find ways to shrink and expand the system automatically so that the training program can determine the best size and structure of the network. The simulated annealing method used has a neighborhood function that is randomly change the weights in the range [-1,1], we can try out different neighborhood type, such as changing the weights by a percentage of the original weights, to see if it really works bad on neural network.

6. Conclusion

The best neural network we found for facial expression recognition is a 2 hidden layers network that is built using some knowledge of the problem. It achieves a hit rate of about 65.38%, which is good given that the data set is not very good. Knowledge of a problem may be useful in building a good neural network. Simulated annealing does not work for training a neural network, backpropagation gives much better results. During backpropagation training, lower learning rate gives better performance but the training time is longer. The momentum factor should also be no larger than the learning rate.


7. Reference

A.M. Martinez and R. Benavente. The AR Face Database. CVC Technical Report #24, June 1998.

"NEuroNet Roadmap" July 18 2000.
Network of Excellence in Neural Networks.
Available: http://www.kcl.ac.uk/neuronet/about/roadmap/emotrec.html

Russell, Stuart and Norvig, Peter. Artificial Intelligence: A Modern Approach. New Jersey:
Prentice Hall, 1995.

"The AR Face Database" June 1998.
Aleix Face Database.
Available: http://rvl1.ecn.purdue.edu/~aleix/aleix_face_DB.html

Theodoridis, Sergios and Koutroumbas, Konstantinos. Pattern Recognition. New York:
Academic Press, 1999.

Answer: a. neutral, b. neutral, c. angry, d. neutral, e. angry, f. angry.