Color-based Face (and skin) Segmentation for Robotic Perceptual Control

Abstract

The purpose of the project is to extend the Intel CAMSHIFT face-segmentation and face-tracking algorithm on a non-stationary camera and still image. The software developed also extends the original to utilize spatial information instead of alone the hue value of human flesh and to also roughly estimate the skin pixel as well as its boundary to derive precise location coordinates of the center of mass in three-dimensions space including its orientation. The main goal of this project is to provide mobile robot the capability to locate humans in its close proximity and to possibly extract some simple instructions by analyzing motion trajectories of human flesh. For these reasons, the algorithm must be robust in its well-controlled working environment and requires minimal training for operating on a still image or low speed image sequence from non-stationary camera. The program is designed to work reliably at the range between one to two meters and should be able to segment more than one flesh area at a time unlike its original. Within its operating range, most of the flesh will be the facial area; hence, the area where the experiment is heavily conducted on.

Previous work review

CAMSHIFT was designed to handle precise tracking of facial location on a non-moving camera at the range of less than one meter. The algorithm relies heavily on hue value of the human flesh. Human skin forms a very tight cluster in color space even when different races are considered [1]. CAMSHIFT operates on color probability image and applies a non-parametric gradient density claiming called mean shift algorithm to re-center its operating window. By windowing the image, the algorithm will track the location of the face and effectively reduce color segmentation noise by ignoring value outside its search window. Using color model greatly cut down the fault track in noisy environment since color noise has a low probability of being flesh color [2]. The result shows that the algorithm can tolerate up to 30% of Gaussian noise added within reasonable performance degradation. Thus, CAMSHIFT is able to handle noisy image without the need for extra filtering or adaptive smoothing [3].

It is considered a very good algorithm that suits its perceptual interface purpose; it however, has serious problems when the algorithm loses track of the face, which may occur when the tracked flesh moves too fast or the camera changes it aim angle. Moreover, The algorithm relies on data obtained from sequence of motion images and initial training information to begin with, which means that it will not be able to recover from such a fault. Fast moving motion of face that can cause the CAMSHIFT to fail includes the head roll down motion where significant pixels are quickly replaced by insignificant hair pixels. CAMSHIFT reports the following features, spatial coordinate of the center of mass of the tracked face in three dimension and its orientation respects to the x-y plane. In its operating range, the algorithm is very accurate. The paper shows an almost perfect accuracy result in x and y space translation when measured against Polhemus device. It also does reasonably a good job on z-axis and t-axis rotation.

The algorithm is designed to track face within its search window. Distraction from other skin can also cause large distortion of the reported feature values or as serious as losing the track of the object if the distraction object is large and causes major drift in window location when mean shift tries to claim highly distorted probability gradient.

The CAMSHIFT can segment its tracked object when the hue information is unique enough and its pixel intensities are not too low. When the intensity of the pixel is low, it prevents a reliable reading of hue value due to its larger fluctuation range. Generally, hue may be read any value if the intensity of the image is zero. It is still, however, has to operate on sequence images, which is not suit to application this project aims for. The paper also highly recommend discarding any pixels that is too dark such that it may cause large distortion in color probability image if they are attempted to be corrected and included. The segmentation performance of the hue value as a mean to segment face greatly depends on illumination of the scene. It suggest that improving the performance of the color-based segmentation can be done by consider the hue value along with other spatial features in which this project attempts to complete.

Publishing Indexes and References

[1][2][3],Computer Vision Face Tracking For Use in a Perceptual User Interface Gary R. Bradski, Microcomputer Research Lab, Santa Clara, CA, Intel Corporation

D. Comaniciu and P. Meer, “Robust Analysis of Feature Spaces: Color Image Segmentation,” CVPR’97, pp. 750-755.

K. Sobottka and I. Pitas, “Segmentation and tracking of faces in color images,” Proc. Of the Second Intl. Conf. On Auto. Face and Gesture Recognition, pp. 236-241, 1996.

Needs Being Served

The need to develop a vision algorithm to segment and approximate humans facial and skin area and its boundary including some basic features about the segmented area, such as spatial coordinate, with a single captured image from a slow speed non-stationary camera in a large but experimentally well-controlled area for mobile robot application. Images can be captured from a moving camera and caused major problem for most segmentation algorithms that rely on motion and inter-frame information. Moreover, the system must tolerate to instance-based training, which would cause major downtime of the robot being trained. The algorithm developed within the project serves the purpose by extends the CAMSHIFT color probability image segmentation with additional spatial features consideration. The detail discussion of the inner working is described in the following section.

Overview of the document

The document consists of the detail explanation of the implemented algorithm, experimental description, test results on the measurement in terms of accuracy of feature values reported and on some additional color segmentation performance analysis, and the conclusion including the listing of the software code developed as part of the project. The data is tested against the original CAMSHIFT algorithm if applicable. For non-applicable comparison between the two on capabilities that does not exist in the original, large set of test data that attempts to cover as much as possible the cases are explored.

Description of the Designed Experiment

The accuracy of the x and y axis coordinates are computed via the following method and compare against the CAMSHIFT algorithm.

· Setup a pink ball calibration mat at various distance from the camera, identify each pixel value at multiple specific locations

· Center face at each of the specific location and obtain the center of gravity reading from the software

· Plot the actual-measured values at multiple camera distance on a chart and calculate its summary statistic of errors

· Approximate the system measurement accuracy as a function of distance

The value of the z-axis is computed based on facial size expectation at distances. The system is previously trained with an expected facial size at distances. By measuring the size of the segmented facial area, it approximates the distance. The setup is similar with the above experiment with a gird measured on the floor. The system can only estimate the distance by assuming that the flesh area found is facial area. The algorithm does not have the capability to differentiate skin parts. We will test the system by centering our face at various depth, the reported segmented area is then mapped to the distance away from the camera. It will then be used against the actual distance for comparison.

Orientation measurement is conducted based on human observation of hand place on elevated platform in angles. The following is the experiment

· Setup an elevate table at various angles

· Measure the angles read by the system

· Comparing it against the actual angle

· Repeat the measurement on other location in the expected operating type of environment.

The system reports the angle based on x-y axis and not z. Measurement of orientation related to z-axis is not implemented due to expected unreliable z-axis measurement that heavily based on facial size-to-space mapping model.

Test on segmentation performance is conducted based on approximated misplace pixels count by human on various location done on different racial skin types and also the approximates dissimilarity of the segmented boundary and the actual boundary observed by human. The dissimilarity is calculated via the mean square error measurement method at sample value randomly selected from significant point. A number of measurement points is selected from place where error is significantly peak. The algorithm should be able to identify the amount of flesh area presented in the seen at its operating range. This will also be measured. Minimum distance by average in which two flesh area will be combined into a single area is also observed.

Description of the Developed Algorithm

First, the developed algorithm is extends to consider three extreme types of colors that can potentially be found in different locations on humans face, including a typical skin color found on the cheek, an unusually red spot found on nose, and extra yellow found around the neck area. The captured RGB image is converted to HSL color space image, and its hue channel is used for generating three color probability images based on the three color probability distributions sampled at the local areas. Each pdf is converted from a histogram of the trained data that is captured in the scene.

Second, each result image is clipped to zero if the probability is less than a certain threshold. This is done to avoid bad training data set that contains a lot of insignificant pixels data. Each color probability images is then morphological transformed with a “grayscale open” operation to possibly break cluster background noise (area that is not skin) into smaller pieces before being passes to a region analysis module.

Region analysis module completes the region glowing on the each probability image and filter out region whose size is too small or its grayscale sum value within the region is not large enough. Zero^th moment of both grayscale and binary area are used for rejecting large object whose color is close to the skin but not close enough. The center of gravity of each region in each probability image is then computed. The center of gravity is then used across the three images to link three spatial regions in each image together if they are close enough measured as a proportion to its maximum ferret size. If there is a link at least from two regions located at different probability image, the algorithm will link them into a single region.

Four best-fit corner coordinates normalized to region orientation is then computed and the rectangular masking image for each combined region is produce and used for filtered out the insignificant region time with a different set of minimal size and minimal sum pixel intensities requirement. A different way to mask the linked regions together is also observed, such as minimal polyline that will result in a complete best-fit convex hull representation of the linked region. It is wished that it were implemented, however, due to the time limitation, it was not complete at the moment. Only the approximated best-fit rectangle is used to describe the link regions. It is approximated because the maximum ferret value and the ferret measured at perpendicular angle of the maximum ferret value are approximated.

The final stage is develop a demonstrate program, a simple motion analyzer, that utilizes some measured feature values reported by the algorithm. The analyzer developed observed if there are two skin areas in the sequence image and the one of the right side contracts and expands significantly. Only size (z-axis information) and x and y spatial information are used for this motion analyzer.

Result & Observation

The tracking of the X-axis and Y-axis center of mass coordinate is very accuracy. The result is comparable to those CAMSHIFT algorithm with advantage that the system find the feature values in various scene within the controlled environment.

As seen, 36 sample points are taken from a camera where face is centered at varouis x and y locations. The system provides accurate results at distance of one meter. The following is another set of sample points taken at different locations at distance of 1.5 meter within the environment. The system also provides accurate results.

Sample Result Reading from Location 1:

Orientation is however less accurate. The system still provide good result but with larger standard deviation and mean error. Orientation works reliably in various scenes. However, z-axis estimation does not yield the same result. As moving to the new scene, z-axis measurement varies considerably from the old setting. This is expected since at different scene location, different amount illumination may cause the blob size the significantly change and distort estimated distance.

About 13 distance reading and 17 orientation readings are taken at location 1. The following is the same type of measurement but taken at location 3 where there is less illumination.

The following is also another set of reading the location 3 which is the darkest in the room.

Statistic summary of the z-axis measurement

	Location 3	Location 2	Location 1
STD DEV	32.3092329	34.42108016	7.049126344
MEAN	4.986257524	-27.15387682	5.746882841

Statistic symmary of the orientation measurement

	Location 3	Location 2	Location 1
STD DEV	6.6719345	3.1989	2.79574646
MEAN	2.5291284	2.82357634	2.9411575

Segmentation performance measurement is calculated based on the following criteria

· Estimated amount of displacement pixels

· Region boundary comparison (samples top three-four maximum error points) with mean square measurement method

This section is provided as a mean to estimate the segmentation performance. The experiment does not focus on this part since the boundary region is not the main measurement of the project. Only about 10 samples of different persons (with various racial different) taken at a scene of operation are analyzed. The samples are taken at range between 1 meters to 3 meters.

One main conclusion we can derive is that the amount of misplace pixel as ratio to the size of the region reduces when size is large. This is obviously true because large object also indicate well segmenting performance to certain extends. The boundary measurement is however, arbitrary. This may caused by that face that only 5 sample points are taken for the comparison. A measurement program must be developed to do the task here.

Algorithm under uncontrolled environment

This section attempts to provide some measurement of how the system operates outside its controlled environment. As the result indicates, when the environment significantly changes, the systems should be re-trained. The following shows how the system operates in new scene with significant illumination changes.

The graph above is the estimated result quality measured as percentage of able to report X-axis and Y-axis coordinates within 10% of error margin at operating point of illumination between 0-100%. The algorithm is hard coded to cut off if illumination is lower than 15%. Once, the image is brighter, reliable hue value is not obtained and causes system to mis-segmenting the skin; hence, less quality result is obtained. As shown, the quality starts seriously degrading at 60-70% illumination point.

Motion analyzer developed does not have its performance analyze since it is used solely for only demonstrating the use of the feature extractions.

The following is the example of the program under working

The image on the left: the square box is the best fit rectangle normalized at angle.

On the right, the segmented color probability image used for the masking

The above is the segmented image when applying the masking on the actual picture.

With calibration map for X-Y axis accuracy check as seen the square red and blue boxes in the background

The above two pictures are the segmenting mask and the segmented image

Conclusion

Hue in a very important feature needs for segmenting facial and skin area. It works reasonably well when used along with other features to classify human’s flesh. It is, however, alone not enough for precise skin segmentation. The algorithm helps identifies some Moment-based features accurately since its precise boundary does not have to be determined to indicate good data. Color noise has low probability being a human flesh color is a very important property that helps the algorithm robust to image noise at a correct setting, as long as the hue is reliable read and does not swing. The performance of the algorithm is comparable to those of CAMSHIFT. Insignificant degrading in feature quality is observed due to extra capabilities added to adapt system more usable for different applications. The possible future work is to implementing Fourier descriptor or minimum polyline fitting to better approximate the boundary of the color probability image as used as a masking instead of the color probability image itself. This would solve the non-convex curves and holes problems.