Facial expression recognition based on compressive sensing and pyramid processing

Öz In this paper, a new approach has been proposed for improved facial expression recognition. The new approach is inspired by the compressive sensing theory and multi-resolution approach to facial expression problems. Initially, each image sample is decomposed into desired levels of its pyramids at different sizes and resolutions. At each level of the pyramid, features are extracted using a measurement matrix based on compressive sensing theory. These measurements are concatenated together to form a feature vector for the original image. The results obtained from the approach using three distance measurement classifiers (Manhattan, Euclidean, Cosine) and support vector machine are impressive and outperforms most of its counterpart algorithms in the literature using the same databases and settings.


Introduction
Facial Expression Recognition (FER) is one of the branches of Pattern Recognition (PR) which enjoys increasing patronage from many works of life in recent times.This could be attributed to the developments in technology and human's needs for information and intelligence gathering.Some of the emerging applications of FER are in marketing, security, psychology, medical diagnosis, human-machine interaction and entertainments [1].The algorithm flow for FER is not much different from its counterpart algorithms in PR.The steps include: pre-processing, feature extraction, classification and the decision.Generally, in FER two major approaches are adopted for feature extraction.First, is the Component-based (holistic) and the second is feature-based (local) approach.In the former, the entire face image is used as input to extract features while in the later only some key points within the face image (e.g.eye, nose, mouth e.tc) are used to take some geometrical measurements and localized information around them [2], [3].
Due to the tasking nature of FER, complex techniques have been used by researchers to obtain more robust and distinct features for the various emotions recognition.Use of multi-resolution algorithms such as Gabor Wavelets Transform (GWT), Discrete Wavelets transform (DWT) to mention but few, are very common in FER and appears to have an edge over other feature extractors like Local Binary Pattern (LBP), Principal Component Analysis (PCA) and Local Discriminant Analysis (LDA) [2].Authors in [2], used a multi-resolution transform called Curvelets Transform (CT) at different orientations and scales to form Curvelets products which were wrapped around their origin.The products were then used to extract Curvelets coefficients using inverse CT.The coefficients are subsequently used as feature vectors.Though improved performance has been reported but intensive computations are required to arrive at that performance.In a similar way, in [3]- [8] authors used GWT in one form or the other to encode features for FER.For instance, [3] subjected the face images to local, multi-scale Gabor-filter operations, and then the resulting Gabor decompositions were encoded using radial grids, imitating the topographical map-structure of the Human Visual Cortex (HVC).Due to the similarity of Gabor filter response to HVC and its performance, many local variants of the algorithms were used recently [4], [7], [8].In general, these algorithms come with additional cost of computation, extensive memory usage and in most cases feature vector dimensionality reduction becomes a necessity.
In this paper, feature extraction algorithm based on Compressive Sensing (CS) theory and image pyramid processing has being proposed.Initially, image pyramids were computed at different layers with Gaussian approximations filter.The pyramid layers were concatenated.Final features are extracted from the concatenated vectors using CS theory.The approach has the combined multi-resolution capability, simplicity and higher performance.
The rest of the paper is organized as follows: Section II briefly discusses image pyramid processing and compressive sensing theory.Section III outlines the proposed approach and section IV presents the experimental results.Section V concluded the findings.

Feature extraction
Feature extraction is a crucial stage in any pattern recognition problem.The technique deployed to extract features from the original image samples are briefly explained.

Image pyramid
One of the powerful but yet conceptually simple structure for representing image in more than one resolution is image pyramid.The motivation for image pyramid and of course, any other multi-resolution algorithm is to view salient features of image at different resolutions so that, one feature which is difficult to detect in one resolution might easily be found in another resolution [9].For instance, small or low contrast objects reveal more details at higher resolutions whereas for large or high contrast objects within an image, a coarse resolution (lower) is all that is needed [9].In a situation where both low and high contrast objects coexist (which is usually the case), viewing the image at different resolutions will enhance the chance for more features revelation.
An image pyramid is a collection of decreasing resolution images arranged in pyramidal shape.As you move up from the base of the image pyramid both resolution and size of the image decreases.A succeeding level of the pyramid is obtained from its preceding level by applying an approximation filter (lowpass filter) and down sampling the results both along rows and columns.For applications like image compression, original image can be reconstructed from its pyramids using level residual and interpolation filters [9]. Figure 1 and 2 showed schematic of image pyramid decomposition and an example of samples image being decomposed into four level pyramids.In the pyramid decomposition stage, a low pass two dimensional Gaussian filter can be used.The Gaussian kernel's weights can be constructed by using (1).The mean distribution of the filter is centered at the middle of the kernel i.e. (, ) = (0,0), where  and  are the kernel's horizontal and vertical coordinates respectively. (1)

Compressive sensing
The Nyquist-Shannon sampling theorem [10] is considered as one of the cornerstones theorems in communication and digital signal processing world.It provided hints on how a continuoustime signal can be flawlessly recovered from its original samples.The theory states that a continuous-time signal can be reconstructed or recovered from its regularly space samples when the sampling frequency is two times larger or equal to its bandwidth [10].Sampling at Nyquist rate is easy to implement however, the question of efficiency becomes of great concern in terms of data rates to be collected.Sampling at Nyquist rate usually leads to collection of large amount of data samples [11].Related work by Shannon [12] in information theory used the idea of entropy to prove that the sampled data could be compressed to fewer samples since substantial parts of it are redundant data and does not contain the needed information.
Having realized that, number of data compression algorithms evolved over times to trim down data redundancy.For example, in JPEG compression, the signal which is the digitized 2-dimensional image sampled from the 2-dimensional image sensors grid, is transformed into the Discrete Cosine Transform (DCT) domain.In the DCT transform domain, most of the samples with small or negligible coefficients (i.e.scales or amplitude) are thrown away [11], [13], [14].
Compressive sensing (CS) is one of the data compression algorithms with even more radical approach to data compression [13].In the transformation domain a signal of length  is represented in terms of its scaling vector and its basis functions.The signal in this domain is -Sparse if it has maximum  nonzero and ( − ) zero coefficients in its scaling vector.The crucial point in CS theory is that when  is way much less than , i.e. the signal is compressible in the transform domain [11], [13], [14].
In CS, the aim is to reduce number of regularly spaced samples taken from the original signal.This is achieved by taking values of the signal at certain places instead of all the samples.The new samples are referred to compressed measurements y and they are calculated as follows: Where  is the  ×  measurement matrix,  ≪ , and  is the scaling coefficients of the signal  in the transform domain, .The original signal  reconstruction from its compressed measurements  is achieved through solving optimization problem and not simply by inverse transformation techniques [11].
An important property of the measurement matrix , is that it does not need to have a specific structure like transformation matrices or sampling matrices.In fact, authors in [11], stated that the measurement matrix only needs to satisfy the Restricted Isometric Property (RIP) for a given number of measurements.They also proved that a random matrix with independent and identically distributed (i.i.d.) Gaussian random variables as its elements will satisfy the RIP property [11].
Recovering the original signal from these compressed measurements is another topic.However, within the scope of this paper, the relevant part is the decomposition or the sampling part of the CS framework.Since the compressed measurements are guaranteed to have preserved the salient information from the original signal, it can be used as features.Therefore, reconstruction is not discussed here.

Proposed approach
The proposed approach blended both the simplicity and strength of multi-resolution algorithms and data compression capabilities (dimensionality reduction) of CS.The general flow of the algorithm is shown in Table 1.
1. Given a training set of M images: set a matrix  = [Ι 1 , Ι 2 , … ., Ι  ]; such that Ι  is a column version of the i th image, 2. For each image in  generate its pyramid to a predefined J levels using Gaussian approximation filter, 3.For each level of image pyramid perform the following:

Simulation results
Experiments were carried out on two different facial expression databases and results were tabulated using the new proposed approach.Two standard methods of training and cross-validation of the results were adhered to.Leave-One-Pose-Out (LOPO) cross validation was used on Japanese Female Facial Expression (JAFFE) database [5] and n-fold cross validation on Cohn-Kanade (CK) database [15].Both methods are frequently used in literatures [1]

n-fold cross validation on CK.
LOPO cross validation technique could be very exhaustive and computationally intensive [1] in a database with many samples like CK.Since CK database is very huge containing 97 subjects and total of 8795 sample images [16], 10-fold cross validation method of training was used.In 10-fold cross validation, all the images were initially grouped into seven classes based on expression information (person-independent), then equal number of samples were randomly drawn (without duplication) from each expression class to form 10 equally sized groups or folds.During the training, one fold was used as the testing set whereas the remaining nine folds were used as training set.The procedure was repeated 10 times, each time with a different fold.The recognition rate was given as the average of the 10 runs conducted.
It is worth noting that original CK database contains 640×490 sized samples with non-facial background information.In this experiment, all face samples were cropped to remove irrelevant background information and resized to 256×256.Since every expression in CK database starts from neutral to peak level, neural samples were not included in an expression that is not neutral.Examples of preprocessed images from CK database are shown in Figure 4. Table 2 and 3 contained the simulation results of the proposed approach using LOPO on JAFFE database.The results were obtained using level 1 and 2 of the pyramid at different feature vector length , and four different classifiers namely: Manhattan (   ), Euclidean (   ), Cosine (  ) and support vector machine (SVM).Whereas Table 4 and 5 present similar results on CK database using 10-fold cross validation.Results from other approaches are compared in Table 6 and 7 with the results obtained in this work under the same scenario.

Discussion
The combination of CS and pyramids processing has achieved a better performance compared to its counterparts with great simplicity.In Table 2 to 5 for example, in both JAFFE and CK database, increase in pyramid layer to a certain level improves the performance.However, for distance measurement classifiers the improvement is not much and sometimes increase in the pyramid layers beyond certain number does not necessarily translate into improvement in performance.In fact, experiments have shown that 2 layers of the pyramid are adequate and any attempt to go beyond that, a slight degradation in performance was observed.This may be due to that fact that, for the sample images used no discernible information could be extracted beyond the second layer of the pyramid.As a result use of many pyramids layers for those may be more of a dimensionality curse than of an advantage.Moreover, while this is true for both classifiers, a distinct response to the pyramid layer increase was exhibited by SVM which is different from the rest of the distance measurement classifiers.Distance measurement based classifiers results improve slightly when both the feature vector length and pyramid layer are increased.It might be connected to the fact that they do not require any learning to be able to adequately encode any additional information which might be available at lower resolutions.In sharp contrast to distance measurement classifier, SVM classifier both responds rapidly to the increase in pyramid layers and feature vector length.For instance, in JAFFE database higher performances are recorded up to 700 feature vector lengths after which the performance begins to decline.Whereas for CK database, lower feature vector length appears to be more suitable.Ref.

Comparison
In Tables 6 and 7, comparisons of the validity of our proposed approach with others in the literature under the same settings were conducted.The best result obtained from the proposed approach is chosen to do the comparison.

Conclusion
A new approach for facial expression recognition has been developed.The proposed approach has attributes of simplicity and better performance with regards to recognition rate and computational complexity as compared to other approaches.The proposed approach edges out most of its counterparts in the literature which further proves its validity.

Figure 2 :
Figure 2: A sample decomposed into 4-level pyramids using Gaussian approximation filter.
. The experimental constant parameters used are listed below:  Filter size: 9 × 9,  Gaussian Filter mean:  = 0,  Gaussian Filter Standard Deviation:  = 20.4.1 Leave-One-Pose-Out cross-validation on JAFFE.The Jaffe database has 213 images from 10 subjects each having 3 to 4 sample images per expression.210 images were used in this context.Before the training all the samples were grouped into 7 classes based on expression content (i.e.personindependent training).One sample from each expression's group was used as test data while the remaining samples were used as training set.The process was repeated with different sample, until each sample was uniquely used as testing data as well as training data.The recognition rate was given as the average of all the runs.Figure 3 contains a cross section of the JAFFE database.The vertical columns represent person-independent expression classes whereas the rows represent person-dependent expression classes for a particular subject.

Table 6 :
Comparison with different approaches on JAFFE database.

Table 7 :
Comparison with different approaches on CK database.