Predicting and visualizing psychological attributions with a deep neural network
Edward Grant, Stephan Sahm, Mariam Zabihi, Marcel van Gerven
PPredicting and visualizing psychological attributionswith a deep neural network ∗ Edward Grant ∗ , ∗ Stephan Sahm † , ∗ Mariam Zabihi ‡ and Marcel van Gerven § Radboud UniversityNijmegen, The NetherlandsEmail: ∗ [email protected], † [email protected], ‡ [email protected], § [email protected] ∗ Denotes equal contribution
Abstract —Judgments about personality based on facial ap-pearance are strong effectors in social decision making, and areknown to have impact on areas from presidential elections tojury decisions. Recent work has shown that it is possible topredict perception of memorability, trustworthiness, intelligenceand other attributes in human face images. The most successfulof these approaches require face images expertly annotatedwith key facial landmarks. We demonstrate a ConvolutionalNeural Network (CNN) model that is able to perform the sametask without the need for landmark features, thereby greatlyincreasing efficiency. The model has high accuracy, surpassinghuman-level performance in some cases. Furthermore, we usea deconvolutional approach to visualize important features forperception of 22 attributes and demonstrate a new method forseparately visualizing positive and negative features.
I. I
NTRODUCTION
Facial attributions for intelligence, attractiveness, domi-nance and trustworthiness have been shown to exhibit astrong effect on social decision making, with far-reachingconsequences from choosing between presidential candidatesto jury decisions in criminal legal cases [1], [2].Despite this, reliably predicting how a face will be perceivedhas proven to be difficult. Some of the best current methodsrequire images hand-annotated with facial landmark features,which is time consuming [3].Given the success of CNNs in image recognition tasks [4],the CNN model is a natural choice for visual attribution ofpsychological characteristics. In addition to superior perfor-mance in vision tasks, CNNs are able to learn visual featuresfrom image data and do not require additional human input orhand-crafted features.Previous work on modelling attribute perception was con-ducted by Khosla et al. who introduced a method for char-acterizing face images using key facial points, histograminformation, SIFT features and hand-annotated landmark facialfeatures [3]. Using these features, it was possible to accuratelypredict attributions for many psychological and demographicattributes. In contrast, we use a CNN to learn features directlyfrom RGB images without the need for facial landmarkannotations, which are time consuming and can introducebias. Using these learned features the CNN is able to predictattribution labels with high accuracy, surpassing human-levelperformance in some cases. In addition we demonstrate a method to visualize general attribution features learned by theCNN.One important distinction is between the perception ofpsychological attributes (attributions) and other tests for anattribute. Visual perception has important social consequences,but is not always a good indicator of more robust measure-ments for an attribute. For example, Rezlecu et al. showedthat perceived trustworthiness, is not significantly correlatedwith measured trustworthiness [5]. In contrast Kleiser et al.showed that perceived intelligence is associated with measuredintelligence in men but not women [6]. In this experiment wefocus solely on perception of attributes.Several methods exist for visualizing the features learnedby a neural network. These methods can broadly be dividedinto approaches that require a target image to be forwardpropagated through the network before the activity of a targetfeature detector can be projected back into image space [7],[8] and image-free approaches that generate an image thatmaximizes a class score [9], [10].The first type of approach has the benefit of visualizingfeatures from a real example image: the second approach canbe more general because it does not rely on a single imageexample. We use the first approach, but visualize the mean ofmany examples, thus retaining both the benefit of visualizingfeatures from real images and the generality of an image-freeapproach. This is only possible because the images we usedcontain faces that have roughly the same pose. If this was notthe case the features could become obscured by each other.To accomplish feature visualization we use the deconvnetproposed by Zeiler and Fergus [7]. In this approach, a targetimage is forward propagated through the network, and allactivations except the targets are set to zero. The targetactivation is projected back into image space by passing theactivation back through the network using deconvolution anda special kind of up-sampling. The projected image containsthe image features most responsible for the target activation.The resulting visualization represents all of the featuresimportant for an attribute. However, by manipulating thefinal layer, network weights we are able to show that thesefeatures can be decomposed into their positive and negativecomponents. Using this method we separately visualize boththe positive and negative features for attributions. a r X i v : . [ c s . C V ] D ec ig. 1. CNN schematic for attribution prediction. The network takes as input an image and outputs a binary class prediction. Within the network eachconvolutional layer transforms the output of the previous layer using learned filters that are convolved across overlapping sub-regions. The output of eachlayer is subject to one or more of the following nonlinear transformations: rectified linearity, max pooling and softmax. II. M
ETHODS
A. Dataset
Our example set comprised the annotated subset of the 10Kface database collected by Bainbridge et al [11]. This setconsists of 2222 face photographs as RGB images annotatedwith psychological and demographic labels. These data werecollected in the form of a rating from 15 participants forpsychological features and 13 participants for demographicfeatures.
B. Preprocessing
Each image was labelled as belonging to one of two equallysized classes for each attribute based on its ratings for a targetattribute. For gender, the ratings were binary already and theclasses were balanced by removing images randomly fromthe larger class. Images were square cropped and resized to × pixels. Images were zero-centered by subtracting themean pixel value from all images. C. Training the Network
For each attribution, each of CNN models were trainedwithin one fold of an -fold cross validation setting. Eachexample was represented exactly once in the test set. Thenetwork comprised of three convolutional layers, two fullyconnected layers and a softmax layer.Training was performed using stochastic gradient descentwith momentum set at . , a batch size of , a learning rateof . and weight decay of . . The models were trainedusing MatConvNet [12] on a Tesla K80 GPU. See Figure 1for a more detailed description of the network structure. D. Performance Measures
Two performance measures were computed. The outputsfrom the CNN are denoted by the probability of an imagebelonging to the positive or the negative class. These valueswere thresholded at and afterwards compared to the truebinarization of the image set (individually for each attribution).The fraction of correct predictions was used to determine accuracy. As a baseline, a linear support vector machine(SVM) was trained on the same data, and correspondingaccuracy measurements were obtained. Furthermore, singlehuman accuracy was computed using a leave-one-out strategyover the given (binarized) dataset.Correlation values refer to standard correlation coefficients.They were computed between the CNN output probabilitiesand the continuous human assessment from the dataset. Again,single human correlation was obtained by a leave-one-outprocedure over the (continuous) dataset for a respective at-tribution. While accuracy values show the correctness of thepredictions, the correlations show how the variation within thepredictions mimics the true variation in the data.
E. Statistics
CNN performance was tested for significance in two re-spects: whether it was significantly different to human perfor-mance and an a random baseline.The CNN performance values were revealed to be consistentenough over trials in order to reasonably approximate them asbeing constant. We therefore assume that training the CNNtwice will result in exactly the same prediction output – norandomness is involved. Hence, to test for the null hypothesisthat the CNN performance could be produced by baseline orhumans, we only need to estimate how probable it is that thesame or higher performance is generated by the baseline orhumans respectively. For this the respective distributions wereapproximate. As we are interested in significant improvements ,the hypothesis test was one-sided, and the critical value for analpha set to . For higher values, we reject the null hypothe-sis and concluded that CNN performance is significantly betterthan baseline or human performance.The accuracy measure as defined above counts the fractionof correct binary predictions. Hence, the random baseline isthe mean of a respective number of independent coin-flips. Asthere are images, the baseline distribution is the mean of independent fair Bernoulli distributed random variables.The distribution of the correlations is less simple to derive. A ABLE IA
CCURACY AND CORRELATIONS FOR PSYCHOLOGICAL AND DEMOGRAPHIC ATTRIBUTIONS . U
SED ABBREVIATIONS : ACC - ACCURACIES ; CORR - CORRELATIONS . V
ALUES IN
BOLD
STAND FOR SIGNIFICANT IMPROVEMENT COMPARED TO THE RESPECTIVE HUMAN PERFORMANCE ON % ALPHALEVEL . A
TTRIBUTIONS IN
BOLD
DENOTE SIGNIFICANT IMPROVEMENT IN BOTH ACCURACY AND CORRELATION ( AGAIN COMPARED TO HUMANPERFORMANCE ).attributions acc CNN acc Human acc SVM corr CNN corr Humanage 0.75 0.98 0.65 0.58 0.86attractive 0.74 0.72 0.59 emotional 0.73 friendly 0.80 happy 0.84 kind 0.78 sociable 0.79 completely random CNN would output an arbitrary probabilityprediction between and , independently for each image.This corresponds to a vector of independent standarduniform random variables. The correlation measurement thendemands computation of the correlation between the CNNresponse and each attribution dataset individually. Instead ofderiving an analytic expression for such a correlation of arandom vector with a given data vector of size , it iseasier to simulate these correlation values. For this study, , random samples were generated from a uniformrandom vector, and each correlated with all attributiondatasets.For comparison with human performance, the distributionsof human performance was approximated. For both accuracyand correlations, a bootstrap estimation procedure was used.A short explanation of the bootstrap approach follows. Ourdataset was generated by a small number of participants.Estimating the distribution of human performance over thewhole population would mean taking a new (random) subsetof participants and computing the same performance measureagain, and again and again, thereby simulating its distribution.As this is obviously not feasible, the method known as boot-strap approximation regards the given subset as the populationitself. Instead of taking a new random subset of the wholeworld population, a random sub-subset of the given subset ischosen and the human performance measurement computed.For this study, sampling with replacement was used and again , samples were generated. F. Visualizing Features Using Deconvolution
Attribution features were visualized using a deconvnet [7].Using this method, the target image was first forward prop-agated through the network and the location with maximumactivation for all pools stored. The activation for the targetattribute was passed back through the network using decon-volution. At each pooling layer an approximate inverse to thepooling operation was performed by up-sampling the imageand placing the feature map pixel values at the location withmaximum activation previously stored during pooling.The target activation is caused by positive and negativepredictions from the previous layer, and so deconvolutionvisualizes an image with both positive and negative features.In order to separately visualize features that positively andnegatively contribute to an attribution prediction, we set thenegative weights in the final layer to zero to visualize thepositive features and set the positive weights to zero tovisualize the negative features. This is possible because thefinal layer nodes’ inputs are the weighted sum of the previouslayer activations, which are all positive or zero because ofthe rectified linear activation function. By setting the positiveor negative weights to zero in the final layer, the effect ofthe positive or negative features on the prediction is isolated.Using this method, we create three visualizations for eachattribution. One contains all the features responsible for aprediction, one only negative features and one contains onlypositive features.The deconvolution approach is typically used to visualizethe features of a single example image. Because most imagescontain faces in roughly the same pose, we visualized the meanof all feature representations for each attribute. This resultsn an image with the mean features for an attribution. Thisprocess was repeated for positive and negative features.III. R
ESULTS
A. Performance of the CNN, Human and SVM Classifiers
The CNN accuracy was higher than SVM accuracy in allcases. The averaged CNN accuracy over all attributions was . % with standard deviation . . For correlations, themean was . and the standard deviation . .CNN accuracy as well as correlations between CNN pre-dictions and human assessments were significantly better thanchance for all attributions. Compared to human correlations,a significant improvement was found for attributions:attractive, caring, common, confident, egotistic, emotional,emotional stability, friendly, happy, interesting, kind, responsi-ble, sociable and trustworthy. Compared to human accuracies,the CNN was significantly better in predicting attributions,namely emotional, friendly, happy, kind and sociable. SeeTable I for a summary. B. Visualization of CNN Features Involved in Attribution
Table II visualizes the CNN features involved in attribution.Here we identified a number of salient properties of thevisualizations.As expected the CNN positive features for a happy attri-bution look very much like a smile. In contrast the negativefeatures focus around the eyes and a down-turned mouth.These two feature sets compete with each other to producea prediction for happy.Some of the attribution features we visualize have beenstudied before and we find a general coherence betweenprevious findings and the CNN feature visualizations.The CNN features important for gender discriminationcentered around the eyes and lips. This coheres with existingevidence that gender perception can be modulated by subtlychanging the color of the lips and eyes in a gender-neutralimage [13], [14].Todorov et al. show that there is a significant correlationbetween human facial features and perceived trustworthiness.These features are mainly located at the center of the face, in-cluding the eyebrows, cheekbones, nose and chin. The shape ofthese features generates a spectrum of trustworthiness impres-sions. A longer narrower nose indicated increased perceivedtrustworthiness. In addition, the shape of the cheekbones wasfound to be important [15]. The features for trustworthinessshown in Table II show that the nose and cheekbones are alsoimportant features for CNN attribution of trustworthiness.We further observed striking differences in color hue andsaturation between features for many attributes. Importantfeatures for age, gender and attractiveness are found mainlyin the red and blue channels of the target image. The greenchannel contains important features for friendly and happy,and other attributes have features with a combination of colors.Features for sociable are mainly found in the red channel inthe nose area and in the green channel around the eyes andmouth. Although it is interesting to see the color of the features learned by the network, it is advisable not to over-interpretthese findings. Just because the network learns features in aspecific color channel does not necessarily mean that similarfeatures cannot be found in other channels. Further work isneed to determine the relevance of color in attribute perception,for example by comparing the predictive performance of thepresent model with that of a model trained only on gray-scaleimages. IV. D
ISCUSSION
This work shows that deep neural networks can be used toaccurately predict rated personality traits from face images,even surpassing human-level performance in some cases.The coherence between CNN features for attribution andfeatures found to be used by humans is interesting but notunexpected. CNNs and humans both exploit common struc-tures in natural images to perform vision. In addition, CNNsare loosely inspired by real neural structures and, similar toCNNs, there is strong evidence that the human visual systemis organized in a hierarchy of feature detectors of increasingcomplexity [16].By visualizing positive and negative features separately, weshowed that for some attributions all features are important,whereas for others, the positive and negative features are betterindicators of the existence of an attribution. For example, thenegative features for responsible are much more pronounced,whereas for memorable the positive features are more impor-tant (see Table II).Considering the performance of the CNN compared tohuman performance, it is interesting that the correlation valuesare far more often significantly outperformed by CNN thanaccuracy values are. The CNN can replicate the variabilitywithin the assessments better than the overall classification. Weused classification rather than regression to allow for binaryfeature representations using deconvolution, but a similar net-work trained using regression may yield superior performancefor attribution accuracy.Many of the positive and negative features for attributionsdiscriminate between innate features such as face shape or dis-tance between the eyes. Gender and age are good examples ofsuch attributions. In contrast, for some attributions positive andnegative features appear to be discriminated by expressions.Good examples of these attributions are confident, friendly,happy, kind and emotional. Smiling is a positive feature foreach of these attributions, suggesting that perception of theseattributions can be changed through facial expression, unlikegender and age. Other attributes contain a mixture of fixed andexpressive features. Attractiveness appears to be influencedby both face shape and the orientation of the lips. A smileis a feature of attractiveness, suggesting that perception ofattractiveness can be modulated through expression.In conclusion, we have shown that CNNs can be used topredict rated psychological and demographic attributions andto analyze the visual features that contribute to the predictionof these attributions. This can have practical applications
ABLE IID
ECONVOLVED F EATURES . P
ER ATTRIBUTION , THE DATASET WAS BINARIZED INTO LOW VERSUS HIGH VALUES ( CLASS AND RESPECTIVELY ) AND ADEEP NEURAL NETWORK WAS TRAINED FOR CLASSIFICATION . F
OLLOWING TRAINING , TWO ADDITIONAL NETWORK VARIANTS WERE CREATEDRETAINING ONLY THE POSITIVE (+)
OR NEGATIVE (-)
FINAL LAYER WEIGHTS . T
HIS ALLOWS FOR SEPARATE VISUALIZATION OF POSITIVE AND NEGATIVEFEATURES FOR EACH ATTRIBUTION . T
HE IMAGES DISPLAYED ARE THE AVERAGED DECONVOLUTION RESULTS OVER THE WHOLE TEST DATASET . F
ORINSTANCE , FOR THE ATTRIBUTION GENDER A PRONOUNCED MOUTH REGION IS COMMON FOR CLASS
FEMALE ) WHILE NOSE AND EYEBROWS AREIMPORTANT FEATURES FOR CLASS
MALE ).Attribution Features Features(-) Features(+) Attribution Features Features(-) Features(+)age(young/old) attractivecalm caringcommon confidentegotistic emotionalemotStable familiarfriendly gender(female/male)happy intelligentinteresting kindmemorable responsiblesociable trustworthytypical weird s well as providing new insights into the psychologicalunderpinnings of personality ratings.A
CKNOWLEDGMENT
We would like to thank Wilma Bainbridge for the HumanFaces 10k dataset and Umut G¨uc¸l¨u for providing the deconvnetcode. R
EFERENCES[1] Charles C Ballew and Alexander Todorov. Predicting political electionsfrom rapid and unreflective face judgments.
Proceedings of the NationalAcademy of Sciences , 104(46):17948–17953, 2007.[2] Daniel E Re, David W Hunter, Vinet Coetzee, Bernard P Tiddeman,Dengke Xiao, Lisa M DeBruine, Benedict C Jones, and David IPerrett. Looking like a leader–facial shape predicts perceived heightand leadership ability.
PloS ONE , 8(12):e80957, 2013.[3] Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba, and Aude Oliva.Modifying the memorability of face photographs. In
InternationalConference on Computer Vision (ICCV) , pages 3200–3207, 2013.[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenetclassification with deep convolutional neural networks. In
Advancesin Neural Information Processing Systems (NIPS) , pages 1097–1105,2012.[5] Constantin Rezlescu, Brad Duchaine, Christopher Y Olivola, and NickChater. Unfakeable facial configurations affect strategic choices in trustgames with or without information about past behavior.
PloS ONE ,7(3):e34293, 2012.[6] Karel Kleisner, Veronika Chv´atalov´a, and Jaroslav Flegr. Perceivedintelligence is associated with measured intelligence in men but notwomen.
PloS ONE , 9(3):e81237, 2014.[7] Matthew D Zeiler and Rob Fergus. Visualizing and understandingconvolutional networks. In
Computer Vision–ECCV 2014 , pages 818–833. Springer, 2014.[8] Aravindh Mahendran and Andrea Vedaldi. Understanding deep imagerepresentations by inverting them.
International Journal of ComputerVision (IJCV) , 2(60):91–110, 2004.[9] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep insideconvolutional networks: Visualising image classification models andsaliency maps. arXiv preprint arXiv:1312.6034 , 2013.[10] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks areeasily fooled: High confidence predictions for unrecognizable images.In
The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , pages 427–436, 2015.[11] Wilma A Bainbridge, Phillip Isola, and Aude Oliva. The intrinsicmemorability of face photographs.
Journal of Experimental Psychology:General , 142(4):1323–1334, 2013.[12] Andrea Vedaldi and Karel Lenc. Matconvnet – convolutional neuralnetworks for matlab.
In Proceeding of the International Conference onMultimedia (ACM) , pages 689–692, 2015.[13] Richard Russell. A sex difference in facial contrast and its exaggerationby cosmetics.
Perception , 38(8):1211–1219, 2009.[14] Ian D Stephen and Angela M McKeegan. Lip colour affects perceivedsex typicality and attractiveness of human faces.
Perception , 39(8):1104–1110, 2010.[15] Alexander Todorov, Sean G Baron, and Nikolaas N Oosterhof. Evaluat-ing face trustworthiness: a model based approach.
Social Cognitive andAffective Neuroscience , 3(2):119–127, 2008.[16] Umut G¨uc¸l¨u and Marcel AJ van Gerven. Deep neural networks reveala gradient in the complexity of neural representations across the ventralstream.