[PDF] Predicting and visualizing psychological attributions with a deep neural network

Abstract

Judgments about personality based on facial appearance are strong effectors in social decision making, and are known to have impact on areas from presidential elections to jury decisions. Recent work has shown that it is possible to predict perception of memorability, trustworthiness, intelligence and other attributes in human face images. The most successful of these approaches require face images expertly annotated with key facial landmarks. We demonstrate a Convolutional Neural Network (CNN) model that is able to perform the same task without the need for landmark features, thereby greatly increasing efficiency. The model has high accuracy, surpassing human-level performance in some cases. Furthermore, we use a deconvolutional approach to visualize important features for perception of 22 attributes and demonstrate a new method for separately visualizing positive and negative features.

Full PDF

PPredicting and visualizing psychological attributionswith a deep neural network ∗ Edward Grant ∗ , ∗ Stephan Sahm † , ∗ Mariam Zabihi ‡ and Marcel van Gerven § Radboud UniversityNijmegen, The NetherlandsEmail: ∗ [email protected], † [email protected], ‡ [email protected], § [email protected] ∗ Denotes equal contribution

Abstract —Judgments about personality based on facial ap-pearance are strong effectors in social decision making, and areknown to have impact on areas from presidential elections tojury decisions. Recent work has shown that it is possible topredict perception of memorability, trustworthiness, intelligenceand other attributes in human face images. The most successfulof these approaches require face images expertly annotatedwith key facial landmarks. We demonstrate a ConvolutionalNeural Network (CNN) model that is able to perform the sametask without the need for landmark features, thereby greatlyincreasing efﬁciency. The model has high accuracy, surpassinghuman-level performance in some cases. Furthermore, we usea deconvolutional approach to visualize important features forperception of 22 attributes and demonstrate a new method forseparately visualizing positive and negative features.

I. I

NTRODUCTION

Facial attributions for intelligence, attractiveness, domi-nance and trustworthiness have been shown to exhibit astrong effect on social decision making, with far-reachingconsequences from choosing between presidential candidatesto jury decisions in criminal legal cases [1], [2].Despite this, reliably predicting how a face will be perceivedhas proven to be difﬁcult. Some of the best current methodsrequire images hand-annotated with facial landmark features,which is time consuming [3].Given the success of CNNs in image recognition tasks [4],the CNN model is a natural choice for visual attribution ofpsychological characteristics. In addition to superior perfor-mance in vision tasks, CNNs are able to learn visual featuresfrom image data and do not require additional human input orhand-crafted features.Previous work on modelling attribute perception was con-ducted by Khosla et al. who introduced a method for char-acterizing face images using key facial points, histograminformation, SIFT features and hand-annotated landmark facialfeatures [3]. Using these features, it was possible to accuratelypredict attributions for many psychological and demographicattributes. In contrast, we use a CNN to learn features directlyfrom RGB images without the need for facial landmarkannotations, which are time consuming and can introducebias. Using these learned features the CNN is able to predictattribution labels with high accuracy, surpassing human-levelperformance in some cases. In addition we demonstrate a method to visualize general attribution features learned by theCNN.One important distinction is between the perception ofpsychological attributes (attributions) and other tests for anattribute. Visual perception has important social consequences,but is not always a good indicator of more robust measure-ments for an attribute. For example, Rezlecu et al. showedthat perceived trustworthiness, is not signiﬁcantly correlatedwith measured trustworthiness [5]. In contrast Kleiser et al.showed that perceived intelligence is associated with measuredintelligence in men but not women [6]. In this experiment wefocus solely on perception of attributes.Several methods exist for visualizing the features learnedby a neural network. These methods can broadly be dividedinto approaches that require a target image to be forwardpropagated through the network before the activity of a targetfeature detector can be projected back into image space [7],[8] and image-free approaches that generate an image thatmaximizes a class score [9], [10].The ﬁrst type of approach has the beneﬁt of visualizingfeatures from a real example image: the second approach canbe more general because it does not rely on a single imageexample. We use the ﬁrst approach, but visualize the mean ofmany examples, thus retaining both the beneﬁt of visualizingfeatures from real images and the generality of an image-freeapproach. This is only possible because the images we usedcontain faces that have roughly the same pose. If this was notthe case the features could become obscured by each other.To accomplish feature visualization we use the deconvnetproposed by Zeiler and Fergus [7]. In this approach, a targetimage is forward propagated through the network, and allactivations except the targets are set to zero. The targetactivation is projected back into image space by passing theactivation back through the network using deconvolution anda special kind of up-sampling. The projected image containsthe image features most responsible for the target activation.The resulting visualization represents all of the featuresimportant for an attribute. However, by manipulating theﬁnal layer, network weights we are able to show that thesefeatures can be decomposed into their positive and negativecomponents. Using this method we separately visualize boththe positive and negative features for attributions. a r X i v : . [ c s . C V ] D ec ig. 1. CNN schematic for attribution prediction. The network takes as input an image and outputs a binary class prediction. Within the network eachconvolutional layer transforms the output of the previous layer using learned ﬁlters that are convolved across overlapping sub-regions. The output of eachlayer is subject to one or more of the following nonlinear transformations: rectiﬁed linearity, max pooling and softmax. II. M

ETHODS

A. Dataset

Our example set comprised the annotated subset of the 10Kface database collected by Bainbridge et al [11]. This setconsists of 2222 face photographs as RGB images annotatedwith psychological and demographic labels. These data werecollected in the form of a rating from 15 participants forpsychological features and 13 participants for demographicfeatures.

B. Preprocessing

Each image was labelled as belonging to one of two equallysized classes for each attribute based on its ratings for a targetattribute. For gender, the ratings were binary already and theclasses were balanced by removing images randomly fromthe larger class. Images were square cropped and resized to × pixels. Images were zero-centered by subtracting themean pixel value from all images. C. Training the Network

For each attribution, each of CNN models were trainedwithin one fold of an -fold cross validation setting. Eachexample was represented exactly once in the test set. Thenetwork comprised of three convolutional layers, two fullyconnected layers and a softmax layer.Training was performed using stochastic gradient descentwith momentum set at . , a batch size of , a learning rateof . and weight decay of . . The models were trainedusing MatConvNet [12] on a Tesla K80 GPU. See Figure 1for a more detailed description of the network structure. D. Performance Measures

Two performance measures were computed. The outputsfrom the CNN are denoted by the probability of an imagebelonging to the positive or the negative class. These valueswere thresholded at and afterwards compared to the truebinarization of the image set (individually for each attribution).The fraction of correct predictions was used to determine accuracy. As a baseline, a linear support vector machine(SVM) was trained on the same data, and correspondingaccuracy measurements were obtained. Furthermore, singlehuman accuracy was computed using a leave-one-out strategyover the given (binarized) dataset.Correlation values refer to standard correlation coefﬁcients.They were computed between the CNN output probabilitiesand the continuous human assessment from the dataset. Again,single human correlation was obtained by a leave-one-outprocedure over the (continuous) dataset for a respective at-tribution. While accuracy values show the correctness of thepredictions, the correlations show how the variation within thepredictions mimics the true variation in the data.

E. Statistics

CNN performance was tested for signiﬁcance in two re-spects: whether it was signiﬁcantly different to human perfor-mance and an a random baseline.The CNN performance values were revealed to be consistentenough over trials in order to reasonably approximate them asbeing constant. We therefore assume that training the CNNtwice will result in exactly the same prediction output – norandomness is involved. Hence, to test for the null hypothesisthat the CNN performance could be produced by baseline orhumans, we only need to estimate how probable it is that thesame or higher performance is generated by the baseline orhumans respectively. For this the respective distributions wereapproximate. As we are interested in signiﬁcant improvements ,the hypothesis test was one-sided, and the critical value for analpha set to . For higher values, we reject the null hypothe-sis and concluded that CNN performance is signiﬁcantly betterthan baseline or human performance.The accuracy measure as deﬁned above counts the fractionof correct binary predictions. Hence, the random baseline isthe mean of a respective number of independent coin-ﬂips. Asthere are images, the baseline distribution is the mean of independent fair Bernoulli distributed random variables.The distribution of the correlations is less simple to derive. A ABLE IA

CCURACY AND CORRELATIONS FOR PSYCHOLOGICAL AND DEMOGRAPHIC ATTRIBUTIONS . U

SED ABBREVIATIONS : ACC - ACCURACIES ; CORR - CORRELATIONS . V

ALUES IN

BOLD

STAND FOR SIGNIFICANT IMPROVEMENT COMPARED TO THE RESPECTIVE HUMAN PERFORMANCE ON % ALPHALEVEL . A

TTRIBUTIONS IN

BOLD

DENOTE SIGNIFICANT IMPROVEMENT IN BOTH ACCURACY AND CORRELATION ( AGAIN COMPARED TO HUMANPERFORMANCE ).attributions acc CNN acc Human acc SVM corr CNN corr Humanage 0.75 0.98 0.65 0.58 0.86attractive 0.74 0.72 0.59 emotional 0.73 friendly 0.80 happy 0.84 kind 0.78 sociable 0.79 completely random CNN would output an arbitrary probabilityprediction between and , independently for each image.This corresponds to a vector of independent standarduniform random variables. The correlation measurement thendemands computation of the correlation between the CNNresponse and each attribution dataset individually. Instead ofderiving an analytic expression for such a correlation of arandom vector with a given data vector of size , it iseasier to simulate these correlation values. For this study, , random samples were generated from a uniformrandom vector, and each correlated with all attributiondatasets.For comparison with human performance, the distributionsof human performance was approximated. For both accuracyand correlations, a bootstrap estimation procedure was used.A short explanation of the bootstrap approach follows. Ourdataset was generated by a small number of participants.Estimating the distribution of human performance over thewhole population would mean taking a new (random) subsetof participants and computing the same performance measureagain, and again and again, thereby simulating its distribution.As this is obviously not feasible, the method known as boot-strap approximation regards the given subset as the populationitself. Instead of taking a new random subset of the wholeworld population, a random sub-subset of the given subset ischosen and the human performance measurement computed.For this study, sampling with replacement was used and again , samples were generated. F. Visualizing Features Using Deconvolution

Attribution features were visualized using a deconvnet [7].Using this method, the target image was ﬁrst forward prop-agated through the network and the location with maximumactivation for all pools stored. The activation for the targetattribute was passed back through the network using decon-volution. At each pooling layer an approximate inverse to thepooling operation was performed by up-sampling the imageand placing the feature map pixel values at the location withmaximum activation previously stored during pooling.The target activation is caused by positive and negativepredictions from the previous layer, and so deconvolutionvisualizes an image with both positive and negative features.In order to separately visualize features that positively andnegatively contribute to an attribution prediction, we set thenegative weights in the ﬁnal layer to zero to visualize thepositive features and set the positive weights to zero tovisualize the negative features. This is possible because theﬁnal layer nodes’ inputs are the weighted sum of the previouslayer activations, which are all positive or zero because ofthe rectiﬁed linear activation function. By setting the positiveor negative weights to zero in the ﬁnal layer, the effect ofthe positive or negative features on the prediction is isolated.Using this method, we create three visualizations for eachattribution. One contains all the features responsible for aprediction, one only negative features and one contains onlypositive features.The deconvolution approach is typically used to visualizethe features of a single example image. Because most imagescontain faces in roughly the same pose, we visualized the meanof all feature representations for each attribute. This resultsn an image with the mean features for an attribution. Thisprocess was repeated for positive and negative features.III. R

ESULTS

A. Performance of the CNN, Human and SVM Classiﬁers

The CNN accuracy was higher than SVM accuracy in allcases. The averaged CNN accuracy over all attributions was . % with standard deviation . . For correlations, themean was . and the standard deviation . .CNN accuracy as well as correlations between CNN pre-dictions and human assessments were signiﬁcantly better thanchance for all attributions. Compared to human correlations,a signiﬁcant improvement was found for attributions:attractive, caring, common, conﬁdent, egotistic, emotional,emotional stability, friendly, happy, interesting, kind, responsi-ble, sociable and trustworthy. Compared to human accuracies,the CNN was signiﬁcantly better in predicting attributions,namely emotional, friendly, happy, kind and sociable. SeeTable I for a summary. B. Visualization of CNN Features Involved in Attribution

Table II visualizes the CNN features involved in attribution.Here we identiﬁed a number of salient properties of thevisualizations.As expected the CNN positive features for a happy attri-bution look very much like a smile. In contrast the negativefeatures focus around the eyes and a down-turned mouth.These two feature sets compete with each other to producea prediction for happy.Some of the attribution features we visualize have beenstudied before and we ﬁnd a general coherence betweenprevious ﬁndings and the CNN feature visualizations.The CNN features important for gender discriminationcentered around the eyes and lips. This coheres with existingevidence that gender perception can be modulated by subtlychanging the color of the lips and eyes in a gender-neutralimage [13], [14].Todorov et al. show that there is a signiﬁcant correlationbetween human facial features and perceived trustworthiness.These features are mainly located at the center of the face, in-cluding the eyebrows, cheekbones, nose and chin. The shape ofthese features generates a spectrum of trustworthiness impres-sions. A longer narrower nose indicated increased perceivedtrustworthiness. In addition, the shape of the cheekbones wasfound to be important [15]. The features for trustworthinessshown in Table II show that the nose and cheekbones are alsoimportant features for CNN attribution of trustworthiness.We further observed striking differences in color hue andsaturation between features for many attributes. Importantfeatures for age, gender and attractiveness are found mainlyin the red and blue channels of the target image. The greenchannel contains important features for friendly and happy,and other attributes have features with a combination of colors.Features for sociable are mainly found in the red channel inthe nose area and in the green channel around the eyes andmouth. Although it is interesting to see the color of the features learned by the network, it is advisable not to over-interpretthese ﬁndings. Just because the network learns features in aspeciﬁc color channel does not necessarily mean that similarfeatures cannot be found in other channels. Further work isneed to determine the relevance of color in attribute perception,for example by comparing the predictive performance of thepresent model with that of a model trained only on gray-scaleimages. IV. D

ISCUSSION

This work shows that deep neural networks can be used toaccurately predict rated personality traits from face images,even surpassing human-level performance in some cases.The coherence between CNN features for attribution andfeatures found to be used by humans is interesting but notunexpected. CNNs and humans both exploit common struc-tures in natural images to perform vision. In addition, CNNsare loosely inspired by real neural structures and, similar toCNNs, there is strong evidence that the human visual systemis organized in a hierarchy of feature detectors of increasingcomplexity [16].By visualizing positive and negative features separately, weshowed that for some attributions all features are important,whereas for others, the positive and negative features are betterindicators of the existence of an attribution. For example, thenegative features for responsible are much more pronounced,whereas for memorable the positive features are more impor-tant (see Table II).Considering the performance of the CNN compared tohuman performance, it is interesting that the correlation valuesare far more often signiﬁcantly outperformed by CNN thanaccuracy values are. The CNN can replicate the variabilitywithin the assessments better than the overall classiﬁcation. Weused classiﬁcation rather than regression to allow for binaryfeature representations using deconvolution, but a similar net-work trained using regression may yield superior performancefor attribution accuracy.Many of the positive and negative features for attributionsdiscriminate between innate features such as face shape or dis-tance between the eyes. Gender and age are good examples ofsuch attributions. In contrast, for some attributions positive andnegative features appear to be discriminated by expressions.Good examples of these attributions are conﬁdent, friendly,happy, kind and emotional. Smiling is a positive feature foreach of these attributions, suggesting that perception of theseattributions can be changed through facial expression, unlikegender and age. Other attributes contain a mixture of ﬁxed andexpressive features. Attractiveness appears to be inﬂuencedby both face shape and the orientation of the lips. A smileis a feature of attractiveness, suggesting that perception ofattractiveness can be modulated through expression.In conclusion, we have shown that CNNs can be used topredict rated psychological and demographic attributions andto analyze the visual features that contribute to the predictionof these attributions. This can have practical applications

ABLE IID

ECONVOLVED F EATURES . P

ER ATTRIBUTION , THE DATASET WAS BINARIZED INTO LOW VERSUS HIGH VALUES ( CLASS AND RESPECTIVELY ) AND ADEEP NEURAL NETWORK WAS TRAINED FOR CLASSIFICATION . F

OLLOWING TRAINING , TWO ADDITIONAL NETWORK VARIANTS WERE CREATEDRETAINING ONLY THE POSITIVE (+)

OR NEGATIVE (-)

FINAL LAYER WEIGHTS . T

HIS ALLOWS FOR SEPARATE VISUALIZATION OF POSITIVE AND NEGATIVEFEATURES FOR EACH ATTRIBUTION . T

HE IMAGES DISPLAYED ARE THE AVERAGED DECONVOLUTION RESULTS OVER THE WHOLE TEST DATASET . F

ORINSTANCE , FOR THE ATTRIBUTION GENDER A PRONOUNCED MOUTH REGION IS COMMON FOR CLASS

FEMALE ) WHILE NOSE AND EYEBROWS AREIMPORTANT FEATURES FOR CLASS

MALE ).Attribution Features Features(-) Features(+) Attribution Features Features(-) Features(+)age(young/old) attractivecalm caringcommon conﬁdentegotistic emotionalemotStable familiarfriendly gender(female/male)happy intelligentinteresting kindmemorable responsiblesociable trustworthytypical weird s well as providing new insights into the psychologicalunderpinnings of personality ratings.A

CKNOWLEDGMENT

We would like to thank Wilma Bainbridge for the HumanFaces 10k dataset and Umut G¨uc¸l¨u for providing the deconvnetcode. R

EFERENCES[1] Charles C Ballew and Alexander Todorov. Predicting political electionsfrom rapid and unreﬂective face judgments.

Proceedings of the NationalAcademy of Sciences , 104(46):17948–17953, 2007.[2] Daniel E Re, David W Hunter, Vinet Coetzee, Bernard P Tiddeman,Dengke Xiao, Lisa M DeBruine, Benedict C Jones, and David IPerrett. Looking like a leader–facial shape predicts perceived heightand leadership ability.

PloS ONE , 8(12):e80957, 2013.[3] Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba, and Aude Oliva.Modifying the memorability of face photographs. In

InternationalConference on Computer Vision (ICCV) , pages 3200–3207, 2013.[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advancesin Neural Information Processing Systems (NIPS) , pages 1097–1105,2012.[5] Constantin Rezlescu, Brad Duchaine, Christopher Y Olivola, and NickChater. Unfakeable facial conﬁgurations affect strategic choices in trustgames with or without information about past behavior.

PloS ONE ,7(3):e34293, 2012.[6] Karel Kleisner, Veronika Chv´atalov´a, and Jaroslav Flegr. Perceivedintelligence is associated with measured intelligence in men but notwomen.

PloS ONE , 9(3):e81237, 2014.[7] Matthew D Zeiler and Rob Fergus. Visualizing and understandingconvolutional networks. In

Computer Vision–ECCV 2014 , pages 818–833. Springer, 2014.[8] Aravindh Mahendran and Andrea Vedaldi. Understanding deep imagerepresentations by inverting them.

International Journal of ComputerVision (IJCV) , 2(60):91–110, 2004.[9] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep insideconvolutional networks: Visualising image classiﬁcation models andsaliency maps. arXiv preprint arXiv:1312.6034 , 2013.[10] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks areeasily fooled: High conﬁdence predictions for unrecognizable images.In

The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , pages 427–436, 2015.[11] Wilma A Bainbridge, Phillip Isola, and Aude Oliva. The intrinsicmemorability of face photographs.

Journal of Experimental Psychology:General , 142(4):1323–1334, 2013.[12] Andrea Vedaldi and Karel Lenc. Matconvnet – convolutional neuralnetworks for matlab.

In Proceeding of the International Conference onMultimedia (ACM) , pages 689–692, 2015.[13] Richard Russell. A sex difference in facial contrast and its exaggerationby cosmetics.

Perception , 38(8):1211–1219, 2009.[14] Ian D Stephen and Angela M McKeegan. Lip colour affects perceivedsex typicality and attractiveness of human faces.

Perception , 39(8):1104–1110, 2010.[15] Alexander Todorov, Sean G Baron, and Nikolaas N Oosterhof. Evaluat-ing face trustworthiness: a model based approach.

Social Cognitive andAffective Neuroscience , 3(2):119–127, 2008.[16] Umut G¨uc¸l¨u and Marcel AJ van Gerven. Deep neural networks reveala gradient in the complexity of neural representations across the ventralstream.