Applying Deep Machine Learning for psycho-demographic profiling of Internet users using O.C.E.A.N. model of personality
AApplying Deep Machine Learning forpsycho-demographic profiling of Internetusers using O.C.E.A.N. model of personality I aroslav O melianenko Research Director, NewGround [email protected]
Abstract
In the modern era, each Internet user leaves enormous amounts of auxiliary digital residuals (footprints) by using a varietyof on-line services. All this data is already collected and stored for many years. In recent works, it was demonstrated that it’spossible to apply simple machine learning methods to analyze collected digital footprints and to create psycho-demographicprofiles of individuals. However, while these works clearly demonstrated the applicability of machine learning methods for suchan analysis, created simple prediction models still lacks accuracy necessary to be successfully applied for practical needs. We haveassumed that using advanced deep machine learning methods may considerably increase the accuracy of predictions. We startedwith simple machine learning methods to estimate basic prediction performance and moved further by applying advanced methodsbased on shallow and deep neural networks. Then we compared prediction power of studied models and made conclusions aboutits performance. Finally, we made hypotheses how prediction accuracy can be further improved. As result of this work, we providefull source code used in the experiments for all interested researchers and practitioners in corresponding GitHub repository. Webelieve that applying deep machine learning for psycho-demographic profiling may have an enormous impact on the society(for good or worse) and provides means for Artificial Intelligence (AI) systems to better understand humans by creating theirpsychological profiles. Thus AI agents may achieve the human-like ability to participate in conversation (communication) flow byanticipating human opponents’ reactions, expectations, and behavior. By providing full source code of our research we hope tointensify further research in the area by the wider circle of scholars.
1. I ntroduction
By using various on-line services, modern Internet user leaves an enormous amount of digital tracks in theform of server logs, user-generated content, etc. All these information bits meticulously saved by on-line ser-vice providers create the vast amount of digital footprints for almost every Internet user. In recent research[Lambiotte, R., and Kosinski, M., 2014], it was demonstrated that by applying simple machine learning methods itwas possible to find statistical correlations between digital footprints and psycho-demographic profile of individuals.The considered psycho-demographic profile comprise of psychometric scores based on five-factor
O.C.E.A.N. modelof personality [Goldberg et. al, 2006] and demographic scores such as
Age , Gender and the
Political Views . The
O.C.E.A.N. is an abbreviation for
Openness (Conservative and Traditional - Liberal and Artistic),
Conscientiousness (Impulsive and Spontaneous - Organized and Hard Working),
Extroversion (Contemplative - Engaged with outsideworld),
Agreeableness (Competitive - Team working and Trusting), and
Neuroticism (Laid back and Relaxed - EasilyStressed and Emotional).In this work we decided to test whether applying advanced machine learning methods to analyze digital footprintsof Internet users can outperform results of previous research conducted by M. Kosinski:
Mining Big Data to ExtractPatterns and Predict Real-Life Outcomes [Kosinski et. al, 2016]. For our experiments we used data corpus comprising of1 a r X i v : . [ c s . L G ] J u l pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personalitypsycho-demographic scores of individuals and their digital footprints in form of Facebook likes. The data corpus usedin experiments kindly provided by M. Kosinski through corresponding web site: http://dataminingtutorial.com .We started our experiments with building simple machine learning models based on linear/logistic regressionmethods as proposed by M. Kosinski in [Kosinski et. al, 2016]. By training and execution of simple models weestimated basic predictive performance of machine learning methods against available data set. Then we continuedour experiments with advanced machine learning methods based on shallow and deep neural network architectures.The full source code of our experiments provided in form of GitHub repository: https://github.com/NewGround-LLC/psistats The source code is written in R programming language [R Core Team, 2015] which is highly optimized for statisticaldata processing and allows to apply advanced deep machine learning algorithms by bridging with Google Brain’sTensorFlow framework [Google Brain Team, 2015].This paper is organized as follows: In Section 2, we describe data corpus structure, and necessary data preprocessingsteps to be applied. It is followed in Section 3 by details about how to build and run simple prediction modelsbased on linear and logistic regression with results of their execution. In Section 4, we provide details how to createand execute advanced prediction models based on artificial neural networks. Finally, in Section 6 we compare theperformance of different machine learning methods studied in this work and draw conclusions about the predictivepower of studied machine learning models.
2. D ata C orpus P reparation In this section, we consider the creation of input data corpus from the publicly available data set, and it’s preprocessingto allow further analysis by selected machine learning algorithms.
The data set kindly provided by M. Kosinski and used in this work contains psycho-demographic profiles of n u =
110 728 Facebook users and n L = http://dataminingtutorial.com :1. users.csv: contains psycho-demographic user profiles. It has n u =
110 728 rows (excluding the row holdingcolumn names) and nine columns: anonymised user ID, gender ("0" for male and "1" for female), age,political views ("0" for Democrat and "1" for Republican), and scores of five-factor model of personality[Goldberg et. al, 2006].2. likes.csv: contains anonymized IDs and names of n L = users-likes.csv: contains the associations between users and their FB Likes, stored as user-Like pairs. It has nu L =
10 612 326 rows and two columns: user ID and Like ID. An existence of a user-Like pair implies that agiven user had the corresponding Like on their profile.2pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personality
The raw data preprocessing is an important step in machine learning analysis which will significantly reduce thetime needed for analysis and result in better prediction power of created machine learning models.The detailed description of data corpus preprocessing steps applied during this research given hereafter.
Construction of sparse users-likes matrix and matrix trimming
To use provided data corpus in machine learning analysis it should be transformed first into optimal format. Takinginto account properties of provided data corpus (user can like specific topic only once and most users has generatedsmall amount of likes) its natural to present it as sparse matrix where most of data points is zero (the resulting matrixdensity is about 0,006% - see Table 1). The sparse matrix data structure is optimized to perform numeric operationson sparse data and considerably reduce computational costs compared to the dense matrix used for same data set.After users-likes sparse matrix creation, it was trimmed by removing rare data points. As a result, the significantlyreduced data corpus was created, imposing even lower demands on computational resources and more useful formanual analysis to extract specific patterns. The descriptive statistics of users-likes matrix before and after trimmingpresent in Table 1. Descriptive statistics Raw Matrix Trimmed Matrix
Table 1:
The descriptive statistics of raw and trimmed users-likes matrix with minimum users per like threshold set to u L = and minimumlikes per user L u = The users-likes matrix can be constructed from provided three comma-separated files with the help of accompanyingscript written in R language: src/preprocessing.R . To use this script make sure that input_data_dir variable in the src/config.R points to the root directory where sample data corpus in the form of .CSV files were unpacked.To start preprocessing and trimming, run the following command from terminal in the project’s root directory:$ R s c r i p t . / s r c / p r e p r o c e s s i n g . R \\ − u 150 − l 50 3pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personalitywhere: -u is the minimum number of users per like u L , and -l is the minimum number of likes per user L u to keep inresulting matrix.The values for the minimum number of users per like u L and the minimum number of likes per user L u was selectedbased on recommendations given in [Kosinski et. al, 2016]. We have experimented with other set of parameters aswell ( u L =
20 and L u = Data imputation of missed values
The raw data corpus has missed values in column with ’Political’ dependent variable data. Before building theprediction model for this dependent variable, it is advisable to impute missed values. In this work, we appliedmultivariate imputation using
LDA method with number of multiple imputations equals to m = src/preprocessing.R script, as part of users-likes matrix creation routine.The summary statistics for data imputation applied to political variable, presented in Table 2.est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda(Intercept) 1.39 0.01 102.29 1240.53 0.00 1.36 1.41 NA 0.06 0.05gender -0.02 0.01 -2.80 25.27 0.01 -0.04 -0.01 0 0.44 0.40age 0.00 0.00 -0.73 577.61 0.47 0.00 0.00 0 0.09 0.08ope -0.23 0.00 -68.10 2446.16 0.00 -0.23 -0.22 0 0.04 0.04con 0.05 0.00 10.92 20.28 0.00 0.04 0.06 0 0.49 0.44ext 0.03 0.00 6.44 14.40 0.00 0.02 0.04 0 0.58 0.53agr 0.02 0.00 6.72 189.32 0.00 0.02 0.03 0 0.15 0.14neu -0.01 0.00 -2.07 95.30 0.04 -0.02 0.00 0 0.22 0.20 Table 2:
The descriptive statistics for data imputation applied to political variable using
LDA method with number of multiple imputationsequals to m = . The plausibility of applied multivariate imputation can be confirmed by low values in column fmi and lambda . Thecolumn fmi contains the fraction of missing information as defined in [Rubin DB, 1987], and the column lambda is the proportionof the total variance that is attributable to the missing data λ = B + Bm T ). Dimensionality reduction with SVD
After two previous steps, the resulting users-likes sparse matrix still has a considerable number of features perdata sample: 8 523 of feature columns. To make it even more maintainable, we considered applying singular valuedecomposition [Golub, G. H., and Reinsch, C. 1970], representing eigendecomposition-based methods, projecting aset of data points into a set of dimensions. As mentioned in [Kosinski et. al, 2016], reducing the dimensionality ofdata corpus has number of advantages: • With reduced features space we can use fewer number of data samples, as it is required by most of the machine learninganalysis algorithms that number of data samples exceeds the number of features (input variables) • It will reduce risk of overfitting and increase statistical power of results • It will remove multicollinearity and redundancy in data corpus by grouping related features (variables) in singledimension • It will significantly reduce required computational power and memory requirements • And finally it makes it easier to analyze data by hand over small set of dimensions as opposite to hundreds or thousands ofseparate features
SVD analysis against generated users-likes matrix run the following command from project’s root directory:$ R s c r i p t . / s r c /svd_varimax . R \\ −− svd_dimensions 50 −− apply_varimax t r u ewhere: –svd_dimensions is the number of SVD dimensions for projection, and –apply_varimax is the flag to indicatewhether varimax rotation should be applied afterwards.
Factor rotation analysis
The factor rotation analysis methodology can be used to further simplify
SVD dimensions and increase theirinterpretability by mapping the original multidimensional space into a new, rotated space. Rotation approaches canbe orthogonal (i.e., producing uncorrelated dimensions) or oblique (i.e., allowing for correlations between rotateddimensions).In this work during data preprocessing we applied one of the most popular orthogonal rotations - varimax . Itminimizes both the number of dimensions related to each variable and the number of variables related to eachdimension, thus improving the interpretability of the data by human analysts.For more details on rotation techniques, see [Abdi, H., 2003].
3. R egression analysis
There is an abundance of methods developed to build prediction machine learning models suitable for analysis oflarge data sets. They ranging from sophisticated methods such as Deep Machine Learning [Goodfellow et al., 2016],probabilistic graphical models, or support vector machines [Cortes & Vapnik, 1995], to much simpler, such as linearand logistic regressions [Yan, Su, 2009].Starting with simple methods is a common practice allowing the creation of good baseline prediction model withminimal computational efforts. The results obtained from these models can be used later to debug and estimate thequality of results obtained from advanced models.
In our data corpus, we have eight dependent variables with psycho-demographic scores of individuals to be predicted.Among those variables, six have continuous values, and two has categorical values. To build the prediction model forvariables with continuous values we applied linear regression analysis and for variables with categorical values -logistic regression analysis.Hereafter we describe the rationale for selection of appropriate regression analysis methods as well as a descriptionof model-specific data preprocessing needed.
Linear regression analysis
The linear regression is an approach for modeling the relationship between continuous scalar dependent variable y and one or more explanatory (or independent) variables denoted X . The case of one explanatory variable is calledsimple linear regression. For more than one explanatory variable, the process is called multiple linear regression5pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personality[David A. Freedman, 2009]. In linear regression, the relationships are modeled using linear predictor function y = Θ T X whose unknown model parameters Θ estimated from the input data. Such models are called linear models[Hilary L. Seal, 1967].We used linear regression to build prediction models for analysis of six continuous dependent variables in givendata corpus: Age , Openness , Conscientiousness , Extroversion , Agreeableness , and
Neuroticism
Logistic regression analysis
The logistic regression is a regression model where the dependent variable is categorical [David A. Freedman, 2009].It measures the relationship between the categorical dependent variable and one or more independent variablesby estimating probabilities using a logistic function σ ( x ) = + e − x , which is the cumulative logistic distribution[Rodriguez, G., 2007].We considered only specialized binary logistic regression because categorical dependent variables found in our datacorpus ( Gender and
Political Views ) are binominal, i.e. have only two possible types, "0" and "1".
Cross-Validation
We applied k-fold cross-validation to help with avoiding model overfitting when evaluating accuracy scores of predictionmodels. In k-fold cross-validation , the original sample is randomly partitioned into k equal sized subsamples. Of the ksubsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − k times (the folds), with eachof the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged toproduce a single estimation. The advantage of this method is that all observations are used for both training andvalidation, and each observation is used for validation exactly once. The 10-fold cross-validation is commonly used,but in general, k remains an unfixed parameter [Kohavi, Ron, 1995]. Dimensionality reduction
To reduce the number of features (input variables) in the data corpus was applied singular value decomposition (SVD)with subsequent varimax factor rotation analysis. The number of the varimax-rotated singular value decompositiondimensions ( K ) has a considerable impact on the accuracy of model predictions. To find an optimal number of SVDdimensions , we performed analysis of relationships between K and accuracy of model predictions by creating seriesof regression models for different values of K . Then we plotted prediction accuracy of regression models againstchosen number of K SVD dimensions. Typically the prediction accuracy grows rapidly within lower ranges of K andmay start decreasing once the number of dimensions becomes large. Selecting value of K that marks the end of arapid growth of prediction accuracy values usually offers decent interpretability of the input data topics. In general,the larger K values often results in better predictive power when preprocessed data corpus further analyzed withspecific machine learning algorithm [Zhang, Marron, Shen,& Zhu, 2007]. See Figure 1 for results of our experiments.To start SVD analysis run the following command from terminal in the project’s root directory:$ R s c r i p t . / s r c / a n a l y s i s . RThe resulting plots will be saved as "Rplots.pdf" file in the project root and include two plots: • the plot with relationships between the accuracy of prediction models for each dependent variable and thenumber of the varimax-rotated SVD dimensions used for dimensionality reduction (Figure 1). With this plot,6pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personality Figure 1:
The relationship between the accuracy of predicting psycho-demographic traits and the number of the varimax-rotated singular valuedecomposition dimensions used for dimensionality reduction. The results suggest that selecting K = SVD dimensions might be agood choice for building models predicting almost all dependent variables, as it offers accuracy that is close to what seems like thehigher asymptote for this data. But for
Openness , Extroversion , and
Agreeableness dependent variables prediction results can beslightly improved with higher numbers of K SVD dimensions selected. K SVD dimensions to maximize predicting power of regressionmodel per particular dependent variable. • the heat map of correlations between scores of digital footprints of individuals projected on specific number ofvarimax-rotated SVD dimensions and each dependent variable (Figure 2). This plot can be used to find mostcorrelated dependent variables visually. Later, it will be shown that predictive models for dependent variableswith higher correlation have better prediction accuracy. The given data corpus has eight dependent variables for which to build prediction models. The simple machinelearning methods such as regression analysis mostly applied to estimate single dependent variable. But in case whenmultiple dependent variables need to be estimated the specialized methods of multivariate regression analysis canbe used. Taking into account that our dependent variables have different types (continuous and nominal) whichrequire different regression analysis methods to be applied, we decided to build separate regression models per eachdependent variable. The metric to evaluate accuracy of prediction model is related to the regression method used inthe model. In this research we have considered following metrics: • the accuracy of prediction model applied to the continuous dependent variable will be measured as Pearsonproduct-moment correlation [Gain, 1951] • the accuracy of prediction model applied to the bi-nominal dependent variable will be measured as area underthe receiver-operating characteristic curve coefficient (AUC) [Sing, Sander, Beerenwinkel, Lengauer, 2005]Before executing models make sure that data corpus already preprocessed as described in Subsection: "Constructionof sparse users-likes matrix and matrix trimming"When data corpus is ready, the following command can be executed to start linear/logistic regression modelsbuilding and its predictive performance evaluation (run command from terminal in the project’s root directory):$ R s c r i p t . / s r c / r e g r e s s i o n _ a n a l y s i s . RThe results will be saved into the file ” out / pred _ accuracy _ regr . txt ”. The prediction accuracy of regression models fordata corpus trimmed to contain 150 users-per-like and 50 likes-per-user and varimax-rotated against K =
50 SVDdimensions presented in Table 3. Trait Variable Pred. accuracyGender gender 93.65%Age age 61.17%Political view political 68.36%Openness ope 44.02%Conscientiousness con 25.72%Extroversion ext 30.26%Agreeableness agr 23.97%Neuroticism neu 29.11%Mean 47.03%
Table 3:
The predictive accuracy of linear and logistic regression models per depended variable (for u L = , L u = , and K = SVDdimensions).
Figure 2:
The heat map is presenting correlations between scores of digital footprints of individuals projected on K = varimax-rotatedsingular value decomposition dimensions and scores of psycho-demographic traits of individuals. The heat map suggests that Age , Gender , and the
Political view dependent variables have maximum correlation with a maximal number of SVD dimensions. Thehigher correlation will result in higher prediction power of regression model for particular dependent variable (which will be shownlater).
Gender , Age , and
Political view with
Openness following after. That correlates wellwith our previous analysis of SVD correlations heat map (see Figure 2). In general only prediction model for
Gender is accurate enough to be useful in real-life applications. Thus, simple linear/logistic regression models can not beused to accurately estimate psycho-demographic profiles of Internet users based only on their Facebook likes.In following sections, we will test if applying advanced deep machine learning methods can improve predictionaccuracy any further.
4. F ully C onnected F eed F orward A rtificial N eural N etworks In this work, we considered multilayer fully connected feed-forward neural networks (NN) for building simple (shallow)and deep machine learning NN models. The fully connected
NN characterized by interconnectedness of all units ofone layer with all units of the layer before it in the graph. The feed-forward
NN is not allowed to have cycles fromlatter layers back to the earlier.Hereafter we will describe Artificial NN architectures evaluated and prediction accuracy results obtained. A Shallow Neural Network (SNN) is an artificial neural network with one hidden layer of units (neurons) betweeninput and output layers. Its hidden units (neurons) take inputs from input units (columns of input data matrix) andfeeds into the output units, where linear or categorical analysis performed. To mimic biological neuron, hiddenunits in the neural network apply specific non-linear activation functions. One of the popular activation functionsis ReLU non-linearity that we considered as activation function for the units of hidden layers in studied networkarchitectures [Nair, Hinton, 2010]. It improves information disentangling and linear separability producing efficientvariable size representation of model’s data. Furthermore ReLU activation is computationally cheaper: there is noneed for computing the exponential function as in case of sigmoid activation [Glorot, Bordes, Bengio, 2011].To reduce overfitting was applied dropout regularization with drop-probability 0.5, which means that each hiddenunit if randomly omitted from the network with the specified probability. This helps to break the rare dependenciesthat can occur in the training data [Hinton, G. et al., 2012].The NN architecture was build using Google Brain’s TensorFlow library - an open source software library for nu-merical computation using data flow graphs. [Google Brain Team, 2015] Nodes in the graph represent mathematicaloperations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.The resulting two layer (one hidden layer) ANN’s architecture graph depicted in Figure 5As the loss function to be optimized was selected
Mean Squared Error (MSE) with
Adam optimizer (Adaptive MomentEstimation) to estimate it’s minimum. The Adam optimizer was selected for it’s proven advantages some of them are thatthe magnitudes of parameter updates are invariant to rescaling of the gradient, its step sizes are approximately bounded by thestep size hyper-parameter, it does not require a stationary objective, it works with sparse gradients, and it naturally performsa form of step size annealing [Kingma, Ba, 2014]. The batch size was selected to be 100. We also tested training withbatch size 10 but found no statistically relevant prediction accuracy difference between runs with either batch sizebut reducing batch size considerably increased training time of NN models.It was found that optimal number of SVD dimensions for Shallow ANN is K = γ = γ = K = K = K = K = Table 4:
The predictive accuracy results of SNN per K SVD dimensions with learning rate γ = and 512 units in the hidden layer. The optimal learning rate was selected by comparing ratio of survived hidden units with non zero ReLU activation and monitoring loss function over iterations plot (Figure 3). With presented hyper-parameters it was achievedmaximum ratio 0.57 of zero ReLU activations for largest learning rate value, which is acceptable taking into accounttendency of ReLU to saturate at zero during gradient back propagation stage when strong gradients applied, due tohigh learning rates [Nair, Hinton, 2010]. The maximal number of iterations (50 000) and correspondingly number oftraining epochs was selected based on loss function plot (Figure 3). With series of experiments it was selected asoptimal the learning rate value γ = K =
128 SVDTrait Variable γ = γ = γ = Table 5:
The predictive accuracy results of SNN per learning rate with K = SVD dimensions and 512 units in the hidden layer.
The accompanying launch script provided to conduct experiments under Unix:$ . / eval_mlp_1 . sh u l _ s v d _ m a t r i x _ f i l ewhere: ul_svd_matrix_file the path to the preprocessed users-likes matrix with dimension of feature columnsreduced as described in Construction of sparse users-likes matrix and matrix trimmingThe source code of shallow ANN implementation used for experiment can be found in src/mlp.R of accompanying11pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personalityGitHub repository. A Deep Neural Network (DNN) is an artificial neural network with multiple hidden layers of units between the inputand the output layers. The first hidden layer will take inputs from each of the input units and the subsequent hiddenlayer will take inputs from the outputs of previous hidden layer’s units [Christopher M. Bishop, 1995]. Similar toshallow, deep neural network can model complex non-linear relationships. But added extra layers enable compositionof features from lower layers, giving the potential of modeling complex data with fewer units than a similarlyperforming shallow network [Bengio, Yoshua, 2009].
Deep learning discovers intricate structure in large data sets by usingthe backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute therepresentation in each layer from the representation in the previous layer [LeCun, Bengio, Hinton, 2015].As with shallow ANNs, many issues can arise with training of the deep neural networks, with two most commonproblems - are overfitting and computation time [Tetko, Livingstone, Luik, 1995].Hereafter we will consider DNN architectures studied in this research.
The three-layer Deep Learning Network Architecture Evaluation
Our experiments with deep learning networks we started with simple DNN architecture comprising of two hiddenlayers with ReLU activation and dropout after each hidden layer with keep probability of 0.5. The experimentalnetwork graph depicted in Figure 7. Prediction accuracy, γ = K = K = K = K = K = Table 6:
The predictive accuracy results of three layer DNN per K SVD dimensions with learning rate γ = , with sizes of hiddenlayers presented in third table header row as [ hidden , hidden ] . We started with learning rate γ = K SVD dimensions. The optimal prediction accuracyof DNN model was achieved with K =
256 SVD dimensions and two hidden layers comprising of [512, 256] unitscorrespondingly. Similar prediction accuracy can be achieved with K = γ = − . But we have not considered later set of hyper-parameters due to its extracomputational overhead while giving statistically same results as former set. The results of experiments presented inTable 6. 12pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personalityAfter finding optimal values of K SVD dimensions and number of units per hidden layer, we have conducted seriesof experiments to determine optimal initial learning rate value for found hyper-parameters. It was experimentallyconfirmed that initial learning rate value γ = K =
256 SVDTrait Variable γ = γ = γ = Table 7:
The predictive accuracy results of three layer DNN per learning rate with K = SVD dimensions, and with sizes of hidden layers512 and 256 correspondingly.
In our experiments, we applied exponential learning rate decay with the number of steps before decay 10 000 anddecay rate of 0.96. Such scheme has positive effect on network convergence speed due to the learning rate annealingeffect, which gives a system the ability to escape from poor local minima to which it might have been initialized[Kirkpatrick et al., 1983]. We selected batch size 100 as optimal for this experiment.The accompanying launch script provided to conduct experiments under Unix:$ . / eval_dnn . sh u l _ s v d _ m a t r i x _ f i l ewhere: ul_svd_matrix_file is the path to the preprocessed users-likes matrix with dimension of feature columnsreduced as described in Construction of sparse users-likes matrix and matrix trimmingThe source code of DNN with two hidden layers implementation used for experiment can be found in src/dnn.R ofaccompanying GitHub repository.
The four-layer Deep Learning Network Architecture Evaluation
This architecture comprise of three hidden layers with ReLU activation and one output linear layer. All network layersare fully connected and network architecture is feed-forward as in all previous NN experiments. The experimentalnetwork graph depicted in Figure 8.We have tested two dropout regularization schemes: (a) dropout applied after each hidden layer with keep-probability0.5; (b) dropout applied after each second hidden layer with keep probability calculated by formula: p d = i n , (where: n - number of dropouts, i - current dropout index).It was found that former scheme gives better results than the last one. Thus for final evaluation run, we applieddropout regularization after each hidden layer .Based on our previous experiments with more shallow networks we decided to start with following hyper-parameters:learning rate γ = K =
128 SVD dimensions, and hiddenlayers configuration - [256, 128, 128]. 13pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personalityPrediction accuracy, γ = K = K = K = K = Table 8:
The predictive accuracy results of three layer DNN per K SVD dimensions with the learning rate γ = . The number of unitsin hidden layers differ per configuration and presented as [ hidden , hidden ] in third table header row To find the optimal number of K SDV dimensions and hidden layers configurations we have conducted series ofexperiments trying various combinations. The heuristic applied to select the number of units per hidden layer israther naive and assumes that with dropout probability of 0.5 half of the units will be saturated to zero at ReLUactivation. Thus we decided to have the number of units in the first hidden layer to be twice as much as the numberof features in input data ( K ). The results of experiments present in Table 8Despite the fact that best accuracy was achieved with K = K and number of hidden units as it will give nofurther gain in prediction accuracy against validation data set and even may lead to worsening of validation accuracywith greater overfitting level. (See Figure 6)The accompanying launch script provided to conduct experiments under Unix:$ . / eval_3dnn . sh u l _ s v d _ m a t r i x _ f i l ewhere: ul_svd_matrix_file is the path to the preprocessed users-likes matrix with dimension of feature columnsreduced as described in Construction of sparse users-likes matrix and matrix trimmingThe source code of DNN with three hidden layers implementation used for experiments can be found in src/3dnn.R of accompanying GitHub repository.
5. F uture W ork With conducted experiments, we have found that prediction accuracy differs considerably among machine learningmethods studied and best results was achieved by using advanced methods based on neural networks architectures.At the same time, the prediction accuracy per individual dependent variable also differs per particular predictionmodel and selected set of hyper-parameters. From experimental results it can be seen that specific combinationof NN architecture with given set of hyper-parameters are best suited for one dependent variable but worsenedpredictive power for some of the others. 14pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personalityIn future studies, it is interesting to investigate this dependency and build separate NN models per each dependentvariable as it was done in case of simple machine learning methods (see Section: ’Regression analysis’)Also, it seems promising to apply methodology described in [Ba, Caruana, 2014] which provide evidence that shallownetworks are capable of learning the same functions as deep learning networks, and often with the same number ofparameters as the deep learning networks. In [Ba, Caruana, 2014] it was shown that with wide shallow networks it’spossible to reach the state-of-the-art performance of deep models and reduce training time by the factor of 10 usingparallel computational resources (GPU).
6. C onclusion
From our experiments we found that weak correlation exists between most of
O.C.E.A.N. psychometric scoresof individuals and collected Facebook likes associated with them. Either simple or advanced machine learningalgorithms that we have tested provided poor prediction accuracy for almost all
O.C.E.A.N. personality traits. It seemsnot feasible yet to use machine learning models to accurately estimate psychometric profile of an individual based onlyon Facebook likes. But we believe that by complementing Facebook likes of user with additional data points, it ispossible to greatly improve accuracy of machine learning prediction models for psychometric profile estimation.At the same time, we have found a strong correlation with demographic traits of individuals such as
Age , Gender , andthe
Political Views with their Facebook activity (likes). Our experiments confirmed that its possible to use advancedmachine learning methods to build the correct demographic profile of an individual based only on collected Facebooklikes. DNNRegression SNN 512,256 2048,1024,1024Trait Var. K = K = K = K = Table 9:
The comparison of prediction accuracy for best prediction models found. The best prediction accuracy demonstrated by shallow neuralnetwork (SNN) followed by three-layered deep neural network (DNN [512,256]). The further increase of K SVD dimensions andadding of extra hidden layers lead to model overfitting and degradation of accuracy over validation data set.
Among all studied machine learning prediction models the best overall accuracy was achieved with Shallow NeuralNetwork architecture. We hypothesize that this may be the result of its ability to learn best parameters space functionwithin an optimal number of
SVD dimensions applied to the input data set (users-likes sparse matrix). Adding extrahidden layers either leads to model overfitting when the number of SVD dimensions is too high, or underfitting whenthe number of SVD dimensions is too low. Also, it’s interesting to notice that performance of shallow networks anddeep learning networks with two hidden layers are comparable, while with introducing of third and more hidden15pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personalitylayers it drops significantly. Thus we can conclude that no further improvements can be gained with extra hiddenlayers introduced. See Table 9.Gathered experimental data confirms that advanced machine learning methods based on variety of studied artificialneural networks architectures outperform simple machine learning methods (linear and logistic regression) describedin previous research conducted by M. Kosinski [Kosinski et. al, 2016]. We believe that further prediction accuracyimprovements can be achieved by building separate advanced machine learning models per dependent variable,which is the subject of our future research activities. 16pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personality
A. T he SNN evaluation plots
The following pages provide plots and diagrams related to evaluation of two layer (shallow) feed forward artificialneural network with one fully connected hidden layer and linear output layer.
Figure 3:
The training process evaluation based on loss values and ReLU zero activations per number of iterations. With higher learningrate ( γ = - orange) we have fast convergence but ratio of ReLU-zero activations is higher than and quickly rising withrelatively low evaluated prediction accuracy, which implies that optimum was missed. With medium learning rate ( γ = -violet) we have smooth loss function plot with ratio of ReLU-zero activations bellow , giving best prediction scores among allthree runs. With lowest learning rate ( γ = - purple) we can see that learning struggled to find global minimum, reducedspeed of convergence, and despite the lowest ReLU zero activations rate - worst prediction accuracy among all runs due to high lossvalues. Figure 4:
The histograms of various tensors collected during three runs within hidden layer (Left: γ = , middle: γ = , right: γ = ). By examining weights histograms it may be noticed that middle has widest base with sharp peak which means thatlayer converged, but search space was widest among all runs. The left one has narrower base and sharp peak, which means that layerconverged withing narrower search space and as result has better prediction power. The right one has narrow base but wide plateauat the top, which means that search space is narrow but algorithm still failed to converge. The left and middle histograms has sharppeaks compared to right one, which may be a signal that their learning rate values has more relevance for algorithm convergence andas result we have better predictions for those learning rates. Figure 5:
The tensor network graph for multi layer perceptron with one hidden layer (fully_connected), and one linear output layer(fully_connected1). The input layer presented as input tensor placeholder. The hidden layer has ReLU activation nonlinearity. Theloss function is MSE (mean squared error). The train optimizer is Adam (for Adaptive Moment Estimation).
B. D eep NN evaluation plots The following pages provide plots and diagrams related to evaluation of studied deep neural networks.
Figure 6:
The loss function plot against train and validation data sets, for K = input features and three hidden layers - [2048,1024,1024]units correspondingly. It can be seen that model slightly overfitted against training data - the validation plot is above train plot anddoesn’t improve with more training steps. Figure 7:
The tensor network graph for DNN with two hidden layers. The input layer presented as input tensor placeholder. The hiddenlayers has ReLU activation nonlinearity. The loss function is MSE (mean squared error). The train optimizer is Adam (for AdaptiveMoment Estimation).
Figure 8:
The tensor network graph for DNN with three hidden layers. The input layer presented as input tensor placeholder. The hiddenlayers has ReLU activation nonlinearity. The loss function is MSE (mean squared error). The train optimizer is Adam (for AdaptiveMoment Estimation). R eferences [Kosinski et. al, 2016] Michal Kosinski, Yilun Wang, Himabindu Lakkaraju, and Jure Leskovec, (2016). Mining BigData to Extract Patterns and Predict Real-Life Outcomes. Psychological Methods 2016 , Vol. 21, No. 4, 493-506.DOI: 10.1037/met0000105[Lambiotte, R., and Kosinski, M., 2014] Lambiotte, R., and Kosinski, M., (2014). Tracking the digital footprintsof personality.
Proceedings of the Institute of Electrical and Electronics Engineers , 102, 1934-1939. DOI:10.1109/JPROC.2014.2359054[Goldberg et. al, 2006] Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., andGough, H. G. (2006). The International Personality Item Pool and the future of public-domain personalitymeasures.
Journal of Research in Personality , 40, 84-96. DOI: 10.1016/j.jrp.2005.08.007[Golub, G. H., and Reinsch, C. 1970] Golub, G. H., and Reinsch, C. (1970). Singular value decomposition and leastsquares solutions.
Numerische Mathematik , 14, 403-420. DOI: 10.1007/BF02163027[Abdi, H., 2003] Abdi, H. (2003). Factor rotations in factor analyses. In M. Lewis-Beck, A. E. Bryman, & T. F. Liao(Eds.),
The SAGE encyclopedia of social science research methods (pp. 792-795). Thousand Oaks, CA: SAGE.[Goodfellow et al., 2016] Ian Goodfellow and Yoshua Bengio and Aaron Courville, (2016).
Deep learning.
Manuscriptin preparation. Retrieved from [Daphne Koller, 2012] Daphne Koller, (2010-2012)
Probabilistic Graphical Models.
Stanford Uni-versity. Retrieved from http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=ProbabilisticGraphicalModels [Cortes & Vapnik, 1995] Cortes, C. & Vapnik, V. (1995). Support-vector networks.
Machine Learning.
20 (3):273-297.DOI: 10.1007/BF00994018[Yan, Su, 2009] Xin Yan, Xiao Gang Su, (2009), Linear Regression Analysis: Theory and Computing,
World Scientific ,pp. 1-2, ISBN 9789812834119[David A. Freedman, 2009] David A. Freedman (2009). Statistical Models: Theory and Practice.
Cambridge UniversityPress , p. 26.[Hilary L. Seal, 1967] Hilary L. Seal (1967). The historical development of the Gauss linear model.
Biometrika.
Lecture Notes on Generalized Linear Models. pp. Chapter 3, page 45.Retrieved from http://data.princeton.edu/wws509/notes/ [Sing, Sander, Beerenwinkel, Lengauer, 2005] Sing, T., Sander, O., Beerenwinkel, N., & Lengauer, T. (2005). ROCR:Visualizing classifier performance in R.
Bioinformatics , 21, 3940-3941. DOI: 10.1093/bioinformatics/bti623[Gain, 1951] Gain, A. K. (1951). The frequency distribution of the product moment correlation coefficient in randomsamples of any size draw from non-normal universes.
Biometrika.
38: 219-247. DOI: 10.1093/biomet/38.1-2.219[Kohavi, Ron, 1995] Kohavi, Ron (1995). A study of cross-validation and bootstrap for accuracy estimation andmodel selection.
Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence.
San Mateo, CA:Morgan Kaufmann. 2 (12): 1137-1143. CiteSeerX 10.1.1.48.529[Zhang, Marron, Shen,& Zhu, 2007] Zhang, L., Marron, J., Shen, H., & Zhu, Z. (2007). Singular value decompositionand its visualization.
Journal of Computational and Graphical Statistics , 16, 833-854.23pplying Deep Machine Learning for psychological profiling using O.C.E.A.N. model of personality[van Buuren, Groothuis-Oudshoorn, 2011] Stef van Buuren and Karin Groothuis-Oudshoorn (2011). mice: Multivari-ate Imputation by Chained Equations in R.
Journal of Statistical Software
45 (3) American Statistical Association.Retrieved from http://doc.utwente.nl/78938/ [Rubin DB, 1987] Rubin DB (1987).
Multiple Imputation for Nonresponse in Surveys.
John Wiley & Sons, New York.[Christopher M. Bishop, 1995] Christopher M. Bishop (1995).
Neural Networks for Pattern Recognition , OxfordUniversity Press, Inc. New York, NY, USA c (cid:13)
Foundations and Trends in MachineLearning.
Nature.
J. Chem. Inf. Comput. Sci.
35 (5): 826-833. DOI: 10.1021/ci00027a006[Hinton, G. et al., 2012] Hinton, Geoffrey E.; Srivastava, Nitish; Krizhevsky, Alex; Sutskever, Ilya; Salakhutdinov,Ruslan R. (2012).
Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprintarXiv:1207.0580v1[Nair, Hinton, 2010] Vinod Nair and Geoffrey Hinton (2010). Rectified linear units improve restricted Boltzmannmachines.
ICML.
Retrieved from http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_NairH10.pdf [Glorot, Bordes, Bengio, 2011] Xavier Glorot, Antoine Bordes, Yoshua Bengio (2011). Deep Sparse Rectifier NeuralNetworks.
JMLR W&CP http://jmlr.org/proceedings/papers/v15/glorot11a.html [Kingma, Ba, 2014] Diederik P. Kingma; Lei Jimmy Ba (2014).
Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980[Ba, Caruana, 2014] Lei Jimmy Ba, Rich Caruana (2014).
Do Deep Nets Really Need to be Deep? arXiv preprintarXiv:1312.6184[Kirkpatrick et al., 1983] S. Kirkpatrick, C. D. Gelatt Jr., M. P. Vecchi (1983). Optimization by Simulated Annealing.
Science , 13 May 1983: Vol. 220, Issue 4598, pp. 671-680 DOI: 10.1126/science.220.4598.671[R Core Team, 2015] R Core Team, (2015). R: A language and environment for statistical computing.
Vienna, Austria:R Foundation for Statistical Computing.
Retrieved from [Google Brain Team, 2015] Google Brain Team, (2015). TensorFlow is an open source software library for numericalcomputation using data flow graphs. Retrieved from