FM^2: Field-matrixed Factorization Machines for Recommender Systems
πΉπΉ π : Field-matrixed Factorization Machinesfor Recommender Systems Yang Sun
Yahoo ResearchSunnyvale, CA, [email protected]
Junwei Pan
Yahoo ResearchSunnyvale, CA, [email protected]
Alex Zhang
Yahoo ResearchSunnyvale, CA, [email protected]
Aaron Flores
Yahoo ResearchSunnyvale, CA, [email protected]
ABSTRACT
Click-through rate (CTR) prediction plays a critical role in recom-mender systems and online advertising. The data used in theseapplications are multi-field categorical data, where each featurebelongs to one field. Field information is proved to be importantand there are several works considering fields in their models. Inthis paper, we proposed a novel approach to model the field in-formation effectively and efficiently. The proposed approach isa direct improvement of FwFM, and is named as Field-matrixedFactorization Machines (FmFM, or πΉπ ). We also proposed a newexplanation of FM and FwFM within the FmFM framework, andcompared it with the FFM. Besides pruning the cross terms, ourmodel supports field specific variable dimensions of embeddingvectors, which acts as a soft pruning. We also proposed an efficientway to minimize the dimension while keeping the model perfor-mance. The FmFM model can also be optimized further by cachingthe intermediate vectors, and it only takes thousands floating-pointoperations (FLOPs) to make a prediction. Our experiment resultsshow that it can out-perform the FFM, which is more complex. TheFmFM modelβs performance is also comparable to DNN modelswhich require much more FLOPs in runtime. CCS CONCEPTS β’ Recommender systems, Computational advertising ; KEYWORDS
Recommender Systems, Factorization Machines, CTR prediction
ACM Reference Format:
Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores. 2021. πΉπ : Field-matrixed Factorization Machines for Recommender Systems. In Proceedingsof the Web Conference 2021 (WWW β21), April 19β23, 2021, Ljubljana, Slovenia.
ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3442381.3449930
This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.
WWW β21, April 19β23, 2021, Ljubljana, Slovenia Β© 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3449930
Click-through rate (CTR) prediction plays a key role in recom-mender systems and online advertising, and it has attracted muchresearch attention in the past decade [3, 13, 19]. The data involvedin CTR prediction are typically multi-field categorical data [15, 23].Such data possess the following properties. First, all the features arecategorical and are very sparse since many of them are identifiers.Therefore, the total number of features can easily reach millionsto billions. Second, every feature belongs to one and only one fieldand there can be tens to hundreds of fields.A prominent model for these prediction problems is logisticregression with cross-features [3]. When all cross-features are con-sidered, the resulting model is equivalent to a polynomial kernel ofdegree 2 [2]. However, it takes too many parameters to considerall possible cross-features. To resolve this issue, matrix factoriza-tion [1, 10] and factorization machines (FM) [17, 18] was proposedto learn the effects of cross features by dot products of two fea-ture embedding vectors. Based on FM, Field-aware FactorizationMachines (FFM) [8, 9] was proposed to consider the field infor-mation to model the different interaction effects of features fromdifferent field pairs. Recently, a Field-weighted Factorization Ma-chine (FwFM) [14, 15] model was proposed to consider the fieldinformation in a more parameter-efficient way.Existing models that consider the field information either hastoo many parameters, such as FFM [8, 9], or is not very effective,such as [15]. We propose to use a field matrix between two featurevectors to model their interactions, where the matrix is learnedseparately for each field pair. We will show that our field-pair matrixapproach achieves good accuracy performance while maintainingcomputational space and time efficiency.
Logistic Regression (LR) is the most widely used model on multi-field categorical data for CTR prediction [3, 19]. Suppose there are π unique features { π , Β· Β· Β· , π π } and π different fields { πΉ , Β· Β· Β· , πΉ π } .Each field may contain multiple features, while each feature be-longs to only one field. To simplify the notation, we use index π torepresent feature π π , and πΉ ( π ) to represent the field which π π belongsto. Given a data set πΊ = { π¦ ( π ) , π ( π ) } , where π¦ ( π ) β { , β } is thelabel (clicked or not) and π ( π ) β { , } π is the feature vector inwhich π₯ ( π ) π = π is active for this instance otherwise a r X i v : . [ c s . I R ] F e b WW β21, April 19β23, 2021, Ljubljana, Slovenia Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores π₯ ( π ) π =
0, the LR model parameters π are estimated by minimizingthe following loss function:min π [ | π | βοΈ π = log ( + exp (β π¦ ( π ) Ξ¦ πΏπ ( π , π ( π ) ))) + π β₯ π β₯ ] (1)The first term is the log loss, and the second term is the L2 regular-ization term where π is the regularization weight, and Ξ¦ πΏπ ( π , π ) = π€ + π βοΈ π = π₯ π π€ π (2)is a linear combination of individual features.However, linear models lack the capability to represent the fea-ture interactions [3]. As cross features may have more importantfactors than those single features, many improvements have beenproposed in the past decades. Degree-2 Polynomial (Poly2) models as a general way to ad-dress this problem is to add feature conjunctions. It has been shownthat Poly2 models can effectively capture the effect of featureinteractions[2]. Mathematically, in the loss function of equation (1),Poly2 models consider replacing Ξ¦ πΏπ with Ξ¦ ππππ¦ ( π , π ) = π€ + π βοΈ π = π₯ π π€ π + π βοΈ π = π βοΈ π = π + π₯ π π₯ π π€ β ( π,π ) (3)where β ( π, π ) is a function which hashes feature conjunction ( π, π ) into a natural number in the hashing space π» to reduce the numberof parameters. Otherwise the number of parameters in the modelwould be in the order of π ( π ) , which is too many to be learned. Factorization Machines(FM) learn an embedding vector π π β R πΎ for each feature, where πΎ is a hyper-parameter and is usuallya small integer, e.g., 10. FM model the interaction between twofeatures π and π as the dot product of their corresponding embeddingvectors π π , π π : Ξ¦ πΉπ (( π , π ) , π ) = π€ + π βοΈ π = π₯ π π€ π + π βοΈ π = π βοΈ π = π + π₯ π π₯ π β¨ π π , π π β© (4)FM usually outperform Poly2 models in applications involvingsparse data such as CTR prediction. This is because it models theinteraction between two features by a dot product between theircorresponding embedding vectors. These embedding vector of afeature is meaningful as long as the this feature appears enoughtimes during model training. However, FM neglect the fact that afeature might behave differently when it interacts with featuresfrom different other fields. Field-aware Factorization Machines (FFM) model such dif-ference explicitly by learning π β π , and only using the corresponding one π π,πΉ ( π ) to in-teract with another feature π from field πΉ ( π ) : Ξ¦ πΉπΉπ (( π , π ) , π ) = π€ + π βοΈ π = π₯ π π€ π + π βοΈ π = π βοΈ π = π + π₯ π π₯ π β¨ π π,πΉ ( π ) , π π,πΉ ( π ) β© (5)Although FFM have gotten significant performance improve-ments over FM, their number of parameters is in the order of π ( πππΎ ) . The huge number of parameters in FFM is undesirable inthe real-world production systems [8]. Therefore, it is appealing to design alternative approaches that are competitive and morememory-efficient. Field-weighted Factorization Machines (FwFM) was pro-posed in [15], which models the different field interaction strengthexplicitly. More specifically, the interaction of a feature pair π and π in our proposed approach is modeled as π₯ π π₯ π β¨ π π , π π β© π πΉ ( π ) ,πΉ ( π ) where π π , π π are the embedding vectors of π and π , πΉ ( π ) , πΉ ( π ) are thefields of features π and π , respectively, and π πΉ ( π ) ,πΉ ( π ) β R is a weightto model the interaction strength between fields πΉ ( π ) and πΉ ( π ) . Theformulation of FwFM is: Ξ¦ πΉ π€πΉπ (( π , π ) , π ) = π€ + π βοΈ π = π₯ π π€ π + π βοΈ π = π βοΈ π = π + π₯ π π₯ π β¨ π π , π π β© π πΉ ( π ) ,πΉ ( π ) (6)FwFM are extensions of FM in the sense that it uses additionalweight π πΉ ( π ) ,πΉ ( π ) to explicitly capture different interaction strengthsof different field pairs. FFM can model this implicitly since theylearn several embedding vectors for each feature π , each one π π,πΉ π corresponds to one of other fields πΉ π β πΉ ( π ) , to model its differentinteractions with features from different fields. However, the modelcomplexity of FFM is significantly higher than that of FM andFwFM.Recently, there are also lots of work on deep learning based clickprediction models [4, 6, 7, 12, 16, 20β23, 25]. These models captureboth low order and high order interactions and achieve significantperformance improvement. However, the online inference complex-ity of these models is much higher than the shallow models [5].Model compression techniques such as pruning [5], distillation [24]or quantization are usually needed to accelerate these models inthe online inference. In this paper, we focus on improving the loworder interactions, and the proposed model can be easily used as ashallow component in these deep learning models. We propose a new model to represent the interaction of field pairs asa matrix. Similar to FM and FwFM, we learn an embedding vectorfor each feature. We define a matrix π πΉ ( π ) ,πΉ ( π ) to represent theinteraction between field πΉ ( π ) and field πΉ ( π ) π₯ π π₯ π β¨ π π π πΉ ( π ) ,πΉ ( π ) , π π β© where π π , π π are the embedding vectors of feature π and π , πΉ ( π ) , πΉ ( π ) are the fields of feature π and π , respectively, and π πΉ ( π ) ,πΉ ( π ) β R πΎ Γ πΎ is a matrix to model the interaction between field πΉ ( π ) and field πΉ ( π ) . We name this model Field-matrixed Factorization Machines(FmFM): Ξ¦ πΉππΉπ (( π , π ) , π ) = π€ + π βοΈ π = π₯ π π€ π + π βοΈ π = π βοΈ π = π + π₯ π π₯ π β¨ π π π πΉ ( π ) ,πΉ ( π ) , π π β© (7)FmFM are extensions of FwFM in that it uses a 2-dimensionalmatrix π πΉ ( π ) ,πΉ ( π ) to interact different field pairs, instead of a scalar π : Field-matrixed Factorization Machinesfor Recommender Systems WWW β21, April 19β23, 2021, Ljubljana, Slovenia weight π in FwFM. With those matrices, features from the em-bedding space can be transferred to π β ( π£ π , π£ π ) and ( π£ π , π£ π ) , while features π, π and π are from 3 different fields. v i,F(j) = v i Γ M F(i)F(j) v i,F(k) =v i Γ M F(i)F(k)
Embedding v k Embedding v j Matrix M F(i)F(k)
Matrix M F(i)F(j) E m b e dd i ng v i ⬀⬀ E m b e dd i ng v i Embedding v i β¦β¦β¦β¦ E m bedd i ng Loo k up T r an s f o r m a t i on D o t P r odu c t Figure 1: An example of FmFM interaction terms calculation
The calculation can be decomposed into 3 steps:(1)
Embedding Lookup:
The feature embedding vectors π£ π , π£ π ,and π£ π are looked up from the embedding table, and π£ π willbe shared between those 2 pairs.(2) Transformation:
Then π£ π is multiplied by the matrices π πΉ ( π ) πΉ ( π ) and π πΉ ( π ) πΉ ( π ) respectively, here we get the in-termediate vector π£ π,πΉ ( π ) = π£ π Γ π πΉ ( π ) πΉ ( π ) for the field πΉ ( π ) ,and π£ π,πΉ ( π ) = π£ π Γ π πΉ ( π ) πΉ ( π ) for the field πΉ ( π ) .(3) Dot Product:
The final interaction terms will be a simpledot product between π£ π and π£ π,πΉ ( π ) , as well as π£ π and π£ π,πΉ ( π ) ,which are the black dots showed in Fig.1. FmFM have a similar design with, while extending, FM and FwFM;in this section, we deep dive into their design, explain their struc-ture with the 3-step FmFM framework above, and figure out thefundamental relationships among these factorization machine mod-els.
Figure 2 shows the calculation of feature interactionsin FM, the difference to FmFM is that FM skip the step 2, and usethe shared embedding π£ π to do the final dot product with π£ π and π£ π respectively. Since we know π£ π = π£ π πΌ πΎ , we can construct an identity matrix πΌ πΎ and let all field matricesequal to πΌ πΎ . As the identity matrix shows in Fig.2, the FM actuallyis a special case of FmFM when all field matrices are πΌ πΎ . Since thosematrices πΌ πΎ are fixed and non-trainable, we define its degree offreedom to be 0. Embedding v k Embedding v j ⬀⬀ Embedding v i β¦β¦β¦β¦ Embedding v i Embedding v i Figure 2: An explanation of FM with FmFM framework
Fig.3 shows the calculation of feature interactions inFwFM. There is a change from the original definition 2, while, it iseasy to know that: β¨ π π , π π β© π πΉ ( π ) ,πΉ ( π ) = β¨ π π π πΉ ( π ) ,πΉ ( π ) , π π β© Thus, we calculate the term π π π πΉ ( π ) ,πΉ ( π ) firstly in figure 3, insteadof β¨ π π , π π β© in the original definition in Eq.2. It is clear now that theintermediate vector in step 2 is actually a scaled embedding vector: π£ π,πΉ ( π ) = π£ π π πΉ ( π ) πΉ ( π ) = π£ π ( π πΉ ( π ) πΉ ( π ) πΌ πΎ ) Thus, we construct the field matrix in FwFM as a scalar matrix π πΉ ( π ) πΉ ( π ) πΌ πΎ , which is a diagonal matrix with all its main diagonalentries equal π . Its effect on the embedding vector π£ π is a scalarmultiplication by π . We show this matrix at the corner of Fig.3(left one). Since the scalar π is trainable, it has one more degree offreedom than FM, we define its degree of freedom as 1. v i r F(i)F(j) v i r F(i)F(k)
Embedding v k Embedding v j ⬀⬀ Embedding v i,F(j) β¦β¦β¦β¦ r F(i)F(j) r F(i)F(k) d d β¦ β¦ d K r r r r r Figure 3: An explanation of FwFM and FvFM with FmFMframework
WW β21, April 19β23, 2021, Ljubljana, Slovenia Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores
We follow the clue above, and give one more free-dom to the field matrix in FwFM. Let the field matrix become adiagonal matrix in which the main diagonal entries are trainablevariables, instead of one single variable in FwFM, we can rewritethe intermediate vector: π£ π,πΉ ( π ) = π£ π π· πΉ ( π ) πΉ ( π ) = π£ π β π πΉ ( π ) πΉ ( π ) , where π· πΉ ( π ) πΉ ( π ) = diag ( π , π , . . . , π πΎ ) , this can be expressedmore compactly by using a vector π πΉ ( π ) πΉ ( π ) β R πΎ instead of thediagonal matrix, and taking the Hadamard product ( β ) of the vectors π£ π . Figure 3 demonstrates this case in the right matrix at the corner.We name this method Field-vectorized Factorization Machines(FvFM) . The FvFM have one more freedom than FwFM: the train-able parameters become a vector instead of a scalar; thus, we defineits degree of freedom to be 2.
Letβs revisit FmFM in figure 1. It has all the degrees offreedom of a matrix, which is 3. All the variables in those matricesare trainable, and we expect the FmFM to have a greater predictivecapacity than other factorization machine models. We will evaluatethis hypothesis in the next section.Overall, we have found that FM, FwFM, FvFM are all special casesof FmFM, the only differences are their field matricesβ restrictions.According to their flexibility, we summarize them in the Table1Model Field Interaction Degree of FreedomFM Constant 0FwFM Scalar 1FvFM Vector 2FmFM Matrix 3
Table 1: Degrees of freedom in different FM models
FmFM can also be viewed as modelingthe interaction of two feature embedding vectors by a weightedouter product: Ξ¦ πΉππΉπ (( π , π ) , π ) = π€ + π βοΈ π = π₯ π π€ π + π βοΈ π = π βοΈ π = π + π₯ π π₯ π π ( π π , π π , πΎ πΉ ( π ) ,πΉ ( π ) ) (8)where πΎ πΉ ( π ) ,πΉ ( π ) β R πΎ Γ πΎ , and π ( π π , π π , πΎ πΉ ( π ) ,πΉ ( π ) ) = πΎ βοΈ π = πΎ βοΈ π β² = π£ ππ π£ π β² π π€ π,π β² πΉ ( π ) ,πΉ ( π ) (9)OPNN also proposed to model the feature interactions via outerproduct. However, FmFM is different from OPNN in the followingtwo aspects. First, FmFM is a simple shallow model without the fullyconnected layers as in [16]. We can use FmFM as a shallow compo-nent or a building block in any deep CTR models, like DeepFM [6].Second, we support variable embedding dimensions for featuresfrom different fields, which will be discussed in Section 4.1. Unlike other factorization machines above, FFM cannot be reformedinto the FmFM framework since it has a different way to look uptheir feature embeddings, we demonstrate its interaction termsβcalculation in Figure 4. FFM never share the feature embedding; italways looks up the field specific embeddings from the embeddingtables. Thus, there are π β π β Embedding v k,F(i) Embedding v j,F(i) ⬀⬀ β¦β¦β¦β¦ Embedding v i,F(j) Embedding v i,F(k) β¦β¦β¦β¦ Embedding v i,F(j) Embedding v i,F(k) Figure 4: An example of FFM
This FFM mechanism gives the model maximal flexibility to fitthe data, and the huge number of embedding parameters also hasincredible memorization capacity. Meanwhile, there is always a riskof over-fitting with it, even when there are billions of instancesto be trained. The nature of the featuresβ distribution is a longtail distribution, instead of a uniform distribution, that makes thefeature pairsβ distribution even more imbalanced.Given an example in Fig.4, assume that feature pair ( π£ π , π£ π ) is ahigh frequency combination, while ( π£ π , π£ π ) is a low frequency (pos-sibly 0 frequency, or never appeared), since π£ π,πΉ ( π ) and π£ π,πΉ ( π ) are2 independent embeddings in the setting of FFM, thus embedding π£ π,πΉ ( π ) may be trained well but π£ π,πΉ ( π ) may not. Due to the longtail distribution of features, those high frequent features pairs maydominate the number of training data, while other low frequencyfeatures which dominate the number of features, cannot be trainedwell.FmFM use shared embedding vectors, as there is only one copyfor each single feature. It utilizes a transformation process to projectthis single embedding vector into other π β π β π£ π,πΉ (β) are actually tiedwith the original embedding vector π£ π . With those field matrices, thevectors are transformable forward and backward. That is the fun-damental difference between FFM and FmFM; those transformableintermediate vectors within the same feature help the model learnthose low frequency feature pairs well.Back to the example in Fig.1, even the feature pair ( π£ π , π£ π ) is oflow frequency, the feature embedding π£ π can still be trained wellthrough the other high frequency feature pairs like ( π£ π , π£ π ) , and thefield matrix π πΉ ( π ) ,πΉ ( π ) can be trained well through other featureinteractions between field πΉ ( π ) and field πΉ ( π ) . Thus, if the low π : Field-matrixed Factorization Machinesfor Recommender Systems WWW β21, April 19β23, 2021, Ljubljana, Slovenia frequency feature pair ( π£ π , π£ π ) occurs during evaluation or test, theintermediate vector π£ π,πΉ ( π ) can be inferred through π£ π π πΉ ( π ) ,πΉ ( π ) .Despite this difference between FFM and FmFM, they have morein common. An interesting observation between figure 4 and figure1 is that, when all transformations are done, the FmFM modelbecomes a FFM model. We can cache those intermediate vectorsand avoid matrix operations during runtime; the details will bediscussed in the next section.In contrast, FFM model cannot be reformed to a FmFM model,as we have mentioned above. Those π β The number of parameters in FM is π + ππΎ , where π accounts forthe weights for each feature in the linear part { π€ π | π = , ..., π } and ππΎ accounts for the embedding vectors for all the features { π π | π = , ..., π } . FwFM use π ( π β )/ { π πΉ π ,πΉ π | π, π = , ..., π } for each field pair so that the total number of parameters ofFwFM is π + ππΎ + π ( π β )/
2. The additional matrices in FmFM is π ( π β )/ π ( π β ) πΎ parameters.For FFM, the number of parameters is π + π ( π β ) πΎ since eachfeature has π β π βͺ π ,the number of parameters of other Factorization Machines arecomparable with that of FM and significantly less than that ofFFM. In Table 2 we compare the model complexity of all modelsmentioned so far. We also list the estimated number of parametersin the setting of section 5 for each model, which use the publicdata set Criteo. Those numbers can give us an intuitive impressionabout the size of each model. FM, FwFM, and FmFM have similarsizes while FFM have more than dozen times than others.Model N of Parameters Estimated N in Criteo DatasetLR π π + π» π + ππΎ π + ππΎ + π ( π β ) π + ππΎ + π ( π β ) πΎ π + π ( π β ) πΎ Table 2: A summary of model complexities (ignoring thebias term π€ ). The estimate of the total π of the model inthe settings of Section 5.3 In this section we present our methodologies to optimize the FmFMmodel. There are 3 tactics which we can devise to reduce the com-plexity of FmFM further. In section 4.1 we introduce the variabledimensions in embeddings, which is a unique property in FmFM; itallows us to use field specific dimensions in the embedding table,instead of a fixed length πΎ globally. In Section 4.2 we introducethe method to cache the intermediate vectors to avoid the matrixoperations, which can reduce the FmFM modelβs computationalcomplexity in runtime. In Section 4.5 we introduce the method toreduce the linear terms and replace them with field specific weights. The main improvement of FM over LR model is that, FM use theembedding vector to represent each feature. In order to make thedot product, it requires the vector dimension πΎ of all feature em-beddings to be the same, even though features come from differentfields. The improved models like FwFM, FvFM also adopt this prop-erty. The vector dimension matters both in model complexity andmodel performance, the work [15] discussed this trade-off betweenperformance and time cost, but the vector dimension can only beoptimized globally.When we utilize the matrix multiplication in FmFM, it actuallydoes not require the field matrices to be square matrices; we canadjust the output vector length by changing the shape of the fieldmatrix. This property gives us an another flexibility to set the field-specific lengths on-demand in the embedding table, as we show infigure 5. Embedding v j Matrix M F(i)F(j) E m b e dd i ng v i Embedding v i β¦β¦β¦β¦ v i Γ M F(i)F(k) v i Γ M F(i)F(j) ⬀ ⬀
Matrix M F(i)F(k) E m b e dd i ng v i Embedding v k Figure 5: An example of Variable Vector Length of Embed-dings
The dimension of an embedding vector determines the amountof information it can carry; this property allows us to accommodatethe need for each field. For the example ( π, π ) in Fig.5, the field π’π ππ _ ππππππ may contain only 3 values (male, female, other), whileanother field π‘ππ _ πππ£ππ _ ππππππ may contain more than 1 millionfeatures. Thus, the embedding table of field π’π ππ _ ππππππ may onlyneed 5-dimension (5D), while the field π‘ππ _ πππ£ππ _ ππππππ may need7D. The field matrix π should be set up with a shape in ( , ) . Whenwe cross the feature between π‘ππ _ πππ£ππ _ ππππππ and π’π ππ _ ππππππ ,the matrix can transfer the 5D feature vector to a 7D vector, makingit ready to do a dot product with the feature vector from field π‘ππ _ πππ£ππ _ ππππππ .To optimize the field-specific embedding vector dimension with-out model performance loss, we propose a 2-pass method. In thefirst pass, we use a larger fixed embedding vector dimension forall fields, e.g. πΎ =
16, and train the FmFM as a full model. Fromthe full model, we learn how much information (variance) in eachfield, then we utilize a standard PCA dimensionality reduction to
WW β21, April 19β23, 2021, Ljubljana, Slovenia Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores the embedding table in each field. From the experiment in Section5.4 we found that the new dimension which contains 95% originalvariance is a good trade-off. With this new field specific dimen-sion setting, we train the model in a second pass from scratch, andshould get the second model without any significant performanceloss, compared to the first full model.FeatureField ID EmbDim Feat. Nin FieldField
Table 3: The optimized dimensions for all fields in Criteodata set, 95% variance kept
Table 3 shows the optimized dimension for each categorical fieldin the Criteo dataset, with the PCA method. This list shows that therange of those dimensions are huge which from 2 to 14, and most ofthe dimensions are less than 10. The average πΎ is only 7.72, whichis less than the optimal setting in the FwFM. With keeping mostvariance from the dataset, the lower average dimension means themodel has fewer parameters, requires less memory. FmFM is a lower complexity model than FFM, in the number ofparameters; however it requires expensive matrix operations in thetransformation step. In table 4, we list the number of Floating PointOperations (FLOPs) for each model , and estimate it with typicalsettings.Among those Factorization Machine models, the FmFM needsthe most operations to accomplish its calculation, which is about π times as FwFM and FFM, but still faster than most DNN models. Insection 3.2, we have shown that a FmFM model can be transformed We use the following values to estimate the FLOPs: π = , πΎ = , π» = denotesthe number of nodes in the hidden layers of DNN, πΏ = denotes the number of hiddenlayers in DNN, πΎ β² = denotes the dimension of embedding vectors in the new spacein AutoInt, and π FwFM = ,π DNN = denotes the sparsity rate of the FwFM andDNN component in DeepLight. Model Floating Point Operations Estimated π ( π ) π ( π ) π ( ππΎ ) π ( π πΎ ) π ( π πΎ ) π ( π πΎ ) π ( π πΎ ) FmFM(cached& 95% variance) π ( π πΎ ) Wide & Deep π ( π + ππΎπ» + πΏπ» ) ~500,000Deep & Cross π ( π πΎ + ππΎπ» + πΏπ» ) ~510,000DeepFM π ( ππΎπ» + πΏπ» + π ) ~246,000xDeepFM π ( ππ» πΎπΏ ) ~150,000,000AutoInt π ( ππ»πΎ β² ( π + πΎ )) ~28,000,000FiBiNET π ( π πΎ + π πΎπ» + πΏπ» ) ~10,000,000DeepLight π ( π πΎ ( β π FwFM )+( ππΎπ» + πΏπ» )( β π DNN )) ~102,000 Table 4: A summary of Floating Point Operations by model into a FFM model, by caching all intermediate vectors. In this sense,we can reduce its number of operations to the same magnitude asFM and FFM, which is almost 20 times faster.
When we combine variable dimension optimization and cache opti-mization, the inference speed can be much faster, and the requiredmemory can be reduced significantly. This benefits from anotherproperty of FmFM - the interaction matrices are symmetrical, whichmeans β¨ π π π πΉ ( π ) πΉ ( π ) , π π β© = β¨ π π π ππΉ ( π ) πΉ ( π ) , π π β© (10)We have a proof for this lemma in the Appendix.Thus, we can choose to cache those intermediate vectors whichhave lower field dimensions, and dot-product with the other featurevectors during inference. For example, in the setting of table 3, twofeatures π£ π and π£ π are from field π£ π and π£ π , wecan cache either π π π , or π π π π , .Since the field matrix π , has a shape of [ , ] , the formerone π π π , increased the dimension from 2 (field π π whose dimension isalso 14. It costs 14 units of memory for the intermediate vectorscache, and takes 2 Γ
14 FLOPs during inference. By contrast, thelatter one π π π π , reduces the dimension from 14 (field π π whose dimensionis also 2. It costs 2 units of memory for the intermediate vectorscache, and takes 2 Γ π : Field-matrixed Factorization Machinesfor Recommender Systems WWW β21, April 19β23, 2021, Ljubljana, Slovenia only about 1/3 of FFM. In the section5.4, we will show that thisoptimized model can achieve the same performance as the fullmodel. The variable embedding dimensions also act in a role similar topruning actually; while traditional pruning such as DeepLight [5]gives a binary decision to keep or drop a field or a field pair, thevariable embedding dimensions give us a new way to determine theimportance of each field and field pair on-demand, and assign eachfield a factor to represent its importance. For example, in the FmFMmodel of Table3, the cross field Γ
11 FLOPs during inference; incontrast, a low strength pair, field Γ Figure 6: An example of Mutual Information Score betweenfield pairs and label in Criteo Dataset
Figure 6 shows a heat-map of mutual information scores be-tween field pairs and labels in the Criteo dataset, which representsthe strength of field pairs in prediction. Figure 7 shows the crossfield dimensions, which is the lower dimension between two fields(explanation in Section 4.3); it represents the parameters and com-putational cost for each field pair. Obviously, these two heat-mapsare highly related to each other, which means the optimized FmFMmodel allocates more parameters and more computation on thosehigher strength field pairs, and fewer parameters and less computa-tion on lower strength field pairs.
There is a linear part in Eq.8:
Figure 7: An example of cross fields dimensions - πππ ( π· π , π· π ) in Criteo Dataset π βοΈ π = π₯ π π€ π (11)which requires an extra scalar π€ π to be learned for each feature.However the learned embedding vector π£ π should contain moreinformation, and the weight π€ π can be learned from π£ π by a simpledot product. Another benefit from the learned π£ π is that, it can helpto learn the embedding vector from the linear part.We follow the method from the work of [15], by learning a fieldspecific vector π€ πΉ ( π ) so that all features from the same field πΉ ( π ) will share the same linear weight vector. Then the linear terms canbe rewritten to: π βοΈ π = π₯ π β¨ π π , π€ πΉ ( π ) β© (12)We apply this linear term optimization to FwFM, FvFM andFmFM by default in the rest of the paper. In this section we present our experimental evaluation results. Wewill first describe the data sets and implementation details in Sec-tion 5.1 and 5.2 respectively. In Section 5.3 we compare FmFMwith other baseline models like LR, FM, FwFM and FFM, as wellas the state-of-the-art methods like Wide & Deep, Deep & Crossnetwork, xDeepFM, AutoInt, FiBiNET and DeepLight. In Section5.4, we did a few experiment on the Criteo dataset and observe themodel performance change when we apply the variable dimensionin embedding.
We use 2 public data sets to evaluate our model performance:
WW β21, April 19β23, 2021, Ljubljana, Slovenia Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores (1) The first one is the Criteo CTR data set; it is a well-knowbenchmark data set which used for the Criteo Display Ad-vertising Challenge [11]. The Criteo data set is already labelbalanced, the positive to the negative ratio is about 1:3. Thereare 45 million samples and each sample has 13 numericalfields and 26 categorical fields.(2) The second data set is the Avazu CTR data set; it was usedin the Avazu CTR prediction competition, which predictswhether a mobile ad will be clicked. The positive to thenegative ratio in the Avazu data set is about 1:5. There are40 million samples and each sample has 23 categorical fields.We follow those existing works [5, 6, 12, 20β23, 25], for eachdata set, we split it into 3 parts randomly, 80% for training, 10% forvalidation, and 10% for testing. Regarding to those numerical features in the Criteo data set, weadopt the log transformation of πππ ( π₯ ) if π₯ >
2, which proposed bythe winner of the Criteo competition to normalize the numericalfeatures. This method was also used by [5] and [21]. Regardingthe date/hour feature in the Avazu data set, we transfer it into 2features: day_of_week(0-6) and hours(0-23) to consume the featurebetter.We also remove those infrequent features that are less than athreshold in both data sets and replace their values with the default"unknown" feature in that field. We set the threshold to 8 for theCriteo data set, and to 5 for the Avazu data set.The statistics of the normalized data sets are shown in Table 5.Data set Samples Fields Features Pos:NegCriteo Train 36,672,493 39 1,327,180 ~1:3Validation 4,584,062Test 4,584,062Avazu Train 32,343,173 23 1,544,257 ~1:5Validation 4,042,897Test 4,042,897 Table 5: Statistics of training, validation and test sets of theCriteo data sets.
We have implemented the LR (logistic regression) and all factor-ization machine models (FM, FwFM, FFM, FvFM and FmFM) withTensorflow. . We follow the implementation of LR and FM in [16],and implement FFM, FwFM following the work [15].We evaluate all models performance by AUC (Area Under theROC Curve) and Log Loss on the test set. It is noticeable that aslightly higher AUC or lower Log Loss at 0.001-level is regarded asignificant improvement for CTR prediction task, which has alsobeen pointed out in existing works [4, 21, 22].For those state-of-the-art models, they are all DNN models andmay need more hyper-parameters tuning, we pull their performance Some of those works split 90%:10% for just training and testing, but we adopt thestrict one with validation set and less training set. We open-sourced the training code and the feature extraction code athttps://github.com/VerizonMedia/FmFM (AUC and Log Loss) from their original papers, in order to keep theirresults optimal. It is fair to compare our results with theirs, sincewe use more strict data splittings; while their implementations mayhave slight differences, e.g. feature processing, optimizer (Adam orAdagrad), we list their results just for reference. The Deep & CrossNetwork is an exception, since their paper only listed the Log Lossbut not AUC. Thus, we implemented their model and got a similarperformance.
In this section we will conduct performance evaluation for FmFM.We will compare it with LR, FM, FwFM and FFM on the two datasets mentioned above. We always use the full size model to comparein the results; that means we donβt use any optimization methodsmentioned in Section 4. For the parameters such as regularizationcoefficient π , and learning rate π in all models, we select thosewhich lead to the best performance on the validation set and thenuse them in the evaluation on the test set. Experiment results canbe found in Table 6 for the Criteo data set, and Table 7 for the Avazudata set.Models AUC Log Loss(Test Set)Training Validation TestLR 0.7930 0.7918 0.7917 0.4582FM 0.8142 0.8075 0.8075 0.4441FFM Deep & Cross 0.8244 0.8118 0.8118 0.4413Wide & Deep - - 0.7981 0.4677DeepFM - - 0.8007 0.4508xDeepFM - - 0.8052 0.4418AutoInt - - 0.8061 0.4454FiBiNET - - 0.8103 0.4423DeepLight - -
We observe that FvFM and FmFM can achieve better performancethan LR, FM, and FwFM on both data sets, which is in our expecta-tion. Surprisingly, the FmFM can achieve better performance thanFFM constantly in both test sets. As we mentioned before, eventhough FFM is a model dozens times larger than FmFM, our FmFMmodel still get the best AUC in the test set among all shallow mod-els. Additionally, if we compare the differences in AUC betweentraining and test, we found that the Ξ π΄ππΆ
πΉππΉπ = . In this part we implement the method described in 4.1, whereby wehave a full size model, we can extract its embedding tables for each π : Field-matrixed Factorization Machinesfor Recommender Systems WWW β21, April 19β23, 2021, Ljubljana, Slovenia Models AUC Log Loss(Test Set)Training Validation TestLR 0.7526 0.7521 0.7517 0.3953FM 0.7744 0.7696 0.7695 0.3857FFM
Deep & Cross 0.8109 0.7825 0.7826 0.3791AutoInt - - 0.7752 0.3823Fi-GNN - - 0.7762 0.3825FGCNN+IPNN - - 0.7883 0.3746DeepLight - - field, then we utilize a standard PCA dimensionality reduction. Herewe do several experiments and compare how the dimensionalityreduction impact the model performance, and try to find a trade-offbetween the model size, speed and its performance.We use the full size FmFM model from experiment 5.3 on theCriteo data set to evaluate our metrics. We keep 99%, 97%, 95%,90%, 85% and 80% variance in PCA dimensionality reduction respec-tively, and estimate the average embedding dimensions and floatoperations (FLOPs, with cached intermediate vectors). With thenew dimensions setting, we train those FmFM models the secondpass respectively, and observe the AUC and Log Loss change in testset.Table 8 shows the summary of experiments. The average em-bedding dimension was reduced significantly when we keep lessvariance from PCA: there is only less than 1/2 embedding dimen-sions and 1/3 computation cost when we keep 95% variance, whilethere is no significant change on the modelβs performance compareto the full size model. Thus, 95% variance is a good trade-off whenwe optimize the embedding dimensions in FmFM.Variance% Emb Dim(Average) FLOPsEstimated
95% 7.72(48.2%) 8,960(36.5%) 0.8108 0.4411
90% 6.26(39.1%) 7,202(29.4%) 0.8103 0.441585% 3.82(23.9%) 4,716(19.2%) 0.8084 0.443280% 3.36(21.0%) 4,392(17.9%) 0.8080 0.4436
Table 8: Compare among FmFM optimized models with em-bedding dim optimization, an example of the Criteo DataSet
Figure 8 shows these modelsβ performance (in AUC) and theircomputational complexity (in FLOPs). As a shallow model, theoptimized FmFM model gets higher AUC as well as lower FLOPs,compared with all the baseline models except Deep & Cross andDeepLight. While its computational cost is much lower than these two complex models which ensembled DNN module and shallowmodule, its FLOPs is only 1.76% and 8.78% of them, respectively.The lower FLOPs makes it preferable when the computation latencyis strictly limited, which is the common scenario in the real-timeonline ads CTR prediction and recommender systems.
Figure 8: AUC and FLOPs comparison among all models onthe Criteo dataset
In conclusion, we propose a novel approach FmFM to model theinteractions of field pairs as a matrix. We prove that FmFM is aunified framework of factorization machine model family, in whichboth FM and FwFM can be treated as special cases. We devise a fewoptimizations to FmFM, including variable embedding dimensionsand caching intermediate vectors. These optimizations make theFmFM lightweight and faster during inference, taking only thou-sands of floating-point operations to make a prediction. We havedone comprehensive experiments to verify the effectiveness andefficiency of the proposed model. It achieves state-of-the-art perfor-mance among all shallow models, including FM, FFM and FwFM,and its performance is even comparable to those complex DNNmodels.With regard to future work, there are a few potential researchdirections: β’ The FmFM is still a linear model, since the field interactionare matrices, and embedding vectors are transformed linearly.We can introduce the non-linear layers to the field interactionand let the model become a non-linear model, which is moreflexible. β’ All the factorization machine models are actually Degree-2models, which allows up to 2 fields interactions. This restric-tion is majorly because the dot product. In the future, we canintroduce the 3D tensor and allows the 3 fields interaction,or even higher ranks. This work may require more modeloptimization since there are too much Degree-3 interactions. β’ We can combine the DNN models like the Wide and Deep [4],DeepFM [6], DeepLight [5], and try FmFM as a building block
WW β21, April 19β23, 2021, Ljubljana, Slovenia Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores in DNN models to further improve their performances. Webelieve this method should be more competitive in the modelperformance, when compare to those deep learning basedmodels in Section 2.
REFERENCES [1] Michal Aharon, Natalie Aizenberg, Edward Bortnikov, Ronny Lempel, Roi Adadi,Tomer Benyamini, Liron Levin, Ran Roth, and Ohad Serfaty. 2013. OFF-set: one-pass factorization of feature sets for online recommendation in persistent coldstart settings. In
Proceedings of the 7th ACM Conference on Recommender Systems .ACM, 375β378.[2] Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-JenLin. 2010. Training and testing low-degree polynomial data mappings via linearSVM.
Journal of Machine Learning Research
11, Apr (2010), 1471β1490.[3] Olivier Chapelle, Eren Manavoglu, and Romer Rosales. 2015. Simple and scalableresponse prediction for display advertising.
ACM Transactions on IntelligentSystems and Technology (TIST)
5, 4 (2015), 61.[4] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.2016. Wide & deep learning for recommender systems. In
Proceedings of the 1stWorkshop on Deep Learning for Recommender Systems . ACM, 7β10.[5] Wei Deng, Junwei Pan, Tian Zhou, Aaron Flores, and Guang Lin. 2020. DeepLight:Deep Lightweight Feature Interactions for Accelerating CTR Predictions in AdServing.
Proceedings of the 14th ACM international conference on Web search anddata mining (2020).[6] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. arXiv preprint arXiv:1703.04247 (2017).[7] Xiangnan He and Tat-Seng Chua. 2017. Neural Factorization Machines for SparsePredictive Analytics. (2017).[8] Yuchin Juan, Damien Lefortier, and Olivier Chapelle. 2017. Field-aware factor-ization machines in a real-world online advertising system. In
Proceedings ofthe 26th International Conference on World Wide Web Companion . InternationalWorld Wide Web Conferences Steering Committee, 680β688.[9] Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware factorization machines for CTR prediction. In
Proceedings of the 10th ACMConference on Recommender Systems . ACM, 43β50.[10] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.
Computer
42, 8 (2009).[11] Criteo Labs. 2014.
Display Advertising Challenge .[13] H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner,Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. 2013.Ad click prediction: a view from the trenches. In
Proceedings of the 19th ACMSIGKDD international conference on Knowledge discovery and data mining . ACM,1222β1230.[14] Junwei Pan, Yizhi Mao, Alfonso Lobos Ruiz, Yu Sun, and Aaron Flores. 2019.Predicting different types of conversions with multi-task learning in onlineadvertising. In
Proceedings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining . 2689β2697.[15] Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun,and Quan Lu. 2018. Field-weighted Factorization Machines for Click-ThroughRate Prediction in Display Advertising.
Proceedings of the 2018 World Wide WebConference on World Wide Web - WWW β18 (2018). https://doi.org/10.1145/3178876.3186040[16] Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang.2016. Product-based neural networks for user response prediction. In . IEEE, 1149β1154.[17] Steffen Rendle. 2010. Factorization machines. In
Data Mining (ICDM), 2010 IEEE10th International Conference on . IEEE, 995β1000.[18] Steffen Rendle. 2012. Factorization machines with libfm.
ACM Transactions onIntelligent Systems and Technology (TIST)
3, 3 (2012), 57.[19] Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predictingclicks: estimating the click-through rate for new ads. In
Proceedings of the 16thinternational conference on World Wide Web . ACM, 521β530.[20] Ying Shan, T Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016.Deep Crossing: Web-scale modeling without manually crafted combinatorialfeatures. In
Proceedings of the 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining . ACM, 255β262. [21] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang,and Jian Tang. 2018. AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks.
CIKM β19: Proceedings of the 28th ACM InternationalConference on Information and Knowledge Management (2018).[22] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Networkfor Ad Click Predictions. arXiv preprint arXiv:1708.05123 (2017).[23] Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep learning over multi-fieldcategorical data. In
European conference on information retrieval . Springer, 45β57.[24] Guorui Zhou, Ying Fan, Runpeng Cui, Weijie Bian, Xiaoqiang Zhu, and KunGai. 2018. Rocket launching: A universal and efficient framework for trainingwell-performing light net. In
Proceedings of the AAAI Conference on ArtificialIntelligence , Vol. 32.[25] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, YanghuiYan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-throughrate prediction. In
Proceedings of the 24th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining . 1059β1068.
A MATH PROOF
Lemma A.1.
Given two row vectors π£ π and π£ π whose lengths are π and π respectively, there is a matrix π β R π,π , then: π π Γ π Β· π π = π π Γ π π Β· π π (13) where Γ denotes matrix multiplication, and Β· denotes dot product. Proof. Since π π Γ π Β· π π is a scalar, we denote it as π , and thedot product can be rewrite to a matrix multiplication, we rewritethe left of the equation: π = π π Γ π Β· π π = π π Γ π Γ π ππ (14)while the transpose of a scalar equals to itself: π π = ( π π Γ π Γ π ππ ) π = π π Γ π π Γ π ππ = π π Γ π π Β· π π = π (15)Hence, the left equals the right.(15)Hence, the left equals the right.