[PDF] Fair Kernel Regression via Fair Feature Embedding in Kernel Space

Abstract

In recent years, there have been significant efforts on mitigating unethical demographic biases in machine learning methods. However, very little is done for kernel methods. In this paper, we propose a new fair kernel regression method via fair feature embedding (FKR-F 2 E) in kernel space. Motivated by prior works on feature selection in kernel space and feature processing for fair machine learning, we propose to learn fair feature embedding functions that minimize demographic discrepancy of feature distributions in kernel space. Compared to the state-of-the-art fair kernel regression method and several baseline methods, we show FKR-F 2 E achieves significantly lower prediction disparity across three real-world data sets.

Full PDF

FFair Kernel Regression via Fair Feature Embeddingin Kernel Space

Austin Okray

Department of Computer ScienceUniversity of WyomingLaramie, WyomingEmail: [email protected]

Hui Hu

Department of Computer ScienceUniversity of WyomingLaramie, WyomingEmail: [email protected]

Chao Lan

Department of Computer ScienceUniversity of WyomingLaramie, WyomingEmail: [email protected]

Abstract —In recent years, there have been signiﬁcant effortson mitigating unethical demographic biases in machine learningmethods. However, very little work is done for kernel methods.In this paper, we propose a novel fair kernel regression methodvia fair feature embedding (FKR-F E) in kernel space. Motivatedby prior works feature processing for fair learning and featureselection for kernel methods, we propose to learn fair featureembeddings in kernel space, where the demographic discrepancyof feature distributions is minimized. Through experiments onthree public real-world data sets, we show the proposed FKR-F E achieves signiﬁcantly lower prediction disparity comparedwith the state-of-the-art fair kernel regression method and severalother baseline methods.

I. I

NTRODUCTION

In recent years, we’ve witnessed a tremendous growth ofmachine learning applications in real-world problems that haveimmediate impacts on peoples’ lives. However, standardlylearned models can have unethical predictive biases against mi-nority peoples; e.g., in recidivism prediction, a commercializedmodel has signiﬁcant bias against innocent black defendants[1]; other biases are found in hiring [2], facial veriﬁcation [3],violence risk assessment in prison [4], etc.How to learn fair models has become a signiﬁcant researchtopic [5], and many methods have been proposed [6]–[12]. Theytypically sacriﬁce certain prediction accuracy for improvingprediction fairness, bound to the accuracy-fairness tradeoff.A promising direction is fair kernel learning [13], [14]. Byconstructing sufﬁciently complex hypothesis spaces, they aremore likely to learn a model that can achieve an efﬁcientaccuracy-fairness trade-off. However, this direction is sparselyexplored so far. A notable work is fair kernel regression [13],which penalizes a model’s predictive bias in kernel space.In this paper, we propose a novel fair kernel regressionmethod that learns fair feature embeddings (FKR-F E) in thekernel space. It is motivated by the work of Feldman et al [7],which shows that in a properly transformed data space wheredifferent demographic groups have similar feature distributions,a standardly learned prediction model will be naturally fair.We thus seek for such a fair transformation in the kernel space.A major challenge is that kernel space is often implicit,making it hard to ﬁnd explicit fair transformations therein.To tackle the problem, we borrow ideas from Cao et al[15], which learns feature embeddings in the kernel space for feature selection. Speciﬁcally, we propose to learn fair featureembeddings in the kernel space, such that different demographicgroups have similar embedded feature distributions. We proposeto measure similarity using mean discrepancy [16].Through experiments on three real-world data sets, we showthe proposed FKR-F E achieves signiﬁcantly lower predictionbias than the existing fair kernel regression method as well asseveral non-kernel fair learning methods, without sacriﬁcing asigniﬁcant amount of prediction accuracy.The rest of the paper is organized as follows: in Section II,we revisit related works; in Section III, we present the proposedmethod; in Section IV, experimental results are presented anddiscussed; our conclusion is in Section V.

A. Notations and Assumptions

To facilitate discussions in related work, we introduce somenotations here. We describe an instance using a triple ( x , s , y ) ,where x is a feature vector, s is a protected demographic (e.g.gender, race) and y is label. Assume s is contained in x .Similar to prior studies, we assume s is binary. Let there be n instances in the training set, among which n u belong to theunprotected group (s = 0) and n p belong to the protected group(s = 1). Without loss of generality, we assume the instances areordered such that the ﬁrst n u ones x , . . . , x n u are unprotectedand the rest x n u + , . . . , x n are protected.For kernel methods, let φ (·) be the feature mapping function,and f be a prediction model mapping from φ ( x ) to y .II. R ELATED W ORK

A. Fair Kernel Regression

Perez-Suay et al [13] propose a fair kernel regression method,which directly extends the linear fair learning method [9]to kernel space. Speciﬁcally, it minimizes prediction losswhile additionally penalizing the correlation between modelprediction and demographic feature in the kernel space as: min f n (cid:213) i = [ f ( φ ( x i )) − y i ] + µ n (cid:213) i = ( ¯ f ( x i ) · ¯ s i ) + λ Ω ( f ) , (1)where the second term measures predictive bias as the correla-tion between model prediction and the demographic feature,and ¯ f ( x ) and ¯ s are centered variables; the last term measuresmodel complexity; µ and λ are hyper-parameters. Based on the a r X i v : . [ c s . L G ] S e p epresenter Theorem that f is a linear combination of φ ( x i ) ’s,task (1) admits an analytic solution for the linear coefﬁcients.Perez et al’s method adopts the regularization approach infair learning, which penalizes predictive bias during learning(e.g., [8], [9]). In this paper, we adopt another popular approachwhich ﬁrst constructs a fair feature space and then builds astandard model in it (e.g., [6], [7], [12]). In experiments, weshow our method can achieve higher prediction fairness. B. Fair Feature Learning and Mean Discrepancy

An effective approach to learn fair models is to ﬁrst constructa fair feature space and then learn a standard model in it(e.g., [6], [7]). A fair feature space is one where featuredistributions of different demographic groups are similar, e.g.,different groups have similar CDF’s of the new features [7],or the statistical dependence between the new features and thedemographic feature is low [6].In this paper, we develop a new fair kernel regression basedon the idea of fair feature learning. Unlike previous studies, wemeasure feature similarity using mean discrepancy [16]. MDmeasures distance between distributions and is widely used inmachine learning [17]–[19]. Let x , . . . , x n and z , . . . , z m betwo sets generated from distributions P x and P z respectively.MD estimates the distance between P x and P z as M D ( P x , P z ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:213) i = φ ( x i ) − m m (cid:213) j = φ ( z j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (2)A technical challenge is that previous fair feature learningapproaches assume the feature space is explicit and then modifyit to obtain a fairer space. In kernel methods, however, thefeature space of φ ( x ) is implicit. To tackle this issue, wepropose to construct an explicit fair feature space for φ ( x ) , bylearning fair feature embedding functions in the kernel space.This approach is motivated by the literature of feature selectionin kernel methods (e.g., [15], [20]). C. Feature Selection in Kernel Space

Feature selection is a common practice for improving therobustness and interpretability of machine learning models [21].However, its practice in kernel methods is not easy, since thereis not an explicit feature representation in kernel space. Onlya few approaches are proposed, e.g. [15], [20], [22].Our study is motivated by Cao et al [15]. They propose tolearn explicit feature representation in the kernel space, bylearning feature embedding function η . They show the optimalfunction is a linear combination of training instances, and learnsuch functions by standard methods such as KPCA [23]. Afterthat, instance φ ( x ) is mapped onto η to obtain an explicitfeature representation on which feature selection is performed.Motivated by Cao et al’s approach, we propose to learnfeature embedding functions that are fair, e.g.., differentdemographic groups have similar distributions in the embeddedspace. As explained in the previous subsection, similarity ismeasured by mean discrepancy. III. F AIR K ERNEL R EGRESSION VIA L EARNING F AIR F EATURE E MBEDDINGS IN K ERNEL S PACE (FKR-F E)In this section, we present the proposed fair kernel regressionvia learning fair feature embeddings in kernel space (FKR-F E).Recall an individual is ( x , s , y ) , where x is feature vector, s isbinary demographic feature and y is label. There are n traininginstances, where x , . . . , x n u are from the unprotected groupand x n u + , . . . , x n are from the protected group.Our proposed method works in two steps: (i) learn fair featureembeddings in kernel space; (ii) build a standard regressionmodel based on the embedded features. Step 1. Learn Fair Feature Embeddings in Kernel Space

Our goal is to learn an explicit and fair feature representationfor φ ( x ) . To that end, we propose to learn a fair featureembedding function η , such that in the embedded space, themean discrepancy between the protected group and unprotectedgroup is minimized: min η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n u n u (cid:213) i = (cid:104) φ ( x i ) , η (cid:105) − n p n (cid:213) i = n u + (cid:104) φ ( x i ) , η (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (3)Problem (3) cannot be directly solved since there is no explicitrepresentation of φ ( x ) . Motivated by Cao et al [15], we assumethe optimal η is a linear combination of training instances: η = (cid:213) ni = α i φ ( x i ) . (4)To avoid overﬁtting, we further assume η has a unit norm: || η || = . (5)Solving (3) under constraints (4) and (5), we have that (cid:32) n u K Tu K u − n u n p K Tu u Tp K p + n p K Tp K p (cid:33) α = λ K α, (6)where α = [ α , . . . , α n ] T is the vector of unknown parametersand λ is an eigenvalue; K is a standard n -by- n Gram matrixof all instances; K u is n u -by- n and K p is n p -by- n satisfying K = (cid:20) K u K p (cid:21) (7)Formula (6) is a generalized eigenproblem, and α is theleast generalized eigenvector. After α is solved, we obtain theﬁrst explicit and fair feature of φ ( x ) in the kernel space as (cid:104) φ ( x ) , η (cid:105) = (cid:213) ni = α i k ( x , x i ) . (8)The above analysis gives the ﬁrst fair feature embeddingfunction η in the kernel space. Now we present how to obtainthe second η (cid:48) , and the rest can be derived in similar fashions.The second optimal embedding η (cid:48) is obtained in a similarfashion as η , with an additional constraint that it should beorthogonal to the previously obtained embeddings: min η (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n u n u (cid:213) i = (cid:104) φ ( x i ) , η (cid:48) (cid:105) − n p n (cid:213) i = n u + (cid:104) φ ( x i ) , η (cid:48) (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s . t . η (cid:48) = (cid:213) ni = α (cid:48) i φ ( x i ) , || η (cid:48) || = , η T η (cid:48) = . (9) Detailed arguments are in Appendix A. olving (9) shows that α (cid:48) = [ α (cid:48) , . . . , α (cid:48) n ] T is the second leastgeneralized eigenvector of the same eigenproblem (6) .By similar arguments, we can show the linear coefﬁcients of k optimal fair embeddings η , . . . , η k are the least k generalizedeigenvectors of the eigenproblem (6).After that, we obtain a k-dimensional explicit fair featurerepresentation of φ in the kernel space, i.e., φ FS ( x ) = [(cid:104) φ ( x ) , η (cid:105) , . . . , (cid:104) φ ( x ) , η k (cid:105)] T . (10) Step 2. Learn a Standard Regression Model on φ FS ( x ) Given an explicit fair feature representation φ FS ( x ) , we learna standard regression model based on it. Let x , . . . , x n be n training instances. One can easily verify that φ FS ( x i ) = [(cid:104) φ ( x i ) , η (cid:105) , . . . , (cid:104) φ ( x i ) , η k (cid:105)] T = ( K T : i A ) T (11)where K : i is the i th column Gram matrix K , and A is an n -by- k matrix with column j being the linear coefﬁcient vector of η j (e.g., the ﬁrst column is α and the second column is α (cid:48) ).Then, one can obtain an n -by- k training sample matrix X FS =  ( φ FS ( x )) T ... ( φ FS ( x n )) T  =  K T :1 A ... K T : n A  = K T A = K A . (12)Now, we learn a regression model β ∈ R k on X FS by min β || X FS · β − Y || + γ || β || , (13)where γ is a regularization coefﬁcient.For any testing instance z , we ﬁrst compute its explicit fairfeature representation φ FS ( z ) = [(cid:104) φ ( z ) , η (cid:105) , . . . , (cid:104) φ ( z ) , η k (cid:105)] T , (14)and then compute its prediction as ˆ y = φ FS ( z ) T β. (15)For classiﬁcation tasks, one can simply threshold ˆ y .IV. E XPERIMENT

A. Data Sets

We experimented on three public data sets, namely, theCredit Default data set , the Community Crime data set , andthe COMPAS data set .The original Credit Default data set contains 30,000 individ-uals described by 23 attributes. We treated ‘education level’ asthe sensitive variable, and binarized it into higher educationand lower education as in [12]; ‘default payment’ is treated asthe binary label. We removed individuals with missing valuesand down-sampled the data set from 30,000 to 20,000. Ourpreprocessed data sets are published at . Detailed arguments are in Appendix B. https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients http://archive.ics.uci.edu/ml/datasets/communities+and+crime https://github.com/propublica/compas-analysis https://uwyomachinelearning.github.io/ The Communities Crime data set contains 1,993 communitiesdescribed by 101 informative attributes. We treated the ‘fractionof African-American residents’ as the sensitive feature, andbinarized it so that a community is ’minority’ if the fraction isabove 0.5 and ’majority’ otherwise. Label is the ‘communitycrime rate’, and we binarized it into high if the rate is above0.5 and low otherwise.The COMPAS data set contains 18,317 individuals with 40features (e.g., name, sex, race). We down-sampled the data setto 16,000 instances and 15 numerical features (e.g. name isremoved). Similar to [24], we treated ‘race’ as the sensitivefeature and ‘risk of recidivism’ as the binary label.

B. Experiment Design

On each data set, we randomly chose 75% of the instances fortraining and used the rest for testing. We evaluated each methodover 50 random trials and reported its average performanceand standard deviation.We compared the proposed FKR-F E with the existing fairkernel regression [13], and several other non-kernel methods.For each compared method, we set its hyper-parameters asdescribed in the original paper.For FKR-F E, we used polynomial kernel on both Creditand Community Crime data sets and sigmoid kernel on theCOMPAS data set. For polynomial kernel, we grid-searched itsoptimal degree in { , , , } and optimal additive coefﬁcientin [ − , − , − , , ] . For sigmoid kernel, we grid-searched its optimal c among 5 values in the logarithmic rangeof [ − , ] , and we used the default γ in Scikit-Learn [25](i.e., inverse of feature dimension). For the ridge regressionregularization coefﬁcient λ , we grid-searched an optimal valueamong 6 values in the logarithmic range [ − , ] .Finally, an important hyper-parameter is the number offeature embeddings k . We experimented with 4 values, namely, n , n , n and n . In experiment these values yielded goodgeneralization performance on all data sets.We evaluated model accuracy using the standard classiﬁca-tion error (Error), and evaluated model fairness using a popularmeasure called statistical disparity (SD) [11], deﬁned as: SD ( f , S ) = | p ( f ( x ) = | s = ) − p ( f ( x ) = | s = )| . (16)Finally, all experiments were run on the Teton Computing En-vironment at the University of Wyoming’s Advanced ResearchComputing Center (https://doi.org/10.15786/M2FY47), and ourFFE implementation is at https://github.com/aokray/FFE. C. Classiﬁcation Results and Discussions

Our classiﬁcation results are summarized in Table I.Our ﬁrst observation is that FKR-F E consistently achieveslower statistical disparity than the existing fair kernel regressionmethod (and other baselines) across the three data sets. Thisimplies that fair feature embedding is an effective approachfor learning fair models in kernel space.We notice the superior fairness of FKR-F E is not achievedwithout any cost. In general, it has slightly higher predictionerror than the existing fair kernel regression and other baselines. ethod Credit Default Communities Crime COMPAS

SD Error SD Error SD ErrorFKR-F E .0021 ± .0017 .2277 ± .0050 .0392 ± .0267 .1384 ± .0125 .0025 ± .0018 .2307 ± .0057FKRR [13] .0079 ± .0011 .2001 ± .0054 .0968 ± .0722 .1208 ± .0054 .0041 ± .0013 .2190 ± .0089FLR [9] .0779 ± .0571 .2412 ± .0469 .0898 ± .0971 .1166 ± .0189 .0408 ± .0162 .2428 ± .0917FRR [11] .0186 ± .0016 .2914 ± .0186 .3062 ± .0452 .1102 ± .0128 .0182 ± .0042 .2276 ± .0040FPCA [12] .1716 ± .0149 .4025 ± .0382 .0859 ± .0479 .1731 ± .0089 .2806 ± .0182 .3204 ± .1032 TABLE I: Classiﬁcation Performance of Different Methods across Different Data Sets. For polynomial kernel, we set degree as4 and additive coefﬁcient as 0.1. For sigmoid kernel, we used c = . and γ as the inverse of feature dimension. k is set to n .However, we argue the loss of accuracy is small compared withthe increase of fairness. For example, on the Credit Defaultdata set, FKR-F E lowers prediction disparity by at least 75%= (0.0079-0.0021)/0.0079 but only increases prediction errorby at most 13% = (0.2277-0.2001)/0.2001. We thus argue thismethod has a more efﬁcient accuracy-fairness trade-off.Finally, we see fair kernel methods generally achieve lowerstatistical disparity than other fair learning methods, suggestingtheir promisingness for fair machine learning.

D. Sensitivity Analysis

In this section, we examined the performance of FKR-F E onthe Communities Crime data set under different conﬁgurations.We ﬁrst examined its performance with different choices ofkernel. Results on testing samples averaged over 50 randomtrials are reported in Figure 1. We see that polynomial kernelachieves the highest prediction fairness, with slightly higherprediction error. Sigmoid kernel is the second best, and linearkernel does not give low disparity. This supports our hypothesisthat why fair kernel methods are promising – they construct acomplex hypothesis space that is more likely to include modelswith efﬁcient fairness-accuracy trade-off.Next, we examined performance with polynomial kernelunder different k (number of feature embeddings). Results areshown in Figure 2. We see that smaller k generally leads tohigher prediction fairness and slightly higher prediction error.The former phenomenon implies that only the least eigenvectorsof problem (6) can effectively minimize the mean discrepancybetween two groups. The latter is easy to understand – higherfeature dimension provides more information for buildingan accurate prediction model. However, the variation versus k seems quite limited, suggesting our method has robustclassiﬁcation performance.V. C ONCLUSION

In this paper, we propose a novel fair kernel regressionmethod FKR-F E. It ﬁrst learns a set of fair feature embeddingsin the kernel space, and then standardly learns a predictionmodel in the embedded space. Through experiments acrossthree real-world data sets, we show it achieves signiﬁcantlylower bias in prediction compared with the state-of-the-art fairkernel regression method as well as several non-kernel fair Fig. 1: Sensitivity analysis for varying kernels with theirapproximately optimal number of features selected.Fig. 2: Performance versus k .learning methods, while sacriﬁcing only a small amount ofprediction accuracy. R EFERENCES[1] J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine bias: Theressoftware used across the country to predict future criminals. and itsbiased against blacks,”

ProPublica , 2016.[2] M. Hoffman, L. B. Kahn, and D. Li, “Discretion in Hiring*,”

TheQuarterly Journal of Economics , 2017.[3] B. F. Klare, M. J. Burge, J. C. Klontz, R. W. V. Bruegge, and A. K. Jain,“Face recognition performance: Role of demographic information,”

IEEETransactions on Information Forensics and Security , 2012.[4] M. D. Cunningham and J. R. Sorensen, “Actuarial models for assessingprison violence risk: Revisions and extensions of the risk assessmentscale for prison (rasp),”

Assessment , 2006.[5] Press et al. , “Preparing for the future of artiﬁcial intelligence,” 2016.[6] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork, “Learning fairrepresentations,” in

ICML , 2013.7] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkata-subramanian, “Certifying and removing disparate impact,” in

KDD , 2015.[8] T. Calders, A. Karim, F. Kamiran, W. Ali, and X. Zhang, “Controllingattribute effect in linear regression,”

ICDM , 2013.[9] Kamishima, A. Akaho, and Sakuma, “Fairness-aware classiﬁer withprejudice remover regularizer,” in

ECML-PKDD , 2012.[10] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairnessthrough awareness,” in

Innovations in Theoretical Computer ScienceConference . ACM, 2012.[11] D. McNamara, C. S. Ong, and R. C. Williamson, “Provably fairrepresentations,”

CoRR , vol. abs/1710.04394, 2017.[12] S. Samadi, U. Tantipongpipat, J. Morgenstern, M. Singh, and S. Vempala,“The price of fair pca: One extra dimension,” in

NIPS , 2018.[13] A. P´erez-Suay, V. Laparra, G. Mateo-Garc´ıa, J. Mu˜noz-Mar´ı, L. G´omez-Chova, and G. Camps-Valls, “Fair kernel learning,” in

Joint EuropeanConf. Machine Learning and Knowledge Discovery in Databases , 2017.[14] M. Olfat and A. Aswani, “Convex formulations for fair principalcomponent analysis,”

CoRR , vol. abs/1802.03765, 2019.[15] B. Cao, D. Shen, J.-T. Sun, Q. Yang, and Z. Chen, “Feature selection ina kernel space,” in

ICML , 2007.[16] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola,“A kernel two-sample test,”

JMLR , 2012.[17] J. Huang, A. Gretton, K. Borgwardt, B. Sch¨olkopf, and A. J. Smola,“Correcting sample selection bias by unlabeled data,” in

NIPS , 2007.[18] S. J. Pan, J. T. Kwok, Q. Yang et al. , “Transfer learning via dimensionalityreduction.” in

AAAI , 2008.[19] A. Gretton, A. J. Smola, J. Huang, M. Schmittfull, K. M. Borgwardt,and B. Sch¨olkopf, “Covariate shift by kernel mean matching,” 2009.[20] Y. Grandvalet and S. Canu, “Adaptive scaling for feature selection insvms,” in

NIPS , 2002.[21] J. Tang, S. Alelyani, and H. Liu, “Feature selection for classiﬁcation: Areview,” in

Data Classiﬁcation: Algorithms and Applications , 2014.[22] L. Yang, S. Lv, and J. Wang, “Model-free variable selection in reproducingkernel hilbert space,”

JMLR , 2016.[23] B. Sch¨olkopf, A. Smola, and K.-R. M¨uller, “Kernel principal componentanalysis,” in

International Conf. Artiﬁcial Neural Networks , 1997.[24] A. Chouldechova, “Fair prediction with disparate impact: A study ofbias in recidivism prediction instruments,”

Big data , 2017.[25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,“Scikit-learn: Machine learning in Python,”

JMLR , 2011.

VI. A

PPENDIX

A. Derivation of Eigen-Problem (6)

We will show how to derive (6) by solving (3) underconstraints (4) and (5). Recall that η = (cid:205) ni = α i φ ( x i ) where α i ’s are unknown parameters. Rewrite the objective in (3) as J ( η ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n u n u (cid:213) i = (cid:104) φ ( x i ) , η (cid:105) − n p n (cid:213) i = n u + (cid:104) φ ( x i ) , η (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = n u (cid:32) n u (cid:213) i = (cid:104) φ ( x i ) , η (cid:105) (cid:33) + n p (cid:32) n (cid:213) i = n u + (cid:104) φ ( x i ) , η (cid:105) (cid:33) − n u n p (cid:32) n u (cid:213) i = (cid:104) φ ( x i ) , η (cid:105) (cid:33) (cid:32) n (cid:213) i = n u + (cid:104) φ ( x i , η (cid:105) (cid:33) = α T K Tu K u α n u + α T K Tp K p α n p − α T K Tu u Tp K p α n u n p = α T M α = J ( α ) , (17)where M is a symmetric matrix deﬁned as M = n u K Tu K u − n u n p K Tu u Tp K p + n p K Tp K p , (18) matrix K u ∈ R n u × n has k ( x i , x j ) at row i column j and matrix K p ∈ R n p × n has k ( x n u + i , x j ) at row i column j ; u ∈ R n u and p ∈ R n p are vectors of ones, and α = [ α , . . . , α n ] .Next, it is easy to verify that constraint (4) staisﬁes η T η = α T K α = , (19)where K ∈ R n × n is the standard Gram matrix.Thus we need to solve min α J ( α ) s . t . α T K α = . (20)The Lagrange function is L ( α, λ ) = J ( α ) + λ ( α T K α − ) . (21)Setting ∂ L ( α,λ ) ∂α = and solving for α gives (6). B. Derivation of the Solution to (9)

Here, we show why solution to (9) is also a solution to theeigen-problem (6). Let α be the coefﬁcient vector for the ﬁrstfeature embedding η (known), and α (cid:48) be the coefﬁcient vectorof the second embedding η (cid:48) (unknown). The new constraintwhen learning η (cid:48) can be written as η T η (cid:48) = (cid:32) n (cid:213) i = α i φ ( x i ) (cid:33) (cid:32) n (cid:213) i = α (cid:48) i φ ( x i ) (cid:33) = α T K α (cid:48) = . (22)Thus we need to solve min α (cid:48) J ( α (cid:48) ) s . t . ( α (cid:48) ) T K α (cid:48) = , (23)and ( α (cid:48) ) T K α = . (24)The Lagrange function is L ( α (cid:48) , λ , λ ) = J ( α (cid:48) ) + λ (( α (cid:48) ) T K α (cid:48) − ) + λ α T K α (cid:48) . (25)Setting ∂ L ( α (cid:48) ,λ ,λ ) ∂α (cid:48) = and left-multiplying both sides by α T , α T M α (cid:48) − λ α T K α (cid:48) − λ α T K α = . (26)Since α T K α (cid:48) = and α (cid:48) T K α (cid:48) = , we have ( α (cid:48) ) T M α = λ . (27)Further, from (6) we know α is a generalized eigenvector of M satisfying M α = λ K α . Thus (27) becomes ( α (cid:48) ) T M α = ( α (cid:48) ) T λ K α = λ ( α (cid:48) ) T K α = = λ , (28)where the second equality is due to the new constraint (22).Comparing (21) and (25), with λ = , we see α and α (cid:48)(cid:48)