Cross-Lingual Adaptation using Structural Correspondence Learning
CCross-Lingual Adaptationusing Structural Correspondence Learning
PETER PRETTENHOFERBauhaus-Universit¨at WeimarandBENNO STEINBauhaus-Universit¨at Weimar
Cross-lingual adaptation, a special case of domain adaptation, refers to the transfer of classifi-cation knowledge between two languages. In this article we describe an extension of StructuralCorrespondence Learning (SCL), a recently proposed algorithm for domain adaptation, for cross-lingual adaptation. The proposed method uses unlabeled documents from both languages, alongwith a word translation oracle, to induce cross-lingual feature correspondences. From these cor-respondences a cross-lingual representation is created that enables the transfer of classificationknowledge from the source to the target language. The main advantages of this approach overother approaches are its resource efficiency and task specificity.We conduct experiments in the area of cross-language topic and sentiment classification involv-ing English as source language and German, French, and Japanese as target languages. The resultsshow a significant improvement of the proposed method over a machine translation baseline, re-ducing the relative error due to cross-lingual adaptation by an average of 30% (topic classification)and 59% (sentiment classification). We further report on empirical analyses that reveal insightsinto the use of unlabeled data, the sensitivity with respect to important hyperparameters, andthe nature of the induced cross-lingual correspondences.Categories and Subject Descriptors: H.3.3 [
Information Storage and Retrieval ]: InformationSearch and Retrieval— information filtering ; I.2.7 [
Artificial Intelligence ]: Natural LanguageProcessing—
Text analysis
General Terms: Cross-language text classification, cross-lingual adaptationAdditional Key Words and Phrases: Structural Correspondence Learning, cross-language senti-ment analysis
1. INTRODUCTION
Over the past two decades supervised machine learning methods have been success-fully applied to many problems in natural language processing (e.g., named entityrecognition, relation extraction, sentiment analysis) and information retrieval (e.g.,text classification, information filtering). These methods, however, rely on large,annotated training corpora, whose acquisition is time-consuming, costly, and inher-ently language-specific. As a consequence most of the available training corpora arein English only. Since an ever increasing fraction of the textual content available indigital form is written in languages other than English , this limits the widespread This is especially the case for the World Wide Web, where from 2000 to 2009 the contentavailable in Chinese grew more than four times as much as the content available in EnglishAuthor’s address: P. Prettenhofer, Bauhaus-Universit¨at Weimar, 99421 Weimar, Germany. a r X i v : . [ c s . I R ] A ug · Prettenhofer, Stein application of state-of-the-art techniques from natural language processing (NLP)and information retrieval (IR). Technology for cross-lingual adaptation aims toovercome this problem by transferring the knowledge encoded within annotated(= labeled) data written in a source language to create a classifier for a differenttarget language. Cross-lingual adaptation can thus be viewed as a special case ofdomain adaptation, where each language acts as a separate domain.In contrast to “classical” domain adaptation, cross-lingual adaptation is charac-terized by the fact that the two domains, i.e., the languages, have non-overlappingfeature spaces, which has both theoretical and practical implications for domainadaptation. In classical domain adaptation—as well as in related problems such ascovariate shift—the factor of overlapping feature spaces is implicitly presumed bythe following or similar assumptions: (1) generalizable features, i.e., features whichbehave similarly in both domains, exist [Jiang and Zhai 2007; Blitzer et al. 2006;Daume 2007], or, (2) the support of the test data distribution is contained in thesupport of the training data distribution [Bickel et al. 2009]. If, on the other hand,the feature sets are non-overlapping, one needs external knowledge to link featuresof the source domain and the target domain [Dai et al. 2008].This article extends the work of Prettenhofer and Stein [2010] and presents anapproach for cross-lingual adaptation in the context of text classification: Cross-Language Structural Correspondence Learning (CL-SCL). CL-SCL uses unlabeleddata from both languages along with external domain knowledge in the form ofa word translation oracle to induce cross-lingual word correspondences. The ap-proach is based on Structural Correspondence Learning (SCL), a recently proposedalgorithm for domain adaptation in natural language processing.Similar to SCL, CL-SCL induces correspondences among the words from bothlanguages using a small number of so-called pivots . In CL-SCL, a pivot is a pairof words, { w S , w T } , from the source language S and the target language T , whichpossess a similar semantics. Testing the occurrence of w S or w T in a set of unlabeleddocuments from S and T yields two equivalence classes across these languages: oneclass contains the documents where either w S or w T occur, the other class containsthe documents where neither w S nor w T occur. Ideally, a pivot splits the set of un-labeled documents with respect to the semantics that is associated with { w S , w T } .The correlation between w S or w T and other words w , w (cid:54)∈ { w S , w T } is modeledby a linear classifier, which then is used as a language-independent predictor forthe two equivalence classes. A small number of pivots can capture a sufficientlylarge part of the correspondences between S and T in order to (1) construct a cross-lingual representation and (2) learn a classifier that operates on this representation.Several advantages follow from this approach:— Task specificity. The approach exploits the words’ pragmatics since itconsiders—during the pivot selection step—task-specific characteristics of languageuse.— Efficiency in terms of linguistic resources. The approach uses unlabeled doc-uments from both languages along with a small number (100 - 500) of translated ( , June 2010). ross-Lingual Adaptation using SCL · Problem class:(cid:13)Sample Selection BiasDomain(cid:13)Adaptation Problem class:(cid:13)Cross-lingual AdaptationTansfer(cid:13)Learning(cid:13)Setting different(cid:13)tasks
Problem class:(cid:13)Covariate Shift different(cid:13)domains different(cid:13)distributionsdifferent(cid:13)feature setsP S ( x ) „ P T ( x )M S „ M T M S ˙ M T = ˘(cid:13) M S ˙ M T „ ˘(cid:13) (cid:13) Fig. 1. A taxonomy of transfer learning settings, organized the dimension “domain”and “task”. The domain adaptation branch is unfolded.words, instead of employing a parallel corpus or an extensive bilingual dictionary.— Efficiency in terms of computing resources. The approach solves the classifi-cation problem directly, instead of resorting to a more general and potentially muchharder problem such as machine translation.The article is organized as follows: Section 2 discusses cross-lingual adaptationin the context of related work including domain adaptation and dataset shift. Sec-tion 3 introduces the problem of cross-language text classification, a special caseof domain adaptation. Section 4 describes Cross-Language Structural Correspon-dence Learning and proposes a new regularization schema for the pivot predictors.Section 5 reports on the design and the results of experiments in the area of cross-language sentiment and topic classification. Finally, Section 6 concludes our work.
2. RELATED WORK
The idea to transfer knowledge from a source learning setting S to a different targetlearning setting T is an active field of research [Pan and Yang 2009], and Figure 1organizes well-known problems within a taxonomy. The taxonomy combines thetwo most important determinants within a learning setting, namely, the domain and the task . A domain is defined by (1) a set of features M , (2) a space of possiblefeature vector realizations x , which typically is the R | M | , and (3) a probabilitydistribution P ( x ) over the space of possible feature vector realizations. A taskspecifies a set of labels corresponding to classes, typically { +1 , − } , along with aconditional distribution P ( y | x ), with y ∈ { +1 , − } . Alternatively, a task can bespecified by a sample { ( x , y ) | x ∈ R | M | , y ∈ { +1 , − }} . In Figure 1 the domainadaptation branch is unfolded since it is the focus of this article. The branch“different distributions” addresses problems where the feature sets are unchanged;without loss of generality P S ( x ) (cid:54) = P T ( x ) can also be presumed for problems in thebranch “different feature space”. If clear without ambiguity we use x or y to denote both a realization and a random variable. · Prettenhofer, Stein
Domain adaptation refers to the problem of adapting a statistical classifier trainedon data from one (or more) source domains to a different target domain . In thebasic domain adaptation setting we are given labeled data from a source domain S and unlabeled data from the target domain T , and the goal is to train a classifierfor the target domain. Beyond this setting one can further distinguish whether asmall amount of labeled data from the target domain is available [Daume 2007;Finkel and Manning 2009] or not [Blitzer et al. 2006; Jiang and Zhai 2007]. Thelatter setting is referred to as unsupervised domain adaptation.Blitzer et al. [2006] propose an effective algorithm for unsupervised domain adap-tation, called Structural Correspondence Learning. Within a first step SCL iden-tifies features that generalize across domains, which the authors call pivots. SCLthen models the correlation between the pivots and all other features by traininglinear classifiers on the unlabeled data from both domains. This information is usedto induce correspondences among features from the different domains and to learna shared representation that is meaningful across both domains. SCL is related tothe structural learning paradigm introduced by Ando and Zhang [2005a]. The basicidea of structural learning is to constrain the hypothesis space of a learning task byconsidering multiple different but related tasks on the same input space. Ando andZhang [2005b] present a semi-supervised learning method, Alternating StructuralOptimization (ASO), based on this paradigm, which generates related tasks fromunlabeled data. They show that ASO delivers state-of-the-art performance for avariety of natural language processing tasks including named entity and syntacticchunking. Quattoni et al. [2007] apply structural learning to image classification insettings where little labeled data is given. Traditional machine learning assumes that both training and test examples aredrawn from identical distributions. In practice this assumption is often violated,for instance due to the irreproducibility of the test conditions within the trainingphase. Dataset shift refers to the general problem when the joint distribution ofinputs and outputs differs between training phase and test phase. The differencebetween dataset shift and domain adaptation is subtle; in fact, both refer to thesame underlying problem but emerge from the viewpoints of different research com-munities. Dataset shift is coined by the machine learning community and buildson prior work in statistics, in particular the work on covariate shift [Shimodaira2000] and sample selection bias [Cortes et al. 2008]. In contrast, domain adap-tation originates from the natural language processing community. Most of theearly work on domain adaptation focuses on the question of how to leverage “out-domain data” (= data associated with S ) effectively to learn a classifier when onlylittle or no labeled “in-domain data” (= data associated with T ) is available. Thelatter emphasizes the relationship to semi-supervised learning—with the crucialdifference that labeled and unlabeled data stem from different distributions. Co-variate shift can be considered as a certain case of dataset shift which is closelyrelated to unsupervised domain adaptation. It is characterized by the fact that theclass conditional distribution between training phase and test phase is equal, i.e. ross-Lingual Adaptation using SCL · P S ( y | x ) = P T ( y | x ), while the marginal distribution of the inputs (covariates)differs, i.e. P S ( x ) (cid:54) = P T ( x ). A broad discussion of dataset shift is beyond the scopeof this article; the interested reader is referred to [Quionero-Candela et al. 2009]. Analogous to domain adaptation, cross-lingual adaptation refers to the problemof adapting a statistical classifier trained on data from a source language S to adifferent target language T . Examples include the adaptation of a named-entityrecognizer, a syntactic parser, or a relation extractor. The major characteristic ofcross-lingual adaptation is the fact that the two ”domains” have non-overlappingfeatures sets, i.e., M S (cid:54) = M T . While cross-lingual adaptation has not received alot of attention in the natural language processing community, a special case ofcross-lingual adaptation has received a lot of attention recently: cross-languagetext classification, which is also the focus of this article.Bel et al. [2003] belong to the first who explicitly considered the problem ofcross-language text classification. Their research, however, is predated by work incross-language information retrieval, CLIR, where similar problems are addressed[Oard 1998]. Traditional approaches to cross-language text classification and CLIRuse linguistic resources such as bilingual dictionaries or parallel corpora to inducecorrespondences between two languages [Lavrenko et al. 2002; Olsson et al. 2005].Dumais et al. [1997] is considered as seminal work in CLIR: they propose a methodwhich induces semantic correspondences between two languages by performing la-tent semantic analysis, LSA, on a parallel corpus. Li and Taylor [2007] improveupon this method by employing kernel canonical correlation analysis, CCA, insteadof LSA. The major limitation of these approaches is their computational complexityand, in particular, the dependence on a parallel corpus, which is hard to obtain—especially for less resource-rich languages. Gliozzo and Strapparava [2005] circum-vent the dependence on a parallel corpus by using so-called multilingual domainmodels, which can be acquired from comparable corpora in an unsupervised man-ner. In [Gliozzo and Strapparava 2006] they show for particular tasks that theirapproach can achieve a performance close to that of monolingual text classification.Recent work in cross-language text classification focuses on the use of automaticmachine translation technology. Most of these methods involve two steps: (1) trans-lation of the documents into the source or the target language, and (2) dimension-ality reduction or semi-supervised learning to reduce the noise introduced by themachine translation. Methods which follow this two-step approach include theEM-based approach by Rigutini et al. [2005], the CCA approach by Fortuna andShawe-Taylor [2005], the information bottleneck approach by Ling et al. [2008], andthe co-training approach by Wan [2009].
3. CROSS-LANGUAGE TEXT CLASSIFICATION
In standard text classification, a document d is represented under the bag-of-wordsmodel as | V | -dimensional feature vector x ∈ X , where V , the vocabulary, denotesan ordered set of words, x i ∈ x denotes the normalized frequency of word i in d , and X is an inner product space. D S denotes the training set and comprises tuples ofthe form ( x , y ), which associate a feature vector x ∈ X with a class label y ∈ Y . Forsimplicity but without loss of generality we assume binary classification problems, · Prettenhofer, Stein Y = { +1, -1 } . The goal is to find a classifier f : X → Y that predicts the labels ofnew, previously unseen documents. In the following, we restrict ourselves to linearclassifiers: f ( x ) = sign ( w T x ) , (1)where w is a weight vector that parameterizes the classifier and [ · ] T denotes thematrix transpose. The computation of w from D S is referred to as model estimationor training. A common choice for w is given by a vector w ∗ that minimizes theregularized training error: w ∗ = argmin w ∈ R | V | (cid:88) ( x ,y ) ∈ D S L ( y, w T x ) + λR ( w ) . (2) L is a loss function that measures the quality of the classifier, R is a regularizationterm that penalizes model complexity, and λ is a non-negative hyperparameter thatmodels the tradeoff between classification performance and model complexity. Acommon choice for R is L2-regularization, which imposes an L2-norm penalty on w , R ( w ) = (cid:107) w (cid:107) = w T w . Different choices for L entail different classifier types;e.g., when choosing the hinge loss function one obtains the popular Support VectorMachine classifier [Zhang 2004].Standard text classification distinguishes between labeled (training) documentsand unlabeled (test) documents. Cross-language text classification poses an extraconstraint in that training documents and test documents are written in differentlanguages. Here, the language of the training documents is referred to as source lan-guage S , and the language of the test documents is referred to as target language T .The vocabulary V divides into V S and V T , called vocabulary of the source languageand vocabulary of the target language, with V S ∩ V T = ∅ . I.e., documents from thetraining set and the test set map onto non-overlapping regions of the feature space.Thus, a linear classifier trained on D S associates non-zero weights only with wordsfrom V S , which in turn means that it cannot be used to classify documents writtenin T .One way to overcome this “feature barrier” is to find a cross-lingual representa-tion for documents written in S and T , which enables the transfer of classificationknowledge between the two languages. Intuitively, one can understand such a cross-lingual representation as a concept space that underlies both languages. In the fol-lowing, we will use θ to denote a map that associates the original | V | -dimensionalrepresentation of a document d written in S or T with its cross-lingual representa-tion. Once such a mapping is found the cross-language text classification problemreduces to a standard classification problem in the cross-lingual space. Note thatthe existing methods for cross-language text classification can be characterized bythe way θ is constructed. For instance, cross-language latent semantic indexing[Dumais et al. 1997] and cross-language explicit semantic analysis [Potthast et al.2008] estimate θ using a parallel corpus. Other methods use linguistic resourcessuch as a bilingual dictionary to obtain θ [Bel et al. 2003; Olsson et al. 2005; Wuet al. 2008]. ross-Lingual Adaptation using SCL ·
4. CROSS-LANGUAGE STRUCTURAL CORRESPONDENCE LEARNING
We now present a method for learning a map θ by exploiting relations from un-labeled documents written in S and T . The proposed method, which we callcross-language structural correspondence learning, CL-SCL, addresses the follow-ing learning setup (see also Figure 2):(1) Given a set of labeled training documents D S written in language S , the goalis to create a text classifier for documents written in a different language T . Werefer to this classification task as the target task . An example for the target task isthe determination of sentiment polarity, either positive or negative, of book reviewswritten in German ( T ) given a set of training reviews written in English ( S ).(2) In addition to the labeled training documents D S we have access to unla-beled documents D S ,u and D T ,u from both languages S and T . Let D u denote D S ,u ∪ D T ,u .(3) Finally, we are given a budget of calls to a word translation oracle (e.g., adomain expert) to map words in the source vocabulary V S to their correspondingtranslations in the target vocabulary V T . For simplicity and without loss of appli-cability we assume here that the word translation oracle maps each word in V S toexactly one word in V T .CL-SCL comprises three steps: In the first step, CL-SCL selects word pairs { w S , w T } , called pivots, where w S ∈ V S and w T ∈ V T . Pivots have to satisfy thefollowing conditions: Confidence.
Both words, w S and w T , are predictive for the target task. Support.
Both words, w S and w T , occur frequently in D S ,u and D T ,u , respec-tively.The confidence condition ensures that, in the second step of CL-SCL, only thosecorrelations are modeled that are useful for discriminative learning. The supportcondition, on the other hand, ensures that these correlations can be estimated accu-rately. Considering our sentiment classification example, the word pair { excellent S ,exzellent T } satisfies both conditions: (1) the words are strong indicators of positivesentiment, and (2) the words occur frequently in book reviews from both languages.Note that the support of w S and w T can be determined from the unlabeled data D u . The confidence, however, can only be determined for w S since the setting givesus access to labeled data from S only.We use the following heuristic to form an ordered set P of pivots: First, wechoose a subset V P from the source vocabulary V S , | V P | (cid:28) | V S | , which containsthose words with the highest mutual information with respect to the class label ofthe target task in D S . Second, for each word w S ∈ V P we find its translation in thetarget vocabulary V T by querying the translation oracle; we refer to the resultingset of word pairs as the candidate pivots, P (cid:48) : P (cid:48) = {{ w S , translate ( w S ) } | w S ∈ V P } . We then enforce the support condition by eliminating in P (cid:48) all candidate pivots { w S , w T } where the document frequency of w S in D S ,u or of w T in D T ,u is smaller · Prettenhofer, Stein (cid:0)y(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)yyyyyy(cid:0)(cid:0)yy
Words in V S Class(cid:13)label (cid:0)(cid:0)yy
Words in V T ... , x |V| ) x = (x , ...D S D S ,u D T ,u D u term frequencies(cid:13)Positive class label(cid:13)Negative class label(cid:13)No value(cid:13) (cid:0)y y Fig. 2. The document sets underlying CL-SCL. The subscripts S , T , and u designate “sourcelanguage”, “target language”, and “unlabeled”. than some threshold φ : P = candidateElimination ( P (cid:48) , φ ) . Let m denote | P | , the number of pivots.In the second step, CL-SCL models the correlations between each pivot { w S , w T } ∈ P and all other words w ∈ V \ { w S , w T } . This is done by train-ing linear classifiers that predict whether or not w S or w T occur in a document,based on the other words. For this purpose a training set D l is created for eachpivot p l ∈ P : D l = { ( mask ( x , p l ) , in ( x , p l )) | x ∈ D u } mask ( x , p l ) is a function that returns a copy of x where the components asso-ciated with the two words in p l are set to zero—which is equivalent to removingthese words from the feature space. in ( x , p l ) returns +1 if one of the componentsof x associated with the words in p l is non-zero and -1 otherwise. For each D l a linear classifier, characterized by the parameter vector w l , is trained by mini-mizing Equation (2) on D l . Note that each training set D l contains documentsfrom both languages. Thus, for a pivot p l = { w S , w T } the vector w l captures boththe correlation between w S and V S \ { w S } and the correlation between w T and V T \ { w T } .In the third step, CL-SCL identifies correlations across pivots by computing thesingular value decomposition of the | V | × m -dimensional parameter matrix W , W = (cid:2) w . . . w m (cid:3) : UΣV T = SVD( W ) . Recall that W encodes the correlation structure between pivot and non-pivotwords in the form of multiple linear classifiers. Thus, the columns of U identifycommon substructures among these classifiers. Choosing the columns of U associ-ated with the largest singular values yields those substructures that capture mostof the correlation in W . We define θ as those columns of U that are associatedwith the k largest singular values: θ = U T [1: k, | V | ] . ross-Lingual Adaptation using SCL · Algorithm 1
CL-SCL
Input:
Labeled source data D S Unlabeled data D u = D S ,u ∪ D T ,u Parameters: m , k , λ , and φ Output: k × | V | -dimensional matrix θ selectPivots ( D S , m ) V P = mutualInformation ( D S ) P (cid:48) = {{ w S , translate ( w S ) } | w S ∈ V P } P = candidateElimination ( P (cid:48) , φ ) trainPivotPredictors ( D u , P ) for l = 1 to m do D l = { ( mask ( x , p l ) , in ( x , p l )) | x ∈ D u } w l = argmin w ∈ R | V | (cid:80) ( x ,y ) ∈ D l L ( y, w T x )) + λR ( w ) end forW = (cid:2) w . . . w m (cid:3) computeSVD ( W , k ) UΣV T = SVD( W ) θ = U T [1: k, | V | ] output { θ } Algorithm 1 summarizes the three steps of CL-SCL. At training and test time,we apply the projection θ to each input instance x . The vector v ∗ that minimizesthe regularized training error for D S in the projected space is defined as follows: v ∗ = argmin v ∈ R k (cid:88) ( x ,y ) ∈ D S L ( y, v T θ x ) + λR ( v ) . (3)The resulting classifier, which will operate in the cross-lingual setting, is definedas follows: f ( x ) = sign ( v ∗ T θ x ) . An alternative view of cross-language structural correspondence learning is pro-vided by the framework of structural learning [Ando and Zhang 2005a]. The basicidea of structural learning is to constrain the hypothesis space, i.e., the space ofpossible weight vectors, of the target task by considering multiple different but re-lated prediction tasks. In our context these auxiliary tasks are represented by thepivot predictors, i.e., the columns of W . Each column vector w l can be consideredas a linear classifier which performs well in both languages. Thus, we can regardthe column space of W as an approximation to the subspace of bilingual classifiers . · Prettenhofer, Stein w . . . . w . . . . w . . . . . . . w w w θ θ w = θ T v Fig. 3. Illustration of the subspace constraint for | V | = 3 and k = 2. The plane spanned by θ and θ shows the subspace induced by the two left singular vectors of W = [ w w w ] associatedwith the largest singular values. For the target task, we restrict the weight vector w to lie in thesubspace of the parameter space defined by θ T , w = θ T v . By computing SVD( W ) one obtains a compact representation of this column spacein the form of an orthonormal basis θ T .The subspace is used to constrain the learning of the target task by restrictingthe weight vector w to lie in the subspace defined by θ T . Following Ando andZhang [2005a] and Quattoni et al. [2007] we choose w for the target task to be w ∗ = θ T v ∗ , where v ∗ is defined as follows: v ∗ = argmin v ∈ R k (cid:88) ( x ,y ) ∈ D S L ( y, ( θ T v ) T x ) + λR ( v ) . (4)Since ( θ T v ) T = v T θ it follows that this view of CL-SCL corresponds to theinduction of a new feature space given by Equation 3.Figure 3 illustrates the basic idea of the subspace constraint for | V | = 3 and k = 2. While the second step of CL-SCL involves the training of a fairly large number oflinear classifiers, these classifiers can be learned very efficiently due to (1) efficientlearning algorithms for linear classifiers [Shwartz et al. 2007] and (2) the fact that ross-Lingual Adaptation using SCL · learning the pivot classifiers is an embarrassingly parallel problem. The compu-tational bottleneck of the CL-SCL procedure is the SVD of the dense parametermatrix W . In order to make the computation tractable, Ando and Zhang [2005a] aswell as Blitzer et al. [2007] propose to set negative entries in W to zero, in order toobtain a sparse matrix for which the SVD can be computed more efficiently [Berry1992]. As a rational for this step the authors claim that the involved features “yieldmuch less specific information” on the target concept, while “positive weights areusually directly related to the target concept” [Ando and Zhang 2005a].We propose a different strategy to obtain a sparse parameter matrix W , namelyto enforce sparse pivot classifiers w l by employing a proper regularization term R inthe second step of CL-SCL. A straight-forward solution is to use L1 regularization[Tibshirani 1996], which imposes an L1-norm penalty on w , R ( w ) = (cid:107) w (cid:107) = (cid:80) | V | j =1 | w j | . This strategy recently gained much attention in the natural languageprocessing community; Gao et al. [2007] show that L1 regularized models havesimilar predictive power to L2 regularized models while being much smaller at thesame time—i.e., less parameters are non-zero.L1 regularization, however, has properties which are inadequate in the contextof SCL, in particular its handling of highly correlated features. Zou and Hastie[2005] show that if there is a subset of features among which the pairwise corre-lations are high, L1 regularization tends to select only one feature while pushingthe other feature weights to zero. This is certainly not desirable for SCL since itrelies on the proper modeling of correlations in order to induce correspondencesamong features. L2 regularization, by contrast, exhibits such a grouping behavior,resulting in equal weights for correlated features. The Elastic Net combines bothproperties, the grouping behavior of L2 regularization and the sparsity property ofL1 regularization [Zou and Hastie 2005]. It is given by the convex combination ofboth norms: R ( w ) = α (cid:107) w (cid:107) + (1 − α ) (cid:107) w (cid:107) , (5)where α ∈ [0 ,
1] models the trade-off between grouping and sparsity. The ElasticNet is widely used in bioinformatics, in particular the study of gene expression,however, its use for applications in natural language processing or informationretrieval has not been studied yet.
5. EXPERIMENTS
We evaluate CL-SCL for the task of cross-language sentiment and topic classifica-tion using English as source language and German, French, and Japanese as targetlanguages. We first describe the experimental design and give implementation de-tails, we then present the evaluation results and, finally, we report on detailedanalyses with respect to the nature of the induced cross-lingual correspondences,the use of unlabeled data, and important hyperparameters including the impact ofdifferent regularization methods. · Prettenhofer, Stein
We use the cross-lingual sentiment dataset provided by Prettenhofer and Stein[2010]. The dataset contains Amazon product reviews for the three product cat-egories books, dvds, and music in the languages English, German, French andJapanese. Each document is labeled according to its sentiment polarity as eitherpositive or negative. The documents in the dataset are organized by language andproduct category. For each language-category pair there are three balanced disjointsets of training, test, and unlabeled documents; the respective set sizes are 2,000,2,000, and 9,000-50,000. Similar to Prettenhofer and Stein [2010], each document d is represented as a normalized (unit length) feature vector x under a unigrambag-of-words model. Based on this dataset we create two tasks (see Table I for asummary statistics): Sentiment Classification Task.
For the task of cross-language sentiment classifi-cation the original partitioning of the cross-lingual sentiment dataset is used. Anal-ogous to Prettenhofer and Stein [2010] English is employed as source language, andGerman, French, and Japanese as target languages. For each of the nine target-language-category-combinations a sentiment classification task is created by takingthe training set and the unlabeled set for some product category from S and thetest set and the unlabeled set for the same product category from T . Topic Classification Task (Product Categories).
For the task of cross-languagetopic classification we discard the original sentiment labels and use the productcategory, i.e., books, dvd, and music as the document label. Again we use Englishas the source language and German, French, and Japanese as target languages.Note that—in contrast to the sentiment classification tasks—classifying reviewsaccording to product categories is a multi-class classification problem with threemutually exclusive classes. Hence for each of the three target languages a cross-language topic classification task is created by combining the training set and theunlabeled set of each product category from S with the test set and the unlabeledset of each product category from T . For each of the three tasks we have 6,000training and 6,000 test documents, each containing a balanced number of examples. Within all experiments we employ linear classifiers, which are trained by minimizingEquation (2) using a stochastic gradient descent (SGD) algorithm. In particular,we use the plain SGD algorithm as described by Zhang [2004] while adopting thelearning rate schedule from PEGASOS [Shwartz et al. 2007]. Analogous to Blitzeret al. [2007] and Ando and Zhang [2005a] we employ as loss function L the modifiedHuber loss [Zhang 2004], a smoothed version of the hinge loss: L ( y, p ) = (cid:40) max(0 , − py ) , if py ≥ − − py, otherwise. (6)SGD and related methods based on stochastic approximation have been suc-cessfully applied to solve large-scale linear prediction problems in natural language Available at ross-Lingual Adaptation using SCL · Table I. Dataset statistics. T Category Unlabeled data Labeled data Vocabulary | D S ,u | | D T ,u | | D S | | D T | | V S | | V T | books 50,000 50,000 2,000 2,000 64,682 108,573German dvd 30,000 50,000 2,000 2,000 52,822 103,862music 25,000 50,000 2,000 2,000 41,306 99,287books 50,000 32,000 2,000 2,000 64,682 55,016French dvd 30,000 9,000 2,000 2,000 52,822 29,519music 25,000 16,000 2,000 2,000 41,306 42,097books 50,000 50,000 2,000 2,000 64,682 52,311Japanese dvd 30,000 50,000 2,000 2,000 52,822 54,533music 25,000 50,000 2,000 2,000 41,306 54,463German - 60,000 60,000 6,000 6,000 76,629 124,529French - 60,000 45,000 6,000 6,000 76,629 74,807Japanese - 60,000 60,000 6,000 6,000 76,629 64,050Summary statistics for the nine cross-language sentiment classification tasks (first ninerows) and the three cross-language topic classification tasks (last three rows). | D S ,u | and | D T ,u | give the number of unlabeled documents from S and T ; | D S | and | D T | give the number of training and test documents. All document sets are balanced. processing and information retrieval [Zhang 2004; Shwartz et al. 2007]. Their majoradvantages are efficiency and ease of implementation.SGD, however, cannot be applied directly in connection with L1 regularization(and thus the Elastic Net) due to the fluctuations of the approximated gradients.To overcome this problem different solutions have been proposed, in particularmethods based on truncated gradients [Langford et al. 2009; Tsuruoka et al. 2009]and projected gradients [Duchi et al. 2008]. In our experiments we employ thetruncated stochastic gradient algorithm proposed by Tsuruoka et al. [2009], whichuses the cumulative L1 penalty to smooth out fluctuations in the approximatedgradients. Note that Elastic Net regularization is applied for the pivot classifiersonly, all other classifiers are trained using L2 regularization.SGD receives two hyperparameters as input: the number of iterations T , andthe regularization parameter λ . In our experiments T is always set to 10 , which isabout the number of iterations required for SGD to converge. For the target task, λ is determined by 3-fold cross-validation, testing for λ all values 10 − i , i ∈ [0; 6].For the pivot prediction task, λ is set to the small value of 10 − , in order to favormodel accuracy over generalizability.Since SGD is sensitive to feature scaling the projection θ x is post-processed asfollows: (1) Each feature of the cross-lingual representation is standardized to zeromean and unit variance, where mean and variance are estimated on D S ∪ D u .(2) The cross-lingual document representations are scaled by a constant α suchthat | D S | − (cid:80) x ∈ D S (cid:107) αθ x (cid:107) = 1.For multi-class classification the one-against-all-strategy is applied. For multi-class problems, the set of pivot candidates V P is formed as follows: (1) rank for eachclass the words according to mutual information with respect to all other classes,and (2) select from each ranking those words with the highest mutual information. Our implementation is available at http://github.com/pprett/bolt · Prettenhofer, Stein
We use the bilingual dictionary provided by Prettenhofer and Stein [2010] asword translation oracle. If the source word is not contained in the dictionary weresort Google Translate, which returns a single translation for each query word. Note that the word translation oracle operates context-free, which is suboptimal;however, we do not sanitize the translations to demonstrate the robustness of CL-SCL with respect to translation noise.
To get an upper bound on the performance of a cross-language method we firstconsider the monolingual setting. For each task a linear classifier is learned onthe training set of the target language and tested on the test set. The resultingaccuracy scores are referred to as upper bound; this bound informs us about theexpected performance on the target task if training data in the target language isavailable.We choose a machine translation baseline to compare CL-SCL to another cross-language method. Statistical machine translation technology offers a straightfor-ward solution to the problem of cross-language text classification and has beenused in a number of cross-language sentiment classification studies [Hiroshi et al.2004; Bautin et al. 2008; Wan 2009]. Our baseline CL-MT is determined as follows:(1) learn a linear classifier on the training data, and (2) translate the test docu-ments into the source language, (3) predict the sentiment polarity of the translatedtest documents.Translations of the test documents into the source language via Google Translateare provided by Prettenhofer and Stein [2010]. Note that the baseline CL-MT doesnot make use of unlabeled documents.
Table II contrasts the classification performance of CL-SCL with the upper boundand the baseline. Due to the inherent randomness of the training algorithm, wereport the accuracy scores as mean µ and standard deviation σ of ten repetitionsof SGD. We use McNemar’s test to analyze whether or not the results of CL-SCL and CL-MT are statistically significant [Dietterich 1998]. Again, due to therandomness of the training algorithm statistical significance is analyzed for eachof the ten repetitions, whereas significance at a specific level is reported only if itapplies to all repetitions.Observe that the upper bound does not exhibit high variability across the threelanguages. For sentiment classification the average accuracy is about 82%, whichis consistent with prior work on monolingual sentiment analysis [Pang et al. 2002;Blitzer et al. 2007]. For product category classification the average accuracy isin the low 90’s, which is also consistent with prior work on monolingual productcategory classification [Crammer et al. 2009].The performance of CL-MT, however, differs considerably between the two Eu-ropean languages and Japanese: for Japanese, the averaged differences betweenthe upper bound and CL-MT (9.5%, 7.3%) are much larger than for German andFrench (5.3%, 1.7%). This can be explained by the fact that machine translation http://translate.google.com ross-Lingual Adaptation using SCL · Table II. Cross-language sentiment and topic classification results. T Cat. Upper Bound CL-MT CL-SCL µ σ µ σ ∆ µ σ ∆ RR[%]books 83.79 ± ± † ± ± ± † ± ± ± † ± ± ± ± ± ± ± ± ± ± ± ± †† ± ± ± †† ± ± ± †† ± ± ± ± ± ± ± ± ± †† ± µ and standard deviation σ of 10 repetitions of SGD) on the testset of the target language T are reported. ∆ gives the difference in accuracy to the upperbound. Statistical significance (McNemar) of CL-SCL is measured against CL-MT ( † indicates0.05 and †† m = 450, k = 100, φ = 30, and α = 0 .
85. For topic classification,CL-SCL uses m = 250, k = 50, φ = 50, and α = 0 . works better for European than for Asian languages such as Japanese.Recall that CL-SCL receives four hyperparameters as input: the number of pivots m , the dimensionality of the cross-lingual representation k , the minimum support φ of a pivot in D S ,u and D T ,u , and the Elastic Net coefficient α . For cross-languagesentiment classification we use fixed values of m = 450, k = 100, φ = 30, and α = 0 .
85. For cross-language topic classification we found that smaller values of m and k work significantly better. The results for topic classification are obtainedby using fixed values of m = 250, k = 50, φ = 50, and α = 0 .
85. The parametersettings have been optimized using the German book review task (sentiment) andthe German task (topic).The results show that CL-SCL either outperforms CL-MT or is at least competi-tive across all tasks. For German and Japanese sentiment classification we observesignificant differences at a 0.05 and 0.001 confidence level. For product categoryclassification we observe significant differences only for Japanese (0.001 confidencelevel). Interestingly, for German music reviews, the accuracy of CL-SCL evensurpasses the upper bound; this can be interpreted as a semi-supervised learningeffect that stems from the massive use of unlabeled data. The rightmost columnof Table II shows the relative reduction in error due to cross-lingual adaptationof CL-SCL over CL-MT. CL-SCL reduces the relative error by an average of 59%(sentiment classification) and 30% (topic classification) over CL-MT.
CL-SCL receives a number of hyperparameters as input; the purpose of this sectionis to elaborate on each hyperparameter. In the following, we will analyze thesensitivity of each hyperparameter in isolation while keeping the others fixed. If notspecified otherwise, we use the same setting of the hyperparameters as in Table II. · Prettenhofer, Stein a cc u r a cy German-Books UpperBoundCL-MTCL-SCL 1 2 4 5 10 15 20 25unlabeled/labeled a cc u r a cy French-Books UpperBoundCL-MTCL-SCL 1 2 4 5 10 15 20 25unlabeled/labeled a cc u r a cy Japanese-Books UpperBoundCL-MTCL-SCL
100 200 300 400 500 600 700 800 m a cc u r a cy German-Books UpperBoundCL-MTCL-SCL
100 200 300 400 500 600 700 800 m a cc u r a cy French-Books UpperBoundCL-MTCL-SCL
100 200 300 400 500 600 700 800 m a cc u r a cy Japanese-Books UpperBoundCL-MTCL-SCL
50 100 150 200 250 300 350 400 450 k a cc u r a cy German-Books UpperBoundCL-MTCL-SCL
50 100 150 200 250 300 350 400 450 k a cc u r a cy French-Books UpperBoundCL-MTCL-SCL
50 100 150 200 250 300 350 400 450 k a cc u r a cy Japanese-Books UpperBoundCL-MTCL-SCL
Fig. 4. Influence of unlabeled data and hyperparameters on the performance ofCL-SCL. The rows show the performance of CL-SCL as a function of (1) the ratiobetween labeled and unlabeled documents, (2) the number of pivots m , and (3) thedimensionality of the cross-lingual representation k . Unlabeled Data.
The first row of Figure 4 shows the performance of CL-SCL as afunction of the ratio of labeled and unlabeled documents for sentiment classificationof book reviews. A ratio of 1 means that | D S ,u | = | D T ,u | = 2 , Number of Pivots.
The second row shows the influence of the number of pivots m on the performance of CL-SCL. Compared to the size of the vocabularies V S and V T , which is in 10 order of magnitude, the number of pivots is very small. Theplots show that even a small number of pivots captures a significant amount of thecorrespondence between S and T . Dimensionality of the Cross-Lingual Representation.
The third row shows theinfluence of the dimensionality of the cross-lingual representation k on the per-formance of CL-SCL. Obviously the SVD is crucial to the success of CL-SCL if m is sufficiently large. Observe that the value of k is task-insensitive: a value of50 150 works equally well across all tasks. ross-Lingual Adaptation using SCL · Table III. Effect of regularization. T Category L2 + L1 Elastic Net µ d[%] µ d[%] µ d[%]books 79.50 17.88 82.45 1.24 W , i.e., the number of non-zero entries divided by the total number of entries. W is 450 × | V | where | V | is in 10 orders of magnitude (see Table I for details). Elastic Net uses α = 0 . Effect of Regularization. Table III compares the effect of three different regu-larization terms on the performance of CL-SCL. The third column, L2 + , refers tothe strategy in [Blitzer et al. 2006] and [Prettenhofer and Stein 2010] with ordi-nary L2 regularization and negative weights set to zero. The fifth column showsthe performance of L1 regularization. Observe that L1 regularization drasticallyreduces the number of non-zero features, from 16% to 2% on average. We arguedin Section 4.2 that L1 regularization is not adequate due to its improper handlingof highly correlated features and we proposed the Elastic Net penalty as an al-ternative. The empirical evidence supports this claim: Elastic Net regularizationconsistently outperforms both L2 + and L1 regularization while keeping the numberof non-zero features low (15% on average). Note that Elastic Net regularizationadds an additional hyperparameter α that trades off the relative importance of L2and L1 regularization. In the above experiments the value of α is chosen such thatthe obtained density roughly equals the density of L2 + . A convenient property ofthe Elastic Net is that it encompasses L2 and L1 regularization as special cases(either α = 1 or α = 0). Thus, if m and | V | are sufficiently small and a dense SVDis computationally feasible α = 1 is optimal. Otherwise, the optimal choice of α isgoverned by the computing resource.The use of Elastic Net regularization to obtain sparse pivot classifiers has impli-cations beyond CL-SCL, in particular for the application of Alternating StructuralOptimization [Ando and Zhang 2005b] and Structural Correspondence Learning[Blitzer et al. 2006] in high dimensional feature spaces. Primarily responsible for the effectiveness of CL-SCL is its task specificity, i.e.,the way in which context contributes to meaning (pragmatics). Due to the useof task-specific, unlabeled data, relevant characteristics are captured by the pivot · Prettenhofer, Stein Table IV. Semantic and pragmatic correlations. Pivot English German Semantics Pragmatics Semantics Pragmatics { beautiful S ,sch¨on T } amazing,beauty, lovely picture, pat-tern, poetry,photographs,paintings sch¨oner (more beau-tiful), traurig (sad) bilder (pictures), il-lustriert (illustrated) { boring S ,langweilig T } plain, asleep,dry, long characters,pages, story langatmig (lengthy),einfach (plain),entt¨auscht (disap-pointed) charaktere (char-acters), handlung(plot), seiten (pages)Semantic and pragmatic correlations identified for the two pivots { beautiful S , sch¨on T } and { boring S , langweilig T } in English and German book reviews. classifiers.Table IV exemplifies this claim with two pivots for German book reviews. Therows of the table show a selection of words which have the highest correlation withthe pivots { beautiful S , sch¨on T } and { boring S , langweilig T } . We can distinguishbetween (1) correlations that reflect similar meaning, such as “amazing”, “lovely”,or “plain”, and (2) correlations that reflect the pivot pragmatics with respect tothe task, such as “picture”, “poetry”, or “pages”.Note in this connection that the authors of book reviews tend to use the word“beautiful” to refer to illustrations or to poetry, and that they use the word “pages”to indicate lengthy or boring books. While the first type of word correlations canbe obtained by methods that operate on parallel corpora, the second correlationtype requires an understanding of the task-specific language use. 6. CONCLUSIONS We have presented Cross-Language Structural Correspondence Learning, CL-SCL,as an effective technology for cross-lingual adaptation. CL-SCL builds on StructuralCorrespondence Learning, a recently proposed algorithm for domain adaptationin natural language processing. CL-SCL uses unlabeled documents along with afeature translation oracle to automatically induce task-specific, cross-lingual featurecorrespondences.We evaluated the approach for cross-language text classification, a special caseof cross-lingual adaptation. The analysis covers performance and sensitivity issuesin the context of sentiment and topic classification with English as source languageand German, French, and Japanese as target languages. The results show a signif-icant improvement of the proposed approach over a machine translation baseline,reducing the relative error due to cross-lingual adaptation by an average of 59%(sentiment classification) and 30% (topic classification) over the baseline.Furthermore, the Elastic Net is proposed as an effective means to obtain a sparseparameter matrix, again leading to a significant improvement upon previously re-ported results. Note Elastic Net has implications beyond CL-SCL, in particular forStructural Correspondence Learning [Blitzer et al. 2006] and Alternating StructuralOptimization [Ando and Zhang 2005a]. ross-Lingual Adaptation using SCL · REFERENCES Ando, R. K. and Zhang, T. J. Mach. Learn. Res. 6 , 1817–1853. Ando, R. K. and Zhang, T. ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Com-putational Linguistics . Association for Computational Linguistics, Morristown, NJ, USA, 1–9. Bautin, M. , Vijayarenu, L. , and Skiena, S. Proceedings of ICWSM . Bel, N. , Koster, C. H. A. , and Villegas, M. Berry, M. W. International Journal ofSupercomputer Applications 6, 1, 13–49. Bickel, S. , Br¨uckner, M. , and Scheffer, T. J. Mach. Learn. Res. 10 , 2137–2155. Blitzer, J. , Dredze, M. , and Pereira, F. Proceedings of the 45th AnnualMeeting of the Association of Computational Linguistics . Association for Computational Lin-guistics, Prague, Czech Republic, 440–447. Blitzer, J. , Mcdonald, R. , and Pereira, F. Proceedings of the 2006 Conference on Empirical Methods in NaturalLanguage Processing . Association for Computational Linguistics, Sydney, Australia, 120–128. Cortes, C. , Mohri, M. , Riley, M. , and Rostamizadeh, A. Algorithmic Learning Theory , Y. Freund, L. Gy¨orfi, G. Tur´an, and T. Zeugmann, Eds.Lecture Notes in Computer Science, vol. 5254. Springer Berlin Heidelberg, Berlin, Heidelberg,Chapter 8, 38–53. Crammer, K. , Dredze, M. , and Kulesza, A. EMNLP ’09: Proceedings of the 2009 Conference on Empirical Methods in Natural LanguageProcessing . Association for Computational Linguistics, Morristown, NJ, USA, 496–504. Dai, W. , Chen, Y. , Xue, G.-R. , Yang, Q. , and Yu, Y. NIPS . 353–360. Daume, III, H. Proceedings of the 45th An-nual Meeting of the Association of Computational Linguistics . Association for ComputationalLinguistics, Prague, Czech Republic, 256–263. Dietterich, T. G. Neural Computation 10 , 1895–1923. Duchi, J. , Shwartz, S. S. , Singer, Y. , and Chandra, T. l -ball for learning in high dimensions. In Proceedings of the 25th International Conference onMachine learning . ACM, New York, NY, USA, 272–279. Dumais, S. T. , Letsche, T. A. , Littman, M. L. , and Landauer, T. K. AAAI Symposium on CrossLanguage Textand Speech Retrieval. American Association for Artificial Intelligence, March 1997. Finkel, J. R. and Manning, C. D. NAACL’09: Proceedings of Human Language Technologies: The 2009 Annual Conference of the NorthAmerican Chapter of the Association for Computational Linguistics . Association for Compu-tational Linguistics, Morristown, NJ, USA, 602–610. Fortuna, B. and Shawe-Taylor, J. Workshop on Learning with Multiple Views, ICML, 2005 . Gao, J. , Andrew, G. , Johnson, M. , and Toutanova, K. Proceedings of the45th Annual Meeting of the Association of Computational Linguistics . The Association forComputer Linguistics, Prague, Czech Republic, 824–831. Gliozzo, A. and Strapparava, C. ParaText ’05: Proceedings of the ACL · Prettenhofer, Stein Workshop on Building and Using Parallel Texts . Association for Computational Linguistics,Morristown, NJ, USA, 9–16. Gliozzo, A. and Strapparava, C. Proceedings of the 21st International Conferenceon Computational Linguistics and the 44th annual meeting of the Association for Computa-tional Linguistics . Association for Computational Linguistics, Morristown, NJ, USA, 553–560. Hiroshi, K. , Tetsuya, N. , and Hideo, W. Proceedings of the 20th international conference on ComputationalLinguistics . Association for Computational Linguistics, Morristown, NJ, USA, 494+. Jiang, J. and Zhai, C. CIKM ’07: Proceedings of the sixteenth ACM conference on Conference on information andknowledge management . ACM, New York, NY, USA, 401–410. Langford, J. , Li, L. , and Zhang, T. J.Mach. Learn. Res. 10 , 777–801. Lavrenko, V. , Choquette, M. , and Croft, W. B. SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Researchand development in information retrieval . ACM, New York, NY, USA, 175–182. Li, Y. and Taylor, J. S. Inf. Process. Manage. 43, 5, 1183–1199. Ling, X. , Xue, G. R. , Dai, W. , Jiang, Y. , Yang, Q. , and Yu, Y. WWW ’08: Proceeding of the 17th internationalconference on World Wide Web . ACM, New York, NY, USA, 969–978. Oard, D. W. AMTA , D. Farwell, L. Gerber, E. H. Hovy, D. Farwell, L. Gerber, andE. H. Hovy, Eds. Lecture Notes in Computer Science, vol. 1529. Springer, 472–483. Olsson, J. S. , Oard, D. W. , and Hajiˇc, J. Proceed-ings of the 28th annual international ACM SIGIR conference on Research and developmentin information retrieval . ACM, New York, NY, USA, 645–646. Pan, S. J. and Yang, Q. IEEE Transactions on Knowledgeand Data Engineering 99, Pang, B. , Lee, L. , and Vaithyanathan, S. Proceedings of the ACL-02 conference on Empirical methods innatural language processing . Association for Computational Linguistics, Morristown, NJ, USA,79–86. Potthast, M. , Stein, B. , and Anderka, M. Advances in Information Retrieval . Lecture Notes in Computer Science. Chapter 51,522–530. Prettenhofer, P. and Stein, B. Proceedings of the 48th Annual Meeting of the Association ofComputational Linguistics (to appear) . Association for Computational Linguistics, Uppsala,Sweden. Quattoni, A. , Collins, M. , and Darrell, T. IEEE Conference on Computer Vision and Pattern Recognition . IEEEComputer Society, 1–8. Quionero-Candela, J. , Sugiyama, M. , Schwaighofer, A. , and Lawrence, N. D. DatasetShift in Machine Learning . The MIT Press. Rigutini, L. , Maggini, M. , and Liu, B. Web Intelligence, IEEE / WIC / ACM International Conference on 0 ,529–535. Shimodaira, H. Journal of Statistical Planning and Inference 90, Shwartz, S. S. , Singer, Y. , and Srebro, N. ICML ’07: Proceedings of the 24th international conference on Machine learning .ACM, New York, NY, USA, 807–814. ross-Lingual Adaptation using SCL · Tibshirani, R. Journal of the RoyalStatistical Society. Series B (Methodological) 58, 1, 267–288. Tsuruoka, Y. , Tsujii, J. , and Ananiadou, S. Proceedings of the Joint Conferenceof the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Nat-ural Language Processing of the AFNLP . Association for Computational Linguistics, Suntec,Singapore, 477–485. Wan, X. Proceedings of the JointConference of the 47th Annual Meeting of the ACL and the 4th International Joint Conferenceon Natural Language Processing of the AFNLP . Association for Computational Linguistics,Suntec, Singapore, 235–243. Wu, K. , Wang, X. , and Lu, B.-L. Proceedings of the Third International Joint Conference on Natural LanguageProcessing . Zhang, T. ICML ’04: Proceedings of the twenty-first international conference on Machinelearning . ACM, New York, NY, USA, 116+. Zou, H. and Hastie, T.