On the Replicability of Combining Word Embeddings and Retrieval Models
OOn the Replicability of Combining WordEmbeddings and Retrieval Models
Luca Papariello, Alexandros Bampoulidis, and Mihai Lupu
Research Studio Data Science, RSA FGVienna, Austria { name.surname } @researchstudio.at Abstract.
We replicate recent experiments attempting to demonstratean attractive hypothesis about the use of the Fisher kernel frameworkand mixture models for aggregating word embeddings towards documentrepresentations and the use of these representations in document classifi-cation, clustering, and retrieval. Specifically, the hypothesis was that theuse of a mixture model of von Mises-Fisher (VMF) distributions insteadof Gaussian distributions would be beneficial because of the focus oncosine distances of both VMF and the vector space model traditionallyused in information retrieval. Previous experiments had validated thishypothesis. Our replication was not able to validate it, despite a largeparameter scan space.
The last 5 years have seen proof that neural network-based word embeddingmodels provide term representations that are a useful information source for avariety of tasks in natural language processing. In information retrieval (IR),“traditional” models remain a high baseline to beat, particularly when consid-ering efficiency in addition to effectiveness [6]. Combining the word embeddingmodels with the traditional IR models is therefore very attractive and severalpapers have attempted to improve the baseline by adding in, in a more or lessad-hoc fashion, word-embedding information. Onal et al. [10] summarized thevarious developments of the last half-decade in the field of neural IR and groupthe methods in two categories: aggregate and learn . The first one, also knownas compositional distributional semantics , starts from term representations anduses some function to combine them into a document representation (a simpleexample is a weighted sum). The second method uses the word embedding as afirst layer of another neural network to output a document representation.The advantage of the first type of methods is that they often distill down to alinear combination (perhaps via a kernel), from which an explanation about therepresentation of the document is easier to induce than from the neural networklayers built on top of a word embedding. Recently, the issue of explainability inIR and recommendation is generating a renewed interest [15].In this sense, Zhang et al. [14] introduced a new model for combining high-dimensional vectors, using a mixture model of von Mises-Fisher (VMF) instead a r X i v : . [ c s . C L ] J a n Luca Papariello, Alexandros Bampoulidis, and Mihai Lupu of Gaussian distributions previously suggested by Clinchant and Perronnin [3].This is an attractive hypothesis because the Gaussian Mixture Model (GMM)works on Euclidean distance, while the mixture of von Mises-Fisher (moVMF)model works on cosine distances—the typical distance function in IR.In the following sections, we set up to replicate the experiments describedby Zhang et al. [14]. They are grouped in three sets: classification, clustering,and information retrieval, and compare “standard” embedding methods with thenovel moVMF representation.
In general, we follow the experimental setup of the original paper and, for lackof space, we do not repeat here many details, if they are clearly explained there.
All experiments are conducted on publicly available datasets and are brieflydescribed here below.
Classification.
Two subsets of the movie review dataset: (i) the subjectivitydataset (subj) [11]; and (ii) the sentence polarity dataset (sent) [12].
Clustering.
The 20 Newsgroups dataset was used in the original paper, butthe concrete version was not specified. We selected the “bydate” version,because it is, according to its creators, the most commonly used in the lit-erature. It is also the version directly load-able in scikit-learn , making ittherefore more likely that the authors had used this version. Retrieval.
The TREC Robust04 collection [13].
The methods used to generate vectors for terms and documents are:
TF-IDF.
The basic term frequency - inverse document frequency method [5].
Implemented in the scikit-learn library . LSI.
Latent Semantic Indexing [4].
LDA.
Latent Dirichlet Allocation [2]. cBoW.
Word2vec [9] in the Continuous Bag-of-Word (cBow) architecture.
PV-DBOW/DM.
Paragraph vector (PV) is a document embedding algorithmthat builds on Word2vec. We use here both its implementations: DistributedBag-of-Words (PV-DBOW) and Distributed Memory (PV-DM) [7]. http://qwone.com/˜jason/20Newsgroups/ https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html https://scikit-learn.org/stable/ n the Replicability of Combining Word Embeddings and Retrieval Models 3 The LSI, LDA, cBoW, and PV implementations are available in the gensimlibrary . Fisher Kernel (FK).
The FK framework offers the option to aggregate wordembeddings to obtain fixed-length representations of documents. We useFisher vectors (FV) based on (i) a Gaussian mixture model (FV-GMM) and(ii) a mixture of von Mises-Fisher distributions (FV-moVMF) [1].We first fit (i) a GMM and (ii) a moVMF model on previously learnt con-tinuous word embeddings. The fixed-length representation of a document X containing T words w i —expressed as X = { E w , . . . , E w T } , where E w i is theword vector representation of word w i —is then given by G X = [ G X , . . . , G XK ],where K is the number of mixture components. The vectors G Xi , having thedimension ( d ) of the word vectors E w i , are explicitly given by [3, 14]:(i) G Xi = 1 √ ω i T X t =1 γ t ( i ) x t − µ i σ i , and (ii) G Xi = T X t =1 γ t ( i ) x t dω i κ i , (1)where ω i are the mixture weights, γ t ( i ) = p ( i | x t ) is the soft assignment of x t to (i) Gaussian and (ii) VMF distribution i , and σ i = diag( Σ i ), with Σ i thecovariance matrix of Gaussian i . In (i), σ i refers to the mean vector; in (ii) itindicates the mean direction and κ i is the concentration parameter.We implement the FK-based algorithms by ourselves, with the help of thescikit-learn library for fitting a mixture of Gaussian models and of the Sphereclus-ter package for fitting a mixture of von Mises-Fisher distributions to our data.The implementation details of each algorithm are described in what follows. Each of the following experiments is conceptually divided in three phases. First,text processing (e.g. tokenisation); second, creating a fixed-length vector repre-sentation for every document; finally, the third phase is determined by the goalto be achieved, i.e. classification, clustering, and retrieval.For the first phase the same pre-processing is applied to all datasets. Inthe original paper, this phase was only briefly described as tokenisation andstop-word removal. It is not given what tokeniser, linguistic filters (stemming,lemmatisation, etc.), or stop word list were used. Knowing that the gensim li-brary was used, we took all standard parameters (see provided code ). Gensimhowever does not come with a pre-defined stopword list, and therefore, based onour own experience, we used the one provided in the NLTK library for English.For the second phase, transforming terms and documents to vectors, Zhang etal. [14] specify that all trained models are 50 dimensional. We have additionally https://radimrehurek.com/gensim/ https://github.com/jasonlaska/spherecluster https://rsagit.researchstudio.at/lpapariello/ecir_2020.git Luca Papariello, Alexandros Bampoulidis, and Mihai Lupu experimented with dimensionality 20 (used by Clinchant and Perronnin [3] forclustering) and 100, as we hypothesized that 50 might be too low. The TF-IDFmodel is 5000 dimensional (i.e. only the top 5000 terms based on their tf-idf valueare used), while the Fischer-Kernel models are 15 × d dimensional, where d = { , , } , as just explained. In what follows, d refers to the dimensionalityof LSI, LDA, cBow, and PV models.The cBoW and PV models are trained using a default window size of 5, keep-ing both low and high-frequency terms, again following the setup of the originalexperiment. The LDA model is trained using a chunk size of 1000 documentsand for a number of iterations over the corpus ranging from 20 to 100. For theFK methods, both fitting procedures (GMM and moVMF) are independentlyinitialised 10 times and the best fitting model is kept.For the third phase, parameters are explained in the following sections. Logistic regression is used for classification in Zhang et al., and therefore alsoused here. The results of our experiments, for d = 50 and 100-dimensional featurevectors, are summarised in Table 1. For all the methods, we perform a parameterscan of the (inverse) regularisation strength of the logistic regression classifier, asshown in Fig. 1(a) and (b). Additionally, the learning algorithms are trained fora different number of epochs and the resulting classification accuracy assessed,cf. Fig. 1(c) and (d).Table 1: Results ofclassification experi-ments on the subj and sent datasets. Shownare the mean accuracyand standard devi-ation, under 10-foldcross-validation, foroptimally chosen hy-perparameters (i.e. topvalues in Figure 1). Model 50-dim. 100-dim.Subj Sent Subj SentTF-IDF 89.3 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Fig. 1(a) indicates that cBow, FV-GMM, FV-moVMF, and the simple TF-IDF, when properly tuned, exhibit a very similar accuracy on subj —the givenconfidence intervals do not indeed allow us to identify a single, best model.Surprisingly, TF-IDF outperforms all the others on the sent dataset [Fig. 1(b)].Increasing the dimensionality of the feature vectors, from d = 50 to 100, has theeffect of reducing the gap between TF-IDF and the rest of the models on the sent dataset (see Table 1). n the Replicability of Combining Word Embeddings and Retrieval Models 5 Fig. 1: Results of classification experiments, for 50-dimensional feature vectors,on the subj dataset [top panels, (a) and (c)] and sent dataset [bottom panels, (b)and (d)]. LSI and LDA achieve low accuracy (see Table 1) and are omitted herefor visibility. The left panels [(a) and (b)] show the effect of (inverse) regulari-sation of the logistic regression classifier on the accuracy, while the right panels[(c) and (d)] display the effect of training for the learning algorithms. The twosymbols on the right axis in panels (a) and (b) indicate the best (FV-moVMF)results reported in [14].
For clustering experiments, the obtained feature vectors are passed to the k-means algorithm. The results of our experiments, measured in terms of AdjustedRand Index (ARI) and Normalized Mutual Information (NMI), are summarisedin Table 2. We used both d = 20 and 50-dimensional feature vectors. Note thatthe evaluation of the clustering algorithms is based on the knowledge of theground truth class assignments, available in the 20 Newsgroups dataset.As opposed to classification, clustering experiments show a generous imbal-ance in performance and firmly speak in favour of PV-DBOW. Interestingly,TF-IDF, FV-GMM, and FV-moVMF, all providing high-dimensional documentrepresentations, have a low clustering effectiveness. For these experiments, we extracted from every document of the test collectionall the raw text, and preprocessed it as described in the beginning of this section.The documents were indexed and retrieved for BM25 with the Lucene 8.2 search
Luca Papariello, Alexandros Bampoulidis, and Mihai Lupu
Table 2: Results ofclustering experiments(mean performance andstandard deviation over10 runs) in terms ofAdjusted Rand Index(ARI) and NormalisedMutual Information(NMI).
Model 20-dim. 50-dim.ARI NMI ARI NMITF-IDF 0.4 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± engine. We experimented with three topic processing ways: (1) title only, (2)description only, and (3) title and description. The third way produces the bestresults and closest to the ones reported by Zhang et. al [14], and hence are theonly ones reported here.An important aspect of BM25 is the fact that the variation of its parameters k and b could bring significant improvement in performance, as reported byLipani et. al [8]. Therefore, we performed a parameter scan for k ∈ [0 ,
3] and b ∈ [0 ,
1] with a 0.05 step size for both parameters. For every TREC topic, thescores of the top 1000 documents retrieved from BM25 were normalised to [0,1]with the min-max normalisation method, and were used in calculating the scoresof the documents for the combined models [14].The original results, those of our replication experiments with standard ( k =1 . b = 0 .
75) and best BM25 parameter values—measured in terms of MeanAverage Precision (MAP) and Precision at 20 (P@20)—are outlined in Table 3.
Model Zhang et. al Replicated Replicated Best BM25MAP P@20 MAP P@20 MAP P@20BM25 24.10 33.70 22.80 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 3: Results of retrieval experiments with 95% confidence intervals. n the Replicability of Combining Word Embeddings and Retrieval Models 7
We replicated previously reported experiments that presented evidence that anew mixture model, based on von Mises-Fisher distributions, outperformed aseries of other models in three tasks (classification, clustering, and retrieval—when combined with standard retrieval models).Since the source code was not released in the original paper, important imple-mentation and formulation details were omitted, and the authors never repliedto our request for information, a significant effort has been devoted to reverse en-gineer the experiments. In general, for none of the tasks were we able to confirmthe conclusions of the previous experiments: we do not have enough evidenceto conclude that FV-moVMF outperforms the other methods. The situation israther different when considering the effectiveness of these document represen-tations for clustering purposes: we find indeed that the FV-moVMF significantlyunderperforms, contradicting previous conclusions. In the case of retrieval, al-though Zhang et. al’s proposed method (FV-moVMF) indeed boosts BM25, itdoes not outperform most of the other models it was compared to.
Acknowledgments
Authors are partially supported by the H2020 Safe-DEEDproject (GA 825225).
References
1. A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphereusing von mises-fisher distributions.
J. Mach. Learn. Res. , 6:1345–1382, December2005.2. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.
J. Mach.Learn. Res. , 3:993–1022, March 2003.3. St´ephane Clinchant and Florent Perronnin. Aggregating continuous word embed-dings for information retrieval. In
Proceedings of the workshop on continuous vectorspace models and their compositionality , pages 100–109, 2013.4. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman.Indexing by latent semantic analysis.
Journal of the American Society for Infor-mation Science , 41(6):391–407, 1990.5. Z. S. Harris. Distributional structure.
Word , 10(2-3):146–162, 1954.6. S. Hofst¨atter and A. Hanbury. Let’s measure run time! extending the IRreplicability infrastructure to include performance aspects. In
Proceedings ofthe Open-Source IR Replicability Challenge co-located with 42nd InternationalACM SIGIR Conference on Research and Development in Information Retrieval,OSIRRC@SIGIR 2019, Paris, France, July 25, 2019. , pages 12–16, 2019.7. Quoc Le and Tomas Mikolov. Distributed representations of sentences and docu-ments. In
International conference on machine learning , pages 1188–1196, 2014.8. Aldo Lipani, Mihai Lupu, Allan Hanbury, and Akiko Aizawa. Verboseness fissionfor bm25 document length normalization. In
Proceedings of the 2015 InternationalConference on The Theory of Information Retrieval , pages 385–388. ACM, 2015.9. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of wordrepresentations in vector space, 2013. Luca Papariello, Alexandros Bampoulidis, and Mihai Lupu10. Kezban Dilek Onal, Ye Zhang, Ismail Sengor Altingovde, Md Mustafizur Rahman,Pinar Karagoz, Alex Braylan, Brandon Dang, Heng-Lu Chang, Henna Kim, Quin-ten McNamara, Aaron Angert, Edward Banner, Vivek Khetan, Tyler McDonnell,An Thanh Nguyen, Dan Xu, Byron C. Wallace, Maarten de Rijke, and MatthewLease. Neural information retrieval: at the end of the early years.
InformationRetrieval Journal , 21(2):111–182, Jun 2018.11. B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivitysummarization based on minimum cuts. In
Proceedings of the 42nd Annual Meetingof the Association for Computational Linguistics (ACL-04) , pages 271–278, 2004.12. B. Pang and L. Lee. Seeing stars: Exploiting class relationships for sentimentcategorization with respect to rating scales. In
Proceedings of the 43rd AnnualMeeting of the Association for Computational Linguistics (ACL’05) , pages 115–124. Association for Computational Linguistics, 2005.13. Ellen M. Voorhees. The trec robust retrieval track.
SIGIR Forum , 39(1):11–20,June 2005.14. R. Zhang, J. Guo, Y. Lan, J. Xu, and X. Cheng. Aggregating neural word embed-dings for document representation. In Gabriella Pasi, Benjamin Piwowarski, LeifAzzopardi, and Allan Hanbury, editors,
Advances in Information Retrieval , pages303–315. Springer International Publishing, 2018.15. Yongfeng Zhang, Yi Zhang, Min Zhang, and Chirag Shah. EARS 2019: The 2ndinternational workshop on explainable recommendation and search. In