[PDF] DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections

Abstract

This paper explores learning rich self-supervised entity representations from large amounts of the associated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as ranked retrieval, knowledge base completion, question answering, and more. Unlike other methods that harvest self-supervision signals based merely on a local context within a sentence, we radically expand the notion of context to include any available text related to an entity. This enables a new class of powerful, high-capacity representations that can ultimately distill much of the useful information about an entity from multiple text sources, without any human supervision. We present several training strategies that, unlike prior approaches, learn to jointly predict words and entities -- strategies we compare experimentally on downstream tasks in the TV-Movies domain, such as MovieLens tag prediction from user reviews and natural language movie search. As evidenced by results, our models match or outperform competitive baselines, sometimes with little or no fine-tuning, and can scale to very large corpora. Finally, we make our datasets and pre-trained models publicly available. This includes Reviews2Movielens (see this https URL ), mapping the up to 1B word corpus of Amazon movie reviews (He and McAuley, 2016) to MovieLens tags (Harper and Konstan, 2016), as well as Reddit Movie Suggestions (see this https URL ) with natural language queries and corresponding community recommendations.

Full PDF

DD OC E NT : Learning Self-Supervised Entity Representations from LargeDocument Collections Yury Zemlyanskiy ∗ U. of Southern California [email protected]

Sudeep Gandhe

Google Research [email protected]

Ruining He

Google Research [email protected]

Bhargav Kanagal

Google Research [email protected]

Anirudh Ravula

Google Research [email protected]

Juraj Gottweis

Google Research [email protected]

Fei Sha † Google Research [email protected]

Ilya Eckstein

Google Research [email protected]

Abstract

This paper explores learning rich self-supervised entity representations from largeamounts of associated text. Once pre-trained,these models become applicable to multipleentity-centric tasks such as ranked retrieval,knowledge base completion, question an-swering, and more. Unlike other methodsthat harvest self-supervision signals basedmerely on a local context within a sentence,we radically expand the notion of context toinclude any available text related to an entity.This enables a new class of powerful, high-capacity representations that can ultimatelydistill much of the useful information aboutan entity from multiple text sources, withoutany human supervision.We present several training strategies that, un-like prior approaches, learn to jointly predictwords and entities—strategies we compare ex-perimentally on downstream tasks in the TV-Movies domain, such as MovieLens tag pre-diction from user reviews and natural languagemovie search. As evidenced by results, ourmodels match or outperform competitive base-lines, sometimes with little or no ﬁne-tuning,and can scale to very large corpora.Finally, we make our datasets and pre-trainedmodels publicly available . This includes Reviews2Movielens , mapping the ∼

1B wordcorpus of Amazon movie reviews (He andMcAuley, 2016) to MovieLens tags (Harperand Konstan, 2016), as well as Reddit MovieSuggestions with natural language queries andcorresponding community recommendations.

Much of the online information describing enti-ties in domains such as music, movies, venues or ∗ Work is partially done while at Google † On leave from USC ([email protected]) See http://goo.gle/research-docent for

Re-views2Movielens and models. Scripts and

Reddit Suggestions can be found at https://urikz.github.io/docent

Review 1: “This movie develops its power best if you don’ttry to look out for the “real” and “true” events behind thefour versions of the narration... shown in a very intelligentand artistic way, no silly plot-twists, no explanation in theend — it is open to your fantasy... “ $ MOVIE ” is an impor-tant piece of cinematic storytelling and a really interestingway to reﬂect on the origin of tales... Some scenes evenremind me of Andrej Tarkovskijs intensive style..”.

Review 2: “Just rented this, and at ﬁrst I didn’t like verymuch, but then it starts to sink in for how good it is, theacting is great especially Toshiro Mifune, it was shot verygood for an older movie... it’s

Review 3: “Saw this movie at my local video store... wasplaced on a waiting list, but when I returned to check it outthe video store had closed down over night. Actually whentout of business” ... More reviews ...

Summary tags: [nonlinear] [multiple storylines] [japan][black and white] [surreal] [cerebral] [imdb top 250], ...

Table 1:

Reviews2Movielens task, illustrated. Here are sam-ple review snippets for a certain classic ﬁlm which is sum-marized using MovieLens tags. Notice that the tags may notappear in the input verbatim and can be thought of as booleanquestions about the ﬁlm. Note also that Review 3 has zero rel-evant signal—a common challenge of low SNR in this dataset.Bonus teaser: can you guess the $

MOVIE from these snippets?This little quiz alludes to a key learning task in our approach. consumer products, is only available as unstruc-tured text—a format that is human-readable butnot machine-understandable (yet). Consider onlinereviews—a rich source of mostly user-generatedabout a vast number of entities. Our key researchquestion is:

Can we learn strong models for en-tity understanding tasks such as vertical searchand question answering, solely from text?

In otherwords, given a large and noisy collection of docu-ments about an entity, can we distill all the usefulinformation therein into a dense entity representa-tion, so as to beneﬁt multiple downstream tasks? a r X i v : . [ c s . C L ] F e b raditionally, learning entity representations re-quired supervised signals such as clicks, “likes”and consumption behavior (Agichtein et al., 2006;Huang et al., 2013; Koren et al., 2009; Vig et al.,2012a), which are generally expensive and timeconsuming to obtain at scale. To leapfrog theselimitations, we draw inspiration from the recentprogress in unsupervised learning of text, particu-larly contextualized representations via techniquessuch as ELMo (Peters et al., 2018), CoVe (McCannet al., 2017) and BERT (Devlin et al., 2019). Manyof these representations are learned by predicting amissing word from its context. More recently, Sunet al. (2019) showed that extending word maskingstrategies to entities can lead to superior languagemodels. Even more recent entity linking methodssuch as RELIC (Ling et al., 2020) and others, de-tailed in Section 6, were shown to produce explicitencodings applicable to entity understanding tasks.We start with RELIC-like approaches and gen-eralize them into a family of models, collectivelycalled D OC E NT , that jointly embed text and en-tities (Section 2) via self-supervised tasks. Theﬁrst one, D OC E NT -D UAL , is essentially RELIC,but trained with a much broader context to includeany and all sentences potentially related to an en-tity. Importantly, D OC E NT -D UAL /RELIC onlyoptimizes a single task, namely entity prediction given an associated sentence, effectively modeling P ( Entity | Sentence ) .Another natural way of jointly modelling entitiesand text is by directly tapping the cross-attentionmechanism in BERT, simply by extending theBERT vocabulary to include entity tokens V E .Each entity-related sentence can then be augmentedwith a corresponding token from V E . We call thismethod D OC E NT -F ULL and, despite (or perhapsbecause of) its conceptual simplicity, it proves sur-prisingly effective in semi-supervised tasks.Finally, D OC E NT -H YBRID aims to capture thebest of both models by extending D OC E NT -D UAL with an additional task of predicting words in asentence, conditioned on its associated entity. Thistask encourages the latter to “remember” salientphrases in its sentences.We empirically evaluate these methods by learn-ing entity representations for movies from a TV-Movies portion of the Amazon Reviews Cor-pus (He and McAuley, 2016). To this end, we con-sider several movie-oriented tasks for downstreamevaluation, i.e. Reddit Movie Suggestions and MovieLens Tag Prediction (Harper and Konstan,2016), which we study in both zero-shot, super-vised and few-shot settings. We join the MovieLensdataset with the reviews corpus (He and McAuley,2016) obtaining a mapping from movie reviews touser-generated tags. On the supervised tag predic-tion task, our text-based model demonstrates SOTAperformance, despite not using powerful user sig-nals (Vig et al., 2012a). In fact, we are able tomatch or outperform baselines on all tasks wherethey are available.

1. First, we propose a family of methods totrain deep self-supervised entity representa-tions purely from related text documents, withstrong zero-shot results on ranked retrievalwith natural language queries.2. Secondly, we show that these pre-trained rep-resentations are amenable to ﬁne-tuning onnew tasks such as MovieLens tag prediction,where we show state-of-the-art results. Theyare also effective few-shot learners, which wedemonstrate on a harder open-vocabulary task akin to Boolean Question Answering(Clark et al., 2019).3. Next, we propose Reviews2Movielens —a newText Based Entity Understanding task. Therequisite dataset, which we release publicly,effectively joins the Amazon Movie ReviewsCorpus and MovieLens into a large, sparselysupervised set with approximately 1B wordsand 470K movie-tag pairs.4. Finally, we also release a dataset of user-generated Reddit Movie Suggestions, a bench-mark for natural language search and recom-mendation scenarios.

Inspired by the success of self-supervised languagemodels, we seek to extend them to jointly computetext and entity representations. Recall that our inputis a set of entities E where for every entity e ∈ E ,we have a collection of sentences, denoted by S e ,from all documents related to e . Intuitively, wewant the representation of e to be inﬂuenced byeach associated sentence s ∈ S e , and vice versa.To that end, we explore two (self-) supervisionsignals: P ( e | s ) and P ( s | e ) . An open vocabulary allows any phrase to be a label. igure 1:

Models in the D OC E NT family. Left: a baseline dual encoder model called D OC E NT -D UAL a.k.a. RELIC,maximizing P ( e | s ) but not P ( s | e ) . Center: D OC E NT -F ULL —a model maximizing the joint sentence-entity probability usingfull cross-attention. Right: D OC E NT -H YBRID , designed to capture the best of both worlds. OC E NT -D UAL , Known as RELIC

At the core of D OC E NT -D UAL is a RELIC modelthat co-encodes an entity e and an associated sen-tence s ∈ S e so as to maximize their compatibilityscore, deﬁned as the cosine similarity between thetwo encodings: s ( e, s ) = g ( e ) T f CLS ( s ) (cid:107) g ( e ) (cid:107)(cid:107) f CLS ( s ) (cid:107) , where g ( e ) is an embedding of e and f ( s ) is aBERT-based encoding of s, with its special [ CLS ] token whose output representation is denoted by f CLS . Then, the conditional probability of e given s is given by a softmax over the set E : P ( e | s ) = exp( s ( e, s )) (cid:80) e (cid:48) ∈E exp( s ( e (cid:48) , s )) . Finally, RELIC is trained by maximizing log P ( e | s ) over all associated pairs e, s ∈ S e : L E ( e, s ) = log P ( e | s ) . Note that both g and f (initialized with a commonBERT) are learned during training.Our sole difference to the original RELIC is intraining data: while RELIC only uses sentencescontaining entity mentions, we allow a radicallybroader context – all sentences associated with anentity – with the goal of remembering all of itsattributes. Crucially, no human labeling is required.Despite its effectiveness (as demonstrated in Sec-tion 5), RELIC has one obvious limitation: it ig-nores P ( s | e ) , leaving a useful signal “on thetable”. We therefore propose another way of co-encoding sentences and entities by tapping the fullcross-attention power of Transformers. In practice, only a subset of entities in E is used in thedenominator: the so called “in-batch negatives”. OC E NT -F ULL

Before we proceed, let us revisit BERT’s MaskedLanguage Model (MLM) training objective. Givena sequence of input tokens s = [ s , . . . , s n ] , a frac-tion of tokens s J at randomly selected positions J is replaced with a special [MASK] token. Wedenote this new sequence by s − J .Then, BERT predicts masked tokens based ontheir contextualized representations f ( s − J ) . TheMLM training objective to maximize is: L MLM = log P ( s J | s − J ) . Enter D OC E NT -F ULL . It follows the standardBERT architecture, with a twist. First, we expandthe input vocabulary to include all entity tokens in E . Then, during input sequence construction, eachsentence s ∈ S e is prepended with the correspond-ing entity token e , as shown in Figure 1. This way,masking and predicting this token (via softmax)effectively adds our new objective L E to BERT.Further, the new e token is now part of a sentencecontext, augmenting the original L MLM to L MLM + E ( s, e ) = log P ( s J | s − J , e ) , and L F ULL = L E + λ L MLM + E becomes the combined loss function optimized us-ing nothing but BERT’s standard MLM training,with a hyperparameter λ to balance the two terms .This conceptual simplicity and full cross-attention power come with a cost: bundling word-pieces and entities together forces the model toallocate an equal capacity to both types of tokens(e.g., 768D for BERT-base), regardless of the size Technically, we replace BERT’s standard ( s A , s B ) two-segment input structure with ( e , s ), for s ∈ S e . The relative masking frequency of entity tokens is anotherhyperparameter available to balance the two objectives. f E . As a result, a relatively small-sized E may beprone to overﬁtting in zero-shot scenarios, as weobserve in Section 5.4.2. OC E NT -H YBRID

Recall that RELIC avoids the above limitationby decoupling text and entity encoders. To getthe best of both worlds, we introduce D OC E NT -H YBRID —a third model that sticks with the modu-lar dual encoder architecture while also modeling P ( s | e ) . This is achieved by implementing a differ-ent variant of L MLM + E where, for every maskedwordpiece token, the output of Transformer layers f ( s − J ) is ﬁrst concatenated with the associated en-tity embedding g ( e ) before feeding into the ﬁnalMLM prediction layer. By including entity embed-dings in the prediction of related text tokens, weget them to “remember” important aspects fromthe text without sacriﬁcing modularity. In this section, we deﬁne the three tasks used toevaluate pre-trained entity representations.

The original MovieLens Tag Prediction task is toproduce movie-tag scores for a set of movies anda canonical vocabulary of tags (see examples inTable 1), based on a collection of crowdsourced(movie, tag, user) votes, as well as (user, movie)star ratings. These tags are often not factual butmay refer to plot elements, qualitative aspects orreﬂect subjective opinions. Since the same canbe said about user reviews, and we observe a non-trivial amount of textual entailment between thetwo sources. We therefore intentionally excludeuser ratings from the input. The new challengeis to complete the movie-tag relevance matrix byleveraging movie reviews, hereafter referred to asthe closed-vocabulary tag prediction task . This isa supervised setup where models are ﬁne-tuneedwith tag labels and evaluated on a held-out set sub-set of movies, as elaborated in Section 5. Conversely, a very large E may require an optimizedimplementation of softmax to maintain scalability. One can also view this as a two-dimensional knowledgebase (KB) completion problem, where relation types are notavailable and the KB is reduced to a 2D matrix.

In reality, the space of tags is not static. Rather,tags are a useful kind of user-generated contentthat evolves to reﬂect the zeitgeist, much like hu-man language. Many online platforms (e.g, Twitterand Instagram to name a few) have vibrant onlinecommunities that keep inventing new tags. Wetherefore propose a new open-vocabulary formula-tion of the tag prediction problem where any phraseis allowed to be a tag.This requires a small change in evaluation. In-stead of held-out movies, we hold out a subsetof tags and ﬁne-tune on the rest (and on all themovies). Note that this is no longer a classicmulti-label classiﬁcation task as we never get tosee the test labels during training. Rather, thisopen-vocabulary setup is akin to answering booleanquestions (about a movie) based on a text docu-ment (Clark et al., 2019).

The purpose of this task is to evaluate pre-trainedentity representations in the context of verticalsearch. The classic entity ranking problem is,given a text query and a ﬁnite set of entities,to rank them according to their relevance to thequery. Recall that D OC E NT models are naturallydesigned to make such relevance predictions via P ( Entity | Sentence ) — without any ﬁne-tuning,if necessary. We therefore leverage the RedditMovie Suggestions Dataset (detailed in Section 4.3)as a source of both queries and ground truth to de-ﬁne a zero-shot movie ranking task. To clarify, thenotion of zero shot implies a pre-trained but notﬁne-tuned model in our context. This dataset isparticularly interesting for its challenging queries,with their distinctly natural, often conversationallanguage (e.g., “Last week I watched the British cold warmovie Threads. I am scarred, but intrigued as well. Anysimilar deeply disturbing yet realistic movies you can rec-ommend?” , see Table 2 for more examples). An-other challenge is an explicit recommendation in-tent present in many of the queries (i.e., “Movieslike ...” ), making this task a mixture of Search andRecommendation. The latter typically requires spe-cialized recommendation models of entity-to-entitysimilarity, and cannot generally be solved withkeyword-based search. uery Top 5 Results Movies like [

Whiplash ] about an artist or a musician chasing analmost impossible dream and nearly or does ruin his life because of it Inside Llewyn Davis,

Whiplash , A Young Man witha Horn, Hustle & Flow, Born to Be BlueReally dark, slow paced movies with minimal story, but incredibleatmosphere, kinda like [

Drive ] or [

The Rover ] The Rover , Valhalla Rising, Only God Forgives,Blade Runner, SicarioFilms like [

Mission Impossible ] or [

The Italian Job ] that have bigscenes where the characters must break in or inﬁltrate some place National Treasure: Book of Secrets,

Mission: Im-possible – Rogue Nation , Ant-Man,

The ItalianJob

Table 2:

Qualitative examples illustrating zero-shot movie ranking by D OC E NT -F ULL , with natural language queries crawledfrom Reddit. The bracketed greyed-out movie mentions are users’ examples of desired recommendations, removed from thequeries to probe the model in what resembles a movie guessing game. Those obfuscated entities were correctly guessed by themodel based on remaining query terms, making it to the Top 5 in most cases. Other top matches appear to be equally relevant.

All our models are pretrained on Amazon ProductReviews (He and McAuley, 2016) in the “Moviesand TV” category, comprising 4,607,047 reviewsfor 208,321 movies collected during 1996–2014 . One of this paper’s contributions is

Reviews2Movielens —a new multi-documentmulti-label dataset created by joining AmazonMovie Reviews (He and McAuley, 2016; Ni et al.,2019) and MovieLens (Harper and Konstan, 2016),a rich source of crowdsourced movie tags. The keychallenge in joining the two datasets is establishingcorrespondences between their respective movieIDs, which turns out to be a many-to-one mapping .We have identiﬁed a subset of high-precision many-to-one correspondences by applying Named EntityRecognition techniques to both Amazon producttitles (incl. release years) and their product pages.The resulting mapping consists of 71,077 uniqueAmazon IDs and 28,918 unique MovieLens IDs.The mapping accuracy was manually veriﬁed to be97% based on 200 random samples. Ultimately,the joined dataset contains nearly 2 million reviewsand close to 1B words, signiﬁcantly more than itsIMDB counterpart (Maas et al., 2011).Since both datasets are widely used as a source We’ve used the 2016 version of the dataset from http://jmcauley.ucsd.edu/data/amazon . Each Amazon ID (ASIN) matches a canonical prod-uct URL, e.g., .However, these IDs correspond to speciﬁc product editions (typically DVDs) rather than unique titles, causing duplicationissues. Some are collections of several titles. We use the public Google Cloud Natural Language API– https://cloud.google.com/natural-language/docs/basics . of data and academic benchmarks (Miller et al.,2003; Jung, 2012; Anand and Naorem, 2016; Heand McAuley, 2016; Ni et al., 2019), we hope thatthis new mapping will be useful to the commu-nity. This user-generated dataset contains a collectionof 4765 movie-seeking queries and correspondingrecommendations, collectively curated and votedon by the Reddit Movie Suggestions community .Worth noting are (a) the conversational, human-to-human language of the queries; (b) the community-recommended movies that, while sparse and possi-bly biased, can be used as a source of ground truth.While modest in size, the dataset is well-suitedto evaluate zero-shot performance on the movieranking task deﬁned in Section 3.3. All our experiments start with pre-training modelson the Amazon Movie Reviews corpus, followedby optional task-dependent ﬁne-tuning. First, weapply some simple ﬁltering to the input, removingreviews shorter than 5 words and movies with lessthan 5 reviews . This results in 81,057 Amazonmovies, of which 17,131 have MovieLens corre-spondences, and 4,181,727 reviews in total. Fur-ther, we split reviews into individual sentences (orshort paragraphs) so as to circumvent the BERTsequence length limit. Finally, since our goal isto learn non-obvious entity attributes, we removemovie names from their reviews. See http://goo.gle/research-docent This low-count ﬁltering is applied after de-duplicationand aggregation. ll our models use the standard BERT-base con-ﬁguration with 12 layers, 12 attention heads anda hidden size of 768, and are initialized with apublicly available BERT-base checkpoint . We will now describe the ﬁne-tuning strategies usedto transfer pre-trained D OC E NT models to down-stream tag prediction tasks. D OC E NT -F ULL

To generate movie-tag rele-vance scores, we need to predict P ( T ag | M ovie ) ,which we cast as binary classiﬁcation. Recall thatBERT has a built-in binary classiﬁer (for next-sentence prediction), implemented as a single-layerFFN on top of its [CLS] output, with logisticloss. We simply repurpose that layer for our task. D OC E NT -D UAL and D OC E NT -H YBRID

Re-call that, during pre-training, D OC E NT -D UAL andD OC E NT -H YBRID use softmax cross entropy lossto predict P ( Entity | Sentence ) . However, tagprediction poses the inverse problem: predict tagsbased on a movie entity. In our dual encoder frame-work, that can be done simply by computing soft-max over all of the encoded tags rather than entities,without any changes to the architecture. Shared Strategies

For ﬁne-tuning, all of themodels share the following choices. First, we treatevery existing movie-tag pair in the training setas a positive example, weighted proportionally tothe number of user votes for that pair (or to thelogarithm thereof). Next, for a given movie, about of all vocabulary tags are sampled as negativeexamples, excluding the known true positives forthat movie. To prevent overﬁtting, we ﬁx entity em-bedding weights for all models during ﬁne-tuning.

To corroborate the utility of explicit entity represen-tations, we set out to evaluate a few baselines thatcircumvent them by representing each entity as aBag-of-Sentences (BoS), computed over its relatedreviews with a sentence encoder of choice. Such aBoS encoder can replace entity embeddings in ourarchitecture, yielding a na¨ıve variant of D OC E NT -D UAL . We call these baselines B O S-G LO V E , https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip Feed-Forward Neural Network

Task Movies Tags M-T Pairs

Closed (test) 1000 1128 46359Closed (dev) 380 1128 17943Open (test) 6392 500 141618Open (dev) 3362 100 25274

Table 3:

Evaluation datasets sizes for Tag Prediction tasks.Closed / Open stand for the closed and open vocabulary tasks,respectively; M-T Pairs shows the number of correspondingmovie-tags pairs. The top two rows describe movie holdoutsets used in our closed vocabulary experiments; bottom tworows showing tag holdouts for open vocabulary experiments. B O S-BERT and B O S-S

ENTENCE B ERT , reﬂect-ing their underlying sentence encoders. The main challenge with evaluating tag predictionis the sparse and noisy nature of user-generatedground truth. For instance, a certain movie taghaving zero votes may still be relevant in reality.On the other hand, some entities may have votesfor contradictory tags (e.g., both “funny” and “notfunny”). The original Tag Genome baseline (Viget al., 2012b) mitigated this by collecting an ad-ditional dataset of unbiased movie-tag relevancescores. Alas, that data has not been released. In-stead, we propose two complementary metrics thatcast tag prediction either as binary classiﬁcation oras a ranking problem.For classiﬁcation, we binarize labels as follows.Let m, t ) be the number of users who assigneda tag t to a movie m . Then its binary counterpart l ( m, t ) is set to 1 iff m, t ) > T , a threshold .For the tag ranking formulation, we make theassumption that true movie-tag relevance is cor-related with the number of movie-tag votes, anddeﬁne our movie-tag relevance score as r ( m, t ) = m, t ) .Equipped with this score, we use Precision@kand NDCG metrics (J¨arvelin and Kek¨al¨ainen, 2002)to measure performance. Tag prediction baselines includeMovielensTopTags— a ﬁxed ordering of tags.TF-IDF scores for movie-tag pairs, based on tagfrequencies in movie reviews. S ENTENCE

BERT (Reimers and Gurevych, 2019) ﬁne-tunes BERT on NLI to provide off-the-shelf semantic sentencerepresentations. We use T = 2 to ﬁlter out noisy tags. odel MAP AUC MovielensTopTags 6.2 0.80TD-IDF 32.3 0.86B O S-BERT 39.3 0.91T AG G ENOME D OC E NT -F ULL D OC E NT -D UAL OC E NT -H YBRID

Human 76.6 0.99

Table 4:

Mean Average Precision and ROC-AUC results onthe closed-vocabulary tag prediction task. T AG G ENOME is theoriginal baseline from MovieLens creators (Vig et al., 2012b),trained on multiple additional features and considered SOTA.Despite using fewer features, D OC E NT matches T AG G ENOME performance on AUC and outperforms it on precision (MAP). B O S-BERT, as deﬁned in Sec. 5.3, is ﬁne-tuned to estimate sentence-to-tag relevancedirectly . This setup is applicable to bothopen and closed vocabulary scenarios. Duringinference, a movie-tag prediction is obtainedby averaging over sentence-wise predictionsfor the movie’s reviews.T AG G ENOME —the original baseline from Movie-Lens team (Vig et al., 2012b). The comparisonis not entirely apt as that model was trained onadditional movie-tag relevance data and userratings, albeit with a smaller corpus of unsu-pervised reviews. Also, T AG G ENOME wastrained on all of MovieLens (no holdouts).Humans—to simulate human performance, applycross-validation to ground truth user votes,treating one of the folds as a quasi-model.All models were evaluated on the same holdoutsets, with averaging.

Closed Vocabulary Tag Prediction

In this sce-nario, evaluation is done on a holdout set of movies(with a smaller development set used for hyperpa-rameter tuning; see Table 3 for details).Results for ranking (MAP) and binary classiﬁ-cation (AUC) metrics are shown in Table 4. Col-lectively, D OC E NT models outperform the strongT AG G ENOME baseline on tag ranking (see alsoFig. 2 (a) and (b)) and match (or slightly outper-form) it in binary classiﬁcation. It is a strong re-sult, considering that D OC E NT had no access toadditional features used by T AG G ENOME and em-ployed no feature engineering. Of the three models, We found it is best to encode a review sentence usingBERT’s [CLS] output, while tags are encoded by averagingindividual tokens’ output vectors.

Model MRR Recall , %@50 @100Lucene (TF-IDF) 0.14 15.3 20.7B O S-G LO V E O S-BERT ∗ O S-S

ENTENCE B ERT OC E NT -F ULL OC E NT -D UAL OC E NT -H YBRID

Table 5:

Zero-shot results for D OC E NT models vs severalbaselines on Reddit Movie Suggestions. MRR stands forMean Reciprocal Rank. D OC E NT -D UAL scores the lowest on all metrics,likely due to not optimizing for P ( T ext | Entity ) in pre-training. Finally, note that all models stillscore way below humans on the (harder) tag rank-ing task, indicating considerable headroom. Open Vocabulary Tag Prediction

This task isevaluated by withholding parts of the tag vocab-ulary so that those tags are never seen in training(consult Table 3 for details). Fig. 2 (c) shows ourmodels’ performance on the binary classiﬁcationtask base on the fraction of the vocabulary seen by amodel in ﬁne-tuning. The graph shows that trainingwith only 100 of the 1124 tags results in reasonableperformance. Of our three models, D OC E NT -F ULL starts below the others but adapts the fastest, reach-ing a near-closed vocabulary performance with lessthan 50% of the full tag vocabulary.

Since this is asearch task, we compare our models to an ApacheLucene baseline, arguably the world’s mostwidely used open-source search engine. For com-pleteness, we also compare to B O S-BERT ∗ ,B O S-G LO V E and B O S-S

ENTENCE B ERT , neuralbaselines deﬁned in Sec. 5.3, whose query-movierelevance score is given by the maximum cosinesimilarity among the movie’s review sentences .Table 5 shows the Mean Reciprocal Rank (MRR)as well as recall, metrics that suit the noisy groundtruth (for completeness, see also the qualitativeresults in Table 2). D OC E NT models outperform https://lucene.apache.org/ In absence of a ﬁne-tuned [CLS] output, this version ofB O S-BERT encodes sentences by averaging their individualtokens’ output vectors. In this case, we found that aggregating sentence-wisepredictions with L ∞ norm is superior to averaging. k , Number of top predictions P r ec i s i on @ k , % k , Number of top predictions ND C G @ k

100 200 300 400 5000 . . . . . Number of training tags AU C TD-IDF T AG G ENOME B O S-BERTD OC E NT -F ULL D OC E NT -D UAL D OC E NT -H YBRID (a)

Closed-vocabulary

Pr@ k (b) Closed-vocabulary

NDCG (c)

Open-vocabulary

AUC

Figure 2:

Performance on tag prediction tasks. Left and center: Precision and NDCG @ k , with a closed vocabulary. D OC E NT -F ULL dominates the strong T AG G ENOME baseline for smaller values of k , a concentration of gains typical for binary classiﬁcationmodels. For perspective, human Precision@ k ranges 80-95% for this task. Right: AUC for open vocabulary experiments, withmodels trained using a variable fraction of the tag vocabulary. D OC E NT approaches close-vocabulary AUC after training withonly 10-50% of the vocabulary (showing all baselines that were available to us in this setting). the Lucene baseline on all metrics, with D OC E NT -H YBRID leading by a large margin. Compared toD OC E NT -D UAL , its strong performance is not sur-prising since D OC E NT -H YBRID optimizes both P ( Entity | T ext ) and P ( T ext | Entity ) —acombination of tasks that helps avoid overﬁtting.Also expected is the relatively weak performanceof D OC E NT -F ULL . As discussed in Sec. 2, itshigh-capacity entity representations are prone tooverﬁtting when the number of entities is relativelysmall. Still, this shortcoming can be remedied byﬁne-tuning, as evidenced by this model’s superiorresults on tag prediction in Sec. 5.4.1. These resultssuggest that D OC E NT -F ULL may be a good choicein semi-supervised scenarios.

Much of the prior art in text-based entity under-standing is motivated by the

Entity Linking (EL)problem: predict a unique entity from its mentionin text, assuming a single right answer. By con-trast, tasks like entity retrieval and tag predictionimply multiple valid matches and emphasize un-derstanding entities through the prism of their at-tributes, expressed in natural language. Still, recentEL works propose dual encoder approaches simi-lar to ours (Yamada et al., 2017; Ling et al., 2020;Cheng and Roth, 2013; Sun et al., 2015; Yamadaet al., 2016; Chang et al., 2020; Kobayashi et al.,2016; He et al., 2013; Gupta et al., 2017), with Ling et al. (2020) already discussed in Section 2.1.Dual encoders have also been explored in zero-shotscenarios (Gillick et al., 2019; Logeswaran et al.,2019; Wu et al., 2019; Gupta et al., 2017), withentity embeddings computed dynamically basedon metadata such as dictionary deﬁnitions, entityname and/or category. Others incorporate entityrepresentations directly in the transformer by re-trieving from an external memory (F´evry et al.,2020; Peters et al., 2019). While clearly useful forEL, e.g., in sentences with multiple entity mentions,the beneﬁts to our applications are unclear. Finally,there is ERNIE (Sun et al., 2019) – a languagemodel trained with awareness of entity mentions.Alas, the lack of explicit entity representation limitsits use in our tasks.

This paper proposes a family of models to learnself-supervised entity representations from largedocument collections. We motivate these dedicatedrepresentations by contrasting them with naive text-as-a-proxy approaches, with clear gains on entity-centric tasks such as natural language search andmovie tag prediction. We then show that achiev-ing superior performance requires optimizing both P ( Entity | T ext ) and P ( T ext | Entity ) —incontrast to the baseline RELIC model (and similarprior dual encoders) having only a single objec-tive. To that end, we propose two novel models andtudy them in zero-shot, few-shot and supervisedsettings. We match or outperform competitive base-lines, where available, with little or no ﬁne-tuning. Future Work

As shown qualitatively in Sec. 3.3,D OC E NT has the potential for being a hybrid ap-proach to bridge entity retrieval and recommenda-tion, an application worth exploring in depth (e.g.,on the MovieLens Recommendation task whichcan be readily integrated with D OC E NT thanks to Reviews2Movielens ). A larger entity retrieval studywith heterogeneous entity types is another usefuldirection. Lastly, extending D OC E NT to additionalentity understanding tasks such as QA and summa-rization is yet another promising avenue. Acknowlegements

We appreciate the feedback from the reviewers.This work is partially supported by NSF AwardsIIS-1513966/ 1632803/1833137, CCF-1139148,DARPA Awards

References

Eugene Agichtein, Eric Brill, and Susan Dumais. 2006.Improving web search ranking by incorporating userbehavior information. In

Proceedings of the 29thannual international ACM SIGIR conference on Re-search and development in information retrieval ,pages 19–26. ACM.Deepa Anand and Deepan Naorem. 2016. Semi-supervised aspect based sentiment analysis formovies using review ﬁltering.

Procedia ComputerScience , 84:86–93.Jacqueline Bourdeau, Jim Hendler, Roger Nkambou,Ian Horrocks, and Ben Y. Zhao, editors. 2016.

Pro-ceedings of the 25th International Conference onWorld Wide Web, WWW 2016, Montreal, Canada,April 11 - 15, 2016 . ACM.Jill Burstein, Christy Doran, and Thamar Solorio, ed-itors. 2019.

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2019, Minneapolis, MN,USA, June 2-7, 2019, Volume 1 (Long and Short Pa-pers) . Association for Computational Linguistics.Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yim-ing Yang, and Sanjiv Kumar. 2020. Pre-trainingtasks for embedding-based large-scale retrieval. Mingda Chen, Zewei Chu, Yang Chen, Karl Stratos,and Kevin Gimpel. 2019. Enteval: A holistic eval-uation benchmark for entity representations. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing, EMNLP-IJCNLP 2019, HongKong, China, November 3-7, 2019 , pages 421–433.Association for Computational Linguistics.Xiao Cheng and Dan Roth. 2013. Relational inferencefor wikiﬁcation. In

Proceedings of the 2013 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 1787–1796, Seattle, Washington,USA. Association for Computational Linguistics.Christopher Clark, Kenton Lee, Ming-Wei Chang,Tom Kwiatkowski, Michael Collins, and KristinaToutanova. 2019. Boolq: Exploring the surprisingdifﬁculty of natural yes/no questions. In (Bursteinet al., 2019), pages 2924–2936.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In (Burstein et al., 2019), pages 4171–4186.Thibault F´evry, Livio Baldini Soares, Nicholas FitzGer-ald, Eunsol Choi, and Tom Kwiatkowski. 2020. En-tities as experts: Sparse memory access with entitysupervision.Daniel Gillick, Sayali Kulkarni, Larry Lansing,Alessandro Presta, Jason Baldridge, Eugene Ie, andDiego Garc´ıa-Olano. 2019. Learning dense repre-sentations for entity retrieval. In

Proceedings ofthe 23rd Conference on Computational Natural Lan-guage Learning, CoNLL 2019, Hong Kong, China,November 3-4, 2019 , pages 528–537. Associationfor Computational Linguistics.Nitish Gupta, Sameer Singh, and Dan Roth. 2017. En-tity linking via joint encoding of types, descriptions,and context. In

Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Process-ing, EMNLP 2017, Copenhagen, Denmark, Septem-ber 9-11, 2017 , pages 2681–2690. Association forComputational Linguistics.Isabelle Guyon, Ulrike von Luxburg, Samy Bengio,Hanna M. Wallach, Rob Fergus, S. V. N. Vish-wanathan, and Roman Garnett, editors. 2017.

Ad-vances in Neural Information Processing Systems30: Annual Conference on Neural Information Pro-cessing Systems 2017, 4-9 December 2017, LongBeach, CA, USA .F. Maxwell Harper and Joseph A. Konstan. 2016. Themovielens datasets: History and context.

TiiS ,5(4):19:1–19:19.Ruining He and Julian J. McAuley. 2016. Ups anddowns: Modeling the visual evolution of fashiontrends with one-class collaborative ﬁltering. In(Bourdeau et al., 2016), pages 507–517.hengyan He, Shujie Liu, Mu Li, Ming Zhou, LongkaiZhang, and Houfeng Wang. 2013. Learning entityrepresentation for entity disambiguation. In

Pro-ceedings of the 51st Annual Meeting of the Associ-ation for Computational Linguistics, ACL 2013, 4-9August 2013, Soﬁa, Bulgaria, Volume 2: Short Pa-pers , pages 30–34. The Association for ComputerLinguistics.Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,Alex Acero, and Larry Heck. 2013. Learning deepstructured semantic models for web search usingclickthrough data. In

Proceedings of the 22nd ACMinternational conference on Information & Knowl-edge Management , pages 2333–2338. ACM.Kalervo J¨arvelin and Jaana Kek¨al¨ainen. 2002. Cumu-lated gain-based evaluation of IR techniques.

ACMTrans. Inf. Syst. , 20(4):422–446.Jason J Jung. 2012. Attribute selection-based recom-mendation framework for short-head user group: Anempirical study by movielens and imdb.

Expert Sys-tems with Applications , 39(4):4049–4054.Sosuke Kobayashi, Ran Tian, Naoaki Okazaki, andKentaro Inui. 2016. Dynamic entity representationwith max-pooling improves machine reading. In

NAACL HLT 2016, The 2016 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, San Diego California, USA, June 12-17, 2016 ,pages 850–855. The Association for ComputationalLinguistics.Yehuda Koren, Robert M. Bell, and Chris Volinsky.2009. Matrix factorization techniques for recom-mender systems.

IEEE Computer , 42(8):30–37.Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, ed-itors. 2011.

The 49th Annual Meeting of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Proceedings of the Conference,19-24 June, 2011, Portland, Oregon, USA . The As-sociation for Computer Linguistics.Jeffrey Ling, Nicholas FitzGerald, Zifei Shan,Livio Baldini Soares, Thibault F´evry, David Weiss,and Tom Kwiatkowski. 2020. Learning cross-context entity representations from text.

CoRR ,abs/2001.03765.Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee,Kristina Toutanova, Jacob Devlin, and Honglak Lee.2019. Zero-shot entity linking by reading entity de-scriptions. In

Proceedings of the 57th Conference ofthe Association for Computational Linguistics, ACL2019, Florence, Italy, July 28- August 2, 2019, Vol-ume 1: Long Papers , pages 3449–3460.Andrew L. Maas, Raymond E. Daly, Peter T. Pham,Dan Huang, Andrew Y. Ng, and Christopher Potts.2011. Learning word vectors for sentiment analysis.In (Lin et al., 2011), pages 142–150. Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In (Guyon et al., 2017),pages 6294–6305.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In

Advances in neural information processingsystems , pages 3111–3119.Bradley N Miller, Istvan Albert, Shyong K Lam,Joseph A Konstan, and John Riedl. 2003. Movie-lens unplugged: experiences with an occasionallyconnected recommender system. In

Proceedings ofthe 8th international conference on Intelligent userinterfaces , pages 263–266. ACM.Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019.Justifying recommendations using distantly-labeledreviews and ﬁne-grained aspects. In

Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 188–197.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP) , pages 1532–1543.Matthew E. Peters, Mark Neumann, Robert L. LoganIV, Roy Schwartz, Vidur Joshi, Sameer Singh, andNoah A. Smith. 2019. Knowledge enhanced contex-tual word representations.Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In (Walker et al., 2018), pages 2227–2237.Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Nat-ural Language Processing, EMNLP-IJCNLP 2019,Hong Kong, China, November 3-7, 2019 , pages3980–3990. Association for Computational Linguis-tics.Walid Shalaby, Wlodek Zadrozny, and Hongxia Jin.2019. Beyond word embeddings: learning entityand concept representations from large scale knowl-edge bases.

Inf. Retr. J. , 22(6):525–542.Yaming Sun, Lei Lin, Duyu Tang, Nan Yang, ZhenzhouJi, and Xiaolong Wang. 2015. Modeling mention,context and entity with neural networks for entitydisambiguation. In (Yang and Wooldridge, 2015),pages 1333–1339.u Sun, Shuohuan Wang, Yukun Li, Shikun Feng, XuyiChen, Han Zhang, Xin Tian, Danxiang Zhu, HaoTian, and Hua Wu. 2019. ERNIE: enhanced rep-resentation through knowledge integration.

CoRR ,abs/1904.09223.Jesse Vig, Shilad Sen, and John Riedl. 2012a. The taggenome: Encoding community knowledge to sup-port novel interaction.

ACM Transactions on Inter-active Intelligent Systems (TiiS) , 2(3):13.Jesse Vig, Shilad Sen, and John Riedl. 2012b. The taggenome: Encoding community knowledge to sup-port novel interaction.

TiiS , 2(3):13:1–13:44.Marilyn A. Walker, Heng Ji, and Amanda Stent, editors.2018.

Proceedings of the 2018 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, NAACL-HLT 2018, New Orleans, Louisiana,USA, June 1-6, 2018, Volume 1 (Long Papers) . As-sociation for Computational Linguistics.Ledell Wu, Fabio Petroni, Martin Josifoski, SebastianRiedel, and Luke Zettlemoyer. 2019. Zero-shot en-tity linking with dense entity retrieval.Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2016. Joint learning of the em-bedding of words and entities for named entity dis-ambiguation. In

Proceedings of The 20th SIGNLLConference on Computational Natural LanguageLearning , pages 250–259, Berlin, Germany. Associ-ation for Computational Linguistics.Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2017. Learning distributed rep-resentations of texts and entities from knowledgebase.

Trans. Assoc. Comput. Linguistics , 5:397–411.Qiang Yang and Michael J. Wooldridge, editors. 2015.