A Brief Survey and Comparative Study of Recent Development of Pronoun Coreference Resolution
AA B
RIEF S URVEY AND C OMPARATIVE S TUDY OF R ECENT D EVELOPMENT OF P RONOUN C OREFERENCE R ESOLUTION
Hongming Zhang, Xinran Zhao, Yangqiu Song
Department of Computer Science, HKUST [email protected], [email protected], [email protected]
September 29, 2020 A BSTRACT
Pronoun Coreference Resolution (PCR) is the task of resolving pronominal expressions to all mentionsthey refer to. Compared with the general coreference resolution task, the main challenge of PCR is thecoreference relation prediction rather than the mention detection. As one important natural languageunderstanding (NLU) component, pronoun resolution is crucial for many downstream tasks and stillchallenging for existing models, which motivates us to survey existing approaches and think abouthow to do better. In this survey, we first introduce representative datasets and models for the ordinarypronoun coreference resolution task. Then we focus on recent progress on hard pronoun coreferenceresolution problems (e.g., Winograd Schema Challenge) to analyze how well current models canunderstand commonsense. We conduct extensive experiments to show that even though currentmodels are achieving good performance on the standard evaluation set, they are still not ready to beused in real applications (e.g., all SOTA models struggle on correctly resolving pronouns to infrequentobjects). All experiment codes are available at: https://github.com/HKUST-KnowComp/PCR.
The question of how human beings resolve pronouns has long been of interest to both linguistic and natural languageprocessing (NLP) communities, for the reason that a pronoun itself only having weak semantic meaning bringschallenges to natural language understanding. To explore solutions for that question, pronoun coreference resolution(PCR) [2] was proposed. As a challenging yet vital natural language understanding task, pronoun coreference resolutionis to find the correct reference for a given pronominal anaphor in the context and has been shown to be crucial for aseries of downstream tasks, such as machine translation [6], summarization [7], and dialog systems [8].To investigate the difference between PCR and the general coreference resolution task, which tries to identify not onlythe coreference relations between noun phrases (NP) and pronouns (P) but also potential coreference relations betweennoun phrases or coreference relations between pronouns, we conduct experiments with one recent breakthrough model(i.e., End-to-end model [9]) on the CoNLL-2012 shard task [10] under two settings: one without the gold mentionand one with the gold mention. In the ‘without gold mention’ setting, models are required to first identify spans fromthe documents as the mentions and then predict the coreference relations among these mentions. As a comparison, ifgold mentions are provided, models only need to predict the coreference relations. From the results in Table 1 we cansee that, without gold mention, the model performs well on P-P coreference relations while not that well on the othertwo kinds of relations. However, if gold mentions are provided, the model can achieve very good performance on the Some pronouns may refer to non-nominal antecedents. For example, the pronoun “it” in “It is too cold in the Winter here” doesnot refer to any real object [1]. But in this survey, we only focus on pronouns that refer to nominal antecedents. Previous studies [3, 4] mainly focus on three kinds of pronouns: third personal pronoun (e.g., she , her , he , him , them , they , it ),possessive pronoun (e.g., his , hers , its , their , theirs ), and demonstrative pronoun (e.g., this , that , these , those ). The first and secondpersonal pronouns are typically not considered as they often refer to the current speakers, which are normally out of the conversationor document. Besides that, conventional PCR works [3, 4, 5] mostly focusing on identifying coreference relations between pronounsand noun phrases rather than coreference relation between pronouns. a r X i v : . [ c s . C L ] S e p PREPRINT - S
EPTEMBER
29, 2020
Type
NP-NP coreference relations. Compare with other kinds of coreference relations, no matter whether the gold mention isprovided or not, resolving pronouns to noun phrases is always the most challenging one.The correct resolution of pronouns typically requires reasoning over both linguistic knowledge (e.g., ‘they’ typicallycan only refer to plural objects ) and commonsense knowledge (e.g., in sentence “The fish ate the worm, it washungry”, ‘it’ refers to ‘fish’ because hungry things tend to eat rather than being eaten.). Considering that the ordinaryPCR task evaluates the inference over both types of knowledge at the same time, the performance on ordinary PCRtasks cannot clearly reflect models’ performance regarding different knowledge types. To address this problem, theWinograd Schema Challenge (WSC) [12] task is proposed. The influence of all commonly used linguistic knowledge isavoided during the creation of WSC such that WSC can be used to reflect how current PCR models can understandcommonsense knowledge. In Section 2 and 3, we introduce the progress and remaining challenges on the ordinary PCRand WSC tasks respectively. After that, we introduce other PCR tasks that are developed for different research purposesin Section 4. In the end, we conclude this survey with Section 5. The contribution of this survey is three-fold: (1) webroadly introduce available PCR tasks, datasets, and models; (2) We summarize the main contribution of recent models;(3) We conduct experiments to analyze the limitations of current models, which can help the community think abouthow to better solve PCR in the future. Ordinary pronoun coreference resolution tasks are often defined over formal textual corpus (e.g., newspaper) and theannotation is usually conducted by domain experts or linguists. The PCR task can be formally defined as follows. Givena text D , which contains a pronoun p , the goal is to identify all the mentions that p refers to. We denote the correctmentions p refers to as c ∈ C , where C is the correct mention set. Similarly, each candidate span is denoted as s ∈ S ,where S is the set of all candidate spans. Note that in the case where no golden mentions are provided, all possiblespans in D are used to form S . The task is thus to identify C out of S . In the rest of this section, we introduce thewidely used datasets as well as the progress and limitation of current approaches. Throughout the years, researchers in the NLP community have devoted great efforts to developing high-qualitycoreference resolution datasets and we introduce representative ones as follows:1. MUC : MUC-6 [13] and MUC-7 [14], which were developed for the 6 th and 7 th message understanding conferencesrespectively, are the earliest coreference resolution datasets. They are focusing on English news articles and arerelatively small compared with modern datasets.2. ACE : The ACE dataset [15] was proposed as part of the Automatic Content Extraction program. Compared withMUC datasets, ACE extends the corpus domain from news to other domains like telephonic speeches and broadcastconversations.3.
CoNLL shared tasks : CoNLL-2011 [16] and CoNLL-2012 [10] shared tasks were proposed to evaluate models’abilities of resolving unrestricted coreference resolution. Among these two, CoNLL-2011 only contains annotationabout English and CoNLL-2012 extends to multilingual (e.g., Chinese and Arabic). Compared with MUC and ACE,CoNLL shared tasks have a much larger scale. Moreover, as CoNLL-2012 shared tasks provide clear training, dev,and test set separation as well as the official evaluation tool, it is the most widely used evaluation benchmark for thecoreference resolution task. The only exception is the organisation named entities. For example, “they” can refer to “the company” [11]. Some datasets (e.g., CoNLL-2012 shared task) are originally designed for the general coreference resolution task. Nonetheless,we can easily convert them into a PCR task. PREPRINT - S
EPTEMBER
29, 2020
Model Third Personal Possessive Demonstrative Overall(18,147) (6,843) (546) (25,536)P R F1 P R F1 P R F1 P R F1Deterministic [19] [25] [26] [9] [4] [28]
Table 2: Performances of different models on the CoNLL-2012 shared task. Precision (P), recall (R), and the F1 score are reported.Numbers of different types of pronouns in the test set are shown in the brackets. Best models are indicated with the bold font. WikiCoref : Recently, a new coreference dataset WikiCoref [17] was proposed as a supplementary of CoNLLshared tasks. Different from CoNLL, where most of the corpus is from the newswire, WikiCoref directly annotatesWikipedia pages, which provides a new way to evaluate models’ performances in the out-of-domain setting.5.
Crowd-sourced Coref : [18] leveraged a crowd-sourced game to collect 2.2 million annotations about 108,000coreference relations, which makes it one of the largest coreference dataset. Moreover, their annotations also includeambiguous coreference relations.
In this subsection, we introduce representative models for the ordinary PCR task. We first briefly introduce conventionalapproaches that rely on human-designed rules or features and then introduce the end-to-end model, which is a ground-breaking model for solving coreference resolution tasks. After that, we briefly introduce a few recent improvementsover the end-to-end model.
Before the deep learning era, human-designed rules [2, 19], knowledge [20, 21], or features [3, 22] dominated thegeneral coreference resolution and PCR tasks. Some rules and features are crucial for correctly resolving pronouns [23].For example, ‘he’ can only refer to males and ‘she’ can only refer to females; ‘it’ can only refer to singular objects and‘them’ can only refer to plural objects. The performances of these methods heavily rely on the coverage and qualityof the manually defined rules and features. Based on these designed features [24], a few more advanced machinelearning models were applied to the coreference resolution task. For example, instead of identifying coreferencerelation pair-wisely, [25] proposes an entity-centric coreference system that can learn an effective policy for buildingcoreference chains incrementally. Besides that, a novel model was also proposed to predict coreference relations witha deep reinforcement learning framework [26]. Moreover, heuristic rules based on linguistic knowledge can also beincorporated into constraints for machine learning models [27].
Leveraging human-designed rules or features can help accurately resolve some pronouns, but it is hard to manuallydesign rules to cover all cases. To solve this problem, an end-to-end deep model [9] was proposed. Different from othermachine learning-based methods, it does not use any human-defined rules, yet achieves surprisingly good performance.Specifically, the end-to-end model first leverages the combination of Bi-directional LSTM and inner-attention modulesto encode local context and generate representations for all potential mentions. After that, a standard feed-forwardneural network is used to predict the coreference relations. Experiment results show that the proposed model is simpleyet effective. Its success proves that current deep models are capable of capturing rich contextual information, which iscrucial for resolving coreference relations.
Recently, on top of the end-to-end model, a few improvement works were proposed to address different limitations ofthe original end-to-end model : These models once achieved better performance either on the general coreference resolution task or the PCR task. PREPRINT - S
EPTEMBER
29, 2020
Model Training data Test dataCoNLL i2b2End-to-end CoNLL 72.1 75.2i2b2 20.0 92.3+ KG CoNLL 75.9 80.9i2b2 42.7 95.2+ SpanBERT CoNLL 79.6 40.8i2b2 28.5 80.5Table 3: Models’ performance (in F1 score) in cross-domain setting on different training/test data. Higher-order Information : One limitation of the original end-to-end model is that all predictions are based onpairs, which is not sufficient for capturing higher-order coreference relations. To fix this issue, a differentiableapproximation module was proposed in [29] to provide the higher-order coreference resolution inference ability (i.e.,leveraging the coreference cluster to better predict the coreference relations). Moreover, this work first incorporatesELMo [30] as part of the word representation, which is proven very effective.2.
Structured Knowledge : Another limitation of the end-to-end model is that its success heavily relies on the qualityand coverage of the training data. However, in real applications, it is labor-intensive and almost impossible toannotate a large-scale dataset to contain all scenarios. To solve this problem, two works [5, 4] were proposed toinject external structured knowledge into the end-to-end model. Among these two, [5] requires converting externalknowledge into features while [4] directly uses external knowledge in the format of triples.3.
Stronger Language Representation Models : Recently, along with the fast development of language representationmodels, a few works [31, 28] have been trying to replace the encoding layer of the original end-to-end model withmore powerful language representation models. Take SpanBERT [28] as an example, by replacing ELMo withSpanBERT, it boosts the performance by 6.6 F1 over the general coreference resolution task.
We follow the experimental setting of [4] and test the performance of representative models [19, 25, 26, 9, 4, 28]on the CoNLL-2012 dataset [10]. From the results in Table 2, we can observe that with the help of the end-to-endmodel and further modifications, the community has made great progress on the standard evaluation set. For example,the end-to-end model achieves an F1 score over 70 and adding external knowledge (either in a structured way or arepresentation way) further boost the performance. Among all pronoun types, all models perform better on third personaland possessive pronouns, and relatively poorly on demonstrative ones. This is mainly because of the imbalanceddistribution of the dataset (i.e., third personal and possessive pronouns appear much more than demonstrative ones). To investigate whether current PCR models are good enough to be used in real applications, which could be out of thetraining domain, we conduct experiments on the cross-domain setting. In detail, we select two different PCR datasetsfrom different domains (i.e., CoNLL [10] from news and i2b2 [32] from the medical domain) and try to train themodel on one dataset and test it on the other. We conduct experiments with three best-performing models and show theresults in Table 3, from which we can see that all models perform significantly worse if they are used across domains.Compared with the baseline method, adding explicit knowledge can help achieve slightly better performance in thecross-domain setting because its training objective allows models to learn to selectively use suitable knowledge ratherthan just fitting the training data. To further analyze the performance of existing models, we split the pronouns based on the frequency of the objects theyrefer to. If an object appears more than ten times in the whole dataset, we denote it as frequent objects. Otherwise, we We use the released codes of different models along with their default hyper-parameters to finish the experiments. For theend2end model, we also include ELMo [30] as part of the representation and thus achieves better performance than the original onein Table 1. SpanBERT performs poorly on i2b2 because the medical corpus is too different from the pre-trained corpus of SpanBERT andwe use the default hyper-parameters, which might not be the best ones. PREPRINT - S
EPTEMBER
29, 2020
Model Object Type P R F1End-to-End Infrequent 66.5 73.8 70.0Frequent 73.0 83.3 77.8+ KG Infrequent 77.9 72.5 75.1Frequent 78.0 77.7 77.9+ SpanBERT Infrequent 71.3 72.4 71.9Frequent 83.3 85.3 84.3Table 4: Influence of the frequency.Figure 1: WSC question examples. denote it as infrequent objects. As a result, we collect 1,095 frequent and 470,232 infrequent objects, whose averagefrequencies are 36.2 and 1.46 respectively. We report the performance of best-performing models on infrequent andfrequent objects separately in Table 4. In general, all models perform better on frequent objects because they appearmore in the training data. Another interesting observation is that even though adding external KG and a strongerlanguage representation model can both boost the performance, their improvements come from different types of objects.For example, the main contribution of adding KG is on infrequent objects because even though they are less frequent inthe training data, they could still be covered by some external knowledge. As a comparison, using a strong languagerepresentation model mainly benefits the frequent objects because it has a stronger ability to fit the training data. Thisobservation is consistent with our previous observations that adding external KG has more effect on those relativelyrare pronouns (i.e., demonstrative pronouns).
As aforementioned, the correct resolution of pronouns requires the inference over both linguistic knowledge and com-monsense knowledge. To clearly reflect how models can resolve pronouns that require the inference over commonsenseknowledge, the hard PCR task was proposed. As Winograd Schema Challenge (WSC) is the most popular hard PCRtask, we use the task definition in WSC to define the hard PCR task. Given a sentence s , which contains a pronoun p and two candidates n , n , the task is to find out which of the candidates p refers to. Different from the ordinary PCRtask, the influence of all commonly observed features (e.g., gender or plurality) are removed via carefully expert design.In WSC, all questions are paired up such that questions in each pair have only minor differences (mostly one-worddifference), but the answers are reversed. One pair of the WSC instances is shown in Figure 1. Solving these questionstypically requires the support of complex commonsense knowledge. For example, human beings can know that thepronoun ‘it’ in the first sentence refers to ‘fish’ while the one in the second sentence refers to ‘worm’ because ‘hungry’is a common property of something eating while ‘tasty’ is a common property of something being eaten. Without thesupport of such commonsense knowledge, answering these questions becomes challenging because both the fish andworm can be hungry or tasty by themselves. We introduce datasets as follows:1.
Winograd Schema Challenge : Among all the hard pronoun coreference resolution tasks, WSC is the most popularone. In total, WSC has 273 questions . Its small size determines that it cannot be used to train a good supervisedmodel and can only be used as the evaluation set. The latest version of WSC has 284 questions, but as all the following works are evaluated based on the 273-question version, westill use the 273-question version in this survey. PREPRINT - S
EPTEMBER
29, 2020
Methods Correct Wrong NA A p A o Unsupervised Random Guess 137 136 0 50.2% 50.2%Knowledge Hunting [36]
119 79 75 60.1% 57.3%SP (Human) [37]
15 0 258 [37]
50 26 197 65.8% 54.4%ASER (String Match) [38]
63 27 183 70.0% 56.6%LM (Single) [39]
149 124 0 54.5% 54.5%LM (Ensemble) [39]
168 105 0 61.5% 61.5%GPT-2 [40]
193 80 0 70.7% 70.7%Finetuning BERT [41] +ASER [38]
177 96 0 64.5% 64.5%BERT [41] +DPR [33]
195 78 0 71.4% 71.4%BERT [41] +WinoGrande [34]
210 63 0 76.9% 76.9%RoBERTa [42] +DRP [33]
227 46 0 83.1% 83.1%RoBERTa [42] +WinoGrande [34]
246 27 0 90.1%
Human Beings Original [12]
252 21 0 92.1% 92.1%Recent [34]
264 9 0 96.5% 96.5%Table 5: Performances of different models on the 273-question version WSC. NA means that the model cannot give a prediction, A p means the accuracy of predict examples without NA examples, and A o the overall accuracy. Definite Pronoun Resolution : Another hard pronoun coreference resolution dataset is the definite pronoun reso-lution dataset (DPR) [33]. Different from WSC, DPR leveraged undergraduates rather than experts to create thedataset. In total, DPR collected 1,886 questions, which is a slightly larger scale than the official WSC. However, asDPR could not guarantee that all DPR questions follow the strict design guideline of WSC, questions in DPR arerelatively simpler.3. WinoGrande : One common problem of WSC and DPR is their small scales. To create a larger scale data,WinoGrande [34] was proposed. By leveraging annotators from Amazon Mechanical Turk, WinoGrande collected 53thousand WSC-like questions. Moreover, to make sure of the dataset quality, WinoGrande applied a bias reductionalgorithm to filter out examples that may contain annotation bias. Experimental results prove that WinoGrande ismuch more challenging than the original WSC because the SOTA models on WSC only achieve 51% accuracy onWinoGrande, which is similar to the random guess.4.
KnowRef : Similar to WinoGrande, KnowRef [35] also aimed at creating a larger scale WSC dataset but with adifferent approach. Instead of using crowd-sourcing + adversarial filtering framework, KnowRef tried to extractWSC-like questions from raw sentences. As a result, KnowRef collected eight thousand WSC-like questions.
In this subsection, we introduce existing approaches for the hard PCR task. As the majority of the methods are evaluatedbased on WSC, all the discussion and analysis are based on their performance on WSC.
At first, people tried to leverage different commonsense knowledge resources to solve WSC questions in an explainableway. For example, [43] first leveraged the commonsense triplets from ConceptNet [44] to train the word embeddingsand then applied the embeddings to solve the WSC task. Knowledge hunter [36] proposed to leverage search engines(e.g., Google) to acquire needed commonsense knowledge. It first searched WSC questions in search engines and thenused the returned searching results to solve WSC questions. SP-10K [37] conducted experiment to show that selectionalpreference (SP) knowledge such as human beings are more likely to eat ‘food’ rather than ‘rock’ can also be helpfulfor solving WSC questions. Last but not least, ASER [38] tried to use knowledge about eventualities (e.g., ‘beinghungry’ can cause ‘eat food’) to solve WSC questions. In general, structured commonsense knowledge can help solveone-third of the WSC questions, but their overall performance is limited due to their low coverage. There are mainlytwo reasons: (1) coverage of existing commonsense resources are not large enough; (2) lack of principle way of usingstructured knowledge for NLP tasks. Current methods [36, 37, 38] mostly rely on string match. However, for manyWSC questions, it is hard to find supportive knowledge in such way. This dataset is also referred to as WSCR in some works. PREPRINT - S
EPTEMBER
29, 2020
L.R. Rel. High Medium Low Overall(13,466) (13,466) (13,466) (40,398)1e-6 87.81% 85.63% 84.95%
Table 6: Performance of fine-tuning RoBERTa with different subsets of Winogrande and different learning rates. L.R. means learningrate and Rel. means relevance to WSC data. WinoGrande instances are grouped into three subsets. Numbers of instances are shownin brackets. Best performed datasets for each learning rate is indicated with the bold font.
Another approach is leveraging language models to solve WSC questions [39], where each WSC question is firstconverted into two sentences by replacing the target pronoun with the two candidates respectively and then the languagemodels can be employed to compute the probability of both sentences. The sentence with a higher probability willbe selected as the final prediction. As this method does not require any string match, it can make prediction for allWSC questions and achieve better overall performance. Recently, a more advanced transformer-based language modelGPT-2 [40] achieved better performance due to its stronger language representation ability. The success of languagemodels demonstrates that rich commonsense knowledge can be indeed encoded within language models implicitly.Another interesting finding about these language model based approaches is that they proposed two settings to predictthe probability: (1) Full: use the probability of the whole sentence as the final prediction; (2) Partial: only consider theprobability of the partial sentence after the target pronoun. Experiments show that the partial model always outperformsthe full model. One explanation is that the influence of the imbalanced distribution of candidate words is relievedby only considering the sentence probability after them. Such observation also explains why GPT-2 can outperformunsupervised BERT on WSC because models based on BERT, which relies on predicting the probability of candidatewords, cannot get rid of such noise.
Last but not least, we would like to introduce current best-performing models on the WSC task, which fine-tunespre-trained language representation models (e.g., BERT [41] or RoBERTa [42]) with a similar dataset (e.g., DPR [33] orWinoGrande [34]). This idea was originally proposed by [45], which first converts the original WSC task into a tokenprediction task and then selects the candidate with higher probability as the final prediction. In general, the stronger thelanguage model and the larger the fine-tuning datasets are, the better the model can perform on the WSC task.
To clearly understand the progress we have made on solving hard PCR problems, we show the performance of allmodels on Winograd Schema challenge in Table 5. From the results, we can make the following observations:1. Even though methods that leverage structured knowledge can provide explainable solutions to WSC questions, theirperformance is typically limited due to their low coverage.2. Different from them, language model based methods represent knowledge contained in human language with animplicit approach, and thus do not have the matching issue and achieve better overall performance.3. In general, fine-tuning pre-trained language representation models (e.g., BERT and RoBERTa) with similar datasets(e.g., DPR and WinoGrande) achieve the current SOTA performances and two observations can be made: (1)The stronger the pre-trained model, the better the performance. This observation shows that current languagerepresentation models can indeed cover commonsense knowledge and along with the increase of their representationability (e.g., deeper model or larger pre-training corpus like RoBERTa), more commonsense knowledge can beeffectively represented. (2) The larger the fine-tuning dataset, the better the performance. This is probably becausethe knowledge about some WSC questions is only covered by Winogrande but not in DPR.7
PREPRINT - S
EPTEMBER
29, 2020To investigate the reason behind WinoGrande’s success, we divide WinoGrande into subsets based on the instances’relevance towards WSC. Assume that the instance set of WinoGrande and WSC are I W G and I W SC respectively, foreach instance i ∈ I W G , we design its relevance score as follows: R W SC ( i ) = M ax ( O ( i, i (cid:48) ) L ( i ) · L ( i (cid:48) ) , i (cid:48) ∈ I W SC ) , (1)where O ( i, i (cid:48) ) is the unigram co-occurrence of i and i (cid:48) and L () the instance length. We use the released code anddataset to conduct the experiments and follow all hyper-parameters as the original paper [34] except the batch size .From the results in Table 6, we can observe that: (1) The most relevant instances contribute the most to the success.In some learning rate settings, it performs similar to or even better than the overall set; (2) Less relevant instancesalso help, which shows that current fine-tuning approach is not just fitting the data but also learning some underneathknowledge about solving the task from the data; (3) The model can be sensitive to the hyper-parameters (i.e., learningrate). Different subsets have different best hyper-parameters and the learning process can easily fail with a badhyper-parameter. To achieve a good performance on a fixed dataset like WSC, we can tune the hyper-parameters. But tocreate a reliable PCR system we can rely on in real life, we probably need a more robust model. Besides the ordinary and hard PCR tasks, PCR is also an important research topic for many special purposes (e.g.,gender bias) or in some special settings (e.g., Visual-aware PCR). In this section, we briefly introduce these tasks:1.
PCR in the Medical Domain : I2b2 [32] is a dataset that focuses on identifying coreference relations in electricalmedical records. As reported in [4], the training set of I2b2 contains 2,024 third personal pronouns, 685 possessivepronouns, and 270 demonstrative pronouns. Its test set contains 1,244 third personal pronouns, 367 possessivepronouns, and 166 demonstrative pronouns. As a dataset in a relatively narrow domain, the usage of domainknowledge becomes important. As shown in [4], i2b2 can be used as an additional dataset to evaluate models’cross-domain abilities.2.
PCR for Machine Translation : ParCor [46] and ParCorFull [47] are datasets focusing on PCR in parallel multi-lingual datasets, which can be used in downstream machine translation tasks. Different from other PCR works, itfocuses on how to leverage the PCR results for better translation rather than how to solve the PCR problem.3.
PCR for Chatbots : CIC [48] is a dataset focusing on identifying coreference relations in multi-party conversations.Compared with the ordinary PCR tasks, which are mostly annotated on formal textual data (e.g., newswire),identifying coreference relation in conversation is more challenging.4.
PCR for Studying Gender Bias : Nowadays, gender bias has been a hot research topic in the NLP community [49,50]. Among all the works, WinoGender [49] is one of the most popular ones. The setting of WinoGender is similarto the setting of WSC [12], where each sentence contains one target pronoun and two candidate noun phrases andthe models are required to select the correct antecedent from the two candidates. But the purpose is different. WSCaims at evaluating models’ abilities to understand commonsense knowledge, while WinoGender aims at evaluatinghow well models can predict without the influence of gender bias. The experiments show that some gender bias (e.g.,‘he’ is more likely to be predicted to be the doctor rather than the nurse by the machine) indeed exists in pre-trainedlanguage representation models. Such observation is astonishing and motivates the community to think about how tominimize the influence of such gender bias.5.
Visual-aware PCR : Recently, a visual-aware PCR dataset [51], which evaluates how well models can groundpronouns to visual objects, was proposed. Similar to CIC [48], Visual-PCR also focuses on pronouns in dailydialogue, where the language usage is informal and a lot of background knowledge could be missing. For example,if one speaker refers to something both speakers can see, they may directly use a pronoun rather than introduce itfirst. In such a case, a pronoun may refer to not mentioned objects in the conversation. As analyzed in the originalpaper, 15% of pronouns in conversations refer to not mentioned objects and for them, leveraging the visual contextinformation becomes crucial. As shown in [52], grounding pronouns to the visual objects can significantly help themodel to better understand the dialog and generate the better response, which further proves that visual PCR is animportant research topic worth exploring. The original batch size is 16 and our batch size is 4 due to the GPU memory limitation, so the experimental result is slightlydifferent from the one reported in the original paper. PREPRINT - S
EPTEMBER
29, 2020
In this paper, we survey the progress on the pronoun coreference resolution (PCR) task and the limitation of existingapproaches. Experiments and analysis on both the ordinary and hard PCR tasks demonstrate that even though we havemade great progress based on the main evaluation metric, the PCR task is still far away from being solved. For example,all best-performing ordinary PCR models struggle on the cross-domain setting as well as infrequent objects, and eventhough fine-tuning pre-trained language representation models can achieve near-human performance on WSC, it can besensitive to the hyper-parameters. All codes will be released to encourage the research on the PCR task.
Acknowledgements
This paper was supported by Early Career Scheme (ECS, No. 26206717), General Research Fund (GRF, No. 16211520),and Research Impact Fund (RIF, No. R6020-19) from the Research Grants Council (RGC) of Hong Kong.
References [1] Varada Kolhatkar, Adam Roussel, Stefanie Dipper, and Heike Zinsmeister. Survey: Anaphora with non-nominalantecedents in computational linguistics: a Survey.
Computational Linguistics , 44(3):547–612, September 2018.[2] Jerry R Hobbs. Resolving pronoun references.
Lingua , 44(4):311–338, 1978.[3] Vincent Ng. Supervised ranking for pronoun resolution: Some recent improvements. In
Proceedings of AAAI2005 , pages 1081–1086, 2005.[4] Hongming Zhang, Yan Song, Yangqiu Song, and Dong Yu. Knowledge-aware pronoun coreference resolution. In
Proceedings of ACL 2019 , pages 867–876, 2019.[5] Hongming Zhang, Yan Song, and Yangqiu Song. Incorporating context and external knowledge for pronouncoreference resolution. In
Proceedings of NAACL-HLT 2019 , pages 872–881, 2019.[6] Ruslan Mitkov et al. Anaphora resolution in machine translation. In
TMMT , 1995.[7] Josef Steinberger, Massimo Poesio, Mijail A Kabadjov, and Karel Je v zek. Two uses of anaphora resolution insummarization. Information Processing & Management , 2007.[8] Michael Strube and Christoph Müller. A machine learning approach to pronoun resolution in spoken dialogue. In
Proceedings of ACL 2003 , pages 168–175, 2003.[9] Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. End-to-end neural coreference resolution. In
Proceedings of EMNLP 2017 , pages 188–197, 2017.[10] Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. Conll-2012 sharedtask: Modeling multilingual unrestricted coreference in ontonotes. In
Proceedings of CoNLL 2012 , pages 1–40,2012.[11] Christian Hardmeier, Luca Bevacqua, Sharid Loáiciga, and Hannah Rohde. Forms of anaphoric reference toorganisational named entities: Hoping to widen appeal, they diversified. In
Proceedings of the Seventh NamedEntities Workshop , pages 36–40, 2018.[12] Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In
Proceeedings ofKRR 2012 , 2012.[13] Ralph Grishman and Beth Sundheim. Message understanding conference- 6: A brief history. In
Proceedings ofCOLING 1996 , pages 466–471, 1996.[14] Nancy A Chinchor. Overview of muc-7/met-2. Technical report, 1998.[15] George R. Doddington, Alexis Mitchell, Mark A. Przybocki, Lance A. Ramshaw, Stephanie M. Strassel, andRalph M. Weischedel. The automatic content extraction (ACE) program - tasks, data, and evaluation. In
Proceedings of LREC 2004 , 2004.[16] Sameer Pradhan, Lance A. Ramshaw, Mitchell P. Marcus, Martha Palmer, Ralph M. Weischedel, and NianwenXue. Conll-2011 shared task: Modeling unrestricted coreference in ontonotes. In
Proceedings of CoNLL 2011 ,pages 1–27, 2011.[17] Abbas Ghaddar and Philippe Langlais. Wikicoref: An english coreference-annotated corpus of wikipedia articles.In
Proceedings of LREC 2016 , 2016. 9
PREPRINT - S
EPTEMBER
29, 2020[18] Massimo Poesio, Jon Chamberlain, Silviu Paun, Juntao Yu, Alexandra Uma, and Udo Kruschwitz. A crowdsourcedcorpus of multiple judgments and disagreement on anaphoric interpretation. In
Proceedings of NAACL-HLT 2019 ,pages 1778–1789, 2019.[19] Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nate Chambers, Mihai Surdeanu, Dan Jurafsky, andChristopher D. Manning. A multi-pass sieve for coreference resolution. In
Proceedings of EMNLP 2010 , 2010.[20] Simone Paolo Ponzetto and Michael Strube. Exploiting semantic role labeling, WordNet and Wikipedia forcoreference resolution. In
Proceedings of NAACL-HLT 2006 , pages 192–199, 2006.[21] Yannick Versley, Massimo Poesio, and Simone Ponzetto.
Using Lexical and Encyclopedic Knowledge , pages393–429. Springer Berlin Heidelberg, Berlin, Heidelberg, 2016.[22] Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber. Learning global features for coreference resolution. In
Proceedings of NAACL-HLT 2016 , pages 994–1004, 2016.[23] Heeyoung Lee, Angel X. Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky.Deterministic coreference resolution based on entity-centric, precision-ranked rules.
Computational Linguistics ,39(4):885–916, 2013.[24] Eric Bengtson and Dan Roth. Understanding the value of features for coreference resolution. In
Proceedings ofEMNLP 2008 , pages 294–303, 2008.[25] Kevin Clark and Christopher D. Manning. Entity-centric coreference resolution with model stacking. In
Proceedings of ACL 2015 , pages 1405–1415, 2015.[26] Kevin Clark and Christopher D. Manning. Deep reinforcement learning for mention-ranking coreference models.In
Proceedings of EMNLP 2016 , pages 2256–2262, 2016.[27] Kai-Wei Chang, Rajhans Samdani, and Dan Roth. A constrained latent variable model for coreference resolution.In
Proceedings of EMNLP 2013 , pages 601–612, 2013.[28] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improvingpre-training by representing and predicting spans.
Trans. Assoc. Comput. Linguistics , 8:64–77, 2020.[29] Kenton Lee, Luheng He, and Luke Zettlemoyer. Higher-order coreference resolution with coarse-to-fine inference.In
Proceedings of NAACL-HLT 2018 , pages 687–692, 2018.[30] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. Deep contextualized word representations. In
Proceedings of NAACL-HLT 2018 , pages 2227–2237,2018.[31] Ben Kantor and Amir Globerson. Coreference resolution with entity equalization. In
Proceedings of ACL 2019 ,pages 673–677, 2019.[32] Ozlem Uzuner, Andreea Bodnari, Shuying Shen, Tyler Forbush, John Pestian, and Brett R South. Evaluatingthe state of the art in coreference resolution for electronic medical records.
J. Am. Medical Informatics Assoc. ,19(5):786–791, 2012.[33] Altaf Rahman and Vincent Ng. Resolving complex cases of definite pronouns: The winograd schema challenge.In
Proceedings of EMNLP-CoNLL 2012 , pages 777–789, 2012.[34] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WINOGRANDE: an adversarialwinograd schema challenge at scale. In
Proceedings of AAAI 2020 , 2020.[35] Ali Emami, Paul Trichelair, Adam Trischler, Kaheer Suleman, Hannes Schulz, and Jackie Chi Kit Cheung. Theknowref coreference corpus: Removing gender and number cues for difficult pronominal anaphora resolution. In
Proceedings of ACL 2019 , pages 3952–3961, 2019.[36] Ali Emami, Noelia De La Cruz, Adam Trischler, Kaheer Suleman, and Jackie Chi Kit Cheung. A knowledgehunting framework for common sense reasoning. In
Proceedings of EMNLP 2018 , pages 1949–1958, 2018.[37] Hongming Zhang, Hantian Ding, and Yangqiu Song. SP-10K: A large-scale evaluation set for selectionalpreference acquisition. In
Proceedings of ACL 2019 , pages 722–731, 2019.[38] Hongming Zhang, Xin Liu, Haojie Pan, Yangqiu Song, and Cane Wing-Ki Leung. ASER: A large-scale eventualityknowledge graph. In
Proceedings of WWW 2020 , pages 201–211, 2020.[39] Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning.
CoRR , abs/1806.02847, 2018.[40] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models areunsupervised multitask learners.
OpenAI Blog , 1(8):9, 2019.10
PREPRINT - S
EPTEMBER
29, 2020[41] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectionaltransformers for language understanding. In
Proceedings of NAACL-HLT 2019 , pages 4171–4186, 2019.[42] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-moyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach.
CoRR , abs/1907.11692,2019.[43] Quan Liu, Hui Jiang, Zhen-Hua Ling, Xiaodan Zhu, Si Wei, and Yu Hu. Commonsense knowledge enhancedembeddings for solving pronoun disambiguation problems in winograd schema challenge. arXiv preprintarXiv:1611.04146 , 2016.[44] Hugo Liu and Push Singh. Conceptnet—a practical commonsense reasoning tool-kit.
BT technology journal ,22(4):211–226, 2004.[45] Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisinglyrobust trick for the winograd schema challenge. In
Proceedings of ACL 2019 , pages 4837–4842, 2019.[46] Liane Guillou, Christian Hardmeier, Aaron Smith, Jörg Tiedemann, and Bonnie L. Webber. Parcor 1.0: A parallelpronoun-coreference corpus to support statistical MT. In
Proceedings of LREC 2014 , pages 3191–3198, 2014.[47] Ekaterina Lapshinova-Koltunski, Christian Hardmeier, and Pauline Krielke. ParCorFull: a parallel corpusannotated with full coreference. In
Proceedings of LREC 2018 , 2018.[48] Yu-Hsin Chen and Jinho D. Choi. Character identification on multiparty conversation: Identifying mentions ofcharacters in TV shows. In
Proceedings of SIGDIAL 2016 , pages 90–100, 2016.[49] Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreferenceresolution. In
Proceedings of NAACL-HLT 2018 , pages 8–14, 2018.[50] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreferenceresolution: Evaluation and debiasing methods. In
Proceedings of NAACL-HLT 2018 , pages 15–20, 2018.[51] Xintong Yu, Hongming Zhang, Yangqiu Song, Yan Song, and Changshui Zhang. What you see is what you get:Visual pronoun coreference resolution in dialogues. In
Proceedings of EMNLP-IJCNLP 2019 , pages 5122–5131,2019.[52] Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. Visual coreference resolutionin visual dialog using neural module networks. In