Featured Researches

Computation And Language

Controlling Hallucinations at Word Level in Data-to-Text Generation

Data-to-Text Generation (DTG) is a subfield of Natural Language Generation aiming at transcribing structured data in natural language descriptions. The field has been recently boosted by the use of neural-based generators which exhibit on one side great syntactic skills without the need of hand-crafted pipelines; on the other side, the quality of the generated text reflects the quality of the training data, which in realistic settings only offer imperfectly aligned structure-text pairs. Consequently, state-of-art neural models include misleading statements - usually called hallucinations - in their outputs. The control of this phenomenon is today a major challenge for DTG, and is the problem addressed in the paper. Previous work deal with this issue at the instance level: using an alignment score for each table-reference pair. In contrast, we propose a finer-grained approach, arguing that hallucinations should rather be treated at the word level. Specifically, we propose a Multi-Branch Decoder which is able to leverage word-level labels to learn the relevant parts of each training instance. These labels are obtained following a simple and efficient scoring procedure based on co-occurrence analysis and dependency parsing. Extensive evaluations, via automated metrics and human judgment on the standard WikiBio benchmark, show the accuracy of our alignment labels and the effectiveness of the proposed Multi-Branch Decoder. Our model is able to reduce and control hallucinations, while keeping fluency and coherence in generated texts. Further experiments on a degraded version of ToTTo show that our model could be successfully used on very noisy settings.

Read more
Computation And Language

Converse, Focus and Guess -- Towards Multi-Document Driven Dialogue

We propose a novel task, Multi-Document Driven Dialogue (MD3), in which an agent can guess the target document that the user is interested in by leading a dialogue. To benchmark progress, we introduce a new dataset of GuessMovie, which contains 16,881 documents, each describing a movie, and associated 13,434 dialogues. Further, we propose the MD3 model. Keeping guessing the target document in mind, it converses with the user conditioned on both document engagement and user feedback. In order to incorporate large-scale external documents into the dialogue, it pretrains a document representation which is sensitive to attributes it talks about an object. Then it tracks dialogue state by detecting evolvement of document belief and attribute belief, and finally optimizes dialogue policy in principle of entropy decreasing and reward increasing, which is expected to successfully guess the user's target in a minimum number of turns. Experiments show that our method significantly outperforms several strong baseline methods and is very close to human's performance.

Read more
Computation And Language

Counting Protests in News Articles: A Dataset and Semi-Automated Data Collection Pipeline

Between January 2017 and January 2021, thousands of local news sources in the United States reported on over 42,000 protests about topics such as civil rights, immigration, guns, and the environment. Given the vast number of local journalists that report on protests daily, extracting these events as structured data to understand temporal and geographic trends can empower civic decision-making. However, the task of extracting events from news articles presents well known challenges to the NLP community in the fields of domain detection, slot filling, and coreference resolution. To help improve the resources available for extracting structured data from news stories, our contribution is three-fold. We 1) release a manually labeled dataset of news article URLs, dates, locations, crowd size estimates, and 494 discrete descriptive tags corresponding to 42,347 reported protest events in the United States between January 2017 and January 2021; 2) describe the semi-automated data collection pipeline used to discover, sort, and review the 144,568 English articles that comprise the dataset; and 3) benchmark a long-short term memory (LSTM) low dimensional classifier that demonstrates the utility of processing news articles based on syntactic structures, such as paragraphs and sentences, to count the number of reported protest events.

Read more
Computation And Language

Creating a Universal Dependencies Treebank of Spoken Frisian-Dutch Code-switched Data

This paper explores the difficulties of annotating transcribed spoken Dutch-Frisian code-switch utterances into Universal Dependencies. We make use of data from the FAME! corpus, which consists of transcriptions and audio data. Besides the usual annotation difficulties, this dataset is extra challenging because of Frisian being low-resource, the informal nature of the data, code-switching and non-standard sentence segmentation. As a starting point, two annotators annotated 150 random utterances in three stages of 50 utterances. After each stage, disagreements where discussed and resolved. An increase of 7.8 UAS and 10.5 LAS points was achieved between the first and third round. This paper will focus on the issues that arise when annotating a transcribed speech corpus. To resolve these issues several solutions are proposed.

Read more
Computation And Language

Creolizing the Web

The evolution of language has been a hotly debated subject with contradicting hypotheses and unreliable claims. Drawing from signalling games, dynamic population mechanics, machine learning and algebraic topology, we present a method for detecting evolutionary patterns in a sociological model of language evolution. We develop a minimalistic model that provides a rigorous base for any generalized evolutionary model for language based on communication between individuals. We also discuss theoretical guarantees of this model, ranging from stability of language representations to fast convergence of language by temporal communication and language drift in an interactive setting. Further we present empirical results and their interpretations on a real world dataset from \rdt to identify communities and echo chambers for opinions, thus placing obstructions to reliable communication among communities.

Read more
Computation And Language

Cross-Lingual Named Entity Recognition Using Parallel Corpus: A New Approach Using XLM-RoBERTa Alignment

We propose a novel approach for cross-lingual Named Entity Recognition (NER) zero-shot transfer using parallel corpora. We built an entity alignment model on top of XLM-RoBERTa to project the entities detected on the English part of the parallel data to the target language sentences, whose accuracy surpasses all previous unsupervised models. With the alignment model we can get pseudo-labeled NER data set in the target language to train task-specific model. Unlike using translation methods, this approach benefits from natural fluency and nuances in target-language original corpus. We also propose a modified loss function similar to focal loss but assigns weights in the opposite direction to further improve the model training on noisy pseudo-labeled data set. We evaluated this proposed approach over 4 target languages on benchmark data sets and got competitive F1 scores compared to most recent SOTA models. We also gave extra discussions about the impact of parallel corpus size and domain on the final transfer performance.

Read more
Computation And Language

Cross-SEAN: A Cross-Stitch Semi-Supervised Neural Attention Model for COVID-19 Fake News Detection

As the COVID-19 pandemic sweeps across the world, it has been accompanied by a tsunami of fake news and misinformation on social media. At the time when reliable information is vital for public health and safety, COVID-19 related fake news has been spreading even faster than the facts. During times such as the COVID-19 pandemic, fake news can not only cause intellectual confusion but can also place lives of people at risk. This calls for an immediate need to contain the spread of such misinformation on social media. We introduce CTF, the first COVID-19 Twitter fake news dataset with labeled genuine and fake tweets. Additionally, we propose Cross-SEAN, a cross-stitch based semi-supervised end-to-end neural attention model, which leverages the large amount of unlabelled data. Cross-SEAN partially generalises to emerging fake news as it learns from relevant external knowledge. We compare Cross-SEAN with seven state-of-the-art fake news detection methods. We observe that it achieves 0.95 F1 Score on CTF, outperforming the best baseline by 9% . We also develop Chrome-SEAN, a Cross-SEAN based chrome extension for real-time detection of fake tweets.

Read more
Computation And Language

Cross-lingual Visual Pre-training for Multimodal Machine Translation

Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the translation language modelling (Lample and Conneau, 2019) with masked region classification and perform pre-training with three-way parallel vision & language corpora. We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance. We also provide qualitative insights into the usefulness of the learned grounded representations.

Read more
Computation And Language

Curriculum-Meta Learning for Order-Robust Continual Relation Extraction

Continual relation extraction is an important task that focuses on extracting new facts incrementally from unstructured text. Given the sequential arrival order of the relations, this task is prone to two serious challenges, namely catastrophic forgetting and order-sensitivity. We propose a novel curriculum-meta learning method to tackle the above two challenges in continual relation extraction. We combine meta learning and curriculum learning to quickly adapt model parameters to a new task and to reduce interference of previously seen tasks on the current task. We design a novel relation representation learning method through the distribution of domain and range types of relations. Such representations are utilized to quantify the difficulty of tasks for the construction of curricula. Moreover, we also present novel difficulty-based metrics to quantitatively measure the extent of order-sensitivity of a given model, suggesting new ways to evaluate model robustness. Our comprehensive experiments on three benchmark datasets show that our proposed method outperforms the state-of-the-art techniques. The code is available at the anonymous GitHub repository: this https URL.

Read more
Computation And Language

Customizing Contextualized Language Models forLegal Document Reviews

Inspired by the inductive transfer learning on computer vision, many efforts have been made to train contextualized language models that boost the performance of natural language processing tasks. These models are mostly trained on large general-domain corpora such as news, books, or Wikipedia.Although these pre-trained generic language models well perceive the semantic and syntactic essence of a language structure, exploiting them in a real-world domain-specific scenario still needs some practical considerations to be taken into account such as token distribution shifts, inference time, memory, and their simultaneous proficiency in multiple tasks. In this paper, we focus on the legal domain and present how different language model strained on general-domain corpora can be best customized for multiple legal document reviewing tasks. We compare their efficiencies with respect to task performances and present practical considerations.

Read more

Ready to get started?

Join us today