Yifan He
New York University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yifan He.
north american chapter of the association for computational linguistics | 2015
Maria Pershina; Yifan He; Ralph Grishman
The task of Named Entity Disambiguation is to map entity mentions in the document to their correct entries in some knowledge base. We present a novel graph-based disambiguation approach based on Personalized PageRank (PPR) that combines local and global evidence for disambiguation and effectively filters out noise introduced by incorrect candidates. Experiments show that our method outperforms state-of-the-art approaches by achieving 91.7% in microand 89.9% in macroaccuracy on a dataset of 27.8K named entity mentions.
Machine Translation | 2010
Yifan He; Andy Way
In Minimum Error Rate Training (MERT), Bleu is often used as the error function, despite the fact that it has been shown to have a lower correlation with human judgment than other metrics such as Meteor and Ter. In this paper, we present empirical results in which parameters tuned on Bleu may lead to sub-optimal Bleu scores under certain data conditions. Such scores can be improved significantly by tuning on an entirely different metric altogether, e.g. Meteor, by 0.0082 Bleu or 3.38% relative improvement on the WMT08 English–French data. We analyze the influence of the number of references and choice of metrics on the result of MERT and experiment on different data sets. We show the problems of tuning on a metric that is not designed for the single reference scenario and point out some possible solutions.
north american chapter of the association for computational linguistics | 2015
Yifan He; Ralph Grishman
We showcase ICE, an Integrated Customization Environment for Information Extraction. ICE is an easy tool for non-NLP experts to rapidly build customized IE systems for a new domain.
empirical methods in natural language processing | 2015
Maria Pershina; Yifan He; Ralph Grishman
The goal of paraphrase identification is to decide whether two given text fragments have the same meaning. Of particular interest in this area is the identification of paraphrases among short texts, such as SMS and Twitter. In this paper, we present idiomatic expressions as a new domain for short-text paraphrase identification. We propose a technique, utilizing idiom definitions and continuous space word representations that performs competitively on a dataset of 1.4K annotated idiom paraphrase pairs, which we make publicly available for the research community.
text speech and dialogue | 2014
Ralph Grishman; Yifan He
When an information extraction system is applied to a new task or domain, we must specify the classes of entities and relations to be extracted. This is best done by a subject matter expert, who may have little training in NLP. To meet this need, we have developed a toolset which is able to analyze a corpus and aid the user in building the specifications of the entity and relation types.
international conference on computational linguistics | 2014
Adam Meyers; Zachary Glass; Angus Grieve-Smith; Yifan He; Shasha Liao; Ralph Grishman
NLP definitions of Terminology are usually application-dependent. IR terms are noun sequences that characterize topics. Terms can also be arguments for relations like abbreviation, definition or IS-A. In contrast, this paper explores techniques for extracting terms fitting a broader definition: noun sequences specific to topics and not well-known to naive adults. We describe a chunkingbased approach, an evaluation, and applications to non-topic-specific relation extraction.
Frontiers in Research Metrics and Analytics | 2018
Adam Meyers; Yifan He; Zachary Glass; John Ortega; Shasha Liao; Angus Grieve-Smith; Ralph Grishman; Olga Babko-Malaya
he Termolator is an open-source high-performing terminology extraction system, available on Github. The Termolator combines several different approaches to get superior coverage and precision. The in-line term component identifies potential instances of terminology using a chunking procedure, similar to noun group chunking, but favoring chunks that contain out-of-vocabulary words, nominalizations, technical adjectives, and other specialized word classes. The distributional component ranks such term chunks according to several metrics including: (a) a set of metrics that favors term chunks that are relatively more frequent in a “foreground” corpus about a single topic than they are in a “background” or multi-topic corpus; (b) a well-formedness score based on linguistic features and (c) a relevance score which measures how often terms appear in articles and patents in a Yahoo web search. We analyse the contributions made by each of these components and show that all modules contribute to the system’s performance, both in terms of the number and quality of terms identified. This paper expands upon previous publications about this research and includes descriptions of some of the improvements made since its initial release. This study also includes a comparison with another terminology extraction system available on-line, Termostat (Drouin 2003).. We found that the systems get comparable results when applied to small amounts of data: about 50% precision for a single foreground file (Einstein’s Theory of Relativity). However, when running the system with 500 patent files as foreground, Termolator performed significantly better than Termostat. For 500 refrigeration patents, Termolator got 70% precision vs Termostat’s 52%. For 500 semiconductor patents, Termolator got 79% precision vs Termostat’s 51%.
meeting of the association for computational linguistics | 2010
Yifan He; Yanjun Ma; Josef van Genabith; Andy Way
meeting of the association for computational linguistics | 2011
Yanjun Ma; Yifan He; Andy Way; Josef van Genabith
Archive | 2009
Yifan He; Andy Way