Ryan Georgi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ryan Georgi is active.

Explore More

Publication

Featured researches published by Ryan Georgi.

language resources and evaluation | 2014

Capturing divergence in dependency trees to improve syntactic projection

Ryan Georgi; Fei Xia; William D. Lewis

Obtaining syntactic parses is an important step in many NLP pipelines. However, most of the world’s languages do not have a large amount of syntactically annotated data available for building parsers. Syntactic projection techniques attempt to address this issue by using parallel corpora consisting of resource-poor and resource-rich language pairs, taking advantage of a parser for the resource-rich language and word alignment between the languages to project the parses onto the data for the resource-poor language. These projection methods can suffer, however, when syntactic structures for some sentence pairs in the two languages look quite different. In this paper, we investigate the use of small, parallel, annotated corpora to automatically detect divergent structural patterns between two languages. We then use these detected patterns to improve projection algorithms and dependency parsers, allowing for better performing NLP tools for resource-poor languages, particularly those that may not have large amounts of annotated data necessary for traditional, fully-supervised methods. While this detection process is not exhaustive, we demonstrate that common patterns of divergence can be identified automatically without prior knowledge of a given language pair, and the patterns can be used to improve performance of syntactic projection and parsing.

sighum workshop on language technology for cultural heritage social sciences and humanities | 2015

Enriching Interlinear Text using Automatically Constructed Annotators

Ryan Georgi; Fei Xia; William D. Lewis

In this paper, we will demonstrate a system that shows great promise for creating Part-of-Speech taggers for languages with little to no curated resources available, and which needs no expert involvement. Interlinear Glossed Text (IGT) is a resource which is available for over 1,000 languages as part of the Online Database of INterlinear text (ODIN) (Lewis and Xia, 2010). Using nothing more than IGT from this database and a classification-based projection approach tailored for IGT, we will show that it is feasible to train reasonably performing annotators of interlinear text using projected annotations for potentially hundreds of world’s languages. Doing so can facilitate automatic enrichment of interlinear resources to aid the field of linguistics.

language resources and evaluation | 2016

Enriching a massively multilingual database of interlinear glossed text

Fei Xia; William D. Lewis; Michael Wayne Goodman; Glenn Slayden; Ryan Georgi; Joshua Crowgey; Emily M. Bender

The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swath of the world’s languages. In many cases this involves bootstrapping the learning process with enriched or partially enriched resources. We propose that Interlinear Glossed Text (IGT), a very common form of annotated data used in the field of linguistics, has great potential for bootstrapping NLP tools for resource-poor languages. Although IGT is generally very richly annotated, and can be enriched even further (e.g., through structural projection), much of the content is not easily consumable by machines since it remains “trapped” in linguistic scholarly documents and in human readable form. In this paper, we describe the expansion of the ODIN resource—a database containing many thousands of instances of IGT for over a thousand languages. We enrich the original IGT data by adding word alignment and syntactic structure. To make the data in ODIN more readily consumable by tool developers and NLP researchers, we adopt and extend a new XML format for IGT, called Xigt. We also develop two packages for manipulating IGT data: one, INTENT, enriches raw IGT automatically, and the other, XigtEdit, is a graphical IGT editor.

conference on intelligent text processing and computational linguistics | 2015

Enriching, Editing, and Representing Interlinear Glossed Text

Fei Xia; Michael Wayne Goodman; Ryan Georgi; Glenn Slayden; William D. Lewis

meeting of the association for computational linguistics | 2016

A Web-framework for ODIN Annotation.

Ryan Georgi; Michael Wayne Goodman; Fei Xia

The current release of the ODIN (Online Database of Interlinear Text) database contains over 150,000 linguistic examples, from nearly 1,500 languages, extracted from PDFs found on the web, representing a significant source of data for language research, particularly for low-resource languages. Errors introduced during PDF-totext conversion or poorly formatted examples can make the task of automatically analyzing the data more difficult, so we aim to clean and normalize the examples in order to maximize accuracy during analysis. In this paper we describe a system that allows users to automatically and manually correct errors in the source data in order to get the best possible analysis of the data. We also describe a RESTful service for managing collections of linguistic examples on the web. All software is distributed under an open-source license.

international conference on computational linguistics | 2010