Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where William D. Lewis is active.

Publication


Featured researches published by William D. Lewis.


Cognitive Psychology | 2004

Representing the Meanings of Object and Action Words: The Featural and Unitary Semantic Space Hypothesis.

Gabriella Vigliocco; David P. Vinson; William D. Lewis; Merrill F. Garrett

This paper presents the Featural and Unitary Semantic Space (FUSS) hypothesis of the meanings of object and action words. The hypothesis, implemented in a statistical model, is based on the following assumptions: First, it is assumed that the meanings of words are grounded in conceptual featural representations, some of which are organized according to modality. Second, it is assumed that conceptual featural representations are bound into lexico-semantic representations that provide an interface between conceptual knowledge and other linguistic information (syntax and phonology). Finally, the FUSS model employs the same principles and tools for objects and actions, modeling both domains in a single semantic space. We assess the plausibility of the model by showing that it can capture generalizations presented in the literature, in particular those related to category-related deficits, and show that it can predict semantic effects in behavioral experiments for object and action words better than other models such as Latent Semantic Analysis (Landauer & Dumais, 1997) and similarity metrics derived from Wordnet (Miller & Fellbaum, 1991).


Journal of Child Language | 2005

Infants can use distributional cues to form syntactic categories.

LouAnn Gerken; Rachel Wilson; William D. Lewis

Nearly all theories of language development emphasize the importance of distributional cues for segregating words and phrases into syntactic categories like noun, feminine or verb phrase. However, questions concerning whether such cues can be used to the exclusion of referential cues have been debated. Using the headturn preference procedure, American children aged 1;5 were briefly familiarized with a partial Russian gender paradigm, with a subset of the paradigm members withheld. During test, infants listened on alternate trials to previously withheld grammatical items and ungrammatical items with incorrect gender markings on previously heard stems. Across three experiments, infants discriminated new grammatical from ungrammatical items, but like adults in previous studies, were only able to do so when a subset of familiarization items was double marked for gender category. The results suggest that learners can use distributional cues to category structure, to the exclusion of referential cues, from relatively early in the language learning process.


language resources and evaluation | 2007

The GOLD Community of Practice: an infrastructure for linguistic data on the Web

Scott Farrar; William D. Lewis

The GOLD Community of Practice is proposed as a model for linking on-line linguistic data to an ontology. The key components of the model include the linguistic data resources themselves and those focused on the knowledge derived from data. Data resources include the ever-increasing amount of linguistic field data and other descriptive language resources being migrated to the Web. The knowledge resources capture generalizations about the data and are anchored in the General Ontology for Linguistic Description (GOLD). It is argued that such a model is in the spirit of the vision for a Semantic Web and, thus, provides a concrete methodology for rendering highly divergent resources semantically interoperable. The focus of this work, then, is not on annotation at the syntactic level, but rather on how annotated Web resources can be linked to an ontology. Furthermore, a methodology is given for creating specific communities of practice within the overall Web infrastructure for linguistics. Finally, ontology-driven search is discussed as a key application of the proposed model.


conference of the european chapter of the association for computational linguistics | 2009

Applying NLP Technologies to the Collection and Enrichment of Language Data on the Web to Aid Linguistic Research

Fei Xia; William D. Lewis

The field of linguistics has always been reliant on language data, since that is its principal object of study. One of the major obstacles that linguists encounter is finding data relevant to their research. In this paper, we propose a three-stage approach to help linguists find relevant data. First, language data embedded in existing linguistic scholarly discourse is collected and stored in a database. Second, the language data is automatically analyzed and enriched, and language profiles are created from the enriched data. Third, a search facility is provided to allow linguists to search the original data, the enriched data, and the language profiles in a variety of ways. This work demonstrates the benefits of using natural language processing technology to create resources and tools for linguistic research, allowing linguists to have easy access not only to language data embedded in existing linguistic papers, but also to automatically generated language profiles for hundreds of languages.


language resources and evaluation | 2014

Capturing divergence in dependency trees to improve syntactic projection

Ryan Georgi; Fei Xia; William D. Lewis

Obtaining syntactic parses is an important step in many NLP pipelines. However, most of the world’s languages do not have a large amount of syntactically annotated data available for building parsers. Syntactic projection techniques attempt to address this issue by using parallel corpora consisting of resource-poor and resource-rich language pairs, taking advantage of a parser for the resource-rich language and word alignment between the languages to project the parses onto the data for the resource-poor language. These projection methods can suffer, however, when syntactic structures for some sentence pairs in the two languages look quite different. In this paper, we investigate the use of small, parallel, annotated corpora to automatically detect divergent structural patterns between two languages. We then use these detected patterns to improve projection algorithms and dependency parsers, allowing for better performing NLP tools for resource-poor languages, particularly those that may not have large amounts of annotated data necessary for traditional, fully-supervised methods. While this detection process is not exhaustive, we demonstrate that common patterns of divergence can be identified automatically without prior knowledge of a given language pair, and the patterns can be used to improve performance of syntactic projection and parsing.


Machine Translation | 2015

Survey of data-selection methods in statistical machine translation

Sauleh Eetemadi; William D. Lewis; Kristina Toutanova; Hayder Radha

Statistical machine translation has seen significant improvements in quality over the past several years. The single biggest factor in this improvement has been the accumulation of ever larger stores of data. We now find ourselves, however, the victims of our own success, in that it has become increasingly difficult to train on such large sets of data, due to limitations in memory, processing power, and ultimately, speed (i.e. data-to-models takes an inordinate amount of time). Moreover, the training data has a wide quality spectrum. A variety of methods for data cleaning and data selection have been developed to address these issues. Each of these methods employs a search or filtering algorithm to select a subset of the data, given a defined set of feature functions. In this paper we provide a comparative overview of research in this area based on application scenario, feature functions and search method.


workshop on hybrid approaches to translation | 2013

Controlled Ascent: Imbuing Statistical MT with Linguistic Knowledge

William D. Lewis; Chris Quirk

We explore the intersection of rule-based and statistical approaches in machine translation, with a particular focus on past and current work at Microsoft Research. Until about 10 years ago, the only machine translation systems worth using were rule-based and linguistically-informed. Along came statistical approaches, which use large corpora to directly guide translations toward expressions people would actually say. Rather than making local decisions when writing and conditioning rules, goodness of translation was modeled numerically and free parameters were selected to optimize that goodness. This led to huge improvements in translation quality as more and more data was consumed. By necessity, the pendulum is swinging back towards the inclusion of linguistic features in MT systems. We describe some of our statistical and non-statistical attempts to incorporate linguistic insights into machine translation systems, showing what is currently working well, and what isn’t. We also look at trade-offs in using linguistic knowledge (“rules”) in pre- or post-processing by language pair, with a particular eye on the return on investment as training data increases in size.


sighum workshop on language technology for cultural heritage social sciences and humanities | 2015

Enriching Interlinear Text using Automatically Constructed Annotators

Ryan Georgi; Fei Xia; William D. Lewis

In this paper, we will demonstrate a system that shows great promise for creating Part-of-Speech taggers for languages with little to no curated resources available, and which needs no expert involvement. Interlinear Glossed Text (IGT) is a resource which is available for over 1,000 languages as part of the Online Database of INterlinear text (ODIN) (Lewis and Xia, 2010). Using nothing more than IGT from this database and a classification-based projection approach tailored for IGT, we will show that it is feasible to train reasonably performing annotators of interlinear text using projected annotations for potentially hundreds of world’s languages. Doing so can facilitate automatic enrichment of interlinear resources to aid the field of linguistics.


conference of the european chapter of the association for computational linguistics | 2009

Parsing, Projecting & Prototypes: Repurposing Linguistic Data on the Web

William D. Lewis; Fei Xia

Until very recently, most NLP tasks (e.g., parsing, tagging, etc.) have been confined to a very limited number of languages, the so-called majority languages. Now, as the field moves into the era of developing tools for Resource Poor Languages (RPLs)--a vast majority of the worlds 7,000 languages are resource poor--the discipline is confronted not only with the algorithmic challenges of limited data, but also the sheer difficulty of locating data in the first place. In this demo, we present a resource which taps the large body of linguistically annotated data on the Web, data which can be repurposed for NLP tasks. Because the field of linguistics has as its mandate the study of human language--in fact, the study of all human languages--and has whole-heartedly embraced the Web as a means for disseminating linguistic knowledge, the consequence is that a large quantity of analyzed language data can be found on the Web. In many cases, the data is richly annotated and exists for many languages for which there would otherwise be very limited annotated data. The resource, the Online Database of INterlinear text (ODIN), makes this data available and provides additional annotation and structure, making the resource useful to the Computational Linguistic audience.


language resources and evaluation | 2016

Enriching a massively multilingual database of interlinear glossed text

Fei Xia; William D. Lewis; Michael Wayne Goodman; Glenn Slayden; Ryan Georgi; Joshua Crowgey; Emily M. Bender

The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swath of the world’s languages. In many cases this involves bootstrapping the learning process with enriched or partially enriched resources. We propose that Interlinear Glossed Text (IGT), a very common form of annotated data used in the field of linguistics, has great potential for bootstrapping NLP tools for resource-poor languages. Although IGT is generally very richly annotated, and can be enriched even further (e.g., through structural projection), much of the content is not easily consumable by machines since it remains “trapped” in linguistic scholarly documents and in human readable form. In this paper, we describe the expansion of the ODIN resource—a database containing many thousands of instances of IGT for over a thousand languages. We enrich the original IGT data by adding word alignment and syntactic structure. To make the data in ODIN more readily consumable by tool developers and NLP researchers, we adopt and extend a new XML format for IGT, called Xigt. We also develop two packages for manipulating IGT data: one, INTENT, enriches raw IGT automatically, and the other, XigtEdit, is a graphical IGT editor.

Collaboration


Dive into the William D. Lewis's collaboration.

Top Co-Authors

Avatar

Fei Xia

University of Washington

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ryan Georgi

University of Washington

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Carrie Lewis

University of Washington

View shared research outputs
Researchain Logo
Decentralizing Knowledge