Sharon Goldwater | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sharon Goldwater is active.

Explore More

Publication

Featured researches published by Sharon Goldwater.

meeting of the association for computational linguistics | 2006

Contextual Dependencies in Unsupervised Word Segmentation

Sharon Goldwater; Thomas L. Griffiths; Mark Johnson

Developing better methods for segmenting continuous text into words is important for improving the processing of Asian languages, and may shed light on how humans learn to segment speech. We propose two new Bayesian word segmentation methods that assume unigram and bigram models of word dependencies respectively. The bigram model greatly outperforms the unigram model (and previous probabilistic models), demonstrating the importance of such dependencies for word segmentation. We also show that previous probabilistic models rely crucially on sub-optimal search procedures.

Cognition | 2009

A Bayesian framework for word segmentation: Exploring the effects of context

Sharon Goldwater; Thomas L. Griffiths; Mark Johnson

Since the experiments of Saffran et al. [Saffran, J., Aslin, R., & Newport, E. (1996). Statistical learning in 8-month-old infants. Science, 274, 1926-1928], there has been a great deal of interest in the question of how statistical regularities in the speech stream might be used by infants to begin to identify individual words. In this work, we use computational modeling to explore the effects of different assumptions the learner might make regarding the nature of words--in particular, how these assumptions affect the kinds of words that are segmented from a corpus of transcribed child-directed speech. We develop several models within a Bayesian ideal observer framework, and use them to examine the consequences of assuming either that words are independent units, or units that help to predict other units. We show through empirical and theoretical results that the assumption of independence causes the learner to undersegment the corpus, with many two- and three-word sequences (e.g. whats that, do you, in the house) misidentified as individual words. In contrast, when the learner assumes that words are predictive, the resulting segmentation is far more accurate. These results indicate that taking context into account is important for a statistical word segmentation strategy to be successful, and raise the possibility that even young infants may be able to exploit more subtle statistical patterns than have usually been considered.

empirical methods in natural language processing | 2005

Improving Statistical MT through Morphological Analysis

Sharon Goldwater; David McClosky

In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of times. With a highly inflected language such as Czech, this problem can be particularly severe. In addition, much of the morphological variation seen in Czech words is not reflected in either the morphology or syntax of a language like English. In this work, we show that using morphological analysis to modify the Czech input can improve a Czech-English machine translation system. We investigate several different methods of incorporating morphological information, and show that a system that combines these methods yields the best results. Our final system achieves a BLEU score of .333, as compared to .270 for the baseline word-to-word system.

north american chapter of the association for computational linguistics | 2009

Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars

Mark Johnson; Sharon Goldwater

One of the reasons nonparametric Bayesian inference is attracting attention in computational linguistics is because it provides a principled way of learning the units of generalization together with their probabilities. Adaptor grammars are a framework for defining a variety of hierarchical nonparametric Bayesian models. This paper investigates some of the choices that arise in formulating adaptor grammars and associated inference procedures, and shows that they can have a dramatic impact on performance in an unsupervised word segmentation task. With appropriate adaptor grammars and inference procedures we achieve an 87% word token f-score on the standard Brent version of the Bernstein-Ratner corpus, which is an error reduction of over 35% over the best previously reported results for this corpus.

automated software engineering | 2003

A type system for statically detecting spreadsheet errors

Yanif Ahmad; Tudor Antoniu; Sharon Goldwater; Shriram Krishnamurthi

We describe a methodology for detecting user errors in spreadsheets, using the notion of units as our basic elements of checking. We define the concept of a header and discuss two types of relationships between headers, namely is-a and has-a relationships. With these, we develop a set of rules to assign units to cells in the spreadsheet. We check for errors by ensuring that every cell has a well-formed unit. We describe an implementation of the system that allows the user to check Microsoft Excel spreadsheets. We have run our system on practical examples, and even found errors in published spreadsheets.

Psychological Review | 2013

A role for the developing lexicon in phonetic category acquisition.

Naomi H. Feldman; Thomas L. Griffiths; Sharon Goldwater; James L. Morgan

Infants segment words from fluent speech during the same period when they are learning phonetic categories, yet accounts of phonetic category acquisition typically ignore information about the words in which sounds appear. We use a Bayesian model to illustrate how feedback from segmented words might constrain phonetic category learning by providing information about which sounds occur together in words. Simulations demonstrate that word-level information can successfully disambiguate overlapping English vowel categories. Learning patterns in the model are shown to parallel human behavior from artificial language learning tasks. These findings point to a central role for the developing lexicon in phonetic category acquisition and provide a framework for incorporating top-down constraints into models of category learning.

north american chapter of the association for computational linguistics | 2009

Inducing Compact but Accurate Tree-Substitution Grammars

Trevor Cohn; Sharon Goldwater; Phil Blunsom

Tree substitution grammars (TSGs) are a compelling alternative to context-free grammars for modelling syntax. However, many popular techniques for estimating weighted TSGs (under the moniker of Data Oriented Parsing) suffer from the problems of inconsistency and over-fitting. We present a theoretically principled model which solves these problems using a Bayesian non-parametric formulation. Our model learns compact and simple grammars, uncovering latent linguistic structures (e.g., verb subcategorisation), and in doing so far out-performs a standard PCFG.

international conference on acoustics, speech, and signal processing | 2013

A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition

Aren Jansen; Emmanuel Dupoux; Sharon Goldwater; Mark Johnson; Sanjeev Khudanpur; Kenneth Church; Naomi H. Feldman; Hynek Hermansky; Florian Metze; Richard C. Rose; Michael L. Seltzer; Pascal Clark; Ian McGraw; Balakrishnan Varadarajan; Erin Bennett; Benjamin Börschinger; Justin Chiu; Ewan Dunbar; Abdellah Fourtassi; David F. Harwath; Chia-ying Lee; Keith Levin; Atta Norouzian; Vijayaditya Peddinti; Rachael Richardson; Thomas Schatz; Samuel Thomas

We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.

international conference on acoustics, speech, and signal processing | 2015

Unsupervised neural network based feature extraction using weak top-down constraints

Herman Kamper; Micha Elsner; Aren Jansen; Sharon Goldwater

Deep neural networks (DNNs) have become a standard component in supervised ASR, used in both data-driven feature extraction and acoustic modelling. Supervision is typically obtained from a forced alignment that provides phone class targets, requiring transcriptions and pronunciations. We propose a novel unsupervised DNN-based feature extractor that can be trained without these resources in zero-resource settings. Using unsupervised term discovery, we find pairs of isolated word examples of the same unknown type; these provide weak top-down supervision. For each pair, dynamic programming is used to align the feature frames of the two words. Matching frames are presented as input-output pairs to a deep autoencoder (AE) neural network. Using this AE as feature extractor in a word discrimination task, we achieve 64% relative improvement over a previous state-of-the-art system, 57% improvement relative to a bottom-up trained deep AE, and come to within 23% of a supervised system.

international conference of the ieee engineering in medicine and biology society | 2006

A Non-Parametric Bayesian Approach to Spike Sorting

Frank D. Wood; Sharon Goldwater; Michael J. Black

In this work we present and apply infinite Gaussian mixture modeling, a non-parametric Bayesian method, to the problem of spike sorting. As this approach is Bayesian, it allows us to integrate prior knowledge about the problem in a principled way. Because it is non-parametric we are able to avoid model selection, a difficult problem that most current spike sorting methods do not address. We compare this approach to using penalized log likelihood to select the best from multiple finite mixture models trained by expectation maximization. We show favorable offline sorting results on real data and discuss ways to extend our model to online applications

Explore More