Gaja Jarosz
Yale University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gaja Jarosz.
meeting of the association for computational linguistics | 2002
Matthew G. Snover; Gaja Jarosz; Michael R. Brent
This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and a novel search algorithm. By examining morphologically rich subsets of an input lexicon, the search identifies highly productive paradigms. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with other recent unsupervised morphology learning algorithms demonstrate the effectiveness of this technique.
Archive | 2006
Gaja Jarosz
_______________________________________________________________________ II ACKNOWLEDGMENTS _____________________________________________________________IV TABLE OF CONTENTS ______________________________________________________________ V LIST OF TABLES ___________________________________________________________________ X LIST OF FIGURES ________________________________________________________________ XIV
Journal of Child Language | 2010
Gaja Jarosz
This study examines the interacting roles of implicational markedness and frequency from the joint perspectives of formal linguistic theory, phonological acquisition and computational modeling. The hypothesis that child grammars are rankings of universal constraints, as in Optimality Theory (Prince & Smolensky, 1993/2004), that learning involves a gradual transition from an unmarked initial state to the target grammar, and that order of acquisition is guided by frequency, along the lines of Levelt, Schiller & Levelt (2000), is investigated. The study reviews empirical findings on syllable structure acquisition in Dutch, German, French and English, and presents novel findings on Polish. These comparisons reveal that, to the extent allowed by implicational markedness universals, frequency covaries with acquisition order across languages. From the computational perspective, the paper shows that interacting roles of markedness and frequency in a class of constraint-based phonological learning models embody this hypothesis, and their predictions are illustrated via computational simulation.
Phonology | 2013
Gaja Jarosz
This paper explores the relative merits of constraint ranking versus weighting in the context of a major outstanding learnability problem in phonology: learning in the face of hidden structure. Specifically, the paper examines a well-known approach to the structural ambiguity problem, Robust Interpretive Parsing (RIP; Tesar and Smolensky 1998), focusing on its stochastic extension as first described by Boersma (2003). Two related problems with the stochastic formulation of RIP are revealed, rooted in a failure to take full advantage of probabilistic information available in the learners grammar. To address these problems, two novel parsing strategies are introduced and applied to learning algorithms for both probabilistic ranking and weighting. The novel parsing strategies yield significant improvements in performance, asymmetrically improving performance of OT learners. Once RIP is replaced with the proposed modifications, the apparent advantage of HG over OT learners reported in previous work disappears (Boersma and Pater to appear).
north american chapter of the association for computational linguistics | 2006
Gaja Jarosz
This paper proposes an unsupervised learning algorithm for Optimality Theoretic grammars, which learns a complete constraint ranking and a lexicon given only unstructured surface forms and morphological relations. The learning algorithm, which is based on the Expectation-Maximization algorithm, gradually maximizes the likelihood of the observed forms by adjusting the parameters of a probabilistic constraint grammar and a probabilistic lexicon. The paper presents the algorithms results on three constructed language systems with different types of hidden structure: voicing neutralization, stress, and abstract vowels. In all cases the algorithm learns the correct constraint ranking and lexicon. The paper argues that the algorithms ability to identify correct, restrictive grammars is due in part to its explicit reliance on the Optimality Theoretic notion of Richness of the Base.
Cognitive Science | 2015
Shira Calamaro; Gaja Jarosz
Phonological rules create alternations in the phonetic realizations of related words. These rules must be learned by infants in order to identify the phonological inventory, the morphological structure, and the lexicon of a language. Recent work proposes a computational model for the learning of one kind of phonological alternation, allophony (Peperkamp, Le Calvez, Nadal, & Dupoux, 2006). This paper extends the model to account for learning of a broader set of phonological alternations and the formalization of these alternations as general rules. In Experiment 1, we apply the original model to new data in Dutch and demonstrate its limitations in learning nonallophonic rules. In Experiment 2, we extend the model to allow it to learn general rules for alternations that apply to a class of segments. In Experiment 3, the model is further extended to allow for generalization by context; we argue that this generalization must be constrained by linguistic principles.
Language Learning and Development | 2013
Gaja Jarosz; J. Alex Johnson
This study is a systematic analysis of the information content of a wide range of distributional cues to word boundaries, individually and in combination, in naturally occurring child-directed speech across three languages (English, Polish, and Turkish). The paper presents a series of statistical analyses examining the relative predictive strength of these cues, the overlap in the information about word boundaries they contain, and the variability in their relative strengths and interactions across the languages. We find that the information content of individual distributional cues is not constant across languages, with relative reliability of cues varying across languages and with individual cues providing much less information in Polish and Turkish than in English. However, we also find that when these cues are combined, the cumulative information content of a diverse array of distributional cues provides a significant source of information about word boundaries across all three languages.
Language Acquisition | 2017
Gaja Jarosz; Shira Calamaro; Jason Zentz
ABSTRACT This article examines phonological development and its relationship to input statistics. Using novel data from a longitudinal corpus of spontaneous child speech in Polish, we evaluate and compare the predictions of a variety of input-based phonotactic models for syllable structure acquisition. We find that many commonly examined input statistics can make dramatically different predictions, as do different assumptions about the representational units over which statistics are calculated. We find that development is sensitive to multiple abstract units of phonological representation, supporting a crucial role for feature-based generalization. We also identify departures between the predictions of the best phonotactic models and children’s production patterns that indicate that input sensitivity alone cannot fully explain the developmental patterns. We discuss the role of universal markedness and phonetic difficulty and argue that a full explanation requires reference to these biases.
Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM | 2014
Natalie M. Schrimpf; Gaja Jarosz
Developmental research indicates that infants use low-level statistical regularities, or phonotactics, to segment words from continuous speech. In this paper, we present a segmentation framework that enables the direct comparison of different phonotactic models for segmentation. We compare a model using phoneme transitional probabilities, which have been widely used in computational models, to syllable-based bigram models, which have played a prominent role in the developmental literature. We also introduce a novel estimation method, and compare it to other strategies for estimating the parameters of the phonotactic models from unsegmented data. The results show that syllable-based models outperform the phoneme models, specifically in the context of improved unsupervised parameter estimation. The syllablebased transitional probability model achieves a word token f-score of nearly 80%, the highest reported performance for a phonotactic segmentation model with no lexicon.
Annual Meeting of the Berkeley Linguistics Society | 2005
Gaja Jarosz