Jonathan May
University of Southern California
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jonathan May.
Computational Linguistics | 2008
Jonathan Graehl; Kevin Knight; Jonathan May
Many probabilistic models for natural language are now written in terms of hierarchical tree structure. Tree-based modeling still lacks many of the standard tools taken for granted in (finite-state) string-based modeling. The theory of tree transducer automata provides a possible framework to draw on, as it has been worked out in an extensive literature. We motivate the use of tree transducers for natural language and address the training problem for probabilistic tree-to-tree and tree-to-string transducers.
international conference on implementation and application of automata | 2006
Jonathan May; Kevin Knight
The availability of weighted finite-state string automata toolkits made possible great advances in natural language processing. However, recent advances in syntax-based NLP model design are unsuitable for these toolkits. To combat this problem, we introduce a weighted finite-state tree automata toolkit, which incorporates recent developments in weighted tree automata theory and is useful for natural language applications such as machine translation, sentence compression, question answering, and many more.
empirical methods in natural language processing | 2016
Barret Zoph; Deniz Yuret; Jonathan May; Kevin Knight
The encoder-decoder framework for neural machine translation (NMT) has been shown effective in large data scenarios, but is much less effective for low-resource languages. We present a transfer learning method that significantly improves Bleu scores across a range of low-resource languages. Our key idea is to first train a high-resource language pair (the parent model), then transfer some of the learned parameters to the low-resource pair (the child model) to initialize and constrain training. Using our transfer learning method we improve baseline NMT models by an average of 5.6 Bleu on four low-resource language pairs. Ensembling and unknown word replacement add another 2 Bleu which brings the NMT performance on low-resource machine translation close to a strong syntax based machine translation (SBMT) system, exceeding its performance on one language pair. Additionally, using the transfer learning model for re-scoring, we can improve the SBMT system by an average of 1.3 Bleu, improving the state-of-the-art on low-resource machine translation.
Computational Linguistics | 2010
Wei Wang; Jonathan May; Kevin Knight; Daniel Marcu
This article shows that the structure of bilingual material from standard parsing and alignment tools is not optimal for training syntax-based statistical machine translation (SMT) systems. We present three modifications to the MT training data to improve the accuracy of a state-of-the-art syntax MT system: re-structuring changes the syntactic structure of training parse trees to enable reuse of substructures; re-labeling alters bracket labels to enrich rule application context; and re-aligning unifies word alignment across sentences to remove bad word alignments and refine good ones. Better structures, labels, and word alignments are learned by the EM algorithm. We show that each individual technique leads to improvement as measured by BLEU, and we also show that the greatest improvement is achieved by combining them. We report an overall 1.48 BLEU improvement on the NIST08 evaluation set over a strong baseline in Chinese/English translation.
Archive | 2009
Kevin Knight; Jonathan May
We explain why weighted automata are an attractive knowledge representation for natural language problems. We first trace the close historical ties between the two fields, then present two complex real-world problems, transliteration and translation. These problems are usefully decomposed into a pipeline of weighted transducers, and weights can be set to maximize the likelihood of a training corpus using standard algorithms. We additionally describe the representation of language models, critical data sources in natural language processing, as weighted automata. We outline the wide range of work in natural language processing that makes use of weighted string and tree automata and describe current work and challenges.
language and technology conference | 2006
Jonathan May; Kevin Knight
Ranked lists of output trees from syntactic statistical NLP applications frequently contain multiple repeated entries. This redundancy leads to misrepresentation of tree weight and reduced information for debugging and tuning purposes. It is chiefly due to nondeterminism in the weighted automata that produce the results. We introduce an algorithm that determinizes such automata while preserving proper weights, returning the sum of the weight of all multiply derived trees. We also demonstrate our algorithms effectiveness on two large-scale tasks.
international conference on implementation and application of automata | 2007
Johanna Högberg; Andreas Maletti; Jonathan May
We improve an existing bisimulation minimisation algorithm for tree automata by introducing backward and forward bisimulations and developing minimisation algorithms for them. Minimisation via forward bisimulation is also effective for deterministic automata and faster than the previous algorithm. Minimisation via backward bisimulation generalises the previous algorithm and is thus more effective but just as fast. We demonstrate implementations of these algorithms on a typical task in natural language processing.
empirical methods in natural language processing | 2015
Michael Pust; Ulf Hermjakob; Kevin Knight; Daniel Marcu; Jonathan May
We present a parser for Abstract Meaning Representation (AMR). We treat Englishto-AMR conversion within the framework of string-to-tree, syntax-based machine translation (SBMT). To make this work, we transform the AMR structure into a form suitable for the mechanics of SBMT and useful for modeling. We introduce an AMR-specific language model and add data and features drawn from semantic resources. Our resulting AMR parser significantly improves upon state-of-the-art results.
Theoretical Computer Science | 2009
Johanna Högberg; Andreas Maletti; Jonathan May
We improve on an existing [P.A. Abdulla, J. Hogberg, L. Kaati, Bisimulation minimization of tree automata, International Journal of Foundations of Computer Science 18(4) (2007) 699-713] bisimulation minimization algorithm for finite-state tree automata by introducing backward and forward bisimulation and developing minimization algorithms for them. Minimization via forward bisimulation is also effective on deterministic tree automata, faster than the previous algorithm, and yields the minimal equivalent deterministic tree automaton. Minimization via backward bisimulation generalizes the previous algorithm and can yield smaller automata but is just as fast. We demonstrate implementations of these algorithms on a typical task in natural language processing.
north american chapter of the association for computational linguistics | 2016
Jonathan May
In this report we summarize the results of the SemEval 2016 Task 8: Meaning Representation Parsing. Participants were asked to generate Abstract Meaning Representation (AMR) (Banarescu et al., 2013) graphs for a set of English sentences in the news and discussion forum domains. Eleven sites submitted valid systems. The availability of state-of-the-art baseline systems was a key factor in lowering the bar to entry; many submissions relied on CAMR (Wang et al., 2015b; Wang et al., 2015a) as a baseline system and added extensions to it to improve scores. The evaluation set was quite difficult to parse, particularly due to creative approaches to word representation in the web forum portion. The top scoring systems scored 0.62 F1 according to the Smatch (Cai and Knight, 2013) evaluation heuristic. We show some sample sentences along with a comparison of system parses and perform quantitative ablative studies.