James R. Curran
University of Sydney
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by James R. Curran.
Monthly Notices of the Royal Astronomical Society | 2003
Tom Mauch; Tara Murphy; Helen J. Buttery; James R. Curran; Richard W. Hunstead; B. Piestrzynski; James Robertson; Elaine M. Sadler
This paper is the second in a series describing the Sydney University Molonglo Sky Survey (SUMSS) being carried out at 843 MHz with the Molonglo Observatory Synthesis Telescope (MOST). The survey will consist of ∼590 4.3° x 4.3° mosaic images with 45 x 45 cosec‖δ‖ arcsec 2 resolution, and a source catalogue. In this paper we describe the initial release (version 1.0) of the source catalogue consisting of 107 765 radio sources made by fitting elliptical Gaussians in 271 SUMSS 4.3° × 4.3° mosaics to a limiting peak brightness of 6 mJy beam - 1 at δ ≤ -50° and 10 mJy beam - 1 at δ > -50°. The catalogue covers approximately 3500 deg 2 of the southern sky with δ ≤ -30°, about 43 per cent of the total survey area. Positions in the catalogue are accurate to within 1-2 arcsec for sources with peak brightness A 8 4 3 ≥ 20 mJy beam - 1 and are always better than 10 arcsec. The internal flux density scale is accurate to within 3 per cent. Image artefacts have been classified using a decision tree, which correctly identifies and rejects spurious sources in over 96 per cent of cases. Analysis of the catalogue shows that it is highly uniform and is complete to 8 mJy at δ ≤ -50° and 18 mJy at δ > -50°. In this release of the catalogue about 7000 sources are found in the overlap region with the National Radio Astronomy Observatories Very Large Array Sky Survey at 1.4 GHz. We calculate a median spectral index of a = -0.83 between 1.4 GHz and 843 MHz. This version of the catalogue will be released via the World Wide Web with future updates as new mosaics are released.
Computational Linguistics | 2007
Stephen Clark; James R. Curran
This article describes a number of log-linear parsing models for an automatically extracted lexicalized grammar. The models are full parsing models in the sense that probabilities are defined for complete parses, rather than for independent events derived by decomposing the parse tree. Discriminative training is used to estimate the models, which requires incorrect parses for each sentence in the training data as well as the correct parse. The lexicalized grammar formalism used is Combinatory Categorial Grammar (CCG), and the grammar is automatically extracted from CCGbank, a CCG version of the Penn Treebank. The combination of discriminative training and an automatically extracted grammar leads to a significant memory requirement (up to 25 GB), which is satisfied using a parallel implementation of the BFGS optimization algorithm running on a Beowulf cluster. Dynamic programming over a packed chart, in combination with the parallel implementation, allows us to solve one of the largest-scale estimation problems in the statistical parsing literature in under three hours. A key component of the parsing system, for both training and testing, is a Maximum Entropy supertagger which assigns CCG lexical categories to words in a sentence. The supertagger makes the discriminative training feasible, and also leads to a highly efficient parser. Surprisingly, given CCGs spurious ambiguity, the parsing speeds are significantly higher than those reported for comparable parsers in the literature. We also extend the existing parsing techniques for CCG by developing a new model and efficient parsing algorithm which exploits all derivations, including CCGs nonstandard derivations. This model and parsing algorithm, when combined with normal-form constraints, give state-of-the-art accuracy for the recovery of predicate-argument dependencies from CCGbank. The parser is also evaluated on DepBank and compared against the RASP parser, outperforming RASP overall and on the majority of relation types. The evaluation on DepBank raises a number of issues regarding parser evaluation. This article provides a comprehensive blueprint for building a wide-coverage CCG parser. We demonstrate that both accurate and highly efficient parsing is possible with CCG.
meeting of the association for computational linguistics | 2004
Stephen Clark; James R. Curran
This paper describes and evaluates log-linear parsing models for Combinatory Categorial Grammar (CCG). A parallel implementation of the L-BFGS optimisation algorithm is described, which runs on a Beowulf cluster allowing the complete Penn Treebank to be used for estimation. We also develop a new efficient parsing algorithm for CCG which maximises expected recall of dependencies. We compare models which use all CCG derivations, including non-standard derivations, with normal-form models. The performances of the two models are comparable and the results are competitive with existing wide-coverage CCG parsers.
meeting of the association for computational linguistics | 2007
James R. Curran; Stephen Clark; Johan Bos
The statistical modelling of language, together with advances in wide-coverage grammar development, have led to high levels of robustness and efficiency in NLP systems and made linguistically motivated large-scale language processing a possibility (Matsuzaki et al., 2007; Kaplan et al., 2004). This paper describes an NLP system which is based on syntactic and semantic formalisms from theoretical linguistics, and which we have used to analyse the entire Gigaword corpus (1 billion words) in less than 5 days using only 18 processors. This combination of detail and speed of analysis represents a break-through in NLP technology.
north american chapter of the association for computational linguistics | 2003
James R. Curran; Stephen Clark
Named Entity Recognition (NER) systems need to integrate a wide variety of information for optimal performance. This paper demonstrates that a maximum entropy tagger can effectively encode such information and identify named entities with very high accuracy. The tagger uses features which can be obtained for a variety of languages and works effectively not only for English, but also for other languages such as German and Dutch.
meeting of the association for computational linguistics | 2002
James R. Curran; Marc Moens
The use of semantic resources is common in modern NLP systems, but methods to extract lexical semantics have only recently begun to perform well enough for practical use. We evaluate existing and new similarity metrics for thesaurus extraction, and experiment with the trade-off between extraction performance and efficiency. We propose an approximation algorithm, based on canonical attributes and coarse- and fine-grained matching, that reduces the time complexity and execution time of thesaurus extraction with only a marginal performance penalty.
international conference on computational linguistics | 2004
Johan Bos; Stephen Clark; Mark Steedman; James R. Curran; Julia Hockenmaier
This paper shows how to construct semantic representations from the derivations produced by a wide-coverage CCG parser. Unlike the dependency structures returned by the parser itself, these can be used directly for semantic interpretation. We demonstrate that well-formed semantic representations can be produced for over 97% of the sentences in unseen WSJ text. We believe this is a major step towards widecoverage semantic interpretation, one of the key objectives of the field of NLP.
Artificial Intelligence | 2013
Ben Hachey; Will Radford; Joel Nothman; Matthew Honnibal; James R. Curran
Named Entity Linking (nel) grounds entity mentions to their corresponding node in a Knowledge Base (kb). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate entities and then disambiguate them, returning either the best candidate or nil. However, comparison has focused on disambiguation accuracy, making it difficult to determine how search impacts performance. Furthermore, important approaches from the literature have not been systematically compared on standard data sets. We reimplement three seminal nel systems and present a detailed evaluation of search strategies. Our experiments find that coreference and acronym handling lead to substantial improvement, and search strategies account for much of the variation between systems. This is an interesting finding, because these aspects of the problem have often been neglected in the literature, which has focused largely on complex candidate ranking algorithms.
international conference on computational linguistics | 2004
Stephen Clark; James R. Curran
This paper describes the role of supertagging in a wide-coverage CCG parser which uses a log-linear model to select an analysis. The supertagger reduces the derivation space over which model estimation is performed, reducing the space required for discriminative training. It also dramatically increases the speed of the parser. We show that large increases in speed can be obtained by tightly integrating the supertagger with the CCG grammar and parser. This is the first work we are aware of to successfully integrate a supertagger with a full parser which uses an automatically extracted grammar. We also further reduce the derivation space using constraints on category combination. The result is an accurate wide-coverage CCG parser which is an order of magnitude faster than comparable systems for other linguistically motivated formalisms.
north american chapter of the association for computational linguistics | 2003
Stephen Clark; James R. Curran; Miles Osborne
This paper investigates booststrapping part-of-speech taggers using co-training, in which two taggers are iteratively re-trained on each others output. Since the output of the taggers is noisy, there is a question of which newly labelled examples to add to the training set. We investigate selecting examples by directly maximising tagger agreement on unlabelled data, a method which has been theoretically and empirically motivated in the co-training literature. Our results show that agreement-based co-training can significantly improve tagging performance for small seed datasets. Further results show that this form of co-training considerably outperforms self-training. However, we find that simply re-training on all the newly labelled data can, in some cases, yield comparable results to agreement-based co-training, with only a fraction of the computational cost.