Steven Bird
University of Melbourne
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Steven Bird.
meeting of the association for computational linguistics | 2006
Steven Bird
The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language. This paper reports on the simplified toolkit and explains how it is used in teaching NLP.
international conference on data engineering | 2006
Steven Bird; Yi Chen; Susan B. Davidson; Haejoong Lee; Yifeng Zheng
Linguistic research and natural language processing employ large repositories of ordered trees. XML, a standard ordered tree model, and XPath, its associated language, are natural choices for linguistic data and queries. However, several important expressive features required for linguistic queries are missing or hard to express in XPath. In this paper, we motivate and illustrate these features with a variety of linguistic queries. Then we propose extensions to XPath to support linguistic queries, and design an efficient query engine based on a novel labeling scheme. Experiments demonstrate that our language is not only sufficiently expressive for linguistic trees but also efficient for practical usage.
International Journal on Digital Libraries | 2005
Jerry Goldman; Steve Renals; Steven Bird; Franciska M.G. de Jong; Marcello Federico; Carl Fleischhauer; Mark Kornbluh; Lori Lamel; Douglas W. Oard; Claire Stewart; Richard Wright
Spoken-word audio collections cover many domains, including radio and television broadcasts, oral narratives, governmental proceedings, lectures, and telephone conversations. The collection, access, and preservation of such data is stimulated by political, economic, cultural, and educational needs. This paper outlines the major issues in the field, reviews the current state of technology, examines the rapidly changing policy issues relating to privacy and copyright, and presents issues relating to the collection and preservation of spoken audio content .
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics | 2008
Steven Bird; Ewan Klein; Edward Loper; Jason Baldridge
The Natural Language Toolkit (NLTK) is widely used for teaching natural language processing to students majoring in linguistics or computer science. This paper describes the design of NLTK, and reports on how it has been used effectively in classes that involve different mixes of linguistics and computer science students. We focus on three key issues: getting started with a course, delivering interactive demonstrations in the classroom, and organizing assignments and projects. In each case, we report on practical experience and make recommendations on how to use NLTK to maximum effect.
Literary and Linguistic Computing | 2003
Gary F. Simons; Steven Bird
New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World -Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. This article describes the infras tructure that the Open Language Archives Community (OLAC) has built in order to address these problems. Its technical and usage infrastructures address problems of resource discovery by constructing a single virtual library of distributed resources. Its go vernance infrastructure addresses problems of resource creation by providing a mechanism through which the language-resource community can express its consensus on recommended best practices.
Computers and The Humanities | 2003
Steven Bird; Gary F. Simons
As language data and associatedtechnologies proliferate and as the languageresources community expands, it is becomingincreasingly difficult to locate and reuse existingresources. Are there any lexical resources forsuch-and-such a language? What tool workswith transcripts in this particular format?What is a good format to use for linguisticdata of this type? Questions like these dominate manymailing lists, since web search engines are anunreliable way to find language resources. Thispaper reports on a new digital infrastructurefor discovering language resources beingdeveloped by the Open Language Archives Community(OLAC). At the core of OLAC is its metadataformat, which is designed to facilitatedescription and discovery of all kinds oflanguage resources, including data, tools, oradvice. The paper describes OLAC metadata, itsrelationship to Dublin Core metadata, and itsdissemination using the metadata harvesting protocol of the Open Archives Initiative.
Library Hi Tech | 2003
Gary F. Simons; Steven Bird
The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. The Dublin Core (DC) Element Set and the OAI Protocol have provided a solid foundation for the OLAC framework. However, we need more precision in community‐specific aspects of resource description than is offered by DC. Furthermore, many of the institutions and individuals who might participate in OLAC do not have the technical resources to support the OAI protocol. This paper presents our solutions to these two problems.
meeting of the association for computational linguistics | 2001
Steven Bird; Gary F. Simons
As language data and associated technologies proliferate and as the language resources community rapidly expands, it has become difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool can work with transcripts in this particular format? What is a good format to use for linguistic data of this type? Questions like these dominate many mailing lists, since web search engines are an unreliable way to find language resources. This paper describes a new digital infrastructure for language resource discovery, based on the Open Archives Initiative, and called OLAC -- the Open Language Archives Community. The OLAC Metadata Set and the associated controlled vocabularies facilitate consistent description and focussed searching. We report progress on the metadata set and controlled vocabularies, describing current issues and soliciting input from the language resources community.
international joint conference on natural language processing | 2015
Long Duong; Trevor Cohn; Steven Bird; Paul Cook
Training a high-accuracy dependency parser requires a large treebank. However, these are costly and time-consuming to build. We propose a learning method that needs less data, based on the observation that there are underlying shared structures across languages. We exploit cues from a different source language in order to guide the learning process. Our model saves at least half of the annotation effort to reach the same accuracy compared with using the purely supervised method.
north american chapter of the association for computational linguistics | 2016
Long Duong; Antonios Anastasopoulos; David Chiang; Steven Bird; Trevor Cohn
For many low-resource languages, spoken language resources are more likely to be annotated with translations than transcriptions. This bilingual speech data can be used for word-spotting, spoken document retrieval, and even for documentation of endangered languages. We experiment with the neural, attentional model applied to this data. On phoneto-word alignment and translation reranking tasks, we achieve large improvements relative to several baselines. On the more challenging speech-to-word alignment task, our model nearly matches GIZA++’s performance on gold transcriptions, but without recourse to transcriptions or to a lexicon.