Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Steven Bird is active.

Publication


Featured researches published by Steven Bird.


meeting of the association for computational linguistics | 2006

NLTK: The Natural Language Toolkit

Steven Bird

The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language. This paper reports on the simplified toolkit and explains how it is used in teaching NLP.


international conference on data engineering | 2006

Designing and Evaluating an XPath Dialect for Linguistic Queries

Steven Bird; Yi Chen; Susan B. Davidson; Haejoong Lee; Yifeng Zheng

Linguistic research and natural language processing employ large repositories of ordered trees. XML, a standard ordered tree model, and XPath, its associated language, are natural choices for linguistic data and queries. However, several important expressive features required for linguistic queries are missing or hard to express in XPath. In this paper, we motivate and illustrate these features with a variety of linguistic queries. Then we propose extensions to XPath to support linguistic queries, and design an efficient query engine based on a novel labeling scheme. Experiments demonstrate that our language is not only sufficiently expressive for linguistic trees but also efficient for practical usage.


International Journal on Digital Libraries | 2005

Accessing the spoken word

Jerry Goldman; Steve Renals; Steven Bird; Franciska M.G. de Jong; Marcello Federico; Carl Fleischhauer; Mark Kornbluh; Lori Lamel; Douglas W. Oard; Claire Stewart; Richard Wright

Spoken-word audio collections cover many domains, including radio and television broadcasts, oral narratives, governmental proceedings, lectures, and telephone conversations. The collection, access, and preservation of such data is stimulated by political, economic, cultural, and educational needs. This paper outlines the major issues in the field, reviews the current state of technology, examines the rapidly changing policy issues relating to privacy and copyright, and presents issues relating to the collection and preservation of spoken audio content .


Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics | 2008

Multidisciplinary Instruction with the Natural Language Toolkit

Steven Bird; Ewan Klein; Edward Loper; Jason Baldridge

The Natural Language Toolkit (NLTK) is widely used for teaching natural language processing to students majoring in linguistics or computer science. This paper describes the design of NLTK, and reports on how it has been used effectively in classes that involve different mixes of linguistics and computer science students. We focus on three key issues: getting started with a course, delivering interactive demonstrations in the classroom, and organizing assignments and projects. In each case, we report on practical experience and make recommendations on how to use NLTK to maximum effect.


Literary and Linguistic Computing | 2003

The Open Language Archives Community: An infrastructure for distributed archiving of language resources

Gary F. Simons; Steven Bird

New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World -Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. This article describes the infras tructure that the Open Language Archives Community (OLAC) has built in order to address these problems. Its technical and usage infrastructures address problems of resource discovery by constructing a single virtual library of distributed resources. Its go vernance infrastructure addresses problems of resource creation by providing a mechanism through which the language-resource community can express its consensus on recommended best practices.


Computers and The Humanities | 2003

Extending Dublin Core Metadata to Support the Description and Discovery of Language Resources

Steven Bird; Gary F. Simons

As language data and associatedtechnologies proliferate and as the languageresources community expands, it is becomingincreasingly difficult to locate and reuse existingresources. Are there any lexical resources forsuch-and-such a language? What tool workswith transcripts in this particular format?What is a good format to use for linguisticdata of this type? Questions like these dominate manymailing lists, since web search engines are anunreliable way to find language resources. Thispaper reports on a new digital infrastructurefor discovering language resources beingdeveloped by the Open Language Archives Community(OLAC). At the core of OLAC is its metadataformat, which is designed to facilitatedescription and discovery of all kinds oflanguage resources, including data, tools, oradvice. The paper describes OLAC metadata, itsrelationship to Dublin Core metadata, and itsdissemination using the metadata harvesting protocol of the Open Archives Initiative.


Library Hi Tech | 2003

Building an Open Language Archives Community on the OAI Foundation.

Gary F. Simons; Steven Bird

The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. The Dublin Core (DC) Element Set and the OAI Protocol have provided a solid foundation for the OLAC framework. However, we need more precision in community‐specific aspects of resource description than is offered by DC. Furthermore, many of the institutions and individuals who might participate in OLAC do not have the technical resources to support the OAI protocol. This paper presents our solutions to these two problems.


meeting of the association for computational linguistics | 2001

The OLAC metadata set and controlled vocabularies

Steven Bird; Gary F. Simons

As language data and associated technologies proliferate and as the language resources community rapidly expands, it has become difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool can work with transcripts in this particular format? What is a good format to use for linguistic data of this type? Questions like these dominate many mailing lists, since web search engines are an unreliable way to find language resources. This paper describes a new digital infrastructure for language resource discovery, based on the Open Archives Initiative, and called OLAC -- the Open Language Archives Community. The OLAC Metadata Set and the associated controlled vocabularies facilitate consistent description and focussed searching. We report progress on the metadata set and controlled vocabularies, describing current issues and soliciting input from the language resources community.


international joint conference on natural language processing | 2015

Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser

Long Duong; Trevor Cohn; Steven Bird; Paul Cook

Training a high-accuracy dependency parser requires a large treebank. However, these are costly and time-consuming to build. We propose a learning method that needs less data, based on the observation that there are underlying shared structures across languages. We exploit cues from a different source language in order to guide the learning process. Our model saves at least half of the annotation effort to reach the same accuracy compared with using the purely supervised method.


north american chapter of the association for computational linguistics | 2016

An Attentional Model for Speech Translation Without Transcription.

Long Duong; Antonios Anastasopoulos; David Chiang; Steven Bird; Trevor Cohn

For many low-resource languages, spoken language resources are more likely to be annotated with translations than transcriptions. This bilingual speech data can be used for word-spotting, spoken document retrieval, and even for documentation of endangered languages. We experiment with the neural, attentional model applied to this data. On phoneto-word alignment and translation reranking tasks, we achieve large improvements relative to several baselines. On the more challenging speech-to-word alignment task, our model nearly matches GIZA++’s performance on gold transcriptions, but without recourse to transcriptions or to a lexicon.

Collaboration


Dive into the Steven Bird's collaboration.

Top Co-Authors

Avatar

Haejoong Lee

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Trevor Cohn

University of Melbourne

View shared research outputs
Top Co-Authors

Avatar

Baden Hughes

University of Melbourne

View shared research outputs
Top Co-Authors

Avatar

Kazuaki Maeda

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Long Duong

University of Melbourne

View shared research outputs
Top Co-Authors

Avatar

Xiaoyi Ma

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Paul Cook

University of Melbourne

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge