Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where David Graff is active.

Publication


Featured researches published by David Graff.


Topic detection and tracking | 2002

Corpora for topic detection and tracking

Christopher Cieri; Stephanie M. Strassel; David Graff; Nii Martey; Kara Rennert; Mark Liberman

The TDT corpora, developed to support the DARPA-sponsored program in Topic Detection and Tracking, combine data collected over a nine month period from 8 English and 3 Chinese sources. The published corpora contain audio, reference text including written news text and transcripts of the broadcast audio, boundary tables segmenting the broadcasts into stories and relevance tables resulting from millions of human judgments. Sections of the corpora have undergone topic-story, first story and story link annotation. Both the TDT-2 and TDT-3 text corpora and the accompanying broadcast audio are now available from the Linguistic Data Consortium. This paper described the raw material collected for the corpora, the annotation of that material to prepare it for research use and the formats in which it is distributed. Special attention is paid to the quality control measures developed for these data sets.


Speech Communication | 2002

An overview of Broadcast News corpora

David Graff

Abstract The LDC began its first Broadcast News (BN) speech collection in the spring of 1996, facing a host of challenges including IPR negotiations with broadcasters, establishment of new transcription conventions and tools, and a compressed schedule for creation and release of speech, transcripts and in-domain language model data. The amount of acoustic training data available for participants in the DARPA Hub4 English benchmark tests doubled from 50 h in 1996 to 100 h in 1997, and doubled again to 200 h in 1998. An additional 40 h has been made available as of the summer of 1999. The 1997 benchmark test also saw the addition of BN speech and transcripts in Spanish and Mandarin Chinese, though in lesser quantity, with 30 h of training data in each language. Supplements to the existing pronunciation lexicons in each language were also produced. More recently, the coordinated research project on topic detection and tracking (TDT) has called for a large collection of BN speech data, totaling about 1100 h in English and 300 h in Mandarin over two phases (TDT2 and TDT3), although the level of detail and quality in the TDT transcriptions is not comparable to that of the Hub4 collections.


human language technology | 1994

Multilingual text resources at the linguistic data consortium

David Graff; Rebecca Finch

The Linguistic Data Consortium (LDC) is currently involved in a major effort to expand its multilingual text resources, in particular for machine translation, message understanding and information retrieval research. The main sources for data acquisition are governmental and international organizations, newswire services, and diverse publishers. This paper describes some of the research that is being done to identify potential resources, discusses some of the process involved in negotiating the broadest possible access to the material for the human language technology research community, and identifies key issues and considerations in transducing the text into common and well documented formats.


international symposium on chinese spoken language processing | 2006

HKUST/MTS: a very large scale mandarin telephone speech corpus

Yi Liu; Pascale Fung; Yongsheng Yang; Christopher Cieri; Shudong Huang; David Graff

The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either strangers or friends. Each conversation focuses on a single topic. All calls are recorded over public telephone networks. All calls are manually annotated with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech. A file with speaker demographic information is also provided. The corpus is the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks, such as topic detection, information retrieval, keyword spotting, speaker recognition, etc. In a 2004 evaluation test by NIST, the corpus is found to improve system performance quite significantly.


Archive | 1999

The Tdt-3 Text And Speech Corpus

David Graff; Christopher Cieri; Stephanie M. Strassel; Nii Martey


conference of the international speech communication association | 2007

Resources for new research directions in speaker recognition: the mixer 3, 4 and 5 corpora.

Christopher Cieri; Linda Corson; David Graff; Kevin Walker


language resources and evaluation | 2000

Many Uses, Many Annotations for Large Speech Corpora: Switchboard and TDT as Case Studies.

David Graff; Steven Bird


conference of the international speech communication association | 2001

Large broadcast news and read speech corpora of spoken czech.

Josef Psutka; Vlasta Radová; Ludek Müller; Jindrich Matousek; Pavel Ircing; David Graff


language resources and evaluation | 2000

Quality Control in Large Annotation Projects Involving Multiple Judges: The Case of the TDT Corpora.

Stephanie M. Strassel; David Graff; Nii Martey; Christopher Cieri


Archive | 2004

Fisher English Training Speech Part 1 Transcripts

Christopher Cieri; David Graff; O. A. Kimball; David Philip Miller; Kevin Walker

Collaboration


Dive into the David Graff's collaboration.

Top Co-Authors

Avatar

Christopher Cieri

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Kevin Walker

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Mohamed Maamouri

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Nii Martey

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Karen Jones

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Linda Brandschain

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Jonathan Wright

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Mark Liberman

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Tim Buckwalter

University of Pennsylvania

View shared research outputs
Researchain Logo
Decentralizing Knowledge