Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Christopher Cieri is active.

Publication


Featured researches published by Christopher Cieri.


Topic detection and tracking | 2002

Corpora for topic detection and tracking

Christopher Cieri; Stephanie M. Strassel; David Graff; Nii Martey; Kara Rennert; Mark Liberman

The TDT corpora, developed to support the DARPA-sponsored program in Topic Detection and Tracking, combine data collected over a nine month period from 8 English and 3 Chinese sources. The published corpora contain audio, reference text including written news text and transcripts of the broadcast audio, boundary tables segmenting the broadcasts into stories and relevance tables resulting from millions of human judgments. Sections of the corpora have undergone topic-story, first story and story link annotation. Both the TDT-2 and TDT-3 text corpora and the accompanying broadcast audio are now available from the Linguistic Data Consortium. This paper described the raw material collected for the corpora, the annotation of that material to prepare it for research use and the formats in which it is distributed. Special attention is paid to the quality control measures developed for these data sets.


international symposium on chinese spoken language processing | 2006

HKUST/MTS: a very large scale mandarin telephone speech corpus

Yi Liu; Pascale Fung; Yongsheng Yang; Christopher Cieri; Shudong Huang; David Graff

The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either strangers or friends. Each conversation focuses on a single topic. All calls are recorded over public telephone networks. All calls are manually annotated with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech. A file with speaker demographic information is also provided. The corpus is the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks, such as topic detection, information retrieval, keyword spotting, speaker recognition, etc. In a 2004 evaluation test by NIST, the corpus is found to improve system performance quite significantly.


Molecular Autism | 2017

Linguistic camouflage in girls with autism spectrum disorder

Julia Parish-Morris; Mark Liberman; Christopher Cieri; John D. Herrington; Benjamin E. Yerys; Leila Bateman; Joseph Donaher; Emily Ferguson; Juhi Pandey; Robert T. Schultz

BackgroundAutism spectrum disorder (ASD) is diagnosed more frequently in boys than girls, even when girls are equally symptomatic. Cutting-edge behavioral imaging has detected “camouflaging” in girls with ASD, wherein social behaviors appear superficially typical, complicating diagnosis. The present study explores a new kind of camouflage based on language differences. Pauses during conversation can be filled with words like UM or UH, but research suggests that these two words are pragmatically distinct (e.g., UM is used to signal longer pauses, and may correlate with greater social communicative sophistication than UH). Large-scale research suggests that women and younger people produce higher rates of UM during conversational pauses than do men and older people, who produce relatively more UH. Although it has been argued that children and adolescents with ASD use UM less often than typical peers, prior research has not included sufficient numbers of girls to examine whether sex explains this effect. Here, we explore UM vs. UH in school-aged boys and girls with ASD, and ask whether filled pauses relate to dimensional measures of autism symptom severity.MethodsSixty-five verbal school-aged participants with ASD (49 boys, 16 girls, IQ estimates in the average range) participated, along with a small comparison group of typically developing children (8 boys, 9 girls). Speech samples from the Autism Diagnostic Observation Schedule were orthographically transcribed and time-aligned, with filled pauses marked. Parents completed the Social Communication Questionnaire and the Vineland Adaptive Behavior Scales.ResultsGirls used UH less often than boys across both diagnostic groups. UH suppression resulted in higher UM ratios for girls than boys, and overall filled pause rates were higher for typical children than for children with ASD. Higher UM ratios correlated with better socialization in boys with ASD, but this effect was driven by increased use of UH by boys with greater symptoms.ConclusionsPragmatic language markers distinguish girls and boys with ASD, mirroring sex differences in the general population. One implication of this finding is that typical-sounding disfluency patterns (i.e., reduced relative UH production leading to higher UM ratios) may normalize the way girls with ASD sound relative to other children, serving as “linguistic camouflage” for a naïve listener and distinguishing them from boys with ASD. This first-of-its-kind study highlights the importance of continued commitment to understanding how sex and gender change the way that ASD manifests, and illustrates the potential of natural language to contribute to objective “behavioral imaging” diagnostics for ASD.


meeting of the association for computational linguistics | 2001

Annotation graphs and servers and multi-modal resources: infrastructure for interdisciplinary education, research and development

Christopher Cieri; Steven Bird

Annotation graphs and annotation servers offer infrastructure to support the analysis of human language resources in the form of time-series data such as text, audio and video. This paper outlines areas of common need among empirical linguists and computational linguists. After reviewing examples of data and tools used or under development for each of several areas, it proposes a common framework for future tool development, data annotation and resource sharing based upon annotation graphs and servers.


Proceedings of the 7th Workshop on Asian Language Resources | 2009

Basic Language Resources for Diverse Asian Languages: A Streamlined Approach for Resource Creation

Heather Simpson; Kazuaki Maeda; Christopher Cieri

The REFLEX-LCTL (Research on English and Foreign Language Exploitation-Less Commonly Taught Languages) program, sponsored by the United States government, was an effort in simultaneous creation of basic language resources and technologies for under-resourced languages, with the aim to enrich sparse areas in language technology resources and encourage new research. We were tasked to produce basic language resources for 8 Asian languages: Bengali, Pashto, Punjabi, Tamil, Tagalog, Thai, Urdu and Uzbek, and 5 languages from Europe and Africa, and distribute them to research and development also funded by the program. This paper will discuss the streamlined approach to language resource development we designed to support simultaneous creation of multiple resources for multiple languages.


Multilingual Speech Processing | 2006

Linguistic Data Resources

Christopher Cieri; Mark Liberman; Victoria Arranz; Khalid Choukri

This chapter provides an overview of available language resources, from both U.S. and European perspectives. Multilingual data repositories as well as large ongoing and planned collection efforts are introduced, along with a description of the major challenges of collection efforts, such as transcription issues due to inconsistent writing standards, subject recruitment, recording equipment, legal aspects, and costs in terms of time and money. The overview of multilingual resources comprises multilingual audio and text data, pronunciation dictionaries, and parallel bilingual/multilingual corpora. This chapter provides an overview of existing language resources in Europe. A number of projects in Europe have been working toward the production of multilingual speech and language resources, many of which have become key databases for the human language technology (HLT) community. The SpeechDat projects are a set of speech data-collection efforts funded by the European Commission with the aim of establishing databases for the development of voice-operated teleservices and speech interfaces. The resulting databases are available via European Language Resources Association (ELRA).


international conference on computational linguistics | 2014

Intellectual Property Rights Management with Web Service Grids

Christopher Cieri; Denise DiPersio

This paper enumerates the ways in which configurations of web services may complicate issues of licensing language resources, whether data or tools. It details specific licensing challenges within the context of the US Language Application (LAPPS) Grid, sketches a solution under development and highlights ways in which that approach may be extended for other web service configurations.


Language and Linguistics Compass | 2014

Challenges and Opportunities in Sociolinguistic Data and Metadata Sharing

Christopher Cieri

Advances in computing technology coupled with recent focus on big data in the social sciences have provided the motivation and some of the infrastructure necessary for sociolinguists to share data among themselves and with researchers in related fields such as human language technologies (HLT). Collaboration among sociolinguists offers the promise to extend current knowledge beyond the community studies that have dominated the field for the past 50 years and focus more on regional and national patterns of variation and change and what they indicate about linguistic theory. Collaboration with HLT developers while relatively new and still uncommon has led to advances both in sociolinguistic methodology and in technologies suited to sociolinguistic research. Before the field can make full use of these advances, however, sociolinguists must confront a number of challenges. Studies that were developed with the intent of describing a single speech community presumably need not ensure, and in many cases have not ensured, consistency with prior work. Given this practice, attempts to compare phenomena across studies must address mismatches at the levels of data elicitation and selection, coding practice, and the definition of underlying concepts. Adding to the confusion wrought by methodological differences, speech communities differ in ways that the field worker cannot always predict so that different and sometimes unique linguistic and non-linguistic features are found to vary with linguistic structure. This paper underscores the motivation for data sharing by identifying some limitations of comparisons based only on published papers and reviewing advances fueled by data sharing among linguists and between linguists and technology developers. It also documents some of the challenges that hinder data sharing by reviewing work that has build upon available corpora. Finally, it summarizes efforts outside of sociolinguistics that have proposed frameworks for sharing and comparing metadata and categories setting the stage for the papers that follow in these special issues.


Archive | 2007

Linguistic Resources, Development, and Evaluation of Text and Speech Systems

Christopher Cieri

Over the past several decades, research and development of human language technology has been driven or hindered by the availability of data and a number of organizations have arisen to address the demand for greater volumes of linguistic data in a wider variety of languages with more sophisticated annotation and better quality. A great deal of the linguistic data available today results from common task technology evaluation programs that, at least as implemented in the United States, typically involve objective measures of system performance on a benchmark corpus that are compared with human performance over the same data. Data centres play an important role by distributing and archiving, sometimes collecting and annotating, and even by coordinating the efforts of other organizations in the creation of linguistic data. Data planning depends upon the purpose of the project, the linguistic resources needed, the internal and external limitations on acquiring them, availability of data, bandwidth and distribution requirements, available funding, the limits on human annotation, the timeline, the details of the processing pipeline including the ability to parallelize, or the need to serialize steps. Language resource creation includes planning, creation of a specification, collection, segmentation, annotation, quality assurance, preparation for use, distribution, adjudication, refinement, and extension. In preparation for publication, shared corpora are generally associated with metadata and documented to indicate the authors and annotators of the data, the volume and types of raw material included, the percent annotated, the annotation specification, and the quality control measures adopted. This chapter sketches issues involved in identifying and evaluating existing language resources and in planning, creating, validating, and distributing new language resources, especially those used for developing human language technologies with specific examples taken from the collection and annotation of conversational telephone speech and the adjudication of corpora created to support information retrieval.


north american chapter of the association for computational linguistics | 2016

Exploring Autism Spectrum Disorders Using HLT.

Julia Parish-Morris; Mark Liberman; Neville Ryant; Christopher Cieri; Leila Bateman; Emily Ferguson; Robert T. Schultz

The phenotypic complexity of Autism Spectrum Disorder motivates the application of modern computational methods to large collections of observational data, both for improved clinical diagnosis and for better scientific understanding. We have begun to create a corpus of annotated language samples relevant to this research, and we plan to join with other researchers in pooling and publishing such resources on a large scale. The goal of this paper is to present some initial explorations to illustrate the opportunities that such datasets will afford.

Collaboration


Dive into the Christopher Cieri's collaboration.

Top Co-Authors

Avatar

Mark Liberman

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Denise DiPersio

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Kevin Walker

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

David Graff

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

David Miller

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Jonathan Wright

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jiahong Yuan

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Joseph P. Campbell

Massachusetts Institute of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge