Mark Liberman
University of Pennsylvania
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mark Liberman.
human language technology | 1991
Steven P. Abney; S. Flickenger; Claudia Gdaniec; C. Grishman; Philip Harrison; Donald Hindle; Robert Ingria; Frederick Jelinek; Judith L. Klavans; Mark Liberman; Mitchell P. Marcus; Salim Roukos; Beatrice Santorini; Tomek Strzalkowski; Ezra Black
The problem of quantitatively comparing the performance of different broad-coverage grammars of English has to date resisted solution. Prima facie, known English grammars appear to disagree strongly with each other as to the elements of even the simplest sentences. For instance, the grammars of Steve Abney (Bellcore), Ezra Black (IBM), Dan Flickinger (Hewlett Packard), Claudia Gdaniec (Logos), Ralph Grishman and Tomek Strzalkowski (NYU), Phil Harrison (Boeing), Don Hindle (AT&T), Bob Ingria (BBN), and Mitch Marcus (U. of Pennsylvania) recognize in common only the following constituents, when each grammarian provides the single parse which he/she would ideally want his/her grammar to specify for three sample Brown Corpus sentences:The famed Yankee Clipper, now retired, has been assisting (as (a batting coach)).One of those capital-gains ventures, in fact, has saddled him (with Gore Court).He said this constituted a (very serious) misuse (of the (Criminal court) processes).
Speech Communication | 2001
Claude Barras; Edouard Geoffrois; Zhibiao Wu; Mark Liberman
Abstract We present “Transcriber”, a tool for assisting in the creation of speech corpora, and describe some aspects of its development and use. Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and acoustic conditions. It is highly portable, relying on the scripting language Tcl/Tk with extensions such as Snack for advanced audio functions and tcLex for lexical analysis, and has been tested on various Unix systems and Windows. The data format follows the XML standard with Unicode support for multilingual transcriptions. Distributed as free software in order to encourage the production of corpora, ease their sharing, increase user feedback and motivate software contributions, Transcriber has been in use for over a year in several countries. As a result of this collective experience, new requirements arose to support additional data formats, video control, and a better management of conversational speech. Using the annotation graphs framework recently formalized, adaptation of the tool towards new tasks and support of different data formats will become easier.
Journal of the Acoustical Society of America | 2008
Jiahong Yuan; Mark Liberman
This paper reports the results of our experiments on speaker identification in the SCOTUS corpus, which includes oral arguments from the Supreme Court of the United States. Our main findings are as follows: 1) a combination of Gaussian mixture models and monophone HMM models attains near‐100% text‐independent identification accuracy on utterances that are longer than one second; (2) the sampling rate of 11025 Hz achieves the best performance (higher sampling rates are harmful); and a sampling rate as low as 2000 Hz still achieves more than 90% accuracy; (3) a distance score based on likelihood numbers was used to measure the variability of phones among speakers; we found that the most variable phone is the phone UH (as in good), and the velar nasal NG is more variable than the other two nasal sounds M and N; 4.) our models achieved “perfect” forced alignment on very long speech segments (one hour). These findings and their significance are discussed.
conference on computational natural language learning | 2006
Partha Pratim Talukdar; Thorsten Brants; Mark Liberman; Fernando Pereira
We present a novel context pattern induction method for information extraction, specifically named entity extraction. Using this method, we extended several classes of seed entity lists into much larger high-precision lists. Using token membership in these extended lists as additional features, we improved the accuracy of a conditional random field-based named entity tagger. In contrast, features derived from the seed lists decreased extractor accuracy.
international conference on acoustics, speech, and signal processing | 1984
Mark Robert Anderson; Janet B. Pierrehumbert; Mark Liberman
This papet reports work on synthesizing English F0 contours. One motivation for this work is to improve the naturalness and liveliness of the prosody in speech synthesis systems. However, our main goal is to develop a theory of the dimensions of variation controlling intonation, and of their interaction.
international conference on computational linguistics | 1990
Evelyne Tzoukermann; Mark Liberman
A finite transducer that processes Spanish inflectional and derivational morphology is presented. The system handles both generation and analysis of tens of millions inflected forms. Lexical and surface (orthographic) representations of the words are linked by a program that interprets a finite directed graph whose arcs are labelled by n-tuples of strings. Each of about 55,000 base forms requires at least one are in the graph. Representing the inflectional and derivational possibilities for these forms imposed an overhead of only about 3000 additional arcs, of which about 2500 represent (phonologically predictable) stem allomorphy, so that we pay a storage price of about 5% for compiling these forms offline. A simple interpreter for the resulting automaton processes several hundred words per second on a Sun4.
BMC Bioinformatics | 2006
Yang Jin; Ryan T. McDonald; Kevin Lerman; Mark A. Mandel; Steven Carroll; Mark Liberman; Fernando Pereira; Raymond S Winters; Peter S. White
BackgroundThe rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. Furthermore, promise has been tempered by an inability to efficiently scale approaches in ways that minimize manual efforts and still perform with high accuracy. Here, we have applied a machine-learning approach previously successful for identifying molecular entities to a disease concept to determine if the underlying probabilistic model effectively generalizes to unrelated concepts with minimal manual intervention for model retraining.ResultsWe developed a named entity recognizer (MTag), an entity tagger for recognizing clinical descriptions of malignancy presented in text. The application uses the machine-learning technique Conditional Random Fields with additional domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83 recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using string matching of text with a neoplasm term list, MTag performed with a much higher recall rate (92.1% vs. 42.1% recall) and demonstrated the ability to learn new patterns. Application of MTag to all MEDLINE abstracts yielded the identification of 580,002 unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an extensive lexicon of malignancy mentions as a feature set for extraction had minimal impact in performance.ConclusionTogether, these results suggest that the identification of disparate biomedical entity classes in free text may be achievable with high accuracy and only moderate additional effort for each new application domain.
Topic detection and tracking | 2002
Christopher Cieri; Stephanie M. Strassel; David Graff; Nii Martey; Kara Rennert; Mark Liberman
The TDT corpora, developed to support the DARPA-sponsored program in Topic Detection and Tracking, combine data collected over a nine month period from 8 English and 3 Chinese sources. The published corpora contain audio, reference text including written news text and transcripts of the broadcast audio, boundary tables segmenting the broadcasts into stories and relevance tables resulting from millions of human judgments. Sections of the corpora have undergone topic-story, first story and story link annotation. Both the TDT-2 and TDT-3 text corpora and the accompanying broadcast audio are now available from the Linguistic Data Consortium. This paper described the raw material collected for the corpora, the annotation of that material to prepare it for research use and the formats in which it is distributed. Special attention is paid to the quality control measures developed for these data sets.
human language technology | 1989
Stephen E. Levinson; Mark Liberman; Andrej Ljolje; L. G. Miller
Speaker independent phonetic transcription of fluent speech is performed using an ergodic continuously variable duration hidden Markov model (CVDHMM) to represent the acoustic, phonetic and phonotactic structure of speech. An important property of the model is that each of its fifty-one states is uniquely identified with a single phonetic unit. Thus, for any spoken utterance, a phonetic transcription is obtained from a dynamic programming (DP) procedure for finding the state sequence of maximum likelihood. A model has been constructed based on 4020 sentences from the TIMIT database. When tested on 180 different sentences from this database, phonetic accuracy was observed to be 56% with 9% insertions. A speaker dependent version of the model was also constructed. The transcription algorithm was then combined with lexical access and parsing routines to form a complete recognition system. When tested on sentences from the DARPA resource management task spoken over the local switched telephone network, phonetic accuracy of 64% with 8% insertions and word accuracy of 87% with 3% insertions was measured. This system is presently operating in an on-line mode over the local switched telephone network in less than ten times real time on an Alliant FX-80.
Journal of the Acoustical Society of America | 1976
Mark Liberman; Lynn A. Streeter
The technique of nonsense‐syllable mimicry of natural utterances has many advantages in the study of prosodic phenomena, especially duration. In analytic studies, the elimination of segmental effects as a factor makes data collection much more efficient, and requires only one segmentation criterion. In perceptual studies, the technique eliminates lexical information without unnatural distortions of the signal. In a series of validation experiments, we have found that (1) the patterns of duration obtained by using this technique were stable and reproducible within and across speakers; and (2) mimicry of different natural models with identical stress patterns and constituent structures produced nearly indistinguishable nonsense‐syllable duration patterns.