Adam Janin
University of California, Berkeley
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Adam Janin.
international conference on acoustics, speech, and signal processing | 2003
Adam Janin; Don Baron; Jane Edwards; Daniel P. W. Ellis; David Gelbart; Nelson Morgan; Barbara Peskin; Thilo Pfau; Elizabeth Shriberg; Andreas Stolcke; Chuck Wooters
We have collected a corpus of data from natural meetings that occurred at the International Computer Science Institute (ICSI) in Berkeley, California over the last three years. The corpus contains audio recorded simultaneously from head-worn and table-top microphones, word-level transcripts of meetings, and various metadata on participants, meetings, and hardware. Such a corpus supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more. We present details on the contents of the corpus, as well as rationales for the decisions that led to its configuration. The corpus were delivered to the Linguistic Data Consortium (LDC).
international conference on human language technology research | 2001
Nelson Morgan; Don Baron; Jane Edwards; Daniel P. W. Ellis; David Gelbart; Adam Janin; Thilo Pfau; Elizabeth Shriberg; Andreas Stolcke
In collaboration with colleagues at UW, OGI, IBM, and SRI, we are developing technology to process spoken language from informal meetings. The work includes a substantial data collection and transcription effort, and has required a nontrivial degree of infrastructure development. We are undertaking this because the new task area provides a significant challenge to current HLT capabilities, while offering the promise of a wide range of potential applications. In this paper, we give our vision of the task, the challenges it represents, and the current state of our development, with particular attention to automatic transcription.
international conference on machine learning | 2005
Andreas Stolcke; Xavier Anguera; Kofi Boakye; Özgür Çetin; Frantisek Grezl; Adam Janin; Arindam Mandal; Barbara Peskin; Chuck Wooters; Jing Zheng
We describe the development of our speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2005 Meeting Rich Transcription (RT-05S) evaluation, highlighting improvements made since last year [1]. The system is based on the SRI-ICSI-UW RT-04F conversational telephone speech (CTS) recognition system, with meeting-adapted models and various audio preprocessing steps. This years system features better delay-sum processing of distant microphone channels and energy-based crosstalk suppression for close-talking microphones. Acoustic modeling is improved by virtue of various enhancements to the background (CTS) models, including added training data, decision-tree based state tying, and the inclusion of discriminatively trained phone posterior features estimated by multilayer perceptrons. In particular, we make use of adaptation of both acoustic models and MLP features to the meeting domain. For distant microphone recognition we obtained considerable gains by combining and cross-adapting narrow-band (telephone) acoustic models with broadband (broadcast news) models. Language models (LMs) were improved with the inclusion of new meeting and web data. In spite of a lack of training data, we created effective LMs for the CHIL lecture domain. Results are reported on RT-04S and RT-05S meeting data. Measured on RT-04S conference data, we achieved an overall improvement of 17% relative in both MDM and IHM conditions compared to last years evaluation system. Results on lecture data are comparable to the best reported results for that task.
international conference on acoustics, speech, and signal processing | 2003
Nelson Morgan; Don Baron; Sonali Bhagat; Hannah Carvey; Rajdip Dhillon; Jane Edwards; David Gelbart; Adam Janin; Ashley Krupski; Barbara Peskin; Thilo Pfau; Elizabeth Shriberg; Andreas Stolcke; Chuck Wooters
In early 2001, we reported (at the Human Language Technology meeting) the early stages of an ICSI (International Computer Science Institute) project on processing speech from meetings (in collaboration with other sites, principally SRI, Columbia, and UW). We report our progress from the first few years of this effort, including: the collection and subsequent release of a 75-meeting corpus (over 70 meeting-hours and up to 16 channels for each meeting); the development of a prosodic database for a large subset of these meetings, and its subsequent use for punctuation and disfluency detection; the development of a dialog annotation scheme and its implementation for a large subset of the meetings; and the improvement of both near-mic and far-mic speech recognition results for meeting speech test sets.
conference of the international speech communication association | 1999
Adam Janin; Daniel P. W. Ellis; Nelson Morgan
Multi-stream and multi-band methods can improve the accuracy of speech recognition systems without overly increasing the complexity. However, they cannot be applied blindly. In this paper, we review our experience applying multi-stream and multiband methods to the Broadcast News corpus. We found that multi-stream systems using different acoustic front-ends provide a significant improvement over single stream systems. However, despite the fact that they have been successful on smaller tasks, we have not yet been able to show any improvement using multiband methods. We report various insights gained from the experience in applying these methods in a large-vocabulary task.
international conference on acoustics, speech, and signal processing | 2010
Brian Kingsbury; Hagen Soltau; George Saon; Stephen M. Chu; Hong-Kwang Kuo; Lidia Mangu; Suman V. Ravuri; Nelson Morgan; Adam Janin
This paper describes the Arabic broadcast transcription system fielded by IBM in the GALE Phase 3.5 machine translation evaluation. Key advances compared to our Phase 2.5 system include improved discriminative training, the use of Subspace Gaussian Mixture Models (SGMM), neural network acoustic features, variable frame rate decoding, training data partitioning experiments, unpruned n-gram language models and neural network language models. These advances were instrumental in achieving a word error rate of 8.9% on the evaluation test set.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Gerald Friedland; Adam Janin; David Imseng; Xavier Anguera Miro; Luke R. Gottlieb; Marijn Huijbregts; Mary Tai Knox; Oriol Vinyals
The speaker diarization system developed at the International Computer Science Institute (ICSI) has played a prominent role in the speaker diarization community, and many researchers in the rich transcription community have adopted methods and techniques developed for the ICSI speaker diarization engine. Although there have been many related publications over the years, previous articles only presented changes and improvements rather than a description of the full system. Attempting to replicate the ICSI speaker diarization system as a complete entity would require an extensive literature review, and might ultimately fail due to component description version mismatches. This paper therefore presents the first full conceptual description of the ICSI speaker diarization system as presented to the National Institute of Standards Technology Rich Transcription 2009 (NIST RT-09) evaluation, which consists of online and offline subsystems, multi-stream and single-stream implementations, and audio and audio-visual approaches. Some of the components, such as the online system, have not been previously described. The paper also includes all necessary preprocessing steps, such as Wiener filtering, speech activity detection and beamforming.
Proceedings of the 3rd ACM SIGMM international workshop on Social media | 2011
Gerald Friedland; Jaeyoung Choi; Howard Lei; Adam Janin
The following article describes an approach to determine the geo-coordinates of the recording place of Flickr videos based on both textual metadata and visual cues. The system is tested on the MediaEval 2010 Placing Task evaluation data, which consists of 5091 unfiltered test videos. The system presented in this article is less complex, uses less training data, and is at the same time more accurate than the best system presented in the evaluation in August 2010. The performance peaks at being able to classify 14% of the videos with less than 10m accuracy. The article describes the realization of the system, analyses of the different uses of multimodal cues and gazetteer information.
acm multimedia | 2011
Gerald Friedland; Jaeyoung Choi; Adam Janin
The following article describes our demo of an approach to determine the geo-coordinates of the recording place of Flickr videos based on both textual metadata and visual cues. The underlying system has been tested on the MediaEval 2010 Placing Task evaluation data, which consists of 5091 unfiltered test videos is able to classify 14% of the videos to within an accuracy of 10m.
international symposium on multimedia | 2010
Gerald Friedland; Jike Chong; Adam Janin
The following article presents an application for browsing meeting recordings by speaker and keyword which we call the Meeting Diarist. The goal of the system is to enable browsing of the content with rich meta-data in a graphical user interface shortly after the end of meeting, even when the application runs on a contemporary laptop. We there-fore developed novel parallel methods for speaker diarization and multi-hypothesis speech recognition that are optimized to run on multicore and many core architectures. This paper presents the underlying parallel speaker diarization and speech recognition realizations, a comparison of results based on NIST RT07 evaluation data, and a description of the final application.
