Joshua T. Goodman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Joshua T. Goodman is active.

Explore More

Publication

Featured researches published by Joshua T. Goodman.

Computer Speech & Language | 1999

An empirical study of smoothing techniques for language modeling

Stanley F. Chen; Joshua T. Goodman

We survey the most widely-used algorithms for smoothing models for language n -gram modeling. We then present an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer (1980); Katz (1987); Bell, Cleary and Witten (1990); Ney, Essen and Kneser (1994), and Kneser and Ney (1995). We investigate how factors such as training data size, training corpus (e.g. Brown vs. Wall Street Journal), count cutoffs, and n -gram order (bigram vs. trigram) affect the relative performance of these methods, which is measured through the cross-entropy of test data. We find that these factors can significantly affect the relative performance of models, with the most significant factor being training data size. Since no previous comparisons have examined these factors systematically, this is the first thorough characterization of the relative performance of various algorithms. In addition, we introduce methodologies for analyzing smoothing algorithm efficacy in detail, and using these techniques we motivate a novel variation of Kneser?Ney smoothing that consistently outperforms all other algorithms evaluated. Finally, results showing that improved language model smoothing leads to improved speech recognition performance are presented.

meeting of the association for computational linguistics | 1996

An Empirical Study of Smoothing Techniques for Language Modeling

Stanley F. Chen; Joshua T. Goodman

We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the cross-entropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods.

Computer Speech & Language | 2001

A bit of progress in language modeling

Joshua T. Goodman

In the past several years, a number of different language modeling improvements over simple trigram models have been found, including caching, higher-order n -grams, skipping, interpolated Kneser?Ney smoothing, and clustering. We present explorations of variations on, or of the limits of, each of these techniques, including showing that sentence mixture models may have more potential. While all of these techniques have been studied separately, they have rarely been studied in combination. We compare a combination of all techniques together to a Katz smoothed trigram model with no count cutoffs. We achieve perplexity reductions between 38 and 50% (1 bit of entropy), depending on training data size, as well as a word error rate reduction of 8.9%. Our perplexity reductions are perhaps the highest reported compared to a fair baseline.

ACM Transactions on Asian Language Information Processing | 2002

Toward a unified approach to statistical language modeling for Chinese

Jianfeng Gao; Joshua T. Goodman; Mingjing Li; Kai-Fu Lee

This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

international conference on acoustics, speech, and signal processing | 2001

Classes for fast maximum entropy training

Joshua T. Goodman

Maximum entropy models are considered by many to be one of the most promising avenues of language modeling research. Unfortunately, long training times make maximum entropy research difficult. We present a speedup technique: we change the form of the model to use classes. Our speedup works by creating two maximum entropy models, the first of which predicts the class of each word, and the second of which predicts the word itself. This factoring of the model leads to fewer nonzero indicator functions, and faster normalization, achieving speedups of up to a factor of 35 over one of the best previous techniques. It also results in typically slightly lower perplexities. The same trick can be used to speed training of other machine learning techniques, e.g. neural networks, applied to any problem with a large number of outputs, such as language modeling.

intelligent user interfaces | 2002

Language modeling for soft keyboards

Joshua T. Goodman; Gina Venolia; Keith Steury; Chauncey R. Parker

Language models predict the probability of letter sequences. Soft keyboards are images of keyboards on a touch screen for input on Personal Digital Assistants. When a soft keyboard user hits a key near the boundary of a key position, the language model and key press model are combined to select the most probable key sequence. This leads to an overall error rate reduction by a factor of 1.67 to 1.87. An extended version of this paper [4] is available.

meeting of the association for computational linguistics | 2002

Sequential Conditional Generalized Iterative Scaling

Joshua T. Goodman

We describe a speedup for training conditional maximum entropy models. The algorithm is a simple variation on Generalized Iterative Scaling, but converges roughly an order of magnitude faster, depending on the number of constraints, and the way speed is measured. Rather than attempting to train all model parameters simultaneously, the algorithm trains them sequentially. The algorithm is easy to implement, typically uses only slightly more memory, and will lead to improvements for most maximum entropy problems.

electronic commerce | 2004

Stopping outgoing spam

Joshua T. Goodman; Robert L. Rounthwaite

We analyze the problem of preventing outgoing spam. We show that some conventional techniques for limiting outgoing spam are likely to be ineffective. We show that while imposing per message costs would work, less annoying techniques also work. In particular, it is only necessary that the average cost to the spammer over the lifetime of an account exceed his profits, meaning that not every message need be challenged. We develop three techniques, one based on additional HIP challenges, one based on computational challenges, and one based on paid subscriptions. Each system is designed to impose minimal costs on legitimate users, while being too costly for spammers. We also show that maximizing complaint rates is a key factor, and suggest new standards to encourage high complaint rates.

international conference on acoustics, speech, and signal processing | 2001

MiPad: a multimodal interaction prototype

Xuedong Huang; Alex Acero; Ciprian Chelba; Li Deng; Jasha Droppo; Doug Duchene; Joshua T. Goodman; Hsiao-Wuen Hon; Derek Jacoby; Li Jiang; Ricky Loynd; Milind Mahajan; Peter Mau; Scott Meredith; Salman Mughal; Salvado Neto; Mike Plumpe; Kuansan Steury; Gina Venolia; Kuansan Wang; Ye-Yi Wang

Dr. Who is a Microsoft research project aiming at creating a speech-centric multimodal interaction framework, which serves as the foundation for the NET natural user interface. MiPad is the application prototype that demonstrates compelling user advantages for wireless personal digital assistant (PDA) devices, MiPad fully integrates continuous speech recognition (CSR) and spoken language understanding (SLU) to enable users to accomplish many common tasks using a multimodal interface and wireless technologies. It tries to solve the problem of pecking with tiny styluses or typing on minuscule keyboards in todays PDAs. Unlike a cellular phone, MiPad avoids speech-only interaction. It incorporates a built-in microphone that activates whenever a field is selected. As a user taps the screen or uses a built in roller to navigate, the tapping action narrows the number of possible instructions for spoken word understanding. MiPad currently runs on a Windows CE Pocket PC with a Windows 2000 machine where speech recognition is performed. The Dr Who CSR engine uses a unified CFG and n-gram language model. The Dr Who SLU engine is based on a robust chart parser and a plan-based dialog manager. The paper discusses MiPads design, implementation work in progress, and preliminary user study in comparison to the existing pen-based PDA interface.

international world wide web conferences | 2004

Filtering spam e-mail on a global scale

Geoff Hulten; Joshua T. Goodman; Robert L. Rounthwaite

In this paper we analyze a very large junk e-mail corpus which was generated by a hundred thousand volunteer users of the Hotmail e-mail service. We describe how the corpus is being collected, and analyze: the geographic origins of the e-mail who the e-mail is targeting and what the e-mail is selling.

Explore More