Is this you? Create Your Porfile

Leah S. Larkey

University of Massachusetts Amherst

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Leah S. Larkey is active.

Explore More

Publication

Featured researches published by Leah S. Larkey.

international acm sigir conference on research and development in information retrieval | 1996

Combining classifiers in text categorization

Leah S. Larkey; W. Bruce Croft

Three different types of classifiers were investigatedin the context of a text categorization problem in the medical domain: the automatic assignment of ICD9 codes to dictated inpatient discharge summaries. K-nearest-neighbor, relevance feedback, and Bayesian independence classifiers were applied individually and in combination. A coknbination of different classifiers produced better results than any single type of classifier. For this specific medical categorization problem, new query formulation and weighting methods used in the k-nearest-neighbor classifier improved performance.

international acm sigir conference on research and development in information retrieval | 2002

Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

Leah S. Larkey; Lisa Ballesteros; Margaret E. Connell

Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stem¿ming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval effectiveness of our stemmers and of a morphological analyzer on the TREC-2001 data. The best light stemmer was more effective for cross-lan¿guage retrieval than a morphological stemmer which tried to find the root for each word. A repartitioning process consisting of vowel removal followed by clustering using co-occurrence analy¿sis pro¿duced stem classes which were better than no stemming or very light stemming, but still inferior to good light stemming or mor¿phological analysis.

acm international conference on digital libraries | 1999

A patent search and classification system

Leah S. Larkey

We present a system for searching and classifying U.S. patent documents, based on Inquery. Patents are distributed through hundreds of collections, divided up by general area. The system selects the best collections for the query. Users can search for patents or classify patent text. The user interface helps users search in fields without requiring the knowledge of Inquery query operators. The system includes a unique “phrase help” facility, which helps users find and add phrases and terms related to those in their query.

international acm sigir conference on research and development in information retrieval | 1998

Automatic essay grading using text categorization techniques

Leah S. Larkey

Several standard text-categorization techniques were applied to the problem of automated essay grading. Bayesian independence classifiers and knearest-neighbor classifiers were trained to assign scores to manually-graded essays. These scores were combined with several other summary text measures using linear regression. The classifiers and regression equations were then applied to a new set of essays. The classifiers worked very well. The agreement between the automated grader and the final manual grade was as good as the agreement between human graders.

conference on information and knowledge management | 2003

Statistical transliteration for english-arabic cross language information retrieval

Nasreen AbdulJaleel; Leah S. Larkey

Out of vocabulary (OOV) words are problematic for cross language information retrieval. One way to deal with OOV words when the two languages have different alphabets, is to transliterate the unknown words, that is, to render them in the orthography of the second language. In the present study, we present a simple statistical technique to train an English to Arabic transliteration model from pairs of names. We call this a selected n-gram model because a two-stage training procedure first learns which n-gram segments should be added to the unigram inventory for the source language, and then a second stage learns the translation model over this inventory. This technique requires no heuristics or linguistic knowledge of either language. We evaluate the statistically-trained model and a simpler hand-crafted model on a test set of named entities from the Arabic AFP corpus and demonstrate that they perform better than two online translation sources. We also explore the effectiveness of these systems on the TREC 2002 cross language IR task. We find that transliteration either of OOV named entities or of all OOV words is an effective approach for cross language IR.

acm international conference on digital libraries | 2000

Acrophile: an automated acronym extractor and server

Leah S. Larkey; Paul Ogilvie; M. Andrew Price; Brenden Tamilio

We implemented a web server for acronym and abbreviation lookup, containing a collection of acronyms and their expansions gathered from a large number of web pages by a heuristic extraction process. Several different extraction algorithms were evaluated and compared. The corpus resulting from the best algorithm is comparable to a high-quality hand-crafted site, but has the potential to be much more inclusive as data from more web pages are processed.

Archive | 2007

Light Stemming for Arabic Information Retrieval

Leah S. Larkey; Lisa Ballesteros; Margaret E. Connell

Computational Morphology is an urgent problem for Arabic Natural Language Processing, because Arabic is a highly inflected language. We have found, however, that a full solution to this problem is not required for effective information retrieval. Light stemming allows remarkably good information retrieval without providing correct morphological analyses. We developed several light stemmers for Arabic, and assessed their effectiveness for information retrieval using standard TREC data. We have also compared light stemming with several stemmers based on morphological analysis. The light stemmer, light10, outperformed the other approaches. It has been included in the Lemur toolkit, and is becoming widely used Arabic information retrieval.

conference on information and knowledge management | 2000

Collection selection and results merging with topically organized U.S. patents and TREC data

Leah S. Larkey; Margaret E. Connell; James P. Callan

We investigate three issues in d istributed information retrieval, considering both TREC data and U.S. Patents: (1) topical organization o f large text collections, (2) collection ranking and selection with topically organized collections (3) results merging, particularly document score normalization, with topically organized collections. We find that it is better to organize collections topically, and that topical collections can be well ranked using either INQUERY’s CORI algorithm, or the Kullback-Leibler divergence (KL), but KL is far worse than CORI for non-topically organized collections. For r esults merging, collections organized b y topic require global idfs for the best performance. Contrary to results found elsewhere, normalized scores are not as good as global idfs for merging when the collections are topically organized.

international acm sigir conference on research and development in information retrieval | 2004

Language-specific models in multilingual topic tracking

Leah S. Larkey; Fangfang Feng; Margaret E. Connell; Victor Lavrenko

Topic tracking is complicated when the stories in the stream occur in multiple languages. Typically, researchers have trained only English topic models because the training stories have been provided in English. In tracking, non-English test stories are then machine translated into English to compare them with the topic models. We propose a native language hypothesis stating that comparisons would be more effective in the original language of the story. We first test and support the hypothesis for story link detection. For topic tracking the hypothesis implies that it should be preferable to build separate language-specific topic models for each language in the stream. We compare different methods of incrementally building such native language topic models.

ACM Transactions on Asian Language Information Processing | 2003

Hindi CLIR in thirty days

Leah S. Larkey; Margaret E. Connell; Nasreen AbdulJaleel

As participants in the TIDES Surprise language exercise, researchers at the University of Massachusetts helped collect Hindi--English resources and developed a cross-language information retrieval system. Components included normalization, stop-word removal, transliteration, structured query translation, and language modeling using a probabilistic dictionary derived from a parallel corpus. Existing technology was successfully applied to Hindi. The biggest stumbling blocks were collection of parallel English and Hindi text and dealing with numerous proprietary encodings.

Explore More