Rosie Jones | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rosie Jones is active.

Explore More

Publication

Featured researches published by Rosie Jones.

conference on information and knowledge management | 2001

Mining the web to create minority language corpora

Rayid Ghani; Rosie Jones; Dunja Mladenic

The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for retrieving documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes to a variety of languages.

Knowledge and Information Systems | 2005

Building Minority Language Corpora by Learning to Generate Web Search Queries

Rayid Ghani; Rosie Jones; Dunja Mladenic

The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions.

north american chapter of the association for computational linguistics | 2001

You're not from 'round here, are you?: naive Bayes detection of non-native utterance text

Laura Mayfield Tomokiyo; Rosie Jones

Native and non-native use of language differs, depending on the proficiency of the speaker, in clear and quantifiable ways. It has been shown that customizing the acoustic and language models of a natural language understanding system can significantly improve handling of non-native input; in order to make such a switch, however, the nativeness status of the user must be known. In this paper, we show that naive Bayes classification can be used to identify non-native utterances of English. The advantage of our method is that it relies on text, not on acoustic features, and can be used when the acoustic source is not available. We demonstrate that both read and spontaneous utterances can be classified with high accuracy, and that classification of errorful speech recognizer hypotheses is more accurate than classification of perfect transcriptions. We also characterize part-of-speech sequences that play a role in detecting non-native speech.

international acm sigir conference on research and development in information retrieval | 2001

Automatic web search query generation to create minority language corpora

Rayid Ghani; Rosie Jones; Dunja Mladenic

The Web is a valuable source of language specific resources but collecting, organizing and utilizing this information is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries to collect documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant and a subset of documents is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for finding documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Applying the same approach to multiple languages show that our system generalizes to a variety of languages.

conference on information and knowledge management | 2000

Learning a monolingual language model from a multilingual text database

Rayid Ghani; Rosie Jones

Language models are of importance in speech recognition, document classi cation, and database selection algorithms. Traditionally language models are learned from corpora specifically acquired for the purpose. Increasingly, however, there is interest in constructing language models for speci c languages from heterogeneous sources such as the web. Querybased sampling has been shown to be e ective for gauging the content of monolingual heterogeneous databases. We propose evaluating an extension to this approach by considering the case of learning a monolingual language model from a multilingual database, and extensions to the querybased sampling algorithm to handle this case. We test our approach on a corpus collected from the WWW and sho w that our proposed methods perform accurately and e ciently for learning a language model of Tagalog, when these documents are only 2.5% of the documents in a collection.

international acm sigir conference on research and development in information retrieval | 2018

An Information Nutritional Label for Online Documents

Norbert Fuhr; Anastasia Giachanou; Gregory Grefenstette; Iryna Gurevych; Andreas Hanselowski; Kalervo Järvelin; Rosie Jones; Yiqun Liu; Josiane Mothe; Wolfgang Nejdl; Isabella Peters; Benno Stein

With the proliferation of online information sources, it has become more and more difficult to judge the trustworthiness of news found on the Web. The beauty of the web is its openness, but this openness has lead to a proliferation of false and unreliable information, whose presentation makes it difficult to detect. It may be impossible to detect what is “real news” and what is “fake news” since this discussion ultimately leads to a deep philosophical discussion of what is true and what is false. However, recent advances in natural language processing allow us to analyze information objectively according to certain objective criteria (for example, the number of spelling errors). Here we propose creating an “information nutrition label” that we can automatically generated for any online text. Among others, the label provides information on the following computable criteria: factuality, virality, opinion, controversy, authority, technicality, and topicality. With this label, we hope to help readers make more informed judgments about the items they read.

national conference on artificial intelligence | 1999