Ljiljana Dolamic
University of Neuchâtel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ljiljana Dolamic.
parallel computing | 2009
Jacques Savoy; Ljiljana Dolamic
Introduction In multilingual countries (Canada, Hong Kong, India, among others) and large international organizations or companies (such as, WTO, European Parliament), and among Web users in general, accessing information written in other languages has become a real need (news, hotel or airline reservations, or government information, statistics). While some users are bilingual, others can read documents written in another language but cannot formulate a query to search it, or at least cannot provide reliable search terms in a form comparable to those found in the documents being searched. There are also many monolingual users who may want to retrieve documents in another language and then have them translated into their own language, either manually or automatically. Translation services may however be too expensive, not readily accessible or not available within a short timeframe. On the other hand, many documents contain non-textual information such as images, videos and statistics that do not need translation and can be understood regardless of the language involved. In response to these needs and in order to make the Web universally available regardless of any language barriers, in May 2007 Google launched a translation service that now provides two-way online translation services mainly between English and 41 other languages, for example, Arabic, simplified and traditional Chinese, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish (http://translate.google.com/). Over the last few years other free Internet translation services have been made available as for example by BabelFish (http://babel.altavista.com/) or Yahoo! (http://babelfish.yahoo.com/). These two systems are similar to that used by Google, given they are based on technology developed by Systran, one of the earliest companies to develop machine translation. Also worth mentioning here is the Promt system (also known as Reverso, http://translation2.paralink.com/), which was developed in Russia to provide mainly translation between Russian and other languages. The question we would like to address here is to what extent a translation service such as Google can produce adequate results in the language other than that being used to write the query. Although we will not evaluate translations per se we will test and analyze various systems in terms of their ability to retrieve items automatically based on a translated query. To be adequate, these tests must be done on a collection of documents written in one given language plus a series of topics (expressing user information needs) written in other languages, plus a series of relevance assessments (relevant documents for each topic).
Information Processing and Management | 2009
Ljiljana Dolamic; Jacques Savoy
This paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as the classical tf idf vector-space approach. We found that the Divergence from Randomness paradigm tend to propose better retrieval effectiveness than the Okapi, LM or tf idf models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming reduces generally the MAP by more than 40%, and these differences are always significant. Finally, if our more aggressive stemmer tends to show the best performance, the differences in performance with a light stemmer are not statistically significant.
ACM Transactions on Asian Language Information Processing | 2010
Ljiljana Dolamic; Jacques Savoy
The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the key elements of their inflectional and derivational morphologies, and suggest a light and more aggressive stemming approach based on them. In our evaluation of these stemming strategies we make use of the FIRE 2008 test collections, and then to broaden our comparisons we implement and evaluate two language independent indexing methods: the n-gram and trunc-n (truncation of the first n letters). We evaluate these solutions by applying our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language models (LM) together with two classical vector-space approaches: tf idf and Lnu-ltc. Experiments performed with all three languages demonstrate that the I(ne)C2 model derived from the Divergence from Randomness paradigm tends to provide the best mean average precision (MAP). Our own tests suggest that improved retrieval effectiveness would be obtained by applying more aggressive stemmers, especially those accounting for certain derivational suffixes, compared to those involving a light stemmer or ignoring this type of word normalization procedure. Comparisons between no stemming and stemming indexing schemes shows that performance differences are almost always statistically significant. When, for example, an aggressive stemmer is applied, the relative improvements obtained are ~28% for the Hindi language, ~42% for Marathi, and ~18% for Bengali, as compared to a no-stemming approach. Based on a comparison of word-based and language-independent approaches we find that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing scheme. A query-by-query analysis reveals the reasons for this, and also demonstrates the advantage of applying a stemming or a trunc-4 indexing scheme.
database and expert systems applications | 2009
Ljiljana Dolamic; Jacques Savoy
The main goal of this paper is to describe and evaluate different indexing and stemming strategies for the Farsi (Persian) language. For this Indo-European language we have suggested a stopword list and a light stemmer. We have compared this stemmer to indexing strategy in which the stemming procedure was omitted, with or without stopword list removal, another publically available stemmer for this language as well as language independent n-gram indexing strategy. To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as two vector space models, the classical tf idf and Lnu-ltc model. We have found that the Divergence from Randomness paradigm tends to propose better retrieval effectiveness than the Okapi, LM or vector-space models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming ameliorates the MAP by more than 7%, giving the differences that are most of the time statistically significant. Finally, not removing the stoplist words for this language deprecates the MAP performance by 3%.
cross language evaluation forum | 2008
Ljiljana Dolamic; Claire Fautsch; Jacques Savoy
In our participation in this evaluation campaign, our first objective was to analyze retrieval effectiveness when using The European Library (TEL) corpora composed of very short descriptions (library catalog records) and also to evaluate the retrieval effectiveness of several IR models. As a second objective we wanted to design and evaluate a stopword list and a light stemming strategy for the Persian (Farsi), a member of the Indo-European family of languages and whose morphology is more complex than of the English language.
cross language evaluation forum | 2008
Claire Fautsch; Ljiljana Dolamic; Samir Abdou; Jacques Savoy
In participating in this domain-specific track, our first objective is to propose and evaluate a light stemmer for the Russian language. Our second objective is to measure the relative merit of various search engines used for the German and to a lesser extent the English languages. To do so we evaluated the tf ·idf, Okapi, IR models derived from the Divergence from Randomness (DFR) paradigm, and also a language model (LM). For the Russian language, we find that word-based indexing using our light stemming procedure results in better retrieval effectiveness than does the 4-gram indexing strategy (relative difference around 30%). Using the German corpus, we examine certain variations in retrieval effectiveness after applying the specialized thesaurus to automatically enlarge topic descriptions. In this case, the performance variations were relatively small and usually non significant.
FIRE | 2013
Jacques Savoy; Ljiljana Dolamic; Mitra Akasereh
Our first objective in participating in FIRE evaluation campaigns is to analyze the retrieval effectiveness of various indexing and search strategies when dealing with corpora written in Hindi, Bengali and Marathi languages. As a second goal, we have developed new and more aggressive stemming strategies for both Marathi and Hindi languages during this second campaign. We have compared their retrieval effectiveness with both light stemming strategy and n-gram language-independent approach. As another language-independent indexing strategy, we have evaluated the trunc-n method in which the indexing term is formed by considering only the first n letters of each word. To evaluate these solutions we have used various IR models including models derived from Divergence from Randomness (DFR), Language Model (LM) as well as Okapi, or the classical tf idf vector-processing approach.
cross language evaluation forum | 2008
Claire Fautsch; Ljiljana Dolamic; Jacques Savoy
Our first objective in participating in this domain-specific evaluation campaign is to propose and evaluate various indexing and search strategies for the German, English and Russian languages, and thus obtain retrieval effectiveness superior to that of language-independent approaches (n-gram). To do so we evaluated the GIRT-4 test-collection using the Okapi model, various IR models based on the Divergence from Randomness (DFR) paradigm, the statistical language model (LM) together with the classical tfċidf vector-processing scheme.
International Conference on E-Technologies | 2011
Jacques Savoy; Ljiljana Dolamic; Olena Zubaryeva
In this paper we describe current search technologies available on the web, explain underlying difficulties and show their limits, related to either current technologies or to the intrinsic properties of all natural languages. We then analyze the effectiveness of freely available machine translation services and demonstrate that under certain conditions these translation systems can operate at the same performance levels as manual translators. Searching for factual information with commercial search engines also allows the retrieval of facts, user comments and opinions on target items. In the third part we explain how the principle machine learning strategies are able to classify short passages of text extracted from the blogosphere as factual or opinionated and then classify their polarity (positive, negative or mixed).
cross language evaluation forum | 2009
Ljiljana Dolamic; Jacques Savoy
This paper describes our participation to the Persian ad hoc search during the CLEF 2009 evaluation campaign. In this task, we suggest using a light suffix-stripping algorithm for the Farsi (or Persian) language. The evaluations based on different probabilistic models demonstrated that our stemming approach performs better than a stemmer removing only the plural suffixes, or statistically better than an approach ignoring the stemming stage (around +4.5%) or a n-gram approach (around +4.7%). The use of a blind query expansion may significantly improve the retrieval effectiveness (between +7% to +11%). Combining different indexing and search strategies may further enhance the MAP (around +4.4%).