Seyed M. M. Tahaghoghi

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Seyed M. M. Tahaghoghi is active.

Explore More

Publication

Featured researches published by Seyed M. M. Tahaghoghi.

international acm sigir conference on research and development in information retrieval | 2006

Capturing collection size for distributed non-cooperative retrieval

Milad Shokouhi; Justin Zobel; Falk Scholer; Seyed M. M. Tahaghoghi

Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.

conference on image and video retrieval | 2007

Detection of near-duplicate images for web search

Jun Jie Foo; Justin Zobel; Ranjan Sinha; Seyed M. M. Tahaghoghi

Among the vast numbers of images on the web are many duplicates and near-duplicates, that is, variants derived from the same original image. Such near-duplicates appear in many web image searches and may represent infringements of copyright or indicate the presence of redundancy. While methods for identifying near-duplicates have been investigated, there has been no analysis of the kinds of alterations that are common on the web or evaluation of whether real cases of near-duplication can in fact be identified. In this paper we use popular queries and a commercial image search service to collect images that we then manually analyse for instances of near-duplication. We show that such duplication is indeed significant, but that not all kinds of image alteration explored in previous literature are evident in web data. Removal of near-duplicates from a collection is impractical, but we propose that they be removed from sets of answers. We evaluate our technique for automatic identification of near duplicates during query evaluation and show that it has promise as an effective mechanism for management of near-duplication in practice.

Information Processing and Management | 2007

Using query logs to establish vocabularies in distributed information retrieval

Milad Shokouhi; Justin Zobel; Seyed M. M. Tahaghoghi; Falk Scholer

Users of search engines express their needs as queries, typically consisting of a small number of terms. The resulting seacch engine query logs are valuable resources that can be used to predict how people interact with the search system. In this paper, we introduce two novel applications of query logs, in the context of distributed information retrieval. First, we use query log terms to guide sampling from uncooperative distributed collections. We show that while our sampling strategy is at least as efficient as current methods, it consistently performs better. Second, we propose and evaluate a pruning strategy that uses query log information to eliminate terms. Our experiments show that our proposed pruning method maintains the accuracy achieved by complete indexes, while decreasing the index size by up to 60%. While such pruning may not always be desirable in practice, it provides a useful benchmark against which other pruning strategies can be measured.

string processing and information retrieval | 2005

Stemming arabic conjunctions and prepositions

Abdusalam F. A. Nwesri; Seyed M. M. Tahaghoghi; Falk Scholer

Arabic is the fourth most widely spoken language in the world, and is characterised by a high rate of inflection. To cater for this, most Arabic information retrieval systems incorporate a stemming stage. Most existing Arabic stemmers are derived from English equivalents; however, unlike English, most affixes in Arabic are difficult to discriminate from the core word. Removing incorrectly identified affixes sometimes results in a valid but incorrect stem, and in most cases reduces retrieval precision. Conjunctions and prepositions form an interesting class of these affixes. In this work, we present novel approaches for dealing with these affixes. Unlike previous approaches, our approaches focus on retaining valid Arabic core words, while maintaining high retrieval performance.

IEEE Transactions on Multimedia | 2007

Modeling Human Judgment of Digital Imagery for Multimedia Retrieval

Timo Volkmer; James A. Thom; Seyed M. M. Tahaghoghi

The application of machine learning techniques to image and video search has been shown to boost the performance of multimedia retrieval systems, and promises to lead to more generalized semantic search approaches. In particular, the availability of large training collections allows model-driven search using a substantial number of semantic concepts. The training collections are obtained in a manual annotation process where human raters review images and assign predefined semantic concept labels. Besides being prone to human error, manual image annotation is biased by the view of the individual annotator because visual information almost always leaves room for ambiguity. Ideally, several independent judgments are obtained per image, and the inter-rater agreement is assessed. While disagreement between ratings bears valuable information on the annotation quality, it complicates the task of clearly classifying rated images based on multiple judgments. In the absence of a gold standard, evaluating multiple judgments and resolving disagreement between raters is not trivial. In this paper, we present an approach using latent structure analysis to solve this problem. We apply latent class modeling to the annotation data collected during the TRECVID 2005 Annotation Forum, and demonstrate how to use this statistic to clearly classify each image on the basis of varying numbers of ratings. We use latent class modeling to quantify the annotation quality and discuss the results in comparison with the well-known Kappa inter-rater agreement measure.

International Workshop of the Initiative for the Evaluation of XML Retrieval | 2006

Social Media Retrieval Using Image Features and Structured Text

D.N.F. Awang Iskandar; Jovan Pehcevski; James A. Thom; Seyed M. M. Tahaghoghi

Use of XML offers a structured approach for representing information while maintaining separation of form and content. XML information retrieval is different from standard text retrieval in two aspects: the XML structure may be of interest as part of the query; and the information does not have to be text. In this paper, we describe an investigation of approaches to retrieve text and images from a large collection of XML documents, performed in the course of our participation in the INEX 2006 Ad Hoc and Multimedia tracks. We evaluate three information retrieval similarity measures: Pivoted Cosine, Okapi BM25 and Dirichlet. We show that on the INEX 2006 Ad Hoc queries Okapi BM25 is the most effective among the three similarity measures used for retrieving text only, while Dirichlet is more suitable when retrieving heterogeneous (text and image) data.

computer vision and pattern recognition | 2004

Gradual Transition Detection Using Average Frame Similarity

Timo Volkmer; Seyed M. M. Tahaghoghi; Hugh E. Williams

Segmenting digital video into its constituent basic semantic entities, or shots, is an important step for effective management and retrieval of video data. Recent automated techniques for detecting transitions between shots are highly effective on abrupt transitions. However, automated detection of gradual transitions, and the precise determination of the corresponding start and end frames, remains problematic. In this paper, we present a gradual transition detection approach based on average frame similarity and adaptive thresholds. We report good detection results on the TREC video track collections - particularly for dissolves and fades - and very high accuracy in identifying transition boundaries. Our technique is a valuable new tool for transition detection.

empirical methods in natural language processing | 2006

Capturing Out-of-Vocabulary Words in Arabic Text

Abdusalam F. A. Nwesri; Seyed M. M. Tahaghoghi; Falk Scholer

The increasing flow of information between languages has led to a rise in the frequency of non-native or loan words, where terms of one language appear transliterated in another. Dealing with such out of vocabulary words is essential for successful cross-lingual information retrieval. For example, techniques such as stemming should not be applied indiscriminately to all words in a collection, and so before any stemming, foreign words need to be identified. In this paper, we investigate three approaches for the identification of foreign words in Arabic text: lexicons, language patterns, and n-grams and present that results show that lexicon-based approaches outperform the other techniques.

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval | 2005

Combining image and structured text retrieval

D.N.F. Awang Iskandar; Jovan Pehcevski; James A. Thom; Seyed M. M. Tahaghoghi

Two common approaches in retrieving images from a collection are retrieval by text keywords and retrieval by visual content. However, it is widely recognised that it is impossible for keywords alone to fully describe visual content. This paper reports on the participation of the RMIT University group in the INEX 2005 multimedia track, where we investigated our approach of combining evidence from a content-oriented XML retrieval system and a content-based image retrieval system using a linear combination of evidence. Our approach yielded the best overall result for the INEX 2005 Multimedia track using the standard evaluation measures. We have extended our work by varying the parameter for the linear combination of evidence, and we have also examined the performance of runs submitted by participants by using the newly proposed HiXEval evaluation metric. We show that using CBIR in conjunction with text search leads to better retrieval performance.

genetic and evolutionary computation conference | 2008

Evolving similarity functions for code plagiarism detection

Victor Ciesielski; Nelson Wu; Seyed M. M. Tahaghoghi

Detecting whether computer program code is a students original work or has been copied from another student or some other source is a major problem for many universities. Detection methods based on the information retrieval concepts of indexing and similarity matching scale well to large collections of files, but require appropriate similarity functions for good performance. We have used particle swarm optimization and genetic programming to evolve similarity functions that are suited to computer program code. Using a training set of plagiarised and non-plagiarised programs we have evolved better parameter values for the previously published Okapi BM25 similarity function. We have then used genetic programming to evolve completely new similarity functions that do not conform to any predetermined structure. We found that the evolved similarity functions outperformed the human developed Okapi BM25 function. We also found that a detection system using the evolved functions was more accurate than the the best code plagiarism detection system in use today, and scales much better to large collections of files. The evolutionary computing techniques have been extremely useful in finding similarity functions that advance the state of the art in code plagiarism detection.

Explore More