Christopher M. De Vries

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christopher M. De Vries is active.

Explore More

Publication

Featured researches published by Christopher M. De Vries.

conference on information and knowledge management | 2011

TOPSIG: topology preserving document signatures

Shlomo Geva; Christopher M. De Vries

Comparisons between file signatures and inverted files for text retrieval have shown the shortcomings of traditional file signatures. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures that extends recent advances in semantic hashing and dimensionality reduction. These were not so far linked to general purpose, signature file based, search engines. We demonstrate significant improvements in the performance of signature file based indexing and retrieval. Performance is comparable to the state of the art inverted file based systems, including language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and positions the file signatures model in the class of Vector Space retrieval models.

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval | 2009

Overview of the INEX 2009 XML mining track: clustering and classification of XML documents

Richi Nayak; Christopher M. De Vries; Sangeetha Kutty; Shlomo Geva; Ludovic Denoyer; Patrick Gallinari

This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2009 XML Mining track. The report also describes the approaches and results obtained by the different participants.

arXiv: Information Retrieval | 2009

Document Clustering with K-tree

Christopher M. De Vries; Shlomo Geva

This paper describes the approach taken to the XML Mining track at INEX 2008 by a group at the Queensland University of Technology. We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document clustering. Many large scale problems exist in document clustering. K-tree scales well with large inputs due to its low complexity. It offers promising results both in terms of efficiency and quality. Document classification was completed using Support Vector Machines.

international acm sigir conference on research and development in information retrieval | 2009

K-tree: large scale document clustering

Christopher M. De Vries; Shlomo Geva

We introduce K-tree in an information retrieval context. It is an efficient approximation of the k-means clustering algorithm. Unlike k-means it forms a hierarchy of clusters. It has been extended to address issues with sparse representations. We compare performance and quality to CLUTO using document collections. The K-tree has a low time complexity that is suitable for large document collections. This tree structure allows for efficient disk based implementations where space requirements exceed that of main memory.

australasian document computing symposium | 2012

Pairwise similarity of TopSig document signatures

Christopher M. De Vries; Shlomo Geva

This paper analyses the pairwise distances of signatures produced by the TopSig retrieval model on two document collections. The distribution of the distances are compared to purely random signatures. It explains why TopSig is only competitive with state of the art retrieval models at early precision. Only the local neighbourhood of the signatures is interpretable. We suggest this is a common property of vector space models.

international world wide web conferences | 2015

Parallel Streaming Signature EM-tree: A Clustering Algorithm for Web Scale Applications

Christopher M. De Vries; Lance De Vine; Shlomo Geva; Richi Nayak

The proliferation of the web presents an unsolved problem of automatically analyzing billions of pages of natural language. We introduce a scalable algorithm that clusters hundreds of millions of web pages into hundreds of thousands of clusters. It does this on a single mid-range machine using efficient algorithms and compressed document representations. It is applied to two web-scale crawls covering tens of terabytes. ClueWeb09 and ClueWeb12 contain 500 and 733 million web pages and were clustered into 500,000 to 700,000 clusters. To the best of our knowledge, such fine grained clustering has not been previously demonstrated. Previous approaches clustered a sample that limits the maximum number of discoverable clusters. The proposed EM-tree algorithm uses the entire collection in clustering and produces several orders of magnitude more clusters than the existing algorithms. Fine grained clustering is necessary for meaningful clustering in massive collections where the number of distinct topics grows linearly with collection size. These fine-grained clusters show an improved cluster quality when assessed with two novel evaluations using ad hoc search relevance judgments and spam classifications for external validation. These evaluations solve the problem of assessing the quality of clusters where categorical labeling is unavailable and unfeasible.

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval | 2009

Clustering with random indexing K-tree and XML structure

Christopher M. De Vries; Shlomo Geva; Lance De Vine

This paper describes the approach taken to the clustering task at INEX 2009 by a group at the Queensland University of Technology. The Random Indexing (RI) K-tree has been used with a representation that is based on the semantic markup available in the INEX 2009 Wikipedia collection. The RI K-tree is a scalable approach to clustering large document collections. This approach has produced quality clustering when evaluated using two different methodologies.

Science & Engineering Faculty | 2011