Boris Chidlovskii | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Boris Chidlovskii is active.

Explore More

Publication

Featured researches published by Boris Chidlovskii.

very large data bases | 2000

Semantic caching of Web queries

Boris Chidlovskii; Uwe M. Borghoff

Abstract. In meta-searchers accessing distributed Web-based information repositories, performance is a major issue. Efficient query processing requires an appropriate caching mechanism. Unfortunately, standard page-based as well as tuple-based caching mechanisms designed for conventional databases are not efficient on the Web, where keyword-based querying is often the only way to retrieve data. In this work, we study the problem of semantic caching of Web queries and develop a caching mechanism for conjunctive Web queries based on signature files. Our algorithms cope with both relations of semantic containment and intersection between a query and the corresponding cache items. We also develop the cache replacement strategy to treat situations when cached items differ in size and contribution when providing partial query answers. We report results of experiments and show how the caching mechanism is realized in the Knowledge Broker system.

acm symposium on applied computing | 2010

Scalable indexing for layout based document retrieval and ranking

Loic Lecerf; Boris Chidlovskii

In this paper we propose a schema for querying large documents collections by document layout. We develop a model of layout indexing of a collection adapted for the quick retrieval of top k relevant documents. Fort the sake of scalability, we avoid a direct evaluation of the similarity between a query and each document in the collection; their similarity is instead approximated by the similarity between their projections on the set of representative blocks which are inferred from the collection on the indexed step. The technique also proposes new functions for the relevance ranking and the cluster pruning that ensure a scalable retrieval and ranking.

Proceedings IEEE Advances in Digital Libraries 2000 | 2000

Using regular tree automata as XML schemas

Boris Chidlovskii

We address the problem of tight XML schemas and propose regular tree automata to model XML data. We show that the tree automata model is more powerful than the XML DTDs and is closed under main algebraic operations. We introduce the XML query algebra based on the tree automata model, and discuss the query optimization and query pruning techniques. Finally we show the conversion of tree automata schema into XML DTDs.

european conference on machine learning | 2000

Wrapper Generation via Grammar Induction

Boris Chidlovskii; Jon Ragetli; Maarten de Rijke

To facilitate effective search on the World Wide Web, meta search engines have been developed which do not search the Web themselves, but use available search engines to find the required information. By means of wrappers, meta search engines retrieve information from the pages returned by search engines. We present an approach to automatically create such wrappers by means of an incremental grammar induction algorithm. The algorithm uses an adaptation of the string edit distance. Our method performs well; it is quick, can be used for several types of result pages and requires a minimal amount of user interaction.

acm/ieee joint conference on digital libraries | 2002

Schema extraction from XML collections

Boris Chidlovskii

XML Schema language has been proposed to replace Document Type Definitions (DTDs) as schema mechanism for XML data. This language consistently extends grammar-based constructions with constraint- and pattern-based ones and have a higher expressive power than DTDs. As schemas remain optional for XML, we address the problem of XML Schema extraction. We model the XML schema as extended context-free grammars and develop a novel extraction algorithm inspired by methods of grammatical inference. The algorithm copes also with the schema determinism requirement imposed by XML DTDs and XML Schema languages.

international conference on machine learning and applications | 2010

Boosting Multi-Task Weak Learners with Applications to Textual and Social Data

Jean Baptiste Faddoul; Boris Chidlovskii; Fabien Torre; Rémi Gilleron

Learning multiple related tasks from data simultaneously can improve predictive performance relative to learning these tasks independently. In this paper we propose a novel multi-task learning algorithm called MT-Adaboost: it extends Adaboost algorithm Freund1999Short to the multi-task setting, it uses as multi-task weak classifier a multi-task decision stump. This allows to learn different dependencies between tasks for different regions of the learning space. Thus, we relax the conventional hypothesis that tasks behave similarly in the whole learning space. Moreover, MT-Adaboost can learn multiple tasks without imposing the constraint of sharing the same label set and/or examples between tasks. A theoretical analysis is derived from the analysis of the original Adaboost. Experiments for multiple tasks over large scale textual data sets with social context (Enron and Tobacco) give rise to very promising results.

european conference on machine learning | 2012

Learning multiple tasks with boosted decision trees

Jean Baptiste Faddoul; Boris Chidlovskii; Rémi Gilleron; Fabien Torre

We address the problem of multi-task learning with no label correspondence among tasks. Learning multiple related tasks simultaneously, by exploiting their shared knowledge can improve the predictive performance on every task. We develop the multi-task Adaboost environment with Multi-Task Decision Trees as weak classifiers. We first adapt the well known decision tree learning to the multi-task setting. We revise the information gain rule for learning decision trees in the multi-task setting. We use this feature to develop a novel criterion for learning Multi-Task Decision Trees. The criterion guides the tree construction by learning the decision rules from data of different tasks, and representing different degrees of task relatedness. We then modify MT-Adaboost to combine Multi-task Decision Trees as weak learners. We experimentally validate the advantage of the new technique; we report results of experiments conducted on several multi-task datasets, including the Enron email set and Spam Filtering collection.

international conference on management of data | 2006

Documentum ECI self-repairing wrappers: performance analysis

Boris Chidlovskii; Bruno Roustant; Marc Brette

Documentum Enterprise Content Integration (ECI) services is a content integration middleware that provides one-query access to the Intranet and Internet content resources. The ECI Adapter technology offers an interface to any application for data and metadata extraction from unstructured Web pages. It offers a unique frame-work of wrapper production, automatic recovery and maintenance, developed at Xerox Research Centre Europe and based on state-of-art algorithms from machine learning and grammatical inference. In this presentation we analyze the performance of ECI adapters deployed in current commercial installations. We benefit from accessing reports on daily tests for all ECI commercially deployed adapters collected from June 2003 to September 2005. Using the daily reports, we analyze different aspects of the wrapper technology.

web search and data mining | 2013

Connecting comments and tags: improved modeling of social tagging systems

Dawei Yin; Shengbo Guo; Boris Chidlovskii; Brian D. Davison; Cédric Archambeau; Guillaume Bouchard

Collaborative tagging systems are now deployed extensively to help users share and organize resources. Tag prediction and recommendation can simplify and streamline the user experience, and by modeling user preferences, predictive accuracy can be significantly improved. However, previous methods typically model user behavior based only on a log of prior tags, neglecting other behaviors and information in social tagging systems, e.g., commenting on items and connecting with other users. On the other hand, little is known about the connection and correlations among these behaviors and contexts in social tagging systems. In this paper, we investigate improved modeling for predictive social tagging systems. Our explanatory analyses demonstrate three significant challenges: coupled high order interaction, data sparsity and cold start on items. We tackle these problems by using a generalized latent factor model and fully Bayesian treatment. To evaluate performance, we test on two real-world data sets from Flickr and Bibsonomy. Our experiments on these data sets show that to achieve best predictive performance, it is necessary to employ a fully Bayesian treatment in modeling high order relations in social tagging system. Our methods noticeably outperform state-of-the-art approaches.

european conference on machine learning | 2001

Wrapping web information providers by transducer induction

Boris Chidlovskii

Modern agent and mediator systems communicate to a multitude of Web information providers to better satisfy user requests. They use wrappers to extract relevant information from HTML responses and to annotate it with user-defined labels. A number of approaches exploit the methods of machine learning to induce instances of certain wrapper classes, by assuming the tabular structure of HTML responses and by observing the regularity of extracted fragments in the HTML structure. In this work, we propose a general approach and consider the information extraction conducted by wrappers as a special form of transduction. We make no assumption about the HTML response structure and profit from the advanced methods of transducer induction, in order to develop two powerful wrapper classes, for samples with and without ambiguous translations.We test the proposed induction methods on a set of general-purpose and bibliographic data providers and report the results of experiments.

Explore More