Periklis Andritsos
University of Toronto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Periklis Andritsos.
extending database technology | 2004
Periklis Andritsos; Panayiotis Tsaparas; Renée J. Miller; Kenneth C. Sevcik
Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. As a hierarchical algorithm, LIMBO has the advantage that it can produce clusterings of different sizes in a single execution. We use the IB framework to define a distance measure for categorical tuples and we also present a novel distance measure for categorical attribute values. We show how the LIMBO algorithm can be used to cluster both tuples and values. LIMBO handles large data sets by producing a memory bounded summary model for the data. We present an experimental evaluation of LIMBO, and we study how clustering quality compares to other categorical clustering algorithms. LIMBO supports a trade-off between efficiency (in terms of space and time) and quality. We quantify this trade-off and demonstrate that LIMBO allows for substantial improvements in efficiency with negligible decrease in quality.
IEEE Transactions on Software Engineering | 2005
Periklis Andritsos; Vassilios Tzerpos
The majority of the algorithms in the software clustering literature utilize structural information to decompose large software systems. Approaches using other attributes, such as file names or ownership information, have also demonstrated merit. At the same time, existing algorithms commonly deem all attributes of the software artifacts being clustered as equally important, a rather simplistic assumption. Moreover, no method that can assess the usefulness of a particular attribute for clustering purposes has been presented in the literature. In this paper, we present an approach that applies information theoretic techniques in the context of software clustering. Our approach allows for weighting schemes that reflect the importance of various attributes to be applied. We introduce LIMBO, a scalable hierarchical clustering algorithm based on the minimization of information loss when clustering a software system. We also present a method that can assess the usefulness of any nonstructural attribute in a software clustering context. We applied LIMBO to three large software systems in a number of experiments. The results indicate that this approach produces clusterings that come close to decompositions prepared by system experts. Experimental results were also used to validate our usefulness assessment method. Finally, we experimented with well-established weighting schemes from information retrieval, Web search, and data clustering. We report results as to which weighting schemes show merit in the decomposition of software systems.
international conference on data engineering | 2006
Periklis Andritsos; Ariel Fuxman; Renée J. Miller
The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such tuples, the merging or elimination of duplicates can be a difficult task that relies on ad-hoc and often manual solutions. We propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. We rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database. Our rewritten queries are sensitive to the semantics of duplication and help a user understand which query answers are most likely to be present in the clean database. The semantics that we adopt is independent of the way the probabilities are produced, but is able to effectively exploit them during query answering. In the absence of external knowledge that associates each database tuple with a probability, we offer a technique, based on tuple summaries, that automates this task. We experimentally study the performance of our rewritten queries. Our studies show that the rewriting does not introduce a significant overhead in query execution time. This work is done in the context of the ConQuer project at the University of Toronto, which focuses on the efficient management of inconsistent and dirty databases.
international conference on management of data | 2007
Anna Stavrianou; Periklis Andritsos; Nicolas Nicoloyannis
Text mining refers to the discovery of previously unknown knowledge that can be found in text collections. In recent years, the text mining field has received great attention due to the abundance of textual data. A researcher in this area is requested to cope with issues originating from the natural language particularities. This survey discusses such semantic issues along with the approaches and methodologies proposed in the existing literature. It covers syntactic matters, tokenization concerns and it focuses on the different text representation techniques, categorisation tasks and similarity measures suggested.
international conference on management of data | 2004
Periklis Andritsos; Renée J. Miller; Panayiotis Tsaparas
Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this work, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We propose a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical data sets. We study the use of these summaries in one specific physical design task, that of ranking functional dependencies based on their data redundancy. We show how our ranking can be used by a physical data-design tool to find good vertical decompositions of a relation (decompositions that improve the information content of the design). We present an evaluation of the approach on real data sets.
working conference on reverse engineering | 2003
Periklis Andritsos; Vassilios Tzerpos
The majority of the algorithms in the software clusteringliterature utilize structural information in order to decomposelarge software systems. Other approaches, such as usingfile names or ownership information, have also demonstratedmerit. However, there is no intuitive way to combine informationobtained from these two different types of techniques.In this paper, we present an approach that combines structuraland non-structural information in an integrated fashion.LIMBO is a scalable hierarchical clustering algorithm basedon the minimization of information loss when clustering asoftware system.We apply LIMBO to two large software systems in a numberof experiments. The results indicate that this approachproduces valid and useful clusterings of large software systems.LIMBO can also be used to evaluate the usefulnessof various types of non-structural information to the softwareclustering process.
database and expert systems applications | 2008
Themis Palpanas; Junaid Ahsenali Chaudhry; Periklis Andritsos; Yannis Velegrakis
In the past years, we are witnessing an increasing interest in the semantic Web and the relevant technologies, which can have a significant impact in the enterprise environment of information and knowledge management. An important observation is that the entity identification problem lies at the core of many semantic Web applications. In this paper, we examine the special requirements of storage and management for entities, in the context of an entity management system for the semantic Web. We study the requirements with respect to creating and modifying these entities, as well as to managing their evolution over time. Finally, we propose a conceptual model for there presentation of entities, and discuss related research directions.
international conference on software maintenance | 2005
Yijun Yu; Homayoun Dayani-Fard; John Mylopoulos; Periklis Andritsos
Large-scale legacy programs take long time to compile, thereby hampering productivity. This paper presents algorithms that reduce compilation time by analyzing syntactic dependencies in fine-grain program units, and by removing redundancies as well as false dependencies. These algorithms are combined with parallel compilation techniques (compiler farms, compiler caches), to further reduce build time. We demonstrate through experiments their effectiveness in achieving significant speedup for both fresh and incremental builds.
advances in geographic information systems | 2014
Paolo Bolzoni; Sven Helmer; Kevin Wellenzohn; Johann Gamper; Periklis Andritsos
We propose a more realistic approach to trip planning for tourist applications by adding category information to points of interest (POIs). This makes it easier for tourists to formulate their preferences by stating constraints on categories rather than individual POIs. However, solving this problem is not just a matter of extending existing algorithms. In our approach we exploit the fact that POIs are usually not evenly distributed but tend to appear in clusters. We develop a group of efficient algorithms based on clustering with guaranteed theoretical bounds. We also evaluate our algorithms experimentally, using real-world data sets, showing that in practice the results are better than the theoretical guarantees and very close to the optimal solution.
very large data bases | 2008
Hamid Motahari; Boualem Benatallah; Regis Saint-Paul; Fabio Casati; Periklis Andritsos
Business processes (BPs) are central to the operation of both public and private organizations. A business process is a set of coordinated tasks and activities to achieve a business objective or goal. Given the importance of BPs to overall efficiency and effectiveness, the competitiveness of organizations hinges on continuous BP improvement. In the nineties, the focus of BP improvement was on automation: workflow management systems (WfMSs) and other middleware technologies were used to reduce cost and improve efficiency by providing better system integration and automated enactment of operational business processes. Recently, the focus of business process has expanded to monitoring, analysis and understanding of business processes, and such techniques are incorporated in business process management systems (BPMSs).