Natalia Vanetik
Ben-Gurion University of the Negev
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Natalia Vanetik.
international conference on data mining | 2002
Natalia Vanetik; Ehud Gudes; Solomon Eyal Shimony
Whereas data mining in structured data focuses on frequent data values, in semistructured and graph data the emphasis is on frequent labels and common topologies. Here, the structure of the data is just as important as its content. We study the problem of discovering typical patterns of graph data. The discovered patterns can be useful for many applications, including: compact representation of source information and a road-map for browsing and querying information sources. Difficulties arise in the discovery task from the complexity of some of the required sub-tasks, such as sub-graph isomorphism. This paper proposes a new algorithm for mining graph data, based on a novel definition of support. Empirical evidence shows practical, as well as theoretical, advantages of our approach.
IEEE Transactions on Knowledge and Data Engineering | 2006
Ehud Gudes; Solomon Eyal Shimony; Natalia Vanetik
Whereas data mining in structured data focuses on frequent data values, in semistructured and graph data mining, the issue is frequent labels and common specific topologies. The structure of the data is just as important as its content. We study the problem of discovering typical patterns of graph data, a task made difficult because of the complexity of required subtasks, especially subgraph isomorphism. In this paper, we propose a new apriori-based algorithm for mining graph data, where the basic building blocks are relatively large, disjoint paths. The algorithm is proven to be sound and complete. Empirical evidence shows practical advantages of our approach for certain categories of graphs
Data Mining and Knowledge Discovery | 2006
Natalia Vanetik; Solomon Eyal Shimony; Ehud Gudes
The concept of support is central to data mining. While the definition of support in transaction databases is intuitive and simple, that is not the case in graph datasets and databases. Most mining algorithms require the support of a pattern to be no greater than that of its subpatterns, a property called anti-monotonicity, or admissibility. This paper examines the requirements for admissibility of a support measure. Support measures for mining graphs are usually based on the notion of an instance graph---a graph representing all the instances of the pattern in a database and their intersection properties. Necessary and sufficient conditions for support measure admissibility, based on operations on instance graphs, are developed and proved. The sufficient conditions are used to prove admissibility of one support measure—the size of the independent set in the instance graph. Conversely, the necessary conditions are used to quickly show that some other support measures, such as weighted count of instances, are not admissible.
Data Mining and Knowledge Discovery | 2009
Vladimir Lipets; Natalia Vanetik; Ehud Gudes
We present a novel approach to the problem of finding all subgraphs and induced subgraphs of a (target) graph which are isomorphic to another (pattern) graph. To attain efficiency we use a special representation of the pattern graph. We also combine our search algorithm with some known bisection algorithms. Experimental comparison with other algorithms was performed on several types of graphs. The comparison results suggest that the approach provided here is most effective when all instances of a subgraph need to be found.
Bioinformatics | 2014
Chen Yanover; Natalia Vanetik; Michael Levitt; Rachel Kolodny; Chen Keasar
MOTIVATION Structural knowledge, extracted from the Protein Data Bank (PDB), underlies numerous potential functions and prediction methods. The PDB, however, is highly biased: many proteins have more than one entry, while entire protein families are represented by a single structure, or even not at all. The standard solution to this problem is to limit the studies to non-redundant subsets of the PDB. While alleviating biases, this solution hides the many-to-many relations between sequences and structures. That is, non-redundant datasets conceal the diversity of sequences that share the same fold and the existence of multiple conformations for the same protein. A particularly disturbing aspect of non-redundant subsets is that they hardly benefit from the rapid pace of protein structure determination, as most newly solved structures fall within existing families. RESULTS In this study we explore the concept of redundancy-weighted datasets, originally suggested by Miyazawa and Jernigan. Redundancy-weighted datasets include all available structures and associate them (or features thereof) with weights that are inversely proportional to the number of their homologs. Here, we provide the first systematic comparison of redundancy-weighted datasets with non-redundant ones. We test three weighting schemes and show that the distributions of structural features that they produce are smoother (having higher entropy) compared with the distributions inferred from non-redundant datasets. We further show that these smoothed distributions are both more robust and more correct than their non-redundant counterparts. We suggest that the better distributions, inferred using redundancy-weighting, may improve the accuracy of knowledge-based potentials and increase the power of protein structure prediction methods. Consequently, they may enhance model-driven molecular biology.
web age information management | 2010
Natalia Vanetik
The area of graph mining bears great importance when dealing with semi-structured data such as XML, text and chemical and genetic data. One of the main challenges of this field is that out of many resulting frequent subgraphs it is hard to find interesting ones. We propose a novel algorithm that finds subgraphs of limited diameter and high symmetry. These subgraphs represent the more structurally interesting patterns in the database. Our approach also allows to decrease processing time drastically by employing the tree decomposition structure of database graphs during the discovery process.
empirical methods in natural language processing | 2015
Marina Litvak; Natalia Vanetik
Automated text summarization is aimed at extracting essential information from original text and presenting it in a minimal, often predefined, number of words. In this paper, we introduce a new approach for unsupervised extractive summarization, based on the Minimum Description Length (MDL) principle, using the Krimp dataset compression algorithm (Vreeken et al., 2011). Our approach represents a text as a transactional dataset, with sentences as transactions, and then describes it by itemsets that stand for frequent sequences of words. The summary is then compiled from sentences that compress (and as such, best describe) the document. The problem of summarization is reduced to the maximal coverage, following the assumption that a summary that best describes the original text, should cover most of the word sequences describing the document. We solve it by a greedy algorithm and present the evaluation results.
meeting of the association for computational linguistics | 2016
Marina Litvak; Natalia Vanetik; Elena Churkin
The MUSEEC (MUltilingual SEntence Extraction and Compression) summarization tool implements several extractive summarization techniques – at the level of complete and compressed sentences – that can be applied, with some minor adaptations, to documents in multiple languages. The current version of MUSEEC provides the following summarization methods: (1) MUSE – a supervised summarizer, based on a genetic algorithm (GA), that ranks document sentences and extracts top–ranking sentences into a summary, (2) POLY – an unsupervised summarizer, based on linear programming (LP), that selects the best extract of document sentences, and (3) WECOM – an unsupervised extension of POLY that compiles a document summary from compressed sentences. In this paper, we provide an overview of MUSEEC methods and its architecture in general.
Archive | 2018
Marina Litvak; Natalia Vanetik; Lei Li
Extractive text summarization aims at selecting a small subset of sentences so that the contents and meaning of the original document are best preserved. In this paper we describe an unsupervised approach to extractive summarization. It combines hierarchical topic modeling (TM) with the Minimal Description Length (MDL) principle and applies them to Chinese language. Our summarizer strives to extract information that provides the best description of text topics in terms of MDL. This model is applied to the NLPCC 2015 Shared Task of Weibo-Oriented Chinese News Summarization [1], where Chinese texts from news articles were summarized with the goal of creating short meaningful messages for Weibo (Sina Weibo is a Chinese microblogging website, one of the most popular sites in China.) [2]. The experimental results disclose superiority of our approach over other summarizers from the NLPCC 2015 competition.
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres | 2017
Marina Litvak; Natalia Vanetik
Query-based text summarization is aimed at extracting essential information that answers the query from original text. The answer is presented in a minimal, often predefined, number of words. In this paper we introduce a new unsupervised approach for query-based extractive summarization, based on the minimum description length (MDL) principle that employs Krimp compression algorithm (Vreeken et al., 2011). The key idea of our approach is to select frequent word sets related to a given query that compress document sentences better and therefore describe the document better. A summary is extracted by selecting sentences that best cover query-related frequent word sets. The approach is evaluated based on the DUC 2005 and DUC 2006 datasets which are specifically designed for query-based summarization (DUC, 2005 2006). It competes with the best results.