Marina Litvak
Ben-Gurion University of the Negev
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Marina Litvak.
international conference on computational linguistics | 2008
Marina Litvak
In this paper, we introduce and compare between two novel approaches, supervised and unsupervised, for identifying the keywords to be used in extractive summarization of text documents. Both our approaches are based on the graph-based syntactic representation of text and web documents, which enhances the traditional vector-space model by taking into account some structural document features. In the supervised approach, we train classification algorithms on a summarized collection of documents with the purpose of inducing a keyword identification model. In the unsupervised approach, we run the HITS algorithm on document graphs under the assumption that the top-ranked nodes should represent the document keywords. Our experiments on a collection of benchmark summaries show that given a set of summarized training documents, the supervised classification provides the highest keyword identification accuracy, while the highest F-measure is reached with a simple degree-based ranking. In addition, it is sufficient to perform only the first iteration of HITS rather than running it to its convergence.
atlantic web intelligence conference | 2011
Marina Litvak; Hen Aizenman; Inbal Gobits; Abraham Kandel
In this paper, we introduce DegExt, a graph-based languageindependent keyphrase extractor,which extends the keyword extraction method described in [6]. We compare DegExt with two state-of-the-art approaches to keyphrase extraction: GenEx [11] and TextRank [8].
Information Retrieval | 2013
Marina Litvak
The increasing trend of cross-border globalization and acculturation requires text summarization techniques to work equally well for multiple languages. However, only some of the automated summarization methods can be defined as “language-independent,” i.e., not based on any language-specific knowledge. Such methods can be used for multilingual summarization, defined in Mani (Automatic summarization. Natural language processing. John Benjamins Publishing Company, Amsterdam, 2001) as “processing several languages, with a summary in the same language as input”, but, their performance is usually unsatisfactory due to the exclusion of language-specific knowledge. Moreover, supervised machine learning approaches need training corpora in multiple languages that are usually unavailable for rare languages, and their creation is a very expensive and labor-intensive process. In this article, we describe cross-lingual methods for training an extractive single-document text summarizer called MUSE (MUltilingual Sentence Extractor)—a supervised approach, based on the linear optimization of a rich set of sentence ranking measures using a Genetic Algorithm. We evaluated MUSE’s performance on documents in three different languages: English, Hebrew, and Arabic using several training scenarios. The summarization quality was measured using ROUGE-1 and ROUGE-2 Recall metrics. The results of the extensive comparative analysis showed that the performance of MUSE was better than that of the best known multilingual approach (TextRank) in all three languages. Moreover, our experimental results suggest that using the same sentence ranking model across languages results in a reasonable summarization quality, while saving considerable annotation efforts for the end-user. On the other hand, using parallel corpora generated by machine translation tools may improve the performance of a MUSE model trained on a foreign language. Comparative evaluation of an alternative optimization technique—Multiple Linear Regression—justifies the use of a Genetic Algorithm.
autonomous and intelligent systems | 2007
Marina Litvak; Slava Kisilevich
In this paper, we deal with the problem of analyzing and classifying web documents in a given domain by information filtering agents. We present the ontology-based web content mining methodology that contains such main stages as creation of ontology for the specified domain, collecting a training set of labeled documents, building a classification model in this domain using the constructed ontology and a classification algorithm, and classification of new documents by information agents via the induced model. We evaluated the proposed methodology in two specific domains: the chemical domain (web pages containing information about production of certain chemicals), and Yahoo! collection of web news documents divided into several categories. Our system receives as input the domain-specific ontology, and a set of categorized web documents, and then perfroms concept generalization on these documents. We use a key-phrase extractor with integrated ontology parser for creating a database from input documents and use it as a training set for the classification algorithm. The system classification accuracy is estimated using various levels of ontology.
ambient intelligence | 2013
Marina Litvak; Abraham Kandel
In this paper, we introduce DegExt, a graph-based language-independent keyphrase extractor, which extends the keyword extraction method described in Litvak and Last (Graph-based keyword extraction for single-document summarization. In: Proceedings of the workshop on multi-source multilingual information extraction and summarization, pp 17–24, 2008). We compare DegExt with two state-of-the-art approaches to keyphrase extraction: GenEx (Turney in Inf Retr 2:303–336, 2000) and TextRank (Mihalcea and Tarau in Textrank—bringing order into texts. In: Proceedings of the conference on empirical methods in natural language processing. Barcelona, Spain, 2004). We evaluated DegExt on collections of benchmark summaries in two different languages: English and Hebrew. Our experiments on the English corpus show that DegExt significantly outperforms TextRank and GenEx in terms of precision and area under curve for summaries of 15 keyphrases or more at the expense of a mostly non-significant decrease in recall and F-measure, when the extracted phrases are matched against gold standard collection. Due to DegExt’s tendency to extract bigger phrases than GenEx and TextRank, when the single extracted words are considered, DegExt outperforms them both in terms of recall and F-measure. In the Hebrew corpus, DegExt performs the same as TextRank disregarding the number of keyphrases. An additional experiment shows that DegExt applied to the TextRank representation graphs outperforms the other systems in the text classification task. For documents in both languages, DegExt surpasses both GenEx and TextRank in terms of implementation simplicity and computational complexity.
empirical methods in natural language processing | 2015
Marina Litvak; Natalia Vanetik
Automated text summarization is aimed at extracting essential information from original text and presenting it in a minimal, often predefined, number of words. In this paper, we introduce a new approach for unsupervised extractive summarization, based on the Minimum Description Length (MDL) principle, using the Krimp dataset compression algorithm (Vreeken et al., 2011). Our approach represents a text as a transactional dataset, with sentences as transactions, and then describes it by itemsets that stand for frequent sequences of words. The summary is then compiled from sentences that compress (and as such, best describe) the document. The problem of summarization is reduced to the maximal coverage, following the assumption that a summary that best describes the original text, should cover most of the word sequences describing the document. We solve it by a greedy algorithm and present the evaluation results.
ACM Transactions on Internet Technology | 2017
Jahna Otterbacher; Chee Siang Ang; Marina Litvak; David Atkins
Linguistic mimicry, the adoption of another’s language patterns, is a subconscious behavior with pro-social benefits. However, some professions advocate its conscious use in empathic communication. This involves mutual mimicry; effective communicators mimic their interlocutors, who also mimic them back. Since mimicry has often been studied in face-to-face contexts, we ask whether individuals with empathic dispositions have unique communication styles and/or elicit mimicry in mediated communication on Facebook. Participants completed Davis’s Interpersonal Reactivity Index and provided access to Facebook activity. We confirm that dispositional empathy is correlated to the use of particular stylistic features. In addition, we identify four empathy profiles and find correlations to writing style. When a linguistic feature is used, this often “triggers” use by friends. However, the presence of particular features, rather than participant disposition, best predicts mimicry. This suggests that machine-human communications could be enhanced based on recently used features, without extensive user profiling.
meeting of the association for computational linguistics | 2016
Marina Litvak; Natalia Vanetik; Elena Churkin
The MUSEEC (MUltilingual SEntence Extraction and Compression) summarization tool implements several extractive summarization techniques – at the level of complete and compressed sentences – that can be applied, with some minor adaptations, to documents in multiple languages. The current version of MUSEEC provides the following summarization methods: (1) MUSE – a supervised summarizer, based on a genetic algorithm (GA), that ranks document sentences and extracts top–ranking sentences into a summary, (2) POLY – an unsupervised summarizer, based on linear programming (LP), that selects the best extract of document sentences, and (3) WECOM – an unsupervised extension of POLY that compiles a document summary from compressed sentences. In this paper, we provide an overview of MUSEEC methods and its architecture in general.
Archive | 2018
Marina Litvak; Natalia Vanetik; Lei Li
Extractive text summarization aims at selecting a small subset of sentences so that the contents and meaning of the original document are best preserved. In this paper we describe an unsupervised approach to extractive summarization. It combines hierarchical topic modeling (TM) with the Minimal Description Length (MDL) principle and applies them to Chinese language. Our summarizer strives to extract information that provides the best description of text topics in terms of MDL. This model is applied to the NLPCC 2015 Shared Task of Weibo-Oriented Chinese News Summarization [1], where Chinese texts from news articles were summarized with the goal of creating short meaningful messages for Weibo (Sina Weibo is a Chinese microblogging website, one of the most popular sites in China.) [2]. The experimental results disclose superiority of our approach over other summarizers from the NLPCC 2015 competition.
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres | 2017
Marina Litvak; Natalia Vanetik
Query-based text summarization is aimed at extracting essential information that answers the query from original text. The answer is presented in a minimal, often predefined, number of words. In this paper we introduce a new unsupervised approach for query-based extractive summarization, based on the minimum description length (MDL) principle that employs Krimp compression algorithm (Vreeken et al., 2011). The key idea of our approach is to select frequent word sets related to a given query that compress document sentences better and therefore describe the document better. A summary is extracted by selecting sentences that best cover query-related frequent word sets. The approach is evaluated based on the DUC 2005 and DUC 2006 datasets which are specifically designed for query-based summarization (DUC, 2005 2006). It competes with the best results.