Martin Potthast | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Martin Potthast is active.

Explore More

Publication

Featured researches published by Martin Potthast.

european conference on information retrieval | 2008

Automatic vandalism detection in Wikipedia

Martin Potthast; Benno Stein; Robert Gerling

We present results of a new approach to detect destructive article revisions, so-called vandalism, inWikipedia. Vandalism detection is a one-class classification problem, where vandalism edits are the target to be identified among all revisions. Interestingly, vandalism detection has not been addressed in the Information Retrieval literature by now. In this paper we discuss the characteristics of vandalism as humans recognize it and develop features to render vandalism detection as a machine learning task. We compiled a large number of vandalism edits in a corpus, which allows for the comparison of existing and new detection approaches. Using logistic regression we achieve 83% precision at 77% recall with our model. Compared to the rule-based methods that are currently applied in Wikipedia, our approach increases the F-Measure performance by 49% while being faster at the same time.

language resources and evaluation | 2011

Cross-language plagiarism detection

Martin Potthast; Alberto Barrón-Cedeño; Benno Stein; Paolo Rosso

Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on “exact” translations but does not generalize well.

international acm sigir conference on research and development in information retrieval | 2007

Strategies for retrieving plagiarized documents

Benno Stein; Sven Meyer zu Eissen; Martin Potthast

For the identification of plagiarized passages in large document collections we present retrieval strategies which rely on stochastic sampling and chunk indexes. Using the entire Wikipedia corpus we compile n-gram indexes and compare them to a new kind of fingerprint index in a plagiarism analysis use case. Our index provides an analysis speed-up by factor 1.5 and is an order of magnitude smaller, while being equivalent in terms of precision and recall.

international world wide web conferences | 2011

Query segmentation revisited

Matthias Hagen; Martin Potthast; Benno Stein; Christof Bräutigam

We address the problem of query segmentation: given a keyword query, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve reasonable segmentation performance but are tested only against a small corpus of manually segmented queries. In addition, many of the previous approaches are fairly intricate as they use expensive features and are difficult to be reimplemented. The main contribution of this paper is a new method for query segmentation that is easy to implement, fast, and that comes with a segmentation accuracy comparable to current state-of-the-art techniques. Our method uses only raw web n-gram frequencies and Wikipedia titles that are stored in a hash table. At the same time, we introduce a new evaluation corpus for query segmentation. With about 50,000 human-annotated queries, it is two orders of magnitude larger than the corpus being used up to now.

international acm sigir conference on research and development in information retrieval | 2012

ChatNoir: a search engine for the ClueWeb09 corpus

Martin Potthast; Matthias Hagen; Benno Stein; Jan Graßegger; Maximilian Michel; Martin Tippmann; Clement Welsch

We present the ChatNoir search engine which indexes the entire English part of the ClueWeb09 corpus. Besides Carnegie Mellons Indri system, ChatNoir is the second publicly available search engine for this corpus. It implements the classic BM25F information retrieval model including PageRank and spam likelihood. The search engine is scalable and returns the first results within three seconds, which is significantly faster than Indri. A convenient API allows for implementing reproducible experiments based on retrieving documents from the ClueWeb09 corpus. The search engine has successfully accomplished a load test involving 100,000 queries.

ACM Transactions on Intelligent Systems and Technology | 2013

Paraphrase acquisition via crowdsourcing and machine learning

Steven Burrows; Martin Potthast; Benno Stein

To paraphrase means to rewrite content while preserving the original meaning. Paraphrasing is important in fields such as text reuse in journalism, anonymizing work, and improving the quality of customer-written reviews. This article contributes to paraphrase acquisition and focuses on two aspects that are not addressed by current research: (1) acquisition via crowdsourcing, and (2) acquisition of passage-level samples. The challenge of the first aspect is automatic quality assurance; without such a means the crowdsourcing paradigm is not effective, and without crowdsourcing the creation of test corpora is unacceptably expensive for realistic order of magnitudes. The second aspect addresses the deficit that most of the previous work in generating and evaluating paraphrases has been conducted using sentence-level paraphrases or shorter; these short-sample analyses are limited in terms of application to plagiarism detection, for example. We present the Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11), which recently formed part of the PAN 2010 international plagiarism detection competition. This corpus comprises passage-level paraphrases with 4067 positive samples and 3792 negative samples that failed our criteria, using Amazons Mechanical Turk for crowdsourcing. In this article, we review the lessons learned at PAN 2010, and explain in detail the method used to construct the corpus. The empirical contributions include machine learning experiments to explore if passage-level paraphrases can be identified in a two-class classification problem using paraphrase similarity features, and we find that a k-nearest-neighbor classifier can correctly distinguish between paraphrased and nonparaphrased samples with 0.980 precision at 0.523 recall. This result implies that just under half of our samples must be discarded (remaining 0.477 fraction), but our cost analysis shows that the automation we introduce results in a 18% financial saving and over 100 hours of time returned to the researchers when repeating a similar corpus design. On the other hand, when building an unrelated corpus requiring, say, 25% training data for the automated component, we show that the financial outcome is cost neutral, while still returning over 70 hours of time to the researchers. The work presented here is the first to join the paraphrasing and plagiarism communities.

north american chapter of the association for computational linguistics | 2015

Webis: An Ensemble for Twitter Sentiment Detection

Matthias Hagen; Martin Potthast; Michel Büchner; Benno Stein

We reproduce four Twitter sentiment classification approaches that participated in previous SemEval editions with diverse feature sets. The reproduced approaches are combined in an ensemble, averaging the individual classifiers’ confidence scores for the three classes (positive, neutral, negative) and deciding sentiment polarity based on these averages. The experimental evaluation on SemEval data shows our re-implementations to slightly outperform their respective originals. Moreover, not too surprisingly, the ensemble of the reproduced approaches serves as a strong baseline in the current edition where it is top-ranked on the 2015 test set.

european conference on information retrieval | 2010

Opinion summarization of web comments

Martin Potthast; Steffen Becker

All kinds of Web sites invite visitors to provide feedback on comment boards. Typically, submitted comments are published immediately on the same page, so that new visitors can get an idea of the opinions of previous visitors. Popular multimedia items, such as videos and images, frequently get up to thousands of comments, which is too much to be read in reasonable time. I.e., visitors read, if at all, only the newest comments and hence get an incomplete and possibly misleading picture of the overall opinion. To address this issue we introduce OPINIONCLOUD, a technology to summarize and visualize opinions that are expressed in the form of Web comments.

cross language evaluation forum | 2013

Recent Trends in Digital Text Forensics and Its Evaluation

Tim Gollub; Martin Potthast; Anna Beyer; Matthias Busse; Francisco Rangel; Paolo Rosso; Efstathios Stamatatos; Benno Stein

This paper outlines the concepts and achievements of our evaluation lab on digital text forensics, PANi¾?13, which called for original research and development on plagiarism detection, author identification, and author profiling. We present a standardized evaluation framework for each of the three tasks and discuss the evaluation results of the altogether 58i¾?submitted contributions. For the first time, instead of accepting the output of software runs, we collected the softwares themselves and run them on a computer cluster at our site. As evaluation and experimentation platform we use TIRA, which is being developed at the Webis Group in Weimar. TIRA can handle large-scale software submissions by means of virtualization, sandboxed execution, tailored unit testing, and staged submission. In addition to the achieved evaluation results, a major achievement of our lab is that we now have the largest collection of state-of-the-art approaches with regard to the mentioned tasks for further analysis at our disposal.

international acm sigir conference on research and development in information retrieval | 2010

The power of naive query segmentation

Matthias Hagen; Martin Potthast; Benno Stein; Christof Braeutigam

We address the problem of query segmentation: given a keyword query submitted to a search engine, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve good segmentation performance on a gold standard but are fairly intricate. Our method is easy to implement and comes with a comparable accuracy.

Explore More