Jonathan Schler
Bar-Ilan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jonathan Schler.
Communications of The ACM | 2009
Shlomo Argamon; Moshe Koppel; James W. Pennebaker; Jonathan Schler
ImagIne that you have been gIven an Important text of unknown authorship, and wish to know as much as possible about the unknown author (demographics, personality, cultural background, among others), just by analyzing the given text. This authorship profiling problem is of growing importance in the current global information environment– applications abound in forensics, security, and commercial settings. For example, authorship profiling can help police identify characteristics of the perpetrator of a crime when there are too few (or too many) specific suspects to consider. Similarly, large corporations may be interested in knowing what types of people like or dislike their products, based on analysis of blogs and online product reviews. The question we therefore ask is: How much can we discern about the author of a text simply by analyzing the text itself? It turns out that, with varying degrees of accuracy, we can say a great deal indeed. Unlike the problem of authorship attribution (determining the author of a text from a given candidate set) discussed recently in these pages by Li, Zheng, and Chen authorship profiling does not begin with a set of writing samples from known candidate authors. Instead, we exploit the sociolinguistic observation that different groups of people speaking or writing in a particular genre and in a particular language use that language differently. That is, they vary in how often they use certain words or syntactic constructions (in addition to variation in pronunciation or intonation, for example). The particular profile dimensions we consider here are author gender, age,8 native language7 and personality.10
language resources and evaluation | 2011
Moshe Koppel; Jonathan Schler; Shlomo Argamon
Most previous work on authorship attribution has focused on the case in which we need to attribute an anonymous document to one of a small set of candidate authors. In this paper, we consider authorship attribution as found in the wild: the set of known candidates is extremely large (possibly many thousands) and might not even include the actual author. Moreover, the known texts and the anonymous texts might be of limited length. We show that even in these difficult cases, we can use similarity-based methods along with multiple randomized feature sets to achieve high precision. Moreover, we show the precise relationship between attribution precision and four parameters: the size of the candidate set, the quantity of known-text by the candidates, the length of the anonymous text and a certain robustness score associated with a attribution.
computational intelligence | 2006
Moshe Koppel; Jonathan Schler
Most research on learning to identify sentiment ignores “neutral” examples, learning only from examples of significant (positive or negative) polarity. We show that it is crucial to use neutral examples in learning polarity for a variety of reasons. Learning from negative and positive examples alone will not permit accurate classification of neutral examples. Moreover, the use of neutral training examples in learning facilitates better distinction between positive and negative examples.
knowledge discovery and data mining | 2005
Moshe Koppel; Jonathan Schler; Kfir Zigdon
In this paper, we show that stylistic text features can be exploited to determine an anonymous authors native language with high accuracy. Specifically, we first use automatic tools to ascertain frequencies of various stylistic idiosyncrasies in a text. These frequencies then serve as features for support vector machines that learn to classify texts according to author native language.
international acm sigir conference on research and development in information retrieval | 2006
Moshe Koppel; Jonathan Schler; Shlomo Argamon; Eran Messeri
In this paper, we use a blog corpus to demonstrate that we can often identify the author of an anonymous text even where there are many thousands of candidate authors. Our approach combines standard information retrieval methods with a text categorization meta-learning scheme that determines when to even venture a guess.
intelligence and security informatics | 2005
Moshe Koppel; Jonathan Schler; Kfir Zigdon
Text authored by an unidentified assailant can offer valuable clues to the assailants identity. In this paper, we show that stylistic text features can be exploited to determine an anonymous authors native language with high accuracy.
English Studies | 2012
Moshe Koppel; Jonathan Schler; Shlomo Argamon; Yaron Winter
We introduce the “fundamental problem” of authorship attribution: determining if two, possibly short, documents were written by a single author. A solution to this problem can serve as a building block for solving almost any conceivable authorship attribution problem. Our preliminary work on this problem is based on earlier work in authorship attribution with large open candidate sets.
conference on information and knowledge management | 2004
Benjamin Rosenfeld; Ronen Feldman; Moshe Fresko; Jonathan Schler; Yonatan Aumann
This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labor by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (Trainable Extraction Grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (Stochastic Context Free Grammar) based extraction language, and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or parser. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amount of training data. The improvement in accuracy is slight for named entity extraction task and more pronounced for relation extraction.
Knowledge and Information Systems | 2006
Yonatan Aumann; Ronen Feldman; Yair Liberzon; Benjamin Rosenfeld; Jonathan Schler
Typographic and visual information is an integral part of textual documents. Most information extraction (IE) systems ignore most of this visual information, processing the text as a linear sequence of words. Thus, much valuable information is lost. In this paper, we show how to make use of this visual information for IE. We present an algorithm that allows to automatically extract specific fields of the document (such as the title, author, etc.) based exclusively on the visual formatting of the document, without any reference to the semantic content. The algorithm employs a machine learning approach, whereby the system is first provided with a set of training documents in which the target fields are manually tagged and automatically learns how to extract these fields in future documents. We implemented the algorithm in a system for automatic analysis of documents in PDF format. We present experimental results of applying the system on a set of financial documents, extracting nine different target fields. Overall, the system achieved a 90% accuracy.
conference on information and knowledge management | 2001
Ronen Feldman; Yonatan Aumann; Yair Liberzon; Kfir Ankori; Jonathan Schler; Benjamin Rosenfeld
Text-Mining is a growing area of interest within the field of Data Mining and Knowledge Discovery. Given a collection of text documents, most approaches to Text Mining perform knowledge-discovery operations either on external tags associated with each document, or on the set of all words within each document. Both approaches suffer from limitations. This paper focuses on an intermediate approach, one that we call text mining via information extraction, in which knowledge discovery takes place on focused, relevant terms, phrases and facts, as extracted from the documents.