Roy Bar-Haim | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Roy Bar-Haim is active.

Explore More

Publication

Featured researches published by Roy Bar-Haim.

meeting of the association for computational linguistics | 2005

Definition and Analysis of Intermediate Entailment Levels

Roy Bar-Haim; Idan Szpecktor; Oren Glickman

In this paper we define two intermediate models of textual entailment, which correspond to lexical and lexical-syntactic levels of representation. We manually annotated a sample from the RTE dataset according to each model, compared the outcome for the two models, and explored how well they approximate the notion of entailment. We show that the lexical-syntactic model outperforms the lexical model, mainly due to a much lower rate of false-positives, but both models fail to achieve high recall. Our analysis also shows that paraphrases stand out as a dominant contributor to the entailment task. We suggest that our models and annotation methods can serve as an evaluation scheme for entailment at these levels.

meeting of the association for computational linguistics | 2005

Choosing an Optimal Architecture for Segmentation and POS-Tagging of Modern Hebrew

Roy Bar-Haim; Khalil Sima'an; Yoad Winter

A major architectural decision in designing a disambiguation model for segmentation and Part-of-Speech (POS) tagging in Semitic languages concerns the choice of the input-output terminal symbols over which the probability distributions are defined. In this paper we develop a segmenter and a tagger for Hebrew based on Hidden Markov Models (HMMs). We start out from a morphological analyzer and a very small morphologically annotated corpus. We show that a model whose terminal symbols are word segments (=morphemes), is advantageous over a word-level model for the task of POS tagging. However, for segmentation alone, the morpheme-level model has no significant advantage over the word-level model. Error analysis shows that both models are not adequate for resolving a common type of segmentation ambiguity in Hebrew -- whether or not a word in a written text is prefixed by a definiteness marker. Hence, we propose a morpheme-level model where the definiteness morpheme is treated as a possible feature of morpheme terminals. This model exhibits the best overall performance, both in POS tagging and in segmentation. Despite the small size of the annotated corpus available for Hebrew, the results achieved using our best model are on par with recent results on Modern Standard Arabic.

Natural Language Engineering | 2008

Part-of-speech tagging of modern hebrew text

Roy Bar-Haim; Khalil Sima'an; Yoad Winter

Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a part-of-speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level, the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew texts. After extensive error analysis of the two simple tokenization models, we propose a novel, linguistically motivated, intermediate tokenization model that gives better performance for Hebrew over the two initial architectures. Our study is based on the well-known hidden Markov models (HMMs). We start out from a manually devised morphological analyzer and a very small annotated corpus, and describe how to adapt an HMM-based POS tagger for both tokenization architectures. We present an effective technique for smoothing the lexical probabilities using an untagged corpus, and a novel transformation for casting the segment-level tagger in terms of a standard, word-level HMM implementation. The results obtained using our model are on par with the best published results on Modern Standard Arabic, despite the much smaller annotated corpus available for Modern Hebrew.

empirical methods in natural language processing | 2009

A Compact Forest for Scalable Inference over Entailment and Paraphrase Rules

Roy Bar-Haim; Jonathan Berant; Ido Dagan

A large body of recent research has been investigating the acquisition and application of applied inference knowledge. Such knowledge may be typically captured as entailment rules, applied over syntactic representations. Efficient inference with such knowledge then becomes a fundamental problem. Starting out from a formalism for entailment-rule application we present a novel packed data-structure and a corresponding algorithm for its scalable implementation. We proved the validity of the new algorithm and established its efficiency analytically and empirically.

Language, Culture, Computation (1) | 2014

Benchmarking Applied Semantic Inference: The PASCAL Recognising Textual Entailment Challenges

Roy Bar-Haim; Ido Dagan; Idan Szpektor

Identifying that the same meaning is expressed by, or can be inferred from, various language expressions is a major challenge for natural language understanding applications such as information extraction, question answering and automatic summarization. Dagan and Glickman [5] proposed Textual Entailment, the task of deciding whether a target text follows from a source text, as a unifying framework for modeling language variability, which has often been addressed in an application-specific manner. In this paper we describe the series of benchmarks developed for the textual entailment recognition task, known as the PASCAL RTE Challenges. As a concrete example, we describe in detail the second RTE challenge, in which our methodology was consolidated, and served as a basis for the subsequent RTE challenges. The impressive success of these challenges established textual entailment as an active research area in natural language processing, attracting a growing community of researchers.

empirical methods in natural language processing | 2011