Anna Feldman
Montclair State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Anna Feldman.
conference on information and knowledge management | 2010
Amal Chaminda Kaluarachchi; Aparna S. Varde; Srikanta J. Bedathur; Gerhard Weikum; Jing Peng; Anna Feldman
Time-stamped documents such as newswire articles, blog posts and other web-pages are often archived online. When these archives cover long spans of time, the terminology within them could undergo significant changes. Hence, when users pose queries pertaining to historical information, over such documents, the queries need to be translated, taking into account these temporal changes, to provide accurate responses to users. For example, a query on Sri Lanka should automatically retrieve documents with its former name Ceylon. We call such concepts SITACs, i.e., Semantically Identical Temporally Altering Concepts. In order to discover SITACs, we propose an approach based on a novel framework constituting an integration of natural language processing, association rule mining, and contextual similarity as a learning technique. The proposed approach has been experimented with real data and has been found to yield good results with respect to efficiency and accuracy.
CrossLangInduction '06 Proceedings of the International Workshop on Cross-Language Knowledge Induction | 2006
Jirka Hana; Anna Feldman; Chris Brew; Luiz Amaral
We describe a knowledge and resource light system for an automatic morphological analysis and tagging of Brazilian Portuguese. We avoid the use of labor intensive resources; particularly, large annotated corpora and lexicons. Instead, we use (i) an annotated corpus of Peninsular Spanish, a language related to Portuguese, (ii) an unannotated corpus of Portuguese, (iii) a description of Portuguese morphology on the level of a basic grammar book. We extend the similar work that we have done (Hana et al., 2004; Feldman et al., 2006) by proposing an alternative algorithm for cognate transfer that effectively projects the Spanish emission probabilities into Portuguese. Our experiments use minimal new human effort and show 21% error reduction over even emissions on a fine-grained tagset.
language resources and evaluation | 2014
Alexandr Rosen; Jirka Hana; Barbora Štindlová; Anna Feldman
The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked tiers, designed to handle a wide range of error types present in the input. Each tier corrects different types of errors; links between the tiers allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a data set including approx. 175,000 words with fair inter-annotator agreement results. We also explore the possibility of applying automated linguistic annotation tools (taggers, spell checkers and grammar checkers) to the learner text to support or even substitute manual annotation.
empirical methods in natural language processing | 2014
Jing Peng; Anna Feldman; Ekaterina Vylomova
We describe an algorithm for automatic classification of idiomatic and literal expressions. Our starting point is that words in a given text segment, such as a paragraph, that are highranking representatives of a common topic of discussion are less likely to be a part of an idiomatic expression. Our additional hypothesis is that contexts in which idioms occur, typically, are more affective and therefore, we incorporate a simple analysis of the intensity of the emotions expressed by the contexts. We investigate the bag of words topic representation of one to three paragraphs containing an expression that should be classified as idiomatic or literal (a target phrase). We extract topics from paragraphs containing idioms and from paragraphs containing literals using an unsupervised clustering method, Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Since idiomatic expressions exhibit the property of non-compositionality, we assume that they usually present different semantics than the words used in the local topic. We treat idioms as semantic outliers, and the identification of a semantic shift as outlier detection. Thus, this topic representation allows us to differentiate idioms from literals using local semantic contexts. Our results are encouraging.
international conference on computational linguistics | 2006
Anna Feldman; Jirka Hana; Chris Brew
Annotated corpora are valuable resources for NLP which are often costly to create. We introduce a method for transferring annotation from a morphologically annotated corpus of a source language to a target language. Our approach assumes only that an unannotated text corpus exists for the target language and a simple textbook which describes the basic morphological properties of that language is available. Our paper describes experiments with Polish, Czech, and Russian. However, the method is not tied in any way to these languages. In all the experiments we use the TnT tagger ([3]), a second-order Markov model. Our approach assumes that the information acquired about one language can be used for processing a related language. We have found out that even breathtakingly naive things (such as approximating the Russian transitions by Czech and/or Polish and approximating the Russian emissions by (manually/automatically derived) Czech cognates) can lead to a significant improvement of the tagger’s performance.
international conference on computational linguistics | 2013
Anna Feldman; Jing Peng
We describe several experiments whose goal is to automatically identify idiomatic expressions in written text. We explore two approaches for the task: 1) idiom recognition as outlier detection; and 2) supervised classification of sentences. We apply principal component analysis for outlier detection. Detecting idioms as lexical outliers does not exploit class label information. So, in the following experiments, we use linear discriminant analysis to obtain a discriminant subspace and later use the three nearest neighbor classifier to obtain accuracy. We discuss pros and cons of each approach. All the approaches are more general than the previous algorithms for idiom detection --- neither do they rely on target idiom types, lexicons, or large manually annotated corpora, nor do they limit the search space by a particular type of linguistic construction.
systems, man and cybernetics | 2010
Hiroki Yamakawa; Jing; Anna Feldman
Text classification is a widely studied topic in the area of machine learning. A number of techniques have been developed to represent and classify text documents. Most of the techniques try to achieve good classification performance while taking a document only by its words (e.g. statistical analysis on word frequency and distribution patterns). One of the recent trends in text classification research is to incorporate more semantic interpretation in text classification, especially by using Wikipedia. This paper introduces a technique for incorporating the vast amount of human knowledge accumulated in Wikipedia into text representation and classification. The aim is to improve classification performance by transforming general terms into a set of related concepts grouped around semantic themes. In order to achieve this goal, this paper proposes a unique method for breaking the enormous amount of extracted Wikipedia knowledge (concepts) into smaller pieces (subsets of concepts). The subsets of concepts are separately used to represent the same set of documents in a number of different ways, from which an ensemble of classifiers is built. Experimental results show that an ensemble of classifiers individually trained on a different representation of the document set performs better with increased accuracy and stability than that of a classifier trained only on the original document set.
SIMBig (Revised Selected Papers) | 2015
Jing Peng; Anna Feldman
Expressions, such as add fuel to the fire, can be interpreted literally or idiomatically depending on the context they occur in. Many Natural Language Processing applications could improve their performance if idiom recognition were improved. Our approach is based on the idea that idioms and their literal counterparts do not appear in the same contexts. We propose two approaches: (1) Compute inner product of context word vectors with the vector representing a target expression. Since literal vectors predict well local contexts, their inner product with contexts should be larger than idiomatic ones, thereby telling apart literals from idioms; and (2) Compute literal and idiomatic scatter (covariance) matrices from local contexts in word vector space. Since the scatter matrices represent context distributions, we can then measure the difference between the distributions using the Frobenius norm. For comparison, we implement [8, 16, 24] and apply them to our data. We provide experimental results validating the proposed techniques.
international conference on agents and artificial intelligence | 2018
Jing Peng; Katsiaryna Aharodnik; Anna Feldman
This paper describes experiments in English and Russian automatic idiom detection. Our algorithm is based on the idea that literal and idiomatic expressions appear in different contexts. This difference is captured by our distributional semantics model. We evaluate our model on both languages and compare its results. We show that our model is language-independent. We also describe a new annotated resource we created for our
hellenic conference on artificial intelligence | 2018
Kei Yin Ng; Anna Feldman; Christopher S. Leberknight
This study provides preliminary insights into the linguistic features that contribute to Internet censorship in mainland China. We collected a corpus of 344 censored and uncensored microblog posts that were published on Sina Weibo and built a Naive Bayes classifier based on the linguistic, topic-independent, features. The classifier achieves a 79.34% accuracy in predicting whether a blog post would be censored on Sina Weibo.