Rohini K. Srihari
University at Buffalo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rohini K. Srihari.
Sigkdd Explorations | 2004
Zhaohui Zheng; Xiaoyun Wu; Rohini K. Srihari
A number of feature selection metrics have been explored in text categorization, among which information gain (IG), chi-square (CHI), correlation coefficient (CC) and odds ratios (OR) are considered most effective. CC and OR are one-sided metrics while IG and CHI are two-sided. Feature selection using one-sided metrics selects the features most indicative of membership only, while feature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the signs of features. The former never consider the negative features, which are quite valuable, while the latter cannot ensure the optimal combination of the two kinds of features especially on imbalanced data. In this work, we investigate the usefulness of explicit control of that combination within a proposed feature selection framework. Using multinomial naïve Bayes and regularized logistic regression as classifiers, our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion according to the imbalanced data.
IEEE Computer | 1995
Rohini K. Srihari
The interaction of textual and photographic information in an integrated text/image database environment is being explored. Specifically, our research group has developed an automatic indexing system for captioned pictures of people; the indexing information and other textual information is subsequently used in a content-based image retrieval system. Our approach presents an alternative to traditional face identification systems; it goes beyond a superficial combination of existing text-based and image-based approaches to information retrieval. By understanding the caption accompanying a picture, we can extract information that is useful both for retrieving the picture and for identifying the faces shown. In designing a pictorial database system, two major issues are (1) the amount and type of processing required when inserting new pictures into the database and (2) efficient retrieval schemes for query processing. Our research has focused on developing a computational model for understanding pictures based on accompanying descriptive text. Understanding a picture can be informally defined as the process of identifying relevant people and objects. Several current vision systems employ the idea of top-down control in picture understanding. We carry the notion of top-down control one step further, exploiting not only general context but also picture-specific context. >
knowledge discovery and data mining | 2009
Wei Jin; Hung Hay Ho; Rohini K. Srihari
Merchants selling products on the Web often ask their customers to share their opinions and hands-on experiences on products they have purchased. Unfortunately, reading through all customer reviews is difficult, especially for popular items, the number of reviews can be up to hundreds or even thousands. This makes it difficult for a potential customer to read them to make an informed decision. The OpinionMiner system designed in this work aims to mine customer reviews of a product and extract high detailed product entities on which reviewers express their opinions. Opinion expressions are identified and opinion orientations for each recognized product entity are classified as positive or negative. Different from previous approaches that employed rule-based or statistical techniques, we propose a novel machine learning approach built under the framework of lexicalized HMMs. The approach naturally integrates multiple important linguistic features into automatic learning. In this paper, we describe the architecture and main components of the system. The evaluation of the proposed method is presented based on processing the online product reviews from Amazon and other publicly available datasets.
knowledge discovery and data mining | 2004
Xiaoyun Wu; Rohini K. Srihari
Like many purely data-driven machine learning methods, Support Vector Machine (SVM) classifiers are learned exclusively from the evidence presented in the training dataset; thus a larger training dataset is required for better performance. In some applications, there might be human knowledge available that, in principle, could compensate for the lack of data. In this paper, we propose a simple generalization of SVM: Weighted Margin SVM (WMSVMs) that permits the incorporation of prior knowledge. We show that Sequential Minimal Optimization can be used in training WMSVM. We discuss the issues of incorporating prior knowledge using this rather general formulation. The experimental results show that the proposed methods of incorporating prior knowledge is effective.
international conference on tools with artificial intelligence | 1999
Aibing Rao; Rohini K. Srihari; Zhongfei Zhang
The color histogram is an important technique for color image database indexing and retrieval. In this paper, the traditional color histogram is modified to capture the spatial layout information of each color, and three types of spatial color histograms are introduced: annular, angular and hybrid color histograms. Experiments show that, with a proper trade-off between the granularity in the color and spatial dimensions, these histograms outperform both the traditional color histogram and some existing histogram refinements such as the color coherent vector.
conference on applied natural language processing | 2000
Rohini K. Srihari
This paper presents a hybrid approach for named entity (NE) tagging which combines Maximum Entropy Model (MaxEnt), Hidden Markov Model (HMM) and handcrafted grammatical rules. Each has innate strengths and weaknesses; the combination results in a very high precision tagger. MaxEnt includes external gazetteers in the system. Sub-category generation is also discussed.
international acm sigir conference on research and development in information retrieval | 2002
Munirathnam Srikanth; Rohini K. Srihari
Statistical Language Models(LM) have been used in many natural language processing tasks including speech recognition and machine translation [5, 2]. Recently language models have been explored as a framework for information retrieval [9, 4, 7, 1, 6]. The basic idea is to view each document to have its own language model and model querying as a generative process. Documents are ranked based on the probability of their language model generating the given query. Since documents are fixed entities in information retrieval, language models for documents suffer from sparse data problem. Smoothed unigram models have been used to demonstrate better performance of language models against vector space or probabilistic retrieval models for document retrieval. Song and Croft [10] proposed a general language model that combined bigram language models with Good-Turing estimate and corpus-based smoothing of unigram probabilities. Improved performance was observed with combined bigram language models. The language models explored for information retrieval mimic those used for speech recognition. Specifically, in the bigram model a document d represented as word sequence w1, w2, · · · , wn is modeled as
conference on applied natural language processing | 2000
Rohini K. Srihari; Wei Li
This paper discusses an information extraction (IE) system, Textract, in natural language (NL) question answering (QA) and examines the role of IE in QA application. It shows: (i) Named Entity tagging is an important component for QA, (ii) an NL shallow parser provides a structural basis for questions, and (iii) high-level domain independent IE can result in a QA breakthrough.
north american chapter of the association for computational linguistics | 2003
Rohini K. Srihari; Wei Li; Cheng Niu; Thomas L. Cornell
Information extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of documents drawn from various sources for situations that cannot be anticipated a priori, they require IE systems to have breadth as well as depth. This implies the need for a domain-independent IE system that can easily be customized for specific domains: end users must be given tools to customize the system on their own. It also implies the need for defining new intermediate level IE tasks that are richer than the subject-verb-object (SVO) triples produced by shallow systems, yet not as complex as the domain-specific scenarios defined by the Message Understanding Conference (MUC). This paper describes a robust, scalable IE engine designed for such purposes. It describes new IE tasks such as entity profiles, and concept-based general events which represent realistic goals in terms of what can be accomplished in the near-term as well as providing useful, actionable information. These new tasks also facilitate the correlation of output from an IE engine with existing structured data. Benchmarking results for the core engine and applications utilizing the engine are presented.
acm multimedia | 2000
Lei Zhu; Aidong Zhang; Aibing Rao; Rohini K. Srihari
We propose a new framework termed Keyblock for content-based image retrieval, which is a generalization of the text-based information retrieval technology in the image domain. In this framework, methods for extracting comprehensive image features are provided, which are based on the frequency of representative blocks, termed keyblocks, of the image database. Keyblocks, which are analogous to index terms in text document retrieval, can be constructed by exploiting the vector quantization (VQ) method which has been used for image compression. By comparing the performance of our approach with the existing techniques using color feature and wavelet texture feature, the experimental results demonstrate the effectiveness of the framework in image retrieval.