Huaiyu Zhu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Huaiyu Zhu is active.

Explore More

Publication

Featured researches published by Huaiyu Zhu.

international conference on management of data | 2009

SystemT: a system for declarative information extraction

Rajasekar Krishnamurthy; Yunyao Li; Sriram Raghavan; Frederick R. Reiss; Shivakumar Vaithyanathan; Huaiyu Zhu

As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) -- the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammarbased extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and costbased optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.

international conference on data engineering | 2008

An Algebraic Approach to Rule-Based Information Extraction

Frederick R. Reiss; Sriram Raghavan; Rajasekar Krishnamurthy; Huaiyu Zhu; Shivakumar Vaithyanathan

Traditional approaches to rule-based information extraction (IE) have primarily been based on regular expression grammars. However, these grammar-based systems have difficulty scaling to large data sets and large numbers of rules. Inspired by traditional database research, we propose an algebraic approach to rule-based IE that addresses these scalability issues through query optimization. The operators of our algebra are motivated by our experience in building several rule-based extraction programs over diverse data sets. We present the operators of our algebra and propose several optimization strategies motivated by the text-specific characteristics of our operators. Finally we validate the potential benefits of our approach by extensive experiments over real-world blog data.

international conference on management of data | 2006

Avatar semantic search: a database approach to information retrieval

Eser Kandogan; Rajasekar Krishnamurthy; Sriram Raghavan; Shivakumar Vaithyanathan; Huaiyu Zhu

We present Avatar Semantic Search, a prototype search engine that exploits annotations in the context of classical keyword search. The process of annotations is accomplished offline by using high-precision information extraction techniques to extract facts, con-cepts, and relationships from text. These facts and concepts are represented and indexed in a structured data store. At runtime, keyword queries are interpreted in the context of these extracted facts and converted into one or more precise queries over the structured store. In this demonstration we describe the overall architecture of the Avatar Semantic Search engine. We also demonstrate the superiority of the AVATAR approach over traditional keyword search engines using Enron email data set and a blog corpus.

international world wide web conferences | 2007

Navigating the intranet with high precision

Huaiyu Zhu; Sriram Raghavan; Shivakumar Vaithyanathan; Alexander Löser

Despite the success of web search engines, search over large enterprise intranets still suffers from poor result quality. Earlier work [6] that compared intranets and the Internet from the view point of keyword search has pointed to several reasons why the search problem is quite different in these two domains. In this paper, we address the problem of providing high quality answers to navigational queries in the intranet (e.g., queries intended to find product or personal home pages, service pages, etc.). Our approach is based on offline identification of navigational pages, intelligent generation of term-variants to associate with each page, and the construction of separate indices exclusively devoted to answering navigational queries. Using a testbed of 5.5M pages from the IBM intranet, we present evaluation results that demonstrate that for navigational queries, our approach of using custom indices produces results of significantly higher precision than those produced by a general purpose search algorithm.

international joint conference on natural language processing | 2015

Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling

Alan Akbik; Laura Chiticariu; Marina Danilevsky; Yunyao Li; Shivakumar Vaithyanathan; Huaiyu Zhu

Semantic role labeling (SRL) is crucial to natural language understanding as it identifies the predicate-argument structure in text with semantic labels. Unfortunately, resources required to construct SRL models are expensive to obtain and simply do not exist for most languages. In this paper, we present a two-stage method to enable the construction of SRL models for resourcepoor languages by exploiting monolingual SRL and multilingual parallel data. Experimental results show that our method outperforms existing methods. We use our method to generate Proposition Banks with high to reasonable quality for 7 languages in three language families and release these resources to the research community.

IEEE Micro | 2014

Giving Text Analytics a Boost

Raphael Polig; Kubilay Atasu; Laura Chiticariu; Christoph Hagleitner; H. Peter Hofstee; Frederick R. Reiss; Huaiyu Zhu; Eva Sitaridi

The amount of textual data has reached a new scale and continues to grow at an unprecedented rate. IBMs SystemT software is a powerful text-analytics system that offers a query-based interface to reveal the valuable information that lies within these mounds of data. However, traditional server architectures are not capable of analyzing so-called big data efficiently, despite the high memory bandwidth that is available. The authors show that by using a streaming hardware accelerator implemented in reconfigurable logic, the throughput rates of the SystemTs information extraction queries can be improved by an order of magnitude. They also show how such a system can be deployed by extending SystemTs existing compilation flow and by using a multithreaded communication interface that can efficiently use the accelerators bandwidth.

conference on information and knowledge management | 2011

Facilitating pattern discovery for relation extraction with semantic-signature-based clustering

Yunyao Li; Vivian Chu; Sebastian Blohm; Huaiyu Zhu; Howard Ho

Hand-crafted textual patterns have been the mainstay device of practical relation extraction for decades. However, there has been little work on reducing the manual effort involved in the discovery of effective textual patterns for relation extraction. In this paper, we propose a clustering-based approach to facilitate the pattern discovery for relation extraction. Specifically, we define the notion of semantic signature to represent the most salient features of a textual fragment. We then propose a novel clustering algorithm based on semantic signature, S2C, and its enhancement S2C+. Experiments on two real-world data sets show that, when compared with k-means clustering, S2C and S2C+ are at least an order of magnitude faster, while generating high quality clusters that are at least comparable to the best clusters generated by k-means without requiring any manual tuning. Finally, a user study confirms that our clustering-based approach can indeed help users discover effective textual patterns for relation extraction with only a fraction of the manual effort required by the conventional approach.

international conference on management of data | 2013

Provenance-based dictionary refinement in information extraction

Sudeepa Roy; Laura Chiticariu; Vitaly Feldman; Frederick R. Reiss; Huaiyu Zhu

Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results. In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.

european conference on principles of data mining and knowledge discovery | 2003

Topic learning from few examples

Huaiyu Zhu; Shivakumar Vaithyanathan; Mahesh V. Joshi

This paper describes a semi-supervised algorithm for single class learning with very few examples. The problem is formulated as a hierarchical latent variable model which is clipped to ignore classes not of interest. The model is trained using a multistage EM (msEM) algorithm. The msEM algorithm maximizes the likelihood of the joint distribution of the data and latent variables, under the constraint that the distribution of each layer is fixed in successive stages. We demonstrate that with very few positive examples, the algorithm performs better than training all layers in a single stage. We also show that the latter is equivalent to training a single layer model with corresponding parameters. The performance of the algorithm was verified on several real-world information extraction tasks.

very large data bases | 2017

Creation and interaction with large-scale domain-specific knowledge bases

Shreyas Bharadwaj; Laura Chiticariu; Marina Danilevsky; Samarth Dhingra; Samved Divekar; Arnaldo Carreno-Fuentes; Himanshu Gupta; Nitin Gupta; Sang-Don Han; Mauricio A. Hernández; Howard Ho; Parag Jain; Salil Joshi; Hima P. Karanam; Saravanan Krishnan; Rajasekar Krishnamurthy; Yunyao Li; Satishkumaar Manivannan; Ashish R. Mittal; Fatma Ozcan; Abdul Quamar; Poornima Raman; Diptikalyan Saha; Karthik Sankaranarayanan; Jaydeep Sen; Prithviraj Sen; Shivakumar Vaithyanathan; Mitesh Vasa; Hao Wang; Huaiyu Zhu

The ability to create and interact with large-scale domain-specific knowledge bases from unstructured/semi-structured data is the foundation for many industry-focused cognitive systems. We will demonstrate the Content Services system that provides cloud services for creating and querying high-quality domain-specific knowledge bases by analyzing and integrating multiple (un/semi)structured content sources. We will showcase an instantiation of the system for a financial domain. We will also demonstrate both cross-lingual natural language queries and programmatic API calls for interacting with this knowledge base.

Explore More