Haixun Wang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Haixun Wang is active.

Explore More

Publication

Featured researches published by Haixun Wang.

international conference on data engineering | 2015

Short text understanding through lexical-semantic analysis

Wen Hua; Zhongyuan Wang; Haixun Wang; Kai Zheng; Xiaofang Zhou

Understanding short texts is crucial to many applications, but challenges abound. First, short texts do not always observe the syntax of a written language. As a result, traditional natural language processing methods cannot be easily applied. Second, short texts usually do not contain sufficient statistical signals to support many state-of-the-art approaches for text processing such as topic modeling. Third, short texts are usually more ambiguous. We argue that knowledge is needed in order to better understand short texts. In this work, we use lexical-semantic knowledge provided by a well-known semantic network for short text understanding. Our knowledge-intensive approach disrupts traditional methods for tasks such as text segmentation, part-of-speech tagging, and concept labeling, in the sense that we focus on semantics in all these tasks. We conduct a comprehensive performance evaluation on real-life data. The results show that knowledge is indispensable for short text understanding, and our knowledge-intensive approaches are effective in harvesting semantics of short texts.

international conference on data engineering | 2014

How to partition a billion-node graph

Lu Wang; Yanghua Xiao; Bin Shao; Haixun Wang

Billion-node graphs pose significant challenges at all levels from storage infrastructures to programming models. It is critical to develop a general purpose platform for graph processing. A distributed memory system is considered a feasible platform supporting online query processing as well as offline graph analytics. In this paper, we study the problem of partitioning a billion-node graph on such a platform, an important consideration because it has direct impact on load balancing and communication overhead. It is challenging not just because the graph is large, but because we can no longer assume that the data can be organized in arbitrary ways to maximize the performance of the partitioning algorithm. Instead, the algorithm must adopt the same data and programming model adopted by the system and other applications. In this paper, we propose a multi-level label propagation (MLP) method for graph partitioning. Experimental results show that our solution can partition billion-node graphs within several hours on a distributed memory system consisting of merely several machines, and the quality of the partitions produced by our approach is comparable to state-of-the-art approaches applied on toy-size graphs.

IEEE Transactions on Knowledge and Data Engineering | 2016

Understanding Short Texts through Semantic Enrichment and Hashing

Zheng Yu; Haixun Wang; Xuemin Lin; Min Wang

Clustering short texts by their meaning is a challenging task. The semantic hashing approach encodes the meaning of a text into a compact binary code. Thus, to tell if two texts have similar meanings, we only need to check if they have similar codes. The encoding is created by a deep neural network, which is trained on texts represented by word-count vectors. Unfortunately, for short texts such as search queries, such representations are insufficient to capture the underlying semantics. We propose a method to add more semantic signals to enrich short texts. Furthermore, we introduce a simplified deep learning network constructed by stacked auto-encoders to do semantic hashing. Experiments show that our method significantly improves the understanding of short texts, including text retrieval, classification and other general text-related tasks.

international conference on data engineering | 2014

Head, modifier, and constraint detection in short texts

Zhongyuan Wang; Haixun Wang; Zhirui Hu

Head and modifier detection is an important problem for applications that handle short texts such as search queries, ads keywords, titles, captions, etc. In many cases, short texts such as search queries do not follow grammar rules, and existing approaches for head and modifier detection are coarse-grained, domain specific, and/or require labeling of large amounts of training data. In this paper, we introduce a semantic approach for head and modifier detection. We first obtain a large number of instance level head-modifier pairs from search log. Then, we develop a conceptualization mechanism to generalize the instance level pairs to concept level. Finally, we derive weighted concept patterns that are concise, accurate, and have strong generalization power in head and modifier detection. Furthermore, we identify a subset of modifiers that we call constraints. Constraints are usually specific and not negligible as far as the intent of the short text is concerned, while non-constraint modifiers are more subjective. The mechanism we developed has been used in production for search relevance and ads matching. We use extensive experiment results to demonstrate the effectiveness of our approach.

conference on information and knowledge management | 2013

Wikification via link co-occurrence

Zhiyuan Cai; Kaiqi Zhao; Kenny Q. Zhu; Haixun Wang

Wikification, which stands for the process of linking terms in a plain text document to Wikipedia articles which represent the correct meanings of the terms, can be thought of as a generalized Word Sense Disambiguation problem. It disambiguates multi-word expressions (MWEs) in addition to single words. Existing Wikification techniques either models the context of a given term as well as the Wikipedia article as bags of words, or compute global constraints among Wikipedia concepts by the link graph or link distributions. The first method doesnt achieve good results because the MWEs can have very different meanings than its constituent words which themselves are ambiguous. The second method doesnt produce high accuracy because the link structure or link distribution is often biased or incomplete by themselves due to the fact that Wikipedia pages are often sparsely linked. In this paper, we present a simple but powerful framework of sense disambiguation using co-occurrences of Wikipedia links in the Wikipedia corpus. We propose an iterative method to enrich the sparsely-linked articles by adding more links and then use the resulting link co-occurrence matrix to disambiguate an input document by a sliding window algorithm. Our prototype system achieves 89.97% precision and 76.43% recall on average for three benchmark data and compares favorably against four state-of-the-art wikification techniques.

very large data bases | 2017

KBQA: learning question answering over QA corpora and knowledge bases

Wanyun Cui; Yanghua Xiao; Haixun Wang; Yangqiu Song; Seung-won Hwang; Wei Wang

Question answering (QA) has become a popular way for humans to access billion-scale knowledge bases. Unlike web search, QA over a knowledge base gives out accurate and concise results, provided that natural language questions can be understood and mapped precisely to structured queries over the knowledge base. The challenge, however, is that a human can ask one question in many different ways. Previous approaches have natural limits due to their representations: rule based approaches only understand a small set of canned questions, while keyword based or synonym based approaches cannot fully understand the questions. In this paper, we design a new kind of question representation: templates, over a billion scale knowledge base and a million scale QA corpora. For example, for questions about a citys population, we learn templates such as Whats the population of

IEEE Transactions on Knowledge and Data Engineering | 2017

Understand Short Texts by Harvesting and Analyzing Semantic Knowledge

Wen Hua; Zhongyuan Wang; Haixun Wang; Kai Zheng; Xiaofang Zhou

city?, How many people are there in

IEEE Transactions on Knowledge and Data Engineering | 2015

A Large Probabilistic Semantic Network Based Approach to Compute Term Similarity

Peipei Li; Haixun Wang; Kenny Q. Zhu; Zhongyuan Wang; Xuegang Hu; Xindong Wu

city?. We learned 27 million templates for 2782 intents. Based on these templates, our QA system KBQA effectively supports binary factoid questions, as well as complex questions which are composed of a series of binary factoid questions. Furthermore, we expand predicates in RDF knowledge base, which boosts the coverage of knowledge base by 57 times. Our QA system beats all other state-of-art works on both effectiveness and efficiency over QALD benchmarks.

conference on information and knowledge management | 2014

Transfer Understanding from Head Queries to Tail Queries

Yangqiu Song; Haixun Wang; Weizhu Chen; Shusen Wang

Understanding short texts is crucial to many applications, but challenges abound. First, short texts do not always observe the syntax of a written language. As a result, traditional natural language processing tools, ranging from part-of-speech tagging to dependency parsing, cannot be easily applied. Second, short texts usually do not contain sufficient statistical signals to support many state-of-the-art approaches for text mining such as topic modeling. Third, short texts are more ambiguous and noisy, and are generated in an enormous volume, which further increases the difficulty to handle them. We argue that semantic knowledge is required in order to better understand short texts. In this work, we build a prototype system for short text understanding which exploits semantic knowledge provided by a well-known knowledgebase and automatically harvested from a web corpus. Our knowledge-intensive approaches disrupt traditional methods for tasks such as text segmentation, part-of-speech tagging, and concept labeling, in the sense that we focus on semantics in all these tasks. We conduct a comprehensive performance evaluation on real-life data. The results show that semantic knowledge is indispensable for short text understanding, and our knowledge-intensive approaches are both effective and efficient in discovering semantics of short texts.

conference on information and knowledge management | 2015

An Inference Approach to Basic Level of Categorization

Zhongyuan Wang; Haixun Wang; Ji-Rong Wen; Yanghua Xiao

Measuring semantic similarity between two terms is essential for a variety of text analytics and understanding applications. Currently, there are two main approaches for this task, namely the knowledge based and the corpus based approaches. However, existing approaches are more suitable for semantic similarity between words rather than the more general multi-word expressions (MWEs), and they do not scale very well. Contrary to these existing techniques, we propose an efficient and effective approach for semantic similarity using a large scale semantic network. This semantic network is automatically acquired from billions of web documents. It consists of millions of concepts, which explicitly model the context of semantic relationships. In this paper, we first show how to map two terms into the concept space, and compare their similarity there. Then, we introduce a clustering approach to orthogonalize the concept space in order to improve the accuracy of the similarity measure. Finally, we conduct extensive studies to demonstrate that our approach can accurately compute the semantic similarity between terms of MWEs and with ambiguity, and significantly outperforms 12 competing methods under Pearson Correlation Coefficient. Meanwhile, our approach is much more efficient than all competing algorithms, and can be used to compute semantic similarity in a large scale.

Explore More