Wen Hua | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wen Hua is active.

Explore More

Publication

Featured researches published by Wen Hua.

international conference on data engineering | 2015

Short text understanding through lexical-semantic analysis

Wen Hua; Zhongyuan Wang; Haixun Wang; Kai Zheng; Xiaofang Zhou

Understanding short texts is crucial to many applications, but challenges abound. First, short texts do not always observe the syntax of a written language. As a result, traditional natural language processing methods cannot be easily applied. Second, short texts usually do not contain sufficient statistical signals to support many state-of-the-art approaches for text processing such as topic modeling. Third, short texts are usually more ambiguous. We argue that knowledge is needed in order to better understand short texts. In this work, we use lexical-semantic knowledge provided by a well-known semantic network for short text understanding. Our knowledge-intensive approach disrupts traditional methods for tasks such as text segmentation, part-of-speech tagging, and concept labeling, in the sense that we focus on semantics in all these tasks. We conduct a comprehensive performance evaluation on real-life data. The results show that knowledge is indispensable for short text understanding, and our knowledge-intensive approaches are effective in harvesting semantics of short texts.

web search and data mining | 2013

Identifying users' topical tasks in web search

Wen Hua; Yangqiu Song; Haixun Wang; Xiaofang Zhou

A search task represents an atomic information need of a user in web search. Tasks consist of queries and their reformulations, and identifying tasks is important for search engines since they provide valuable information for determining user satisfaction with search results, predicting user search intent, and suggesting queries to the user. Traditional approaches to identifying tasks exploit either temporal or lexical features of queries. However, many query refinements are topical, which means that a query and its refinements may not be similar on the lexical level. Furthermore, multiple tasks in the same search session may interleave, which means we cannot simply order the searches by their timestamps and divide the session into multiple tasks. Thus, in order to identify tasks correctly, we need to be able to compare two queries at the semantic level. In this paper, we use a knowledgebase known as Probase to infer the conceptual meanings of queries, and automatically identify the topical query refinements in the tasks. Experimental results on real search log data demonstrate that Probase can indeed help estimate the topical affinity between queries, and thus enable us to merge queries that are topically related but dissimilar at the lexical level.

IEEE Transactions on Knowledge and Data Engineering | 2017

Understand Short Texts by Harvesting and Analyzing Semantic Knowledge

Wen Hua; Zhongyuan Wang; Haixun Wang; Kai Zheng; Xiaofang Zhou

Understanding short texts is crucial to many applications, but challenges abound. First, short texts do not always observe the syntax of a written language. As a result, traditional natural language processing tools, ranging from part-of-speech tagging to dependency parsing, cannot be easily applied. Second, short texts usually do not contain sufficient statistical signals to support many state-of-the-art approaches for text mining such as topic modeling. Third, short texts are more ambiguous and noisy, and are generated in an enormous volume, which further increases the difficulty to handle them. We argue that semantic knowledge is required in order to better understand short texts. In this work, we build a prototype system for short text understanding which exploits semantic knowledge provided by a well-known knowledgebase and automatically harvested from a web corpus. Our knowledge-intensive approaches disrupt traditional methods for tasks such as text segmentation, part-of-speech tagging, and concept labeling, in the sense that we focus on semantics in all these tasks. We conduct a comprehensive performance evaluation on real-life data. The results show that semantic knowledge is indispensable for short text understanding, and our knowledge-intensive approaches are both effective and efficient in discovering semantics of short texts.

international conference on management of data | 2015

Microblog Entity Linking with Social Temporal Context

Wen Hua; Kai Zheng; Xiaofang Zhou

Nowadays microblogging sites, such as Twitter and Chinese Sina Weibo, have established themselves as an invaluable information source, which provides a huge collection of manually-generated tweets with broad range of topics from daily life to breaking news. Entity linking is indispensable for understanding and maintaining such information, which in turn facilitates many real-world applications such as tweet clustering and classification, personalized microblog search, and so forth. However, tweets are short, informal and error-prone, rendering traditional approaches for entity linking in documents largely inapplicable. Recent work addresses this problem by utilising information from other tweets and linking entities in a batch manner. Nevertheless, the high computational complexity makes this approach infeasible for real-time applications given the high arrival rate of tweets. In this paper, we propose an efficient solution to link entities in tweets by analyzing their social and temporal context. Our proposed framework takes into consideration three features, namely entity popularity, entity recency, and user interest information embedded in social interactions to assist the entity linking task. Effective indexing structures along with incremental algorithms have also been developed to reduce the computation and maintenance costs of our approach. Experimental results based on real tweet datasets verify the effectiveness and efficiency of our proposals.

asia-pacific web conference | 2012

Self-supervised learning approach for extracting citation information on the web

Dat T. Huynh; Wen Hua

In this paper, we propose a framework for automatically training a model to extract citation information on the web. Constructing manually labeled training data to learn an extraction model is tedious, time consuming and difficult to be applied to several styles of citations with different types of entities. To eliminate the requirement of manually labeled training data, we exploit a knowledge base of citation domain and web search to derive labeled training data automatically. Our experiments show that the combination of knowledge base, heuristics and statistical methods can automate the extraction process and achieve good performance.

conference on information and knowledge management | 2017

Exploiting Spatio-Temporal User Behaviors for User Linkage

Wei Chen; Hongzhi Yin; Weiqing Wang; Lei Zhao; Wen Hua; Xiaofang Zhou

Cross-device and cross-domain user linkage have been attracting a lot of attention recently. An important branch of the study is to achieve user linkage with spatio-temporal data generated by the ubiquitous GPS-enabled devices. The main task in this problem is twofold, i.e., how to extract the representative features of a user; how to measure the similarities between users with the extracted features. To tackle the problem, we propose a novel model STUL (Spatio-Temporal User Linkage) that consists of the following two components. 1) Extract users - spatial features with a density based clustering method, and extract the users - temporal features with the Gaussian Mixture Model. To link user pairs more precisely, we assign different weights to the extracted features, by lightening the common features and highlighting the discriminative features. 2) Propose novel approaches to measure the similarities between users based on the extracted features, and return the pair-wise users with similarity scores higher than a predefined threshold. We have conducted extensive experiments on three real-world datasets, and the results demonstrate the superiority of our proposed STUL over the state-of-the-art methods.

very large data bases | 2017

Minimal on-road time route scheduling on time-dependent graphs

Lei Li; Wen Hua; Xingzhong Du; Xiaofang Zhou

On time-dependent graphs, fastest path query is an important problem and has been well studied. It focuses on minimizing the total travel time (waiting time + on-road time) but does not allow waiting on any intermediate vertex if the FIFO property is applied. However, in practice, waiting on a vertex can reduce the time spent on the road (for example, resuming traveling after a traffic jam). In this paper, we study how to find a path with the minimal on-road time on time-dependent graphs by allowing waiting on some predefined parking vertices. The existing works are based on the following fact: the arrival time of a vertex v is determined by the arrival time of its in-neighbor u, which does not hold in our scenario since we also consider the waiting time on u if u allows waiting. Thus, determining the waiting time on each parking vertex to achieve the minimal on-road time becomes a big challenge, which further breaks FIFO property. To cope with this challenging problem, we propose two efficient algorithms using minimum on-road travel cost function to answer the query. The evaluations on multiple real-world time-dependent graphs show that the proposed algorithms are more accurate and efficient than the extensions of existing algorithms. In addition, the results further indicate, if the parking facilities are enabled in the route scheduling algorithms, the on-road time will reduce significantly compared to the fastest path algorithms.

very large data bases | 2018

Go slow to go fast: minimal on-road time route scheduling with parking facilities using historical trajectory

Lei Li; Kai Zheng; Sibo Wang; Wen Hua; Xiaofang Zhou

For thousands of years, people have been innovating new technologies to make their travel faster, the latest of which is GPS technology that is used by millions of drivers every day. The routes recommended by a GPS device are computed by path planning algorithms (e.g., fastest path algorithm), which aim to minimize a certain objective function (e.g., travel time) under the current traffic condition. When the objective is to arrive the destination as early as possible, waiting during travel is not an option as it will only increase the total travel time due to the First-In-First-Out property of most road networks. However, some businesses such as logistics companies are more interested in optimizing the actual on-road time of their vehicles (i.e., while the engine is running) since it is directly related to the operational cost. At the same time, the drivers’ trajectories, which can reveal the traffic conditions on the roads, are also collected by various service providers. Compared to the existing speed profile generation methods, which mainly rely on traffic monitor systems, the trajectory-based method can cover a much larger space and is much cheaper and flexible to obtain. This paper proposes a system, which has an online component and an offline component, to solve the minimal on-road time problem using the trajectories. The online query answering component studies how parking facilities along the route can be leveraged to avoid predicted traffic jam and eventually reduce the drivers’ on-road time, while the offline component solves how to generate speed profiles of a road network from historical trajectories. The challenging part of the routing problem of the online component lies in the computational complexity when determining if it is beneficial to wait on specific parking places and the time of waiting to maximize the benefit. To cope with this challenging problem, we propose two efficient algorithms using minimum on-road travel cost function to answer the query. We further introduce several approximation methods to speed up the query answering, with an error bound guaranteed. The offline speed profile generation component makes use of historical trajectories to provide the traveling time for the online component. Extensive experiments show that our method is more efficient and accurate than baseline approaches extended from the existing path planning algorithms, and our speed profile is accurate and space efficient.

Neural Networks | 2018

Distant supervision for neural relation extraction integrated with word attention and property features

Jianfeng Qu; Dantong Ouyang; Wen Hua; Yuxin Ye; Ximing Li

Distant supervision for neural relation extraction is an efficient approach to extracting massive relations with reference to plain texts. However, the existing neural methods fail to capture the critical words in sentence encoding and meanwhile lack useful sentence information for some positive training instances. To address the above issues, we propose a novel neural relation extraction model. First, we develop a word-level attention mechanism to distinguish the importance of each individual word in a sentence, increasing the attention weights for those critical words. Second, we investigate the semantic information from word embeddings of target entities, which can be developed as a supplementary feature for the extractor. Experimental results show that our model outperforms previous state-of-the-art baselines.

World Wide Web | 2017

HD-GDD: high dimensional graph dominance drawing approach for reachability query

Lei Li; Wen Hua; Xiaofang Zhou

Efficiently answering reachability queries, which checks whether one vertex can reach another in a directed graph, has been studied extensively during recent years. However, the size of the graph that people are facing and generating nowadays is growing so rapidly that simple algorithms, such as BFS and DFS, are no longer feasible. Although Refined Online Search algorithms can scale to large graphs, they all suffer from the false positive problem. In this paper, we analyze the cause of false positive and propose an efficient High Dimensional coordinate generating method based on Graph Dominance Drawing (HD-GDD) to answer reachability queries in linear or even constant time. We conduct experiments on different graph structures and different graph sizes to fully evaluate the performance and behavior of our proposal. Empirical results demonstrate that our method outperforms state-of-the-art algorithms and can handle extensive graphs.

Explore More