Wei Emma Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wei Emma Zhang is active.

Explore More

Publication

Featured researches published by Wei Emma Zhang.

international world wide web conferences | 2017

Detecting Duplicate Posts in Programming QA Communities via Latent Semantics and Association Rules

Wei Emma Zhang; Quan Z. Sheng; Jey Han Lau; Ermyas Abebe

Programming community-based question-answering (PCQA) websites such as Stack Overflow enable programmers to find working solutions to their questions. Despite detailed posting guidelines, duplicate questions that have been answered are frequently created. To tackle this problem, Stack Overflow provides a mechanism for reputable users to manually mark duplicate questions. This is a laborious effort, and leads to many duplicate questions remain undetected. Existing duplicate detection methodologies from traditional community based question-answering (CQA) websites are difficult to be adopted directly to PCQA, as PCQA posts often contain source code which is linguistically very different from natural languages. In this paper, we propose a methodology designed for the PCQA domain to detect duplicate questions. We model the detection as a classification problem over question pairs. To extract features for question pairs, our methodology leverages continuous word vectors from the deep learning literature, topic model features and phrases pairs that co-occur frequently in duplicate questions mined using machine translation systems. These features capture semantic similarities between questions and produce a strong performance for duplicate detection. Experiments on a range of real-world datasets demonstrate that our method works very well; in some cases over 30% improvement compared to state-of-the-art benchmarks. As a product of one of the proposed features, the association score feature, we have mined a set of associated phrases from duplicate questions on Stack Overflow and open the dataset to the public.

database systems for advanced applications | 2015

Identifying and Caching Hot Triples for Efficient RDF Query Processing

Wei Emma Zhang; Quan Z. Sheng; Kerry Taylor; Yongrui Qin

Resource Description Framework (RDF) has been used as a general model for conceptual description and information modelling. As the growing number and volume of RDF datasets emerged recently, many techniques have been developed for accelerating the query answering process on triple stores, which handle large-scale RDF data. Caching is one of the popular solutions. Non-RDBMS based triple stores, which leverage the intrinsic nature of RDF graphs, are emerging and attracting more research attention in recent years. However, as their fundamental structure is different from RDBMS triple stores, they can not leverage the RDBMS caching mechanism. In this paper, we develop a time-aware frequency based caching algorithm to address this issue. Our approach retrieves the accessed triples by analyzing and expanding previous queries and collects most frequently accessed triples by evaluating their access frequencies using Exponential Smoothing, a forecasting method. We evaluate our approach using real world queries from a publicly available SPARQL endpoint. Our theoretical analysis and empirical results show that the proposed approach outperforms the state-of-the-art approaches with higher hit rates.

web information systems engineering | 2016

Learning-based SPARQL Query Performance Prediction

Wei Emma Zhang; Quan Z. Sheng; Kerry Taylor; Yongrui Qin; Lina Yao

According to the predictive results of query performance, queries can be rewritten to reduce time cost or rescheduled to the time when the resource is not in contention. As more large RDF datasets appear on the Web recently, predicting performance of SPARQL query processing is one major challenge in managing a large RDF dataset efficiently. In this paper, we focus on representing SPARQL queries with feature vectors and using these feature vectors to train predictive models that are used to predict the performance of SPARQL queries. The evaluations performed on real world SPARQL queries demonstrate that the proposed approach can effectively predict SPARQL query performance and outperforms state-of-the-art approaches.

ACM Transactions on Internet Technology | 2018

Collaborative Location Recommendation by Integrating Multi-dimensional Contextual Information

Lina Yao; Quan Z. Sheng; Xianzhi Wang; Wei Emma Zhang; Yongrui Qin

Point-of-Interest (POI) recommendation is a new type of recommendation task that comes along with the prevalence of location-based social networks and services in recent years. Compared with traditional recommendation tasks, POI recommendation focuses more on making personalized and context-aware recommendations to improve user experience. Traditionally, the most commonly used contextual information includes geographical and social context information. However, the increasing availability of check-in data makes it possible to design more effective location recommendation applications by modeling and integrating comprehensive types of contextual information, especially the temporal information. In this article, we propose a collaborative filtering method based on Tensor Factorization, a generalization of the Matrix Factorization approach, to model the multi-dimensional contextual information. Tensor Factorization naturally extends Matrix Factorization by increasing the dimensionality of concerns, within which the three-dimensional model is the one most popularly used. Our method exploits a high-order tensor to fuse heterogeneous contextual information about users’ check-ins instead of the traditional two-dimensional user-location matrix. The factorization of this tensor leads to a more compact model of the data that is naturally suitable for integrating contextual information to make POI recommendations. Based on the model, we further improve the recommendation accuracy by utilizing the internal relations within users and locations to regularize the latent factors. Experimental results on a large real-world dataset demonstrate the effectiveness of our approach.

advanced data mining and applications | 2016

Mining Source Code Topics Through Topic Model and Words Embedding

Wei Emma Zhang; Quan Z. Sheng; Ermyas Abebe; M. Ali Babar; Andi Zhou

Developers nowadays can leverage existing systems to build their own applications. However, a lack of documentation hinders the process of software system reuse. We examine the problem of mining topics (i.e., topic extraction) from source code, which can facilitate the comprehension of the software systems. We propose a topic extraction method, Embedded Topic Extraction (EmbTE), that considers word semantics, which are never considered in mining topics from source code, by leveraging word embedding techniques. We also adopt Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) to extract topics from source code. Moreover, an automated term selection algorithm is proposed to identify the most contributory terms from source code for the topic extraction task. The empirical studies on Github (https://github.com/) Java projects show that EmbTE outperforms other methods in terms of providing more coherent topics. The results also indicate that method name, method comments, class names and class comments are the most contributory types of terms to source code topic extraction.

database systems for advanced applications | 2017

Recovering Missing Values from Corrupted Spatio-Temporal Sensory Data via Robust Low-Rank Tensor Completion

Wenjie Ruan; Peipei Xu; Quan Z. Sheng; Nickolas J. G. Falkner; Xue Li; Wei Emma Zhang

With the booming of the Internet of Things, tremendous amount of sensors have been installed in different geographic locations, generating massive sensory data with both time-stamps and geo-tags. Such type of data usually have shown complex spatio-temporal correlation and are easily missing in practice due to communication failure or data corruption. In this paper, we aim to tackle the challenge – how to accurately and efficiently recover the missing values for corrupted spatio-temporal sensory data. Specifically, we first formulate such sensor data as a high-dimensional tensor that can naturally preserve sensors’ both geographical and time information, thus we call spatio-temporal Tensor. Then we model the sensor data recovery as a low-rank robust tensor completion problem by exploiting its latent low-rank structure and sparse noise property. To solve this optimization problem, we design a highly efficient optimization method that combines the alternating direction method of multipliers and accelerated proximal gradient to minimize the tensor’s convex surrogate and noise’s \(\ell _1\)-norm. In addition to testing our method by a synthetic dataset, we also use passive RFID (radio-frequency identification) sensors to build a real-world sensor-array testbed, which generates overall 115,200 sensor readings for model evaluation. The experimental results demonstrate the accuracy and robustness of our approach.

international congress on big data | 2016

Identification as a Service: Large-Scale Cloud Service Discovery over the World Wide Web

Abdullah Alfazi; Quan Z. Sheng; Wei Emma Zhang; Lina Yao; Talal H. Noor

Cloud computing is provisioned with high flexibility with regard to on demand infrastructures, platforms and software as services through the Internet. The unique characteristics of cloud services such as dynamic and diverse services offering at different levels, as well as the lack of standardized description, are becoming important challenges in efficiently discovering cloud services for customers. In this paper, we propose a cloud service search engine that has the capability to automatically identify cloud services aiming at improving the accuracy when searching cloud services in real environments. Our search engine can detect cloud services effectively from the Web sources. Furthermore, we focus on learning the cloud service features, such as similarity function, semantic ontology and cloud service components to identify the cloud services. We use a real cloud service dataset to build an identifier. Our cloud service identifier can be used to automatically determine whether a given Web source is a cloud service with high accuracy.

web information systems engineering | 2014

A Decremental Search Approach for Large Scale Dynamic Ridesharing

Ali Shemshadi; Quan Z. Sheng; Wei Emma Zhang

The Web of Things (WoT) paradigm introduces novel applications to improve the quality of human lives. Dynamic ridesharing is one of these applications, which holds the potential to gain significant economical, environmental, and social benefits particularly in metropolitan areas. Despite the recent advances in this area, many challenges still remain. In particular, handling large-scale incomplete data has not been adequately addressed by previous works. Optimizing the taxi/passengers schedules to gain the maximum benefits is another challenging issue. In this paper, we propose a novel system, MARS (Multi-Agent Ridesharing System), which addresses these challenges by formulating travel time estimation and enhancing the efficiency of taxi searching through a decremental search approach. Our proposed approach has been validated using a real-world dataset that consists of the trajectories of 10,357 taxis in Beijing, China.

World Wide Web | 2018

Learning-based SPARQL query performance modeling and prediction

Wei Emma Zhang; Quan Z. Sheng; Yongrui Qin; Kerry Taylor; Lina Yao

One of the challenges of managing an RDF database is predicting performance of SPARQL queries before they are executed. Performance characteristics, such as the execution time and memory usage, can help data consumers identify unexpected long-running queries before they start and estimate the system workload for query scheduling. Extensive works address such performance prediction problem in traditional SQL queries but they are not directly applicable to SPARQL queries. In this paper, we adopt machine learning techniques to predict the performance of SPARQL queries. Our work focuses on modeling features of a SPARQL query to a vector representation. Our feature modeling method does not depend on the knowledge of underlying systems and the structure of the underlying data, but only on the nature of SPARQL queries. Then we use these features to train prediction models. We propose a two-step prediction process and consider performances in both cold and warm stages. Evaluations are performed on real world SPRAQL queries, whose execution time ranges from milliseconds to hours. The results demonstrate that the proposed approach can effectively predict SPARQL query performance and outperforms state-of-the-art approaches.

ACM Transactions on Internet Technology | 2018

A Learning-Based Framework for Improving Querying on Web Interfaces of Curated Knowledge Bases

Wei Emma Zhang; Quan Z. Sheng; Lina Yao; Kerry Taylor; Ali Shemshadi; Yongrui Qin

Knowledge Bases (KBs) are widely used as one of the fundamental components in Semantic Web applications as they provide facts and relationships that can be automatically understood by machines. Curated knowledge bases usually use Resource Description Framework (RDF) as the data representation model. To query the RDF-presented knowledge in curated KBs, Web interfaces are built via SPARQL Endpoints. Currently, querying SPARQL Endpoints has problems like network instability and latency, which affect the query efficiency. To address these issues, we propose a client-side caching framework, SPARQL Endpoint Caching Framework (SECF), aiming at accelerating the overall querying speed over SPARQL Endpoints. SECF identifies the potential issued queries by leveraging the querying patterns learned from clients’ historical queries and prefecthes/caches these queries. In particular, we develop a distance function based on graph edit distance to measure the similarity of SPARQL queries. We propose a feature modelling method to transform SPARQL queries to vector representation that are fed into machine-learning algorithms. A time-aware smoothing-based method, Modified Simple Exponential Smoothing (MSES), is developed for cache replacement. Extensive experiments performed on real-world queries showcase the effectiveness of our approach, which outperforms the state-of-the-art work in terms of the overall querying speed.

Explore More