Shanping Li
Zhejiang University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shanping Li.
automated software engineering | 2016
Bowen Xu; Deheng Ye; Zhenchang Xing; Xin Xia; Guibin Chen; Shanping Li
Consider a question and its answers in Stack Overflow as a knowledge unit. Knowledge units often contain semantically relevant knowledge, and thus linkable for different purposes, such as duplicate questions, directly linkable for problem solving, indirectly linkable for related information. Recognising different classes of linkable knowledge would support more targeted information needs when users search or explore the knowledge base. Existing methods focus on binary relatedness (i.e., related or not), and are not robust to recognize different classes of semantic relatedness when linkable knowledge units share few words in common (i.e., have lexical gap). In this paper, we formulate the problem of predicting semantically linkable knowledge units as a multiclass classification problem, and solve the problem using deep learning techniques. To overcome the lexical gap issue, we adopt neural language model (word embeddings) and convolutional neural network (CNN) to capture word- and document-level semantics of knowledge units. Instead of using human-engineered classifier features which are hard to design for informal user-generated content, we exploit large amounts of different types of user-created knowledge-unit links to train the CNN to learn the most informative wordlevel and document-level features for the multiclass classification task. Our evaluation shows that our deep-learning based approach significantly and consistently outperforms traditional methods using traditional word representations and human-engineered classifier features.
pacific asia workshop on intelligence and security informatics | 2012
Xin Xia; Xiaohu Yang; Chao Wu; Shanping Li; Linfeng Bao
Twitter has shown its greatest power of influence for its fast information diffusion. Previous research has shown that most of the tweets posted are truthful, but as some people post the rumors and spams on Twitter in emergence situation, the direction of public opinion can be misled and even the riots are caused. In this paper, we focus on the methods for the information credibility in emergency situation. More precisely, we build a novel Twitter monitor model to monitoring Twitter online. Within the novel monitor model, an unsupervised learning algorithm is proposed to detect the emergency situation. A collection of training dataset which includes the tweets of typical events is gathered through the Twitter monitor. Then we manually dispatch the dataset to experts who label each tweet into two classes: credibility or incredibility. With the classified tweets, a number of features related to the user social behavior, the tweet content, the tweet topic and the tweet diffusion are extracted. A supervised method using learning Bayesian Network is used to predict the tweets credibility in emergency situation. Experiments with the tweets of UK Riots related topics show that our procedure achieves good performance to classify the tweets compared with other state-of-art algorithms.
conference on software maintenance and reengineering | 2013
Xin Xia; David Lo; Xinyu Wang; Xiaohu Yang; Shanping Li; Jianling Sun
Bug fixing is a time-consuming and costly job which is performed in the whole life cycle of software development and maintenance. For many systems, bugs are managed in bug management systems such as Bugzilla. Generally, the status of a typical bug report in Bugzilla changes from new to assigned, verified and closed. However, some bugs have to be reopened. Reopened bugs increase the software development and maintenance cost, increase the workload of bug fixers, and might even delay the future delivery of a software. Only a few studies investigate the phenomenon of reopened bug reports. In this paper, we evaluate the effectiveness of various supervised learning algorithms to predict if a bug report would be reopened. We choose 7 state-of-the-art classical supervised learning algorithm in machine learning literature, i.e., kNN, SVM, Simple Logistic, Bayesian Network, Decision Table, CART and LWL, and 3 ensemble learning algorithms, i.e., AdaBoost, Bagging and Random Forest, and evaluate their performance in predicting reopened bug reports. The experiment results show that among the 10 algorithms, Bagging and Decision Table (IDTM) achieve the best performance. They achieve accuracy scores of 92.91% and 92.80%, respectively, and reopened bug reports F-Measure scores of 0.735 and 0.732, respectively. These results improve the reopened bug reports F-Measure of the state-of-the-art approaches proposed by Shihab et al. by up to 23.53%.
asia pacific symposium on internetware | 2010
Ting Wang; Yuanjie Si; Xiao Xuan; Xinyu Wang; Xiaohu Yang; Shanping Li; Aleksander J. Kavs
Non-functional requirements (NFRs) are often regarded as the key success factor in building high quality software. However, most of the requirements elicitation methods are centered on discovering functional requirements only. This paper presents a novel NFRs elicitation approach aiming at empowering requirements analysts with a knowledge repository that aids to the process of capturing precise NFRs during elicitation interviews. The knowledge repository is composed of two layers: the upper layer of feature models and the lower layer of the QoS ontology. The case study of the stock trading domain illustrates the relationships and cooperations of the two layers.
tools and algorithms for construction and analysis of systems | 2014
Ting Wang; Jun Sun; Yang Liu; Xinyu Wang; Shanping Li
Given a timed automaton \(\cal P\) modeling an implementation and a timed automaton \(\cal S\) as a specification, language inclusion checking is to decide whether the language of \(\cal P\) is a subset of that of \(\cal S\). It is known that this problem is undecidable and “this result is an obstacle in using timed automata as a specification language” [2]. This undecidability result, however, does not imply that all timed automata are bad for specification. In this work, we propose a zone-based semi-algorithm for language inclusion checking, which implements simulation reduction based on Anti-Chain and LU-simulation. Though it is not guaranteed to terminate, we show that it does in many cases through both theoretical and empirical analysis. The semi-algorithm has been incorporated into the PAT model checker, and applied to multiple systems to show its usefulness and scalability.
acm symposium on applied computing | 2013
Zhen Ye; Shanping Li; Xiaozhen Zhou
Cross datacenter data replication has been widely used in geo-cloud environment due to its ability to increase applications availability and improve the performance. However, with the large scale of cloud, it is difficult to determine the location of replicas among datacenters in order to minimize overall user access latency. The data correlation between each other makes replica placement problem more complex. To address these large scale and data correlation issues, we propose a two-step approach called GCplace. Before applying GCplace, a network coordinate system is used to predict the latency between all users and datacenter nodes. In the first step of GCplace, we introduce a stream based similarity clustering, which uses a small number of micro clusters to represent huge number of users and thus significantly reducing the cost of replica placement algorithm. In the second step, an iterative algorithm is proposed to get an approximation solution. We evaluated our approach by using a large scale real network latency dataset. Comprehensive experiments show that GCplace can reduce average user access latency significantly.
conference on information and knowledge management | 2011
Xin Xia; Xiaohu Yang; Shanping Li; Chao Wu; Linlin Zhou
Multi-label classification refers to the problem that predicts each single instance to be one or more labels in a set of associated labels. It is common in many real-world applications such as text categorization, functional genomics and semantic scene classification. The main challenge for multi-label classification is predicting the labels of a new instance with the exponential number of possible label sets. Previous works mainly pay attention to transforming the multi-label classification to be single-label classification or modifying the existing traditional algorithm. In this paper, a novel algorithm which combines the advantage of the famous KNN and Random Walk algorithm (RW.KNN) is proposed. The KNN based link graph is built with the k-nearest neighbors for each instance. For an unseen instance, a random walk is performed in the link graph. The final probability is computed according to the random walk results. Lastly, a novel algorithm based on minimizing Hamming Loss to select the classification threshold is also proposed in this paper.
Empirical Software Engineering | 2018
Qiao Huang; Emad Shihab; Xin Xia; David Lo; Shanping Li
Technical debt is a metaphor to describe the situation in which long-term code quality is traded for short-term goals in software projects. Recently, the concept of self-admitted technical debt (SATD) was proposed, which considers debt that is intentionally introduced, e.g., in the form of quick or temporary fixes. Prior work on SATD has shown that source code comments can be used to successfully detect SATD, however, most current state-of-the-art classification approaches of SATD rely on manual inspection of the source code comments. In this paper, we proposed an automated approach to detect SATD in source code comments using text mining. In our approach, we utilize feature selection to select useful features for classifier training, and we combine multiple classifiers from different source projects to build a composite classifier that identifies SATD comments in a target project. We investigate the performance of our approach on 8 open source projects that contain 212,413 comments. Our experimental results show that, on every target project, our approach outperforms the state-of-the-art and the baselines approaches in terms of F1-score. The F1-score achieved by our approach ranges between 0.518 - 0.841, with an average of 0.737, which improves over the state-of-the-art approach proposed by Potdar and Shihab by 499.19%. When compared with the text mining-based baseline approaches, our approach significantly improves the average F1-score by at least 58.49%. When compared with a natural language processing-based baseline, our approach also significantly improves its F1-score by 27.95%. Our proposed approach can be used by project personnel to effectively identify SATD with minimal manual effort.
international conference on engineering of complex computer systems | 2015
Yun Zhang; David Lo; Xin Xia; Bowen Xu; Jianling Sun; Shanping Li
In recent years, to help developers reduce time and effort required to build highly secure software, a number of prediction models which are built on different kinds of features have been proposed to identify vulnerable source code files. In this paper, we propose a novel approach VULPREDICTOR to predict vulnerable files, it analyzes software metrics and text mining together to build a composite prediction model. VULPREDICTOR first builds 6 underlying classifiers on a training set of vulnerable and non-vulnerable files represented by their software metrics and text features, and then constructs a meta classifier to process the outputs of the 6 underlying classifiers. We evaluate our solution on datasets from three web applications including Drupal, PHPMyAdmin and Moodle which contain a total of 3,466 files and 223 vulnerabilities. The experiment results show that VULPREDICTOR can achieve F1 and EffectivenessRatio@20% scores of up to 0.683 and 75%, respectively. On average across the 3 projects, VULPREDICTOR improves the F1 and EffectivenessRatio@20% scores of the best performing state-of-the-art approaches proposed by Walden et al. by 46.53% and 14.93%, respectively.
acm symposium on applied computing | 2014
Xiaoqiong Zhao; Xin Xia; Pavneet Singh Kochhar; David Lo; Shanping Li
Software build process translates source codes into executable programs, packages the programs, generates documents, and distributes products. In this paper, we perform an empirical study to characterize build process bugs. We analyze bugs in build process in 5 open-source systems under Apache namely CXF, Camel, Felix, Struts, and Tuscany. We compare build process bugs and other bugs across 3 different dimensions, i.e., bug severity, bug fix time, and the number of files modified to fix a bug. Our results show that the fraction of build process bugs which are above major severity level is lower than that of other bugs. However, the time effort required to fix a build process bug is around 2.03 times more than that of a non-build process bug, and the number of source files modified to fix a build process bug is around 2.34 times more than that modified for a non-build bug.