Qingshan Jiang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Qingshan Jiang is active.

Explore More

Publication

Featured researches published by Qingshan Jiang.

Journal in Computer Virology | 2008

An intelligent PE-malware detection system based on association mining

Yanfang Ye; Dingding Wang; Tao Li; Dongyi Ye; Qingshan Jiang

The proliferation of malware has presented a serious threat to the security of computer systems. Traditional signature-based anti-virus systems fail to detect polymorphic/metamorphic and new, previously unseen malicious executables. Data mining methods such as Naive Bayes and Decision Tree have been studied on small collections of executables. In this paper, resting on the analysis of Windows APIs called by PE files, we develop the Intelligent Malware Detection System (IMDS) using Objective-Oriented Association (OOA) mining based classification. IMDS is an integrated system consisting of three major modules: PE parser, OOA rule generator, and rule based classifier. An OOA_Fast_FP-Growth algorithm is adapted to efficiently generate OOA rules for classification. A comprehensive experimental study on a large collection of PE files obtained from the anti-virus laboratory of KingSoft Corporation is performed to compare various malware detection approaches. Promising experimental results demonstrate that the accuracy and efficiency of our IMDS system outperform popular anti-virus software such as Norton AntiVirus and McAfee VirusScan, as well as previous data mining based detection systems which employed Naive Bayes, Support Vector Machine (SVM) and Decision Tree techniques. Our system has already been incorporated into the scanning tool of KingSoft’s Anti-Virus software.

knowledge discovery and data mining | 2010

Automatic malware categorization using cluster ensemble

Yanfang Ye; Tao Li; Yong Chen; Qingshan Jiang

In this paper, resting on the analysis of instruction frequency and function-based instruction sequences, we develop an Automatic Malware Categorization System (AMCS) for automatically grouping malware samples into families that share some common characteristics using a cluster ensemble by aggregating the clustering solutions generated by different base clustering algorithms. We propose a principled cluster ensemble framework for combining individual clustering solutions based on the consensus partition. The domain knowledge in the form of sample-level constraints can be naturally incorporated in the ensemble framework. In addition, to account for the characteristics of feature representations, we propose a hybrid hierarchical clustering algorithm which combines the merits of hierarchical clustering and k-medoids algorithms and a weighted subspace K-medoids algorithm to generate base clusterings. The categorization results of our AMCS system can be used to generate signatures for malware families that are useful for malware detection. The case studies on large and real daily malware collection from Kingsoft Anti-Virus Lab demonstrate the effectiveness and efficiency of our AMCS system.

Knowledge Based Systems | 2013

Feature selection via maximizing global information gain for text classification

Changxing Shang; Min Li; Shengzhong Feng; Qingshan Jiang; Jianping Fan

A novel feature selection metric called global information gain (GIG) is proposed.An efficient algorithm called maximizing global information gain (MGIG) is developed.MGIG performs better than other algorithms (IG, mRMR, JMI, DISR) in most cases.MGIG runs significantly faster than mRMR, JMI and DISR, and comparable with IG. Feature selection is a vital preprocessing step for text classification task used to solve the curse of dimensionality problem. Most existing metrics (such as information gain) only evaluate features individually but completely ignore the redundancy between them. This can decrease the overall discriminative power because one features predictive power is weakened by others. On the other hand, though all higher order algorithms (such as mRMR) take redundancy into account, the high computational complexity renders them improper in the text domain. This paper proposes a novel metric called global information gain (GIG) which can avoid redundancy naturally. An efficient feature selection method called maximizing global information gain (MGIG) is also given. We compare MGIG with four other algorithms on six datasets, the experimental results show that MGIG has better results than others methods in most cases. Moreover, MGIG runs significantly faster than the traditional higher order algorithms, which makes it a proper choice for feature selection in text domain.

IEEE Transactions on Systems, Man, and Cybernetics | 2015

Segment Based Decision Tree Induction With Continuous Valued Attributes

Ran Wang; Sam Kwong; Xi-Zhao Wang; Qingshan Jiang

A key issue in decision tree (DT) induction with continuous valued attributes is to design an effective strategy for splitting nodes. The traditional approach to solving this problem is adopting the candidate cut point (CCP) with the highest discriminative ability, which is evaluated by some frequency based heuristic measures. However, such methods ignore the class permutation of examples in the node, and they cannot distinguish the CCPs with the same or similar frequency information, thus may fail to induce a better and smaller tree. In this paper, a new concept, i.e., segment of examples, is proposed to differentiate the CCPs with same frequency information. Then, a new hybrid scheme that combines the two heuristic measures, i.e., frequency and segment, is developed for splitting DT nodes. The relationship between frequency and the expected number of segments, which is regarded as a random variable, is also given. Experimental comparisons demonstrate that the proposed scheme is not only effective to improve the generalization capability, but also valid to reduce the size of the tree.

Journal in Computer Virology | 2009

SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging

Yanfang Ye; Lifei Chen; Dingding Wang; Tao Li; Qingshan Jiang; Min Zhao

Malicious executables are programs designed to infiltrate or damage a computer system without the owner’s consent, which have become a serious threat to the security of computer systems. There is an urgent need for effective techniques to detect polymorphic, metamorphic and previously unseen malicious executables of which detection fails in most of the commercial anti-virus software. In this paper, we develop interpretable string based malware detection system (SBMDS), which is based on interpretable string analysis and uses support vector machine (SVM) ensemble with Bagging to classify the file samples and predict the exact types of the malware. Interpretable strings contain both application programming interface (API) execution calls and important semantic strings reflecting an attacker’s intent and goal. Our SBMDS is carried out with four major steps: (1) first constructing the interpretable strings by developing a feature parser; (2) performing feature selection to select informative strings related to different types of malware; (3) followed by using SVM ensemble with bagging to construct the classifier; (4) and finally conducting the malware detector, which not only can detect whether a program is malicious or not, but also can predict the exact type of the malware. Our case study on the large collection of file samples collected by Kingsoft Anti-virus lab illustrate that: (1) The accuracy and efficiency of our SBMDS outperform several popular anti-virus software; (2) Based on the signatures of interpretable strings, our SBMDS outperforms data mining based detection systems which employ single SVM, Naive Bayes with bagging, Decision Trees with bagging; (3) Compared with the IMDS which utilizes the objective-oriented association (OOA) based classification on API calls, our SBMDS achieves better performance. Our SBMDS system has already been incorporated into the scanning tool of a commercial anti-virus software.

BMC Bioinformatics | 2012

A novel hierarchical clustering algorithm for gene sequences

Dan Wei; Qingshan Jiang; Yanjie Wei; Shengrui Wang

BackgroundClustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in DNA sequence. Afterwards, a hierarchical procedure is applied to clustering DNA sequences based on the feature vectors.ResultsThe proposed distance measure and clustering method are evaluated by clustering functionally related genes and by phylogenetic analysis. This method is also compared with BlastClust, CD-HIT-EST and some others. The experimental results show our method is effective in classifying DNA sequences with similar biological characteristics and in discovering the underlying relationship among the sequences.ConclusionsWe introduced a novel clustering algorithm which is based on a new sequence similarity measure. It is effective in classifying DNA sequences with similar biological characteristics and in discovering the relationship among the sequences.

IEEE Transactions on Knowledge and Data Engineering | 2012

Model-Based Method for Projective Clustering

Lifei Chen; Qingshan Jiang; Shengrui Wang

Clustering high-dimensional data is a major challenge due to the curse of dimensionality. To solve this problem, projective clustering has been defined as an extension to traditional clustering that attempts to find projected clusters in subsets of the dimensions of a data space. In this paper, a probability model is first proposed to describe projected clusters in high-dimensional data space. Then, we present a model-based algorithm for fuzzy projective clustering that discovers clusters with overlapping boundaries in various projected subspaces. The suitability of the proposal is demonstrated in an empirical study done with synthetic data set and some widely used real-world data set.

international conference on distributed computing systems workshops | 2012

An Intelligent Anti-phishing Strategy Model for Phishing Website Detection

Weiwei Zhuang; Qingshan Jiang; Tengke Xiong

As a new form of malicious software, phishing websites appear frequently in recent years, which cause great harm to online financial services and data security. In this paper, we design and implement an intelligent model for detecting phishing websites. In this model, we extract 10 different types of features such as title, keyword and link text information to represent the website. Heterogeneous classifiers are then built based on these different features. We propose a principled ensemble classification algorithm to combine the predicted results from different phishing detection classifiers. Hierarchical clustering technique has been employed for automatic phishing categorization. Case studies on large and real daily phishing websites collected from King soft Internet Security Lab demonstrate that our proposed model outperforms other commonly used anti-phishing methods and tools in phishing website detection.

intelligent information systems | 2010

Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list

Yanfang Ye; Tao Li; Kai Huang; Qingshan Jiang; Yong Chen

Nowadays, numerous attacks made by the malware (e.g., viruses, backdoors, spyware, trojans and worms) have presented a major security threat to computer users. Currently, the most significant line of defense against malware is anti-virus products which focus on authenticating valid software from a whitelist, blocking invalid software from a blacklist, and running any unknown software (i.e., the gray list) in a controlled manner. The gray list, containing unknown software programs which could be either normal or malicious, is usually authenticated or rejected manually by virus analysts. Unfortunately, along with the development of the malware writing techniques, the number of file samples in the gray list that need to be analyzed by virus analysts on a daily basis is constantly increasing. The gray list is not only large in size, but also has an imbalanced class distribution where malware is the minority class. In this paper, we describe our research effort on building automatic, effective, and interpretable classifiers resting on the analysis of Application Programming Interfaces (APIs) called by Windows Portable Executable (PE) files for detecting malware from the large and imbalanced gray list. Our effort is based on associative classifiers due to their high interpretability as well as their capability of discovering interesting relationships among API calls. We first adapt several different post-processing techniques of associative classification, including rule pruning and rule re-ordering, for building effective associative classifiers from large collections of training data. In order to help the virus analysts detect malware from the imbalanced gray list, we then develop the Hierarchical Associative Classifier (HAC). HAC constructs a two-level associative classifier to maximize precision and recall of the minority (malware) class: in the first level, it uses high precision rules of majority (benign file samples) class and low precision rules of minority class to achieve high recall; and in the second level, it ranks the minority class files and optimizes the precision. Finally, since our case studies are based on a large and real data collection obtained from the Anti-virus Lab of Kingsoft corporation, including 8,000,000 malware, 8,000,000 benign files, and 100,000 file samples from the gray list, we empirically examine the sampling strategy to build the classifiers for such a large data collection to avoid over-fitting and achieve great effectiveness as well as high efficiency. Promising experimental results demonstrate the effectiveness and efficiency of the HAC classifier. HAC has already been incorporated into the scanning tool of Kingsoft’s Anti-Virus software.

knowledge discovery and data mining | 2007

A new initialization method for clustering categorical data

Shu Wu; Qingshan Jiang; Joshua Zhexue Huang

Performance of partitional clustering algorithms which converges to numerous local minima highly depends on initial cluster centers. This paper presents an initialization method which can be implemented to partitional clustering algorithms for categorical data sets with minimizing the numerical objective function. Experimental results show that the new initialization method is more efficient and stabler than the traditional one and can be implemented to large data sets for its linear time complexity.

Explore More