Jimmy Xiangji Huang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jimmy Xiangji Huang is active.

Explore More

Publication

Featured researches published by Jimmy Xiangji Huang.

Future Generation Computer Systems | 2014

Mining network data for intrusion detection through combining SVMs with ant colony networks

Wenying Feng; Qinglei Zhang; Gongzhu Hu; Jimmy Xiangji Huang

Abstract In this paper, we introduce a new machine-learning-based data classification algorithm that is applied to network intrusion detection. The basic task is to classify network activities (in the network log as connection records) as normal or abnormal while minimizing misclassification. Although different classification models have been developed for network intrusion detection, each of them has its strengths and weaknesses, including the most commonly applied Support Vector Machine (SVM) method and the Clustering based on Self-Organized Ant Colony Network (CSOACN). Our new approach combines the SVM method with CSOACNs to take the advantages of both while avoiding their weaknesses. Our algorithm is implemented and evaluated using a standard benchmark KDD99 data set. Experiments show that CSVAC (Combining Support Vectors with Ant Colony) outperforms SVM alone or CSOACN alone in terms of both classification rate and run-time efficiency.

Information Processing and Management | 2011

Combining integrated sampling with SVM ensembles for learning from imbalanced datasets

Yang Liu; Xiaohui Yu; Jimmy Xiangji Huang; Aijun An

Learning from imbalanced datasets is difficult. The insufficient information that is associated with the minority class impedes making a clear understanding of the inherent structure of the dataset. Most existing classification methods tend not to perform well on minority class examples when the dataset is extremely imbalanced, because they aim to optimize the overall accuracy without considering the relative distribution of each class. In this paper, we study the performance of SVMs, which have gained great success in many real applications, in the imbalanced data context. Through empirical analysis, we show that SVMs may suffer from biased decision boundaries, and that their prediction performance drops dramatically when the data is highly skewed. We propose to combine an integrated sampling technique, which incorporates both over-sampling and under-sampling, with an ensemble of SVMs to improve the prediction performance. Extensive experiments show that our method outperforms individual SVMs as well as several other state-of-the-art classifiers.

Information Sciences | 2011

Modeling term proximity for probabilistic information retrieval models

Ben He; Jimmy Xiangji Huang; Xiaofeng Zhou

Proximity among query terms has been found to be useful for improving retrieval performance. However, its application to classical probabilistic information retrieval models, such as Okapis BM25, remains a challenging research problem. In this paper, we propose to improve the classical BM25 model by utilizing the term proximity evidence. Four novel methods, namely a window-based N-gram Counting method, Survival Analysis over different statistics, including the Poisson process, an exponential distribution and an empirical function, are proposed to model the proximity between query terms. Through extensive experiments on standard TREC collections, our proposed proximity-based BM25 model, called BM25P, is compared to strong state-of-the-art evaluation baselines, including the original unigram BM25 model, the Markov Random Field model, and the positional language model. According to the experimental results, the window-based N-gram Counting method, and Survival Analysis over an exponential distribution are the most effective among all four proposed methods, which lead to marked improvement over the baselines. This shows that the use of term proximity considerably enhances the retrieval effectiveness of the classical probabilistic models. It is therefore recommended to deploy a term proximity component in retrieval systems that employ probabilistic models.

international acm sigir conference on research and development in information retrieval | 2012

Proximity-based rocchio's model for pseudo relevance

Jun Miao; Jimmy Xiangji Huang; Zheng Ye

Rocchios relevance feedback model is a classic query expansion method and it has been shown to be effective in boosting information retrieval performance. The selection of expansion terms in this method, however, does not take into account the relationship between the candidate terms and the query terms (e.g., term proximity). Intuitively, the proximity between candidate expansion terms and query terms can be exploited in the process of query expansion, since terms closer to query terms are more likely to be related to the query topic. In this paper, we study how to incorporate proximity information into the Rocchios model, and propose a proximity-based Rocchios model, called PRoc, with three variants. In our PRoc models, a new concept (proximity-based term frequency, ptf) is introduced to model the proximity information in the pseudo relevant documents, which is then used in three kinds of proximity measures. Experimental results on TREC collections show that our proposed PRoc models are effective and generally superior to the state-of-the-art relevance feedback models with optimal parameters.A direct comparison with positional relevance model (PRM) on the GOV2 collection also indicates our proposed model is at least competitive to the most recent progress.

bioinformatics and biomedicine | 2014

Deep learning for healthcare decision making with EMRs

Zhaohui Liang; Gang Zhang; Jimmy Xiangji Huang; Qinming Vivian Hu

Computer aid technology is widely applied in decision-making and outcome assessment of healthcare delivery, in which modeling knowledge and expert experience is technically important. However, the conventional rule-based models are incapable of capturing the underlying knowledge because they are incapable of simulating the complexity of human brains and highly rely on feature representation of problem domains. Thus we attempt to apply a deep model to overcome this weakness. The deep model can simulate the thinking procedure of human and combine feature representation and learning in a unified model. A modified version of convolutional deep belief networks is used as an effective training method for large-scale data sets. Then it is tested by two instances: a dataset on hypertension retrieved from a HIS system, and a dataset on Chinese medical diagnosis and treatment prescription from a manual converted electronic medical record (EMR) database. The experimental results indicate that the proposed deep model is able to reveal previously unknown concepts and performs much better than the conventional shallow models.

international acm sigir conference on research and development in information retrieval | 2009

TREC-CHEM: large scale chemical information retrieval evaluation at TREC

Mihai Lupu; Jimmy Xiangji Huang; Jianhan Zhu; John Tait

Over the past decades, significant progress has been made in Information Retrieval (IR), ranging from efficiency and scalability to theoretical modeling and evaluation. However, many grand challenges remain. Recently, more and more attention has been paid to the research in domain specific IR applications, as evidenced by the organization of Genomics and Legal tracks in the Text REtrieval Conference (TREC). Now it is the right time to carry out large scale evaluations on chemical datasets in order to promote the research in chemical IR in general and chemical Patent IR in particular. Accordingly, we organize a chemical IR track in TREC (TREC-CHEM) in order to address the challenges in chemical and patent IR. This paper describes these challenges and the accomplishments of the first year and opens up the discussions for the next year.

Journal of the Association for Information Science and Technology | 2011

Finding a good query-related topic for boosting pseudo-relevance feedback

Zheng Ye; Jimmy Xiangji Huang; Hongfei Lin

Pseudo-relevance feedback (PRF) via query expansion (QE) assumes that the top-ranked documents from the first-pass retrieval are relevant. The most informative terms in the pseudo-relevant feedback documents are then used to update the original query representation in order to boost the retrieval performance. Most current PRF approaches estimate the importance of the candidate expansion terms based on their statistics on document level. However, a document for PRF may consist of different topics, which may not be all related to the query even if the document is judged relevant. The main argument of this article is the proposal to conduct PRF on a granularity smaller than on the document level. In this article, we propose a topic-based feedback model with three different strategies for finding a good query-related topic based on the Latent Dirichlet Allocation model. The experimental results on four representative TREC collections show that QE based on the derived topic achieves statistically significant improvements over a strong feedback model in the language modeling framework, which updates the query representation based on the top-ranked documents.

Information Processing and Management | 2013

High performance query expansion using adaptive co-training

Jimmy Xiangji Huang; Jun Miao; Ben He

The quality of feedback documents is crucial to the effectiveness of query expansion (QE) in ad hoc retrieval. Recently, machine learning methods have been adopted to tackle this issue by training classifiers from feedback documents. However, the lack of proper training data has prevented these methods from selecting good feedback documents. In this paper, we propose a new method, called AdapCOT, which applies co-training in an adaptive manner to select feedback documents for boosting QEs effectiveness. Co-training is an effective technique for classification over limited training data, which is particularly suitable for selecting feedback documents. The proposed AdapCOT method makes use of a small set of training documents, and labels the feedback documents according to their quality through an iterative process. Two exclusive sets of term-based features are selected to train the classifiers. Finally, QE is performed on the labeled positive documents. Our extensive experiments show that the proposed method improves QEs effectiveness, and outperforms strong baselines on various standard TREC collections.

ACM Transactions on Information Systems | 2014

Modeling Term Associations for Probabilistic Information Retrieval

Jiashu Zhao; Jimmy Xiangji Huang; Zheng Ye

Traditionally, in many probabilistic retrieval models, query terms are assumed to be independent. Although such models can achieve reasonably good performance, associations can exist among terms from a human being’s point of view. There are some recent studies that investigate how to model term associations/dependencies by proximity measures. However, the modeling of term associations theoretically under the probabilistic retrieval framework is still largely unexplored. In this article, we introduce a new concept cross term, to model term proximity, with the aim of boosting retrieval performance. With cross terms, the association of multiple query terms can be modeled in the same way as a simple unigram term. In particular, an occurrence of a query term is assumed to have an impact on its neighboring text. The degree of the query-term impact gradually weakens with increasing distance from the place of occurrence. We use shape functions to characterize such impacts. Based on this assumption, we first propose a bigram CRoss TErm Retrieval (CRTER2) model as the basis model, and then recursively propose a generalized n-gram CRoss TErm Retrieval (CRTERn) model for n query terms, where n > 2. Specifically, a bigram cross term occurs when the corresponding query terms appear close to each other, and its impact can be modeled by the intersection of the respective shape functions of the query terms. For an n-gram cross term, we develop several distance metrics with different properties and employ them in the proposed models for ranking. We also show how to extend the language model using the newly proposed cross terms. Extensive experiments on a number of TREC collections demonstrate the effectiveness of our proposed models.

international acm sigir conference on research and development in information retrieval | 2013

Leveraging conceptual lexicon: query disambiguation using proximity information for patent retrieval

Parvaz Mahdabi; Shima Gerani; Jimmy Xiangji Huang; Fabio Crestani

Patent prior art search is a task in patent retrieval where the goal is to rank documents which describe prior art work related to a patent application. One of the main properties of patent retrieval is that the query topic is a full patent application and does not represent a focused information need. This query by document nature of patent retrieval introduces new challenges and requires new investigations specific to this problem. Researchers have addressed this problem by considering different information resources for query reduction and query disambiguation. However, previous work has not fully studied the effect of using proximity information and exploiting domain specific resources for performing query disambiguation. In this paper, we first reduce the query document by taking the first claim of the document itself. We then build a query-specific patent lexicon based on definitions of the International Patent Classification (IPC). We study how to expand queries by selecting expansion terms from the lexicon that are focused on the query topic. The key problem is how to capture whether an expansion term is focused on the query topic or not. We address this problem by exploiting proximity information. We assign high weights to expansion terms appearing closer to query terms based on the intuition that terms closer to query terms are more likely to be related to the query topic. Experimental results on two patent retrieval datasets show that the proposed method is effective and robust for query expansion, significantly outperforming the standard pseudo relevance feedback (PRF) and existing baselines in patent retrieval.

Explore More