Jianlong Tan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jianlong Tan is active.

Explore More

Publication

Featured researches published by Jianlong Tan.

international conference on data mining | 2010

Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams

Peng Zhang; Xingquan Zhu; Jianlong Tan; Li Guo

Ensemble learning is a commonly used tool for building prediction models from data streams, due to its intrinsic merits of handling large volumes stream data. Despite of its extraordinary successes in stream data mining, existing ensemble models, in stream data environments, mainly fall into the ensemble classifiers category, without realizing that building classifiers requires labor intensive labeling process, and it is often the case that we may have a small number of labeled samples to train a few classifiers, but a large number of unlabeled samples are available to build clusters from data streams. Accordingly, in this paper, we propose a new ensemble model which combines both classifiers and clusters together for mining data streams. We argue that the main challenges of this new ensemble model include (1) clusters formulated from data streams only carry cluster IDs, with no genuine class label information, and (2) concept drifting underlying data streams makes it even harder to combine clusters and classifiers into one ensemble framework. To handle challenge (1), we present a label propagation method to infer each clusters class label by making full use of both class label information from classifiers, and internal structure information from clusters. To handle challenge (2), we present a new weighting schema to weight all base models according to their consistencies with the up-to-date base model. As a result, all classifiers and clusters can be combined together, through a weighted average mechanism, for prediction. Experiments on real-world data streams demonstrate that our method outperforms simple classifier ensemble and cluster ensemble for stream data mining.

conference on information and knowledge management | 2011

Mining frequent patterns across multiple data streams

Jing Guo; Peng Zhang; Jianlong Tan; Li Guo

Mining frequent patterns from data streams has drawn increasing attention in recent years. However, previous mining algorithms were all focused on a single data stream. In many emerging applications, it is of critical importance to combine multiple data streams for analysis. For example, in real-time news topic analysis, it is necessary to combine multiple news report streams from dierent media sources to discover collaborative frequent patterns which are reported frequently in all media, and comparative frequent patterns which are reported more frequently in a media than others. To address this problem, we propose a novel frequent pattern mining algorithm Hybrid-Streaming, H-Stream for short. H-Stream builds a new Hybrid-Frequent tree to maintain historical frequent and potential frequent itemsets from all data streams, and incrementally updates these itemsets for efficient collaborative and comparative pattern mining. Theoretical and empirical studies demonstrate the utility of the proposed method.

Procedia Computer Science | 2014

Sentiment Classification Based on AS-LDA Model

Jiguang Liang; Ping Liu; Jianlong Tan; Shuo Bai

Abstract We address the task of sentiment classification - identification of the polarity of the subjective document in this paper. We introduces a sentiment classification method called AS LDA. In this model, we assume that words in subjective documents consists of two parts: sentiment element words and auxiliary words which are sampled accordingly from sentiment topics and auxiliary topics. Sentiment element words include targets of the opinions, polarity words and modifiers of polarity words. Experimental results demonstrate that our approach outperforms Latent Dirichlet Allocation (LDA).

string processing and information retrieval | 2005

A partition-based efficient algorithm for large scale multiple-strings matching

Ping Liu; Yanbing Liu; Jianlong Tan

Filtering plays an important role in the Internet security and information retrieval fields, and usually employs multiple-strings matching algorithm as its key part. All the classical matching algorithms, however, perform badly when the number of the keywords exceeds a critical point, which made large scale multiple-strings matching problem a great challenge. Based on the observation that the speed of the classical algorithms depends mainly on the length of the shortest keyword, a partition strategy was proposed to decompose the keywords set into a series of subsets on which the classical algorithms was performed. For the optimal partition, it was proved that the keywords with same length locate in one subset, and length of keywords in different subsets would not interlace each other. In this paper, we proposed a shortest-path model for the optimal partition finding problem. Experiments on both random and real data demonstrate that our algorithms generally has about a 100-300% speed-up compared with the classical ones.

conference on information and knowledge management | 2010

SKIF: a data imputation framework for concept drifting data streams

Peng Zhang; Xingquan Zhu; Jianlong Tan; Li Guo

Missing data commonly occurs in many applications. While many data imputation methods exist to handle the missing data problem for large scale databases, when applied to concept drifting data streams, these methods face some common difficulties. First, due to large and continuous data volumes, we are unable to maintain all stream records to form a candidate pool and estimate missing values, as most existing methods commonly do. Second, even if we could maintain all complete stream records using a summary structure, the concept drifting problem would make some information obsolete, and thus deteriorate the imputation accuracy. Third, in data streams, it is necessary to develop a fast yet accurate algorithm to find the most similar data for imputation. Fourth, due to the dynamic and sophisticated data collection environments, the missing rate of most stream data may be much higher than that in generic static databases, so the imputation method should be able to accommodate high missing rate in the data. To tackle these challenges, we propose, in this paper, a Streaming k-Nearest-Neighbors Imputation Framework (SKIF) for concept drifting data streams. To handle concept drifting and large volume problems in data streams, SKIF first summarizes historical complete records in some micro-resources (which are high-level statistical data structures), and maintains these micro-resources in a candidate pool as benchmark data. After that, SKIF employs a novel hybrid-kNN imputation procedure, which uses a hybrid similarity search mechanism, to find the most similar micro-resources from the large scale candidate pool efficiently. Experimental results demonstrate the effectiveness of the proposed SKIF framework for data stream imputation tasks.

international conference on implementation and application of automata | 2009

A Table Compression Method for Extended Aho-Corasick Automaton

Yanbing Liu; Yifu Yang; Ping Liu; Jianlong Tan

The Aho-Corasick algorithm is a classic method for matching a set of strings. However, the huge memory usage of Aho-Corasick automaton prevents it from being applied to large-scale pattern sets. Here we present a simple but efficient table compression method to reduce the automatons space. The basic idea of our method is based on equivalent rows elimination, which groups state rows into equivalent classes and eliminates the duplicates. Experiments demonstrate that the proposed method significantly reduces the memory usage and still runs at linear searching time comparable to that of extended Aho-Corasick algorithm. Our method provides good trade-off between memory usage and searching time.

conference on information and knowledge management | 2015

Lingo: Linearized Grassmannian Optimization for Nuclear Norm Minimization

Qian Li; Wenjia Niu; Gang Li; Yanan Cao; Jianlong Tan; Li Guo

As a popular heuristic to the matrix rank minimization problem, nuclear norm minimization attracts intensive research attentions. Matrix factorization based algorithms can reduce the expensive computation cost of SVD for nuclear norm minimization. However, most matrix factorization based algorithms fail to provide the theoretical guarantee for convergence caused by their non-unique factorizations. This paper proposes an efficient and accurate Linearized Grassmannian Optimization (Lingo) algorithm, which adopts matrix factorization and Grassmann manifold structure to alternatively minimize the subproblems. More specially, linearization strategy makes the auxiliary variables unnecessary and guarantees the close-form solution for low per-iteration complexity. Lingo then converts linearized objective function into a nuclear norm minimization over Grassmannian manifold, which could remedy the non-unique of solution for the low-rank matrix factorization. Extensive comparison experiments demonstrate the accuracy and efficiency of Lingo algorithm. The global convergence of Lingo is guaranteed with theoretical proof, which also verifies the effectiveness of Lingo.

international conference on conceptual structures | 2014

A Multi-layer Event Detection Algorithm for Detecting Global and Local Hot Events in Social Networks☆

Zhicong Tan; Peng Zhang; Jianlong Tan; Li Guo

Abstract In this paper, we present a new approach to detect global hot events and local hot events. Unlike previous event detection algorithms which do not distinguish between global events and local events, we believe it is important that we make that distinction as certain events can only be meaningful if they are placed in specific context while other events may arouse the interests of general users. The main contribution of this paper is that we’ve customized hot events detection by employing local community detection mechanisms and established a very clear concept for global hot events and local hot events. We present in this paper a multi-layer event detection algorithm which constructs a four-stage event detection procedure that produces a relatively comprehensive description of events relevant to the unique makeup and different interest of microblog users. Both the global hot events and local hot events we gathered are represented by a key tweet which contains sufficient information to depict a complete event. As a result of our algorithms ability to precisely describe events which outperforms existing event detection algorithms, it is now possible for people to better understand public sentiment towards hot issues on microblogs. Experiments have shown that our multi-layer hot event detection algorithm can produce promising results in mining the interests of different communities, generating relevant event clusters and presenting meaningful events to community users. The most allround evaluation indicator F-value, which takes both precision and recall rate into account, has demonstrated that our algorithm outperforms the other three traditional approaches in detecting hot events.

conference on information and knowledge management | 2011

Continuous data stream query in the cloud

Jun Li; Peng Zhang; Jianlong Tan; Ping Liu; Li Guo

Cloud computing represents one of the most important research directions for modern computing systems. Existing research efforts on Cloud computing were all focused on designing advanced storage and query techniques for static data. None of them consider the problem that data in a Cloud may appear as continuous and rapid data streams. To address this problem, in this paper we propose a new LCN-Index framework to handle continuous data stream queries in the Cloud. LCN-Index uses the Map-Reduce computing paradigm to process all the queries. In the Mapping stage, it divides all the queries into a batch of predicate sets which are then deployed onto mapping nodes using interval predicate index. In the reducing stage, it merges results from the mapping nodes using multi attribute hash index. In so doing, a data stream can be efficiently evaluated by traversing through the LCN-Index framework. Experiments demonstrate the utility of the proposed method.

conference on computer communications workshops | 2011

Speeding up pattern matching by optimal partial string extraction

Jianlong Tan; Xia Liu; Yanbing Liu; Ping Liu

String matching plays a key role in web content monitoring systems. Suffix matching algorithms have good time efficiency, and thus are widely used. These algorithms require that all patterns in a set have the same length. When the patterns cannot satisfy this requirement, the leftmost characters, m being the length of the shortest pattern, are extracted to construct the data structure. We call such -character strings partial strings. However, a simple extraction from the left does not address the impact of partial string locations on search speed. We propose a novel method to extract the partial strings from each pattern which maximizes search speed. More specifically, with this method we can compute all the corresponding searching time cost by theoretical derivation, and choose the location which yields an approximately minimal search time. We evaluate our method on two rule sets: Snort and ClamAV. Experiments show that in most cases, our method achieves the fastest searching speed in all possible locations of partial string extraction, and is about 5%–20% faster than the alternative methods.

Explore More