Baile Shi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Baile Shi is active.

Explore More

Publication

Featured researches published by Baile Shi.

knowledge discovery and data mining | 2006

Utility-based anonymization using local recoding

Jian Xu; Wei Wang; Jian Pei; Xiaoyuan Wang; Baile Shi; Ada Wai-Chee Fu

Privacy becomes a more and more serious concern in applications involving microdata. Recently, efficient anonymization has attracted much research work. Most of the previous methods use global recoding, which maps the domains of the quasi-identifier attributes to generalized or changed values. However, global recoding may not always achieve effective anonymization in terms of discernability and query answering accuracy using the anonymized data. Moreover, anonymized data is often for analysis. As well accepted in many analytical applications, different attributes in a data set may have different utility in the analysis. The utility of attributes has not been considered in the previous methods.In this paper, we study the problem of utility-based anonymization. First, we propose a simple framework to specify utility of attributes. The framework covers both numeric and categorical data. Second, we develop two simple yet efficient heuristic local recoding methods for utility-based anonymization. Our extensive performance study using both real data sets and synthetic data sets shows that our methods outperform the state-of-the-art multidimensional global recoding methods in both discernability and query answering accuracy. Furthermore, our utility-based method can boost the quality of analysis using the anonymized data.

pacific-asia conference on knowledge discovery and data mining | 2004

Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining

Chen Wang; Mingsheng Hong; Jian Pei; Haofeng Zhou; Wei Wang; Baile Shi

Mining frequent tree patterns is an important research problems with broad applications in bioinformatics, digital library, e-commerce, and so on. Previous studies highly suggested that pattern-growth methods are efficient in frequent pattern mining. In this paper, we systematically develop the pattern growth methods for mining frequent tree patterns. Two algorithms, Chopper and XSpanner, are devised. An extensive performance study shows that the two newly developed algorithms outperform TreeMinerV [13], one of the fastest methods proposed before, in mining large databases. Furthermore, algorithm XSpanner is substantially faster than Chopper in many cases.

knowledge discovery and data mining | 2004

Scalable mining of large disk-based graph databases

Chen Wang; Wei Wang; Jian Pei; Yongtai Zhu; Baile Shi

Mining frequent structural patterns from graph databases is an interesting problem with broad applications. Most of the previous studies focus on pruning unfruitful search subspaces effectively, but few of them address the mining on large, disk-based databases. As many graph databases in applications cannot be held into main memory, scalable mining of large, disk-based graph databases remains a challenging problem. In this paper, we develop an effective index structure, ADI (for <u>ad</u>jacency <u>i</u>ndex), to support mining various graph patterns over large databases that cannot be held into main memory. The index is simple and efficient to build. Moreover, the new index structure can be easily adopted in various existing graph pattern mining algorithms. As an example, we adapt the well-known gSpan algorithm by using the ADI structure. The experimental results show that the new index structure enables the scalable graph pattern mining over large databases. In one set of the experiments, the new disk-based method can mine graph databases with one million graphs, while the original gSpan algorithm can only handle databases of up to 300 thousand graphs. Moreover, our new method is faster than gSpan when both can run in main memory.

Sigkdd Explorations | 2006

Utility-based anonymization for privacy preservation with less information loss

Jian Xu; Wei Wang; Jian Pei; Xiaoyuan Wang; Baile Shi; Ada Wai-Chee Fu

Privacy becomes a more and more serious concern in applications involving microdata. Recently, efficient anonymization has attracted much research work. Most of the previous methods use global recoding, which maps the domains of the quasi-identifier attributes to generalized or changed values. However, global recoding may not always achieve effective anonymization in terms of discernability and query answering accuracy using the anonymized data. Moreover, anonymized data is often used for analysis. As well accepted in many analytical applications, different attributes in a data set may have different utility in the analysis. The utility of attributes has not been considered in the previous methods. In this paper, we study the problem of utility-based anonymization. First, we propose a simple framework to specify utility of attributes. The framework covers both numeric and categorical data. Second, we develop two simple yet efficient heuristic local recoding methods for utility-based anonymization. Our extensive performance study using both real data sets and synthetic data sets shows that our methods outperform the state-of-the-art multidimensional global recoding methods in both discernability and query answering accuracy. Furthermore, our utility-based method can boost the quality of analysis using the anonymized data.

international conference on management of data | 2005

GraphMiner: a structural pattern-mining system for large disk-based graph databases and its applications

Wei Wang; Chen Wang; Yongtai Zhu; Baile Shi; Jian Pei; Xifeng Yan; Jiawei Han

Mining frequent structural patterns from graph databases is an important research problem with broad applications. Recently, we developed an effective index structure, ADI, and efficient algorithms for mining frequent patterns from large, disk-based graph databases [5], as well as constraint-based mining techniques. The techniques have been integrated into a research prototype system--- GraphMiner. In this paper, we describe a demo of GraphMiner which showcases the technical details of the index structure and the mining algorithms including their efficient implementation, the mining performance and the comparison with some state-of-the-art methods, the constraint-based graph-pattern mining techniques and the procedure of constrained graph mining, as well as mining real data sets in novel applications.

IEEE Transactions on Knowledge and Data Engineering | 2007

A Low-Granularity Classifier for Data Streams with Concept Drifts and Biased Class Distribution

Peng Wang; Haixun Wang; Xiaochen Wu; Wei Wang; Baile Shi

Many applications track streaming data for actionable alerts, which may include, for example, network intrusions, transaction frauds, bio-surveilence abnormalities, and so forth. Some stream classification models are built for this purpose. Due to concept drifts, maintaining a models up-to-dateness has become one of the most challenging tasks in mining data streams. State-of-the-art approaches, including both the incrementally updated classifiers and the ensemble classifiers, have proved that model update is a very costly process. In this paper, we show that reducing model granularity reduces the update cost, as models of fine granularity enable us to efficiently pinpoint local components in the model that are affected by the concept drift. It also enables us to derive new model components to reflect the current data distribution, thus avoiding expensive updates on a global scale. Furthermore, those actionable alerts being monitored are usually rare occurrences. The existing stream classifiers cannot handle this problem. We address this problem and show that the low-granularity classifier handles rare events on stream data with ease. Experiments on real and synthetic data show that our approach is able to maintain good prediction accuracy at a fraction of the model updating cost of state-of-the-art approaches.

computer and information technology | 2005

Storage and Query over Encrypted Character and Numerical Data in Database

Zheng-Fei Wang; Wei Wang; Baile Shi

There is some private and sensitive data in database, which need to be protected from attacking. In order to reinforce the security of data, an effective mechanism, cryptographic support has been widely used. However, we must make a tradeoff between the performance and the security because encryption and decryption greatly degrade the query performance. To solve such a problem, a novel approach is proposed in this paper that can quickly execute SQL query on the encrypted data. For character data, it not only encrypts them, but also turns the character data into characteristic values via a characteristic function and stores them as additional fields. For numerical data, it not only encrypts them, but also creates its B+ tree index before the encryption in order to keep the ordering of each record in the index. Furthermore, we give the algorithms of querying the encrypted data based on the storage models. Results of sets of experiments validate the functionality and usability of our approach

international conference on data mining | 2005

On reducing classifier granularity in mining concept-drifting data streams

Peng Wang; Haixun Wang; Xiaochen Wu; Wei Wang; Baile Shi

Many applications use classification models on streaming data to detect actionable alerts. Due to concept drifts in the underlying data, how to maintain a models up-to-dateness has become one of the most challenging tasks in mining data streams. State of the art approaches, including both the incrementally updated classifiers and the ensemble classifiers, have proved that model update is a very costly process. In this paper, we introduce the concept of model granularity. We show that reducing model granularity will reduce model update cost. Indeed, models of fine granularity enable us to efficiently pinpoint local components in the model that are affected by the concept drift. It also enables us to derive new components that can easily integrate with the model to reflect the current data distribution, thus avoiding expensive updates on a global scale. Experiments on real and synthetic data show that our approach is able to maintain good prediction accuracy at a fraction of model updating cost of state of the art approaches.

Journal of Applied Physics | 2002

A two-dimensional nonlinear photonic crystal for strong second harmonic generation

Baile Shi; Z. M. Jiang; X. Zhou; Xuejuan Wang

Detailed numerical analysis and computer simulation of a two-dimensional defective photonic crystal structure fabricated with nonlinear optical materials are carried out. The localized states in the band gap and electric field distributions of such a structure were calculated by the finite-difference time-domain method, and demonstrated a greatly enhanced second harmonic generation with an efficiency of about 4 orders of magnitude higher than that in an ordinary nonlinear crystal.

pacific-asia conference on knowledge discovery and data mining | 2011

Permutation anonymization: improving anatomy for privacy preservation in data publication

Xianmang He; Yanghua Xiao; Yujia Li; Qing Wang; Wei Wang; Baile Shi

Anatomy is a popular technique for privacy preserving in data publication. However, anatomy is fragile under background knowledge attack and can only be applied into limited applications. To overcome these drawbacks, we develop an improved version of anatomy: permutation anonymization, a new anonymization technique that is more effective than anatomy in privacy protection, and meanwhile is able to retain significantly more information in the microdata. We present the detail of the technique and build the underlying theory of the technique. Extensive experiments on real data are conducted, showing that our technique allows highly effective data analysis, while offering strong privacy guarantees.

Explore More