Ming-Syan Chen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ming-Syan Chen is active.

Explore More

Publication

Featured researches published by Ming-Syan Chen.

IEEE Transactions on Knowledge and Data Engineering | 1996

Data mining: an overview from a database perspective

Ming-Syan Chen; Jiawei Han; Philip S. Yu

Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information-providing services, such as data warehousing and online services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided and to increase business opportunities. In response to such a demand, this article provides a survey, from a database researchers point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented.

international conference on management of data | 1995

An effective hash-based algorithm for mining association rules

Jong Soo Park; Ming-Syan Chen; Philip S. Yu

In this paper, we examine the issue of mining association rules among items in a large database of sales transactions. The mining of association rules can be mapped into the problem of discovering large itemsets where a large itemset is a group of items which appear in a sufficient number of transactions. The problem of discovering large itemsets can be solved by constructing a candidate set of itemsets first and then, identifying, within this candidate set, those itemsets that meet the large itemset requirement. Generally this is done iteratively for each large k-itemset in increasing order of k where a large k-itemset is a large itemset with k items. To determine large itemsets from a huge number of candidate large itemsets in early iterations is usually the dominating factor for the overall data mining performance. To address this issue, we propose an effective hash-based algorithm for the candidate set generation. Explicitly, the number of candidate 2-itemsets generated by the proposed algorithm is, in orders of magnitude, smaller than that by previous methods, thus resolving the performance bottleneck. Note that the generation of smaller candidate sets enables us to effectively trim the transaction database size at a much earlier stage of the iterations, thereby reducing the computational cost for later iterations significantly. Extensive simulation study is conducted to evaluate performance of the proposed algorithm.

IEEE Transactions on Knowledge and Data Engineering | 1998

Efficient data mining for path traversal patterns

Ming-Syan Chen; Jong Soo Park; Philip S. Yu

The authors explore a new data mining capability that involves mining path traversal patterns in a distributed information-providing environment where documents or objects are linked together to facilitate interactive access. The solution procedure consists of two steps. First, they derive an algorithm to convert the original sequence of log data into a set of maximal forward references. By doing so, one can filter out the effect of some backward references, which are mainly made for ease of traveling and concentrate on mining meaningful user access sequences. Second, they derive algorithms to determine the frequent traversal patterns-i.e., large reference sequences-from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences; one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required. Performance of these two methods is comparatively analyzed. It is shown that the option of selective scan is very advantageous and can lead to prominent performance improvement. Sensitivity analysis on various parameters is conducted.

IEEE Transactions on Knowledge and Data Engineering | 1997

Using a hash-based method with transaction trimming for mining association rules

Jong Soo Park; Ming-Syan Chen; Philip S. Yu

In this paper, we examine the issue of mining association rules among items in a large database of sales transactions. Mining association rules means that, given a database of sales transactions, t...We examine the issue of mining association rules among items in a large database of sales transactions. Mining association rules means that, given a database of sales transactions, to discover all associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transaction. The mining of association rules can be mapped into the problem of discovering large itemsets where a large itemset is a group of items that appear in a sufficient number of transactions. The problem of discovering large itemsets can be solved by constructing a candidate set of itemsets first, and then, identifying, within this candidate set, these itemsets that meet the large itemset requirement. Generally, this is done iteratively for each large k-itemset in increasing order of k, where a large k-itemset is a large itemset with k items. To determine large itemsets from a huge number of candidate sets in early iterations is usually the dominating factor for the overall data mining performance. To address this issue, we develop an effective algorithm for the candidate set generation. It is a hash-based algorithm and is especially effective for the generation of a candidate set for large 2-itemsets. Explicitly, the number of candidate 2-itemsets generated by the proposed algorithm is, in orders of magnitude, smaller than that by previous methods, thus resolving the performance bottleneck. Note that the generation of smaller candidate sets enables us to effectively trim the transaction database size at a much earlier stage of the iterations, thereby reducing the computational cost for later iterations significantly. The advantage of the proposed algorithm also provides us the opportunity of reducing the amount of disk I/O required. An extensive simulation study is conducted to evaluate performance of the proposed algorithm.

international conference on distributed computing systems | 1996

Data mining for path traversal patterns in a web environment

Ming-Syan Chen; Jong Soo Park; Philip S. Yu

In this paper, we explore a new data mining capability which involved mining path traversal patterns in a distributed information providing environment like world-wide-web. First, we convert the original sequence of log data into a set of maximal forward references and filter out the effect of some backward references which are mainly made for ease of traveling. Second, we derive algorithms to determine the frequent traversal patterns, i.e., large reference sequences, from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences: one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required. Performance of these two methods is comparatively analyzed.

IEEE Transactions on Computers | 1987

Processor Allocation in an N-Cube Multiprocessor Using Gray Codes

Ming-Syan Chen; Kang G. Shin

The processor allocation problem in an n-dimensional hypercube (or an n-cube) multiprocessor is similar to the conventional memory allocation problem. The main objective in both problems is to maximize the utilization of available resources as well as minimize the inherent system fragmentation. A processor allocation strategy using the buddy system, called the buddy strategy, is discussed first and then a new allocation strategy using a Gray code (GC), called the GC strategy, is proposed. When processor relinquishment is not considered (i.e., static allocation), both of these strategies are proved to be optimal in the sense that each incoming request sequence is always assigned to a minimal subcube. It is also shown that the GC strategy outperforms the buddy strategy in detecting the availability of subcubes. Our results are extended further to implement an allocation strategy using more than one GC and derive the relationship between the GCs used and the corresponding ability of detecting the availability of various subcubes. The minimal number of GCs required for complete subcube recognition in a Qn is proved to be less than or equal to C[n/2]n. Several processor allocation strategies in a Q5 are implemented on the NCUBE/six multiprocessor at the University of Michigan, and their performance is experimentally measured.

conference on information and knowledge management | 1995

Efficient parallel data mining for association rules

Jong Soo Park; Ming-Syan Chen; Philip S. Yu

In this paper, we develop an algorithm, called PDM, to conduct parallel data mining for association rules. Consider a transaction as a collection of items, and a large itemset is a set of items such that the number of transactions containing it exceeds a pre-specilied threshold. PDM is so designed that the global set of large itemsets can be identified efficiently and the amount of inter-node data exchange required is minimized. SpecificaUy, with a given database partition, each processing node will collect (count ) information on each itemset from its local database efficiently via a hashing method. The information discovered by each node is next shared with other nodes via some communication schemes. Then, PDM employs a technique, called clue-andpoll, to address the uncertainty due to the partial knowledge collected at each node by judiciously selecting a small fraction of the itemsets for the exchange of count information among nodes, thus reducing the communication cost. The global set of large iternsets can hence be determined based on the aggregate count of itemsets. It is experimentally shown that PDM not only attains very good parallelization efficiencies, but also provides robust performance for various input patterns.

international conference on data engineering | 1996

Energy-efficient caching for wireless mobile computing

Kun Lung Wu; Philip S. Yu; Ming-Syan Chen

Caching can reduce the bandwidth requirement in a mobile computing environment. However, due to battery power limitations, a wireless mobile computer may often be forced to operate in a doze (or even totally disconnected) mode. As a result, the mobile computer may miss some cache invalidation reports broadcast by a server, forcing it to discard the entire cache contents after waking up. In this paper, we present an energy-efficient cache invalidation method, called GCORE (Grouping with COld update-set REtention), that allows a mobile computer to operate in a disconnected mode to save the battery while still retaining most of the caching benefits after a reconnection. We present an efficient implementation of GCORE and conduct simulations to evaluate its caching effectiveness. The results show that GCORE can substantially improve mobile caching by reducing the communication bandwidth (or energy consumption) for query processing.

IEEE Transactions on Parallel and Distributed Systems | 1990

Depth-first search approach for fault-tolerant routing in hypercube multicomputers

Ming-Syan Chen; Kang G. Shin

Using depth-first search, the authors develop and analyze the performance of a routing scheme for hypercube multicomputers in the presence of an arbitrary number of faulty components. They derive an exact expression for the probability of routing messages by way of optimal paths (of length equal to the Hamming distance between the corresponding pair of nodes) from the source node to an obstructed node. The obstructed node is defined as the first node encountered by the message that finds no optimal path to the destination node. It is noted that the probability of routing messages over an optimal path between any two nodes is a special case of the present results and can be obtained by replacing the obstructed node with the destination node. Numerical examples are given to illustrate the results, and they show that, in the presence of component failures, depth-first search routing can route a message to its destination by means of an optimal path with a very high probability. >

IEEE Transactions on Computers | 1990

Addressing, routing, and broadcasting in hexagonal mesh multiprocessors

Ming-Syan Chen; Kang G. Shin; Dilip D. Kandlur

A family of six-regular graphs, called hexagonal meshes or H-meshes, is considered as a multiprocessor interconnection network. Processing nodes on the periphery of an H-mesh are first wrapped around to achieve regularity and homogeneity. The diameter of a wrapped H-mesh is shown to be of O(p/sup 1/2/), where p is the number of nodes in the H-mesh. An elegant, distributed routing scheme is developed for wrapped H-meshes so that each node in an H-mesh can compute shortest paths from itself to any other node with a straightforward algorithm of O(1) using the addresses of the source-destination pair only, i.e. independent of the networks size. This is in sharp contrast with those previously known algorithms that rely on using routing tables. Furthermore, the authors also develop an efficient point-to-point broadcasting algorithm for the H-meshes which is proved to be optimal in the number of required communication steps. The wrapped H-meshes are compared against some other existing multiprocessor interconnection networks, such as hypercubes, trees, and square meshes. The comparison reinforces the attractiveness of the H-mesh architecture. >

Explore More