Is this you? Create Your Porfile

Bi-Ru Dai

National Taiwan University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bi-Ru Dai is active.

Explore More

Publication

Featured researches published by Bi-Ru Dai.

IEEE Transactions on Knowledge and Data Engineering | 2006

Adaptive Clustering for Multiple Evolving Streams

Bi-Ru Dai; Jen Wei Huang; Mi-Yen Yeh; Ming-Syan Chen

In the data stream environment, the patterns generated at different time instances are different due to data evolution. As time progresses, the behavior and members of clusters usually change. Henc...In the data stream environment, the patterns generated at different time instances are different due to data evolution. As time progresses, the behavior and members of clusters usually change. Hence, clustering continuous data streams allows us to observe the changes of group behavior. In order to support flexible clustering requirements, we devise in this paper a clustering on demand framework, abbreviated as COD framework, to dynamically cluster multiple data streams. While providing a general framework of clustering on multiple data streams, the COD framework has two advantageous features, namely, one data scan for online statistics collection and compact multiresolution approximations, which are designed to address, respectively, the time and the space constraints in a data stream environment. The COD framework consists of two phases, i.e., the online maintenance phase and the offline clustering phase. The online maintenance phase provides an efficient mechanism to maintain summary hierarchies of data streams with multiple resolutions in time linear in both the number of streams and the number of data points in each stream. On the other hand, an adaptive clustering algorithm is devised for the offline phase to retrieve approximations of desired substreams from summary hierarchies according to clustering queries. We propose two summarization techniques, based on wavelet and regression analyses, to construct the summary hierarchies. The regression-based summary hierarchy approximates the data stream more precisely and provides better clustering results, at the cost of slightly longer time than and twice the storage space as the wavelet-based one. An adaptive version of COD framework is designed to make a selection between a wavelet-based model and a regression-based model for building the summary hierarchy. By the adaptive COD, we can obtain clustering results with almost the same quality as the regression-based COD while using much less storage space for the summary hierarchy. As shown in the complexity analyses and also validated by our empirical studies, the COD framework performs very efficiently in the data stream environment while producing clustering results of very high quality

international conference on cloud computing | 2012

Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition

Bi-Ru Dai; I-Chang Lin

DBSCAN is a well-known algorithm for density-based clustering because it can identify the groups of arbitrary shapes and deal with noisy datasets. However, with the increasing amount of data, DBSCAN algorithm running on a single machine has to face the scalability problem. In this paper, we propose a Map/Reduce-based DBSCAN algorithm called DBSCAN-MR to solve the scalability problem. In DBSCAN-MR, the input dataset is partitioned into smaller parts and then parallel processed on the Hadoop platform. However, choosing different partition mechanisms will affect the execution efficiency and load balance of each node. Therefore, we propose a method, partition with reduce boundary points (PRBP), to select partition boundaries based on the distribution of data points. Our experimental results show that DBSCAN-MR with the design of PRBP has higher efficiency and scalability than competitors.

IEEE Transactions on Knowledge and Data Engineering | 2007

Clustering over Multiple Evolving Streams by Events and Correlations

Mi-Yen Yeh; Bi-Ru Dai; Ming-Syan Chen

In applications of multiple data streams such as stock market trading and sensor network data analysis, the clusters of streams change at different times because of data evolution. The information about evolving cluster is valuable to support corresponding online decisions. In this paper, we present a framework for clustering over multiple evolving streams by correlations and events, which, abbreviated as COMET-CORE, monitors the distribution of clusters over multiple data streams based on their correlation. Instead of directly clustering the multiple data streams periodically, COMET-CORE applies efficient cluster split and merge processes only when significant cluster evolution happens. Accordingly, we devise an event detection mechanism to signal the cluster adjustments. The coming streams are smoothed as sequences of end points by employing piecewise linear approximation. At the time when end points are generated, weighted correlations between streams are updated. End points are good indicators of significant change in streams, and this is a main cause of a cluster evolution event. When an event occurs, through split and merge operations we can report the latest clustering results. As shown in our experimental studies, COMET-CORE can be performed effectively with good clustering quality.

mobile data management | 2013

Opinion Mining on Social Media Data

Po-Wei Liang; Bi-Ru Dai

Microblogging (Twitter or Facebook) has become a very popular communication tool among Internet users in recent years. Information is generated and managed through either computer or mobile devices by one person and is consumed by many other persons, with most of this user-generated content being textual information. As there are a lot of raw data of people posting real time messages about their opinions on a variety of topics in daily life, it is a worthwhile research endeavor to collect and analyze these data, which may be useful for users or managers to make informed decisions, for example. However this problem is challenging because a micro-blog post is usually very short and colloquial, and traditional opinion mining algorithms do not work well in such type of text. Therefore, in this paper, we propose a new system architecture that can automatically analyze the sentiments of these messages. We combine this system with manually annotated data from Twitter, one of the most popular microblogging platforms, for the task of sentiment analysis. In this system, machines can learn how to automatically extract the set of messages which contain opinions, filter out nonopinion messages and determine their sentiment directions (i.e. positive, negative). Experimental results verify the effectiveness of our system on sentiment analysis in real microblogging applications.

international conference on data mining | 2004

Clustering on demand for multiple data streams

Bi-Ru Dai; Jen Wei Huang; Mi-Yen Yeh; Ming-Syan Chen

In the data stream environment, the patterns generated by the mining techniques are usually distinct at different time because of the evolution of data. In order to deal with various types of multiple data streams and to support flexible mining requirements, we devise in this paper a clustering on demand framework, abbreviated as COD framework, to dynamically cluster multiple data streams. While providing a general framework of clustering on multiple data streams, the COD framework has two major features, namely one data scan for online statistics collection and compact multiresolution approximations, which are designed to address, respectively, the time and the space constraints in a data stream environment. Furthermore, with the multiresolution approximations of data streams, flexible clustering demands can be supported.

ACM Transactions on Knowledge Discovery From Data | 2007

Twain: Two-end association miner with precise frequent exhibition periods

Jen Wei Huang; Bi-Ru Dai; Ming-Syan Chen

We investigate the general model of mining associations in a temporal database, where the exhibition periods of items are allowed to be different from one to another. The database is divided into partitions according to the time granularity imposed. Such temporal association rules allow us to observe short-term but interesting patterns that are absent when the whole range of the database is evaluated altogether. Prior work may omit some temporal association rules and thus have limited practicability. To remedy this and to give more precise frequent exhibition periods of frequent temporal itemsets, we devise an efficient algorithm Twain (standing for TWo end AssocIation miNer.) Twain not only generates frequent patterns with more precise frequent exhibition periods, but also discovers more interesting frequent patterns. Twain employs Start time and End time of each item to provide precise frequent exhibition period while progressively handling itemsets from one partition to another. Along with one scan of the database, Twain can generate frequent 2-itemsets directly according to the cumulative filtering threshold. Then, Twain adopts the scan reduction technique to generate all frequent k-itemsets (k > 2) from the generated frequent 2-itemsets. Theoretical properties of Twain are derived as well in this article. The experimental results show that Twain outperforms the prior works in the quality of frequent patterns, execution time, I/O cost, CPU overhead and scalability.

knowledge discovery and data mining | 2007

Incremental clustering in geography and optimization spaces

Chih-Hua Tai; Bi-Ru Dai; Ming-Syan Chen

Spatial clustering has been identified as an important technique in data mining owing to its various applications. In the conventional spatial clustering methods, data points are clustered mainly according to their geographic attributes. In real applications, however, the obtained data points consist of not only geographic attributes but also non-geographic ones. In general, geographic attributes indicate the data locations and non-geographic attributes show the characteristics of data points. It is thus infeasible, by using conventional spatial clustering methods, to partition the geographic space such that similar data points are grouped together. In this paper, we propose an effective and efficient algorithm, named incremental clustering toward the Bound INformation of Geography and Optimization spaces, abbreviated as BINGO, to solve the problem. The proposed BINGO algorithm combines the information in both geographic and non-geographic attributes by constructing a summary structure and possesses incremental clustering capability by appropriately adjusting this structure. Furthermore, most parameters in algorithm BINGO are determined automatically so that it is easy to be applied to applications without resorting to extra knowledge. Experiments on synthetic are performed to validate the effectiveness and the efficiency of algorithm BINGO.

advances in social networks analysis and mining | 2011

A Framework of Recommendation System Based on Both Network Structure and Messages

Bi-Ru Dai; Chang-Yi Lee; Chih-Heng Chung

The evolving of Internet technology allows people to communicate even they are far away from each other. More and more people share information and exchange their thoughts via the communities on the websites and become friends. A larger community usually attracts more users, therefore, how to enhance the development of a social network on the website is an important issue for the survival of a website. In this paper, we combine the social network features into the recommendation system. In addition to messages between nodes, the features of network structure are taken into consideration. Experimental results show that the recommendation accuracy of our method is higher than the existing method which is based on the message ratio.

knowledge discovery and data mining | 2011

An instance selection algorithm based on reverse nearest neighbor

Bi-Ru Dai; Shu-Ming Hsu

Data reduction is to extract a subset from a dataset. The advantages of data reduction are decreasing the requirement of storage and increasing the efficiency of classification. Using the subset as training data is possible to maintain classification accuracy; sometimes, it can be further improved because of eliminating noises. The key is how to choose representative samples while ignoring noises at the same time. Many instance selection algorithms are based on nearest neighbor decision rule (NN). Some of these algorithms select samples based on two strategies, incremental and decremental. The first type of algorithms select some instances as samples and iteratively add instances which do not have the same class label with their nearest sample to the sample set. The second type of algorithms remove instances which do not have the same class label with their majority of kNN. However, we propose an algorithm based on Reverse Nearest Neighbor (RNN), called the Reverse Nearest Neighbor Reduction (RNNR). RNNR selects samples which can represent other instances in the same class. In addition, RNNR does not need to iteratively scan a dataset which takes much processing time. Experimental results show that RNNR achieves comparable accuracy and selects fewer samples than comparators.

very large data bases | 2007

Constrained data clustering by depth control and progressive constraint relaxation

Bi-Ru Dai; Cheng-Ru Lin; Ming-Syan Chen

In order to import the domain knowledge or application-dependent parameters into the data mining systems, constraint-based mining has attracted a lot of research attention recently. In this paper, the attributes employed to model the constraints are called constraint attributes and those attributes involved in the objective function to be optimized are called optimization attributes. The constrained clustering considered in this paper is conducted in such a way that the objective function of optimization attributes is optimized subject to the condition that the imposed constraint is satisfied. Explicitly, we address the problem of constrained clustering with numerical constraints, in which the constraint attribute values of any two data items in the same cluster are required to be within the corresponding constraint range. This numerical constrained clustering problem, however, cannot be dealt with by any conventional clustering algorithms. Consequently, we devise several effective and efficient algorithms to solve such a clustering problem. It is noted that due to the intrinsic nature of the numerical constrained clustering, there is an order dependency on the process of attaining the clustering, which in many cases degrades the clustering results. In view of this, we devise a progressive constraint relaxation technique to remedy this drawback and improve the overall performance of clustering results. Explicitly, by using a smaller (tighter) constraint range in earlier iterations of merge, we will have more room to relax the constraint and seek for better solutions in subsequent iterations. It is empirically shown that the progressive constraint relaxation technique is able to improve not only the execution efficiency but also the clustering quality.

Explore More