Is this you? Create Your Porfile

En Tzu Wang

Industrial Technology Research Institute

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where En Tzu Wang is active.

Explore More

Publication

Featured researches published by En Tzu Wang.

Data Mining and Knowledge Discovery | 2009

A novel hash-based approach for mining frequent itemsets over data streams requiring less memory space

En Tzu Wang; Arbee L. P. Chen

In recent times, data are generated as a form of continuous data streams in many applications. Since handling data streams is necessary and discovering knowledge behind data streams can often yield substantial benefits, mining over data streams has become one of the most important issues. Many approaches for mining frequent itemsets over data streams have been proposed. These approaches often consist of two procedures including continuously maintaining synopses for data streams and finding frequent itemsets from the synopses. However, most of the approaches assume that the synopses of data streams can be saved in memory and ignore the fact that the information of the non-frequent itemsets kept in the synopses may cause memory utilization to be significantly degraded. In this paper, we consider compressing the information of all the itemsets into a structure with a fixed size using a hash-based technique. This hash-based approach skillfully summarizes the information of the whole data stream by using a hash table, provides a novel technique to estimate the support counts of the non-frequent itemsets, and keeps only the frequent itemsets for speeding up the mining process. Therefore, the goal of optimizing memory space utilization can be achieved. The correctness guarantee, error analysis, and parameter setting of this approach are presented and a series of experiments is performed to show the effectiveness and the efficiency of this approach.

database and expert systems applications | 2010

Continuous probabilistic skyline queries over uncertain data streams

Hui Zhu Su; En Tzu Wang; Arbee L. P. Chen

Recently, some approaches of finding probabilistic skylines on uncertain data have been proposed. In these approaches, a data object is composed of instances, each associated with a probability. The probabilistic skyline is then defined as a set of non-dominated objects with probabilities exceeding or equaling a given threshold. In many applications, data are generated as a form of continuous data streams. Accordingly, we make the first attempt to study a problem of continuously returning probabilistic skylines over uncertain data streams in this paper. Moreover, the sliding window model over data streams is considered here. To avoid recomputing the probability of being not dominated for each uncertain object according to the instances contained in the current window, our main idea is to estimate the bounds of these probabilities for early determining which objects can be pruned or returned as results. We first propose a basic algorithm adapted from an existing approach of answering skyline queries on static and certain data, which updates these bounds by repeatedly processing instances of each object. Then, we design a novel data structure to keep dominance relation between some instances for rapidly tightening these bounds, and propose a progressive algorithm based on this new structure. Moreover, these two algorithms are also adapted to solve the problem of continuously maintaining top-k probabilistic skylines. Finally, a set of experiments are performed to evaluate these algorithms, and the experiment results reveal that the progressive algorithm much outperforms the basic one, directly demonstrating the effectiveness of our newly designed structure.

Data Mining and Knowledge Discovery | 2011

Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

En Tzu Wang; Arbee L. P. Chen

Mining frequent itemsets over data streams has attracted much research attention in recent years. In the past, we had developed a hash-based approach for mining frequent itemsets over a single data stream. In this paper, we extend that approach to mine global frequent itemsets from a collection of data streams distributed at distinct remote sites. To speed up the mining process, we make the first attempt to address a new problem on continuously maintaining a global synopsis for the union of all the distributed streams. The mining results therefore can be yielded on demand by directly processing the maintained global synopsis. Instead of collecting and processing all the data in a central server, which may waste the computation resources of remote sites, distributed computations over the data streams are performed. A distributed computation framework is proposed in this paper, including two communication strategies and one merging operation. These communication strategies are designed according to an accuracy guarantee of the mining results, determining when and what the remote sites should transmit to the central server (named coordinator). On the other hand, the merging operation is exploited to merge the information received from the remote sites into the global synopsis maintained at the coordinator. By the strategies and operation, the goal of continuously maintaining the global synopsis can be achieved. Rooted in the continuously maintained global synopsis, we propose a mining algorithm for finding global frequent itemsets. Moreover, the correctness guarantees of the communication strategies and merging operation, and the accuracy guarantee analysis of the mining algorithm are provided. Finally, a series of experiments on synthetic datasets and a real dataset are performed to show the effectiveness and efficiency of the distributed computation framework.

Information Systems | 2014

Top-n query processing in spatial databases considering bi-chromatic reverse k-nearest neighbors

Cha-Lun Li; En Tzu Wang; Guo-Jhu Huang; Arbee L. P. Chen

A reverse k-nearest neighbor (RkNN) query retrieves the data points which regard the query point as one of their respective k nearest neighbors. A bi-chromatic reverse k-nearest neighbor (BRkNN) query is a variant of the RkNN query, considering two types of data. Given two types of data G and C, a BRkNN query regarding a data point q in G retrieves the data points from C that regard q as one of their respective k-nearest neighbors among the data points in G. Many existing approaches answer either the RkNN query or the BRkNN query. Different from these approaches, in this paper, we make the first attempt to propose a top-n query based on the concept of BRkNN queries, which ranks the data points in G and retrieves the top-n points according to the cardinalities of the corresponding BRkNN answer sets. For efficiently answering this top-n query, we construct the Voronoi Diagram of G to index the data points in G and C. From the information associated with the Voronoi Diagram of G, the upper bound of the cardinality of the BRkNN answer sets for each data point in G can be quickly computed. Moreover, based on an existing approach to answering the RkNN query and the characteristics of the Voronoi Diagram of G, we propose a method to find the candidate region regarding a BRkNN query, which tightens the corresponding search space. Finally, based on the triangle inequality, we propose an efficient refinement algorithm for finding the exact BRkNN answers from the candidate regions. To evaluate our approach on answering the top-n query, it is compared with an approach which applies a state-of-the-art algorithm for answering the BRkNN query to each data point in G. The experiment results reveal that our approach has a much better performance.

international conference on big data | 2016

Mining User Trajectories from Smartphone Data Considering Data Uncertainty

Yu-Chi Chen; En Tzu Wang; Arbee L. P. Chen

Wi-Fi hot spots have quickly increased in recent years. Accordingly, discovering user positions by using Wi-Fi fingerprints has attracted much research attention. Wi-Fi fingerprints are the sets of Wi-Fi scanning results recorded in mobile devices. However, the issue of data uncertainty is not considered in the proposed Wi-Fi positioning systems. In this paper, we propose a framework to find user trajectories from the Wi-Fi fingerprints recorded in the smartphones. In this framework, we first discover meaningful places with the proposed Wi-Fi distance metric. Second, we propose two similarity functions to recognize the places and show the probabilities of the places where a user stayed in by the proposed uncertain data models. Finally, an algorithm on probabilistic sequential pattern mining is used for finding user trajectories. A series of experiments are performed to evaluate each step of the framework. The experiment results reveal that each step of our framework is with high accuracy.

acm symposium on applied computing | 2014

Finding targets with the nearest favor neighbor and farthest disfavor neighbor by a skyline query

Yi-Wen Lin; En Tzu Wang; Chieh-Feng Chiang; Arbee L. P. Chen

Finding the nearest neighbors and finding the farthest neighbors are fundamental problems in spatial databases. Consider two sets of data points in a two-dimensional data space, which represent a set of favor locations F, such as libraries and schools, and a set of disfavor locations D, such as dumps and gambling houses. Given another set of data points C in this space as houses for rent, one who needs to rent a house may need a recommendation which takes into account the favor and disfavor locations. To solve this problem, a new two-dimensional data space is employed, in which dimension X describes the distance from a data point c in C to its nearest neighbor in D and dimension Y describes the distance from c to its farthest neighbor in F. Notice that the larger value is preferred in dimension X while the smaller value is preferred in dimension Y. Following the above dominance rule, the recommendation for the house renting can be achieved by a skyline query. A naïve method to processing this query is 1) to find the nearest neighbor from D and the farthest neighbor from F for each data point in C and then 2) to construct a new two-dimensional data space based on the results from 1) and to apply any of the existing skyline algorithms to get the answer. In this paper, based on the quad-tree index, we propose an efficient algorithm to answer this query by combining the above two steps. A series of experiments with synthetic data and real data are performed to evaluate this approach and the experiment results demonstrate the efficiency of the approach.

data warehousing and knowledge discovery | 2008

Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Tung-Ying Lee; En Tzu Wang; Arbee L. P. Chen

The problem of discovering episode rulesfrom static databases has been studied for years due to its wide applications in prediction. In this paper, we make the first attempt to study a special episode rule, named serial episode rule with a time lagin an environment of multiple data streams. This rule can be widely used in different applications, such as traffic monitoring over multiple car passing streams in highways. Mining serial episode rules over the data stream environment is a challenge due to the high data arrival rates and the infinite length of the data streams. In this paper, we propose two methods considering different criteria on space utilization and precision to solve the problem by using a prefix tree to summarize the data streams and then traversing the prefix tree to generate the rules. A series of experiments on real data is performed to evaluate the two methods.

industrial conference on data mining | 2016

Mining Event Sequences from Social Media for Election Prediction

Kuan-Chieh Tung; En Tzu Wang; Arbee L. P. Chen

Predicting election results is a challenging task for big data analytics. Simple approaches count the number of tweets mentioning candidates or parties to do the prediction. In fact, many other factors may cause the candidates to win or lose in an election, such as their political opinions, social issues, and scandals. In this paper, we mine rules of event sequences from social media to predict election results. An example rule for a candidate can be as follows: “(big event, positive) → (small event, negative) → (big event, positive)” implies a victory to this candidate. We detect events and decide event types to generate event sequences and then apply the rule-based classifier to build the prediction model. A series of experiments are performed to evaluate our approaches and the experiment results reveal that the accuracy of our approaches on predicting election results is over 80 % in most of the cases.

pacific-asia conference on knowledge discovery and data mining | 2013

Anonymization for Multiple Released Social Network Graphs

Chih-Jui Lin Wang; En Tzu Wang; Arbee L. P. Chen

Recently, people share their information via social platforms such as Facebook and Twitter in their daily life. Social networks on the Internet can be regarded as a microcosm of the real world and worth being analyzed. Since the data in social networks can be private and sensitive, privacy preservation in social networks has been a focused study. Previous works develop anonymization methods for a single social network represented by a single graph, which are not enough for the analysis on the evolution of the social network. In this paper, we study the privacy preserving problem considering the evolution of a social network. A time-series of social network graphs representing the evolution of the corresponding social network are anonymized to a sequence of sanitized graphs to be released for further analysis. We point out that naively applying the existing approaches to each time-series graph will break the privacy purposes, and propose an effective anonymization method extended from an existing approach, which takes into account the effect of time for releasing multiple anonymized graphs at one time. We use two real datasets to test our method and the experiment results demonstrate that our method is very effective in terms of data utility for query answering.

international database engineering and applications symposium | 2013

Verification of k -coverage on query line segments

Kun-Han Juang; En Tzu Wang; Chieh-Feng Chiang; Arbee L. P. Chen

The coverage problem is one of the fundamental problems in sensor networks, which reflects the degree of a region being monitored by sensors. In this paper, we make the first attempt to address the k-coverage verification problem regarding a given query line segment, which returns all sub-segments from the line segment that are covered by at least k sensors. To deal with the problem, we propose three methods based on the R-tree index. The first method is the most primitive one, which identifies all intersection points of the query line segment and the circumferences of the covering regions of the sensors and then checks each sub-segment to see whether it is k-coverage. Improving from the first method, the second method calculates the lower bound of the number of sensors covering a specific sub-segment to reduce the computation costs. The third method partitions the query line segment into sub-segments with equal length and then verifies each of them. A series of experiments on a real dataset and two synthetic datasets are performed to evaluate these methods. The experiment results demonstrate that the third method has the best performance among all three methods.

Explore More