Mi-Yen Yeh
Academia Sinica
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mi-Yen Yeh.
IEEE Transactions on Knowledge and Data Engineering | 2006
Bi-Ru Dai; Jen Wei Huang; Mi-Yen Yeh; Ming-Syan Chen
In the data stream environment, the patterns generated at different time instances are different due to data evolution. As time progresses, the behavior and members of clusters usually change. Henc...In the data stream environment, the patterns generated at different time instances are different due to data evolution. As time progresses, the behavior and members of clusters usually change. Hence, clustering continuous data streams allows us to observe the changes of group behavior. In order to support flexible clustering requirements, we devise in this paper a clustering on demand framework, abbreviated as COD framework, to dynamically cluster multiple data streams. While providing a general framework of clustering on multiple data streams, the COD framework has two advantageous features, namely, one data scan for online statistics collection and compact multiresolution approximations, which are designed to address, respectively, the time and the space constraints in a data stream environment. The COD framework consists of two phases, i.e., the online maintenance phase and the offline clustering phase. The online maintenance phase provides an efficient mechanism to maintain summary hierarchies of data streams with multiple resolutions in time linear in both the number of streams and the number of data points in each stream. On the other hand, an adaptive clustering algorithm is devised for the offline phase to retrieve approximations of desired substreams from summary hierarchies according to clustering queries. We propose two summarization techniques, based on wavelet and regression analyses, to construct the summary hierarchies. The regression-based summary hierarchy approximates the data stream more precisely and provides better clustering results, at the cost of slightly longer time than and twice the storage space as the wavelet-based one. An adaptive version of COD framework is designed to make a selection between a wavelet-based model and a regression-based model for building the summary hierarchy. By the adaptive COD, we can obtain clustering results with almost the same quality as the regression-based COD while using much less storage space for the summary hierarchy. As shown in the complexity analyses and also validated by our empirical studies, the COD framework performs very efficiently in the data stream environment while producing clustering results of very high quality
extending database technology | 2009
Mi-Yen Yeh; Kun Lung Wu; Philip S. Yu; Ming-Syan Chen
We present PROUD -- A PRObabilistic approach to processing similarity queries over Uncertain Data streams, where the data streams here are mainly time series streams. In contrast to data with certainty, an uncertain series is an ordered sequence of random variables. The distance between two uncertain series is also a random variable. We use a general uncertain data model, where only the mean and the deviation of each random variable at each timestamp are available. We derive mathematical conditions for progressively pruning candidates to reduce the computation cost. We then apply PROUD to a streaming environment where only sketches of streams, like wavelet synopses, are available. Extensive experiments are conducted to evaluate the effectiveness of PROUD and compare it with Det, a deterministic approach that directly processes data without considering uncertainty. The results show that, compared with Det, PROUD offers a flexible trade-off between false positives and false negatives by controlling a threshold, while maintaining a similar computation cost. In contrast, Det does not provide such flexibility. This trade-off is important as in some applications false negatives are more costly, while in others, it is more critical to keep the false positives low.
IEEE Transactions on Knowledge and Data Engineering | 2007
Mi-Yen Yeh; Bi-Ru Dai; Ming-Syan Chen
In applications of multiple data streams such as stock market trading and sensor network data analysis, the clusters of streams change at different times because of data evolution. The information about evolving cluster is valuable to support corresponding online decisions. In this paper, we present a framework for clustering over multiple evolving streams by correlations and events, which, abbreviated as COMET-CORE, monitors the distribution of clusters over multiple data streams based on their correlation. Instead of directly clustering the multiple data streams periodically, COMET-CORE applies efficient cluster split and merge processes only when significant cluster evolution happens. Accordingly, we devise an event detection mechanism to signal the cluster adjustments. The coming streams are smoothed as sequences of end points by employing piecewise linear approximation. At the time when end points are generated, weighted correlations between streams are updated. End points are good indicators of significant change in streams, and this is a main cause of a cluster evolution event. When an event occurs, through split and merge operations we can report the latest clustering results. As shown in our experimental studies, COMET-CORE can be performed effectively with good clustering quality.
knowledge discovery and data mining | 2015
Wush Chi-Hsuan Wu; Mi-Yen Yeh; Ming-Syan Chen
In the aspect of a Demand-Side Platform (DSP), which is the agent of advertisers, we study how to predict the winning price such that the DSP can win the bid by placing a proper bidding value in the real-time bidding (RTB) auction. We propose to leverage the machine learning and statistical methods to train the winning price model from the bidding history. A major challenge is that a DSP usually suffers from the censoring of the winning price, especially for those lost bids in the past. To solve it, we utilize the censored regression model, which is widely used in the survival analysis and econometrics, to fit the censored bidding data. Note, however, the assumption of censored regression does not hold on the real RTB data. As a result, we further propose a mixture model, which combines linear regression on bids with observable winning prices and censored regression on bids with the censored winning prices, weighted by the winning rate of the DSP. Experiment results show that the proposed mixture model in general prominently outperforms linear regression in terms of the prediction accuracy.
ieee international electric vehicle conference | 2012
Jun-Li Lu; Mi-Yen Yeh; Yu-Ching Hsu; Shun-Neng Yang; Chai-Hien Gan; Ming-Syan Chen
In this paper, we propose a dispatching strategy with charging plans upon the client requests for a commercial fleet of pure electric taxis. To boost the green industry, promoting the development of electrical vehicles is one of the most important policies of many governments. In a new scenario that the commercial taxi fleets run electrical vehicles as the main transportation, traditional dispatching policies for general gasoline taxis are no longer effective. It is because those policies do not need to consider the newly generated issues for the electrical vehicles such as the endurance and the related charging problems when dispatching them. In addition, taxi drivers may doubt whether their working hours would be occupied by the possibly long waiting time of power recharging and thus decrease the chances to carry clients. To overcome the above issues, this paper proposes a new dispatching policy in consideration of the taxi demand, the remaining power of electrical taxis, and the availability of battery charging/switching stations in order to lower the waiting time of power recharging and thus increase the workable hours for taxi drivers. Simulation results show that our dispatching strategy can effectively reduce the waiting time for charging and increase the chances of taking clients compared to some random dispatching strategy.
international conference on data mining | 2004
Bi-Ru Dai; Jen Wei Huang; Mi-Yen Yeh; Ming-Syan Chen
In the data stream environment, the patterns generated by the mining techniques are usually distinct at different time because of the evolution of data. In order to deal with various types of multiple data streams and to support flexible mining requirements, we devise in this paper a clustering on demand framework, abbreviated as COD framework, to dynamically cluster multiple data streams. While providing a general framework of clustering on multiple data streams, the COD framework has two major features, namely one data scan for online statistics collection and compact multiresolution approximations, which are designed to address, respectively, the time and the space constraints in a data stream environment. Furthermore, with the multiresolution approximations of data streams, flexible clustering demands can be supported.
international conference on data engineering | 2013
Jian Pei; Chi-Hsuan Wu; Mi-Yen Yeh
In this paper, we tackle a novel type of interesting queries - shortest unique substring queries. Given a (long) string S and a query point q in the string, can we find a shortest substring containing q that is unique in S? We illustrate that shortest unique substring queries have many potential applications, such as information retrieval, bioinformatics, and event context analysis. We develop efficient algorithms for online query answering. First, we present an algorithm to answer a shortest unique substring query in O(n) time using a suffix tree index, where n is the length of string S. Second, we show that, using O(n·h) time and O(n) space, we can compute a shortest unique substring for every position in a given string, where h is variable theoretically in O(n) but on real data sets often much smaller than n and can be treated as a constant. Once the shortest unique substrings are pre-computed, shortest unique substring queries can be answered online in constant time. In addition to the solid algorithmic results, we empirically demonstrate the effectiveness and efficiency of shortest unique substring queries on real data sets.
IEEE Transactions on Computers | 2014
Hua-Wei Fang; Mi-Yen Yeh; Pei-Lun Suei; Tei-Wei Kuo
This work is motivated by the strong demand for flash-friendly index designs to resolve reliability and performance concerns for data manipulation over flash memory. In comparison to previous work, we propose and explore the impact of hot-data access, sibling-link updates, and different workload types to a tree index structure over flash memory. In particular, a flash-friendly B+-tree, referred to as an Adaptive Durable B+-tree, is proposed to not only improve the endurance but also the performance of a tree index structure. The capability of the proposed methodology and index design is evaluated through a series of experiments, in which significant improvement on endurance was achieved in comparison to previous reports on the subject.
IEEE Transactions on Knowledge and Data Engineering | 2013
Huey-Ru Wu; Mi-Yen Yeh; Ming-Syan Chen
An object can move with various speeds and arbitrarily changing directions. Given a bounded area where a set of objects moving around, there are some typical moving styles of the objects at different local regions due to the geography nature or other spatiotemporal conditions. Not only the paths that the objects move along, we also want to know how different groups of objects move with various speeds. Therefore, given a set of collected trajectories spreading in a bounded area, we are interested in discovering the typical moving styles in different regions of all the monitored moving objects. These regional typical moving styles are regarded as the profile of the monitored moving objects, which may help reflect the geoinformation of the observed area and the moving behaviors of the observed moving objects. In this paper, we present DivCluST, an approach to finding regional typical moving styles by dividing and clustering the trajectories in consideration of both the spatial and temporal constraints. Different from the existing works that consider only the spatial properties or just the interesting regions of trajectories, DivCluST focuses more on typical movements in local regions of a bounded area and takes the temporal information into account when designing the criteria for trajectory dividing and the distance measurement for adaptive
pacific-asia conference on knowledge discovery and data mining | 2013
Hao-Hsiang Wu; Mi-Yen Yeh
(k)