Is this you? Create Your Porfile

Yunyue Zhu

Courant Institute of Mathematical Sciences

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yunyue Zhu is active.

Explore More

Publication

Featured researches published by Yunyue Zhu.

very large data bases | 2002

StatStream: statistical monitoring of thousands of data streams in real time

Yunyue Zhu; Dennis E. Shasha

Publisher Summary Maintaining multistream and time-delayed statistics in a continuous online fashion is a significant challenge in data management. This chapter solves this problem in a scalable way that gives a guaranteed response time with high accuracy. The Discrete Fourier Transform (DFT) technique reduces the enormous raw data streams into a manageable synoptic data structure and gives good I/O performance. For any pair of streams, the pair-wise statistic is computed in an incremental fashion and requires constant time per update using a DFT approximation. A sliding/basic window framework is introduced to facilitate the efficient management of streaming data digests. One reduces the correlation coefficient similarity measure to a Euclidean measure and makes use of a grid structure to detect correlations among thousands of high-speed data streams in real time. Experiments conducted using synthetic and real data show that StatStream detects correlations efficiently and precisely.

knowledge discovery and data mining | 2003

Efficient elastic burst detection in data streams

Yunyue Zhu; Dennis E. Shasha

Burst detection is the activity of finding abnormal aggregates in data streams. Such aggregates are based on sliding windows over data streams. In some applications, we want to monitor many sliding window sizes simultaneously and to report those windows with aggregates significantly different from other periods. We will present a general data structure for detecting interesting aggregates over such elastic windows in near linear time. We present applications of the algorithm for detecting Gamma Ray Bursts in large-scale astrophysical data. Detection of periods with high volumes of trading activities and high stock price volatility is also demonstrated using real time Trade and Quote (TAQ) data from the New York Stock Exchange (NYSE). Our algorithm beats the direct computation approach by several orders of magnitude.

international conference on management of data | 2003

Warping indexes with envelope transforms for query by humming

Yunyue Zhu; Dennis E. Shasha

A Query by Humming system allows the user to find a song by humming part of the tune. No musical training is needed. Previous query by humming systems have not provided satisfactory results for various reasons. Some systems have low retrieval precision because they rely on melodic contour information from the hum tune, which in turn relies on the error-prone note segmentation process. Some systems yield better precision when matching the melody directly from audio, but they are slow because of their extensive use of Dynamic Time Warping (DTW). Our approach improves both the retrieval precision and speed compared to previous approaches. We treat music as a time series and exploit and improve well-developed techniques from time series databases to index the music for fast similarity queries. We improve on existing DTW indexes technique by introducing the concept of envelope transforms, which gives a general guideline for extending existing dimensionality reduction methods to DTW indexes. The net result is high scalability. We confirm our claims through extensive experiments.

very large data bases | 2003

Checks and balances: monitoring data quality problems in network traffic databases

Flip Korn; S. Muthukrishnan; Yunyue Zhu

Internet Service Providers (ISPs) use real-time data feeds of aggregated traffic in their network to support technical as well as business decisions. A fundamental difficulty with building decision support tools based on aggregated traffic data feeds is one of data quality. Data quality problems stem from network-specific issues (irregular polling caused by UDP packet drops and delays, topological mislabelings, etc.) and make it difficult to distinguish between artifacts and actual phenomena, rendering data analysis based on such data feeds ineffective. In principle, traditional integrity constraints and triggers may be used to enforce data quality. In practice, data cleaning is done outside the database and is ad-hoc. Unfortunately, these approaches are too rigid and limited for the subtle data quality problems arising from network data where existing problems morph with network dynamics, new problems emerge over time, and poor quality data in a local region may itself indicate an important phenomenon in the underlying network. We need a new approach - both in principle and in practice - to face data quality problems in network traffic databases. We propose a continuous data quality monitoring approach based on probabilistic, approximate constraints (PACs). These are simple, user-specified rule templates with open parameters for tolerance and likelihood. We use statistical techniques to instantiate suitable parameter values from the data, and show how to apply them for monitoring data quality.In principle, our PAC-based approach can be applied to data quality problems in any data feed. We present PAC-Man, which is the system that manages PACs for the entire aggregate network traffic database in a large ISP, and show that it is very effective in monitoring data quality problems.

international conference on management of data | 2003

Query by humming: in action with its technology revealed

Yunyue Zhu; Dennis E. Shasha; Xiaojian Zhao

You have a tune lingering in your head for many days, but you don’t know where you heard this tune or which song it is from. This demo will show a Query by Humming system that will tell you the name of that song. Most of the research in pre-existing query by humming systems uses pitch contour to match similar melodies (for example [1]). The user’s humming is transcribed to a sequence of discrete notes and the contour information is extracted from the notes. This contour information is represented by a few letters. For example, (“U”, “D”, “S”) represents that a note is above, below or the same as the previous one. The tunes in the databases are also represented by contour information. The edit distance can be use to measure the similarity between two melodies. Unfortunately, it is very hard to segment a user’s humming into discrete notes. Some recent work proposes to match the query directly from audio based on dynamic time warping to match the hum-query with the melodies in the music databases. But this quality improvement comes at a price because a brute-force search using DTW is very slow. The database community has been researching problems in similarity query for time series databases for many years. The techniques developed in the area might shed light on the query by humming problem. In this demo, we treat both the melodies in the music databases and the user humming input as time series. Such an approach allows us to integrate many database indexing techniques into a query by humming system, improving the quality of such system over the traditional (contour) string databases approach. We design special searching techniques that are invariant to shifting, time scaling and local time warping. This makes the system robust and allows more flexible user humming input.

Archive | 2004

Query by Humming

Dennis E. Shasha; Yunyue Zhu

The goal of a Query by Humming system is to allow a user to find a song by humming part of the tune. No musical training is needed. The problem is still unsolved. Some systems have low retrieval precision because they rely on melodic contour information from the hum tune, which in turn relies on the error-prone note segmentation process. Some systems yield better precision when matching the melody directly from audio, but they are slow because of their extensive use of Dynamic Time Warping (DTW) (see Chapter 4). HumFinder [106, 107] improves both the retrieval precision and speed compared to previous approaches. We treat music as a time series and exploit and improve well-developed techniques from time series databases to indexing the music for fast similarity queries. We improve on existing DTW indexes technique by introducing the concept of envelope transforms, which gives a general guideline for extending existing dimensionality reduction methods to DTW indexes. The net result is high scalability. We test our system through experiments. Please read this approach as a case study of the techniques you have seen, not as a complete solution to this hard problem.

international conference on management of data | 2003

IPSOFACTO: a visual correlation tool for aggregate network traffic data

Flip Korn; S. Muthukrishnan; Yunyue Zhu

IP network operators collect aggregate traffic statistics on network interfaces via the Simple Network Management Protocol (SNMP). This is part of routine network operations for most ISPs; it involves a large infrastructure with multiple network management stations polling information from all the network elements and collating a real time data feed. This demo will present a tool that manages the live SNMP data feed on a fully operational large ISP at industry scale. The tool primarily serves to study correlations in the network traffic, by providing a rich mix of ad-hoc querying based on a user-friendly correlation interface and as well as canned queries, based on the expertise of the network operators with field experience. The tool is called IPSOFACTO for IP Stream-Oriented FAst Correlation TOol.

Archive | 2004

Elastic Burst Detection

Dennis E. Shasha; Yunyue Zhu

Burst detection is the activity of rinding abnormal aggregates in data streams. Such aggregates are based on sliding windows over data streams. In some applications, we want to monitor many sliding window sizes simultaneously and to report those windows with aggregates significantly different from other periods. We will present a general data structure and system called OmniBurst [104] for detecting interesting aggregates over such elastic windows in near linear time. We present applications of the algorithm to detecting Gamma Ray Bursts in large-scale astrophysical data. Detection of periods with high volumes of trading activities and high stock price volatility is also demonstrated using real time Trade and Quote (TAQ) data from the New York Stock Exchange (NYSE). Our algorithm filters out periods of non-bursts in linear time, so beats the quadratic direct computation approach (of testing all window sizes individually) by several orders of magnitude.

Archive | 2004

Flexible Similarity Search

Dennis E. Shasha; Yunyue Zhu

There are many applications for similarity search in time series data of which the following are just a small sample. 1. In finance, a trader may be interested in finding pairs of stocks that move similarly, perhaps with some lag. 2. In music, a person may want to find a song that is similar to one that he can hum. 3. In business management, spotting products with similar selling patterns can result in more efficient product management. 4. In environmental science, by comparing the pollutant level in different sections of a river, scientists can have a better understanding of environmental changes.

Archive | 2004

Data Reduction and Transformation Techniques

Dennis E. Shasha; Yunyue Zhu

From a data mining point of view, time series data has two important characteristics: 1. High Dimensional If we think of each time point of a time series as a dimension, a time series is a point in a very high dimensional space. A time series of length 1000 corresponds to a point in a 1000-dimensional space. Though a time series of length 1000 is very common in practice, processing in a 1000-dimensional space is extremely difficult even with modern computer systems. 2. Temporal Order Fortunately, the consecutive values in a time series are related because of the temporal order of a time series. For example, for financial time series, the differences between consecutive values will be within some predictable threshold most of the time. This temporal relationship between nearby data points in a time series produces some redundancy, and such redundancy provides an opportunity for data reduction.

Explore More