Ching-Tien Ho
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ching-Tien Ho.
international conference on management of data | 1997
Ching-Tien Ho; Rakesh Agrawal; Nimrod Megiddo; Ramakrishnan Srikant
A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation operations: SUM and MAX. These two operations cover techniques required for most popular aggregation operations, such as those supported by SQL. For range-sum queries, the essential idea is to precompute some auxiliary information (prefix sums) that is used to answer ad hoc queries at run-time. By maintaining auxiliary information which is of the same size as the data cube, all range queries for a given cube can be answered in constant time, irrespective of the size of the sub-cube circumscribed by a query. Alternatively, one can keep auxiliary information which is 1/bd of the size of the d-dimensional data cube. Response to a range query may now require access to some cells of the data cube in addition to the access to the auxiliary information, but the overall time complexity is typically reduced significantly. We also discuss how the precomputed information is incrementally updated by batching updates to the data cube. Finally, we present algorithms for choosing the subset of the data cube dimensions for which the auxiliary information is computed and the blocking factor to use for each such subset. Our approach to answering range-max queries is based on precomputed max over balanced hierarchical tree structures. We use a branch-and-bound-like procedure to speed up the finding of max in a region. We also show that with a branch-and-bound procedure, the average-case complexity is much smaller than the worst-case complexity.
IEEE Transactions on Parallel and Distributed Systems | 1997
Jehoshua Bruck; Ching-Tien Ho; Shlomo Kipnis; Eli Upfal; Derrick Weathersby
We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-to-all personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully connected message-passing system, in which the performance of any point-to-point communication is independent of the sender-receiver pair. We also assume that each processor has k/spl ges/1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication start-up time, and on the communication bandwidth.
Archive | 2000
Mohammed Javeed Zaki; Ching-Tien Ho
Large-Scale Parallel Data Mining.- Parallel and Distributed Data Mining: An Introduction.- Mining Frameworks.- The Integrated Delivery of Large-Scale Data Mining: The ACSys Data Mining Project.- A High Performance Implementation of the Data Space Transfer Protocol (DSTP).- Active Mining in a Distributed Setting.- Associations and Sequences.- Efficient Parallel Algorithms for Mining Associations.- Parallel Branch-and-Bound Graph Search for Correlated Association Rules.- Parallel Generalized Association Rule Mining on Large Scale PC Cluster.- Parallel Sequence Mining on Shared-Memory Machines.- Classification.- Parallel Predictor Generation.- Efficient Parallel Classification Using Dimensional Aggregates.- Learning Rules from Distributed Data.- Clustering.- Collective, Hierarchical Clustering from Distributed, Heterogeneous Data.- A Data-Clustering Algorithm on Distributed Memory Multiprocessors.
international conference on data engineering | 1999
Mohammed Javeed Zaki; Ching-Tien Ho; Rakesh Agrawal
Presents parallel algorithms for building decision-tree classifiers on shared-memory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pipelining and dynamic load balancing to yield faster implementations. The task-parallel approach uses dynamic subtree partitioning among processors. Our performance evaluation shows that the construction of a decision-tree classifier can be effectively parallelized on an SMP machine with good speedup.
IEEE Transactions on Computers | 1993
Jehoshua Bruck; Robert Cypher; Ching-Tien Ho
This paper presents several techniques for tolerating faults in d-dimensional mesh and hypercube architectures. The approach consists of adding spare processors and communication links so that the resulting architecture will contain a fault-free mesh or hypercube in the presence of faults. The authors optimize the cost of the fault-tolerant architecture by adding exactly k spare processors (while tolerating up to k processor and/or link faults) and minimizing the maximum number of links per processor. For example, when the desired architecture is a d-dimensional mesh and k=1, they present a fault-tolerant architecture that has the same maximum degree as the desired architecture (namely, 2d) and has only one spare processor. They also present efficient layouts for fault-tolerant two- and three-dimensional meshes, and show how multiplexers and buses can be used to reduce the degree of fault-tolerant architectures. Finally, they give constructions for fault-tolerant tori, eight-connected meshes, and hexagonal meshes. >
IEEE Transactions on Parallel and Distributed Systems | 1995
Vasanth Bala; Jehoshua Bruck; Robert Cypher; Pablo Elustondo; Alex Ho; Ching-Tien Ho; Shlomo Kipnis; Marc Snir
A collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the advantage of portability. A library of this nature, the Collective Communication Library (CCL), intended for the line of scalable parallel computer products by IBM, has been designed. CCL is part of the parallel application programming interface of the recently announced IBM 9076 Scalable POWERparallel System 1 (SP1). In this paper, we examine several issues related to the functionality, correctness, and performance of a portable collective communication library while focusing on three novel aspects in the design and implementation of CCL: 1) the introduction of process groups, 2) the definition of semantics that ensures correctness, and 3) the design of new and tunable algorithms based on a realistic point-to-point communication model. >
acm symposium on parallel algorithms and architectures | 1994
Jehoshua Bruck; Ching-Tien Ho; Shlomo Kipnis; Derrick Weathersby
We present efficient algorithms for two all-to-all communication operations in message-passing systems: <italic>index</italic> (or all-to-all personalized communication) and <italic>concatenation</italic> (or all-to-all broadcast). We assume a model of a fully-connected message-passing system, in which the performance of any point-to-point communication is independent of the sender-receiver pair. We also assume that each processor has <italic>k</italic> ≥ 1 ports, through which it can send and receive <italic>k</italic> messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication start-up time and on the communication bandwidth. In the index operation among <italic>n</italic> processors, initially, each processor has <italic>n</italic> blocks of data, and the goal is to exchange the <italic>i</italic>-th block of processor <italic>j</italic> with the <italic>j</italic>-th block of processor <italic>i</italic>. We present a class of index algorithms that is designed for all values of <italic>n</italic> and that features a trade-off between the communication of start-up time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the start-up time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP-1 parallel system. In the concatenation operation among <italic>n</italic> processors, initially, each processor has one block of data, and the goal is to concatenate the <italic>n</italic> blocks of data from the <italic>n</italic> processors and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of <italic>n</italic>, in the number of communication rounds and in the amount of data transferred.
Journal of Parallel and Distributed Computing | 1997
Jehoshua Bruck; Danny Dolev; Ching-Tien Ho; Marcel-Catalin Rosu; H. Raymond Strong
Parallel computing on clusters of workstations and personal computers has very high potential, since it leverages existing hardware and software. Parallel programming environments offer the user a convenient way to express parallel computation and communication. In fact, recently, a Message Passing Interface (MPI) has been proposed as an industrial standard for writing “portable” message-passing parallel programs. The communication part of MPI consists of the usual point-to-point communication as well as collective communication. However, existing implementations of programming environments for clusters are built on top of a point-to-point communication layer (send and receive) over local area networks (LANs) and, as a result, suffer from poor performance in the collective communication part. In this paper, we present an efficient design and implementation of the collective communication part in MPI that is optimized for clusters of workstations. Our system consists of two main components: the MPI-CCL layer that includes the collective communication functionality of MPI and a User-Level Reliable Transport Protocol (URTP) that interfaces with the LAN Data-Link Layer and leverages the fact that the LAN is a broadcast medium. Our system is integrated with the operating system via an efficient kernel extension mechanism that we developed. The kernel extension significantly improves the performance of our implementation as it can handle part of the communication overhead without involving user space. We have implemented our system on a collection of IBM RS/6000 workstations connected via a 10-Mbit Ethernet LAN. Our performance measurements are taken from typical scientific programs that run in a parallel mode by means of the MPI. The hypothesis behind our design is that the systems performance will be bounded by interactions between the kernel and user space rather than by the bandwidth delivered by the LAN Data-Link Layer. Our results indicate that the performance of our MPI Broadcast (on top of Ethernet) is about twice as fast as a recently published software implementation of broadcast on top of ATM.
symposium on principles of database systems | 1997
Ching-Tien Ho; Jehoshua Bruck; Rakesh Agrawal
A partial-sum query obtains the summation over a set of specified cells of a data cube. We establish a connection between the covering problem in the theory of covering codes and the partial-sum problem and use this connection to devise algorithms for the partial-sum problem with efficient space-time trade-offs. For example, using our algorithms, with 44% additional storage, the query response time can be improved by about 12%; by roughly doubling the storage requirement, the query response time can be improved by about 34%.
international conference on parallel processing | 1987
S. Lennart Johnsson; Ching-Tien Ho
In a multiprocessor with distributed storage the data structures have a significant impact on the communication complexity. In this paper we present a few algorithms for performing matrix transposition on a Boolean