Takahiko Shintani
University of Tokyo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Takahiko Shintani.
international conference on parallel and distributed information systems | 1996
Takahiko Shintani; Masaru Kitsuregawa
We propose four parallel algorithms (NPA, SPA, HPA and HPA-ELD) for mining association rules on shared nothing parallel machines to improve its performance. In NPA, candidate itemsets are just copied amongst all the processors, which can lead to memory overflow for large transaction databases. The remaining three algorithms partition the candidate itemsets over the processors. If it is partitioned simply (SPA), transaction data has to be broadcast to all processors. HPA partitions the candidate itemsets using a hash function to eliminate broadcasting, which also reduces the comparison workload significantly. HPA-ELD fully utilizes the available memory space by detecting the extremely large itemsets and copying them, which is also very effective at flattering the load over the processors. We implemented these algorithms in a shared nothing environment. Performance evaluations show that the best algorithm, HPA-ELD, attains good linearity on speedup ratio and is effective for handling skew.
knowledge discovery and data mining | 1998
Takahiko Shintani; Masaru Kitsuregawa
In this paper, we study the problem of mining sequential patterns in a large database of customer transactions. Since finding sequential patterns has to handle a large amount of customer transaction data and requires multiple passes over the database, it is expected that parallel algorithms help to improve the performance significantly. We consider the parallel algorithms for mining sequential patterns on a shared-nothing environment. Three parallel algorithms (Non Partitioned Sequential Pattern Mining(NPSPM), Simply Partitioned Sequential Pattern Mining(SPSPM) and Hash Partitioned Sequential Pattern Mining(HPSPM)) are proposed. In NPSPM, the candidate sequences are just copied among all the nodes, which can lead to memory overflow for large databases. The remaining two algorithms partition the candidate sequences over the nodes, which can efficiently exploit the total systems memory as the number of nodes in increased. If it is partitioned simply, customer transaction data has to be broadcasted to all nodes. HPSPM partitions the candidate sequences among the nodes using hash function, which eliminates the customer transaction data broadcasting and reduces the comparison workload. We describe the implementation of these algorithms on a shared-nothing parallel computer IBM SP2 and its performance evaluation results. Among three algorithms HPSPM attains best performance.
international conference on management of data | 1998
Takahiko Shintani; Masaru Kitsuregawa
Association rule mining recently attracted strong attention. Usually, the classification hierarchy over the data items is available. Users are interested in generalized association rules that span different levels of the hierarchy, since sometimes more interesting rules can be derived by taking the hierarchy into account.nIn this paper, we propose the new parallel algorithms for mining association rules with classification hierarchy on a shared-nothing parallel machine to improve its performance. Our algorithms partition the candidate itemsets over the processors, which exploits the aggregate memory of the system effectively. If the candidate itemsets are partitioned without considering classification hierarchy, both the items and its all the ancestor items have to be transmitted, that causes prohibitively large amount of communications. Our method minimizes interprocessor communication by considering the hierarchy. Moreover, in our algorithm, the available memory space is fully utilized by identifying the frequently occurring candidate itemsets and copying them over all the processors, through which frequent itemsets can be processed locally without any communication. Thus it can effectively reduce the load skew among the processors. Several experiments are done by changing the granule of copying itemsets, from the whole tree, to the small group of the frequent itemsets along the hierarchy. The coarser the grain, the easier the control but it is rather difficult to achieve the sufficient load balance. The finer the grain, the more complicated the control is required but it can balance the load quite well.nWe implemented proposed algorithms on IBM SP-2. Performance evaluations show that our algorithms are effective for handling skew and attain sufficient speedup ratio.
mobile data management | 2002
Iko Pramudiono; Takahiko Shintani; Katsumi Takahashi; Masaru Kitsuregawa
The rapid growth of Internet access from mobile users has emphasised the importance of location specific information on the Web. A unique Web service called Mobile Info Search (MIS) from NTT Laboratories gathers information and provides location aware search facilities. We performed association rule mining and sequence pattern mining against an access log which was accumulated at the MIS site in order to get insight into the behavior of mobile users regarding spatial information on the Web. Details of the Web log mining process and the rules we derived are reported in this paper.
high performance distributed computing | 1998
Masato Oguchi; Takahiko Shintani; Takayuki Tamura; Masaru Kitsuregawa
PC clusters have been studied intensively for next-generation large scale parallel computers. ATM technology is a strong candidate as a de facto standard of high speed communication networks. Therefore an ATM connected PC cluster is a very promising platform from the cost/performance point of view, as a future high performance computing environment. An ATM connected PC cluster consisting of 100 PCs is reported, and characteristics of a transport layer protocol for the PC cluster are evaluated. Point-to-point communication performance is measured and discussed when a TCP window size parameter is changed. Retransmission caused by cell loss at the ATM switch is analyzed, and parameters of the retransmission mechanism suitable for parallel processing on the large scale PC cluster are clarified. From the viewpoint of applications, data intensive applications such as data mining and ad-hoc query processing in databases are considered to be very important for massively parallel processors, in addition to conventional scientific calculations. Thus, investigating the feasibility of such applications on an ATM connected PC cluster is quite meaningful. Parallel data mining is implemented and evaluated on the cluster. The default TCP protocol cannot provide good performance, since a lot of collisions happen during all-to-all multicasting executed on the large scale PC cluster. Using TCP parameters according to the proposed optimization, sufficient performance improvement is achieved for parallel data mining on 100 PCs.
ieee international conference on high performance computing data and analytics | 1997
Masato Oguchi; Takahiko Shintani; Takayuki Tamura; Masaru Kitsuregawa
Until recently, workstations were overwhelmingly superior to personal computers in terms of performance. However, recent PC technology has dramatically increased its CPU, main memory, and cache memory performance. Therefore massively parallel computer systems are moving away from proprietary components such as CPU, disks, etc. to commodity parts.
european conference on parallel processing | 1999
Takahiko Shintani; Masato Oguchi; Masaru Kitsuregawa
One of the most important problems in data mining is discovery of association rules in large database. We had proposed parallel algorithms for mining generalized association rules with classification hierarchy. In this paper, we implemented the proposed algorithms on a large scale PC cluster which consists of one hundred PCs interconnected by an ATM switch, and analyzed the performance of our algorithms using a large amount of transaction dataset. Performance evaluations show our parallel algorithms are effective for handling skew for such large scale parallel systems.
data warehousing and knowledge discovery | 1999
Iko Pramudiono; Takahiko Shintani; Takayuki Tamura; Masaru Kitsuregawa
Data mining has been widely recognized as a powerful tool to explore added value from large-scale databases. One of data mining techniques, generalized association rule mining with taxonomy, is potential to discover more useful knowledge than ordinary flat association mining by taking application specific information into account. We proposed SQL queries, named TTR-SQL and TH-SQL to perform this kind of mining and evaluated them on PC cluster. Those queries can be more than 30% faster than Apriori based SQL query reported previously. Although RDBMS has powerful query processing ability through SQL, most data mining systems use specialized implementations to achieve better performance. There is a tradeoff between performance and portability. Performance is not necessarily sufficiently high but seamless integration with existing RDBMS would be considerably advantageous. Since RDB is already very popular, the feasibility of generalized association rule mining can be explored using the proposed SQL query instead of purchasing expensive mining software. In addition, parallel RDB is now also widely accepted. We showed that paralleling the SQL execution can offer the same performance with those native programs with 10 to 15 nodes. Since most organizations have a lot of PCs, which are not fully utilized. We are able to exploit such resources to explore the performance significantly.
Electronics and Communications in Japan Part I-communications | 1999
Masato Oguchi; Takayuki Tamura; Takahiko Shintani; Masaru Kitsuregawa
A recent tendency in parallel computer design has been to use general-purpose components for system configuration elements such as CPUs, disks, and memories, which used to be specially developed. Although the connection network between the processors has been specially developed, it is now possible to configure a large-scale PC cluster with good performance at low cost by making use of an ATM network as a processor connection network because of the development and cost reduction of ATM network technologies in the communication field. In this paper, a large-scale PC cluster is constructed by connecting 100 personal computers by means of a general-purpose ATM network. Applications to parallel data mining are evaluated and discussed. In particular, an analysis is carried out with a focus on the effect of TCP retransmission with cell discarding of the ATM switch on the performance. The parameter setting of a retransmission mechanism suitable for the parallel processing in the cluster is found. Further, by developing a method for setting the retransmission spacing parameters to random values for each node, it is shown that a further improvement is possible.
knowledge discovery and data mining | 1999
Takahiko Shintani; Masaru Kitsuregawa
One of the most important problems in data mining is discovery of association rules in large database. In our previous study, we proposed parallel algorithms and candidate duplication based load balancing strategies for mining generalized association rules and showed our algorithms could attain good performance on 16 nodes parallel computer system. However, as the number of nodes increase, it would be difficult to achieve flat workload distribution. n nIn this paper, we present the candidate partition based load balancing strategy for parallel algorithm of generalized association rule mining. This strategy partitions the candidate itemsets so that the number of candidate probes for each node is equalized each other with estimated support count by the information of previous pass. Moreover, we implement the parallel algorithms and load balancing strategies for mining generalized association rules on a cluster of 100 PCs interconnected with an ATM network, and analyze the performance using a large amount of transaction dataset. Through the several experiments, we showed the load balancing strategy, which partition the candidate itemsets with considering the distribution of candidate probes and duplicate the frequently occurring candidate itemsets, can attain high performance and achieve good workload distribution on one hundred PC cluster system.