[PDF] Parallelization of Kmeans++ using CUDA

Abstract

K-means++ is an algorithm which is invented to improve the process of finding initial seeds in K-means algorithm. In this algorithm, initial seeds are chosen consecutively by a probability which is proportional to the distance to the nearest center. The most crucial problem of this algorithm is that when running in serial mode, it decreases the speed of clustering. In this paper, we aim to parallelize the most time consuming steps of the k-means++ algorithm. Our purpose is to reduce the running time while maintaining the quality of the serial algorithm.

Full PDF

PParallelization of Kmeans++ using CUDA st Maliheh Heydarpour Shahrezaei

Pouyandegane-danesh higher education institute

Chalus, [email protected] nd Reza Tavoli

Islamic Azad University

Chalus, [email protected]

Abstract —K-means++ is an algorithm which is invented toimprove the process of ﬁnding initial seeds in K-means algorithm.In this algorithm, initial seeds are chosen consecutively by aprobability which is proportional to the distance to the nearestcenter. The most crucial problem of this algorithm is that whenrunning in serial mode, it decreases the speed of clustering. Inthis paper, we aim to parallelize the most time consuming steps ofthe k-means++ algorithm. Our purpose is to reduce the runningtime while maintaining the quality of the serial algorithm.

Index Terms —K-means++, Parallelization, Clustering

I. I

NTRODUCTION

In the K-means clustering algorithm [1], a well-knownversion of clustering proposed by Lloyd [2], one needs to ﬁrstdetermine the initial set for K seeds. Generally, this is done bya random function, and the seeds are selected randomly. Manyapproaches are invented to ameliorate the process of ﬁndingthe initial seeds in the K-means algorithm. K-means++ [3]algorithm is one of the methods that can be used to ﬁnd thecentroids.The K-means++ method can ameliorate the process ofﬁnding initial seeds in K-means and can improve the qualityof clusters by providing a more efﬁcient approach than therandom method which is used in the K-means algorithm. Inthe K-means++ approach, the seeds are chosen based on aprobabilistic method where any point x is selected with aprobability which is proportional to the square of the distanceof x.Because of the importance of clustering algorithms indifferent areas of studies [4]–[7] and can be used for designingvarious intelligent systems [8]. They can even be utilized toprovide anonymity for the networks to further improve themethods like [9], [10]. Considering the preceding fact, devis-ing a methodology for accelerating the clustering process issigniﬁcant. Several methods have been invented to parallelizethe clustering algorithms [11]–[13] including k-means [14]–[16] to further improve the efﬁciency of them. However, theyare a few works on parallelization of K-means++.The earliest approach is a master’s thesis by Karch whichwas revealed in 2010 [17]. The thesis talks about paralleliza-tion of both K-means and K-means++ over GPU. However,this only considers the parallelization of the initial part ofthe algorithm and none of its other parts. Bahmani et al.also discuss the K-means++ method in [18]. They present adifferent K-means initialization technique, however, they do not provide time comparisons between their method and K-means or even K-means++.In the presented work, we aim to provide the implementa-tion for parallelization of K-means++. Our implementation isover GPU and we use different approaches for decreasing thetime of processing. Finally, we present our results to illustratehow parallelization can signiﬁcantly improve running time incomparison to serial mode.II. K-

MEANS ++ P

ARALLELIZATION

A pseudo-code of serial implementation of K-means++ hasbeen shown in Algorithm 1. Algorithm 1 shows that, after theselection of each seed, the distance of each point to the nearestseed is updated.

Algorithm 1

Serial K-means++ procedure M Y S ERIAL

K-M

EANS ++ Input : H: List of Data points(N), Number ofClusters(K), Distance Function (d) Output : O: A set of points n, classiﬁes intok different clusters Function Serial-K-means++(N, K): //select the ﬁrst seed randomly H m ( O ) = rand n ∈ N O ( l ) = ∞ // Rest of the seeds selected by distance and by probability Deﬁne m, l: for m ∈ (1 , k − do for l ∈ (1 , n ) do // Update the nearest distance d m,l = d ( H m n l ) d sn = [[ min ,x,n ]( d x,l )] //Taking H m +1 // Function Probability of n ( d sn , n l ) P ( n l ) = d sn (cid:80) w = zw =1 d snw Return H In the following, we describe the method for parallelizationof K-means++. As we previously mentioned, we aim to paral-lelize the most time consuming steps of K-means++ algorithm. a r X i v : . [ c s . D C ] J u l s it is indicated in algorithm 1, in K-means++ algorithmdistance of all points to the chosen center should be calculatedand then the seeds are chosen based on a probability formula.These are the most challenging aspects for parallelizing theK-means++. In Algorithm 2, we have shown the process ofparallelization of the K-means++ algorithm. As Algorithm 2illustrates, the parallelization procedure starts with partitioningthe points into l different points. It then continues to parallelizethe calculation of the distance of points to the centers until theend of the algorithm. Algorithm 2

Parallel K-means++ procedure M Y P ARALLEL

K-M

EANS ++ Input : H: List of Data points(N), Number ofClusters(K), Distance Function (d) Output : O: A set of points n, classiﬁes intok different clusters Function Parallel-K-means++(N, K): //select the ﬁrst seed randomly H m ( O ) = rand n ∈ N O ( l ) = ∞ // Rest of the seeds selected by distance and by probability Deﬁne m, l:

Partition N ( N , . . . , N l ) for m ∈ (1 , k − do for l ∈ (1 , n ) do → In Parallel // Update the nearest distance d m,l = d ( H m n l ) d sn = [[ min ,x,n ]( d x,l )] //Taking H m +1 // Function Probability of n ( d sn , n l ) P ( n l ) = d sn (cid:80) w = zw =1 d snw Return H In order to start the parallelization process,we get helpfrom CUDA [19], which is a parallel computing platformdesigned by Nvidia [20]. CUDA is used to accelerate differentapplications in various areas like computational chemistry,bioinformatics and other ﬁelds of research [21]–[25].Using CUDA, we consider several different methodologies,for the parallelization including: • Global Memory • Constant Memory • Texture Memory • Parallel reduction using ThrustWe should mention that we also consider using sharedmemory, kernel launch overhead and loop unrolling methodsin our proposed method for applying further methods ofparallelization, but these three approaches had some problemsto be applied on our work. On the one hand, when using shared memory, all of the blocks need to fetch a commonpiece of data from shared memory. In this work, the onlyparameters that could be applied over shared memory arecentroids. However, fetching centroids one by one from sharedmemory for several blocks would not be efﬁcient, and somealternatives like Constant memory would be more helpful. Onthe other hand, loop unrolling was not applicable, since theparameters in the for loop were changeable. So it was notpossible to apply this method over the algorithm. Moreover,in the case of kernel launch overhead we analyze and see thatthe amount of processing we are doing in each kernel launchis much more than the overhead of launching kernel. This isthe reason the kernel launch overhead is insigniﬁcant as well.As a result, we consider using Global memory, Constantmemory, Texture memory, and a parallel reduction in thrust[26] for doing parallelization. Furthermore, the focus of ourwork is on the initial centroid selection. The clustering partis implemented the same as the K-means algorithm. In fact,in parallel mode, the initialization part takes more time tobe executed rather than the clustering part. In the centroidselection process, every time a new centroid is found, it’sindex is stored in a separate array. We use these indices toﬁnd the coordinates of the centroid in the array containingall the points, and in case of shared memory experiment, itscoordinates are stored in Constant memory for fast retrieval.Each thread in the centroid selection process is responsiblefor calculating the distance of a single point from all theselected centroids; hence, each thread has an equal workload.This workload increases equally with each subsequent centroidselection. If the K is small, the workload of each thread issmall and equal at all times. Starting the parallelization, ﬁrst,we used Global memory. We allocate all of the data points toreside on Global memory. Afterward, we deﬁne the block ofthreads, and each thread is used for calculation of distancesto a chosen centroid. In our implementation, we consider thefollowing conﬁguration for the size of blocks and the numberof blocks per thread:

T hread = 1024

Block = ceil ( numberof pointsthreads ) (1)Using Global memory, we could improve the performanceof the algorithm by a large margin. Next to using Globalmemory, we use Constant memory. Constant memory canbe used for data that will not change when the kernel isexecuting. Constant memory is a read-only memory, and it isfaster than Global memory. Using Constant memory in placeof Global memory, we can reduce the memory bandwidth insome cases. NVIDIA hardware has provided 64KB Constantmemory which its space is cached. The advantage of thiskind of memory is that, when threads of half warp read fromconstant cache, it would be as fast as reading from the registerif all of the threads are reading the same address. Consideringthe preceding facts, we try to use Constant memory as amethod for amending the efﬁciency of K-means++ algorithm.Using Constant memory, we ﬁrst decide to put all of thepoints to reside on Constant memory. However, when theumber of points increases by a large margin, the algorithmdoes not work. This is because there are too many points forthe amount of Constant memory. As we previously discussed,Constant memory is only 64kb while we are going to considermore than millions of points when we are evaluating ourmethod. As a result, we should put only the centroids in Con-stant memory, instead of putting the points. Using Constantmemory will better improve the performance of the algorithm.If we increase the num clusters from 30 to 50, there is aslight improvement in Constant memory version. If we furtherincrease from 50, it illustrates much more improvement. Wewill show these results in section Sec III.Next to Constant memory and Global memory, we furtherexpand the methodology for parallelization of the K-means++algorithm by using Texture memory. This type of memory isas like as Constant memory. It can provide better efﬁciencyand reduce memory trafﬁc if accessing and reading frommemory follow a speciﬁc pattern. It is also similar to Globalmemory, but it performs better because it is treated as read-only memory. Since the Texture memory cannot be modiﬁedfrom the device side, the device does not need to keep trackof updates and cache coherence. We should mention thatgenerally reading from Global memory is faster than Texturememory if the access has coalesced. However, if we areaccessing to the neighbors in memory locations, using Texturememory will give us better results than Global memory. In fact,in speciﬁc addressing locations, reading from device memoryusing Texture memory and fetching from that, would be morebeneﬁcial than fetching from Global or even Constant memory.Since we are accessing the same cells in memory, we willuse Texture memory, which can perform more efﬁciently thanGlobal memory and can improve the performance of the K-means++ algorithm. The approach here is to store all thepoints in Texture memory rather than Global memory becausetheir location does not change. Apart from using Global,Constant, and Texture memory, we use parallel reductionmethod. Since we have Sum() function in the K-means++algorithm, we can rely on using the partial sum methodfor improving the performance. In order to do this, we useThrust. Thrust is a library based on C++ for CUDA which isbased on Standard Template Library (STL). It provides twodifferent variables call host vector and device vector. As thename shows, host vector has resided in host memory, but thedevice vector is stored in device memory. They both can beadded as headers in the source code as : include < thrust/host vector.h > include < thrust/device vector.h > In order to make the reduction, a reduction algorithm utilizesbinary operation, and further, it tries to reduce any speciﬁcinput sequence to a single value. For instance, when summingsequences of numbers, one can reduce the arrays for each sumoperations. In Thrust, partial sum or sum of arrays in reductionmode can be implemented by using Thrust reduce function(thrust::reduce). Following is the way that Thrust implementsreduce for Sum() function:intsum = thrust :: reduce(D.begin(), D.end(),(int)0, thrust :: plus < int > ());Above, D.begin(), D.end() identify the range for values and(int)0, thrust :: plus < int > ()) describe the initial value andthe reduction operator, which is plus in this case. Consideringthis, in our implementation, we use the following argumentsfor making sum reduction:doublesum = reduce(dptr, dptr + num points, 0.0);We should also mention that using Thrust for reduction iscommon in all of the experiments and is assumed to performcontinuously. Therefore, the performance of the overall algo-rithm for each experiment is not impacted by a reduction stepdifferently, and the overall performance can be considered asa measure of centroid selection. In the section Sec III, wewill talk about the performance improvement of K-means++algorithm using the abovementioned methodologies.III. R ESULTS

In this section, we will explain the evaluation part andanalysis of the performance of the Kmeans++ algorithm inboth parallel mode (GPU) and serial mode (CPU). Our imple-mentation is done over amazon web service (AWS) on LinuxUbuntu 16.04 and NVIDIA 367.27 driver. (a) Increasing number of clusters (b) Increasing number of points

Fig. 1: Improvement over performance using Global MemoryWe demonstrate the results for applying Global memoryin our implementation. Using Global memory, we considertwo scenarios: First, we discuss the improvements over GPUusing Global memory rather than CPU when the number ofpoints is increasing from 1 million to 10 million points, and thenumber of clusters is equivalent to a constant number (NumC= 50). Second, we explain the scenario when the number ofclusters is increasing from 10 to 100 clusters, while the numberof points is equal to a constant number (NumP = 4000000).In Fig. 1a and Fig. 1b, the results of using Global memoryin abovementioned scenarios are indicated. As it is shown inFig. 1, the performance of K-means++ algorithm has beenamended when it is implemented over GPU than CPU. Wecan see more improvement when the number of points andclusters are increased gradually. Using Global memory, whenthe number of clusters grows from 10 to 100, GPU time variesbetween 1-5 while CPU time increases from 3 seconds to 4minutes.On the other hand, increasing the number of points from1-10 million, the GPU time changes within a range from 1-5seconds while this time for the CPU version is from 18 seconds a) Increasing number of clusters (b) Increasing number of points(c) Global vs Constant Memory

Fig. 2: Improvement over performance using Constant Mem-oryto 3 minutes. Next to Global memory, Fig. 2a and Fig. 2b,demonstrate the performance of the algorithm in parallel modewhen we use Constant memory. As it is shown in Fig. 2,using Constant memory, we have a piecemeal improvement inthe performance of the algorithm. Increasing the number ofclusters from 1 to 100, the GPU time changes within a rangebetween 1-5 seconds while this time for CPU is between 3seconds to 4 minutes. In case of growth from 1 to 10 millionpoints, the GPU time is 1-5 seconds while the CPU time is 18seconds to 3 minutes. One point that should be mentioned isthat the performance of Constant memory and Global memoryseems to be the same when we describe the time in seconds,however, this changes when we consider time in milliseconds.To better show this, in Fig. 2c we make a comparison betweenthe performance of Constant memory and Global memory. Asit is illustrated in Fig. 2c, using Constant memory can providebetter performance for the K-means++ algorithm. ComparingGlobal to Constant memory and changing the clusters from10-100 points, we have slight improvement from 2-11 percentduring these steps. We can even see better improvements whenthe number of clusters is further increased.Beside Global and Constant memory, we evaluate the pro-ﬁciency of the K-means++ algorithm using Texture memory.We apply both of the aforementioned scenarios for this typeof memory. Results are shown in Fig. 3. As it is indicated,using Texture memory, we have outstanding improvementand efﬁciency for the K-means++ algorithm. Incrementingnumber of clusters from 1 to 100 (Fig. 3a), the GPU time iswithin range roughly similar to Global memory and CPU timechanges from 3 seconds to 4 minutes. In case of increasingthe number of points from 1 million to 10 million (Fig. 3b),the GPU time changes between 1-4 seconds, which is slightly (a) Increasing number of clusters (b) Increasing number of points(c) Global vs Texture Memory

Fig. 3: Improvement over performance using Texture Memorybetter than Global memory and the CPU time is the sameas both Global and Constant memory. As we previouslydiscussed, Texture memory can perform better when we areaccessing neighbor locations in memory and because it is notmodiﬁed from the device side. In Fig. 3c we provide the timecomparison between Global memory and Texture memory. Asit is presented, each time Texture memory is performing better,and at each step, Texture memory is between 10 to 14 percentmore efﬁcient in performance than global memory.One result is expected. When the number of clusters issmaller, (K <

7) the performance of CPU is slightly better thanthe GPU. This implies two things. First The data transfer fromhost to device and kernel launch overhead causes the GPUversion takes longer. Second, we work in a two-dimensionalenvironment. If the dimension is larger, we can easier reach tothe point that GPU can provide better speed. As we can see,there is a trade-off between the speed of the algorithm and therequired workload. IV. C

ONCLUSION