[PDF] Clustering with Penalty for Joint Occurrence of Objects: Computational Aspects

Abstract

The method of Hol\'y, Sokol and \v{C}ern\'y (Applied Soft Computing, 2017, Vol. 60, p. 752-762) clusters objects based on their incidence in a large number of given sets. The idea is to minimize the occurrence of multiple objects from the same cluster in the same set. In the current paper, we study computational aspects of the method. First, we prove that the problem of finding the optimal clustering is NP-hard. Second, to numerically find a suitable clustering, we propose to use the genetic algorithm augmented by a renumbering procedure, a fast task-specific local search heuristic and an initial solution based on a simplified model. Third, in a simulation study, we demonstrate that our improvements of the standard genetic algorithm significantly enhance its computational performance.

Full PDF

CClustering with Penalty for Joint Occurrence of Objects:Computational Aspects Ondřej Sokol

Prague University of Economics and BusinessWinston Churchill Square 4, 130 67 Prague 3, [email protected] Author

Vladimír Holý

Prague University of Economics and BusinessWinston Churchill Square 4, 130 67 Prague 3, [email protected]

Abstract:

The method of Holý, Sokol and Černý (Applied Soft Computing, 2017, Vol. 60, p. 752–762) clusters objects based on their incidence in a large number of given sets. The idea is to minimizethe occurrence of multiple objects from the same cluster in the same set. In the current paper, westudy computational aspects of the method. First, we prove that the problem of ﬁnding the optimalclustering is NP-hard. Second, to numerically ﬁnd a suitable clustering, we propose to use the geneticalgorithm augmented by a renumbering procedure, a fast task-speciﬁc local search heuristic and aninitial solution based on a simpliﬁed model. Third, in a simulation study, we demonstrate that ourimprovements of the standard genetic algorithm signiﬁcantly enhance its computational performance.

Keywords:

Cluster Analysis, Computational Complexity, Genetic Algorithm, Local Search.

JEL Codes:

C38, C61, C63.

Clustering of objects is typically based on a distance between objects or a density of objects in anarea. Holý et al. (2017) propose a very diﬀerent approach and cluster objects based on their jointoccurrence in observed sets. It is assumed that there should typically be at most one object fromeach cluster in a single observed set. Deviation from this behavior is considered an error and the goalis to ﬁnd the clustering which minimizes this error. Speciﬁcally, Holý et al. (2017) deﬁne the error asthe average ratio of clustering decisions in which two objects from the same cluster occur in the sameset. The problem of ﬁnding clusters minimizing this error can be formulated as an integer nonlinearoptimization problem.The motivation for this clustering technique lies in retail analytics. Holý et al. (2017) use thismethod to cluster products of a retail store into categories of substitutes. In this setting, it is assumedthat most customers buy at most one product (object) from each cluster in a single visit (set), i.e.two or more products (objects) from the same cluster rarely occur together in the same shoppingbasket (set). This is quite reasonable behavior suggesting customers choose only one product for agiven purpose. In retail, there is typically a large number of products (objects) and a huge numberof shopping baskets (sets) making the method computationally very intensive. The main advantageis that no characteristics of products are needed and only a history of transactions is utilized. Holýet al. (2017) show that this method can uncover meaningful clusters in an empirical study of a Czechdrugstore chain. For other clustering approaches in retail business, see e.g. Jonker et al. (2004), Tsaiand Chiu (2004), Reutterer et al. (2006), Zhang et al. (2007), Lingras et al. (2014), Ammar et al.(2016), Peker et al. (2017), Wu and Liu (2020), and Sokol and Holý (2020a).In this paper, we analyze the approach of Holý et al. (2017) from a computational point of view.First, we demonstrate that the related optimization problem is NP-hard. We build our proof on the Preliminary results were presented in Sokol and Holý (2020b). a r X i v : . [ c s . A I] F e b esults of Karp (1972) for the Max-Cut problem. Second, we propose to numerically ﬁnd clustersusing the genetic algorithm combined with local search. We adjust the standard genetic algorithmby applying the renumbering procedure of Falkenauer (1998) and Hruschka and Ebecken (2003)suitable for integer representation of clusters. Note that Hruschka and Ebecken (2003) also proposecrossover and mutation operations speciﬁcally adapted for the integer representation of clusters.However, these operations are based on a distance between objects and are not applicable in our case.We therefore resort to the standard versions of these operations. Such approach in the context ofclustering is used e.g. by Murthy and Chowdhury (1996). We further enhance the genetic algorithmby computationally eﬀective local search. This improvement of the genetic algorithm is in generalsuggested e.g. by Hamzaçebi (2008). Finally, in the initial population, we include the solution ofthe simpliﬁed problem obtained by the k-means method with data transformed to distances. Ourmodiﬁcations signiﬁcantly improve the computational performance in comparison to the basic geneticalgorithm utilized by Holý et al. (2017). Our numerical method falls to the ﬁeld of clustering methodsbased on nature-inspired metaheuristics. For an overview of this ﬁeld, see e.g. Hruschka et al. (2009),Nanda and Panda (2014) and José-García and Gómez-Flores (2016).The rest of the paper is structured as follows. In Section 2, we formulate our problem of ﬁndingclusters. In Section 3, we prove that this problem is NP-hard. In Section 4, we propose to numericallysolve this problem by the improved genetic algorithm. In Section 5, we investigate the computationalperformance of the proposed algorithm using simulated data. We conclude the paper in Section 6. The problem is based on the following data structure. Let N be the number of sets, M the number ofobjects and K the maximum number of clusters. A matrix A is available with N rows, M columnsand elements a ij deﬁned as a ij = (cid:40) if object j is present in set i, otherwise . (1)Furthermore, we assume that there are at least two objects in each set, i.e. (cid:80) Mj =1 a ij ≥ for all i ∈ { , . . . , N } . Otherwise the set is not interesting for our goal and can be ignored.The variables in the model are vectors of possible object clustering x = ( x , . . . , x M ) (cid:48) , where x j isan integer in range ≤ x j ≤ K for every j ∈ { , . . . , M } . We are looking for such clustering x thatminimizes the weighted occurrences of pairs of objects from one cluster in the same set.For each set i = 1 , . . . , N we denote the total number of object pairs in the set D i as D i = (cid:18) E i (cid:19) , E i = M (cid:88) j =1 a ij , (2)and the number of violating object pairs from one cluster within the same set V i as V i ( x ) = K (cid:88) k =1 (cid:18) W ik ( x )2 (cid:19) , W ik ( x ) = M (cid:88) j =1 a ij I ( x j = c ) , (3)where I ( · ) denotes the indicator function. The cost function is then f cost ( x ) = 1 N N (cid:88) i =1 V i ( x ) D i . (4)Therefore, the cost function equals to the average ratio of object pairs in which two object from thesame cluster are in the same set. The range of the cost function is from 0 (there is no set containingobject from the same cluster) to 1 (every set contain only objects from the same cluster).2he nonlinear integer programming problem is of form min x f cost ( x ) s. t. x j ≤ K for j = 1 , . . . , M,x j ∈ N for j = 1 , . . . , M. (5)In practical tasks, we can assume N (cid:29) M (cid:29) K . The model can be straightforwardly transformed from the integer program to a binary program. Let y jk = (cid:40) if object j is assigned to cluster k, otherwise , (6)for each j and k and let Y be matrix with elements y jk . Note that in the optimization processitself, the number of variables y jk can be further reduced by M as y jK = 1 − (cid:80) K − k =1 y jk for every j = 1 , . . . , M . As the number of violating object pairs is dependent on clustering Y , we deﬁne V (cid:48) i alternatively as V (cid:48) i ( Y ) = K (cid:88) k =1 (cid:18) W (cid:48) ik ( Y )2 (cid:19) , W (cid:48) ik ( Y ) = M (cid:88) j =1 a ij y jc . (7)Similarly to (5), we deﬁne the cost function as f (cid:48) cost ( Y ) = 1 N N (cid:88) i =1 V (cid:48) i ( Y ) D i . (8)The non-linear binary programming model is then min Y f (cid:48) cost ( Y ) s. t. K (cid:88) k =1 y jk = 1 for j = 1 , . . . , M,y jk ∈ { , } for j = 1 , . . . , M, k = 1 , . . . , K. (9) Theorem 1.

Problem (5) is NP-hard.Proof.

In order to prove it, we reduce problem (9) to the simplest case. Set K = 2 , i.e. let there beonly two clusters of objects, and let (cid:80) Mj =1 a ij = 2 for all i , i.e. the size of all sets is 2. The numberof object pairs is therefore D i = 1 in each set i . Also, as we have only two clusters, we can deﬁne z j = (cid:40) if object j is assigned to cluster 1 , if object j is assigned to cluster 2 , (10)and let P j(cid:96) := N (cid:88) i =1 a ij a i(cid:96) for j, (cid:96) = 1 , . . . , M, j (cid:54) = (cid:96), (11)which is the number of occurrences of every pair of objects in the same set.The problem can now be formulated as min z (cid:88) ∀ j<(cid:96) P j(cid:96) I ( z j = z (cid:96) ) s. t. z j ∈ { , } for j = 1 , . . . , M. (12)3ithout the loss of generality, we can rewrite the objective function to the maximization form as max z (cid:88) ∀ j<(cid:96) P j(cid:96) I ( z j (cid:54) = z (cid:96) ) s. t. z j ∈ { , } for j = 1 , . . . , M. (13)Let P denote matrix with elements P j(cid:96) . The decision problem (13) can be formulated as follows: Problem 1.

Is there a binary vector z = { z , z , . . . , z M } such that for a given M, P , and C holds (cid:88) ∀ j<(cid:96) P j(cid:96) I ( z j (cid:54) = z (cid:96) ) ≥ C ? (14)Therefore, an instance of the problem is given by { M, P , C } . Now, we show that the Max-Cutproblem can be reduced to Problem (14). The Max-Cut is an NP-hard decision problem (see theproof of NP-hardness in Karp, 1972) deﬁned as follows: Problem 2.

Having a graph G = ( V G , E G ) , weighting function w : E G → Z and a positive integer C , is there a set S ⊂ V G such that (cid:88) { j,(cid:96) }∈ V G ,j ∈ S,(cid:96)/ ∈ S w ( { j, (cid:96) } ) ≥ C ? (15)Hence, an instance of the Max-Cut problem is given by { V G , E G , w, C } . In order to prove thatthe Max-Cut problem is reducible to Problem 1, we need to shown that every instance of Problem 2is reducible to Problem 1. The reduction function is following: g : { V G , E G , w, C } (cid:55)→ { n, P , C } . (16) V G is mapped to a vector (1 , , . . . , n ) where n is the number of vertices. Values of weighting function w ( j, (cid:96) ) are directly translated into P j(cid:96) . If graph G = ( V G , E G ) is not complete, then P j(cid:96) is set to zerofor missing edges. A value of C remains the same. Remark.

Note that the Max-Cut problem does not assume non-negativity weighting function w andin our setup P j(cid:96) are naturally non-negative as it is number of instances that two objects are in thesame set; however, this is not a problem. As it is shown in reduction of the Knapsack problem to theMax-Cut problem through the Partition problem (Karp, 1972), there are two cases when the w ( j, (cid:96) ) are negative.1. If the sum of all items in the Knapsack form is lower than the capacity, then w ( j, (cid:96) ) may benegative. However, such instance is trivial to solve.2. If the weight of i -th item w ( KP ) j in the Knapsack problem is negative, then w ( j, (cid:96) ) may benegative. However, this case can be easily transformed to instance without negative weights.Such transformation consists in multiplication of negative weight w ( KP ) j by − , switching aﬀectedbinary variable x ( KP ) j to − x ( KP ) j and increase in the total capacity of knapsack by − w ( KP ) i .Therefore, the cases in which w ( j, (cid:96) ) are negative can be straightforwardly transformed to non-negativecases or are easy to solve. As a result, every hard instance of the Max-Cut problem given by [ V G , E G , w, C ] can be trans-formed to Problem 1 using reduction function g and therefore problem (5) is NP-hard.4 Numerical Optimization

In our model, it is not possible to calculate the similarity between individual clusterings withoutlosing important information. Hence, we cannot use standard clustering methods such as k-means,DBSCAN or hierarchical clustering methods. For this reason, an integer genetic algorithm is chosento ﬁnd a suitable clustering.The standard genetic algorithm has the following phases. First, the initial population is generated,where by population we mean a set of diﬀerent clusterings x called individuals. Objects x j assignedto clusters in clustering x are called genes. Individuals can also be inserted to population if promisingcandidates are available. After the initial population is prepared, in each generation following stepsare conducted:1. Designation of given number of the best individuals as elite in order to preserve them.2. Modiﬁcation of non-elite individuals.(a) By crossover, when two diﬀerent individuals are randomly selected using roulette wheelselection and by randomly swapping genes two new individuals are created, replacing theoriginal ones.(b) By mutation, when randomly selected genes are randomly switched.3. Evaluation of all new individuals.4. If the speciﬁed number of generations or a suﬃciently high quality result is not achieved, thealgorithm returns to Step 1, else the best individual is chosen and the algorithm ends.In order to apply the genetic algorithm, several parameters are to be chosen. The ﬁrst one is the size of population . With bigger population of individuals more possible clusterings are explored, whichcan result in ﬁnding better solution at the cost of higher computational demands. This parameter istask-speciﬁc and in our case a large population is preferred (see Holý et al., 2017). The number ofiterations , sometimes called generations in terms of evolutionary algorithms, is another parameter.Simply put, it is a parameter of how long the solution space should be searched. The number ofelites parameter determines how many individuals with the best evaluation are declared elite and arepassed to the next generation without any alteration. This prevents the loss of the best individualsin population. Parameter mutation chance determines the probability of changing gene value toa random one. The purpose of mutation in genetic algorithm is to introduce more diversity intopopulation, thus to avoid reaching local minimum by preventing the individuals from becoming toosimilar to each other. However, if a high value is selected, then the crossover eﬀect is suppressed andalgorithm is more of a random search of the solution space.By far the most computationally demanding part of the algorithm is evaluating newly foundclusters as it is necessary to work with a matrix with large dimensions, speciﬁcally N × M . There are several complications concerning the application of traditional genetic algorithms for clus-tering tasks. The main one is the clustering codiﬁcation. In our case, the clustering x has M elements x j with values { , . . . , K } . If the standard genetic algorithm is used, the resulting clustering x allowssymmetries in the solution space, for example, assuming M = 5 and K = 3 , solution x = (3 , , , , is in fact identical to x = (2 , , , , but standard genetic algorithm treats them as signiﬁcantlydiﬀerent. This has inappropriate consequences, especially in the case of crossover. This shortcom-ing can be remedied by introducing a simple rule (see Falkenauer, 1998) when cluster numbers arerenumbered to start from the smallest based on the ﬁrst occasion of each cluster, e.g. the solutionabove would be renumbered to x = (1 , , , , . The renumbering procedure allows the suitableapplication of classic genetic algorithms in clustering problems, avoiding the problems of redundantcodiﬁcation (see Hruschka and Ebecken, 2003). 5 .3 Local Search Second proposed improvement is the implementation of a task-speciﬁc local search. The local searchfunction consists in generating all possible neighbor individuals for the best individual in each iterationand checking whether the new individuals improve the cost function. If newly found individual givesbetter result than the original one, then the original is simply replaced. If the individual has alreadybeen checked in the previous iterations, then no local search is executed as the individual cannot beimproved by local search.By far the most time-consuming part of used genetic algorithm is the frequently called enumerationof f cost ( x ) . In the proposed procedure, we therefore try to approach enumeration eﬃciently byrepeated usage of the intermediate calculations, similar to dynamic programming approach. The goalis to ﬁnd neighbors of individual x , i.e. all possible vectors x ( j,k ) for all j and k , which diﬀers from x in exactly one element j which is changed from the original cluster k to the new cluster k . Toﬁnd neighbors, it is not necessary to calculate the value of the cost function from the beginning, butto use already prepared calculations. The main idea follows from the decomposition: V ( x ( j,k ) ) := N (cid:88) i =1 V i ( x ( j,k ) ) = N (cid:88) i =1 V i ( x )+ N (cid:88) i =1 (cid:32)(cid:18) v ik ( x ( j,k ) )2 (cid:19) − (cid:18) v ik ( x )2 (cid:19)(cid:33) + N (cid:88) i =1 (cid:32)(cid:18) v ik ( x ( j,k ) )2 (cid:19) − (cid:18) v ik ( x )2 (cid:19)(cid:33) = N (cid:88) i =1 V i ( x ) (17) + N (cid:88) i =1 (cid:18) v ik ( x ( j,k ) )2 (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) V (0) ( x ( j,k ) − N (cid:88) i =1 (cid:18) v ik ( x )2 (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) V (0) ( x ) + N (cid:88) i =1 (cid:18) v ik ( x ( j,k ) )2 (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) V (1) ( x ( j,k ) − N (cid:88) i =1 (cid:18) v ik ( x )2 (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) V (1) ( x ) With the knowledge of V ( x ( j,k ) ) , the cost function f cost ( x ( j,k ) ) can be computed quickly as theother parts of the formula remains unchanged.Let D is a vector of all D i and B is the matrix N × K of the number of objects by set and cluster.In the algorithm, the operation / stands for element-wise division and \ is a set subtraction operation.The local search function which ﬁnds all neighbors of clustering x is described in Algorithm 1.6 lgorithm 1 Local Search function

Require: x , V ( x ) , A , B , D V best := V ( x ) x best := x for j ∈ { , . . . , M } do j := which ( A [ , j ] = 1) k := x [ j ] B (0) := B [ , k ] if V (0) ( x ) = NULL then V (0) ( x ) := sum ( choose ( B (0) , / D ) end if B (0) [ j ] := B (0) [ j ] − V (0) ( x ( j,k ) ) := sum ( choose ( B (0) , / D ) for k ∈ { , . . . , K } \ k do B (1) := B [ , k ] V (1) ( x ) := sum ( choose ( B (1) , / D ) B (1) [ j ] := M (1) [ j ] + 1 V (1) ( x ( j,k ) ) := sum ( choose ( B (1) , / D ) if ( V ( x ) − V (0) ( x ) + V (0) ( x ( j,k ) ) − V (1) ( x ) + V (1) ( x ( j,k ) )) < V best ) then V best := V ( x ) − V (0) ( x ) + V (0) ( x ( j,k ) ) − V (1) ( x ) + V (1) ( x ( j,k ) ) x best := x x best [ j ] := k end if end for end for return x best , V best The proposed local search function is signiﬁcantly faster than naive neighbor generation andrepeated cost function calls. A naive procedure would call the cost function M · ( K − times asthe cluster of each object can be changed. The enumeration complexity of the cost function is of theorder of N · M . The total computational complexity of generating and enumerating of all neighborsis then O ( M · K · N · M ) = O ( M · K · N ) .The proposed method also evaluates all neighbors, i.e. M · ( K − clusterings, but instead ofrepeated calls to the cost function, intermediate calculations are used repeatedly. The enumerationsof the number of object pairs from the same cluster in one set is performed for the original clusteringwith a complexity of the order of N · M and then is used repeatedly. In the enumeration of the changein the cost function of individual neighbors, only the parts that actually diﬀer are calculated, this is ofthe order of · N for one neighbor since only two sums change in respect to original clustering, see (17).The total computational complexity is then O ( N · M + M · ( K − · N ·

2) = O ( N · M + M · K · N ) = O ( M · K · N ) . Since M is expected to be of high value, e.g. hundreds or thousands, the time savedis signiﬁcant. Finally, we obtain an appropriate initial solution. We transform the data set to dissimilarity matrix Q based on the simpliﬁed relationship between objects. The elements q j(cid:96) of matrix Q are computedas the proportions of sets in which objects appear together in the same set: q j(cid:96) := 1 N N (cid:88) i =1 a ij a i(cid:96) for j, (cid:96) = 1 , . . . , M, j (cid:54) = (cid:96), (18)Using the dissimilarity matrix Q , we can use the standard clustering algorithms such as k-means toﬁnd initial solution for genetic algorithm. A substantial amount of information is lost through the7ata transformation, nevertheless, the solution found by k-means can serve as suitable initial point.The main advantage is that k-means is a very fast method, which allows ﬁnding a suitable initialpoint in a matter of seconds. In the simulation study, we compare four modiﬁcations of the genetic algorithm for clustering problemsin terms of the best solution found and the speed of convergence. The modiﬁcations are as follow:(a)

Standard algorithm with completely random initial population and without local search;(b)

Only Local algorithm with completely random initial population but using the proposed localsearch function;(c)

Only Init algorithm with inserted initial solution from the simpliﬁed model and without localsearch;(d)

Local & Init algorithm with both inserted initial solution from the simpliﬁed model and usingthe proposed local search approach.The main problem of the studied clustering model is the diﬃculty of ﬁnding the optimal solution.Even in test instances, we cannot determine the optimum with certainty. Therefore, as a benchmarkwe use the solution of the simpliﬁed model found by the k-means method, as described in Section4.4 and used in

Only Local and

Local & Init algorithms. For parameters of the k-means method weuse the Euclidean distance metric with 1000 starting locations, the maximum number of iterationsto 1000 and the Hartigan-Wong’s algorithm. With these parameters, the k-means solution is foundalmost instantly.The following parameters from Holý et al. (2017) are used for all modiﬁcations. The populationsize is set to 500 individuals. Initial population is generated randomly but in the case of

OnlyInit a Local & Init we also supply the k-means solution of the simpliﬁed model. We always usethe ﬁxed number of 500 generations even though simulations show that vast majority of instancesconverge faster. The elite ratio is set to 0.1. The mutation chance is set to 0.01 to allow algorithmto concentrate on improving one point while retaining some exploratory ability of mutation. Unlikethe local search which is conducted only on the best individual of the generation, the mutation cantake place on any non-elite individual.The basic parameters for data generation are the number of sets N , objects M and clusters K .Sets are generated independently. For each set, it is randomly determined how many unique clustersexist in the set. In doing so, every cluster has the same probability of occurrence given by theparameter ρ . From each cluster assigned to the set, one product is assigned to the set. In the nextstep, based on the π parameter, other products from the same cluster are assigned into sets, thuscreating situations that violate the model’s assumptions.We work with the default values N = 10000 , M = 100 , K = 10 , ρ = 0 . and π = 0 . whichroughly correspond to retail datasets. In the next part, we investigate the eﬀects of changes in thevalues of parameters N , M , K , and π on the results of the genetic algorithm modiﬁcations. Throughout this section, we assume the parameters are set to their default value unless said otherwise– then only one parameter changes at a time. Each algorithm is run 10 times for each dataset and eachdataset is generated 100 times for each scenario. Computation time is reported for 3.40 GHz CPU andthe algorithm is implemented in R software. We consider the k-means solution of the simpliﬁed modelfrom Section 4.4 as the benchmark solution. We report values of the objective function standardizedto this benchmark solution (standardized objective = objective / benchmark objective).8 .91.01.11.21.31.4 00:00 00:15 00:30 00:45 01:00

Computation Time in Hours S t anda r d i z ed O b j e c t i v e Algorithm

Standard Only Local Only Init Local & Init

Convergence of the Method

Figure 1: Progress of the standardized objective function over time.First, we take a look at the progress of the objective function over time in Figure 1. We reportthe objective function over time rather than over the number of iterations as some versions of thealgorithm require to compute local search in each iteration. The computation time spent on localsearch is, however, quite negligible. Apparently, for default values of the parameters, the

Standard algorithm cannot overcome the benchmark solution. The local search improvement in the

Only Local algorithm allows surpassing the benchmark solution. The

Only Init and

Local & Init algorithms usethe benchmark solution as their initial point and both are able to quickly improve it with the

Local& Init algorithm being considerably faster.Next, we focus on diﬀerences in performance based on the changes in parameters of the datagenerating process and the number of clusters. The average found solutions after 500 iterations areshown in Figure 2 and Table 1. With a low number of sets N , the benchmark approach is clearly worsethan the genetic algorithms. With an increasing number of sets N , the Standard algorithm resultsare worsening and the diﬀerences between the benchmark approach and other genetic algorithmsshrink. However, the

Only Init and

Local & Init algorithms are still able to improve their initialsolution. With an increasing number of objects M in the dataset, the benchmark solution proves tobe insuﬃcient – all tested modiﬁcations of the genetic algorithm give signiﬁcantly better results. Thesame can be said about an increasing probability of extra objects π . Simply put, the more violationsoccur in the dataset, the worse is the benchmark solution. Concerning the impact of the true numberof clusters K , the genetic algorithm gives better results than the benchmark approach when K issigniﬁcantly lower than M . With increasing K , the Standard algorithm ceases to be suitable. Withhigh values of K , Only Local , Only Init and

Local & Init algorithms give very similar results to thebenchmark approach. Overall, it is highly advantageous to use both proposed improvements of thegenetic algorithm. Not only the

Local & Init algorithm ends in the best solution among the fourcandidates, but is also the fastest one to reach it.Finally, we examine variation in results from the perspective of repeated data generation andrepeated algorithm runs. In Table 1, we report two kinds of standard deviations. The standarddeviation capturing repeated data generation (labeled as SD/Sim) is obtained by ﬁrst averagingobjective values of all runs based on the same dataset and then taking standard deviation over allgenerated datasets. Conversely, the standard deviation capturing repeated algorithm runs (labeledas SD/Alg) is obtained by ﬁrst taking standard deviation of all runs based on the same dataset andthen averaging it over all generated datasets. We can see that SD/Alg is quite small except for the

Standard algorithm in some scenarios. Nevertheless, in most cases, random data generation is thedominant source of variation. 9 .81.01.2 5000 10000 15000

Parameter N S t anda r d i z ed O b j e c t i v e Number of Sets

Parameter M S t anda r d i z ed O b j e c t i v e Number of Objects

Parameter K S t anda r d i z ed O b j e c t i v e Number of Clusters

Parameter p S t anda r d i z ed O b j e c t i v e Probability of Extra Object

Algorithm

Standard Only Local Only Init Local & Init

Figure 2: Impact of the data generating process.10able 1: Mean values with standard deviations of the standardized objective for several scenarios ofthe data generating process. Simulation Scenario Standardized ObjectiveAlgorithm

N M K π

Mean SD/Sim SD/AlgStandard 10000 100 10 0.03 1.0326 0.0369 0.02025000 100 10 0.03 0.9864 0.0458 0.018815000 100 10 0.03 1.0397 0.0281 0.019410000 50 10 0.03 1.0073 0.0452 0.036610000 150 10 0.03 0.9100 0.0568 0.005510000 100 5 0.03 0.8955 0.0618 0.004910000 100 15 0.03 1.1505 0.0571 0.039510000 100 10 0.01 1.1341 0.0391 0.104510000 100 10 0.05 0.9031 0.0618 0.0052Only Local 10000 100 10 0.03 0.9926 0.0308 0.01795000 100 10 0.03 0.9580 0.0436 0.020615000 100 10 0.03 0.9991 0.0213 0.014910000 50 10 0.03 0.9887 0.0405 0.007210000 150 10 0.03 0.8940 0.0563 0.006610000 100 5 0.03 0.8913 0.0616 0.005010000 100 15 0.03 0.9962 0.0325 0.015310000 100 10 0.01 1.0000 0.0000 0.000010000 100 10 0.05 0.8980 0.0615 0.0061Only Init 10000 100 10 0.03 0.9656 0.0255 0.00145000 100 10 0.03 0.9203 0.0375 0.002915000 100 10 0.03 0.9805 0.0177 0.000910000 50 10 0.03 0.9862 0.0400 0.000410000 150 10 0.03 0.8885 0.0504 0.003210000 100 5 0.03 0.8906 0.0597 0.003610000 100 15 0.03 0.9864 0.0329 0.000410000 100 10 0.01 1.0000 0.0000 0.000010000 100 10 0.05 0.8899 0.0556 0.0038Local & Init 10000 100 10 0.03 0.9650 0.0257 0.00065000 100 10 0.03 0.9188 0.0381 0.001415000 100 10 0.03 0.9803 0.0178 0.000510000 50 10 0.03 0.9859 0.0410 0.000110000 150 10 0.03 0.8817 0.0522 0.002310000 100 5 0.03 0.8883 0.0603 0.002710000 100 15 0.03 0.9861 0.0335 0.000110000 100 10 0.01 1.0000 0.0000 0.000010000 100 10 0.05 0.8870 0.0567 0.003311

Conclusion

The clustering method of Holý et al. (2017) oﬀers a unique way of categorizing products in retailstores. The main limitation of the method lies in its computational complexity. To make the methodmore usable for practitioners, we revisit the algorithm ﬁnding an approximate solution and improveit in several ways. We augment the basic genetic algorithm by the renumbering procedure, the localsearch heuristic and the initial solution based on the simpliﬁed model. Although, these are rathercommon approaches, we tailor them to our speciﬁc problem in a non-trivial and eﬃcient way. On aﬁnal note, the presented formulation of our problem is quite general which allows for straightforwarduse in other potential applications.

Acknowledgements

Computational resources were supplied by the project "e-Infrastruktura CZ" (e-INFRA LM2018140)provided within the program Projects of Large Research, Development and Innovations Infrastruc-tures.

Funding

The work on this paper was supported by the Internal Grant Agency of the Prague University ofEconomics and Business under project F4/27/2020.

References

Ammar , A.,

Elouedi , Z.,

Lingras , P. 2016. Meta-Clustering of Possibilistically Segmented RetailDatasets.

Fuzzy Sets and Systems . Volume 286. Pages 173–196. ISSN 0165-0114. https://doi.org/10.1016/j.fss.2015.07.019 . Falkenauer , E. 1998.

Genetic algorithms and grouping problems . John Wiley & Sons, Inc.

Hamzaçebi , C. 2008. Improving Genetic Algorithms’ Performance by Local Search for ContinuousFunction Optimization.

Applied Mathematics and Computation . Volume 196. Issue 1. Pages309–317. ISSN 0096-3003. https://doi.org/10.1016/j.amc.2007.05.068 . Holý , V.,

Sokol , O.,

Černý , M. 2017. Clustering Retail Products Based on Customer Behaviour.

Applied Soft Computing . Volume 60. Pages 752–762. ISSN 1568-4946. https://doi.org/10.1016/j.asoc.2017.02.004 . Hruschka , E. R.,

Ebecken , N. F. F. 2003. A Genetic Algorithm for Cluster Analysis.

IntelligentData Analysis . Volume 7. Issue 1. Pages 15–25. ISSN 1088-467X. https://doi.org/10.3233/ida-2003-7103 . Hruschka , E. R.,

Campello , R. J. G. B.,

Freitas , A. A.,

Carvalho , A. C. L. F. 2009. ASurvey of Evolutionary Algorithms for Clustering.

IEEE Transactions on Systems, Man, andCybernetics, Part C (Applications and Reviews) . Volume 39. Issue 2. Pages 133–155. ISSN1094-6977. https://doi.org/10.1109/TSMCC.2008.2007252 . Jonker , J.-J.,

Piersma , N.,

Van den Poel , D. 2004. Joint Optimization of Customer Segmentationand Marketing Policy to Maximize Long-Term Proﬁtability.

Expert Systems with Applications .Volume 27. Issue 2. Pages 159–168. ISSN 0957-4174. https://doi.org/10.1016/j.eswa.2004.01.010 . José-García , A.,

Gómez-Flores , W. 2016. Automatic Clustering Using Nature-Inspired Meta-heuristics: A Survey.

Applied Soft Computing . Volume 41. Pages 192–213. ISSN 1568-4946. https://doi.org/10.1016/j.asoc.2015.12.001 .12 arp , R. M. 1972. Reducibility Among Combinatorial Problems. In

Complexity of ComputerComputations . Boston. Springer. Pages 85–103. ISBN 978-1-4684-2003-6. https://doi.org/10.1007/978-1-4684-2001-2 . Lingras , P.,

Elagamy , A.,

Ammar , A.,

Elouedi , Z. 2014. Iterative Meta-Clustering ThroughGranular Hierarchy of Supermarket Customers and Products.

Information Sciences . Volume 257.Pages 14–31. ISSN 0020-0255. https://doi.org/10.1016/j.ins.2013.09.018 . Murthy , C. A.,

Chowdhury , N. 1996. In Search of Optimal Clusters Using Genetic Algorithms.

Pattern Recognition Letters . Volume 17. Issue 8. Pages 825–832. ISSN 0167-8655. https://doi.org/10.1016/0167-8655(96)00043-8 . Nanda , S. J.,

Panda , G. 2014. A Survey on Nature Inspired Metaheuristic Algorithms for PartitionalClustering.

Swarm and Evolutionary Computation . Volume 16. Pages 1–18. ISSN 2210-6502. https://doi.org/10.1016/j.swevo.2013.11.003 . Peker , S.,

Kocyigit , A.,

Eren , P. E. 2017. LRFMP Model for Customer Segmentation in theGrocery Retail Industry: A Case Study.

Marketing Intelligence & Planning . Volume 35. Issue 4.Pages 544–559. ISSN 0263-4503. https://doi.org/10.1108/mip-11-2016-0210 . Reutterer , T.,

Mild , A.,

Natter , M.,

Taudes , A. 2006. A Dynamic Segmentation Approachfor Targeting and Customizing Direct Marketing Campaigns.

Journal of Interactive Marketing .Volume 20. Issue 3-4. Pages 43–57. ISSN 1094-9968. https://doi.org/10.1002/dir.20066 . Sokol , O.,

Holý , V. 2020a. The Role of Shopping Mission in Retail Customer Segmenta-tion.

International Journal of Market Research . ISSN 1470-7853. https://doi.org/10.1177/1470785320921011 . Sokol , O.,

Holý , V. 2020b. Computational Aspects of Product Clustering Based on Transac-tion Incidence. In

Proceedings of the 20th International Conference Quantitative Methods in Eco-nomics . Púchov. Letra Edu. Pages 312–318. ISBN 978-80-89962-60-0. . Tsai , C. Y.,

Chiu , C. C. 2004. A Purchase-Based Market Segmentation Methodology.

ExpertSystems with Applications . Volume 27. Issue 2. Pages 265–276. ISSN 0957-4174. https://doi.org/10.1016/j.eswa.2004.02.005 . Wu , T., Liu , X. 2020. A Dynamic Interval Type-2 fuzzy Customer Segmentation Model and ItsApplication in E-Commerce.

Applied Soft Computing Journal . Volume 94. Pages 106366/1–106366/12. ISSN 1568-4946. https://doi.org/10.1016/j.asoc.2020.106366 . Zhang , Y.,

Jiao , J., Ma , Y. 2007. Market Segmentation for Product Family Positioning Basedon Fuzzy Clustering. Journal of Engineering Design . Volume 18. Issue 3. Pages 227–241. ISSN0954-4828. https://doi.org/10.1080/09544820600752781https://doi.org/10.1080/09544820600752781