[PDF] Learning Neural Models for End-to-End Clustering

Abstract

Full PDF

LLearning Neural Models for End-to-EndClustering

Benjamin Bruno Meier , , Ismail Elezi , , Mohammadreza Amirian , ,Oliver Dürr , , and Thilo Stadelmann ZHAW Datalab & School of Engineering, Winterthur, Switzerland ARGUS DATA INSIGHTS Schweiz AG, Zurich, Switzerland Ca’ Foscari University of Venice, Venice, Italy Institute of Neural Information Processing, Ulm University, Germany Institute for Optical Systems, HTWG Konstanz, Germany

Abstract.

We propose a novel end-to-end neural network architecturethat, once trained, directly outputs a probabilistic clustering of a batch ofinput examples in one pass. It estimates a distribution over the number ofclusters k , and for each ≤ k ≤ k max , a distribution over the individualcluster assignment for each data point. The network is trained in advancein a supervised fashion on separate data to learn grouping by any percep-tual similarity criterion based on pairwise labels (same/diﬀerent group).It can then be applied to diﬀerent data containing diﬀerent groups. Wedemonstrate promising performance on high-dimensional data like images(COIL-100) and speech (TIMIT). We call this “learning to cluster” andshow its conceptual diﬀerence to deep metric learning, semi-superviseclustering and other related approaches while having the advantage ofperforming learnable clustering fully end-to-end. Keywords: perceptual grouping · learning to cluster · speech & image clustering Consider the illustrative task of grouping images of cats and dogs by perceived similarity: depending on the intention of the user behind the task, the similaritycould be deﬁned by animal type (foreground object), environmental nativeness(background landscape, cp. Fig. 1) etc. This is characteristic of clustering percep-tual, high-dimensional data like images [15] or sound [24]: a user typically hassome similarity criterion in mind when thinking about naturally arising groups(e.g., pictures by holiday destination, or persons appearing; songs by mood, oruse of solo instrument). As deﬁning such a similarity for every case is diﬃcult, itis desirable to learn it. At the same time, the learned model will in many casesnot be a classiﬁer—the task will not be solved by classiﬁcation—since the numberand speciﬁc type of groups present at application time are not known in advance(e.g., speakers in TV recordings; persons in front of a surveillance camera; objecttypes in the picture gallery of a large web shop). a r X i v : . [ c s . L G ] J u l Meier, Elezi, Amirian, Dürr & Stadelmann

Fig. 1: Images of cats (top) and dogs (bottom) in urban (left) and natural (right)environments.Grouping objects with machine learning is usually approached with clusteringalgorithms [16]. Typical ones like K-means [25], EM [14], hierarchical clustering[29] with chosen distance measure, or DBSCAN [8] each have a speciﬁc inductivebias towards certain similarity structures present in the data (e.g., K-means:Euclidean distance from a central point; DBSCAN: common point density). Hence,to be applicable to above-mentioned tasks, they need high-level features thatalready encode the aspired similarity measure. This may be solved by learningsalient embeddings [28] with a deep metric learning approach [12], followed byan oﬀ-line clustering phase using one of the above-mentioned algorithm.However, it is desirable to combine these distinct phases (learning salientfeatures, and subsequent clustering) into an end-to-end approach that can betrained globally [19]: it has the advantage of each phase being perfectly adjusted tothe other by optimizing a global criterion, and removes the need of manually ﬁttingparts of the pipeline. Numerous examples have demonstrated the success of neuralnetworks for end-to-end approaches on such diverse tasks as speech recognition[2], robot control [21], scene text recognition [34], or music transcription [35].In this paper, we present a conceptually novel approach that we call “learningto cluster” in the above-mentioned sense of grouping high-dimensional data bysome perceptually motivated similarity criterion. For this purpose, we deﬁnea novel neural network architecture with the following properties: (a) duringtraining, it receives pairs of similar or dissimilar examples to learn the intendedsimilarity function implicitly or explicitly; (b) during application, it is able togroup objects of groups never encountered before; (c) it is trained end-to-end in asupervised way to produce a tailor-made clustering model and (d) is applied likea clustering algorithm to ﬁnd both the number of clusters as well as the clustermembership of test-time objects in a fully probabilistic way.Our approach builds upon ideas from deep metric embedding , namely tolearn an embedding of the data into a representational space that allows forspeciﬁc perceptual similarity evaluation via simple distance computation onfeature vectors. However, it goes beyond this by adding the actual clusteringstep—grouping by similarity—directly to the same model, making it trainableend-to-end. Our approach is also diﬀerent from semi-supervised clustering [4],which uses labels for some of the data points in the inference phase to guidethe creation of groups. In contrast, our method uses absolutely no labels duringinference, and moreover doesn’t expect to have seen any of the groups it encountersduring inference already during training (cp. Fig. 2). Its training stage may be earning Neural Models for End-to-End Clustering 3

TrainingTesting

Proposed Model:TrainingProposed Model:TrainingProposed Model:Evaluation switch to a disjunct set of classes P(k=1)=0.20 P(k=2)=0.75 P(k=3)=0.05P(k=1)=0.05 P(k=2)=0.15 P(k=3)=0.80P(k=1)=0.10 P(k=2)=0.80 P(k=3)=0.10

Fig. 2: Training vs. testing: cluster types encountered during application/inferenceare never seen in training. Exemplary outputs (right-hand side) contain a partitionfor each k ( – here) and a corresponding probability (best highlighted blue).compared to creating K-means, DBSCAN etc. in the ﬁrst place: it creates aspeciﬁc clustering model, applicable to data with certain similarity structure,and once created/trained, the model performs “unsupervised learning” in thesense of ﬁnding groups. Finally, our approach diﬀers from traditional cluster analysis [16] in how the clustering algorithm is applied: instead of looking forpatterns in the data in an unbiased and exploratory way, as is typically the casein unsupervised learning, our approach is geared towards the use case whereusers know perceptually what they are looking for, and can make this explicitusing examples. We then learn appropriate features and the similarity functionsimultaneously, taking full advantage of end-to-end learning.Our main contribution in this paper is the creation of a neural networkarchitecture that learns to group data, i.e., that outputs the same “label” for“similar” objects regardless of (a) it has ever seen this group before; (b) regardlessof the actual value of the label (it is hence not a “class”); and (c) regardless ofthe number of groups it will encounter during a single application run, up to apredeﬁned maximum. This is novel in its concept and generality (i.e., learn tocluster previously unseen groups end-to-end for arbitrary, high-dimensional inputwithout any optimization on test data). Due to this novelty in approach, wefocus here on the general idea and experimental demonstration of the principalworkings, and leave comprehensive hyperparameter studies and optimizationsfor future work. In Sec. 2, we compare our approach to related work, beforepresenting the model and training procedure in detail in Sec. 3. We evaluate ourapproach on diﬀerent datasets in Sec 4, showing promising performance and ahigh degree of generality for data types ranging from 2D points to audio snippetsand images, and discuss these results with conclusions for future work in Sec. 5. Learning to cluster based on neural networks has been approached mostly asa supervised learning problem to extract embeddings for a subsequent oﬀ-line

Meier, Elezi, Amirian, Dürr & Stadelmann clustering phase. The core of all deep metric embedding models is the choice of theloss function. Motivated by the fact that the softmax-cross entropy loss functionhas been designed as a classiﬁcation loss and is not suitable for the clusteringproblem per se,

Chopra et al. [7] developed a “Siamese” architecture, wherethe loss function is optimized in a way to generate similar features for objectsbelonging to the same class, and dissimilar features for objects belonging todiﬀerent classes. A closely related loss function called “triplet loss” has been usedby

Schroﬀ et al. [32] to get state-of-the-art accuracy in face detection. The maindiﬀerence from the Siamese architecture is that in the latter case, the networksees same and diﬀerent class objects with every example. It is then optimized tojointly learn their feature representation. A problem of both approaches is thatthey are typically diﬃcult to train compared to a standard cross entropy loss.

Song et al. [37] developed an algorithm for taking full advantage of allthe information available in training batches. They later reﬁned the work [36]by proposing a new metric learning scheme based on structured prediction,which is designed to optimize a clustering quality metric (normalized mutualinformation [27]). Even better results were achieved by

Wong et al. [38], wherethe authors proposed a novel angular loss, and achieved state-of-the-art resultson the challenging real-world datasets

Stanford Cars [17] and

Caltech Birds [5].On the other hand,

Lukic et al. [23] showed that for certain problems, a carefullychosen deep neural network can simply be trained with softmax-cross entropy lossand still achieve state-of-the-art performance in challenging problems like speakerclustering. Alternatively,

Wu et al. [26] showed that state-of-the-art results canbe achieved simply by using a traditional margin loss function and being carefulon how sampling is performed during the creation of mini-batches.On the other hand, attempts have been made recently that are more similarto ours in spirit, using deep neural networks only and performing clusteringend-to-end [1]. They are trained in a fully unsupervised fashion, hence solve adiﬀerent task then the one we motivated above (that is inspired by speaker- orimage clustering based on some human notion of similarity). Perhaps ﬁrst togroup objects together in an unsupervised deep learning based manner where

Leet al. [18], detecting high-level concepts like cats or humans.

Xie et al. [40] usedan autoencoder architecture to do clustering, but experimental evaluated it onlysimplistic datasets like

MNIST . CNN-based approaches followed, e.g. by

Yanget al. [42], where clustering and feature representation are optimized together.

Greﬀ et al. [10] performed perceptual grouping (of pixels within an image intothe objects constituting the complete image, hence a diﬀerent task than ours)fully unsupervised using a neural expectation maximization algorithm. Our workdiﬀers from above-mentioned works in several respects: it has no assumption onthe type of data, and solves the diﬀerent task of grouping whole input objects.

Our method learns to cluster end-to-end purely ab initio, without the need toexplicitly specify a notion of similarity, only providing the information whether earning Neural Models for End-to-End Clustering 5 (b) (c)(d)(a) FC ...x n x x FCFC ... z ( x ) z ( x n ) z ( x ) ... P ( · | x , k ) P ( k ) R B D L S T M R B D L S T M m − R B D L S T M m B D L S T M ... FC k = k = 2 k = ...k = k max ...P ( ·| x ,k ) P ( ·| xn,k ) O p t i o n a l: M e tr i c L e a r n i n g Fig. 3: Our complete model, consisting of (a) the embedding network, (b) clus-tering network (including an optional metric learning part, see Sec. 3.3), (c)cluster-assignment network and (d) cluster-count estimating network.two examples belong together. It uses as input n ≥ examples x i , where n may be diﬀerent during training and application and constitutes the number ofobjects that can be clustered at a time, i.e. the maximum number of objects in apartition. The network’s output is two-fold: a probability distribution P ( k ) overthe cluster count ≤ k ≤ k max ; and probability distributions P ( · | x i , k ) over allpossible cluster indexes for each input example x i and for each k . The network architecture (see Fig. 3) allows the ﬂexible use of diﬀerent input types,e.g. images, audio or 2D points. An input x i is ﬁrst processed by an embeddingnetwork (a) that produces a lower-dimensional representation z i = z ( x i ) . Thedimension of z i may vary depending on the data type. For example, 2D pointsdo not require any embedding network. A fully connected layer (FC) with LeakyReLU activation at the beginning of the clustering network (b) is then usedto bring all embeddings to the same size. This approach allows to use the identicalsubnetworks (b)–(d) and only change the subnet (a) for any data type. The goalof the subnet (b) is to compare each input z ( x i ) with all other z ( x j (cid:54) = i ) , in orderto learn an abstract grouping which is then concretized into an estimation of thenumber of clusters (subnet (d)) and a cluster assignment (subnet (c)).To be able to process a non-ﬁxed number of examples n as input, we usea recurrent neural network. Speciﬁcally, we use stacked residual bi-directionalLSTM-layers ( RBDLSTM ), which are similar to the cells described in [39] andvisualized in Fig. 4. The residual connections allow a much more eﬀective gradientﬂow during training [11] and avoid vanishing gradients. Additionally, the networkcan learn to use or bypass certain layers using the residual connections, thusreducing the architectural decision on the number of recurrent layers to thesimpler one of ﬁnding a reasonable upper bound.The ﬁrst of overall two outputs is modeled by the cluster assignment network(c). It contains a softmax -layer to produce P ( (cid:96) | x i , k ) , which assigns a clusterindex (cid:96) to each input x i , given k clusters (i.e., we get a distribution over possible Meier, Elezi, Amirian, Dürr & Stadelmann

LSTMLSTMLSTM LSTM LSTMLSTM LSTMLSTM sumsum sumconcatconcatconcatconcatsum x x x n y y y n ...... Fig. 4:

RBDLSTM -layer: A

BDLSTM with residual connections (dashed lines).The variables x i and y i are named independently from the notation in Fig. 3.cluster assignments for each input and every possible number of clusters). Thesecond output, produced by the cluster-count estimating network (d), is builtfrom another BDLSTM -layer. Due to the bi-directionality of the network, weconcatenate its ﬁrst and the last output vector into a fully connected layer oftwice as many units using again

LeakyReLUs . The subsequent softmax -activationﬁnally models the distribution P ( k ) for ≤ k ≤ k max . The next subsection showshow this neural network learns to approximate these two complicated probabilitydistributions [20] purely from pairwise constraints on data that is completelyseparate from any dataset to be clustered. No labels for clustering are needed. In order to deﬁne a suitable loss-function, we ﬁrst deﬁne an approximation(assuming independence) of the probability that x i and x j are assigned to thesame cluster for a given k as P ij ( k ) = k (cid:88) (cid:96) =1 P ( (cid:96) | x i , k ) P ( (cid:96) | x j , k ) . By marginalizing over k , we obtain P ij , the probability that x i and x j belong tothe same cluster: P ij = k max (cid:88) k =1 P ( k ) k (cid:88) (cid:96) =1 P ( (cid:96) | x i , k ) P ( (cid:96) | x j , k ) . Let y ij = 1 if x i and x j are from the same cluster (e.g., have the same grouplabel) and otherwise. The loss component for cluster assignments , L ca , is thengiven by the weighted binary cross entropy as L ca = − n ( n − (cid:88) i

Xing et al. [41]. In contrast to their work, we optimize it end-to-end with backpropagation.This has been proposed in [33] for classiﬁcation alone; we do it here for a clusteringtask, for the whole covariance matrix, and jointly with the rest of our network.We construct the non-symmetric, non-negative dissimilarity measure d A betweentwo data points x i and x j as d A ( x i , x j ) = ( x i − x j ) T A ( x i − x j ) Meier, Elezi, Amirian, Dürr & Stadelmann and let the neural network training optimize A through L tot without intermediatelosses. The matrix A as used in d A can be thought of as a trainable distancemetric. In every training step, it is projected into the space of positive semideﬁnitematrices. To assess the quality of our model, we perform clustering on three diﬀerentdatasets: for a proof of concept, we test on a set of generated

2D points with a highvariety of shapes, coming from diﬀerent distributions. For speaker clustering, weuse the

TIMIT [9] corpus, a dataset of studio-quality speech recordings frequentlyused for pure speaker clustering in related work. For image clustering, we test onthe

COIL-100 [30] dataset, a collection of diﬀerent isolated objects in variousorientations. To compare to related work, we measure the performance withthe standard evaluation scores misclassiﬁcation rate (MR) [22] and normalizedmutual information (NMI) [27]. Architecturally, we choose m = 14 BDLSTMlayers and units in the FC layer of subnetwork (b), units for the BDLSTMin subnetwork (d), and α = 0 . for all LeakyReLUs in the experiments below.All hyperparameters where chosen based on preliminary experiments to achievereasonable performance, but not tested nor tweaked extensively. The code andfurther material and experiments are available online .We set k max = 5 and λ = 5 for all experiments. For the 2D point data, we use n = 72 inputs and a batch-size of N = 200 (We used the batch size of N = 50 formetric learning with 2D points). For TIMIT, the network input consists of n = 20 audio snippets with a length of . seconds, encoded as mel-spectrograms with × pixels (identical to [24]). For COIL-100, we use n = 20 inputs witha dimension of × × . For TIMIT and COIL-100, a simple CNN with3 conv/max-pooling layers is used as subnetwork (a). For TIMIT, we use of the available speakers for training (and of the remaining ones eachfor validation and evaluation). For COIL-100, we train on of the classes( for validation, for evaluation). For all runs, we optimize using Adadelta[43] with a learning rate of . . Example clusterings are shown in Fig. 5. For allconﬁgurations, the used hardware set the limit on parameter values: we used themaximum possible batch size and values for n and k max that allow reasonabletraining times. However, values of n ≥ where tested and lead to a largedecrease in model accuracy. This is a major issue for future work.The results on 2D data as presented in Fig. 5a demonstrate that our methodis able to learn speciﬁc and diverse characteristics of intuitive groupings. Thisis superior to any single traditional method, which only detects a certain classof cluster structure (e.g., deﬁned by distance from a central point). Although[24] reach moderately better scores for the speaker clustering task and [42] reacha superior NMI for COIL-100, our method ﬁnds reasonable clusterings, is moreﬂexible through end-to-end training and is not tuned to a speciﬁc kind of data. See https://github.com/kutoga/learning2cluster.earning Neural Models for End-to-End Clustering 9(a) (b) (c)

Fig. 5: Clustering results for (a) 2D point data, (b) COIL-100 objects, and (c)faces from FaceScrub (for illustrative purposes). The color of points / coloredborders of images depict true cluster membership.Table 1:

NMI ∈ [0 , and MR ∈ [0 , averaged over evaluations of a trainednetwork. We abbreviate our “learning to cluster” method as “L2C”.

2D Points (self generated) TIMIT COIL-100MR NMI MR NMI MR NMI

L2C ( = our method) .

004 0 .

993 0 .

060 0 .

928 0 .

116 0 . L2C + Euclidean .

177 0 .

730 0 .

093 0 .

883 0 .

123 0 . L2C + Mahalanobis .

185 0 .

725 0 .

104 0 .

882 0 .

093 0 . L2C + Metric Learning .

165 0 .

740 0 .

101 0 .

880 0 .

100 0 . Random cluster assignment .

485 0 .

232 0 .

435 0 .

346 0 .

435 0 . Baselines (related work) k-Means:

MR = 0 . , NMI = 0 . DBSCAN:

MR = 0 . , NMI = 0 . [24]: MR = 0 [42]:

NMI = 0 . Hence, we assume, backed by the additional experiments to be found online, thatour model works well also for other data types and datasets, given a suitableembedding network. Tab. 1 gives the numerical results for said datasets in therow called “L2C” without using the explicit metric learning block. Extensivepreliminary experiments on other public datasets like e.g. FaceScrub [31] conﬁrmthese results: learning to cluster reaches promising performance while not yetbeing on par with tailor-made state-of-the-art approaches.We compare the performance of our implicit distance metric learning methodto versions enhanced by diﬀerent explicit schemes for pairwise similarity com-putation prior to clustering. Speciﬁcally, three implementations of the optionalmetric learning block in subnetwork (b) are evaluated: using a ﬁxed diagonalmatrix A (resembling the Euclidean distance), training a diagonal A (resemblingMahalanobis distance), and learning the entire coeﬃcients of the distance matrix A . Since we argue above that our approach combines implicit deep metric em-bedding with clustering in an end-to-end architecture, one would not expect thatadding explicit metric computation changes the results by a large extend. Thisassumption is largely conﬁrmed by the results in the “L2C + . . . ” rows in Tab. 1:for COIL-100, Euclidean gives slightly worse, and the other two slightly betterresults than L2C alone; for TIMIT, all results are worse but still reasonable. Weattribute the considerable performance drop on 2D points using all three explicitschemes to the fact that in this case much more instances are to be comparedwith each other (as each instance is smaller than e.g. an image, n is larger). Thismight have needed further adaptations like e.g. larger batch sizes (reduced hereto N = 50 for computational reasons) and longer training times. We have presented a novel approach to learn neural models that directly outputa probabilistic clustering on previously unseen groups of data; this includes asolution to the problem of outputting similar but unspeciﬁc “labels” for similarobjects of unseen “classes”. A trained model is able to cluster diﬀerent data typeswith promising results. This is a complete end-to-end approach to clustering thatlearns both the relevant features and the “algorithm” by which to produce theclustering itself. It outputs probabilities for cluster membership of all inputs as wellas the number of clusters in test data. The learning phase only requires pairwiselabels between examples from a separate training set, and no explicit similaritymeasure needs to be provided. This is especially useful for high-dimensional,perceptual data like images and audio, where similarity is usually semanticallydeﬁned by humans. Our experiments conﬁrm that our algorithm is able toimplicitly learn a metric and directly use it for the included clustering. This issimilar in spirit to the very recent work of

Hsu et al. [13], but does not need andoptimization on the test (clustering) set and ﬁnds k autonomously. It is a novelapproach to learn to cluster , introducing a novel architecture and loss design.We observe that the clustering accuracy depends on the availability of alarge number of diﬀerent classes during training. We attribute this to the factthat the network needs to learn intra-class distances, a task inherently morediﬃcult than just to distinguish between objects of a ﬁxed amount of classeslike in classiﬁcation problems. We understand the presented work as an earlyinvestigation into the new paradigm of learning to cluster by perceptual similarityspeciﬁed through examples. It is inspired by our work on speaker clusteringwith deep neural networks, where we increasingly observe the need to go beyondsurrogate tasks for learning, training end-to-end speciﬁcally for clustering toclose a performance leak. While this works satisfactory for initial results, pointsfor improvement revolve around scaling the approach to practical applicability,which foremost means to get rid of the dependency on n for the partition size.The number n of input examples to assess simultaneously is very relevantin practice: if an input data set has thousands of examples, incoherent singleclusterings of subsets of n points would be required to be merged to produce aclustering of the whole dataset based on our model. As the (RBD)LSTM layersresponsible for assessing points simultaneously in principle have a long, but stilllocal (short-term) horizon, they are not apt to grasp similarities of thousandsof objects. Several ideas exist to change the architecture, including to replacerecurrent layers with temporal convolutions, or using our approach to seed somesort of diﬀerentiable K-means or EM layer on top of it. Preliminary results onthis exist. Increasing n is a prerequisite to also increase the maximum number ofclusters k , as k (cid:28) n . For practical applicability, k needs to be increased by anorder of magnitude; we plan to do this in the future. This might open up novelapplications of our model in the area of transfer learning and domain adaptation. Acknowledgements:

We thank the anonymous reviewers for helpful feedback. earning Neural Models for End-to-End Clustering 11

References

1. Aljalbout, E., Golkov, V., Siddiqui, Y., Cremers, D.: Clustering with deep learning:Taxonomy and new methods. arXiv preprint arXiv:1801.07648 (2018)2. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C.,Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al.: Deep speech 2: End-to-endspeech recognition in English and Mandarin. In: ICML. pp. 173–182 (2016)3. Arias-Castro, E.: Clustering based on pairwise distances when the data is of mixeddimensions. IEEE Transactions on Information Theory pp. 1692–1706 (2011)4. Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In:ICML. pp. 19–26 (2002)5. Branson, S., Horn, G.V., Wah, C., Perona, P., Belongie, S.J.: The ignorant led bythe blind: A hybrid human-machine vision system for ﬁne-grained categorization.IJCV pp. 3–29 (2014)6. Chin, C.F., Shih, A.C.C., Fan, K.C.: A novel spectral clustering method based onpairwise distance matrix. J. Inf. Sci. Eng. pp. 649–658 (2010)7. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively,with application to face veriﬁcation. In: CVPR. pp. 539–546 vol. 1 (2005)8. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for dis-covering clusters in large spatial databases with noise. In: KDD. pp. 226–231(1996)9. Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren,N.L.: DARPA TIMIT acoustic phonetic continuous speech corpus CDROM (1993)10. Greﬀ, K., van Steenkiste, S., Schmidhuber, J.: Neural expectation maximization.In: NIPS. pp. 6694–6704 (2017)11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: CVPR. pp. 770–778 (2016)12. Hoﬀer, E., Ailon, N.: Deep metric learning using triplet network. In: InternationalWorkshop on Similarity-Based Pattern Recognition. pp. 84–92 (2015)13. Hsu, Y., Lv, Z., Kira, Z.: Learning to cluster in order to transfer across domainsand tasks. In: ICLR (2018), [accepted]14. Jin, X., Han, J.: Expectation maximization clustering. In: Encyclopedia of MachineLearning, pp. 382–383. Springer (2011)15. Kampﬀmeyer, M., Løkse, S., Bianchi, F.M., Livi, L., Salberg, A.B., Robert, J.:Deep divergence-based clustering. In: IEEE Int’l Workshop on Machine Learningfor Signal Processing (MLSP) (2017)16. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to ClusterAnalysis. John Wiley & Sons (1990)17. Krause, J., Stark, M., Deng, J., Li, F.F.: 3D object representations for ﬁne-grainedcategorization. In: Workshop on 3D Representation and Recognition at ICCV (2013)18. Le, Q.V., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J.,Ng, A.Y.: Building high-level features using large scale unsupervised learning. In:ICML. pp. 8595–8598 (2012)19. LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient-based learning applied todocument recognition. Proceedings of the IEEE pp. 2278–2324 (1998)20. Lee, H., Ge, R., Ma, T., Risteski, A., Arora, S.: On the ability of neural nets toexpress distributions. In: COLT. pp. 1271–1296 (2017)21. Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotorpolicies. JMLR (1), 1334–1373 (2016)22. Liu, D., Kubala, F.: Online speaker clustering. In: ICASSP. pp. I–333–6 vol.1 (2003)2 Meier, Elezi, Amirian, Dürr & Stadelmann23. Lukic, Y., Vogt, C., Dürr, O., Stadelmann, T.: Speaker identiﬁcation and clusteringusing convolutional neural networks. In: IEEE Int’l Workshop on Machine Learningfor Signal Processing (MLSP) (2016)24. Lukic, Y., Vogt, C., Dürr, O., Stadelmann, T.: Learning embeddings for speakerclustering based on voice equality. In: Machine Learning for Signal Processing(MLSP), 2017 IEEE 27th International Workshop on (2017)25. MacQueen, J.: Some methods for classiﬁcation and analysis of multivariate obser-vations. In: 5th Berkeley symp. on math. statist. and prob. pp. 281–297 (1967)26. Manmatha, R., Wu, C., Smola, A.J., Krähenbühl, P.: Sampling matters in deepembedding learning. In: ICCV. pp. 2840–2848 (2017)27. McDaid, A.F., Greene, D., Hurley, N.: Normalized mutual information to evaluateoverlapping community ﬁnding algorithms. arXiv preprint arXiv:1110.2515 (2011)28. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Eﬃcient estimation of word represen-tations in vector space. arXiv preprint arXiv:1301.3781 (2013)29. Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. TheComputer Journal pp. 354–359 (1983)30. Nayar, S., Nene, S., Murase, H.: Columbia object image library (COIL 100). De-partment of Comp. Science, Columbia University, Tech. Rep. CUCS-006-96 (1996)31. Ng, H.W., Winkler, S.: A data-driven approach to cleaning large face datasets. In:ICIP. pp. 343–347 (2014)32. Schroﬀ, F., Kalenichenko, D., Philbin, J.: FaceNet: A uniﬁed embedding for facerecognition and clustering. In: CVPR. pp. 815–823 (2015)33. Schwenker, F., Kestler, H.A., Palm, G.: Three learning phases for radial-basis-function networks. Neural networks (4-5), 439–458 (2001)34. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-basedsequence recognition and its application to scene text recognition. PAMI pp. 2298–2304 (2017)35. Sigtia, S., Benetos, E., Dixon, S.: An end-to-end neural network for polyphonicpiano music transcription. IEEE/ACM TASLP24