Learning Neural Models for End-to-End Clustering
Benjamin Bruno Meier, Ismail Elezi, Mohammadreza Amirian, Oliver Durr, Thilo Stadelmann
LLearning Neural Models for End-to-EndClustering
Benjamin Bruno Meier , , Ismail Elezi , , Mohammadreza Amirian , ,Oliver Dürr , , and Thilo Stadelmann ZHAW Datalab & School of Engineering, Winterthur, Switzerland ARGUS DATA INSIGHTS Schweiz AG, Zurich, Switzerland Ca’ Foscari University of Venice, Venice, Italy Institute of Neural Information Processing, Ulm University, Germany Institute for Optical Systems, HTWG Konstanz, Germany
Abstract.
We propose a novel end-to-end neural network architecturethat, once trained, directly outputs a probabilistic clustering of a batch ofinput examples in one pass. It estimates a distribution over the number ofclusters k , and for each ≤ k ≤ k max , a distribution over the individualcluster assignment for each data point. The network is trained in advancein a supervised fashion on separate data to learn grouping by any percep-tual similarity criterion based on pairwise labels (same/different group).It can then be applied to different data containing different groups. Wedemonstrate promising performance on high-dimensional data like images(COIL-100) and speech (TIMIT). We call this “learning to cluster” andshow its conceptual difference to deep metric learning, semi-superviseclustering and other related approaches while having the advantage ofperforming learnable clustering fully end-to-end. Keywords: perceptual grouping · learning to cluster · speech & image clustering Consider the illustrative task of grouping images of cats and dogs by perceived similarity: depending on the intention of the user behind the task, the similaritycould be defined by animal type (foreground object), environmental nativeness(background landscape, cp. Fig. 1) etc. This is characteristic of clustering percep-tual, high-dimensional data like images [15] or sound [24]: a user typically hassome similarity criterion in mind when thinking about naturally arising groups(e.g., pictures by holiday destination, or persons appearing; songs by mood, oruse of solo instrument). As defining such a similarity for every case is difficult, itis desirable to learn it. At the same time, the learned model will in many casesnot be a classifier—the task will not be solved by classification—since the numberand specific type of groups present at application time are not known in advance(e.g., speakers in TV recordings; persons in front of a surveillance camera; objecttypes in the picture gallery of a large web shop). a r X i v : . [ c s . L G ] J u l Meier, Elezi, Amirian, Dürr & Stadelmann
Fig. 1: Images of cats (top) and dogs (bottom) in urban (left) and natural (right)environments.Grouping objects with machine learning is usually approached with clusteringalgorithms [16]. Typical ones like K-means [25], EM [14], hierarchical clustering[29] with chosen distance measure, or DBSCAN [8] each have a specific inductivebias towards certain similarity structures present in the data (e.g., K-means:Euclidean distance from a central point; DBSCAN: common point density). Hence,to be applicable to above-mentioned tasks, they need high-level features thatalready encode the aspired similarity measure. This may be solved by learningsalient embeddings [28] with a deep metric learning approach [12], followed byan off-line clustering phase using one of the above-mentioned algorithm.However, it is desirable to combine these distinct phases (learning salientfeatures, and subsequent clustering) into an end-to-end approach that can betrained globally [19]: it has the advantage of each phase being perfectly adjusted tothe other by optimizing a global criterion, and removes the need of manually fittingparts of the pipeline. Numerous examples have demonstrated the success of neuralnetworks for end-to-end approaches on such diverse tasks as speech recognition[2], robot control [21], scene text recognition [34], or music transcription [35].In this paper, we present a conceptually novel approach that we call “learningto cluster” in the above-mentioned sense of grouping high-dimensional data bysome perceptually motivated similarity criterion. For this purpose, we definea novel neural network architecture with the following properties: (a) duringtraining, it receives pairs of similar or dissimilar examples to learn the intendedsimilarity function implicitly or explicitly; (b) during application, it is able togroup objects of groups never encountered before; (c) it is trained end-to-end in asupervised way to produce a tailor-made clustering model and (d) is applied likea clustering algorithm to find both the number of clusters as well as the clustermembership of test-time objects in a fully probabilistic way.Our approach builds upon ideas from deep metric embedding , namely tolearn an embedding of the data into a representational space that allows forspecific perceptual similarity evaluation via simple distance computation onfeature vectors. However, it goes beyond this by adding the actual clusteringstep—grouping by similarity—directly to the same model, making it trainableend-to-end. Our approach is also different from semi-supervised clustering [4],which uses labels for some of the data points in the inference phase to guidethe creation of groups. In contrast, our method uses absolutely no labels duringinference, and moreover doesn’t expect to have seen any of the groups it encountersduring inference already during training (cp. Fig. 2). Its training stage may be earning Neural Models for End-to-End Clustering 3
TrainingTesting
Proposed Model:TrainingProposed Model:TrainingProposed Model:Evaluation switch to a disjunct set of classes P(k=1)=0.20 P(k=2)=0.75 P(k=3)=0.05P(k=1)=0.05 P(k=2)=0.15 P(k=3)=0.80P(k=1)=0.10 P(k=2)=0.80 P(k=3)=0.10
Fig. 2: Training vs. testing: cluster types encountered during application/inferenceare never seen in training. Exemplary outputs (right-hand side) contain a partitionfor each k ( – here) and a corresponding probability (best highlighted blue).compared to creating K-means, DBSCAN etc. in the first place: it creates aspecific clustering model, applicable to data with certain similarity structure,and once created/trained, the model performs “unsupervised learning” in thesense of finding groups. Finally, our approach differs from traditional cluster analysis [16] in how the clustering algorithm is applied: instead of looking forpatterns in the data in an unbiased and exploratory way, as is typically the casein unsupervised learning, our approach is geared towards the use case whereusers know perceptually what they are looking for, and can make this explicitusing examples. We then learn appropriate features and the similarity functionsimultaneously, taking full advantage of end-to-end learning.Our main contribution in this paper is the creation of a neural networkarchitecture that learns to group data, i.e., that outputs the same “label” for“similar” objects regardless of (a) it has ever seen this group before; (b) regardlessof the actual value of the label (it is hence not a “class”); and (c) regardless ofthe number of groups it will encounter during a single application run, up to apredefined maximum. This is novel in its concept and generality (i.e., learn tocluster previously unseen groups end-to-end for arbitrary, high-dimensional inputwithout any optimization on test data). Due to this novelty in approach, wefocus here on the general idea and experimental demonstration of the principalworkings, and leave comprehensive hyperparameter studies and optimizationsfor future work. In Sec. 2, we compare our approach to related work, beforepresenting the model and training procedure in detail in Sec. 3. We evaluate ourapproach on different datasets in Sec 4, showing promising performance and ahigh degree of generality for data types ranging from 2D points to audio snippetsand images, and discuss these results with conclusions for future work in Sec. 5. Learning to cluster based on neural networks has been approached mostly asa supervised learning problem to extract embeddings for a subsequent off-line
Meier, Elezi, Amirian, Dürr & Stadelmann clustering phase. The core of all deep metric embedding models is the choice of theloss function. Motivated by the fact that the softmax-cross entropy loss functionhas been designed as a classification loss and is not suitable for the clusteringproblem per se,
Chopra et al. [7] developed a “Siamese” architecture, wherethe loss function is optimized in a way to generate similar features for objectsbelonging to the same class, and dissimilar features for objects belonging todifferent classes. A closely related loss function called “triplet loss” has been usedby
Schroff et al. [32] to get state-of-the-art accuracy in face detection. The maindifference from the Siamese architecture is that in the latter case, the networksees same and different class objects with every example. It is then optimized tojointly learn their feature representation. A problem of both approaches is thatthey are typically difficult to train compared to a standard cross entropy loss.
Song et al. [37] developed an algorithm for taking full advantage of allthe information available in training batches. They later refined the work [36]by proposing a new metric learning scheme based on structured prediction,which is designed to optimize a clustering quality metric (normalized mutualinformation [27]). Even better results were achieved by
Wong et al. [38], wherethe authors proposed a novel angular loss, and achieved state-of-the-art resultson the challenging real-world datasets
Stanford Cars [17] and
Caltech Birds [5].On the other hand,
Lukic et al. [23] showed that for certain problems, a carefullychosen deep neural network can simply be trained with softmax-cross entropy lossand still achieve state-of-the-art performance in challenging problems like speakerclustering. Alternatively,
Wu et al. [26] showed that state-of-the-art results canbe achieved simply by using a traditional margin loss function and being carefulon how sampling is performed during the creation of mini-batches.On the other hand, attempts have been made recently that are more similarto ours in spirit, using deep neural networks only and performing clusteringend-to-end [1]. They are trained in a fully unsupervised fashion, hence solve adifferent task then the one we motivated above (that is inspired by speaker- orimage clustering based on some human notion of similarity). Perhaps first togroup objects together in an unsupervised deep learning based manner where
Leet al. [18], detecting high-level concepts like cats or humans.
Xie et al. [40] usedan autoencoder architecture to do clustering, but experimental evaluated it onlysimplistic datasets like
MNIST . CNN-based approaches followed, e.g. by
Yanget al. [42], where clustering and feature representation are optimized together.
Greff et al. [10] performed perceptual grouping (of pixels within an image intothe objects constituting the complete image, hence a different task than ours)fully unsupervised using a neural expectation maximization algorithm. Our workdiffers from above-mentioned works in several respects: it has no assumption onthe type of data, and solves the different task of grouping whole input objects.
Our method learns to cluster end-to-end purely ab initio, without the need toexplicitly specify a notion of similarity, only providing the information whether earning Neural Models for End-to-End Clustering 5 (b) (c)(d)(a) FC ...x n x x FCFC ... z ( x ) z ( x n ) z ( x ) ... P ( · | x , k ) P ( k ) R B D L S T M R B D L S T M m − R B D L S T M m B D L S T M ... FC k = k = 2 k = ...k = k max ...P ( ·| x ,k ) P ( ·| xn,k ) O p t i o n a l: M e tr i c L e a r n i n g Fig. 3: Our complete model, consisting of (a) the embedding network, (b) clus-tering network (including an optional metric learning part, see Sec. 3.3), (c)cluster-assignment network and (d) cluster-count estimating network.two examples belong together. It uses as input n ≥ examples x i , where n may be different during training and application and constitutes the number ofobjects that can be clustered at a time, i.e. the maximum number of objects in apartition. The network’s output is two-fold: a probability distribution P ( k ) overthe cluster count ≤ k ≤ k max ; and probability distributions P ( · | x i , k ) over allpossible cluster indexes for each input example x i and for each k . The network architecture (see Fig. 3) allows the flexible use of different input types,e.g. images, audio or 2D points. An input x i is first processed by an embeddingnetwork (a) that produces a lower-dimensional representation z i = z ( x i ) . Thedimension of z i may vary depending on the data type. For example, 2D pointsdo not require any embedding network. A fully connected layer (FC) with LeakyReLU activation at the beginning of the clustering network (b) is then usedto bring all embeddings to the same size. This approach allows to use the identicalsubnetworks (b)–(d) and only change the subnet (a) for any data type. The goalof the subnet (b) is to compare each input z ( x i ) with all other z ( x j (cid:54) = i ) , in orderto learn an abstract grouping which is then concretized into an estimation of thenumber of clusters (subnet (d)) and a cluster assignment (subnet (c)).To be able to process a non-fixed number of examples n as input, we usea recurrent neural network. Specifically, we use stacked residual bi-directionalLSTM-layers ( RBDLSTM ), which are similar to the cells described in [39] andvisualized in Fig. 4. The residual connections allow a much more effective gradientflow during training [11] and avoid vanishing gradients. Additionally, the networkcan learn to use or bypass certain layers using the residual connections, thusreducing the architectural decision on the number of recurrent layers to thesimpler one of finding a reasonable upper bound.The first of overall two outputs is modeled by the cluster assignment network(c). It contains a softmax -layer to produce P ( (cid:96) | x i , k ) , which assigns a clusterindex (cid:96) to each input x i , given k clusters (i.e., we get a distribution over possible Meier, Elezi, Amirian, Dürr & Stadelmann
LSTMLSTMLSTM LSTM LSTMLSTM LSTMLSTM sumsum sumconcatconcatconcatconcatsum x x x n y y y n ...... Fig. 4:
RBDLSTM -layer: A
BDLSTM with residual connections (dashed lines).The variables x i and y i are named independently from the notation in Fig. 3.cluster assignments for each input and every possible number of clusters). Thesecond output, produced by the cluster-count estimating network (d), is builtfrom another BDLSTM -layer. Due to the bi-directionality of the network, weconcatenate its first and the last output vector into a fully connected layer oftwice as many units using again
LeakyReLUs . The subsequent softmax -activationfinally models the distribution P ( k ) for ≤ k ≤ k max . The next subsection showshow this neural network learns to approximate these two complicated probabilitydistributions [20] purely from pairwise constraints on data that is completelyseparate from any dataset to be clustered. No labels for clustering are needed. In order to define a suitable loss-function, we first define an approximation(assuming independence) of the probability that x i and x j are assigned to thesame cluster for a given k as P ij ( k ) = k (cid:88) (cid:96) =1 P ( (cid:96) | x i , k ) P ( (cid:96) | x j , k ) . By marginalizing over k , we obtain P ij , the probability that x i and x j belong tothe same cluster: P ij = k max (cid:88) k =1 P ( k ) k (cid:88) (cid:96) =1 P ( (cid:96) | x i , k ) P ( (cid:96) | x j , k ) . Let y ij = 1 if x i and x j are from the same cluster (e.g., have the same grouplabel) and otherwise. The loss component for cluster assignments , L ca , is thengiven by the weighted binary cross entropy as L ca = − n ( n − (cid:88) i Xing et al. [41]. In contrast to their work, we optimize it end-to-end with backpropagation.This has been proposed in [33] for classification alone; we do it here for a clusteringtask, for the whole covariance matrix, and jointly with the rest of our network.We construct the non-symmetric, non-negative dissimilarity measure d A betweentwo data points x i and x j as d A ( x i , x j ) = ( x i − x j ) T A ( x i − x j ) Meier, Elezi, Amirian, Dürr & Stadelmann and let the neural network training optimize A through L tot without intermediatelosses. The matrix A as used in d A can be thought of as a trainable distancemetric. In every training step, it is projected into the space of positive semidefinitematrices. To assess the quality of our model, we perform clustering on three differentdatasets: for a proof of concept, we test on a set of generated 2D points with a highvariety of shapes, coming from different distributions. For speaker clustering, weuse the TIMIT [9] corpus, a dataset of studio-quality speech recordings frequentlyused for pure speaker clustering in related work. For image clustering, we test onthe COIL-100 [30] dataset, a collection of different isolated objects in variousorientations. To compare to related work, we measure the performance withthe standard evaluation scores misclassification rate (MR) [22] and normalizedmutual information (NMI) [27]. Architecturally, we choose m = 14 BDLSTMlayers and units in the FC layer of subnetwork (b), units for the BDLSTMin subnetwork (d), and α = 0 . for all LeakyReLUs in the experiments below.All hyperparameters where chosen based on preliminary experiments to achievereasonable performance, but not tested nor tweaked extensively. The code andfurther material and experiments are available online .We set k max = 5 and λ = 5 for all experiments. For the 2D point data, we use n = 72 inputs and a batch-size of N = 200 (We used the batch size of N = 50 formetric learning with 2D points). For TIMIT, the network input consists of n = 20 audio snippets with a length of . seconds, encoded as mel-spectrograms with × pixels (identical to [24]). For COIL-100, we use n = 20 inputs witha dimension of × × . For TIMIT and COIL-100, a simple CNN with3 conv/max-pooling layers is used as subnetwork (a). For TIMIT, we use of the available speakers for training (and of the remaining ones eachfor validation and evaluation). For COIL-100, we train on of the classes( for validation, for evaluation). For all runs, we optimize using Adadelta[43] with a learning rate of . . Example clusterings are shown in Fig. 5. For allconfigurations, the used hardware set the limit on parameter values: we used themaximum possible batch size and values for n and k max that allow reasonabletraining times. However, values of n ≥ where tested and lead to a largedecrease in model accuracy. This is a major issue for future work.The results on 2D data as presented in Fig. 5a demonstrate that our methodis able to learn specific and diverse characteristics of intuitive groupings. Thisis superior to any single traditional method, which only detects a certain classof cluster structure (e.g., defined by distance from a central point). Although[24] reach moderately better scores for the speaker clustering task and [42] reacha superior NMI for COIL-100, our method finds reasonable clusterings, is moreflexible through end-to-end training and is not tuned to a specific kind of data. See https://github.com/kutoga/learning2cluster.earning Neural Models for End-to-End Clustering 9(a) (b) (c) Fig. 5: Clustering results for (a) 2D point data, (b) COIL-100 objects, and (c)faces from FaceScrub (for illustrative purposes). The color of points / coloredborders of images depict true cluster membership.Table 1: NMI ∈ [0 , and MR ∈ [0 , averaged over evaluations of a trainednetwork. We abbreviate our “learning to cluster” method as “L2C”. 2D Points (self generated) TIMIT COIL-100MR NMI MR NMI MR NMI L2C ( = our method) . 004 0 . 993 0 . 060 0 . 928 0 . 116 0 . L2C + Euclidean . 177 0 . 730 0 . 093 0 . 883 0 . 123 0 . L2C + Mahalanobis . 185 0 . 725 0 . 104 0 . 882 0 . 093 0 . L2C + Metric Learning . 165 0 . 740 0 . 101 0 . 880 0 . 100 0 . Random cluster assignment . 485 0 . 232 0 . 435 0 . 346 0 . 435 0 . Baselines (related work) k-Means: MR = 0 . , NMI = 0 . DBSCAN: MR = 0 . , NMI = 0 . [24]: MR = 0 [42]: NMI = 0 . Hence, we assume, backed by the additional experiments to be found online, thatour model works well also for other data types and datasets, given a suitableembedding network. Tab. 1 gives the numerical results for said datasets in therow called “L2C” without using the explicit metric learning block. Extensivepreliminary experiments on other public datasets like e.g. FaceScrub [31] confirmthese results: learning to cluster reaches promising performance while not yetbeing on par with tailor-made state-of-the-art approaches.We compare the performance of our implicit distance metric learning methodto versions enhanced by different explicit schemes for pairwise similarity com-putation prior to clustering. Specifically, three implementations of the optionalmetric learning block in subnetwork (b) are evaluated: using a fixed diagonalmatrix A (resembling the Euclidean distance), training a diagonal A (resemblingMahalanobis distance), and learning the entire coefficients of the distance matrix A . Since we argue above that our approach combines implicit deep metric em-bedding with clustering in an end-to-end architecture, one would not expect thatadding explicit metric computation changes the results by a large extend. Thisassumption is largely confirmed by the results in the “L2C + . . . ” rows in Tab. 1:for COIL-100, Euclidean gives slightly worse, and the other two slightly betterresults than L2C alone; for TIMIT, all results are worse but still reasonable. Weattribute the considerable performance drop on 2D points using all three explicitschemes to the fact that in this case much more instances are to be comparedwith each other (as each instance is smaller than e.g. an image, n is larger). Thismight have needed further adaptations like e.g. larger batch sizes (reduced hereto N = 50 for computational reasons) and longer training times. We have presented a novel approach to learn neural models that directly outputa probabilistic clustering on previously unseen groups of data; this includes asolution to the problem of outputting similar but unspecific “labels” for similarobjects of unseen “classes”. A trained model is able to cluster different data typeswith promising results. This is a complete end-to-end approach to clustering thatlearns both the relevant features and the “algorithm” by which to produce theclustering itself. It outputs probabilities for cluster membership of all inputs as wellas the number of clusters in test data. The learning phase only requires pairwiselabels between examples from a separate training set, and no explicit similaritymeasure needs to be provided. This is especially useful for high-dimensional,perceptual data like images and audio, where similarity is usually semanticallydefined by humans. Our experiments confirm that our algorithm is able toimplicitly learn a metric and directly use it for the included clustering. This issimilar in spirit to the very recent work of Hsu et al. [13], but does not need andoptimization on the test (clustering) set and finds k autonomously. It is a novelapproach to learn to cluster , introducing a novel architecture and loss design.We observe that the clustering accuracy depends on the availability of alarge number of different classes during training. We attribute this to the factthat the network needs to learn intra-class distances, a task inherently moredifficult than just to distinguish between objects of a fixed amount of classeslike in classification problems. We understand the presented work as an earlyinvestigation into the new paradigm of learning to cluster by perceptual similarityspecified through examples. It is inspired by our work on speaker clusteringwith deep neural networks, where we increasingly observe the need to go beyondsurrogate tasks for learning, training end-to-end specifically for clustering toclose a performance leak. While this works satisfactory for initial results, pointsfor improvement revolve around scaling the approach to practical applicability,which foremost means to get rid of the dependency on n for the partition size.The number n of input examples to assess simultaneously is very relevantin practice: if an input data set has thousands of examples, incoherent singleclusterings of subsets of n points would be required to be merged to produce aclustering of the whole dataset based on our model. As the (RBD)LSTM layersresponsible for assessing points simultaneously in principle have a long, but stilllocal (short-term) horizon, they are not apt to grasp similarities of thousandsof objects. Several ideas exist to change the architecture, including to replacerecurrent layers with temporal convolutions, or using our approach to seed somesort of differentiable K-means or EM layer on top of it. Preliminary results onthis exist. Increasing n is a prerequisite to also increase the maximum number ofclusters k , as k (cid:28) n . For practical applicability, k needs to be increased by anorder of magnitude; we plan to do this in the future. This might open up novelapplications of our model in the area of transfer learning and domain adaptation. Acknowledgements: We thank the anonymous reviewers for helpful feedback. earning Neural Models for End-to-End Clustering 11 References 1. Aljalbout, E., Golkov, V., Siddiqui, Y., Cremers, D.: Clustering with deep learning:Taxonomy and new methods. arXiv preprint arXiv:1801.07648 (2018)2. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C.,Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al.: Deep speech 2: End-to-endspeech recognition in English and Mandarin. In: ICML. pp. 173–182 (2016)3. Arias-Castro, E.: Clustering based on pairwise distances when the data is of mixeddimensions. IEEE Transactions on Information Theory pp. 1692–1706 (2011)4. Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In:ICML. pp. 19–26 (2002)5. Branson, S., Horn, G.V., Wah, C., Perona, P., Belongie, S.J.: The ignorant led bythe blind: A hybrid human-machine vision system for fine-grained categorization.IJCV pp. 3–29 (2014)6. Chin, C.F., Shih, A.C.C., Fan, K.C.: A novel spectral clustering method based onpairwise distance matrix. J. Inf. Sci. Eng. pp. 649–658 (2010)7. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively,with application to face verification. In: CVPR. pp. 539–546 vol. 1 (2005)8. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for dis-covering clusters in large spatial databases with noise. In: KDD. pp. 226–231(1996)9. Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren,N.L.: DARPA TIMIT acoustic phonetic continuous speech corpus CDROM (1993)10. Greff, K., van Steenkiste, S., Schmidhuber, J.: Neural expectation maximization.In: NIPS. pp. 6694–6704 (2017)11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: CVPR. pp. 770–778 (2016)12. Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: InternationalWorkshop on Similarity-Based Pattern Recognition. pp. 84–92 (2015)13. Hsu, Y., Lv, Z., Kira, Z.: Learning to cluster in order to transfer across domainsand tasks. In: ICLR (2018), [accepted]14. Jin, X., Han, J.: Expectation maximization clustering. In: Encyclopedia of MachineLearning, pp. 382–383. Springer (2011)15. Kampffmeyer, M., Løkse, S., Bianchi, F.M., Livi, L., Salberg, A.B., Robert, J.:Deep divergence-based clustering. In: IEEE Int’l Workshop on Machine Learningfor Signal Processing (MLSP) (2017)16. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to ClusterAnalysis. John Wiley & Sons (1990)17. Krause, J., Stark, M., Deng, J., Li, F.F.: 3D object representations for fine-grainedcategorization. In: Workshop on 3D Representation and Recognition at ICCV (2013)18. Le, Q.V., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J.,Ng, A.Y.: Building high-level features using large scale unsupervised learning. In:ICML. pp. 8595–8598 (2012)19. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied todocument recognition. Proceedings of the IEEE pp. 2278–2324 (1998)20. Lee, H., Ge, R., Ma, T., Risteski, A., Arora, S.: On the ability of neural nets toexpress distributions. In: COLT. pp. 1271–1296 (2017)21. Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotorpolicies. JMLR (1), 1334–1373 (2016)22. Liu, D., Kubala, F.: Online speaker clustering. In: ICASSP. pp. I–333–6 vol.1 (2003)2 Meier, Elezi, Amirian, Dürr & Stadelmann23. Lukic, Y., Vogt, C., Dürr, O., Stadelmann, T.: Speaker identification and clusteringusing convolutional neural networks. In: IEEE Int’l Workshop on Machine Learningfor Signal Processing (MLSP) (2016)24. Lukic, Y., Vogt, C., Dürr, O., Stadelmann, T.: Learning embeddings for speakerclustering based on voice equality. In: Machine Learning for Signal Processing(MLSP), 2017 IEEE 27th International Workshop on (2017)25. MacQueen, J.: Some methods for classification and analysis of multivariate obser-vations. In: 5th Berkeley symp. on math. statist. and prob. pp. 281–297 (1967)26. Manmatha, R., Wu, C., Smola, A.J., Krähenbühl, P.: Sampling matters in deepembedding learning. In: ICCV. pp. 2840–2848 (2017)27. McDaid, A.F., Greene, D., Hurley, N.: Normalized mutual information to evaluateoverlapping community finding algorithms. arXiv preprint arXiv:1110.2515 (2011)28. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word represen-tations in vector space. arXiv preprint arXiv:1301.3781 (2013)29. Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. TheComputer Journal pp. 354–359 (1983)30. Nayar, S., Nene, S., Murase, H.: Columbia object image library (COIL 100). De-partment of Comp. Science, Columbia University, Tech. Rep. CUCS-006-96 (1996)31. Ng, H.W., Winkler, S.: A data-driven approach to cleaning large face datasets. In:ICIP. pp. 343–347 (2014)32. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding for facerecognition and clustering. In: CVPR. pp. 815–823 (2015)33. Schwenker, F., Kestler, H.A., Palm, G.: Three learning phases for radial-basis-function networks. Neural networks (4-5), 439–458 (2001)34. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-basedsequence recognition and its application to scene text recognition. PAMI pp. 2298–2304 (2017)35. Sigtia, S., Benetos, E., Dixon, S.: An end-to-end neural network for polyphonicpiano music transcription. IEEE/ACM TASLP24