[PDF] Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

Abstract

Large-scale datasets are the cornerstone of representation learning. Existing self-supervised approaches extract learning signals by making certain assumptions about the data, e.g., spatio-temporal continuity and multimodal correspondence. However, finding large amounts of data that satisfy such assumptions is not straightforward, and this restricts the community to rely on datasets collected through laborious annotation and/or manual filtering processes. In this paper, we propose a subset optimization approach for automatic dataset curation. Focusing on audio-visual representation learning, we find a subset that provides the maximum mutual information between audio and visual channels in videos. We show that self-supervised models trained on our data, despite being automatically constructed, achieve competitive downstream performances compared to existing datasets that require annotation and/or manual filtering. The most significant benefit of our approach is scalability. We release a dataset of 100M videos with high audio-visual correspondence.

Full PDF

AAutomatic Curation of Large-Scale Datasets forAudio-Visual Representation Learning

Sangho Lee * , Jiwan Chung (cid:63) , Youngjae Yu, Gunhee KimSeoul National University { sangho.lee, jiwanchung, yj.yu } @vision.snu.ac.kr, [email protected] Thomas Breuel, Gal ChechikNVIDIA Research { tbreuel,gchechik } @nvidia.com Yale SongMicrosoft Research [email protected]

Abstract

Large-scale datasets are the cornerstone of self-supervised representation learning. Existing algorithms ex-tract learning signals by making certain assumptions aboutthe data, e.g., spatio-temporal continuity and multimodalcorrespondence. Unfortunately, ﬁnding a large amountof data that satisﬁes such assumptions is sometimes notstraightforward. This restricts the community to rely ondatasets that require laborious annotation and/or manualﬁltering processes. In this paper, we describe a subset op-timization approach for automatic dataset curation. Focus-ing on the scenario of audio-visual representation learn-ing, we pose the problem as ﬁnding a subset that maximizesthe mutual information between audio and visual channelsin videos. We demonstrate that our approach ﬁnds videoswith high audio-visual correspondence and show that self-supervised models trained on our data, despite being auto-matically constructed, achieve similar downstream perfor-mances to existing video datasets with similar scales. Themost signiﬁcant beneﬁt of our approach is scalability. Werelease the largest video dataset for audio-visual researchcollected automatically using our approach.

1. Introduction

The overall objective of our work is learning to recognizeobjects, actions, and sound in videos without the need formanual ground truth labels. This is not only a theoreticallyinteresting problem, since it mimics the discovery of theauditory and visual environment by infants, it is also of im-mense practical importance, since accurate manual labelingof large amounts of audio-visual data is impractical and anylarge-scale audio-visual learning needs to be semi- or self- * Equal Contribution { } … { }Large-Scale, Unconstrained Videos Auto-Curated Dataset 𝐸 ! 𝐸 " VisualAudio … … … …

Clustering-basedNCE-based 𝑏 𝑠 MI 𝑓𝑓𝐸 ! 𝐸 " MI … … VisualAudio

Figure 1. We propose an automatic dataset curation pipeline foraudio-visual representation learning. We formulate an optimiza-tion problem where the goal is to ﬁnd a subset that maximizes themutual information between audio and visual channels of videos. supervised. Compared to unsupervised and self-supervisedlearning on static images, audio-visual inputs pose addi-tional challenges: large portions of a video may contain norelevant information, and auditory and visual inputs maynot always be in correspondence. In analogy to context andocclusion in static image analysis, we could address suchproblems in an end-to-end way, for example using temporalattention over audio and video streams; however, this is im-practical for full-length videos given the current generationof GPUs. Therefore, many existing self-supervised meth-ods on audio-visual data either start with datasets consist-ing of short video clips for which there is a high probabilityof audio-visual correspondence, or they learn audio-visualproperties corresponding only to short-term statistical reg-ularities. The necessary datasets are usually manually cre-ated or rely on domain-speciﬁc properties ( e.g . [8, 19] andbelow). If we want to carry out self-supervised learningon full length (minutes, hours) of video without manuallygenerating and/or selecting video clips, we need automatedways of curating such collections of audio/video clips fromdiverse collections of full length video.Therefore, we view self-supervised learning from di-1 a r X i v : . [ c s . C V ] J a n erse collections of full-length videos as a two-step process:(1) an automatic dataset curation process that generatesshort, relevant clips small enough for full length process-ing in current-generation GPUs, and (2) a self-supervisedlearning approach that operates on the collection of shortclips. This paper focuses on step (1) and not on step (2),providing an automated way of taking a collection of gen-eral or domain speciﬁc videos of arbitrary length and re-ducing it to a collection of shorter clips containing a highproportion of relevant audio-video correspondences. Theoutput of this step can be used both as input to exist-ing self-supervised learning algorithms on audio-visual data[53, 27, 23, 12, 15, 53, 48, 12, 74, 28, 60, 35, 3, 56], as wellas the development of novel self-supervised algorithms.In this paper, we propose an automated approach to col-lecting a video dataset for audio-visual representation learn-ing. We pose data collection as the subset selection prob-lem [46] and propose an information-theoretic measure ofaudio-visual correspondence as a criterion to select data.Speciﬁcally, we formulate an optimization problem of ﬁnd-ing a subset that maximizes the mutual information (MI)between audio and visual channels of videos.The main technical challenge we address is how to mea-sure the audio-visual MI and ﬁnd a subset that maximizesthe MI in a scalable manner. Given that video processingis notoriously compute and storage intensive, we put a par-ticular emphasis on scalability – i.e., we want a measurethat can easily handle hundreds of millions of video clips.We compare two methods for MI estimation: the classicalclustering-based approach [44, 71] and the modern noisecontrastive learning approach [21, 27, 53, 59].For the clustering-based solution, we avoid usingmemory-heavy components such as the Lloyd’s algo-rithm [55] and instead use SGD [7] to perform K-meansclustering. Further, we approximately solve the subset max-imization objective with a batch greedy method [13], whichgreedily selects samples within a batch and then reassem-bles batches in a greedy manner. Through controlled ex-periments with ground-truth correspondences, we show thatthis approach is more robust to the real-world correspon-dence patterns than the contrastive approaches.Our approach enables building an arbitrarily largedataset without human annotation or manual ﬁltering pro-cesses. We present an automated pipeline that selects anynumber of videos from a pool of several hundreds of mil-lions of video clips. To evaluate the effectiveness of our ap-proach, we produce datasets at varying scales and comparethem to existing datasets that are either manually annotatedor involve human veriﬁcation, i.e., Kinetics-Sounds [4]at 20K-scale, VGG-Sound [11] at 200K-scale, and Au-dioSet [19] at 2M-scale. To compare the datasets in a con-trolled manner, we pretrain a network under the same set-tings and compare them on downstream tasks using a linear evaluation protocol. We demonstrate that datasets producedby our method achieves similar (and sometimes slightly bet-ter) performances to the existing datasets.To summarize, our main contributions include: 1) Wepropose an information-theoretic approach to ﬁnding alarge-scale dataset for audio-visual video representationlearning. 2) We thoroughly evaluate different componentsof our pipeline via controlled experiments using both theground-truth and the noisy real-world correspondence pat-terns. 3) We use the pipeline to construct a large-scale open-domain video dataset consisting of 10M video clips sourcedfrom YouTube and publicly release it. To the best of ourknowledge, this is the largest video dataset that providesaudio-visual correspondence.

2. Related Work

Audio-Visual Datasets . Several different types ofaudio-visual datasets have been collected: (1) manually la-beled ( e.g . [19], [69]), (2) domain speciﬁc ( e.g . AVA Ac-tiveSpeaker [62], AVA Speech [10], Greatest Hits [54],FAIR-Play [18], YouTube-ASMR-300K [80]), and (3) un-labeled, unrestricted collections from consumer video sites( e.g . Flickr-SoundNet [5, 4]). We are building systemsfor learning audio-visual correspondence on diverse, un-restricted inputs. This requires large amounts of trainingdata, making manual collection and labeling costly and im-practical. Our work intends to extend existing audio-visualdatasets by automatically curating collected videos to im-prove audio-visual correspondence.Chen et al . [11] created a dataset of 200K clips for audio-visual research; clips were originally obtained by keywordsearch on YouTube and frames were classiﬁed with pre-trained visual classiﬁers. Since keywords and visual classesdo not perfectly correspond, such correspondences neededto be manually reviewed and corrected on randomly sam-pled clips in an iterative and interactive process. In coun-trast to this prior work, our approach does not involve costlyhuman intervention, which helps improve scalability.

Large-Scale Data Curation . Let us review some of therelevant large-scale video datasets.

AudioSet [19] consistsof about 2M clips corresponding to audio events retrievedfrom YouTube by keyword search; human raters veriﬁedthe presence of audio events in the candidate videos.

Mo-ments in Time [49] consists of over one million clips corre-sponding to diverse visual and auditory events; short videoclips were selected using keywords (verbs) and manuallyreviewed for high correspondence between the clips and thekeywords.

HowTo100M [45] consists of 136M clips seg-mented from 1.22M narrated instructional web videos re-trieved by text search from YouTube, with an additionalﬁltering step based on metadata.

Web Videos and Text (WVT) [68] consists of 70M clips obtained by searchingthe web with keywords based on the Kinetics-700 [8] cate-2ories and retaining both the video and the associated text.Our work contributes to this line of research by introducingan automatic and scalable data curation pipeline free of apredeﬁned set of keywords, and a large-scale audio-visualdataset curated with the pipeline.

Subset Selection . Our work focuses on data subset se-lection; extensive prior work exists in supervised [70, 77,65, 76], unsupervised [22, 78], and active learning set-tings [40, 64]. Different criteria for subset selection havebeen explored in the literature.

Submodular functions nat-urally model notions of information, diversity and cover-age [75], and can be optimized efﬁciently using greedy al-gorithms [47, 51].

Geometric criteria like the coreset [2]aim to approximate geometric extent measures over a largedataset with a relatively small subset.Mutual-information (MI) between input feature valuesand/or labels has been used successfully [20, 41, 67] as aprobablistically motivated criterion. We propose to use MIas an objective function for subset selection and make thefollowing two unique contributions: First, we use MI tomeasure audio-visual correspondence within videos by for-mulating MI between the audio and visual features. Second,we apply MI for the large-scale video dataset curation prob-lem. In case of clustering-based MI estimation, we demon-strate that optimizing MI objective with a greedy algorithmis a practical solution in building a large-scale pipeline asours. We empirically verify the efﬁcacy of MI in subset se-lection problems with controlled experiments in Section 4.

3. Data Collection Pipeline

Our pipeline consists of four steps: (i) acquiring rawvideos from the web and ﬁltering them based on metadata,(ii) segmenting the videos into clips and extracting featureswith pretrained extractors, (iii) estimating mutual informa-tion (MI) between audio and visual representations via ei-ther Noise Contrastive Estimation (NCE) or clustering, and(iv) selecting a subset of videos that maximizes the MI.

We crawl YouTube to download videos with a wide va-riety of topics. Unlike previous work that use a carefullycurated set of keywords [11], which could inadvertently in-troduce bias, we aim for capturing the natural distributionof topics present in the website. To ensure the diversity intopics, cultures and languages, we create different combi-nations of search queries (e.g., keywords, locations, events,categories, etc.) to obtain an initial video list.Before downloading videos, we process the search re-sults using metadata (provided by YouTube API) to ﬁlter outpotentially low quality / low audio-visual correspondencevideos. We use the duration to exclude videos shorter than30 seconds (to avoid low quality videos) and longer than600 seconds (to avoid large storage costs). We also exclude videos that contain selected keywords (in either title or de-scription) or from certain categories – i.e., gaming, anima-tion, screencast, and music videos – because most videosexhibit non-natural scenes (computer graphics) and/or lowaudio-visual correspondence. Finally, we detect languagefrom the titles and descriptions using fastText [31, 32] andkeep the ones that constitute a cumulative ratio of . , re-sulting in eight languages (English, Spanish, Portuguese,Russian, Japanese, French, German, and Korean).The result is 140 million full-length videos with a totalduration of 1030 years (median: 198 seconds). To mini-mize the storage cost we download 360p resolution videos;this still consumes 1.8 petabytes of storage. Handling suchlarge-scale data requires a carefully designed data pipeline.We discuss our modularized pipeline below. Clip Segmentation . To avoid redundant clips, we ex-tract up to three 10-second clips from each full-lengthvideo. We do this by detecting shot boundaries (using the scdet ﬁlter in FFmpeg) and computing pairwise clip sim-ilarities based on the MPEG-7 video signatures (using the signature ﬁlter in FFmpeg). We then select up to 3clips that give the minimum total pairwise scores using localsearch [30]. This gives us about 300M clips.

Feature Extraction . To measure correspondence be-tween the audio and visual channels of the 300M clips, weneed good feature representations. An ideal representationwould capture a variety of important aspects from low-leveldetails (e.g., texture and ﬂow) to high-level concepts (e.g.,semantic categories). However, such an oracle extractor ishard to obtain, and the sheer scale of data makes it imprac-tical to learn optimal feature extractors end-to-end. There-fore, we use widely-used pretrained networks to extract fea-tures, i.e., SlowFast [16] pretrained on Kinetics-400 [33]and VGGish [25] pretrained on YouTube-8M [1] for visualand audio features, respectively.

Our next goal is to select clips that exhibit strong corre-spondence between visual and audio channels. To this end,we estimate the mutual information (MI) between the audioand visual signals. Computing the exact MI is infeasiblebecause it requires estimating the joint distribution of veryhigh dimensional variables. As a result, various methodshave been proposed for approximating and estimating MIbetween variables [72]. Here we implement and comparetwo approaches: a noise-contrastive estimator (NCE) [21],which measures MI in a continuous feature space, and aclustering-based estimator that computes MI in a discretespace via vector quantization.3 .3.1 NCE-based MI Estimation

Contrastive approaches have become a popular way of es-timating MI between different views of the data [53, 27].We add linear projection heads over the precomputed au-dio/visual features and train them using the contrastiveloss [12]. From a mini-batch { ( v i , a i ) } N b i =1 where v i and a i are visual and audio features, respectively, we minimize l ( v i , a i ) = − log exp( S ( z vi , z ai ) /τ ) (cid:80) N b j =1 exp( S ( z vi , z aj ) /τ ) , (1)where z vi and z ai are embeddings from the linear projectionheads, S ( · , · ) measures the cosine similarity, and τ is a tem-perature term (we set τ = 0 . ). For each mini-batch wecompute l ( v i , a i ) and l ( a i , v i ) to make the loss symmetric.Once trained, we can directly use S ( z v , z a ) to estimateaudio-visual MI and ﬁnd a subset by taking the top N can-didates from a ranked list of video clips. Clustering is one of the classical ways ofestimating MI [44, 71]. Given two groupings of a dataset X w.r.t. audio and visual features, A = { A , · · · , A |A| } and V = { V , · · · , V |V| } , we estimate their MI as:MI ( A , V ) = |A| (cid:88) i =1 |V| (cid:88) j =1 | A i ∩ V j || X | log | X || A i ∩ V j || A i || V j | , (2)which can be viewed as an unnormalized version of theAdjusted Mutual Information [71]. This formulation es-timates MI in a discrete (vector-quantized) space inducedby clustering, and thus the quality of clustering affects thequality of the estimator. A straightforward approach to ob-taining A and V is to cluster videos using the output fromthe penultimate layers of the pretrained networks. However,this can introduce distributional bias speciﬁc to the datasetson which the networks are pretrained [73].To address this issue, we cluster samples over each out-put space induced by different layers of the networks. Thisallows the MI estimator to consider a wide range of ab-stract concepts, from low-level (such as textures) to high-level (such as object parts) [6]. Speciﬁcally, we use the fea-ture spaces induced by the ﬁve convolutional blocks fromeach of the SlowFast and VGGish feature extractors. Wethen compute the average MI between all pairs of cluster-ings as our MI estimator. Let CV ( i ) X = { V ( i )1 , · · · , V ( i ) n i } and CA ( i ) X = { A ( i )1 , · · · , A ( i ) m i } denote the clustering re-sults induced by the i -th convolutional block of the visualand audio feature extractors, respectively. We compute: F ( X ) = (cid:88) ( X , Y ) ∈C X MI ( X , Y ) C , (3) Algorithm 1:

Batch Greedy Subset Selection

Input: initial dataset D , clustering-based MIestimator F , target subset size M , batch size b ,selection size s Output: X ⊆ D , | X | = M X ← ∅ , i ← while | X i | < M do Randomly sample B ⊆ D \ X i , | B | = b Y ← ∅ , j ← while j < s do x ← argmax x ∈ B \ Y j F ( X i ∪ Y j ∪ { x } ) Y j +1 ← Y j ∪ { x } , j ← j + 1 if | X i ∪ Y j | = M then break end X i +1 ← X i ∪ Y j , i ← i + 1 end X ← X i Return X where C X denotes the combination of two elements from {CV ( i ) X } i =1 ∪ {CA ( j ) X } j =1 and C denotes the numberof 2-combinations out of 10 elements, which equals to 45.Note that we formulate MI between layers from both withinand across the extractors of different modalities (referred toas combination pairing scheme in Section 4.2). Batch Greedy Subset Selection.

Since the MI estimator F ( · ) is a function of X , we can formulate an optimizationproblem where the goal is to ﬁnd a subset X that maxi-mizes F ( X ) . In general, ﬁnding a global solution to prob-lems such as ours is NP-hard and thus greedy heuristic so-lutions are used instead [52]. They typically select one sam-ple in each iteration and re-evaluate the goodness function,e.g., F ( · ) , on all the remaining candidates. This introducesa challenge to our setting because the time complexity isquadratic to the size of the population; this is clearly notscalable to 300 million instances.Therefore, we approximate the typical greedy solutionusing the batch greedy algorithm [13], as shown in Algo-rithm 1. It randomly samples a batch B from the remainingpool of candidates, and searches for the next element to beincluded in the solution only within B . This batch trick re-duces the time complexity down to linear, i.e., O ( N × | B | ) ,where N is the size of the input dataset. We demonstratethe efﬁcacy of the algorithm in Section 4. Stochastic Clustering . One missing piece in thispipeline is an efﬁcient clustering algorithm scalable to hun-dreds of millions of instances. The most popular choiceamong various clustering methods is K-means cluster-ing [79], which is a special case of mixture density es-timation for isotropic normal and other densities. Typi-cally, an expectation-maximization (EM) algorithm, suchas Lloyd’s [55], is used to ﬁnd the cluster centers. Such al-4orithms require repeated computation of the distances ofall samples from all k cluster centers, followed by clus-ter assignment, until convergence. Lloyd’s algorithm up-dates cluster centers only after each pass through the entiredataset. But for very large datasets (like ours), a small sub-set usually contains enough information to obtain good es-timates of the cluster centers, meaning that EM-style algo-rithms tend to take (perhaps too) many epochs to converge.There are different strategies for addressing this issue,including random sampling and subsetting, but a straight-forward approach is to replace EM algorithm with anSGD [43, 7, 63]. In such an approach, for large datasets,convergence rate and ﬁnal accuracy of the cluster centersare determined not by the total dataset size, but by the learn-ing rate schedule. A straightforward SGD update rule isto compute the nearest cluster centers for each sample ina batch and then update the cluster centers using a convexcombination of the cluster centers and their nearest sam-ples, weighting the samples with the learning rate λ andthe cluster centers with (1 − λ ) . However, mixture densityestimators in general suffer from the problem that addingmixture components with zero probability does not changethe mixture density; in practice, this means EM and SGD-based algorithms may end up with cluster centers that stopreceiving updates at some point during the optimization.We address this problem by estimating the mixture com-ponent utilization rate as the ratio of the total number ofupdates to the cluster center divided by the total number ofestimation steps, and reinitializing cluster centers when thatprobability falls below (1 /k ) . In Section 4.2, we demon-strate that our mini-batch SGD update shows comparableaccuracy to batch update in correspondence retrieval tasks.

4. Evaluation on Correspondence Retrieval

Before applying our pipeline to a large-scale set ofvideos, we systematically evaluate different components ofour pipeline on small-scale datasets. To this end, we cre-ated synthetic correspondence-retrieval tasks, using bench-mark datasets CIFAR-10 [36], MNIST [39] and FSDD [29],by generating corresponding and non-corresponding pairs.In each correspondence retrieval task, the aim is to dis-cover the known corresponding samples among the non-corresponding pairs. To show the generality of the ﬁndings,we also experiment Kinetics-Sounds [4] which exhibit real-world audio-visual correspondence.

Datasets

We construct ﬁve datasets where each instanceis a pair of samples with different correspondence types. . We use images from ﬁverandomly selected categories to construct a “positive pair”set, and use the rest for a “negative pair” set. For the posi-tive set, we create pairs of images by sampling two different images from the same category (e.g., two images of a bird),and apply a geometric transformation to one of them; weapply either a 90 ◦ CCW rotation (CIFAR10-Rotation) or ahorizontal ﬂip (CIFAR10-Flip). The negative set followsthe same process but each pair contains images from differ-ent categories. We categorize this type of correspondence as“Natural Class Correspondence” because pairings are madeover natural semantic categories. . We use images from ﬁvedigit categories to construct a positive set and use the restfor a negative set. Different from above, correspondence isdeﬁned via an arbitrary class-level mapping, e.g., “digit 0”images map to the “car” images in CIFAR-10 or “digit 0”audio samples in FSDD. We take samples from the samecategories to construct the positive set and samples fromdifferent categories for the negative set. We call these “Ar-bitrary Class Correspondence” to differentiate from above.

5) Kinetics-Sounds . Unlike the above datasets wherethe correspondence is deﬁned over class categories, here thecorrespondence is deﬁned at the sample level, i.e., a positiveset contains pairs of audio and visual channels of the samevideo, and a negative set contains randomly permuted pairs.We do not utilize class labels to construct the dataset.

Methods

We compare our pipeline (both contrastive-based and clustering-based) to three ranking-based ap-proaches. All the methods use the same precomputed fea-tures. For images, we use ResNet-50 [24] pretrained on Im-ageNet [14]. For videos, we use SlowFast [16] pretrained onKinetics-400 [33] and VGGish [25] pretrained on YouTube-8M [1] for visual and audio features, respectively. For theranking baselines, we apply PCA [57] to reduce the fea-ture dimensionality to 64 and rank the instances based onthree similarity metrics: inner product, cosine similarity,and (negative) l distance. Because all our datasets have anequal number of positive and negative instances, we simplyselect the top 50% instances as the retrieval result. Protocol

We split each dataset into train and test parti-tions of the same size. We conduct a total of ﬁve runs foreach of the ﬁve datasets and report results on the test splits.We use train sets only for the contrastive estimator to trainthe projection heads. When constructing each dataset, wesample at most n = 1000 instances from each categoryof the source datasets. For the noise contrastive estimator,we train the linear projection heads for 100 epochs usingthe AMSGrad of Adam optimizer [61] with a learning rateof 2e-4. We randomly take one sample from each class tobuild a mini-batch for class-level correspondence datasets,and sample random N b = 10 clips to build a mini-batchfor the sample-level correspondence dataset. When apply-ing our clustering-based method, we perform the SGD K-means clustering with the “ground-truth” number of cen-troids as the number of classes in each source dataset; we5 atural Class Correspondence Arbitrary Class Correspondence Audio-VisualMethod CIFAR10-Rotation CIFAR10-Flip MNIST-CIFAR10 MNIST-FSDD Kinetics-SoundsRanking-inner 87.872 ± ± ± ± ± ± ± ± ± ± l ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± l on CIFAR10-Rotation and CIFAR10-Flip.Layers Method PrecisionSingle Layer1 50.820 ± ± ± ± ± ± ± ± ± ± use the batch greedy algorithm with a batch size b = 100 and a selection size s = 25 . Table 1 shows the experimental results. The two variantsof our approach (contrastive and clustering) achieve over-all higher precision rates than the ranking baselines. Thecontrastive approach particularly performs well on the twodatasets with the “natural class correspondence,” conform-ing to the previous results of [12] that shows contrastivelearning is robust to geometric transformations. The clus-tering approach excels on Kinetics-Sounds which containsthe natural audio-visual correspondence. Next, we conductvarious ablation studies on Kinetics-Sounds to validate dif-ferent components of our clustering-based approach.

Multi-Layer Clustering . All the feature extractors thatwe use consist of ﬁve convolutional blocks. As discussed inSection 3.3.2, we cluster samples over each of the ﬁve out-put spaces to capture a wide range of abstract concepts. Thisraises a question: How should we combine audio-visualclusters for MI estimation? Table 2 compares the single- p r e c i s i o n Greedy vs. Batch Greedy greedyratio=0.0312ratio=0.0625ratio=0.1250ratio=0.2500ratio=0.5000

Figure 2.

Greedy algorithm vs. batch greedy algorithm withvarying selection to batch size ratios, s/b (x axis: iterations, yaxis: precision). The shaded regions show 99% conﬁdence inter-vals obtained by ﬁve runs on Kinetics-Sounds. The batch greedyalgorithm is robust when the ratio is (cid:54) layer approaches to multi-layer approaches. Each of thesingle-layer approach estimates the audio-visual MI basedon a single pair of clustering results. We can see that theprecision increases as we use clustering results from higherlayers. However, all single-layer methods perform signiﬁ-cantly worse than multi-layer variants.Next, we explore various options to select pairs ofclusterings for MI estimation. We test three variants:

Diagonal computes an average MI across all ﬁve single-layer scores,

Bipartite computes an average MI be-tween all possible combinations of audio-visual clusteringresults, and

Combination ( ours ) computes an averageMI between all possible combinations of clustering results,regardless of modalities. With L layers, Diagonal com-putes MI L times, Bipartite computes MI L times, and Combination computes MI L C times. We observe thatthe performance increases with the number of connectionsas shown in the bottom rows of Table 2. This positive re-lationship states that the consensus between layers from thesame extractor, as well as that across extractors, contributesto the clarity of correspondence signal. Mini-Batch SGD K-means Clustering . We compare6 .00K 3.00K 6.00K 9.00Kiterations5060708090100 p r e c i s i o n ncentroids=8 (underclustering)ncentroids=16 (underclustering)ncentroids=32ncentroids=64 (overclustering)ncentroids=128 (overclustering) Figure 3.

Sensitivity analysis on the number of centroids (x axis:iterations, y axis: precision). We determine under/over-clusteringbased on the ground-truth number of class categories in Kinetics-Sounds ( c = 32 ). The shaded regions show 99% conﬁdence inter-vals over ﬁve runs. mini-batch SGD K-means to the standard EM (Lloyd’s) K-means [55]. Table 3 shows that both methods perform sim-ilarly on Kinetics-Sounds, which supports our use of mini-batch SGD K-kmeans for large-scale clustering. Batch Greedy Subset Selection . We explore how theuse of mini-batches in subset selection affects the quality ofthe selected subsets. We compare the greedy algorithm andthe batch greedy algorithm with a batch size b = 160 andvarying selection sizes s = { , , , , } . As shownin Figure 2, the performance gap between the greedy al-gorithm and the batch greedy algorithm is only marginal(greedy: 98.970 vs. batch greedy with ( b, s ) = (160 , :98.020), which validates our use of the batch greedy algo-rithm. While the batch size itself does not have a large im-pact on the subset quality, the ratio of selection size to batchsize ( s/b ) highly affects the retrieval performance. We cansee the performance dropping sharply as the ratio exceeds0.25 in several ( b , s ) conﬁgurations; we provide details inthe supplementary material. Number of Centroids . We vary the number of centroids k ∈ { , , , , } to see how sensitive our approachis to the parameter. We apply the batch greedy algorithmwith a batch size b = 100 and a selection size s = 25 on Kinetics-Sounds. Figure 3 shows that, although the ﬁnalperformance is similar across different number of centroids,they show different trends: underclustering ( k = { , } )shows high precision in early iterations while overclustering( k = { , } ) shows slower drop in the later stage.

5. Large-Scale Evaluation

We construct datasets at varying scales (20K, 200K,2M) and compare them to existing audio-visual datasets:Kinetics-Sounds [4] (20K), VGG-Sound [11] (200K), andAudioSet [19] (2M). Note that all three datasets involve ei- ther human annotation [4, 19] or manual veriﬁcation [11].Since our pipeline can generate datasets of size up to thepopulation (300M in our case), we also generate a versionwith 10M videos and evaluate its performance.For the contrastive-based variant of our approach, wetrain the linear projection heads on a batch size of 1024from a randomly drawn set of 100M videos. We trainthe model for three epochs and rank the entire video set(300M) based on the cosine similarity [12]. We then taketop N ∈ { K , K , M } ranked videos for the ﬁnaldataset. For the clustering-based variant, we vary the num-ber of clusters C ∈ { , , , , } for eachsize of the datasets. To assess the quality of the datasets, we pretrain audio-visual models using a contrastive learning objective [12]and evaluate them on downstream tasks. We follow thelinear evaluation protocol [12] by adding a linear classi-ﬁer on top of the pretrained feature extractors and train itfrom scratch while ﬁxing the pretrained weights. We use3D ResNet-50 [9] and ResNet-50 [24] for visual and au-dio networks, respectively. We provide details of these ex-perimental settings in the supplementary material. We teston three downstream tasks: visual action recognition onUCF101 [66], environmental sound classiﬁcation on ESC-50 [58], and audio-visual action recognition on Kinetics-Sounds [4] (we concatenate audio-visual features for thelinear classiﬁer). We report mean accuracy across the of-ﬁcial splits of UCF101 and ESC-50.Figure 4 shows that models pretrained on our datasetachieve similar, or even slightly better, performances com-pared to the baseline datasets. Considering our data cura-tion process does not involve human intervention (i.e., nomanual annotation and veriﬁcation) this is a promising re-sult showing the potential of our approach for large-scaleself-supervised learning. The signiﬁcant gap between ours(both contrastive and clustering) vs. random set shows theimprovement does not come from the initial pool we crawl(the 300M set). The signiﬁcant performance boost from the10M model reafﬁrms the importance of large-scale training.We will release the list of 10M videos to the community.

We conduct a user study to assess the perceived pres-ence/absence of audio-visual correspondence in video clips.We compare clips from four datasets: AudioSet [19], VGG-Sound [11], ours with clustering (2M scale, 1K clusters),and random (drawn from the 300M set). We prepare 100randomly sampled clips from each of these datasets, for atotal of 400 clips. We recruit 12 participants and presenteach with 100 clips (25 clips per dataset), and ask themwhether audio and visual are corresponding or not. This7 inetics-SoundsESC-50UCF101

Top 1 Accuracy

Top 5 Accuracy

Figure 4.

Linear evaluation on downstream tasks . The top-1/5 accuracy (%) of video classiﬁcation on UCF101 [66], audio classiﬁcationon ESC-50 [58] and audio-visual classiﬁcation on Kinetics-Sounds (KS) [4]. We group the results by the downstream tasks and by thescale of the pretrain datasets. Baselines are Kinetics-Sounds [4] (20K), VGG-Sound [11] (200K), and AudioSet [19] (2M).Dataset Majority Vote (%) Fleiss’ KappaAudioset 65.66 0.4385VGG-Sound 84.00 0.4634Ours (2M) 69.00 0.5110Random 44.00 0.6112Table 4.

Human evaluation results assessing the perceived audio-visual correspondence in videos from different datasets. provides us with 3 votes per video (we provide the detailsof the questionnaire in the supplementary material).Table 4 shows the majority voting accuracy and inter-rater agreement (measured by Fleiss’ Kappa [17]). Everydataset has Fliess’ Kappa greater than 0.4, verifying the re-liability of the accuracy statistics [38]. Ours signiﬁcantlyimproves audio-visual correspondence over a random sub-set (69% vs. 44%), and is even rated slightly higher thanAudioSet. The annotation process for AudioSet has focusedon audio events so we suspect that several of videos do notcontain visible sound sources. There is still a signiﬁcantgap between ours and VGG-Sound; we note that our pro-cess ﬁnds audio-visual correspondence without relying onmanual veriﬁcation as was done in VGG-Sound.

Although our approach achieves promising performancein downstream tasks, one potential drawback is that wedo not explicitly optimize the diversity among the sampledclips (c.f., dPP [37]). While we believe the diversity comeslargely from the very ﬁrst step of our pipeline – i.e., query-ing YouTube for videos without relying on a carefully cu-rated set of keywords – it would be problematic if our ap-proach selects videos from a highly concentrated region inaudio/visual manifolds. Figure 5 suggests this may not bethe case: While we see some concentration effects towardsa few clusters, we see that the histograms follows reason-ably similar shapes between ours (2M scale, 100 clusters)vs. a random subset (2M scale), especially in the visual fea-

Visual

RandomOurs 0 25 50 75 100

Audio

RandomOurs

Figure 5.

Histograms of cluster IDs in the sampled datasets.

Auniform-shape histogram represents samples are drawn uniformlyfrom the 100 clusters (and thus is supposedly diverse). ture space. In the supplementary material, we provide anin-depth analysis of the discovered concept categories andprovide evidence showing the diversity of concepts in thegenerated subset.

6. Conclusion

This work complements existing line of research on self-supervised representation learning with three main contri-butions: i) proposing an automatic and scalable data col-lection pipeline for audio-visual representation learning, ii)demonstrating that the MI-based subset selection can re-trieve correspondence in both artiﬁcial and practical set-tings, and iii) releasing a large-scale open-domain videodataset consisting of 10M clips curated with our pipeline.

A. On the Diversity of Concepts in SampledClips (Section 5.3)

A.1. Histogram of Cluster IDs

To analyze the diversity of concepts contained in ourcurated dataset, we examine the histograms of cluster IDsfrom the chosen videos. Figure 6 shows audio and visualhistograms obtained from either our curated subsets or ran-8 ubset Size = 20K Subset Size = 200K Subset Size = 2M

Figure 6. Histograms of cluster IDs from our curated subsets and randomly sampled subsets (with 100 cluster centroids). The bluehistograms represent the case where samples are drawn uniformly random and thus is the unbiased representation of the concepts naturallyappearing in the entire population. domly sampled subsets at varying scales (20K, 200K, and2M). To obtain these, we cluster the features from the lastlayer of audio and visual feature extractors, respectively,and plot the histograms of cluster IDs. For the purpose ofvisualization we sort the cluster indices by the cluster sizein a decreasing order (and thus the cluster IDs do not matchbetween “Random” and “Ours” in each of the plots). Thehistograms from random subsets represent the natural dis-tribution of the entire video population.In the visual domain, the curated datasets (green his-tograms) mostly follow the original cluster distributions(which is reﬂected in the blue histogram in each subplot).This indicates that the visual concept distribution largelyfollows the natural distribution in the entire population, sug-gesting that our subset contains visual concepts that are asdiverse as the entire set.On the other hand, the audio clusters show noticeableconcentration in distribution after subset selection. Uponclose inspection of videos from the largest audio clusters,we observe that our curated datasets tend to choose videosfrom clusters with high audio-visual correspondence (e.g.,videos of a single person speaking with no other sound inbackground) while random sampling tend to choose videosfrom clusters with no apparent audio-visual correspondence(e.g., videos of multiple people taking with background mu-sic/noise). This shows that the concentration in the audiohistograms is caused by ﬁltering out videos of low audio-visual correspondence, which is a highly desirable artifactin the curated subset.

A.2. Qualitative Analysis of Audio-Visual Cluster-ing Results

To further investigate the diversity of concepts appear-ing in our subsets, we manually inspect audio and visualclustering results in the 2M dataset and compare the con-cepts appearing in the largest clusters to those in the small-est ones. Figure 7 and Figure 8 show representative videosfrom the ﬁve largest and ﬁve smallest clusters obtained fromaudio and visual clustering results, respectively. Figure 7(from audio clusters) suggests that our curated dataset con- tains diverse concepts including general sound categories(e.g., voice and objects sounds) as well as speciﬁc topics(e.g., outdoor interview and cooking). Similarly, Figure 8(from visual clusters) also suggests that our dataset con-tains diverse concepts including both natural (e.g., animalsand ﬁre) and human sounds (e.g., makeup and playing gui-tar). Clips from larger clusters (depicted in the left columnof Figure 7 and Figure 8) contain clear and isolated soundsources, while sounds of smaller clusters (the right col-umn) are less distinguishable due to multiple sound sourcesor background noise. Our dataset also captures severalaudio-visual concepts that existing datasets (such as VGG-Sound [11] and AudioSet [19]) do not offer. For instance,in Figure 7, the 77th cluster contains videos recorded froma front-facing camera with voice recordings from a phonemic, and the 46th cluster contains videos of comedians per-forming exaggerated body actions with the sound of crowd(cheering and laughter). The 88th cluster in Figure 8 con-tains shoes unboxing videos.

B. Details of Linear Evaluation on Down-stream Tasks (Section 5.1)

B.1. Experimental Settings

We pretrain audio-visual models in a contrastive man-ner [12] on different datasets. Speciﬁcally, we attach MLPprojection heads on top of audio and visual feature extrac-tors, respectively, and train the whole model end-to-end us-ing the noise-contrastive loss (see Eqn. 1 of the main pa-per). As for the visual and audio backbone feature extrac-tors, we use 3D ResNet-50 [9] and ResNet-50 [24], respec-tively. Each of the MLP projection head is composed oftwo fully-connected layers with ReLU [50] activation, andproduces the embeddings of dimension 128. We pretrainthe model for 50 epochs with a batch size 64, except for the10M scale dataset where we use a batch size of 300. We usethe AMSGrad variant [61] of AdamW [42] optimizer with alearning rate 1e-3, β = 0 . , β = 0 . and an L2 weightdecay of 1e-5. We apply learning rate warm-up for the ﬁrst20,000 iterations followed by a linear decay of learning rate.9 luster 91 (Size Ratio: 3.9%) Audio : Female Voice,

Visual : Woman Speaking

Cluster 51 (Size Ratio: 1.8%)

Audio:

Object Sounds,

Visual:

Handling Objects

Cluster 89 (Size Ratio: 3.0%)

Audio:

Commentaries, Crowd Cheering,

Visual:

Sports

Cluster 67 (Size Ratio: 2.3%)

Audio:

Singing, Crowd Cheering,

Visual:

Concert

Cluster 37 (Size Ratio: 0.4%)

Audio:

Voice, Background Noise,

Visual:

Outdoor Interview

Cluster 76 (Size Ratio: 0.1%)

Audio:

Sizzling, Boiling, Stirring,

Visual:

Cooking

Cluster 44 (Size Ratio: 0.7%)

Audio:

Metallic Sounds,

Visual:

Machine Parts, Tools

Cluster 46 (Size Ratio: 0.2%)

Audio:

Laughing, Speech,

Visual:

Comedy

Cluster 77 (Size Ratio: 1.7%)

Audio:

Phone Mic Recordings,

Visual:

Front Camera Selﬁes

Cluster 33 (Size: 0.2%)

Audio:

Engine Sound,

Visual:

Car

Audio Clusters

Figure 7. Representative samples and concepts derived from a manual inspection of 100 audio clusters of the 2M subset. We show samplesfrom the ﬁve largest clusters on the left column and those from the ﬁve smallest clusters on the right. Each cluster captures distinctiveaudio-visual concepts, indicating that our curated subset contains various concepts with high audio-visual correspondence. luster 88 (Size Ratio: 0.1%) Audio:

Object Sounds,

Visual:

Shoes Unboxing

Cluster 9 (Size Ratio: 0.1%)

Audio:

Burning Sound,

Visual:

Fire

Cluster 0 (Size Ratio: 0.4%)

Audio:

Guitar Sounds, Singing,

Visual:

Playing Guitar

Cluster 2 (Size Ratio: 0.2%)

Audio:

Punch, Crowd Noise,

Visual:

Martial Arts

Cluster 76 (Size Ratio: 0.2%)

Audio:

Hitting Balls, Crowd Noise,

Visual:

Baseball

Cluster 83 (Size Ratio: 3.6%)

Audio:

Female Voice,

Visual:

Woman Speaking

Cluster 35 (Size Ratio: 1.9%)

Audio:

Male Voice,

Visual:

Indoor Interview

Cluster 42 (Size Ratio: 3.6%)

Audio:

Clear Voice,

Visual:

News

Cluster 33 (Size Ratio: 2.8%)

Audio:

Brushing, Voice,

Visual:

Makeups

Cluster 23 (Size Ratio: 2.2%)

Audio:

Ambient Sounds, Animal Sounds,

Visual:

Nature

Visual Clusters

Figure 8. Representative samples and concepts derived from a manual inspection of 100 visual clusters of the 2M subset. We show samplesfrom the ﬁve largest clusters on the left column and those from the ﬁve smallest clusters on the right. Each cluster captures distinctiveaudio-visual concepts, indicating that our curated subset contains various concepts with high audio-visual correspondence.

00 200 500 1000 2000ncentroids505152535455565758

Kinetics-Sounds

Subset Size20K 200K 2M100 200 500 1000 2000ncentroids354045505560 A cc u r a c y UCF101

Subset Size20K 200K 2M 100 200 500 1000 2000ncentroids3540455055606570

ESC-50

Subset Size20K 200K 2M

Figure 9. Linear evaluation of representations pretrained on the datasets that are constructed by our clustering-based approach. We reportthe top-1 accuracy (%) on UCF101 [66], ESC-50 [58], and Kinetics-Sounds [4], grouped by the number of cluster centroids. The shadedregions show 99% conﬁdence intervals obtained by runs over the ofﬁcial splits of UCF101 (3 splits) and ESC-50 (5 splits).

For linear evaluation on downstream tasks, we attach alinear classiﬁer on top of the pretrained feature extractorsand train it from scratch while ﬁxing the parameters of thefeature extractors. We use only the visual CNN for actionrecognition on UCF101 [66] and only the audio CNN forsound classiﬁcation on ESC50 [58]. For audio-visual actionrecognition on Kinetics-Sounds [4], we concatenate audio-visual features before feeding them as input to the linearclassiﬁer. We apply dropout [26] with a 50% rate beforethe linear classiﬁer. We train the model for 30 epochs witha batch size of 1024 on ESC-50 [58], for 10 epochs with abatch size of 64 on UCF101 [66] and for 5 epochs with abatch size of 64 on Kinetics-Sounds. We use the Adam [34]optimizer with a learning rate 1e-3, β = 0 . , β = 0 . and an L2 weight decay of 1e-5. B.2. Impact of the Number of Centroids

To visualize the impact of the number of clusters in ourclustering-based approach, we group the results by the num-ber of clusters as shown in Figure 9. Notice that the num-ber of clusters is not positively correlated with downstreamtask performance. Instead, clustering with about 500 clus-ters seems to yield the best performance. Also, experimentsusing the largest number of centroids ( C = 2000 ) show lowaccuracy consistently across all datasets and subset sizes.This conﬁrms our ﬁndings in Section 4.2 of the main pa-per: over-clustering tends to have a negative impact on thequality of the selected subset.We believe that this happens because, as the number ofclusters increases, samples with homogeneous concepts inlarge clusters are scattered into small clusters sharing simi-lar concepts. When we do not have many references to com-pare as in the early stage of subset selection, this fragmen-tation effect inhibits sample count sharing between concep-tually similar small clusters, complicating the clustering-based MI estimation. C. More Discussion on Subset Selection (Sec-tion 3.3.2)

C.1. Greedy Algorithm

We provide the details of the greedy algorithm [52] thatis approximated using the batch greedy algorithm [13]. Asshown in Algorithm 2, the greedy algorithm needs to re-evaluate the clustering-based MI estimator F on all the re-maining candidates in each iteration. Thus, the time com-plexity is O ( N ) where N is the size of the initial dataset D .On the other hand, the batch greedy algorithm approxi-mates this by selecting the next element to be included in thesolution within only a randomly chosen batch, not the en-tire candidates. This is shown in Algorithm 3 below (sameas Algorithm 1 of the main paper; reproduced here for easycomparison). Algorithm 2:

Greedy Algorithm

Input: initial dataset D , clustering-based MIestimator F , target subset size M Output: X ⊆ D , | X | = M X ← ∅ for i = 0 to M − do x ← argmax x ∈ D \ X i F ( X i ∪ { x } ) X i +1 ← X i ∪ { x } end X ← X M Return X C.2. Batch Greedy Subset Selection

When using the batch greedy algorithm for subset selec-tion, the batch size b and the selection size s affect the qual-ity of the selected subsets. We explore various ( b, s ) con-ﬁgurations on Kinetics-Sounds [4], as shown in Figure 10.Note that the performance gap between different batch sizes12 igure 10. Precisions of Batch greedy algorithm with varying ratios of selection size to batch size, s/b (x axis: iterations, y axis: precision).We group the plots by the batch size: b = 40 , , from left to right. The shaded regions show 99% conﬁdence intervals obtained byﬁve runs on Kinetics-Sounds. The batch greedy algorithm is robust when the ratio is (cid:54) Algorithm 3:

Batch Greedy Algorithm (repro-duced from the main paper for easy comparison)

Input: initial dataset D , clustering-based MIestimator F , target subset size M , batch size b ,selection size s Output: X ⊆ D , | X | = M X ← ∅ , i ← while | X i | < M do Randomly sample B ⊆ D \ X i , | B | = b Y ← ∅ , j ← while j < s do x ← argmax x ∈ B \ Y j F ( X i ∪ Y j ∪ { x } ) Y j +1 ← Y j ∪ { x } , j ← j + 1 if | X i ∪ Y j | = M then break end X i +1 ← X i ∪ Y j , i ← i + 1 end X ← X i Return X is small. The precision 93.9%, 94.3% and 94.6% are respec-tively obtained when using batch sizes b = 40 , , withthe same ratio of selection size to batch size s/b = 12 . .On the contrary, the value of s/b highly affects the retrievalperformance across all the batch sizes examined; the perfor-mance drops sharply as the ratio exceeds 25% regardless ofthe batch size. When using small ratio s/b , the batch greedyalgorithm ﬁnds local optima of the original problem amongthe batch elements. However, if the ratio becomes too large,the problem reduces to the one where the batch elements arethe entire candidates, which leads to poor performance. D. Details of Automatic Dataset Curation

Here, we describe the details of subset selection via (i)NCE-based MI estimation and (ii) clustering-based MI esti-mation. To construct datasets, we vary scales of 20K, 200Kand 2M. Based on the results at the three scales, we also generate a version with 10M videos using the clustering-based approach.

D.1. NCE-Based MI Estimation

We use the linear projection heads that transform audioand visual features into 128-dimension embeddings. Werandomly sample a subset of 100M clips from the initial300M set that we crawl, and train on the subset for threeepochs with a batch size N b = 1 , . We use the AMS-Grad variant of Adam optimizer [61] with a learning rate2e-4, β = 0 . and β = 0 . . We apply learning ratewarm-up for the ﬁrst 3 epochs followed by a linear decay oflearning rate. D.2. Clustering-Based MI Estimation

For SGD K-Means clustering, we train the cluster cen-troids with a mini-batch of size 100K for 100 epochs usinga learning rate λ = b = 10 , and theselection size s = 500 (with a ratio of s/b = 0 . ), but varythe number of clusters C ∈ { , , , , } foreach size of the datasets, except the dataset of 10M scale(we generate the dataset only with C = 500 for computa-tional reasons). E. Human Evaluation Interface (Section 5.2)

Figure 11 shows the user interface we developed for hu-man evaluation. We provide guidelines on how to assessaudio-visual correspondence:

You will watch a video clip for 10 seconds.Please determine whether there is audio-visualcorrespondence in the video. In other words, de-cide whether the sound source is visible or can beinferred from visual context.

After a pilot study we gathered feedback from expertsand added additional guidelines to help disambiguate com-mon edge scenarios (shown in Figure 11). Annotators are13

HUMAN EVALUATION WEBSITE

Y(cid:83)(cid:89) (cid:91)(cid:77)(cid:80)(cid:80) (cid:91)a(cid:88)c(cid:76) a (cid:90)(cid:77)de(cid:83) c(cid:80)(cid:77)(cid:84) f(cid:83)(cid:86) 10 (cid:87)ec(cid:83)(cid:82)d(cid:87). P(cid:80)ea(cid:87)ede(cid:88)e(cid:86)(cid:81)(cid:77)(cid:82)e (cid:91)(cid:76)e(cid:88)(cid:76)e(cid:86) (cid:88)(cid:76)e(cid:86)e (cid:77)(cid:87) a a(cid:89)d(cid:77)(cid:83)-(cid:90)(cid:77)(cid:87)(cid:89)a(cid:80) c(cid:83)(cid:86)(cid:86)e(cid:87)(cid:84)(cid:83)(cid:82)de(cid:82)ce(cid:77)(cid:82) (cid:88)(cid:76)e (cid:90)(cid:77)de(cid:83). I(cid:82) (cid:83)(cid:88)(cid:76)e(cid:86) (cid:91)(cid:83)(cid:86)d(cid:87), dec(cid:77)de (cid:91)(cid:76)e(cid:88)(cid:76)e(cid:86) (cid:88)(cid:76)e (cid:87)(cid:83)(cid:89)(cid:82)d(cid:87)(cid:83)(cid:89)(cid:86)ce (cid:77)(cid:87) (cid:90)(cid:77)(cid:87)(cid:77)b(cid:80)e (cid:83)(cid:86) (cid:77)(cid:82)fe(cid:86)ab(cid:80)e f(cid:86)(cid:83)(cid:81) (cid:88)(cid:76)e (cid:90)(cid:77)(cid:87)(cid:89)a(cid:80) c(cid:83)(cid:82)(cid:88)e(cid:92)(cid:88).G(cid:89)(cid:77)de(cid:80)(cid:77)(cid:82)e(cid:87) f(cid:83)(cid:86) (cid:88)(cid:76)e ed(cid:75)e ca(cid:87)e(cid:87) P(cid:80)ea(cid:87)e (cid:81)a(cid:86)(cid:79) (cid:88)(cid:76)e be(cid:80)(cid:83)(cid:91) ca(cid:87)e(cid:87) (cid:91)(cid:77)(cid:88)(cid:76) YES - C(cid:83)(cid:86)(cid:86)e(cid:87)(cid:84)(cid:83)(cid:82)de(cid:82)ce (cid:77)(cid:82) A(cid:86)(cid:88)(cid:77)(cid:166)c(cid:77)a(cid:80) Sce(cid:82)e(cid:87) (e.(cid:75). (cid:75)(cid:89)(cid:82)(cid:87)(cid:76)(cid:83)(cid:88) (cid:87)(cid:83)(cid:89)(cid:82)d (cid:77)(cid:82) FPS(cid:75)a(cid:81)e(cid:87)) - M(cid:77)(cid:92)ed S(cid:83)(cid:89)(cid:82)d(cid:87) (e.(cid:75). (cid:81)(cid:89)(cid:87)(cid:77)c (cid:91)(cid:77)(cid:88)(cid:76) (cid:80)(cid:83)(cid:89)d bac(cid:79)(cid:75)(cid:86)(cid:83)(cid:89)(cid:82)d (cid:82)(cid:83)(cid:77)(cid:87)e) - I(cid:81)(cid:81)(cid:83)b(cid:77)(cid:80)e S(cid:83)(cid:89)(cid:82)d S(cid:83)(cid:89)(cid:86)ce (e.(cid:75). e(cid:82)(cid:75)(cid:77)(cid:82)e (cid:87)(cid:83)(cid:89)(cid:82)d f(cid:86)(cid:83)(cid:81) (cid:77)d(cid:80)e ca(cid:86)) P(cid:80)ea(cid:87)e (cid:81)a(cid:86)(cid:79) (cid:88)(cid:76)e be(cid:80)(cid:83)(cid:91) ca(cid:87)e (cid:91)(cid:77)(cid:88)(cid:76) NO - Ab(cid:87)e(cid:82)(cid:88) S(cid:83)(cid:89)(cid:82)d S(cid:83)(cid:89)(cid:86)ce (e.(cid:75). (cid:81)(cid:89)(cid:87)(cid:77)c f(cid:86)(cid:83)(cid:81) a(cid:82) (cid:77)(cid:82)(cid:87)(cid:88)(cid:86)(cid:89)(cid:81)e(cid:82)(cid:88) (cid:83)(cid:89)(cid:88) (cid:83)f (cid:88)(cid:76)e (cid:87)ce(cid:82)e)

H(cid:89)(cid:81)a(cid:82) E(cid:90)a(cid:80)(cid:89)a(cid:88)(cid:77)(cid:83)(cid:82) W(cid:73)(cid:70)(cid:87)(cid:77)(cid:88)(cid:73) L(cid:83)(cid:75)(cid:77)(cid:82)

P(cid:80)ea(cid:87)e (cid:88)(cid:93)(cid:84)e (cid:93)(cid:83)(cid:89)(cid:86) (cid:82)a(cid:81)e (cid:83)(cid:86) ID (cid:88)(cid:83) c(cid:83)(cid:82)(cid:88)(cid:77)(cid:82)(cid:89)e (d(cid:83) (cid:82)(cid:83)(cid:88) f(cid:83)(cid:86)(cid:75)e(cid:88)!!):

L(cid:83)(cid:75)(cid:77)(cid:82)

Figure 11. Screenshots of the human evaluation interface. The introduction page (top) provides instructions to the annotators, and the testpage (bottom) shows clips to the raters and receives the corresponding Yes/No responses. given one 10-second clip at a time and asked to provide aYes/No answer judging whether or not there is audio-visualcorrespondence in the given clip. We do not provide a re- play interface to collect intuitive response from the raters.14 eferences [1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, PaulNatsev, George Toderici, Balakrishnan Varadarajan, andSudheendra Vijayanarasimhan. Youtube-8M: A Large-Scale Video Classiﬁcation Benchmark. arXiv preprintarXiv:1609.08675 , 2016. 3, 5[2] Pankaj K Agarwal, Sariel Har-Peled, and Kasturi RVaradarajan. Geometric Approximation via Coresets.

Com-binatorial and Computational Geometry , 52:1–30, 2005. 3[3] Humam Alwassel, Dhruv Mahajan, Lorenzo Torresani,Bernard Ghanem, and Du Tran. Self-Supervised Learningby Cross-Modal Audio-Video Clustering. arXiv preprintarXiv:1911.12667 , 2019. 2[4] Relja Arandjelovic and Andrew Zisserman. Look, Listen andLearn. In

ICCV , 2017. 2, 5, 7, 8, 12[5] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Sound-Net: Learning Sound Representations from UnlabeledVideo. In

NeurIPS , 2016. 2[6] David Bau, Bolei Zhou, Aude Oliva, and Antonio Torralba.Interpreting Deep Visual Representations via Network Dis-section.

PAMI , 41(9):2131–2145, 2019. 4[7] Leon Bottou and Yoshua Bengio. Convergence Properties ofthe K-Means Algorithms. In

NeurIPS , 1995. 2, 5, 6[8] Jo˜ao Carreira, Eric Noland, Chloe Hillier, and Andrew Zis-serman. A Short Note on the Kinetics-700 Human ActionDataset. arXiv preprint arXiv:1907.06987 , 2019. 1, 2[9] Joao Carreira and Andrew Zisserman. Quo Vadis, ActionRecognition? A New Model and the Kinetics Dataset. In

CVPR , 2017. 7, 9[10] Sourish Chaudhuri, Joseph Roth, Daniel P. W. Ellis, AndrewGallagher, Liat Kaver, Radhika Marvin, Caroline Pantofaru,Nathan Reale, Loretta Guarino Reid, Kevin Wilson, andZhonghua Xi. AVA-Speech: A Densely Labeled Dataset ofSpeech Activity in Movies. In

Interspeech , 2018. 2[11] Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zis-serman. VGGSound: A Large-Scale Audio-Visual Dataset.In

ICASSP , 2020. 2, 3, 7, 8, 9[12] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-offrey Hinton. A Simple Framework for Contrastive Learn-ing of Visual Representations. In

ICML , 2020. 2, 4, 6, 7,9[13] Yuxin Chen and Andreas Krause. Near-optimal Batch ModeActive Learning and Adaptive Submodular Optimization.

ICML , 2013. 2, 4, 12[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, andLi Fei-Fei. ImageNet: A Large-Scale Hierarchical ImageDatabase. In

CVPR , 2009. 5[15] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper-vised Visual Representation Learning by Context Prediction.In

ICCV , 2015. 2[16] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, andKaiming He. SlowFast Networks for Video Recognition. In

ICCV , 2019. 3, 5[17] Joseph L Fleiss. Measuring Nominal Scale AgreementAmong Many Raters.

Psychological Bulletin , 76(5):378,1971. 8 [18] Ruohan Gao and Kristen Grauman. 2.5D Visual Sound. In

CVPR , 2019. 2[19] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, ArenJansen, Wade Lawrence, R. Channing Moore, Manoj Plakal,and Marvin Ritter. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In

ICASSP , 2017. 1, 2, 7,8, 9[20] Yuhong Guo. Active Instance Sampling via Matrix Partition.In

NeurIPS , 2010. 3[21] Michael Gutman and Aapo Hyv¨arinen. Noise-ContrastiveEstimation: A New Estimation Principle for UnnormalizedStatistical Models. In

AISTATS , 2010. 2, 3[22] Sariel Har-Peled and Soham Mazumdar. On Coresets for K-Means and K-Median Clustering. In

STOC , 2004. 3[23] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and RossGirshick. Momentum Contrast for Unsupervised Visual Rep-resentation Learning. In

CVPR , 2020. 2[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep Residual Learning for Image Recognition. In

CVPR ,2016. 5, 7, 9[25] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort FGemmeke, Aren Jansen, R Channing Moore, Manoj Plakal,Devin Platt, Rif A Saurous, Bryan Seybold, et al. CNN Ar-chitectures for Large-Scale Audio Classiﬁcation. In

ICASSP ,2017. 3, 5[26] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, IlyaSutskever, and Ruslan R Salakhutdinov. Improving NeuralNetworks by Preventing Co-Adaptation of Feature Detec-tors. arXiv preprint arXiv:1207.0580 , 2012. 12[27] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon,Karan Grewal, Phil Bachman, Adam Trischler, and YoshuaBengio. Learning Deep Representations by Mutual Informa-tion Estimation and Maximization. In

ICLR , 2019. 2, 4[28] Allan Jabri, Andrew Owens, and Alexei Efros. Space-TimeCorrespondence as a Contrastive Random Walk. In

NeurIPS ,2020. 2[29] Zohar Jackson, C´esar Souza, Jason Flaks, Yuxin Pan, Here-man Nicolas, and Adhish Thite. Free Spoken Digit Dataset:v1.0.8, Aug. 2018. 5[30] David S Johnson, Christos H Papadimitriou, and MihalisYannakakis. How Easy Is Local Search?

Journal of com-puter and system sciences , 37(1):79–100, 1988. 3[31] Armand Joulin, Edouard Grave, Piotr Bojanowski, MatthijsDouze, H´erve J´egou, and Tomas Mikolov. FastText.zip:Compressing Text Classiﬁcation Models. arXiv preprintarXiv:1612.03651 , 2016. 3[32] Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. Bag of Tricks for Efﬁcient Text Classiﬁ-cation. In

EACL , 2017. 3[33] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Vi-ola, Tim Green, Trevor Back, Paul Natsev, et al. TheKinetics Human Action Video Dataset. arXiv preprintarXiv:1705.06950 , 2017. 3, 5[34] Diederik P Kingma and Jimmy Ba. Adam: A Method forStochastic Optimization. In

ICLR , 2015. 12

35] Bruno Korbar, Du Tran, and Lorenzo Torresani. CooperativeLearning of Audio and Video Models from Self-SupervisedSynchronization. In

NeurIPS , 2018. 2[36] Alex Krizhevsky and Geoffrey Hinton. Learning MultipleLayers of Features from Tiny Images. Technical report, Uni-versity of Toronto, 2009. 5[37] Alex Kulesza and Ben Taskar. Determinantal point processesfor machine learning. arXiv preprint arXiv:1207.6083 , 2012.8[38] J Richard Landis and Gary G Koch. The Measurement ofObserver Agreement for Categorical Data.

Biometrics , pages159–174, 1977. 8[39] Yann LeCun, L´eon Bottou, Yoshua Bengio, and PatrickHaffner. Gradient-Based Learning Applied to DocumentRecognition.

Proceedings of the IEEE , 86(11):2278–2324,1998. 5[40] David D Lewis and William A Gale. A Sequential Algorithmfor Training Text Classiﬁers. In

SIGIR , 1994. 3[41] Xin Li and Yuhong Guo. Adaptive Active Learning for Im-age Classiﬁcation. In

CVPR , 2013. 3[42] Ilya Loshchilov and Frank Hutter. Decoupled Weight DecayRegularization. In

ICLR , 2019. 9[43] Thomas Martinetz and Klaus Schulten. A ”Neural Gas”Network Learns Topologies, Artiﬁcial Neural Networks. In

ICANN , volume 1, pages 397–402, 1991. 5[44] Marina Meil˘a. Comparing Clusterings—An InformationBased Distance.

Journal of multivariate analysis , 98(5),2007. 2, 4[45] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,Makarand Tapaswi, Ivan Laptev, and Josef Sivic.HowTo100M: Learning a Text-Video Embedding byWatching Hundred Million Narrated Video Clips. In

ICCV ,2019. 2[46] Alan Miller.

Subset Selection in Regression . CRC Press,2002. 2[47] Michel Minoux. Accelerated Greedy Algorithms for Max-imizing Submodular Set Functions. In

Optimization tech-niques , pages 234–243. Springer, 1978. 3[48] Ishan Misra and Laurens van der Maaten. Self-SupervisedLearning of Pretext-Invariant Representations. In

CVPR ,2020. 2[49] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ra-makrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown,Quanfu Fan, Dan Gutfruend, and Carl Vondrick. Moments inTime Dataset: One Million Videos for Event Understanding.

PAMI , pages 1–8, 2019. 2[50] Vinod Nair and Geoffrey E Hinton. Rectiﬁed Linear UnitsImprove Restricted Boltzmann Machines. In

ICML , 2010. 9[51] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An Anal-ysis of Approximations for Maximizing Submodular SetFunctions–I.

Mathematical Programming , 14(1):265–294,1978. 3[52] George L Nemhauser, Laurence A Wolsey, and Marshall LFisher. An Analysis of Approximations for MaximizingSubmodular Set Functions—I.

Mathematical programming ,14(1):265–294, 1978. 4, 12 [53] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen-tation Learning with Contrastive Predictive Coding. arXivpreprint arXiv:1807.03748 , 2018. 2, 4[54] Andrew Owens, Phillip Isola, Josh McDermott, Antonio Tor-ralba, Edward H Adelson, and William T Freeman. VisuallyIndicated Sounds. In

CVPR , 2016. 2[55] Stuart P. Lloyd. Least Squares Quantization in PCM.

IEEETransactions on Information Theory , 28(2):129–137, 1982.2, 4, 6, 7[56] Mandela Patrick, Yuki M Asano, Ruth Fong, Jo˜ao F Hen-riques, Geoffrey Zweig, and Andrea Vedaldi. Multi-modalSelf-Supervision from Generalized Data Transformations. arXiv preprint arXiv:2003.04298 , 2020. 2[57] Karl Pearson. LIII. On Lines and Planes of Closest Fitto Systems of Points in Space.

The London, Edinburgh,and Dublin Philosophical Magazine and Journal of Science ,2(11):559–572, 1901. 5[58] Karol J. Piczak. ESC: Dataset for Environmental SoundClassiﬁcation. In

ACM-MM , 2015. 7, 8, 12[59] Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander AAlemi, and George Tucker. On Variational Bounds of MutualInformation. In

ICML , 2019. 2[60] Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang,Huisheng Wang, Serge Belongie, and Yin Cui. Spatiotem-poral Contrastive Video Representation Learning. arXivpreprint arXiv:2008.03800 , 2020. 2[61] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On theConvergence of Adam and Beyond. In

ICLR , 2018. 5, 9, 13[62] Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Rad-hika Marvin, Andrew Gallagher, Liat Kaver, SharadhRamaswamy, Arkadiusz Stopczynski, Cordelia Schmid,Zhonghua Xi, and Caroline Pantofaru. AVA-ActiveSpeaker:An Audio-Visual Dataset for Active Speaker Detection. arXiv preprint arXiv:1901.01342 , 2019. 2[63] David Sculley. Web-Scale K-Means Clustering. In

WWW ,2010. 5[64] Burr Settles. Active Learning Literature Survey.

Science ,10(3):237–304, 1995. 3[65] Yusuke Shinohara. A Submodular Optimization Approachto Sentence Set Selection. In

ICASSP , 2014. 3[66] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.UCF101: a dataset of 101 human actions classes from videosin the wild. arXiv preprint arXiv:1212.0402 , 2012. 7, 8, 12[67] Jamshid Sourati, Murat Akcakaya, Jennifer G Dy, Todd KLeen, and Deniz Erdogmus. Classiﬁcation Active LearningBased on Mutual Information.

Entropy , 18(2):51, 2016. 3[68] Jonathan C Stroud, David A Ross, Chen Sun, Jia Deng,Rahul Sukthankar, and Cordelia Schmid. Learning VideoRepresentations from Textual Web Supervision. arXivpreprint arXiv:2007.14937 , 2020. 2[69] Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chen-liang Xu. Audio-Visual Event Localization in UnconstrainedVideos. In

ECCV , 2018. 2[70] Ivor W Tsang, James T Kwok, and Pak-Ming Cheung. CoreVector Machines: Fast SVM Training on Very Large DataSets.

Journal of Machine Learning Research , 6(Apr):363–392, 2005. 3

71] Nguyen Xuan Vinh, Julien Epps, and James Bailey. Informa-tion Theoretic Measures for Clusterings Comparison: Vari-ants, Properties, Normalization and Correction for Chance.

Journal of Machine Learning Research , 11, 2010. 2, 4[72] Janett Walters-Williams and Yan Li. Estimation of MutualInformation: A Survey. In

RSKT , 2009. 3[73] Mei Wang and Weihong Deng. Deep Visual Domain Adap-tation: A Survey.

Neurocomputing , 312:135–153, 2018. 4[74] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learn-ing Correspondence from the Cycle-Consistency of Time. In

CVPR , 2019. 2[75] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity inData Subset Selection and Active Learning. In

ICML , 2015.3[76] Kai Wei, Yuzong Liu, Katrin Kirchhoff, Chris Bartels, andJeff Bilmes. Submodular Subset Selection for Large-ScaleSpeech Training Data. In

ICASSP , 2014. 3[77] Kai Wei, Yuzong Liu, Katrin Kirchhoff, and Jeff Bilmes. Us-ing Document Summarization Techniques for Speech DataSubset Selection. In

NAACL , 2013. 3[78] Kai Wei, Yuzong Liu, Katrin Kirchhoff, and Jeff Bilmes. Un-supervised Submodular Subset Selection for Speech Data. In

ICASSP , 2014. 3[79] Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh,Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, AngusNg, Bing Liu, S Yu Philip, et al. Top 10 Algorithms in DataMining.

Knowledge and Information Systems , 14(1):1–37,2008. 4[80] Karren Yang, Bryan Russell, and Justin Salamon. TellingLeft From Right: Learning Spatial Correspondence of Sightand Sound. In

CVPR , 2020. 2, 2020. 2