[PDF] II-20: Intelligent and pragmatic analytic categorization of image collections

Abstract

We introduce II-20 (Image Insight 2020), a multimedia analytics approach for analytic categorization of image collections. Advanced visualizations for image collections exist, but they need tight integration with a machine model to support analytic categorization. Directly employing computer vision and interactive learning techniques gravitates towards search. Analytic categorization, however, is not machine classification (the difference between the two is called the pragmatic gap): a human adds/redefines/deletes categories of relevance on the fly to build insight, whereas the machine classifier is rigid and non-adaptive. Analytic categorization that brings the user to insight requires a flexible machine model that allows dynamic sliding on the exploration-search axis, as well as semantic interactions. II-20 brings 3 major contributions to multimedia analytics on image collections and towards closing the pragmatic gap. Firstly, a machine model that closely follows the user's interactions and dynamically models her categories of relevance. II-20's model, in addition to matching and exceeding the state of the art w. r. t. relevance, allows the user to dynamically slide on the exploration-search axis without additional input from her side. Secondly, the dynamic, 1-image-at-a-time Tetris metaphor that synergizes with the model. It allows the model to analyze the collection by itself with minimal interaction from the user and complements the classic grid metaphor. Thirdly, the fast-forward interaction, allowing the user to harness the model to quickly expand ("fast-forward") the categories of relevance, expands the multimedia analytics semantic interaction dictionary. Automated experiments show that II-20's model outperforms the state of the art and also demonstrate Tetris's analytic quality. User studies confirm that II-20 is an intuitive, efficient, and effective multimedia analytics tool.

Full PDF

III-20: Intelligent and pragmatic analytic categorization of imagecollections

Jan Zahálka, Marcel Worring,

Senior Member, IEEE , and Jarke J. van Wijk

Exploration Search

Fig. 1. II-20’s novel model fully supports ﬂexible analytic categorization of image collections, closing the pragmatic gap. The user cancategorize or discard images, and the system intelligently chooses the level of exploration and search for the categories. Pictured: thenovel Tetris interface metaphor.

Abstract —In this paper, we introduce

II-20 (Image Insight 2020), a multimedia analytics approach for analytic categorization of imagecollections. Advanced visualizations for image collections exist, but they need tight integration with a machine model to support thetask of analytic categorization. Directly employing computer vision and interactive learning techniques gravitates towards search.Analytic categorization, however, is not machine classiﬁcation (the difference between the two is called the pragmatic gap ): a humanadds/redeﬁnes/deletes categories of relevance on the ﬂy to build insight, whereas the machine classiﬁer is rigid and non-adaptive.Analytic categorization that truly brings the user to insight requires a ﬂexible machine model that allows dynamic sliding on theexploration-search axis, as well as semantic interactions: a human thinks about image data mostly in semantic terms. II-20 bringsthree major contributions to multimedia analytics on image collections and towards closing the pragmatic gap. Firstly, a new machinemodel that closely follows the user’s interactions and dynamically models her categories of relevance. II-20’s machine model, inaddition to matching and exceeding the state of the art’s ability to produce relevant suggestions, allows the user to dynamically slide onthe exploration-search axis without any additional input from her side. Secondly, the dynamic, 1-image-at-a-time

Tetris metaphor that synergizes with the model. It allows a well-trained model to analyze the collection by itself with minimal interaction from the userand complements the classic grid metaphor. Thirdly, the fast-forward interaction , allowing the user to harness the model to quicklyexpand (“fast-forward”) the categories of relevance, expands the multimedia analytics semantic interaction dictionary. Automatedexperiments show that II-20’s machine model outperforms the existing state of the art and also demonstrate the Tetris metaphor’sanalytic quality. User studies further conﬁrm that II-20 is an intuitive, efﬁcient, and effective multimedia analytics tool.

Index Terms —Multimedia analytics, image data, analytic categorization, pragmatic gap

NTRODUCTION

The growing wealth and importance of multimedia data (images, text,videos, audio, and associated metadata) is evident. Processing them • Jan Zahálka is with the Czech Technical University in Prague. E-mail:[email protected].• Marcel Worring is with the University of Amsterdam. E-mail:[email protected].• Jarke J. van Wijk is with the Eindhoven University of Technology. E-mail:[email protected] received xx xxx. 201x; accepted xx xxx. 201x. Date of Publicationxx xxx. 201x; date of current version xx xxx. 201x. For information onobtaining reprints of this article, please send e-mail to: [email protected] Object Identiﬁer: xx.xxxx/TVCG.201x.xxxxxxx meaningfully and efﬁciently has become crucial for an increasing num-ber of domains, e.g., media and news, forensics, security, marketing,and health. The ubiquity and availability of cameras have made casualcontent more important than ever. Social networks are a multi-billiondollar industry and user-contributed content is valuable. Visual data(images and videos) are at the core of the multimedia explosion andthere is a great need for advanced analytics of visual data collections.In recent years, our ability to automatically process large volumesof visual data has improved greatly. The chief reason is the dramaticincrease of the semantic quality of machine feature representations,spearheaded by deep neural networks [23]. In many tasks, deep netsapproach or surpass human capabilities, e.g., in object recognition (withequal train and test noise levels) [12]. The semantic gap [35] has beenclosed for many tasks and is rapidly closing for others. The quality andaccessibility of advanced classiﬁers and indexes have entrenched the a r X i v : . [ c s . MM ] J u l earch engine as the golden standard for analyzing image collections.However, not all multimedia analytics tasks boil down to just search.In a general analytics task on multimedia data, the user dynamicallyoscillates between exploration and search on the exploration-task axis [45]. Examples of tasks that are not purely search include:T1) Structuring the collection — make sense of what is in a collec-tion with unknown contents, and structuring it based on multiplecategories of relevance.Examples of this would be a marketing specialist analyzing her com-pany brand’s perception on social media, or a quality control managervisually inspecting ﬁnished products for ﬂaws.T2)

Finding needles in the haystack — in a collection with only asmall portion of relevant items, ﬁnd them based on complex, oftendomain- and expertise-dependent semantics.For example, a forensics analyst trying to establish whether there iscriminal content on a suspect’s seized computer.T3)

Subjective/highly contextual content retrieval — labeling of thecontent into categories which can only be deﬁned by the user asthey are subjective or contextually deﬁned.In this case as the notion of relevance cannot be deﬁned beforehand orgrounded objectively, the content-based indexes have trouble matchingthe user’s input to their dictionary. An example here is “show me artthat I like” which does not match predeﬁned content labels very well.The need to support varied tasks can be addressed by employing thevisual analytics approach, supporting knowledge/insight gain by tightintegration of advanced visualizations with a machine model [21,22,33].Image collection analytics belongs to multimedia analytics [6], whichhas a number of speciﬁcs: among others, strong focus on semantics,high information bandwidth, and difﬁcult summarization.In general, multimedia analytics tasks can be modeled as analytic cat-egorization , in which the user deﬁnes the categories of relevance herselfon the ﬂy, and the model adapts to them as the session progresses [45].This is different from the classic machine learning classiﬁcation, andthe difference between the two is the pragmatic gap [45]. Analyticcategorization requires support for multiple categories of relevance atonce, creating/redeﬁning/deleting categories on the ﬂy during the ana-lytic session, and strong emphasis on interactivity: the user interactionsdrive the categorization and vice versa, and they complete in interactivetime. New visualizations and models built speciﬁcally around tightintegration of the two and support of analytic categorization are needed.There are approaches that incorporate interactive model buildingto cover a wider range of the exploration-search axis. To advance theanalytic session, they usually make use of a rich set of ﬁlters on the data(e.g., [4, 19, 20, 26, 39, 40]), an interactive (multimodal) learning model(e.g., [15, 44]), or a combination of both. Whilst these techniques gobeyond search, on the exploration-search axis, they tend to lean towardssearch anyway: they simply fetch what the users are looking for or whatthey found relevant previously. To date, multimedia analytics retains anhourglass interface-model structure: a wide array of visualizations onthe one hand, a wide array of automatic multimedia analysis techniqueson the other, and a narrow set of interactions between the two — ﬁlter,search, interactive learning (relevance feedback, active learning).To enable meaningful interaction, semantic interactions are vital tomultimedia analytics. Semantic interactions translate user interactionsperformed on high-level visual artifacts in the interface to low-levelanalytic model adjustments, coupling cognitive and computational pro-cesses [10]. The user does not train and adjust the model directly, butrather interacts within her domain of expertise, the model uses thoseinteractions to improve. Developing new such interactions and thuswidening the hourglass would improve multimedia analytics capabili-ties further.To address the above challenges, we present

II-20 (Image Insight2020), a multimedia analytics approach for image collections thatbrings the following contributions: • A new analytic categorization model that supports multiplecategories of relevance and dynamically slides on the exploration-search axis without explicit user involvement. The model is fullyinteractive even on large (> 1M) datasets and metaphor-agnostic.To the best of our knowledge, II-20’s model is the ﬁrst to fully support analytic categorization of image collections.• The

Tetris metaphor that streams the images in one-by-one, withthe user steering them to the correct categories of relevance. Asthe model learns, it starts playing the Tetris game by itself withthe user only correcting the model’s mistakes. The metaphoris tightly integrated with the model with a clear beneﬁt to theuser: the number of interactions required from her is inverselyproportional to the quality of the model whilst providing the sameor better analytic outcome.• The fast-forward interaction that allows the user to swiftly cat-egorize a large number of images at once based on the currentstate of the model.The rest of the paper is organized as follows. Sect. 2 overviews therelated work. Sect. 3 describes II-20. Sect. 4 outlines the experimentalsetup, results are discussed in Sect. 5. Sect. 6 concludes the paper.

ELATED WORK

Analytic categorization of image collections is iterative, requiring tightintegration between the visual metaphor and the machine model thatprovides image suggestions. This is in line with the established visualanalytics theory [21, 22, 30, 33]. True support of analytic categorizationas a task involves semantic interactions (this challenge is shared withgeneral visual analytics [10]), dynamic sliding on the exploration-search axis, and closing the pragmatic gap [45]. In this section, wereview related work on the constituent parts of a multimedia analyticssystem (interface and model) and on means of integrating the two.There is a great variety of visual metaphors available. The classic,time-tested approach used by the vast majority of systems visualizingimage collections is the grid. Grids score near-perfectly on efﬁciencyof screen space utilization and are very intuitive. They can be enhancedto convey collection structure [4, 31, 46]. Beyond grids, there are manyother metaphors, such as spreadsheet-based that integrate the imagecontent tightly with metadata [8, 19, 40], semantic-navigation-basedthat allow the user to pursue threads of interest, often semantic [5, 9],or even metadata-driven [38, 41]. Rapid serial visual presentation(RSVP) presents images dynamically, ﬂashing images or video clipsin a fast-paced manner, with the user providing simple, rapid response[14, 36, 37]. There are plenty of metaphors with various niches.Models supporting multimedia analytics can be split roughly intoindex-based and interactive-learning-based (hybridization possible).Index-based techniques precompute a collection index which is usedfor ﬁltering and/or search queries. The basic, yet still effective approachis the metadata-based index. Content-based indexing requires featureand/or concept label extraction, and the contemporary computer visionstandard is to use deep convolutional neural networks [23]. The features(esp. semantically meaningful ones, such as concept labels) can beused as metadata (e.g., [39]) or to build a content-based index to fuelsearch. Indexing approaches include clustering-based approaches suchas product quantization [17] or hash-based approaches [3, 7]. Thecurrent state of the art offers a broad range of techniques that establisha semantic structure of the collection. Relying on indexing alone inmultimedia analytics, however, reduces analytics to just search. Themodel is rigid, non/adaptive, and does not address the pragmatic gap.Interactive-learning-based approaches collect feedback from the userin the form of explicit “relevant” and “not relevant” labels, then train anew model based on those labels, rank the data according to the newmodel, and suggest more relevant items. Each interaction round shouldhappen in interactive time. Following visual analytics theory [32], thismeans operating in the real-time ( < ig. 2. II-20’s interface-model scheme. The components innovated byII-20 are coloured orange. model is least sure about. This maximizes the model’s learning gainand minimizes the number of user interactions. Algorithmically, mostof the techniques come from the 2000s (the aforementioned surveysprovide a good overview). In the 2010s, interactive learning struggledwith the rapid increase in data scale. Recently, it has been improvedto work on modern large-scale collections by introducing efﬁcientcompression [42] and clustering [18]. Interactive learning is adaptive,dynamic, and ﬂexible: it learns only from the user, making it a good ﬁtfor closing the pragmatic gap. On its own, however, it still gravitatestowards search, and is limited by latency: there is only so much thatcan be computed in interactive time.In the 2000s and 2010s, there have been a number of systems thatintegrate advanced visualizations with machine learning-based modelsin both visual analytics [11] and multimedia analytics [45]. Moreover,visual analytics has been employed to explain machine learning models.A recent notable instance is the effort to explain deep neural nets [16].In most of the visual analytics systems, users directly manipulate themachine learning model, which is useful for the data scientist, butmight be difﬁcult for an analyst who is not a machine learning expert.The multimedia analytics systems, where semantic navigation is ofparamount importance, usually operate with a narrow semantic interac-tion dictionary: ‘ﬁlter”, “search”, and “perform interactive learning”.Additional semantic interactions would deﬁnitely be a big boon forboth visual and multimedia analytics [10, 45].As discussed in the introduction, II-20 brings three main contribu-tions. II-20’s machine model combines the advantages of index-basedand interactive-learning-based approaches. By ﬂexibly supporting dy-namic sliding on the exploration-search axis, it is to the best of ourknowledge the ﬁrst model closing the pragmatic gap [45]. The Tetrismetaphor, beyond expanding the family of metaphors for image collec-tions, has a tighter integration with the model than others, decreasing thenumber of interactions as the model improves. Finally, the fast-forwardinteraction expands the semantic interaction dictionary, answering aclear visual and multimedia analytics research challenge [10, 45]. II-20 is tailored for full support of analytic categorization, deﬁned asthe task of assigning images i ,..., i n from the collection I into analyticcategories, which we henceforth call buckets consistently with the termi-nology introduced in related work [8, 40]. The machine model requiresthe images to be represented with a semantic feature representation.The support for ﬂexible buckets is formalized as follows. Let B denote the set of user-deﬁned buckets and b ∈ B an individual bucket.To cater for the pragmatic gap, B is a mutable set: the user can create,redeﬁne, activate/deactivate, and remove buckets at any time throughoutthe analytics session. In II-20, the user can have between 1 and 7buckets active at any given time, which is consistent with related workon visualization theory [13]. Individual buckets are mutable as well,images can be added, removed, and transferred between buckets atany time. In addition to B , II-20 adds the implicit discard pile bucket( d ) which at all times contains images that were discarded by the user(marked not relevant). The user can add images to d and restore themto any b ∈ B as she sees ﬁt. Further, d is always active, it cannot bedeactivated, redeﬁned, or deleted. Finally, P is the set of processedimages the user has provided feedback on. At all times, P = B ∪ d . The main challenge of supporting analytic categorization is its ﬂexi-bility. There are no constraints, predeﬁned rules, or prior knowledgeconcerning images the user can add to the buckets. Yet the modelmust “read the user’s mind”, supplying suggestions that are relevantto B in its current state. Moreover, it must do so in interactive time,placing challenging constraints on computational efﬁciency. However,the payoff for the ﬂexibility is signiﬁcant: buckets can be ﬁt by the userto a variety of tasks, including T1 – T3 from the introduction.II-20’s interface-model scheme is depicted in Fig. 2. The modelsuggests images that cover the entire exploration-search axis: from puresearch through dynamic interactive learning to exploration candidatesthat take the user to previously unseen parts of the collection. Thesuggestions come with a bucket conﬁdence score that expresses themodel’s conﬁdence an image belongs to the bucket. This is visualizedin the UI as additional information to enhance the decision making.User agency is a core design tenet for II-20. The user starts inthe familiar grid interface and the model is deliberately a black box:there are no required inputs from the user to control the sliding onthe exploration-search axis, the model determines everything from theuser’s actions and an internal assessment of its own performance. Whenthe user feels comfortable, she can switch back and forth between gridand Tetris and engage the fast-forward interaction. II-20 interface is depicted in Fig. 1. The main view has three compo-nents. The image view occupies the main portion of the screen anddisplays images from the collection. The bucket banner below theimage view shows the buckets that are active. Finally, the control panel on the right side of the screen provides bucket management, imageview settings, and the fast-forward button.

II-20’s interaction protocol is based on the standard used in interactivelearning: the user labels images for the buckets and discards the non-relevant ones, the model learns from those interactions and providesrelevant image suggestions. This approach is especially suited for tasksT1 (the user can easily follow multiple categories of relevance) and T3(given the classiﬁer is able to capture the nuances important to the user).In addition, II-20 provides a small number of exploration candidatesnot tied to a bucket to increase coverage of the exploration-search axis.The grid image view is the static, batch mode showing multipleimages at once. Due to the familiarity of the grid, it is II-20’s defaultimage view mode. It is integrated into the model rather loosely: itwaits for the user’s explicit feedback (the user selects the bucket to belabelled and the images to be assigned there) and explicit instructionsto show more relevant images. Image suggestions for a bucket appearwith a dashed border in the bucket’s color with brightness proportionalto bucket conﬁdence. The grid is resizable, so the user can choose tosee more images or more detail. The user can also preview individualimages by right-clicking the thumbnail. This enlarges the image andprovides ample space for displaying any associated text and metadata.

The grid is a familiar, time-tested metaphor, but working with a seriesof grids for too long might be perceived as tedious. In addition to thestatic, batch-mode grid, we investigate the topic of dynamic, 1-image-at-a-time (D1I) metaphors which show an image for a limited amountof time, after which they get automatically assigned to the bucketsuggested by the model unless the user intervenes. A D1I metaphorcomplements the grid, providing a possibly welcome change of pace,as well as three key strengths:•

Tight integration with the model, with a degree of autonomy — awell-trained model simply feeds images into the correct bucketon its own and the user interacts only once in a while to correctwrong suggestions.•

Potentially lower number of processed images in total — themodel learns incrementally and the UI only shows the top relevantmage, so the user needs to process fewer images in total to get thesame number of relevant images. In other words, a D1I metaphorreaches the same or higher precision and recall growth comparedto the grid (we evaluate this claim in the experiments).•

Focus on detail — with one image shown at a time, the user’sattention naturally focuses on details of the image in question,making D1I metaphors a natural candidate for applications wherethe detail decides the analytic outcome, such as medical imag-ing or security. This complements the grid, which is better atoverviewing the collection and broader-category analytics.II-20’s instantiation of a D1I metaphor is the new

Tetris metaphor(shown in Fig. 1) inspired by the famous game. Tetris operates asfollows: images ﬂow from the top one at a time, and descend to one ofthe buckets in the bucket banner. When an image reaches the bucket,it gets assigned to it, the model processes the assignment, and thenext image starts ﬂowing from the top. The user can steer the imagesbetween buckets by pressing the left and right arrow keys, pause theﬂow by hitting spacebar and increase/decrease the descent speed byhitting the down and up arrow, respectively. Speed and pause/play canalso be controlled by buttons in the control panel. Finally, hitting the Ikey opens up an overlay with any associated text and metadata.The model will mostly suggest images for the buckets that are ac-tive. These ﬂow in already positioned above their suggested bucket,connected to the bucket by a line in the bucket’s colour. Explorationsuggestions appear over the discard pile without a connecting line to anybucket: they tend to be from previously unseen areas of the collection,so whilst providing exploration directions, they are likely incorrect.

Each bucket has an entry in the top part of the control panel. Bucketdeactivation is useful whenever the user wants to focus on somethingelse and return to the bucket in the future. A bucket can be activatedand deactivated at any time by clicking its name or icon. Deactivatingwill remove the bucket from the bucket banner, gray it out in the controlpanel, and pause model suggestions for that bucket. However, thebucket will be preserved and the user can reactivate it again.The eye button next to the bucket name in the control panel opens upthe bucket view, showing all bucket images in a grid. The grid can beswitched between 3 (default) and 1 images per row, toggling betweenmore images and more detail. The brightness of an image border isproportional to bucket conﬁdence. The bucket view allows sortingby bucket conﬁdence and the time the image was added to the bucket(newest/oldest ﬁrst). Finally, the bucket view allows transfer of imagesbetween buckets with two modes: move and copy. This implementsbucket splitting (the user can transfer images in bulk to a new bucket),and bucket redeﬁnition (moving images between bucket triggers modelretraining on both the sending and the receiving bucket), both crucialfor analytic categorization ﬂexibility. The edit button allows renamingthe bucket. The trash bin button deletes the bucket.The bucket banner provides a quick overview of the state of thebuckets. It shows the number of images in the bucket, as well as bucketarchetypes, i.e., the images that the model thinks best represent thebucket. Their number is determined by the screen space available (atleast one will be shown for each bucket). The user can thus quicklygauge if the model understands her bucket deﬁnition.

II-20’s model’s pipeline for suggesting relevant images is depicted inFig. 4. The core (employing just the black-coloured steps) is simplythe interactive learning pipeline. II-20’s model enhances it signiﬁ-cantly, producing exploration and search suggestions dynamically bymonitoring its own performance without direct involvement of the user.

To extract image features, we use the ImageNet Shufﬂe deep neuralnet [25] with 4437 concepts representing visual presence of nounsin the image. We extract two feature representations: the concept representation with the 4437 concepts, i.e., recording the net output,and the abstract representation with the output of the second fully-connected layer containing 1024 dense, abstract features that encodethe same semantic information (meaningless to the user, but suitable forindexing). The features are used to construct two key data structures.Firstly, the collection index , establishing an efﬁcient semantic sim-ilarity structure on the collection. To compute the index, we employproduct quantization (PQ) [17] on the abstract representation. PQ splitsthe feature matrix column-wise into m equally-sized submatrices (in ourcase, m =

32, i.e., 32 submatrices of 32 features each), then quantizeseach submatrix using k -means (in our case, k = min ( , √ n ) , where n is the number of images in the collection), preserving the centroid co-ordinates for each subquantizer. The PQ representation of each imageis the concatenation of subquantizer centroid IDs, with lookup tablesof centroid distances set up for quick similarity search.Secondly, the interactive learning representation built using Black-thorn [42]. We use the concept feature representation, compressedusing Blackthorn to preserve the top 25 concepts by value per image.This number is deliberately larger than the recommendation of 6 [42]due to our concept dictionary being ∼

4x the size of theirs. The resultingsparse representation preserves image semantics and reduces the sizeby more than 99 percent. II-20’s prototype uses scikit-learn [29]which directly supports efﬁcient sparse matrix computations.

To cover the entirety of the exploration-search axis, II-20’s model hasthree components capable of suggesting images (as outlined in Fig. 2):the interactive classiﬁer, nearest-neighbour search, and the randomizedexplorer. The position of each component on the exploration-searchaxis is shown in Fig. 3.The interactive classiﬁer maintains a linear SVM model σ b foreach non-empty bucket b ∈ B . This classiﬁer choice is consistent withthe state of the art in interactive multimodal learning [18, 42], linearSVM exhibits good performance in interactive time on even very largedatasets. The interactive classiﬁer’s suggestions are the top images i ∈ I by classiﬁer score ( score ( σ b , i ) ). Since each interactive classiﬁer isexplicitly tied to a bucket, it is also used to compute bucket conﬁdence,the belief that an image i ∈ I belongs to bucket b ∈ B : con f ( i , b ) = min ( max ( score ( σ b , i ) max i b ∈ b ( score ( σ b , i b )) , ) , ) (1)As described in Sect. 3.1, bucket conﬁdence is used throughout II-20’s UI to provide additional information about the model’s reasoning.Easy translation to bucket colour brightness is the reason bucket conﬁ-dence is conﬁned to the [ , ] range. If σ b = ∅ , bucket conﬁdence isundeﬁned (but in that case, the model is not suggesting for b anyway). Nearest neighbour search uses the collection index to search forimages with the lowest distance to a bucket. It has two modes: k -NN( k nearest neighbours) and aNN (approximate nearest neighbours). Asshown in Fig. 3, they occupy different positions on the exploration-search axis, and also on the “exactness vs. computational efﬁciency”tradeoff: the k -NN mode is more exact, but requires a full k -NN matrixof the dataset (especially difﬁcult for datasets of >

1M images), whereasthe aNN mode is more randomized, but utilizes no precomputed struc-tures. We experimentally evaluate both modes.The k -NN mode relies on a precomputed k -NN matrix that records 10nearest neighbours by PQ distance for each image in the collection. Toproduce suggestions for bucket b , the k -NN mode uniformly samplesimages from the set of all recorded neighbours of the images in b (for computational efﬁciency reasons, if | b | >

50, the neighbours of auniform random sample of size 50 drawn from b is used instead).The aNN mode ﬁrst uniformly samples 50000 candidates from allunseen images. Then, it computes their distance to up to 25 imagesin b (sampling uniformly if | b | > 25). The distance of each candidateto the bucket is the minimal distance between itself and any image inthe bucket (sample). Finally, it returns the top candidates sorted by thedistance to bucket b in ascending order. The aNN sample caps of 50000and 25 were chosen to preserve interactive response time. xploration Search k-NNInteractive classi ﬁ erRandomized explorer aNN Fig. 3. The position of II-20 model components on the exploration-searchaxis.

Finally, II-20 has a randomized explorer component to support ex-ploration. It suggests random images that are as far away from whatthe user has already processed as possible. This allows quick semantictraversal to the unseen parts of the collection. The randomized explorerﬁrst randomly samples candidate suggestions from all unseen images.Then, it sorts the candidates by maximum distance to P : the distanceof each candidate to P is equal to the minimal distance to any image in P . The top images in the sorted set are the randomized explorer sugges-tions. The number of candidates is a performance-bounded parameter:the larger without violating interactive response time, the better. InII-20, it is set to 100 times the number of requested suggestions. To model buckets, II-20 maintains three extra sets of images per bucket.Firstly, bucket suggestions ( S b ), i.e., all images suggested for bucket b ∈ B throughout the session. Secondly, correct bucket suggestions( C b ) , all images in S b which were also subsequently added to the bucketby the user. Thirdly, wrong bucket suggestions ( W b ), all images in S b which were then discarded or added to a different bucket. Further, S classb and S nnb denote the suggestions produced by the interactive classiﬁerand nearest neighbour search, respectively (similarly for C b and W b ).Let (cid:74) · (cid:75) w denote the sliding window operator, which selects exactlythose images added in the last w interaction rounds to an image set. The relevant images suggestion procedure takes two inputs: Firstly,user feedback ( F ), a set of key-value pairs with an image as key and itsuser-assigned bucket as value. Secondly, s b , the number of requestedsuggestions for each bucket. The suggestion procedure (see Fig. 4) foreach bucket operates as follows. Feedback processing — Establish F b , the set of all images concern-ing bucket b in F . Split the feedback into positive (images suggestedfor bucket b and added there by the user), neutral (images added tobucket b , but not suggested for it), and negative (images suggested for b , but added elsewhere); process each separately. Positive and neutralfeedback images are added to b , negatives are added to W b . Train images pruning — By default, σ b uses all images in b aspositive training examples. For increased quality, it may be worthwhileto prune the training set. Generally, the more data, the better, but rein-forcing the importance of archetypal images or clarifying the decisionboundary might lead to increased performance. To that end, we proposethree strategies to construct the positive training set for σ b (if σ b = ∅ ,II-20 falls back to the default of taking all images from b ):• Relevance feedback — The n tr images from b with the highestscore according to the current σ b , emphasizes the archetypes.• Active learning — The n tr images from b with the lowest scoreaccording to the current σ b , focuses on the decision boundarybetween the bucket and the remainder of the collection.• Hybrid — n tr relevance feedback images and n tr active learningimages are obtained, the result is the union of the two sets. Trimsimages that are neither archetypal nor near the decision boundary.We experimentally compare all four strategies with various n tr toeach other and to the default setting (simply taking all images from b ). Classiﬁer training — If b (cid:54) = ∅ , train the classiﬁer. The set ofpositives is taken from the previous step, the set of negative trainingexamples is initialized to W b . For classiﬁer robustness, we want atleast twice as many negatives as positives. If that is not the case, the negatives are supplemented with a random sample of the images in thediscard pile, and if that is still not enough, they are ﬁlled to the desiredsize by a random sample from all images in the collection. Null classiﬁer case — If σ b = ∅ , return s b randomized explorersuggestions. Oracle queries — Employing active learning often leads to im-proved classiﬁer quality whilst reducing the required number of userinteractions [2, 34]. II-20 must chieﬂy employ relevance feedback, asthe user is looking for relevant images, but it might help to ask theuser (= the oracle) for judgment on a a couple of difﬁcult images. Anoracle query means that instead of the image with the highest σ b score,II-20 shows an image with the score closest to 0 (= nearest to the de-cision boundary) and marks it with a question mark. Let o denote theproportion of oracle queries within suggestions ( o =

0: pure relevancefeedback, o =

1: pure active learning). II-20 produces oracle queriesby replacing each suggestion with an oracle query with probability o .Then, s b is reduced by the number of oracle queries such that the correctrequested number of suggestions is maintained. In the experiments, wevary o to gauge the beneﬁts of employing active learning. Exploration-search split — The model decides the proportion be-tween classiﬁer, nearest neighbour, and randomized explorer sugges-tions based on the precision achieved by the classiﬁer ( p class ) andnearest neighbour search ( p nn ) in the last w interaction rounds ( w is aparameter subject to experimentation): p class = (cid:74) C classb (cid:75) w (cid:74) S classb (cid:75) w (2) p nn = (cid:74) C nnb (cid:75) w (cid:74) S nnb (cid:75) w (3)If (cid:74) S classb (cid:75) w = ∅ , p class : = p nn ). For each suggestionto be produced, roulette selection is performed. A uniform randomnumber r ∈ [ , ) determines the suggestion source:• 0 ≤ r < p class : interactive classiﬁer• p class ≤ r < p class + ( − p class ) · p nn : nearest neighbour search• p class + ( − p class ) · p nn ≤ r <

1: randomized explorerIn other words, the percentage of interactive classiﬁer suggestionsis equal to current precision. Should the interactive classiﬁer startfaltering, nearest neighbour search comes in, shifting the position onthe exploration-search axis. If neither provides meaningful suggestions,both p class and p nn fall to zero and most of the suggestions will beproduced by the randomized explorer, which is traversing to yet unseenparts of the collection. Over a couple of exploration rounds, new bucketimages (or even buckets) will hopefully manifest, the sliding windowwill “forget” the bad streak and the analytics will shift toward searchagain. Task-wise, this enables the model to support tasks incl. T1 – T3and allows the user to shift between them on demand (e.g., structuringthe collection at ﬁrst, and looking for needles in the haystack later). Final suggestions — Based on the exploration-search split, imagesuggestions are produced by each of the model components, concate-nated with the oracle queries, and returned to the user.

The fast-forward interaction quickly expands a bucket using the currentmodel. Fast-forward takes two inputs from the user: the bucket tobe expanded ( b f f ∈ B ∪ d , b f f (cid:54) = ∅ ), and the number of images to beadded to b f f (denoted n f f ). As the shorthand notation, we propose“fast-forwarding b f f by n f f ”.After receiving the input, the model directly adds the top n f f imagesby interactive classiﬁer score to b f f . The user is immediately taken tothe bucket view with the fast-forwarded images shown at the top of thegrid, marked with the fast-forward symbol (double right-pointing trian-gle). The user can review the fast-forwarded images and transfer theincorrectly-added ones to the discard pile. Note that the fast-forwarded lassi ﬁ er trainingFeedback processing BirdsDiscard pile

Oracle queries ?? Exploration-search split

Exploration Search k-NNInt.classi ﬁ erRandomized explorer aNN Train images pruning

Fig. 4. The procedure for suggesting relevant images: feedback is collected from the interface, processed, a new interactive classiﬁer is trained, andthen suggestions covering the entire exploration-search axis are produced. The components innovated by II-20 are coloured orange. images have already been added to the bucket: not interacting withthem will keep them in the bucket, i.e., fast-forward does not merelyprovide n f f regular suggestions. Closing the bucket view commits thefast-forward; the images will subsequently appear as regular images.Fast-forwarding brings the following advantages:• Good model = sped up session — Fast-forward provides a gearshift for the session: it is much faster than producing the samenumber of regular suggestions, regardless of the metaphor. Thisis useful whenever the user wants to quickly focus on a bucketand expand on insight related to it, without having to grind thebroader analytics session to a halt. By enabling this focus shift,fast-forward greatly enhances II-20’s capabilities w.r.t. task T2.•

Responsive — The model processing part of a fast-forward alwayscompletes in interactive time, regardless of n f f , due to the modelscoring all images in the collection whenever producing sugges-tions (Sect. 3.2.4). The ﬁnal list of fast-forwards is produced bysimply trimming the list to n f f , which is computationally trivial.• Easy discarding — Fast-forwarding the discard pile allows theuser to quickly dispose of large chunks of non-relevant data,which comes in handy e.g., whenever she has not received rele-vant suggestions for a while (discards provide valuable negativeexamples to the model). Model judgments on which images arenot relevant tend to be more reliable than on the relevant ones,allowing setting a large n f f .• Semantic — “Fast-forwarding a bucket” is a comprehensively,clearly deﬁned interaction universally usable across domains ofexpertise which directly translates to a model adjustment. Assuch, it answers the call for more semantic interactions [10, 45].

XPERIMENTAL SETUP

We evaluate II-20 twofold: ﬁrstly, we verify the analytic quality ofthe model with automated experiments, secondly, we perform an open-ended user study gauging II-20’s usability and ability to provide insight.In addition, we also report on time complexity.

We have selected three datasets with different analytic niches: VOPE-8hr, a needles-in-the-haystack dataset with a clear associated real-worldtask (used for both experiments), and two computer vision benchmarkdatasets that we use for the automated experiments: CelebA, a portrait dataset with high binary annotation coverage, and Places205, a large-scale scene recognition dataset with categories of varied granularity.

VOPE-8hr is a dataset on the topic of violent online political extrem-ism (VOPE). VOPE-8hr was constructed for a real use case: the videoanalytics component of VOX-Pol [1], a European network of excel-lence project connecting social science and forensics research focusedon combatting VOPE. The dataset comprises 8 hours of video. 8%is VOPE content from 3 categories: neo-Nazi, Islamic terrorism, andScottish ultra-nationalism. 28% of the content is “red herring” content,which exhibits some visual similarity to the VOPE content, but is safe(e.g., comedy skits featuring Nazi paraphernalia in a mocking manner).The rest, 64% of the content, is ﬂuff, ranging from gaming streamsthrough feature-length ﬁlms to fashion and football documentaries. Wehave extracted 1 frame per 3 seconds of video, resulting in a datasetof 9618 images. VOPE-8hr is a challenging dataset: only a small partis relevant (VOPE), and it is obfuscated by three times as much redherring content. The clearly-deﬁned needles-in-the-haystack task andits basis in a real-world use case makes VOPE-8hr very suitable forinsight-based evaluation [27, 28].

CelebA contains 202K face images annotated with 40 binary at-tributes such as “eyeglasses” or “wearing hat” [24]. There can (andoften are) be more attributes per image, resulting in a large over-lap of image sets per attribute. CelebA is a narrow-domain dataset.

Places205 comprises 2.5M scene images, each from one of 205 scenecategories [47]. Places205 brings the challenge of scale (it is not triv-ial to process 2.5M images interactively), as well as variation in thescope of individual categories: some are quite general (e.g., “ocean” or“ofﬁce”), some more nuanced (e.g., “herb garden” or “shoe shop”).

The automated experiments aim to answer these research questions:Q1) Does II-20’s model yield better performance (precision, recall)than classic interactive learning?Q2) How does the Tetris metaphor perform in comparison to the grid?Q3) What parameter conﬁguration of II-20 performs the best?The experimental baseline is Blackthorn [42], the state of the artin interactive learning, in two oracle strategy variants. The ﬁrst, baseline-rf , employs pure relevance feedback ( o = baseline-al_0.2 , is an active-learning modiﬁcation with o = .

500 1000 1500 2000 2500 3000 3500

Processed images . . . . . P r ec i s i o n Precision — VOPE-8hr

Tetris: baseline-rf

Tetris: baseline-al 0.2

Tetris: ii20-ann 5-al 0.2-hybrid 1000

Tetris: ii20-knn 5-al 0.2-hybrid 500

Grid: baseline-rf

Grid: baseline-al 0.2

Grid: ii20-ann 5-al 0.2-rf 1000

Grid: ii20-knn 10-al 0.2-hybrid 1000

Processed images . . . . . . P r ec i s i o n Precision — CelebA

Tetris: baseline-rf

Tetris: baseline-al 0.2

Tetris: ii20-ann 5-al 0.2-all

Tetris: ii20-knn 5-al 0.2-all

Grid: baseline-rf

Grid: baseline-al 0.2

Grid: ii20-ann 10-al 0.2-all

Grid: ii20-knn 10-al 0.2-al 1000

Processed images . . . . . . . . . P r ec i s i o n Precision — Places205

Tetris: baseline-rf

Tetris: baseline-al 0.2

Tetris: ii20-ann 5-rf-al 1000

Grid: baseline-rf

Grid: baseline-al 0.2

Grid: ii20-ann 5-al 0.2-hybrid 1000

Processed images . . . . . R ec a ll Recall — VOPE-8hr

Tetris: baseline-rf

Tetris: baseline-al 0.2

Tetris: ii20-ann 5-al 0.2-hybrid 1000

Tetris: ii20-knn 5-al 0.2-rf 1000

Grid: baseline-rf

Grid: baseline-al 0.2

Grid: ii20-ann 5-rf-hybrid 500

Grid: ii20-knn 5-rf-all

Processed images . . . . . . . R ec a ll Recall — CelebA

Tetris: baseline-rf

Tetris: baseline-al 0.2

Tetris: ii20-ann 5-al 0.2-all

Tetris: ii20-knn 5-al 0.2-all

Grid: baseline-rf

Grid: baseline-al 0.2

Grid: ii20-ann 10-al 0.2-all

Grid: ii20-knn 5-al 0.2-all

Processed images . . . . . . R ec a ll Recall — Places205

Tetris: baseline-rf

Tetris: baseline-al 0.2

Tetris: ii20-ann 5-rf-al 1000

Grid: baseline-rf

Grid: baseline-al 0.2

Grid: ii20-ann 5-al 0.2-hybrid 500

Fig. 5. Precision and recall over the number of images processed by the actor, with x -ticks at 300, 600, 900, 1800, and 2700 images, correspondingto 5, 10, 15, 30, and 45 minutes at 1 image processed per second. The baselines are pitted against II-20, varying the parameters deﬁnedin Sect. 3.2 independently. Firstly, the nearest neighbour mode: ann onall three datasets, knn on VOPE-8hr and CelebA (Places205 is too largefor k -NN matrix computation). Secondly, w ∈ { , } , the number ofinteraction rounds in the exploration-search split. Thirdly, o ∈ { , . } ( rf , al_0.2 ), the proportion of oracle queries within the suggestedimages. Fourthly, the train pruning strategies: all (no pruning), rf (relevance feedback), al (active learning), hybrid . Finally, n tr ∈{ , , } , the number of bucket images to be kept when pruning.Henceforth, ii20-_ w --- n tr identi-ﬁer is used for II-20 conﬁgurations.For the automated experiments, we employ an enhanced version ofthe analytic quality evaluation protocol [43]: artiﬁcial actors interactwith II-20 in place of a user, putting relevant images in buckets anddiscarding the non-relevant ones, and we report their achieved precisionand recall over time. These actors base their judgment on ground truthannotations that come with each dataset: VOPE categories in VOPE-8hr, facial attributes in CelebA, and scene categories in Places205. Forevaluation purposes, the ground truth is known only to the actors andwithheld from II-20 in all sessions; II-20 only sees unannotated images.Each actor considers images from a subset of ground truth annotationsas relevant and discards all others. Each annotation is treated as aseparate bucket. For the VOPE-8hr dataset, we run the experiment onall combinations, i.e., 7 notions of relevance in total. For both CelebAand Places205, we randomly sample 5 notions of relevance for eachbucket cardinality from 1 to 7 (matching the active buckets limit asdescribed in Sect. 3), i.e., 35 notions of relevance for each dataset.In addition to a notion of relevance, each actor has an inherent errorrate err a ∈ { , . } : users can make mistakes in their interactions andit is important to test the robustness of the model. Introducing actor errors not only acknowledges the fact that human users are fallible,but also tests resilience against fast-forward errors overlooked by theuser. For each label to be produced by the actor, we sample a uni-form random number r ∈ [ , ) . If r < err a , the actor makes one ofthe following mistakes (with uniform probability): ignores the image(provides no feedback at all), ﬂips relevance (a non-relevant image willbe assigned to a random bucket, a relevant image will be discarded),or confuses buckets if applicable (adds the image to a different bucketthan it belongs to). The actors vary err a and the notions of relevanceindependently, resulting in 14 actors total for VOPE-8hr and 70 actorsfor CelebA and Places205 each.The actors interact with II-20 in a ﬂow identical to a real user, i.e.,a series of interaction rounds: II-20 presents the actor with images,the actor submits its judgment, II-20 updates its model and starts anew interaction round. This allows automatic simulation of metaphorperformance: in Tetris mode, the actor reacts to 1 image at a time, ingrid mode, the actor reacts to 25 images at a time (a 5x5 grid). Sincethe model gets updated between user (actor) judgments, labelling a gridresults in a different model — and different image suggestions — thanrunning Tetris on 25 images. Note that the actors cannot be matched1:1 to real user behaviour: they only have a static notion of relevance,they do not truly reason, they simply try to ﬁnd all relevant imagesbased on ground truth. Still, given that they do exhibit key artifactsof real user behaviour, they are useful for automatic approximation ofanalytic quality of a large combination of II-20 model’s parameters,which would be infeasible to evaluate rigorously in user studies. The user study aims to answer the following research questions:Q4) What are the main strengths and weaknesses of II-20?

I-20 parameters . . . . . . . R e l a t i v e p r ec i s i o n S V M o n l y II - ( k - NN ) II - ( a NN ) rf al0.2 all rf al hybrid

100 500 1000 ∞ Relative precision

Exploration-searchExploration-search: w Oracle Pruning strategyPruning: n tr II-20 parameters . . . . . . . R e l a t i v e r ec a ll S V M o n l y II - ( k - NN ) II - ( a NN ) rf al0.2 all rf al hybrid

100 500 1000 ∞ Relative recall

Exploration-searchExploration-search: w Oracle Pruning strategyPruning: n tr Fig. 6. Parameter tuning results (relative precision and recall).

Q5) How does the Tetris metaphor fare in the eyes of the participants?Q6) How efﬁcient/useful is the fast-forward interaction?We employ an open-ended insight-based evaluation protocol, inwhich the users think aloud, recording their insights as they progresswith their evaluated task [27, 28]. The evaluated system’s analyticefﬁciency is then gauged by analyzing these insights. The II-20 userstudies are designed to be remote, so the “thinking aloud” is replacedby the users hitting the "Record insight" button and recording whateveris on their mind at any point in their session.The user study scenario has four steps:1.

Introduction — The user is greeted by a description of II-20,analytic categorization, and an outline of the user study. Further,the user is linked to a YouTube tutorial and informed that theycan refer back to the video at any time during the user study. Theimportance of using the “Record insight” functionality liberallythroughout the session is heavily stressed. No special attentionis drawn to the novel functionality (innovated model, Tetris, fast-forward) anywhere for the sake of unbiased feedback.2.

Warm-up — The user tries II-20 out on a toy dataset (same asin the tutorial video). Her objective is to familiarize herself withII-20 controls. The user is not being recorded in this step.3.

User study task — After the warm-up, the user performs theevaluated task proper (described in detail below).4.

Final questionnaire — The user answers 3 open-ended questions:strengths of II-20, weaknesses, and any other comments.The user study task is conducted on VOPE-8hr, and it closelymatches its real use case. The user investigates extremists that postpropaganda on the Internet. She has just received data from a suspect’scomputer and is asked to establish whether they contain VOPE contentand if so, what kind. In addition, the user is asked to record insightsabout any content encountered in order to proﬁle the suspect. The usershould (among other insights) be able to establish that there indeed isextremist content. The user is instructed to take as much time as neededand can stop the analytic session at any time. In addition to explicitinsight, we record user actions and all II-20’s image suggestions.11 users participated in the user study: 9 computer scientists and2 robotics experts. 9 participants have a master’s degree or higher, 2participants are master students. None of the users are VOPE domainexperts; the role of a digital forensics investigator was a role-playingtask for them. None of the users have seen the dataset before.

ESULTS AND DISCUSSION

Fig. 5 reports the precision and recall of the evaluated algorithms, splitby dataset and metaphor. Each curve is the average across all actorsrunning on the dataset. In each plot, for each metaphor, we report theresults of the baselines and the best-performing II-20 variants. The x axis is the number of images processed by the user, which has a direct mapping to time (e.g., considering a hypothetical fast user thatprocesses 1 image per second, the x axis is time in seconds).II-20 outperforms the baselines on all datasets with respect to bothprecision and recall, except for the later stages of analytics on CelebA,where the baselines pull ahead slightly (even then, II-20’s performanceremains quite acceptable). This makes sense: CelebA has high cover-age of the annotations used to construct the actors. There are plenty ofpositive examples, which increases the reliability of the vanilla interac-tive learning approach. VOPE-8hr and Places205, however, have moreof a needles-in-the-haystack nature: positives are a rarer asset. II-20 isstrictly dominant on these datasets, often by a large margin. Therefore,we answer Q1 positively, in favour of II-20.Comparing the metaphor usage simulations reveals that Tetris indeedshows analytic promise: Tetris outperforms the grid on both precisionand recall for most of the session on all three datasets. This validates the“potentially lower total number of processed images” strength claimedin Sect. 3.1.2. It appears that Tetris’s performance is tied to whetherII-20’s model was used: Tetris works well with II-20, baselines arebetter off with the grid. Also, there seems to be a breakpoint whereswitching from Tetris to the grid increases performance. We explainthis by the difference in availability of training positives. At ﬁrst, theyare rare, and ﬁne-grained feedback after each image (Tetris) is verybeneﬁcial to the model. Later on, there are usually enough positivesto train the model, but the remaining ones are trickier to ﬁnd, so it’sbeneﬁcial to “ﬁsh” for more by showing a larger portion of the rankingin the grid. Whenever this stage of difﬁcult positives is encountered, itmight be worthwhile for the user to switch to the grid. We answer Q2by remarking that Tetris certainly has strong analytic potential.To answer Q3, we have performed parameter tuning and report theresults in Fig. 6: normalized precision at 900 processed images (15minutes at 1 image/s) and normalized recall at 1800 processed images(30 minutes at 1 image/s), i.e., early precision, late recall. Each barreports the average normalized precision/recall across all experiments,metaphors, and datasets. Each normalized value is obtained by dividingthe absolute value by the maximum achieved on that dataset. This isdone to remove differences in absolute performance between datasets.The differences are not statistically signiﬁcant: none of the parame-ters seem to drastically inﬂuence the performance (within the evaluatedvalues). However, certain observations can be made. The parametertuning conﬁrms what Fig. 5 shows as well: the aNN nearest neighbourmode edges ahead of k -NN. This is fortunate: aNN does not requirea k -NN matrix. The shorter exploration-search window came ahead,which hints at conﬁrming the importance of the exploration-searchsliding being dynamic. Oracle queries seem to improve the model, andpruning should not be done too aggressively (if at all).Fig. 7 shows the insights recorded by the user study participants overtime, split into ﬁve categories. The ﬁrst are insights related to II-20’sfunctionality, the second are general insights related to the task (forexample: “user plays a lot of video games”), and the remaining threecorrespond to the VOPE content categories — neo-Nazi, ISIS, Scottishultranationalism — being explicitly referred to by the user (e.g., asone user wrote, “At the moment I can tell that the suspect does haveextremist content from islamic terrorist.”). The triangular markers mark

10 20 30 40 50 60 70

Time (minutes)User 1User 2User 3User 4User 5User 6User 7User 8User 9User 10User 11

FunctionalityTask (general)Neo-NaziIslamic terroristScottish ultranat.User insightFirst encounter

Fig. 7. User insights over time by type. Circle markers denote insights,triangular markers the user’s ﬁrst encounter of VOPE content by category. the user’s ﬁrst encounter of an image from a VOPE category.II-20 was able to sufﬁciently support the task. All participants weresuggested VOPE content, 9 out of 11 participants noted extremism intheir insights: all 9 have found Islamic terrorism, 5 have found evidenceof neo-Nazi content, and 1 has found out about Scottish ultranationalism(note: this is a difﬁcult, highly contextual category, and none of theusers were Scottish). None of the users ended the session prematurelydue to being confused or ﬁnding the system unusable.Regarding main strengths and weaknesses, the received feedback isdiverse. The strengths reported in the ﬁnal questionnaire were: intuitiveinterface and user-friendliness (6 users), full control of the buckets(4 users), and good performance in ﬁnding similar images (4 users).The main reported weakness was receiving very similar, non-diverseimages from the system (7 users). To an extent, this is an artifact ofthe dataset (frames extracted from videos, often mono-topical), but thatof course does not invalidate the feedback. Other weaknesses includemissing progress bar (percentage of content seen) mentioned by 2 users,and diverse feedback by 1 user each, such as the dark design of theUI or unintuitiveness of certain tools (e.g., grid bucket selection andthe meaning of discard). Regarding Q4, we conclude that II-20 hassucceeded as a tool, it is intuitive and provides good performance, butadditional diversiﬁcation capabilities and UI improvements are needed.The Tetris metaphor has received a mixed response. It is fair to saythat the majority of the users’ time was spent in the grid, with 5 of the 11users swapping to Tetris at any point in the session. That makes sense,the grid is a familiar, useful, and also the default metaphor. Moreover,we have deliberately not drawn any attention to Tetris so as to obtainunbiased feedback. One user has strongly liked Tetris, one has found itnot useful, and one stated that it would be much better if it had a staticvariant with immediate accept/reject. To answer Q5, Tetris comes out asa niche metaphor that polarizes users somewhat: extended functionalityand/or implementation of another D1I metaphor (e.g., images ﬂashingon the screen for a certain, user-controlled amount of time that getassigned to the suggested bucket unless the user intervenes) might bebeneﬁcial, especially given Tetris’s strong analytic performance.Fast-forward has been used by 8 users at least once. The fast-forwards were by 10–25 images, all concerned user buckets (no userfast-forwarded the discard pile). One user has lauded fast-forward asone of the main strengths of II-20, there has been no negative feedbackor suggestions for improvement. Therefore, to answer Q6, we concludethat fast-forward has been shown as a useful, intuitive interaction tothose users that have chosen to use it.Finally, a word on time complexity. Table 1 reports the averagetime per interaction round . Following visual analytics theory [32], Does not include UI image loading, but that is negligible in II-20: it onlyloads images with the suggested IDs, which is a very fast DB query. baseline ii20_ann ii20_knn

VOPE-8hr ± ± ± CelebA ± ± ± Places205 ± ± Table 1. Average time per interaction round.

II-20’s model operates in the direct manipulation ( ≤ < . k -NN is faster, as it frontloads a lot ofthe computation, but is intractable on large datasets and might yieldlower precision/recall as reported above. The aNN mode, while slower,is still interactive even on large data and yields good performance. Inthe user study, none of the users complained that the system is slowor unresponsive: on the contrary, all received feedback on speed waspositive. ONCLUSION

In this paper, we have presented II-20, an approach for multimediaanalytics on image collections that contributes towards resolving openchallenges in visual and multimedia analytics and closing the prag-matic gap. II-20’s new machine model is the ﬁrst to fully supportdynamic sliding on the exploration-search axis without explicit inputfrom the user, and in the automated experiments, it has proven superiorto state-of-the-art interactive learning. The Tetris metaphor is a dy-namic metaphor with high synergy with the model: an accurate modelcan “play the game” fully autonomously. Tetris has been shown to offerstrong analytic numbers and potential. Functionality enhancements, ad-ditional metaphor variants, and user study evaluation of the metaphor’simpact on the user reasoning capacity would help to establish Tetris andthe broader dynamic, 1-image-at-a-time family of metaphors further.The fast-forward interaction expands the family of semantic interac-tions. It provides an intuitive, fully controllable way to speed up theanalytics process.We especially value that II-20’s contributions, in addition to passingthe evaluation, have turned out to be considered intuitive by the users.User agency is one of II-20’s key design paradigms: the basis is afamiliar interface backed by a powerful model, and it’s completely upto the user when she wants to engage with the new interactions andinterface elements. The II-20 prototype is a complete system, whichwill be made open-source and available to the research community andapplied domains alike.II-20’s model, even though tested on image data only, is extensibleto the multimodal setting (metadata and/or text associated with the im-ages). Features would be extracted for each modality separately. Whenasked for suggestions, each model component could either split thesuggestions between modalities or fuse rankings per modality by rankaggregation (late fusion). As evidenced by recent work on interactivelearning [18, 42], late fusion does not break the interactive responsetime requirement even on large datasets. Both approaches are plausibleas direct extensions to II-20.We hope that II-20 has contributed to kicking off truly intelligentand pragmatic multimedia analytics on image collections ﬁt for the newdecade. A CKNOWLEDGMENTS

We would like to thank Paul van der Corput for his comments and sug-gestions, as well as the test users for their time and feedback. This workwas supported by the European Regional Development Fund (projectRobotics for Industry 4.0, CZ.02.1.01/0.0/0.0/15 003/0000470). R EFERENCES [1] VOX – Pol. .2] C. Aggarwal, X. Kong, Q. Gu, J. Han, and P. Yu.

Active learning: Asurvey , pp. 571–605. CRC Press, Jan. 2014.[3] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximatenearest neighbor in high dimensions.

Commun. ACM , 51(1):117–122, Jan.2008.[4] K. U. Barthel, N. Hezel, and K. Jung. Visually browsing millions ofimages using image graphs. In

Proc. MM , pp. 475–479. Association forComputing Machinery, New York, NY, 2017.[5] P. Brivio, M. Tarini, and P. Cignoni. Browsing large image datasets throughvoronoi diagrams.

IEEE Transactions on Visualization and ComputerGraphics , 16(6):1261–1270, 2010.[6] N. A. Chinchor, J. J. Thomas, P. C. Wong, M. G. Christel, and W. Ribarsky.Multimedia analysis + visual analytics = multimedia analytics.

IEEEComputer Graphics and Applications , 30(5):52–60, Sept. 2010.[7] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitivehashing scheme based on p-stable distributions. In

Proc. SCG , SCG ’04,pp. 253–262. Association for Computing Machinery, New York, NY, 2004.[8] O. de Rooij, J. van Wijk, and M. Worring. MediaTable: Interactivecategorization of multimedia collections.

IEEE Computer Graphics andApplications , 30(5):42–51, May 2010.[9] O. de Rooij and M. Worring. Browsing video along multiple threads.

IEEE Transactions on Multimedia , 12(2):121–130, Feb. 2010.[10] A. Endert, R. Chang, C. North, and M. Zhou. Semantic interaction:Coupling cognition and computation through usable interactive analytics.

IEEE Computer Graphics and Applications , 35(4):94–99, 2015.[11] A. Endert, W. Ribarsky, C. Turkay, B. L. W. Wong, I. Nabney, I. D. Blanco,and F. Rossi. The state of the art in integrating machine learning intovisual analytics.

Wiley Computer Graphics Forum , 36(8):458–486, 2018.[12] R. Geirhos, C. R. M. Temme, J. Rauber, H. H. Schütt, M. Bethge, andF. A. Wichmann. Generalisation in humans and deep neural networks. In

NIPS , pp. 7538–7550. Curran Associates, Inc., Red Hook, NY, 2018.[13] T. M. Green, W. Ribarsky, and B. Fisher. Building and applying a humancognition model for visual analytics.

Information Visualization , 8(1):1–13,Jan. 2009.[14] A. G. Hauptmann, W.-H. Lin, R. Yan, J. Yang, and M.-Y. Chen. Extremevideo retrieval: Joint maximization of human and computer performance.In

Proc. ACM MM , p. 385–394. Association for Computing Machinery,New York, NY, 2006.[15] A. G. Hauptmann, J. J. Wang, W.-H. Lin, J. Yang, and M. Christel. Efﬁ-cient search: The Informedia video retrieval system. In

Proc. ACM CIVR ,pp. 543–544. Association for Computing Machinery, New York, NY, 2008.[16] F. Hohman, M. Kahng, R. Pienta, and D. H. Chau. Visual analytics in deeplearning: An interrogative survey for the next frontiers.

IEEE Transactionson Visualization and Computer Graphics , 25(8):2674–2693, 2019.[17] H. Jégou, M. Douze, and C. Schmid. Product quantization for nearestneighbor search.

IEEE Transactions on Pattern Analysis and MachineIntelligence , 33(1):117–128, Jan. 2010.[18] B. Þ. Jónsson, O. S. Khan, H. Ragnarsdóttir, Þ. Þorleiksdóttir, J. Zahálka,S. Rudinac, G. Þ. Guðmundsson, L. Amsaleg, and M. Worring. Exquisitor:Interactive Learning at Large. arXiv e-prints , p. arXiv:1904.08689, Apr.2019.[19] S. Kandel, E. Abelson, H. Garcia-Molina, A. Paepcke, and M. Theobald.PhotoSpread: A spreadsheet for managing photos. In

Proc. CHI . Associa-tion for Computing Machinery, New York, NY, 2008.[20] T. Kauer, S. Joglekar, M. Redi, L. M. Aiello, and D. Quercia. Mapping andvisualizing deep-learning urban beautiﬁcation.

IEEE Computer Graphicsand Applications , 38(5):70–83, 2018.[21] D. Keim, J. Kohlhammer, G. Ellis, and F. Mansmann.

Mastering TheInformation Age – Solving Problems with Visual Analytics . Springer, NewYork, NY, Jan. 2010.[22] D. A. Keim, F. Mansmann, J. Schneidewind, J. Thomas, and H. Ziegler.

Visual Analytics: Scope and Challenges , pp. 76–90. Springer BerlinHeidelberg, Berlin, Heidelber, Germany, 2008.[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classiﬁcationwith deep convolutional neural networks. In

NIPS , pp. 1097–1105. CurranAssociates, Inc., Red Hook, NY, 2012.[24] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in thewild. In

Proc. ICCV . IEEE, New York, NY, Dec. 2015.[25] P. Mettes, D. C. Koelma, and C. G. M. Snoek. The ImageNet Shufﬂe:Reorganized pre-training for video event detection. In

Proc. ACM ICMR ,pp. 175–182. ACM, New York, NY, June 2016.[26] F. Miranda, M. Hosseini, M. Lage, H. Doraiswamy, G. Dove, and C. T.Silva. Urban Mosaic: Visual exploration of streetscapes using large-scale image data. In

Proc. CHI , p. 1–15. Association for Computing Machinery,New York, NY, USA, 2020.[27] C. North. Toward measuring visualization insight.

IEEE Computer Graph-ics and Applications , 26(3):6–9, 2006.[28] C. North, P. Saraiya, and K. Duca. A comparison of benchmark task andinsight evaluation methods for information visualization.

InformationVisualization , 10(3):162–181, 2011.[29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research ,12:2825–2830, 2011.[30] P. Pirolli and S. Card. The sensemaking process and leverage points foranalyst technology as identiﬁed through cognitive task analysis. In

Proc.Int Conf Intelligence Analysis , pp. 2–4, 2005.[31] N. Quadrianto, K. Kersting, T. Tuytelaars, and W. L. Buntine. Beyond2D-grids: A dependence maximization view on image browsing. In

Proc.MIR , pp. 339–348. Association for Computing Machinery, New York, NY,2010.[32] W. Ribarsky and B. Fisher. The human-computer system: Towards anoperational model for problem solving. In

Proc. HICSS , pp. 1446–1455.IEEE, New York, NY, 2016.[33] D. Sacha, A. Stoffel, F. Stoffel, B. C. Kwon, G. Ellis, and D. A. Keim.Knowledge generation model for visual analytics.

IEEE Transactions onVisualization and Computer Graphics , 20(12):1604–1613, 2014.[34] B. Settles. Active learning literature survey. Computer Sciences TechnicalReport 1648, University of Wisconsin–Madison, 2009.[35] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain.Content-based image retrieval at the end of the early years.

IEEE Transac-tions on Pattern Analysis and Machine Intelligence , 22(12):1349–1380,Dec. 2000.[36] R. Spence. Rapid, serial and visual: A presentation technique with poten-tial.

SAGE Information Visualization , 1(1):13–19, 2002.[37] P. van der Corput and J. J. van Wijk. Effects of presentation mode andpace control on performance in image classiﬁcation.

IEEE Transactionson Visualization and Computer Graphics , 20(12):2301–2309, 2014.[38] P. van der Corput and J. J. van Wijk. Exploring items and features with IF,FI-tables.

Wiley Computer Graphics Forum , 35(3):31–40, June 2016.[39] P. van der Corput and J. J. van Wijk. ICLIC: Interactive categorizationof large image collections. In

Proc. PaciﬁcVis , pp. 152–159. IEEE, NewYork, NY, May 2016.[40] M. Worring, D. Koelma, and J. Zahálka. Multimedia pivot tables for mul-timedia analytics on image collections.

IEEE Transactions on Multimedia ,18(11):2217–2227, Sept. 2016.[41] X. Xie, X. Cai, J. Zhou, N. Cao, and Y. Wu. A semantic-based method forvisualizing large image collections.

IEEE Transactions on Visualizationand Computer Graphics , 25(7):2362–2377, 2019.[42] J. Zahálka, S. Rudinac, B. Þ. Jónsson, D. C. Koelma, and M. Worring.Blackthorn: Large-scale interactive multimodal learning.

IEEE Transac-tions on Multimedia , 20(3):687–698, Mar. 2018.[43] J. Zahálka, S. Rudinac, and M. Worring. Analytic quality: Evaluationof performance and insight in multimedia collection analysis. In

Proc.MM , pp. 231–240. Association for Computing Machinery, New York, NY,2015.[44] J. Zahálka, S. Rudinac, and M. Worring. Interactive multimodal learningfor venue recommendation.

IEEE Transactions on Multimedia , 17(12),Dec. 2015.[45] J. Zahálka and M. Worring. Towards interactive, intelligent, and integratedmultimedia analytics. In

Proc. IEEE VAST , pp. 3–12. IEEE, New York,NY, Nov. 2014.[46] E. Zavesky, S.-F. Chang, and C.-C. Yang. Visual islands: Intuitive brows-ing of visual search results. In

Proc. CIVR , pp. 617–626. Association forComputing Machinery, New York, NY, 2008.[47] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deepfeatures for scene recognition using places database. In

Proc. NIPS , pp.487–495. Curran Associates, Inc., Red Hook, NY, 2014.[48] X. Zhou and T. Huang. Relevance feedback in image retrieval: A compre-hensive review.