[PDF] Task-agnostic Indexes for Deep Learning-based Queries over Unstructured Data

Abstract

Unstructured data is now commonly queried by using target deep neural networks (DNNs) to produce structured information, e.g., object types and positions in video. As these target DNNs can be computationally expensive, recent work uses proxy models to produce query-specific proxy scores. These proxy scores are then used in downstream query processing algorithms for improved query execution speeds. Unfortunately, proxy models are often trained per-query, require large amounts of training data from the target DNN, and new training methods per query type. In this work, we develop an index construction method (task-agnostic semantic trainable index, TASTI) that produces reusable embeddings that can be used to generate proxy scores for a wide range of queries, removing the need for query-specific proxies. We observe that many queries over the same dataset only require access to the schema induced by the target DNN. For example, an aggregation query counting the number of cars and a selection query selecting frames of cars require only the object types per frame of video. To leverage this opportunity, TASTI produces embeddings per record that have the key property that close embeddings have similar extracted attributes under the induced schema. Given this property, we show that clustering by embeddings can be used to answer downstream queries efficiently. We theoretically analyze TASTI and show that low training error guarantees downstream query accuracy for a natural class of queries. We evaluate TASTI on four video and text datasets, and three query types. We show that TASTI can be 10x less expensive to construct than proxy models and can outperform them by up to 24x at query time.

Full PDF

TTask-agnostic Indexes for Deep Learning-based Queries overUnstructured Data

Daniel Kang*, John Guibas*, Peter Bailis, Tatsunori Hashimoto, Matei Zaharia

Stanford University

ABSTRACT

Unstructured data is now commonly queried by using target deepneural networks (DNNs) to produce structured information, e.g.,object types and positions in video. As these target DNNs canbe computationally expensive, recent work uses proxy models toproduce query-specific proxy scores . These proxy scores are thenused in downstream query processing algorithms for improvedquery execution speeds. Unfortunately, proxy models are oftentrained per-query , require large amounts of training data from thetarget DNN, and new training methods per query type.In this work, we develop an index construction method (task-agnostic semantic trainable index, TASTI) that produces reusableembeddings that can be used to generate proxy scores for a widerange of queries, removing the need for query-specific proxies. Weobserve that many queries over the same dataset only require accessto the schema induced by the target DNN. For example, an aggre-gation query counting the number of cars and a selection queryselecting frames of cars require only the object types per frame ofvideo. To leverage this opportunity, TASTI produces embeddingsper record that have the key property that close embeddings havesimilar extracted attributes under the induced schema. Given thisproperty, we show that clustering by embeddings can be used toanswer downstream queries efficiently. We theoretically analyzeTASTI and show that low training error guarantees downstreamquery accuracy for a natural class of queries. We evaluate TASTI onfour video and text datasets, and three query types. We show thatTASTI can be 10 × less expensive to construct than proxy modelsand can outperform them by up to 24 × at query time. PVLDB Reference Format:

Daniel Kang*, John Guibas*, Peter Bailis, Tatsunori Hashimoto, MateiZaharia . Task-agnostic Indexes for Deep Learning-based Queries overUnstructured Data. PVLDB, 14(1): XXX-XXX, 2020.doi:XX.XX/XXX.XX

Unstructured data such as video and text are becoming increasinglyfeasible to analyze due to automatic methods of analysis in theform of deep neural networks (DNNs). A common approach ofanalysis is to use a target DNN to extract structured informationinto an induced schema from this data. For example, object detec-tion DNNs [34] can be used to extract the list of object types andpositions of a frame of video. This information can then be used to * Marked authors contributed equally.This work is licensed under the Creative Commons BY-NC-ND 4.0 InternationalLicense. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy ofthis license. For any use beyond those covered by this license, obtain permission byemailing [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.doi:XX.XX/XXX.XX answer a range of queries, such as counting the number of cars perframe. However, state-of-the-art target DNNs can be prohibitivelyexpensive to execute on large data volumes, executing as slow as 3frames per second (fps), or 10 × slower than real time [21].To reduce the cost of analysis over unstructured data, recent workhas proposed query-specific proxy models to approximate targetDNNs. For example, low cost proxy models can be used for selectingunstructured data records matching predicates [3, 24, 30, 31, 33], foraggregation queries [29], and for limit queries [29]. For each query,a new proxy model is used to generate proxy scores per data record,in which the goal is to approximate the result of executing thetarget DNN the data record for the particular query; these scoresare subsequently used for query processing (e.g., BlazeIt debiasesproxy scores with the method of control variates [19, 29]).Unfortunately, each query type often requires large amountsof training data from the target DNN and new, ad-hoc trainingprocedures [3, 29–31, 33]. In particular, methods based on query-specific proxy models have three key drawbacks. First, obtaininglarge amounts of training data from the target DNN can be compu-tationally expensive. For example, BlazeIt and NoScope requireup to 150,000 annotations from the target DNN [29, 30] and othersystems require expensive human annotations [3, 24, 33]. Second,these systems require new training procedures for each query type,which can be difficult to develop. Third, query-specific proxy modelscannot easily share computation across queries. Thus, implement-ing queries can be challenging and computationally expensive atingest time (i.e., obtaining target DNN annotations).This prior work ignores a key opportunity for accelerating queries:the target DNN induces a schema that can be shared across manyquery types. For example, an aggregation query and selection queryover cars in a video only require information of object types andpositions. This schema is thus a sufficient statistic that can be usedto answer many query types. While fully materializing this suffi-cient statistic is expensive, query processing systems would ideallyleverage this sufficient statistic to avoid repeated work and targetDNN invocations.To address these issues and leverage this opportunity, we propose T ask- A gnostic S emantic T rainable I ndexes (TASTI), a method ofindexing unstructured data via embeddings for accelerating down-stream proxy score-based queries over the induced schema. Giventhe target DNN and a user-provided notion of “closeness” over theschema, TASTI produces semantic embeddings for each unstruc-tured data record (e.g., frame of video or line of text), with thedesideratum that records with close embeddings are similar underthe induced schema for all downstream queries. TASTI requiresthat the induced schema has a notion of distance, e.g., that framesof a video with similar object types and object positions are close.Given these embeddings and a small set of records annotated bythe target DNN, we demonstrate how to answer queries over theschema efficiently. We show that TASTI can be simultaneously a r X i v : . [ c s . D B ] S e p aniel Kang*, John Guibas*, Peter Bailis, Tatsunori Hashimoto, Matei Zaharia up to 10 × more sample efficient to construct and up to 24 × moreefficient at query time than recent state-of-the-art.To generate these indexes in a sample-efficient manner, TASTIuses a triplet loss training procedure to train an embedding DNN.TASTI then uses this DNN to produce embeddings per data recordand annotates a small set of records with the target DNN (“clusterrepresentatives”), which are stored as the index. The embeddingDNN is task-agnostic, where “task” refers to a query over the in-duced schema and “agnostic” is in the sense of being agnostic toany downstream queries over the induced schema. Our sample-efficient usage of the triplet loss [38] can use as much as 10 × fewertarget-DNN annotations to achieve similar quality results to proxymodel-based methods [29–31]. Furthermore, if training is expensive,we show that pre-trained DNNs can be used to produce embeddingsthat weakly satisfy our desideratum.TASTI can be used in conjunction with any query processingalgorithm that requires access to proxy scores. We demonstratehow to answer aggregation, selection, and limit queries based onproxy algorithms [3, 29–31, 33] in this work. To answer queriesgiven the cluster representatives and embeddings in TASTI, wecluster the remaining frames by embedding distance and propagatethe annotations to unannotated records (Section 4). For example,for counting the number of cars in a video, we would assign anunannotated frame the average number of cars of the closest clusterrepresentatives. These scores are then used in query processingalgorithms, often to debias proxy scores. We further show that asthe target DNN is executed over the data, we can further cluster theembeddings, which improves performance (i.e., TASTI’s indexescan be “cracked” [25]).To understand TASTI’s performance, we provide a theoreticalanalysis of how the triplet loss corresponds to downstream queryaccuracy (Section 5). We prove a positive result: queries that com-pute Lipschitz-continuous functions of the data will achieve exact results for 0 triplet loss and dense enough clustering. We furtherprove bounds on query performance when the triplet loss is not 0.We implement TASTI in a prototype system. To demonstratethe efficacy of our indexing method, we evaluate TASTI on fourdatasets, including widely studied video datasets [9, 27, 29, 31, 39]and a text dataset [42]. We execute aggregation, selection, andlimit queries over these datasets, as studied by prior work [29–31].Across all queries and datasets, we show that TASTI outperformsstate-of-the-art [29, 31] by up to 24 × . We further show that TASTIperforms well across a range of parameter settings.In summary, our contributions are(1) We propose a method for constructing semantic indexes(TASTI) for queries over unstructured data.(2) We theoretically analyze TASTI, providing a statistical un-derstanding of why our algorithms outperform baselines.(3) We evaluate TASTI on four unstructured datasets and threequery types, showing it can outperform state-of-the-art. TASTI is primarily designed for batch analytics over unstructureddata. As its primary input, TASTI takes a target DNN that extractsstructured information from unstructured data records, as many contemporary systems do [3, 24, 29–31, 33]. Given the schemainduced by the target DNN, TASTI will build an index consistingof per-record embeddings that places input records with similarextracted structure together and a set of cluster representatives(sample records annotated by the target DNN). This index can thenbe used to accelerate queries over the induced schema.As an example, consider queries over object types and objectpositions in video. An object detection DNN [21] could be used asa target DNN to extract this information. Given the target DNN,TASTI will construct an index that places frames with similar objecttypes and object positions together. Users can then issue queriesselecting events of cars or counting the number of cars. As an-other example, consider a dataset that contains natural languagequestions. A semantic parsing target DNN could extract SQL state-ments that answer the question; TASTI can construct an indexthat places questions with similar SQL operators and similar predi-cates together. Users can then issues queries to understand naturallanguage questions, e.g., to count the average number of predicates.To accelerate queries over the induced schema, TASTI propagatesthe cluster representatives’ annotations via embedding distance tothe remainder of the records, producing approximate scores. Theseapproximate scores are used in query processing, e.g., in algorithmsto efficiently execute aggregation and selection queries.We now describe the specific problem statement, index construc-tion, and query processing procedures.

Problem statement.

As input, TASTI takes a target DNN, aninduced schema over the target DNN outputs, a target DNN invo-cation budget for index construction, a user-provided k closest dis-tances to store, and a notion of closeness over the induced schema.TASTI’s indexes have the desideratum that “close” records are closein embedding space and vice versa for records that are far. Theprimary cost in index construction are the target DNN invocationsused for training and annotating cluster representatives; TASTI willattempt to construct high quality indexes subject to the budget.TASTI’s primary goal is to produce high quality proxy scores forquery processing algorithms that require such scores. High qualitytypically refers to high correlation between proxy scores and targetDNN outputs. We demonstrate how to generate such proxy scoresfor selection queries [3, 30, 31, 33], aggregation queries [29], andlimit queries [29]. Index construction.

TASTI’s indexes consist of semantic embed-dings per data record, a small set of annotated cluster representatives ,and distances from data records to cluster representatives.To generate the embeddings, TASTI can train a DNN in a task-agnostic manner, such that semantically similar data records havesimilar embeddings, as measured by ℓ distance (Figure 1a). TASTIcan also use embeddings from a pre-trained DNN to reduce in-dex construction time (although this may degrade query-time per-formance). For the video example, TASTI’s embedding DNN willattempt to cluster frames with similar object types and object posi-tions together. Perhaps surprisingly, TASTI can also perform wellusing pre-trained DNNs with no additional training; we presentresults using pre-trained DNNs as “TASTI-PT” in Section 6.The cluster representatives can be selected in any way, but inthis work, we use the furthest point first (FPF) algorithm [17]. Weuse FPF as it performs well in practice and because our theoretical ask-agnostic Indexes for Deep Learning-based Queries over Unstructured Data Target DNNUnstructured Data Training DataSampling Triplet Dataset Embedding DNN(Triplet Trained)

Positive Negative (a) Overview of TASTI’s training procedure. Training data is selectedvia the induced schema and pre-trained embeddings. This trainingdata is then used to train an embedding DNN.

Target DNNUnstructured Data E mb e dd i n g DNN (Tri p let Trai n e d ) Embeddings TASTI

EmbeddingsDistancesClusterRepresentatives (b) Overview of TASTI’s index construction procedure. TASTI pro-duces embeddings per data record, selects sample records to anno-tate, and computes embedding distances to those records.

Target DNNTASTI time o f c a r s true pred SELECT COUNT(*)FROM VIDEOWHERE CLASS=’CAR’

Query

How many cars are in the video?

Propagation Query-specificProxy Scores [4.2, ... , 2.8](units in cars)

AnswerQuery Processing (c) Overview of TASTI’s query processing. Given a query, TASTI will propagate query-specific scores to unannotated records. These scoresare coalesced into a query-specific proxy scores, which are used in downstream query processing algorithms.

Figure 1: TASTI system overview analysis relies on the maximum intra-cluster distance, which theFPF provides a guarantee on. TASTI subsequently computes thedistances of all embedded records to the cluster representatives andstores the rank ordering by distance (Figure 1b). TASTI can also im-prove its index as the target DNN is invoked over the data as queriesare issued (i.e., TASTI’s indexes can naturally be cracked [25]). Wedescribe full details in Section 4.

Query processing.

In this work, we show how to use TASTI’sindexes to generate proxy scores for query processing algorithmsthat require such scores, specifically for aggregation [29], selec-tion [3, 30, 31, 33], and limit queries [29].We show an overview of TASTI’s query processing procedurein Figure 1c. Given the structured output of the target DNN onthe cluster representatives, TASTI will produce proxy scores onthese records and propagate scores to the remainder of the records.For example, suppose the user issued an aggregation query for thenumber of cars in a frame. On the cluster representatives, TASTIwill return the number of cars as predicted by the target DNN asthe score and these scores will be propagated to the remainder ofthe frames. Downstream query processing occurs over the scores;some query processing algorithms will additionally query the targetDNN. We describe full details in Section 4.In contrast, prior work uses query-specific ad-hoc proxy mod-els to generate proxy scores. We show that TASTI can generatehigher quality proxy scores, which results in more accurate or fasterdownstream query processing (Section 6).

As a concrete example, consider constructing an index for visualdata, in which queries over object types and positions are issued. Inthis case, the target DNN takes an unstructured frame of video andreturns a structured set of records that contains fields about the po-sitions and types of objects in the frame. Consider two queries, oneof which counts the number of cars per frame (aggregation query) and one that selects frames with cars (selection query). To under-stand how the index construction procedure and query processingworks, we describe the intuition below.

Index construction.

Each target DNN and induced schema re-quires a heuristic of “close” and “far” records, either as a Booleanfunction or as a cutoff based on a continuous distance measure. Onesuch heuristic for this application is to group frames with the samenumber of objects and similar positions together. The grouping of“close” frames can be specified in pseudocode as follows: def IsClose ( frame1 : List [ Boxes ], frame2 : List [ Boxes ])-> bool :if len( frame1 ) != len( frame2 ):return Falsereturn all_close (frame1 , frame2 ) where all_close is a helper function that returns true if all boxesin frame1 have a corresponding “close” box in frame2 . Given thenotion of close frames, TASTI will train an embedding DNN viathe triplet loss, which separates “close” and “far” frames. Then,TASTI will compute embeddings over all frames of the video andselect a set of frames to annotate with the target DNN. In thiscase, TASTI will store the object types and positions for the clusterrepresentatives as predicted by the target DNN.Given the annotated frames, TASTI will compute the embed-ding ℓ distance from all other frames to these annotated frames.TASTI will also store the k closest frames. These distances will besubsequently used in query processing. Query processing.

Given the structured records from the targetDNN output, TASTI will generate scores on the cluster representa-tives. Then, TASTI will propagate these scores to the remainder ofthe records. While TASTI provides a default method for doing so, adeveloper can also provide custom functions for propagation. Wedescribe the intuition behind two example queries.

Approximate aggregation.

Suppose the user issues a query for theaverage number of cars per frame, as studied by BlazeIt [29]. Tooptimize this query, BlazeIt trains a proxy model to estimate the aniel Kang*, John Guibas*, Peter Bailis, Tatsunori Hashimoto, Matei Zaharia number of cars per frame. This proxy model is then used to generatea query-specific proxy score per frame, which is the estimate fromthe proxy model. BlazeIt then uses these scores as a “controlvariate” [19] to reduce the variance in sampling.In contrast, TASTI computes the query-specific proxy score as theweighted average of the number of cars in the k closest annotatedframes (we defer pseudocode to Section 4). This will produce anestimate of the number of cars in a given unannotated frame. Thesescores are then coalesced and can be used by BlazeIt’s queryprocessing algorithm to optimize approximate aggregation. Approximate selection.

Suppose the user issues a query to select90% of frames with cars with 95% probability of success, as studiedby “recall target” setting in SUPG [31]. SUPG requires a probabilitythat a frame contains a car. As with BlazeIt, SUPG will use a proxymodel to estimate whether a record matches the predicate.To compute this probability (i.e., query-specific proxy score),TASTI will compute the weighted average as above, except that an-notated frames that contain a car receive a score of 1 and annotatedframes that do not contain a car receive a score of 0.We now discuss TASTI’s index construction method (Section 3)and TASTI’s query processing method (Section 4).

We describe how TASTI constructs task-agnostic indexes given thetarget DNN. Recall that many queries only require a low dimen-sional representation of data records to answer, such as object typesand positions (as opposed to raw pixels in a video). Furthermore,in many applications, this low dimensional representation has anatural notion of closeness. TASTI attempts to construct repre-sentations that reflect these heuristics by grouping close recordsand separating far records. We show a schematic of the trainingin Figure 1a and the index construction in Figure 1b. We note thatTASTI’s training procedure is optional: pre-trained embeddings canbe also be used for the index if training is expensive.

TASTI optionally trains a task-agnostic mapping between datarecords (e.g., frames of a video) and semantic, task-agnostic embed-dings. The semantic embeddings have the desideratum that datarecords that have similar extracted attributes are close in embed-ding space, and vice versa for records that have dissimilar extractedattributes. For example, consider queries over object type and po-sition. A frame with a single car in the upper left should be closeto another frame with a single car in the upper left, but far from aframe with two cars in the bottom right.We describe our training method via domain-specific tripletlosses and show a schematic in Figure 1a.

Domain-specific triplet loss.

To train the embedding DNN toproduce embeddings that fulfill our desiderata, TASTI uses thetriplet loss [38]. The triplet loss takes an anchor point, a positiveexample (i.e., a close example), and a negative example (i.e., a farexample). It penalizes examples where the anchor point and thepositive point are further apart than the anchor point and thenegative point. Formally, the per-example triplet loss is defined as ℓ T ( x a , x p , x n ; ϕ , m ) = max ( , m + | ϕ ( x a ) − ϕ ( x p )| − | ϕ ( x a ) − ϕ ( x n )|) for some embedding function ϕ .A key choice in using the triplet loss is selecting points that are“close” and those that are “far.” This choice is application specific,but many applications have natural choices. For example, any frameof video with different numbers of objects may be far. Furthermore,frames with the same number of objects, but where the objects arefar apart may also considered far. Training data selection (FPF mining).

Training via the tripletloss requires invocations of the target DNN to determine whetherpairs of records are close or not. Due to the cost of the target DNN,TASTI must sample records to be selected for training; we assumethe user provides a budget of target DNN invocations. While TASTIcould randomly sample data points, randomly sampled points maymostly sample redundant records (e.g., majority of empty frames)and miss rare events. We empirically show that randomly samplingtraining data results in embeddings that perform well on average,but can perform poorly on rare events (Section 6).To produce embeddings that perform well across queries, wewould ideally sample a diverse set of data records. For example,suppose 80% of a video were empty: selecting frames at randomwould mostly sample empty frames. Selecting frames with a varietyof car numbers and positions would be more sample efficient.When available, TASTI uses a DNN pre-trained on other seman-tic data to select such diverse points. We note that these pre-trainedDNNs are widely available, e.g., DNNs pre-trained on ImageNet [22]or on large text corpora (BERT) [12]. Pre-trained DNNs produceembeddings that are typically semantically meaningful, althoughtypically not adapted to the specific set of queries.To produce training data that results in embeddings that performwell on rare events, TASTI performs the following selection proce-dure. First, TASTI uses a pre-trained DNN to generate embeddingsover the data records. Then, TASTI executes the FPF algorithm toselect the training data. TASTI constructs triplets from the trainingdata via target DNN annotations.

TASTI produces clusters via the embedding DNN. As we describein Section 4, TASTI propagates annotations/scores from clusterrepresentatives to unannotated data records.A key choice is deciding which data records to select as clusterrepresentatives. Similar to selecting training data, TASTI couldselect a set of cluster representatives at random. While randomsampling for cluster representatives may do well on average atquery time, it may perform poorly on rare events.To address this issue, TASTI selects cluster representatives viafurthest-point first (FPF). FPF iteratively chooses the furthest pointfrom the existing set of cluster representatives as the newest rep-resentative. FPF is both computationally efficient and provides a2-approximation to the optimal maximum intra-cluster distance.Intuitively, FPF chooses points that are diverse in embedding space.If the embeddings are semantically meaningful, then FPF will selectdata records that are diverse. Finally, we mix a small fraction ofrandom clusters, which helps “average-case performance” queries.TASTI stores the distances of all embeddings to each clusterrepresentative. As we describe in Section 4, TASTI uses the k nearestcluster representatives for query processing. ask-agnostic Indexes for Deep Learning-based Queries over Unstructured Data In contrast to prior work, which can only share work betweenqueries in an ad-hoc manner, TASTI can naturally be extendedwith “cracking” functionality [25]. In particular, when any queryexecutes the target DNN, TASTI can cache the target DNN result.The records over which the target DNN are executed over can thenbe added as new cluster representatives. Computing the distanceto the new cluster representative is computationally efficient andtrivially parallelizable.

Suppose there are N data records, D dimensions, L training itera-tions, and a total target DNN budget of C . Denote the costs of thetarget DNN, embedding DNN, and distance computation as c T , c E ,and c D respectively. Then, the total cost of index construction is O ( C · c T + L · c E + N · c E + NCD · c D ) assuming the cost of a trainingiteration is proportional to the cost of the forward pass [28].The ratio of these steps depends on the relative computationalcosts. In many applications, the cost of embedding is less expensivethan the cost of the target DNN. For example, the Mask R-CNNwe use in this work executes at 3 fps, compared to the embeddingDNN which executes at 12,000 fps. Given an index, how can TASTI accelerate query processing? To ex-ecute a query, TASTI will construct query-specific proxy scores of thedata records that can then be passed to existing proxy score-basedalgorithms. These query-specific proxy scores are an approxima-tion of the result of executing the target DNN on the data recordsfor the particular query. Consider an aggregation query countingthe average number of cars per frame [29]. In this case, the query-specific proxy scores would be an estimate of the number of carsin a given frame.Many downstream query processing techniques only requirequery-specific proxy scores and the target DNN. For example, cer-tain types of selection without guarantees (e.g., binary detection)[3, 30, 33], selection with statistical guarantees [31], aggregation[29], and limit queries [29] only require query-specific proxy scoresand the target DNN. We describe concrete examples in Section 4.3.We assume TASTI is provided methods that take target DNNoutputs and produce a numeric score; this can be done automati-cally in many cases, but a developer may also implement customfunctions produce such scores. We describe the interface for thesescores, describe how TASTI propagates query-specific proxy scores,and give examples of uses of proxy scores for query processing. Weshow a schematic of the query processing procedure in Figure 1c.

In order to execute queries, a developer must specify a query-specific scoring function, which takes the output of the target DNNand returns a query-specific proxy score. We note that query pro-cessing systems must implement such a function even withoutusing TASTI’s indexes, so this is a natural requirement. Concretely,consider the query of counting the average number of cars perframe. The scoring function would return the number of cars aspredicted by the target DNN. The API for specifying scoring functions is as follows. Denotethe type of the output of the target DNN as

TargetDNNOutput (e.g.,a list of bounding boxes) and the type of the score as

ScoreType (e.g., a float). Using Python typing, the developer would implement: def Score ( target_output : TargetDNNOutput ) -> ScoreType

These functions can often be implemented in few lines of code.Concretely, we show the pseudocode for the example above: def CountCarScore ( boxes : Sequence [ Boxes ]) -> int:return len ([ box for box in boxesif box. object_type == 'car'])

Other queries, e.g., over object positions, can be implemented simi-larly with few lines of code.

Given the query-specific scoring functions, TASTI will executethe scoring functions on the cluster representatives (as the targetDNN outputs are available for these data records). In order to exe-cute downstream query processing, TASTI must also materializeapproximate scores for the remainder of the data records.To produce these query-specific proxy scores, TASTI will prop-agate scores from the cluster representatives to the unannotatedrecords. Given k , the score for each data record will be the distance-weighted mean of the nearest k cluster representatives for numericscores. For categorical scores, TASTI will take the distance-weightedmajority vote. Since the distances to cluster representatives arecached, this process is computationally efficient.A developer may also implement a custom method of propagat-ing scores. We show an example of such a method in Section 6.3. We provide examples of the query-specific scoring functions, scorepropagation, and downstream query processing for several classesof queries below.

Approximate aggregation.

Consider the example of countingthe average number of cars per frame, as studied by BlazeIt [29].The scoring function would take the detected boxes in a frameand return the count of the boxes matching “car,” as shown above.For k =

1, the query-specific proxy score would be the count forthe nearest cluster representative and for k >

1, it would be thedistanced-weighted mean count of the nearest k cluster representa-tives for a given frame.The query-specific proxy scores can be used to answer the querywith statistical error bounds, e.g., used as a control variate by theBlazeIt’s query processing algorithm. The scores could also beused to directly answer the query. Selection.

Consider a query that selects all frames of a video witha car, as studied by prior work [3, 30, 31, 33]. The scoring functionwould take the detected boxes in a frame and return 0 if there areno cars and 1 if there is a car in the frame. The query-specific proxyscore can be smoothed for k > aniel Kang*, John Guibas*, Peter Bailis, Tatsunori Hashimoto, Matei Zaharia value above some threshold, either ad-hoc or computed over somevalidation set), as other systems do [3, 30, 33]. Limit queries.

Consider a query that selects 10 frames contain-ing at least 5 cars [29]. Such queries are often used to manuallystudy rare events. In this case, the scoring function and query-specific proxy scores would be the same as for aggregation. Forlimit queries, we generally recommend using k =

1, since this queryis typically focused on ranking rare events. The query processingalgorithm will examine frames with the target DNN as ordered bythe query-specific proxy scores. The algorithm will terminate oncethe requested number of frames is found.

We use the notation of Section 3.4. Computing the query-specificproxy scores requires C calls of the scoring function and O ( N · k ) arithmetic operations, as distances to cluster representatives arecached. In all applications we consider, this procedure is orders ofmagnitude more efficient than executing the target DNN, whichoften requires billions of floating point operations on an accelerator. We present a statistical performance analysis of our index construc-tion and query processing to better understand resulting query qual-ity. Intuitively, if the original data records have a metric structureand the triplet loss recovers this structure, we expect downstreamqueries to behave well. We formalize this intuition by analyzing howdownstream queries behave under the triplet loss. We specificallyanalyze the case where k = Notation.

We define the set of data records as D : = { x , ..., x N } ,the scoring function f ( x i ) : D → R , and the embedding function ϕ ( x i ) : D → R d . Denote the cluster representatives as R : = { x r : r ∈ R} ⊂ D for some set R ⊂ { , ..., N } . Given this set, we denotethe representative mapping function as c ( x i ) : D → R , whichmaps a data record to the nearest cluster representative, and thequery-specific scores as ˆ f ( x ) : = f ( c ( x )) .Suppose there is a query-specific loss function ℓ Q ( x i , y i ) : D × R → R where y i ∈ R is the predicted label. ℓ Q will be used toevaluate the quality of f and ˆ f as ℓ Q ( x , f ( x )) and ℓ Q ( x , ˆ f ( x )) .We define the per-example triplet loss as ℓ T ( x a , x p , x n ; ϕ , m ) : = max ( , m + | ϕ ( x a ) − ϕ ( x p )| − | ϕ ( x a ) − ϕ ( x n )|) where we omit ϕ and m where clear. Define the ball of radius M as B M ( x ) = { x ′ : d ( x , x ′ ) < M } and its complement ¯ B M . Forrandom variables x a ∼ D , x p ∼ B M ( x a ) , and x n ∼ ¯ B M ( x a ) drawnuniformly from the sets, we define the population triplet loss as L ( ϕ ; M , m ) : = E x a , x p , x n [ ℓ T ( x a , x p , x n ; ϕ , m )] (1)for some margin m > Assumptions and properties.

We make the following assump-tions. We first assume that there is a metric d ( x i , x j ) on D and that D is compact with metric d . We further assume that ℓ Q ( x , y ) isLipshitz in x and y with constant K Q /

2, in both arguments. For both of our proofs, we assume the triplet loss is low and thecluster representatives are dense enough under ϕ . Low triplet losscontrols the quality of the embeddings with respect to the originalmetric d . The density of the cluster representatives controls howclose the unannotated records are from the cluster representativesin the original space. Example.

As a concrete example, consider the video exampledescribed in Section 2. D is the set of frames, ϕ is the trainedembedding DNN, and we use the metric induced by the notion ofcloseness also described in Section 2.Consider the two queries: aggregation queries for the numberof cars and selecting frames of cars. For the aggregation query, f maps frames to the number of cars. For the selection query, f mapsframes with cars to 1 and frames without cars to 0. To theoretically analyze our index and query processing algorithms,we first consider the case where the embedding achieves zero tripletloss (we generalize to non-zero loss below). We show the followingpositive result: using the query-specific proxy scores in this settingwill achieve bounded loss. In fact, for ℓ Q that are identically 0 (e.g.,for the example above), TASTI will achieve exact results. Theorem and proof.

We now prove the main theorem for thezero-loss case.Theorem 1 (Zero loss).

Let ϕ be an embedding that achieves L ( ϕ ; M , m ) = and c be such that max x ∈D | ϕ ( x ) − ϕ ( c ( x ))| < m .Then, the query procedure will suffer an expected loss gap of at most E [ ℓ Q ( x , ˆ f ( x ))] ≤ E [ ℓ Q ( x , f ( x ))] + M · K Q . (2)Lemma 1. If ℓ T ( x a , x p , x n ) = for x a ∈ D , x p ∈ B M ( x a ) , x n ∈ ¯ B M ( x a ) , then for all x i , x r such that | ϕ ( x i ) − ϕ ( x r )| < m we have d ( x i , x r ) < M . Proof. To prove the lemma, we will show the contra-positive: if d ( x i , x r ) ≥ M implies that | ϕ ( x i ) − ϕ ( x r )| ≥ m , we have our result.Let x a = x i , x n = x r , x p ∈ B M ( x i ) . B M ( x i ) must be nonemptysince x i ∈ B M ( x i ) . This implies in inequality:0 ≥ m + | ϕ ( x a ) − ϕ ( x p )| − | ϕ ( x a ) − ϕ ( x n )|| ϕ ( x a ) − ϕ ( x n )| ≥ m . □ Proof of Theorem 1. Since the triplet loss is bounded below by0, L ( ϕ ; M , m ) = ℓ T ( x a , x p , x n ) = x a , x n suchthat d ( x a , x n ) > M , since ℓ T is bounded below by 0. By Lemma 1with x i = x and x r = c ( x ) , and since the maximum intra-clusterinstance is m , d ( x , c ( x )) < M for all x ∈ D .Then for every x : | ℓ Q ( x , f ( x )) − ℓ Q ( x , ˆ f ( x ))| (3) ≤ | ℓ Q ( x , f ( x )) − ℓ Q ( c ( x ) , f ( c ( x )))| + | ℓ Q ( c ( x ) , f ( c ( x ))) − ℓ Q ( x , f ( c ( x )))| (4) ≤ M · K Q (5) ask-agnostic Indexes for Deep Learning-based Queries over Unstructured Data This follows by the definition of ˆ f , the Lipschitz condition of ℓ Q ,and the non-negativity of ℓ Q .The proof follows from taking expectations. □ We generalize our analysis to the non-zero loss case below. Weshow that the loss in queries is bounded by the triplet loss andseveral other natural quantities.Theorem 2 (Non-zero loss).

Consider an embedding ϕ thatachieves L ( ϕ ; M , m ) = α and a clustering c such that max x ∈D | ϕ ( x )− ϕ ( c ( x ))| < m . Assume that the query loss ℓ Q is upper bounded by C .Then, query procedure will suffer an expected loss gap of at most E [ ℓ Q ( x , ˆ f ( x ))] ≤ E [ ℓ Q ( x , f ( x ))] + M · K Q + C sup x | ¯ B M ( x )| m α . (6)Lemma 2. P (cid:20) inf x ′ ∈ ¯ B M ( x ) | ϕ ( x ) − ϕ ( x ′ )| ≤ | ϕ ( x ) − ϕ ( x p )| (cid:21) ≥ P [ d ( x , c ( x )) > M ] for any distribution of x p such that the condition distribution of x p on x has support on B M ( x ) . Proof. Recall that c ( x ) : = arg min x r ∈R | ϕ ( x ) − ϕ ( x r )| andthat d ( x , c ( x )) > M implies c ( x ) ∈ ¯ B M ( x ) . Then, we have thatinf x ′ ∈ ¯ B M ( x ) | ϕ ( x ) − ϕ ( x ′ )| ≤ | ϕ ( x ) − ϕ ( x p )| for all x p ∈ B M ( x ) . Thisgives us the lemma. □ Lemma 3.1 m ℓ T ( x , x p , x n ) ≥ | ϕ ( x )− ϕ ( x n )|≤| ϕ ( x )− ϕ ( x p )| for any x n ∈ ¯ B M ( x ) , x p ∈ B M ( x ) . Proof.1 m ℓ T ( x , x p , x n ) = m · max ( , m + | ϕ ( x ) − ϕ ( x p )| − | ϕ ( x ) − ϕ ( x n )|) = max (cid:18) , − (cid:18) | ϕ ( x ) − ϕ ( x n )| − | ϕ ( x ) − ϕ ( x p )| m (cid:19)(cid:19) ≥ | ϕ ( x )− ϕ ( x n )|≤| ϕ ( x )− ϕ ( x p )| . which follows from the hinge dominating the indicator. □ Proof of Theorem 2. Consider the indicators d ( x , c ( x ))≤ M and its complement d ( x , c ( x )) > M .We analyze E [ ℓ Q ( x , ˆ f ( x ))] = E [ ℓ Q ( x , ˆ f ( x )) · d ( x , c ( x ))≤ M ] + E [ ℓ Q ( x , ˆ f ( x )) · d ( x , c ( x )) > M ] By Theorem 1 and that expectations of indicators are boundedabove by 1, we have that E [ ℓ Q ( x , ˆ f ( x )) · d ( x , c ( x ))≤ M ] ≤ E [ ℓ Q ( x , f ( x ))] + M · K Q To show the RHS we observe thatsup x | ¯ B M ( x )| m E [ ℓ T ( x , x p , x n )]≥ m E x , x p  (cid:213) x ′ n ∈ ¯ B M ( x ) sup x | ¯ B M ( x )|| ¯ B M ( x ′ n )| ℓ T ( x , x p , x ′ n )  ≥ m E (cid:34) sup x ′ n ∈ ¯ B M ( x ) ℓ T ( x , x p , x ′ n ) (cid:35) ≥ E (cid:34) inf x ∗ n ∈ ¯ B M ( x ) m ℓ T ( x , x p , x ∗ n ) (cid:35) ≥ E (cid:104) inf x ′∈ ¯ BM ( x ) | ϕ ( x )− ϕ ( x ′ )|≤| ϕ ( x )− ϕ ( x p )| (cid:105) ≥ P [ d ( x , c ( x )) > M ] which follow from Lemmas 2 and 3.Taking expectations, using Hölder’s inequality, and maximizing ℓ Q gives us the result. □ We have shown that many classes of queries will have bounded loss(i.e., discrepancy from using exact answers). However, we note thatour analysis has several gaps from our exact procedure. First, we usethe nearest k cluster representatives to generate the query-specificproxy scores. Second, the triplet loss may be large in practice. Third,not all queries admit Lipschitz losses. Nonetheless, we believe ouranalysis provides intuition for why TASTI outperforms even recentstate-of-the-art. We defer a more detailed analysis to future work. We evaluate TASTI on four real world video and text datasets. Wedescribe the experimental setup and baselines. We then demonstratethat TASTI’s index construction is less expensive than recent state-of-the-art, that it outperforms on all settings we consider, that allcomponents of TASTI are required for performance, and that TASTIis not sensitive to hyperparameter settings.

Datasets, target DNNs, and triplet loss.

We consider three videodatasets and one text dataset. We use the night-street , taipei ,and amsterdam videos as used by BlazeIt [29]. night-street iswidely used in video analytics evaluations [9, 29, 30, 39] and taipei has two object classes (car and bus). We use Mask R-CNN as thetarget DNN. We use ResNet-18 as our embedding DNN. The tripletloss separates frames with objects that are far apart and frameswith different numbers of objects.For the text dataset, we use a semantic parsing dataset [42].The dataset consists of pairs of natural language questions andcorresponding SQL statements. We assume the SQL statements arenot known at query time and must be annotated by crowd workers(i.e., that crowd workers are the “target DNN”). We use BERT [12]for the embedding DNN. We consider queries over SQL operatorsand number of predicates. The triplet loss separates questions overdifferent operators and number of predicates. aniel Kang*, John Guibas*, Peter Bailis, Tatsunori Hashimoto, Matei Zaharia Queries and metrics.

We evaluate TASTI and recent state-of-the-art on three general classes of queries: aggregation, selection, andlimit queries.For aggregation queries, we query for the approximate countof some statistic of the target DNN executed on the unstructureddata records. We compute the average number of objects per framefor the video datasets and the average number of predicates perquery for the WikiSQL dataset. For all methods and datasets, weuse the EBS sampling as used by the BlazeIt system [29], whichprovides guarantees on error. EBS sampling is adaptive with respectto how well the query-specific proxy scores are correlated with thetarget DNN, so better proxy scores will result in fewer target DNNinvocations. As a result, we measure the number of target DNNinvocations (lower is better).For selection queries, we execute approximate selection querieswith recall targets (SUPG queries [31]). Given a target DNN invoca-tion budget, these queries return a set of records matching a predi-cate with a given recall target with a given confidence level (e.g.,“return 90% of instances of cars with 95% probability of failure”):these queries are useful in scientific applications or mission-criticalsettings [31]. In contrast to queries that do not provide statisti-cal guarantees, SUPG guarantees the recall target is satisfied withhigh probability. Furthermore, in contrast to limit queries, SUPGguarantees the recall of the returned set, as opposed to returning aset number of records. We select for frames with objects for videodatasets and natural language questions that are parsed into selec-tion SQL statements for the WikiSQL dataset. Since SUPG queriesfix the number of target DNN invocations, we measure the falsepositive for the recall target settings, respectively (lower is better).For limit queries, we use the ranking algorithm proposed in [29].This ranking algorithm will examine data records that are likely tomatch the predicate of interest in descending order by the proxyscore. Proxy scores that have high recall for given number of recordswill perform better. This algorithm will execute the target DNNon all proposed frames until the requested number of records arefound. As such, we measure the number of target DNN invocations(lower is better).

Methods evaluated.

We use the exact setup of BlazeIt and SUPGfor the respective query types where possible. These systems traina proxy model at query time using an ad-hoc loss. We use the exactproxy models for the video datasets (a “tiny ResNet”) and FastTextembeddings [7] for the WikiSQL dataset. FastText embeddings areless expensive than BERT embeddings.Throughout, we refer to TASTI when using a pre-trained DNN asthe embedding DNN as “TASTI-PT” (pre-trained) and TASTI whenusing a triplet-loss trained embedding DNN as “TASTI-T” (trained).We demonstrate that TASTI-T generally outperforms TASTI-PT.

Hardware and timing.

We evaluate TASTI on a private serverwith a single NVIDIA V100 GPU, 2 Intel Xeon Gold 6132 CPUs (56hyperthreads), and 504GB of memory. In contrast to prior work, wetime the end-to-end time of process of loading video and executingthe embedding DNN for measuring wall clock times for TASTI. Ourcode is open-source. https://github.com/stanford-futuredata/tasti TMASTrain target DNNBucket target DNNEmbeddingCluster

Figure 2: Breakdown of time to construct indexes for TASTIand for BlazeIt on the night-street dataset. The BlazeItindex is the “target-model annotated set” (TMAS) [29]. Sim-ilar results hold for other datasets.

Due to the large cost of executing the target DNN, we cachetarget DNN results and compute the average execution time for thetarget DNN. For baselines, we only time the target DNN computa-tion and exclude the computational cost of proxy models, whichstrictly improves the baselines. We exclude the cost of query pro-cessing as it is negligible in all cases (over 3 orders of magnitudeless expensive for all queries we consider).

To understand the index construction performance, we measurethe wall clock time to construct TASTI indexes. We compare toBlazeIt, which effectively constructs indexes by executing thetarget DNN on a subset of the data (referred to as the “TMAS” in[29]). For BlazeIt, we only consider the cost of constructing theTMAS. For TASTI, we measure the full index construction time,including the embedding DNN time and the distance computationtime. We compute the construction times on the night-street dataset; similar results hold for other datasets.We show the breakdown of index construction time for TASTIand BlazeIt in Figure 2 using the parameters in Section 6.3. Asshown, TASTI is substantially faster than BlazeIt, costing 10 × less.The reduced cost is due to fewer invocations from the target DNN.Similar results hold for other datasets. We additionally show theindex construction time vs performance for BlazeIt and a rangeof parameters for TASTI. We show results in Figure 3. As shown,TASTI can outperform or match BlazeIt performance with up to10 × less expensive index construction times. We show that TASTI outperforms recent state-of-the-art for ap-proximate aggregation, selection with guarantees, and limit queries.For all video datasets in this Section, we use 3,000 training records,7,000 cluster representatives, and an embedding size of 128. Forthe WikiSQL dataset, we use 500 training examples and 500 clus-ter representatives. For the remainder of the experiments, we donot include the cost of constructing the index. Other work alsoexcludes this cost [3, 29–31, 33], and TASTI can construct indexessubstantially faster than prior work. ask-agnostic Indexes for Deep Learning-based Queries over Unstructured Data

Index construction (s)250003000035000 T a r g e t D NN i n v o c a t i o n s ( q u e r y t i m e ) BlazeItTASTI-T

Figure 3: Index construction time vs performance of TASTIand BlazeIt for aggregation queries on the night-street dataset. Similar results hold for other datasets. T a r g e t D NN i n v o c a t i o n s ( t h o u s a n d s ) a) night-street b) taipei R a n d o m S a m p li n g B l a z e I t T A S T I - P T T A S T I - T T a r g e t D NN i n v o c a t i o n s ( t h o u s a n d s ) c) amsterdam R a n d o m S a m p li n g B l a z e I t T A S T I - P T T A S T I - T d) wikisql Figure 4: Number of target DNN invocations for baselinesand TASTI for aggregation queries (lower is better). Asshown, TASTI outperforms baselines in all cases, includingprior state-of-the-art by up to 2 × .Approximate aggregation. For approximate aggregation queries,we compare TASTI (with and without training) to using standardrandom sampling and an ad-hoc trained proxy model. We use theexact experimental setup as BlazeIt [29] for video datasets (theexact code) and BERT embeddings for the WikiSQL dataset. Weaggregate over the average number of cars per frame for all videodatasets (TASTI also outperforms on bus for taipei ).We show results in Figure 4. As shown, TASTI outperforms foraggregation queries on all datasets. In particular, TASTI outperformrecent state-of-the-art optimized for aggregation queries (BlazeIt)by up to 2 × while maintaining less expensive index constructioncosts. Further, compared to standard random sampling, TASTI out-performs by up to 3 × .TASTI’s improved performance comes from better query-specificproxy scores ( ρ of 0.91 vs 0.55). As the correlation of the proxyscores with the target DNN increases, the control variates variance F a l s e P o s i t i v e R a t e ( % ) a) night-street b) taipei S U P G T A S T I - P T T A S T I - T F a l s e P o s i t i v e R a t e ( % ) c) amsterdam S U P G T A S T I - P T T A S T I - T d) wikisql Figure 5: False positive rate for recall-target SUPG queries(lower is better). We show the performance of baselines andTASTI. As shown, TASTI outperforms baselines in all cases. decreases. Reduced variance results in fewer samples, as the EBSstopping algorithm is adaptive with the variance.

Selection.

For selection queries with statistical guarantees (SUPGqueries), we compare TASTI (with and without training) to usingan ad-hoc trained proxy model (standard random sampling is notappropriate for SUPG queries). We use the exact same experimentalsetup as in SUPG [31] for the video datasets and BERT embeddingsfor the WikiSQL dataset. For all queries, we use a recall target of90% with a confidence of 95%, as used in [31]. We search for carsin night-street and amsterdam , buses in taipei (TASTI alsooutperforms for cars), and star operators for WikiSQL.As shown in Figure 5, TASTI outperforms on all datasets. In par-ticular, TASTI can improve the false positive rate by almost 2 × overrecent state-of-the-art. We further show that the triplet trainingimproves performance. As with aggregation queries, TASTI’s im-proved performance comes from better query-specific proxy scores( ρ of 0.90 vs 0.79). Limit queries.

For limit queries, we use the ranking algorithmproposed by BlazeIt [29]. We use the exact same experimentalsetup as BlazeIt [29] for the video datasets (including the queryconfigurations, e.g., number of objects, etc.) and BERT embeddingsfor the WikiSQL dataset. For limit queries, we use a custom scoringfunction which is the regular scoring function with k = × compared to recent state-of-the-art. As we demonstrate in Section 6.7, TASTI’s FPF miningand FPF clustering are critical for performance when searching forrare events (Figure 9 and 10). The FPF algorithm naturally producesclusters that are far apart, which is beneficial when searching forrare events. aniel Kang*, John Guibas*, Peter Bailis, Tatsunori Hashimoto, Matei Zaharia T a r g e t D NN i n v o c a t i o n s a) night-street

52 10 10 b) taipei B l a z e I t T A S T I - P T T A S T I - T T a r g e t D NN i n v o c a t i o n s

479 13 11 c) amsterdam B l a z e I t T A S T I - P T T A S T I - T d) wikisql Figure 6: Number of target DNN invocations for baselinesand TASTI for limit queries (lower is better). As shown,TASTI outperforms baselines in all cases, including priorstate-of-the-art by up to 34 × . SUPG TASTI-PT TASTI-T020406080 F a l s e P o s i t i v e R a t e ( % ) a) night-street SUPG TASTI-PT TASTI-T020406080100 b) taipei

Figure 7: SUPG queries for selecting objects of interest onthe left hand side of the frame. This query violates the Lips-chitz condition, but TASTI still outperforms baselines. R a n d o m S a m p li n g T A S T I - P T T A S T I - T T a r g e t D NN i n v o c a t i o n s ( t h o u s a n d s ) a) night-street R a n d o m S a m p li n g T A S T I - P T T A S T I - T b) taipei Figure 8: Aggregation query for the average x position of ob-jects in frames of a video. Recent state-of-the-art is not wellsuited for this query as regression can be difficult for proxymodels. In contrast, TASTI performs well on these queries. Dataset Method Query Quality metric night-street

TASTI Agg. 3.3% night-street

BlazeIt Agg. 4.4% night-street

TASTI Selection 5.5 night-street

NoScope Selection 14.9

Table 1: Performance of TASTI and baselines on querieswithout statistical guarantees (lower is better). The qualitymetrics are percent error and 100 - F1 score for aggregationand selection queries respectively. TASTI outperforms on allsettings we considered.

In addition to the queries above, we demonstrate that TASTI can beused to efficiently answer queries that prior work is not well suitedfor. We consider two queries over positions of objects in video. Inparticular, these tasks require modified data preprocessing or lossesfor proxy models, but TASTI can naturally produce proxy scoresfor both tasks.

Selecting objects by position.

We consider the query of selectingobjects in the left hand side of the video, as measured by the averagex-position of the bounding box. We compare TASTI to to traininga proxy model by extending SUPG and to TASTI without triplettraining. Results are shown in Figure 7.We note that prior proxy models were not designed to takeposition into account, which may explain their poor performance:there is a sharp discontinuity for labels in the center of the frame.Learning the boundary in the frame may require large amountsof training data. In contrast, TASTI performs well on this queryas it uses the information from the target DNN, despite the queryviolating our assumptions in the theoretical analysis. As shown,TASTI outperforms both baselines.

Average position.

We consider the query of computing the av-erage position of objects in frames of video (specifically the x-coordinate). We compare TASTI to random sampling and to TASTIwithout triplet training. We attempted to train a BlazeIt proxymodel by regressing the output to the average position but wereunable to train a model that outperformed random sampling. Wenote that BlazeIt was not configured for such queries and that weare unaware of work on proxy models for pure regression. We showresults in Figure 8. As shown, TASTI outperforms random samplingby up to 3 × , without having to implement custom training code fora new proxy model. In addition to queries with statistical guarantees, we executed ag-gregation and selection queries without statistical guarantees. Foraggregation queries, we used the proxy score to directly computethe statistic of interest and measured the percent error from theground truth. For selection queries, we used the proxy score toselect records above some threshold. As some selection queries areclass imbalanced, we measured 100 - F1 score (so lower is better).We show results for TASTI and for proxy model-based baselinesin Table 1. As shown, TASTI outperforms on quality metrics for all ask-agnostic Indexes for Deep Learning-based Queries over Unstructured Data

Dataset 1st query 2nd query Quality metric night-street

Agg. SUPG 4.9% (8.6%) taipei

Agg. SUPG 40.1% (55.9%) night-street

SUPG Agg. 18.9k (21.2k) taipei

SUPG Agg. 34.6k (39.1k)

Table 2: Performance of TASTI after cracking. We mea-sured query performance of a SUPG/aggregation query af-ter cracking (false positive rate and number of target DNNinvocations, lower is better for both). Results after crackingare shown along with results before cracking in parentheses.TASTI improves results in all settings we tested. N o n e + T r i p l e t + F P F c l u s t e r + F P F t r a i n T a r g e t D NN i n v o c a t i o n s a) Aggregation N o n e + T r i p l e t + F P F c l u s t e r + F P F t r a i n b) Limit Figure 9: Factor analysis, in which optimizations are addedin in sequence. As shown, all optimizations improve perfor-mance for aggregation queries. For limit queries, FPF train-ing and clustering are required for triplet training to im-prove performance. settings we considered, indicating that TASTI’s proxy scores arehigher quality.

We further demonstrate that TASTI’s index can easily be used for“cracking” [25]. To show this, we executed an aggregation queryfollowed by a SUPG query and vice versa. We used the target DNNannotations from the first query to improve TASTI’s index beforeexecuting the second query. We use the same quality/runtime met-rics as for queries with statistical guarantees.As shown in Table 2, TASTI improves in performance for bothqueries. In particular, TASTI can improve (decrease) the false posi-tive rate for SUPG queries by up to 1.7 × after repeated queries. We investigated whether all of TASTI’s components contributes toperformance. We find that all components of TASTI (triplet loss,FPF mining, and FPF clustering) are critical to performance.

Factor analysis.

We first performed a factor analysis, in whichwe began with no optimizations and added the triplet loss, FPFmining, and FPF clustering in turn. For brevity, we show resultsfor night-street for aggregation and limit queries. Aggregationqueries highlight “average-case” performance and limit querieshighlight “rare-event” performance. We choose night-street as A ll - T r i p l e t - F P F t r a i n - F P F c l u s t e r T a r g e t D NN i n v o c a t i o n s a) Aggregation A ll - T r i p l e t - F P F t r a i n - F P F c l u s t e r b) Limit Figure 10: Lesion study, in which optimizations are removedindividually (they are not removed cumulatively). As shown,all optimizations improve performance. Recall that lower isbetter for both aggregation and limit queries. T a r g e t D NN i n v o c a t i o n s a) Aggregation TASTI-TBlazeIt b) Limit

Figure 11: Number of cluster representatives vs perfor-mance on aggregation and limit queries on the night-street dataset. As shown, TASTI outperforms baselines on a rangeof parameter settings. it has been widely studied in visual analytics [9, 29–31, 39]; otherdatasets have similar behaviors.As shown in Figure 9, all optimizations help performance. Inparticular, FPF clustering substantially improves limit query per-formance, as it selects frames that are semantically distinct.

Lesion study.

We then performed a lesion study, in which we startwith all optimizations, and remove each optimization individually(triplet loss, FPF mining, and FPF clustering). As with the factoranalysis, we show results for night-street for aggregation andlimit queries; other datasets have similar behaviors.We show results in Figure 10. As shown, triplet training sig-nificantly improves aggregation performance. Furthermore, FPFclustering is critical for limit query performance.

We investigated whether TASTI is sensitive to hyperparameters byvarying the number of training examples, number of cluster repre-sentatives, and embedding size. As we show, TASTI outperformsbaselines on a wide range of parameter settings, demonstratingthat hyperparameters are not difficult to select.

Number of buckets.

A critical parameter that determines TASTIperformance is the number of buckets in the index. To understandthe effect of the number of buckets on performance, we vary thenumber of buckets and measured performance on aggregation andlimit queries on the night-street dataset. We use 3,000, 5,000, aniel Kang*, John Guibas*, Peter Bailis, Tatsunori Hashimoto, Matei Zaharia T a r g e t D NN i n v o c a t i o n s a) Aggregation TASTI-TBlazeIt b) Limit

Figure 12: Number of training examples vs performance onaggregation and limit queries on the night-street dataset.As shown, TASTI outperforms baselines on a range of pa-rameter settings. Embedding dimension20000250003000035000 T a r g e t D NN i n v o c a t i o n s a) Aggregation TASTI-TBlazeIt Embedding dimension10002000 b) Limit

Figure 13: Embedding size vs performance on aggregationand limit queries on the night-street dataset. TASTI out-performs baselines on a range of parameter settings. × less expensive than the baseline. Number of training examples.

To understand how the numberof training examples affects the performance of TASTI, we used1,000, 2,000, 3,000, 4,000, and 5,000 training examples. We measuredperformance on aggregation and limit queries on the night-street dataset. As shown in Figure 12, the performance of TASTI does notsignificantly change with the number of training examples. TASTIoutperforms baselines across all settings we consider.

Embedding size.

To understand how the embedding size affectsTASTI’s performance, we varied the embedding size and measuredperformance on aggregation and limit queries on the night-street dataset. We used embedding sizes of 32, 64, 128, 256, and 512.We show results in Figure 13. As shown, aggregation perfor-mance does not substantially change with the embedding size.However, the performance of SUPG precision target queries doesvary with the embedding size. We hypothesize the difference pri-marily comes from downstream query processing procedures. Foraggregation queries, only the correlation of the proxy and targetDNN affects performance. In contrast, the calibration affects the performance of SUPG queries. As the embedding size decreases,we hypothesize that the aggregate correlation remains approxi-mately constant, but the calibration may change. We defer furtherinvestigation to future work.

Structured indexes.

There is a long history in the database litera-ture of indexes for structured data [37]. These indexes generally areused to accelerated lookups on certain columns. Techniques rangefrom tree-like structures [6, 11, 23] to hash tables [16]. However,these indexes assume that the data is present in a structured format,which is not the case for the data we consider.

Unstructured indexes.

The analytics community has also longstudied indexes for unstructured data. Many of these indexingmethods are modality-specific, such as indexes for spatial data [10,18], time series [2, 13], and low-level visual features [14, 36]. Otherindexes accelerate KNN search in possibly high dimensions, whenthe distances are meaningful [26, 40]. Work in retrieval has usedindexed embeddings to accelerate search for semantically similaritems, in particular for visual data [4, 32, 41]. While this workalso accelerates queries over unstructured data, our work differs infocusing on constructing proxy scores to address unique challengeswhen queries require executing expensive target DNNs.

Coresets.

Several communities, including the theory and deeplearning communities, have considered coresets [1]. Coresets areconcise summaries of data. They have been used for nearest neigh-bor searches [20], streaming data [8], active learning [35], and otherapplications. We are unaware of work that trains embedding DNNsin a task-agnostic manner for the construction of indexes.

DNN-based queries.

Recent work in the database communityhas focused on accelerating DNN-based queries. Many systemshave been developed to accelerate certain classes of queries, in-cluding selection without statistical guarantees [3, 24, 30, 33], se-lection with statistical guarantees [31], aggregation queries [29],limit queries [29], tracking queries [5], and a range of other queries.These systems aim to reduce the cost of expensive target DNNs,often by using cheap, query-specific proxy models. In this work,we propose a general index to accelerate many such queries overthe schema induced by the target DNN. We leverage many of thedownstream query processing techniques used in this prior work.Other work assumes that the target DNN is not expensive toexecute or that extracting bounding boxes is not expensive [15,41]. We have found that many applications require accurate andexpensive target DNNs, so we focus on reducing executing thetarget DNN wherever possible.

To reduce the cost of queries using expensive target DNNs, weintroduce a method of constructing task-agnostic, trainable indexesfor unstructured data. TASTI relies on the key property that manyqueries only require a low dimensional representation of unstruc-tured data records. TASTI uses an embedding DNN and target DNNannotated cluster representatives as its index, which allows formore accurate and generalizable proxy scores across a range of ask-agnostic Indexes for Deep Learning-based Queries over Unstructured Data query types. We theoretical analyze TASTI to understand its sta-tistical accuracy. We show that these indexes can be constructedup to 10 × more efficiently than recent work. We further show thatthey can be used to answer queries up to 24 × more efficiently thanrecent state-of-the-art. We hope that our work serves as a start-ing point for other indexing and query processing techniques overunstructured data. REFERENCES [1] Pankaj K Agarwal, Sariel Har-Peled, Kasturi R Varadarajan, et al. 2005. Geometricapproximation via coresets.

Combinatorial and computational geometry

52 (2005),1–30.[2] Rakesh Agrawal, Christos Faloutsos, and Arun Swami. 1993. Efficient similaritysearch in sequence databases. In

International conference on foundations of dataorganization and algorithms . Springer, 69–84.[3] Michael R Anderson, Michael Cafarella, Thomas F Wenisch, and German Ros.2019. Predicate Optimization for a Visual Analytics Database.

ICDE (2019).[4] Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. 2014.Neural codes for image retrieval. In

European conference on computer vision .Springer, 584–599.[5] Favyen Bastani, Songtao He, Arjun Balasingam, Karthik Gopalakrishnan, Mo-hammad Alizadeh, Hari Balakrishnan, Michael Cafarella, Tim Kraska, and SamMadden. 2020. MIRIS: Fast Object Track Queries in Video. In

Proceedings of the2020 ACM SIGMOD International Conference on Management of Data . 1907–1921.[6] Rudolf Bayer and Edward McCreight. 1970. Organization and maintenance oflarge ordered indexes. In

SIGFIDET Workshop on Data Description, Access andControl .[7] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. En-riching Word Vectors with Subword Information.

Transactions of the Associationfor Computational Linguistics arXiv preprint arXiv:1612.00889 (2016).[9] Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, Hyeontaek Lim,David Andersen, Michael Kaminsky, and Subramanya Dulloor. 2019. ScalingVideo Analytics on Constrained Edge Nodes.

SysML (2019).[10] Paolo Ciaccia, Marco Patella, and Pavel Zezula. 1997. M-tree: An E cient AccessMethod for Similarity Search in Metric Spaces. In

Proceedings of the 23rd VLDBconference, Athens, Greece . Citeseer, 426–435.[11] Douglas Comer. 1979. Ubiquitous B-tree.

ACM Computing Surveys (CSUR)

11, 2(1979), 121–137.[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018).[13] Christos Faloutsos, Mudumbai Ranganathan, and Yannis Manolopoulos. 1994.Fast subsequence matching in time-series databases.

Acm Sigmod Record

23, 2(1994), 419–429.[14] Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, QianHuang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic,et al. 1995. Query by image and video content: The QBIC system. computer

28, 9(1995), 23–32.[15] Daniel Y Fu, Will Crichton, James Hong, Xinwei Yao, Haotian Zhang, Anh Truong,Avanika Narayan, Maneesh Agrawala, Christopher Ré, and Kayvon Fatahalian.2019. Rekall: Specifying video events using compositions of spatiotemporal labels. arXiv preprint arXiv:1910.02993 (2019).[16] Hector Garcia-Molina. 2008.

Database systems: the complete book . PearsonEducation India.[17] Teofilo F Gonzalez. 1985. Clustering to minimize the maximum interclusterdistance.

Theoretical computer science

38 (1985), 293–306.[18] Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching.In

Proceedings of the 1984 ACM SIGMOD international conference on Managementof data . 47–57.[19] John Michael Hammersley and David Christopher Handscomb. 1964. Generalprinciples of the Monte Carlo method. In

Monte Carlo Methods . Springer, 50–75.[20] Sariel Har-Peled and Soham Mazumdar. 2004. On coresets for k-means andk-median clustering. In

Proceedings of the thirty-sixth annual ACM symposium onTheory of computing . 291–300.[21] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn.In

ICCV . IEEE, 2980–2988.[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In

CVPR . 770–778.[23] Joseph M Hellerstein, Jeffrey F Naughton, and Avi Pfeffer. 1995.

Generalizedsearch trees for database systems . September. [24] Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Paramvir Bahl, MatthaiPhilipose, Phillip B Gibbons, and Onur Mutlu. 2018. Focus: Querying Large VideoDatasets with Low Latency and Low Cost.

OSDI (2018).[25] Stratos Idreos, Martin L Kersten, Stefan Manegold, et al. 2007. Database Cracking..In

CIDR , Vol. 7. 68–78.[26] Hosagrahar V Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang.2005. iDistance: An adaptive B+-tree based indexing method for nearest neighborsearch.

ACM Transactions on Database Systems (TODS)

30, 2 (2005), 364–397.[27] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and IonStoica. 2018. Chameleon: scalable adaptation of video analytics. In

Proceedings ofthe 2018 Conference of the ACM Special Interest Group on Data Communication .ACM, 253–266.[28] Daniel Justus, John Brennan, Stephen Bonner, and Andrew Stephen McGough.2018. Predicting the computational cost of deep learning models. In . IEEE, 3873–3882.[29] Daniel Kang, Peter Bailis, and Matei Zaharia. 2019. BlazeIt: Optimizing Declara-tive Aggregation and Limit Queries for Neural Network-Based Video Analytics.

PVLDB (2019).[30] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017.NoScope: optimizing neural network queries over video at scale.

PVLDB

10, 11(2017), 1586–1597.[31] Daniel Kang, Edward Gan, Peter Bailis, Tatsunori Hashimoto, and Matei Zaharia.2020. Approximate Selection with Guarantees using Proxies.

PVLDB (2020).[32] Kevin Lin, Huei-Fang Yang, Jen-Hao Hsiao, and Chu-Song Chen. 2015. Deeplearning of binary hash codes for fast image retrieval. In

Proceedings of the IEEEconference on computer vision and pattern recognition workshops . 27–35.[33] Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, and Surajit Chaudhuri. 2018.Accelerating Machine Learning Inference with Probabilistic Predicates. In

SIG-MOD . ACM, 1493–1508.[34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:Towards real-time object detection with region proposal networks. In

NIPS .[35] Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neuralnetworks: A core-set approach. arXiv preprint arXiv:1708.00489 (2017).[36] John R Smith and Shih-Fu Chang. 1997. VisualSEEk: a fully automated content-based image query system. In

Proceedings of the fourth ACM international confer-ence on Multimedia . 87–98.[37] Jeffrey D Ullman. 1984.

Principles of database systems . Galgotia publications.[38] Kilian Q Weinberger and Lawrence K Saul. 2009. Distance metric learningfor large margin nearest neighbor classification.

Journal of Machine LearningResearch

10, 2 (2009).[39] Tiantu Xu, Luis Materon Botelho, and Felix Xiaozhu Lin. 2019. VStore: A DataStore for Analytics on Large Videos. In

Proceedings of the Fourteenth EuroSysConference 2019 . ACM, 16.[40] Cui Yu, Beng Chin Ooi, Kian-Lee Tan, and HV Jagadish. 2001. Indexing thedistance: An efficient method to knn processing. In

VLDB , Vol. 1. 421–430.[41] Yuhao Zhang and Arun Kumar. 2019. Panorama: a data system for unboundedvocabulary querying over video.

Proceedings of the VLDB Endowment

13, 4 (2019),477–491.[42] Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: GeneratingStructured Queries from Natural Language using Reinforcement Learning.