Auctus: A Dataset Search Engine for Data Augmentation
Fernando Chirigati, Rémi Rampin, Aécio Santos, Aline Bessa, Juliana Freire
AAuctus: A Dataset Search Engine for Data Augmentation
Fernando Chirigati, R ´emi Rampin, A ´ecio Santos, Aline Bessa, and Juliana Freire
New York University { fchirigati, remi.rampin, aecio.santos, aline.bessa, juliana.freire } @nyu.edu ABSTRACT
Machine Learning models are increasingly being adopted inmany applications. The quality of these models critically de-pends on the input data on which they are trained, and byaugmenting their input data with external data, we have theopportunity to create better models. However, the massivenumber of datasets available on the Web makes it challeng-ing to find data suitable for augmentation. In this demo, wepresent our ongoing efforts to develop a dataset search en-gine tailored for data augmentation. Our prototype, named
Auctus , automatically discovers datasets on the Web and,different from existing dataset search engines, infers con-sistent metadata for indexing and supports join and unionsearch queries.
Auctus is already being used in a real de-ployment environment to improve the performance of MLmodels. The demonstration will include various real-worlddata augmentation examples and visitors will be able to in-teract with the system.
PVLDB Reference Format:
F. Chirigati, R. Rampin, A. Santos, A. Bessa, and J. Freire. Auc-tus: A Dataset Search Engine for Data Augmentation.
PVLDB ,12(xxx): xxxx-yyyy, 2019.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
1. INTRODUCTION
Machine Learning (ML) models are widely used in manyapplications, including traffic and weather prediction, newsand product recommendation, spam filtering, and fraud de-tection, to name a few. These models contain two key com-ponents: an input data and an estimator algorithm. Theestimator uses a subset of the input data attributes—the features (independent variables)—to predict the value of a target attribute (dependent variable). The quality of thesemodels critically depends on the input data. In fact, studieshave shown that more data leads to higher classification ac-curacy and that the accuracy of different learning algorithmsconverges with increasing amounts of training data [3].
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.
Proceedings of the VLDB Endowment,
Vol. 12, No. xxxISSN 2150-8097.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx pickup_datetime LocationID n. trips
Random Forest Regressor (cid:31)(cid:30)(cid:29)(cid:28)(cid:27)(cid:26)(cid:26)(cid:25)(cid:26)(cid:24) (cid:31)(cid:30)(cid:29)(cid:28)(cid:27)(cid:26)(cid:25)(cid:30)(cid:24)(cid:23)(cid:25)(cid:22)(cid:24)(cid:27)(cid:21)(cid:20)(cid:19)(cid:18)(cid:17) pickup_datetime LocationID n. trips (cid:31)(cid:30)(cid:29)(cid:28)(cid:27)(cid:26)(cid:23)(cid:25)(cid:22)(cid:26) (cid:31)(cid:30)(cid:29)(cid:28)(cid:27)(cid:26)(cid:25)(cid:30)(cid:24)(cid:23)(cid:25)(cid:22)(cid:24)(cid:27)(cid:21)(cid:20)(cid:19)(cid:18)(cid:17)(cid:27) (cid:21)(cid:27)(cid:20)(cid:19)(cid:18)(cid:17)(cid:27)(cid:16)(cid:15)(cid:14)(cid:13)(cid:12)(cid:11)(cid:10)(cid:9)(cid:27)(cid:8)(cid:7)(cid:6)(cid:24)(cid:5) (cid:31)(cid:30)(cid:29)(cid:28)(cid:27)(cid:26)(cid:25)(cid:30)(cid:24)(cid:23)(cid:25)(cid:22)(cid:24)(cid:27)(cid:21)(cid:20)(cid:19)(cid:18)(cid:17)(cid:27) (cid:21)(cid:27)(cid:4)(cid:10)(cid:19)(cid:3)(cid:2)(cid:10)(cid:1) pickup_datetime n. trips
Temp
Precip ... ...
Random Forest Regressor (cid:31)(cid:30)(cid:29)(cid:28)(cid:27)(cid:22)(cid:127)(cid:25)(cid:22)(cid:7) (cid:16)(cid:15)
Random Forest Regressor (cid:30)
Figure 1: (a) An ML model to predict taxi demandin NYC. This model can be improved through dataaugmentation, by adding (b) taxi data for other timeperiods, or (c) features that influence demand (e.g.,weather information).Augmenting Data to Increase Accuracy.
Consider forexample the problem of predicting the number of taxis inNew York City (NYC). As shown in Figure 1(a), the inputdata contains taxi data for the first half of 2017 and the at-tributes include date and time for the trip pickup ( pickup -datetime ), the location of the pickup (
LocationID ), and thenumber of trips ( n. trips ) for a particular time and loca-tion. Here, the first two attributes are the features and thelast one is the target variable. The estimator is a randomforest regressor that achieves a mean absolute error (MAE)of 66 .
67 on the test data (the smaller the error the better).To reduce this error, we can try augmenting the originaldata. We can add new records to include data for time peri-ods other than
Jan-Jun 2017 , which will likely introduce ex-amples that are more representative and diverse. By addingdata for
Jul-Dec 2017 , the error decreases to 64.36 (see Fig-ure 1(b)). Another strategy to further decrease the erroris to include features that can affect taxi demand, such astemperature, visibility, and precipitation. As shown in Fig-ure 1(c), with these new features, the error goes down to39.30, which represents a substantial improvement.
Opportunities and Challenges.
Our increasing abilityto collect, transmit, and store data, coupled with the grow-ing trend towards openness, have led to an explosion in thenumber of datasets available on the Web. By discoveringrelevant data, we have the opportunity to create more ac-curate models. However, finding them is challenging be-cause they are spread over a large number of Web sitesand repositories. Recognizing this challenge, Google hasrecently released Google Dataset Search [17], a search en-1 a r X i v : . [ c s . I R ] F e b ine for datasets on the Web. Their approach, however, hasimportant limitations for data augmentation. First, it onlyindexes datasets published in Web pages that have meta-data annotated with the schema.org vocabulary. Second,even when this metadata is available, it can be incompleteor inconsistent. For instance, consider the 2018 Green TaxiTrip Data available at NYC Open Data. The name of thisdataset suggests that the data corresponds to the year of2018. However, by further inspecting it, we can find recordsdating back to 2008: if we use only the title to capture suchinformation, many relevant records may be missed. Anotherlimitation is that the provided query interface only allowskeyword-based queries over the metadata. Using this inter-face, a potentially large number of datasets will be returned,but many may be irrelevant for augmentation purposes: itis important to support queries that, given an initial dataset D , return datasets that can be concatenated or joined with D . Our Approach.
In this demo, we present our ongoing ef-forts to develop a dataset search engine tailored for dataaugmentation . The goal is to, given an input dataset D and an optional query Q , efficiently and effectively discoverand rank a set D of datasets from the Web that can be in-tegrated (join or union) to D . To address the limitationsdiscussed above, our approach: • Discovers datasets on the Web.
Our approach doesnot rely on the availability of schema.org annotations. Itcan connect to multiple APIs to capture datasets from dif-ferent repositories. Also, the infrastructure is extensibleand allows multiple discovery plugins to be implemented. • Automatically infers metadata from datasets.
In-stead of relying on published metadata, our approach pro-files datasets as they are discovered in a scalable and con-sistent way to power the search engine. • Supports join and union search queries.
To supportdata augmentation tasks, queries can be issued to finddatasets that can be either joined or concatenated with D .To make these queries efficient, we create data summaries for the discovered datasets, and use these summaries toperform join and union queries.We will demonstrate our prototype system Auctus thatis being developed in the context of the DARPA D3M pro-gram [1], which aims to automatically generate and improveML models. The prototype has been deployed and is cur-rently being used by different groups that participate in theproject. Demo visitors will be able to see how
Auctus is be-ing used to improve model performance in D3M, and theywill also interact directly with
Auctus to query over 19,000datasets.
2. RELATED WORK
Dataset Search.
Many dataset search engines have be-come recently available [9]. Cafarella et al. [6] proposed asystem to perform keyword search over a corpus of HTMLtables. In addition to Google Dataset Search, Google alsointroduced Goods, a dataset search system to manage datagenerated and used within the company [13]. There is also aplethora of open data portals based on data donations (e.g., [18, https://data.cityofnewyork.us/Transportation/2018-Green-Taxi-Trip-Data/w7fs-fd9i Querying and Ranking
NOAA SocrataWeb Crawlers . . .
Statistics Type Detection Data SummariesIndexSearch Augment
Data DiscoveryProfiling on-demandingestion WebWeb InterfacePython/HTTP API resultsinput query augmented data profile Upload
Figure 2: An overview of
Auctus .
19, 20]) that provide search capabilities. The search method-ology available in all of these engines and portals is largelykeyword-based over the metadata published by the datasources, which, as described earlier, can be incomplete orinconsistent. The keyword-based search does not considerpotential data augmentation tasks either.
Data Integration.
Data augmentation for structured datarequires finding tables for data integration (join and union),a problem that has been commonly studied in the databasecommunity. However, existing work is focused on eitherentity-centric datasets (e.g., HTML tables) [4, 5, 10, 14, 24,25], i.e., their techniques require the presence of named en-tities, or primary key/foreign key relationships [12]. On theother hand, datasets published on open data platforms maynot have any entities, and the augmentation may not fol-low a primary key/foreign key relationship (see Figure 1).The aforementioned systems also require the presence of allof the data for finding potential joins and unions, which isunrealistically expensive in the context of a search enginefor datasets over the Web. Some systems are not suitablefor real-time queries [5, 10], or assume that the relationshipbetween the input D and the repository has been precom-puted [4, 10]. In our context, this assumption may not hold: D is potentially an unknown dataset to the search engine.Nargesian et al. [15] proposed a framework for findingtop- k datasets that can be concatenated (union) with anexternal input dataset (i.e., not present in the framework’srepository). They used different notions of “unionability”(e.g., based on value intersection and embedding vectors),and various indices to improve search performance. How-ever, their approach is focused on textual attributes only.
3. THE AUCTUS PROTOTYPE
An overview of the
Auctus prototype is depicted in Fig-ure 2. In what follows, we describe its key components. In Auctus , we developed plugins to retrieve datasets (i.e.,the data itself and any available metadata) from specificrepositories, using their respective APIs. Currently, we havesupport to obtain datasets from Socrata [22], which provides2 platform for open government data, Zenodo [ ? ], whichcontains open-access data deposited by researchers aroundthe world, the World Bank Open Data [23], which providesdatasets on global development, among other open data por-tals. The prototype is extensible and new plugins can bedeveloped for others repositories. While our approach isdifferent from Google Dataset Search [17], we are currentlyworking on leveraging multiple search engines (includingGoogle Dataset Search) to discover new datasets that arenot present in the Auctus index: these search engines will complement our current approach. Finally,
Auctus also al-lows users to upload their own datasets.
Once datasets are discovered or uploaded,
Auctus profileseach dataset to infer relevant metadata. The reason for thatis twofold. First, metadata published on the Web can be in-complete or inconsistent, as described in Section 1. Second,some datasets (including data uploaded by users, and the in-put data D ) may not have associated metadata. For inputdata, profiling must be done at query time.As depicted in Figure 2, we perform different profilingtasks, including type detection (categorical, numerical, spa-tial, and temporal attributes), statistics computation (e.g.,frequent items, mean, and variance), and data summariza-tion (explained below). We also store provenance informa-tion, i.e., information on how to materialize each dataset.This is useful for users to be able to download these datasetsand also to perform the augmentation tasks. Once all of themetadata is captured and indexed, the raw data becomesunnecessary: only the metadata is kept, in our search in-dex. Data Summaries.
An important step of the profiling taskis to compute data summaries for the different attributes ofa dataset. Instead of storing all of the datasets and usingtheir original data during the search phase (which would beexpensive both in terms of storage and for query evaluation),we generate summaries of the data and use these when do-ing the data integration search. Currently, our prototypecreates summaries for categorical, numerical, spatial (GPSdata), and temporal attributes, which are the most com-mon in the data augmentation tasks in the DARPA D3Mprogram. We plan to investigate and integrate some of thetechniques proposed by Nargesian et al. [15] in the future,in order to provide support for categorical attributes.The data summaries generated by
Auctus are representedby the ranges of their corresponding attributes. Duringthe search phase,
Auctus uses these summaries to estimatethe size of the intersection between two attributes, conclud-ing whether a join is feasible. A na¨ıve approach to com-pute these ranges is to simply consider the smallest and thelargest values. For instance, the smallest and largest valuesof the pickup datetime attribute (Figure 1) are and , which would also consti-tute its range. While this range is accurate, suppose thereare no other records from June 2017 except for the one corre-sponding to the largest value: the range would not capturethis information, and it could erroneously indicate that ajoin between this data and another data from June 2017 ispossible. Thus, we need to capture finer-grained ranges tobetter estimate join intersections. We use the clustering al-gorithm k-means for computing these ranges, as it producesthe desired results while being efficient.
Once all of the metadata are generated, including the datasummaries, they are indexed in an Elasticsearch [11] cluster.Numerical and temporal summaries are indexed using rangedata types, and spatial summaries are indexed using geo-shape data types. We use Lazo [8], a method for findingjoinable datasets in a data lake based on the overlap of theircategorical attributes, to store summaries of categorical datathat can then be used for the retrieval of related datasets.While this architecture provides a reasonable performancefor real-time queries, we are currently investigating otherdata structures to improve query processing efficiency.
Auctus allows users to pose queries of the following form:
Given an input dataset D and an optional query Q , re-turn a ranked set of datasets D that can be joined with orconcatenated to D . Dataset D is first profiled at query time in Auctus . Us-ing the profiled metadata,
Auctus performs either a join ora union search, and then ranks all of the search results. Ifquery Q is present, datasets are filtered from the search re-sults based on Q . With Q , users are able to specify keywordsand to determine the desired temporal and spatial coverageof the dataset, which help fine-tune the results. Join Search.
To find other datasets that can be joined with D , Auctus first searches, for each attribute a of D , whichother attributes, in the index that corresponds to a ’s datatype (e.g., temporal), have summaries intersecting the sum-mary of a . Every dataset that has at least one intersectingattribute is a potential join result. Union Search.
To find other datasets that can be con-catenated with D , the indices are searched for any datasetthat has attributes with the same data types present in D ,as well as similar names. Name similarity, in this case, is ac-complished by using the fuzzy query in Elasticsearch, whichapplies Levenshtein distance for the similarity search. Theunion search does not require all of the attributes from D to be matched: a result dataset may only match a subset of D . Ranking.
Given the results of join and union searches,and after filtering them based on query Q , they are finallyranked and returned as D . Join results are ranked basedon the intersection of the summaries: datasets with higherintersection are ranked higher. Union results are rankedbased on the Levenshtein similarity between the names ofthe matching attributes. Augmentation.
In addition to providing dataset searchcapabilities for data augmentation,
Auctus can also performthe augmentation itself. Users can choose a dataset R ∈ D , and Auctus will materialize it (using the provenance an-notated in the metadata) and perform the join or unionoperation with D , returning the new, augmented dataset A .If multiple attribute pairs match between R and D for ajoin operation, users can choose which pair(s) they want forthe join. For temporal and spatial joins, the attributes aretranslated into the same resolution before the augmentation. Auctus has been implemented with scalability in mind:the search engine is entirely containerized using Docker.Each data discovery plugin, for instance, is an independent3ontainer, which allows multiple plugins to be executed inparallel. Also, we can spin up as many profiling and querycontainers as required in response to load. The search enginecan be accessed via a Web interface or programmaticallyvia Python and HTTP APIs. So far, we have indexed over19,000 datasets.
4. USE CASES
Visitors will be able to see how
Auctus has been used toperform data augmentation for different use cases related tothe DARPA D3M program, including:
Vision Zero.
As part of the NYC Vision Zero Initiative [2],an expert from the Department of Transportation is tryingto devise policies to reduce the number of traffic fatalitiesand improve safety in NYC streets. Initially, she uses dataabout traffic collisions and taxi trips to create a model thatpredicts the number of collisions; such a model can allow herto explore what-if scenarios. To further improve the model,she looks for weather information, as weather conditions canpotentially increase the number of collisions. She uses
Auc-tus to find datasets that have the keyword “weather”, andthat are related to her input data. Because many colli-sions also involve cyclists, particularly after the arrival ofthe Citi Bike in NYC, she also searches for datasets hav-ing the keyword “citi bike”. After doing the augmentationswith weather and Citi Bike data, the model achieves a goodperformance for her analysis. A video demonstrating thisuse case is available at http://bit.ly/auctus-video . Taxi Demand Prediction.
Predicting the demand oftaxis is important for providing better transportation ser-vices around the city. An expert from the Taxi and Limou-sine Commission in NYC decides to build a model using theYellow Taxi Trip data (Figure 1). Yellow taxis are predom-inantly present in Manhattan; to take into account otherboroughs in NYC, she uses
Auctus to find Green Taxi Tripdata (which covers other boroughs) that can be concate-nated to her original input data. To further improve themodel, she also uses
Auctus to integrate weather data, asweather has a significant impact in the city.
5. CHALLENGES AND FUTURE WORK
To the best of our knowledge,
Auctus is the first datasetsearch engine for data augmentation. The system has beensuccessfully deployed and is being used by D3M projectmembers.
Auctus is under active development and we areworking on many improvements, from new discovery plu-gins to more sophisticated join and union algorithms. Wenote that, while our original motivation to build
Auctus wasto improve ML model performance, the system can also beuseful in many data exploration and integration scenarios.There are many open research challenges. Notably, realdata is messy and noise in data can negatively impact thequality of search results, leading to both false positives andnegatives. Thus, research on techniques that address datanoise is needed. There are also many open questions abouthow to rank datasets. Examples are how to take the repu-tation of a dataset repository into account, how to diversifyresults, or how to prioritize augmentations that are morelikely to improve accuracy [21]. https://auctus.vida-nyu.org/
6. REFERENCES [1] Data-Driven Discovery of Models (D3M). .[2] NYC Vision Zero. .[3] M. Banko and E. Brill. Scaling to Very Very LargeCorpora for Natural Language Disambiguation. In
ACL ’01 , pages 26–33, 2001.[4] C. S. Bhagavatula, T. Noraset, and D. Downey.Methods for Exploring and Mining Tables onWikipedia. In
IDEA ’13 , pages 18–26, 2013.[5] M. J. Cafarella, A. Halevy, and N. Khoussainova.Data Integration for the Relational Web.
Proc. VLDBEndow. , 2(1):1090–1101, 2009.[6] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, andY. Zhang. WebTables: Exploring the Power of Tableson the Web.
Proc. VLDB Endow. , 1(1):538–549, 2008.[7] R. J. G. B. Campello, D. Moulavi, and J. Sander.Density-Based Clustering Based on HierarchicalDensity Estimates. In J. Pei, V. S. Tseng, L. Cao,H. Motoda, and G. Xu, editors,
Advances inKnowledge Discovery and Data Mining , pages160–172, 2013.[8] R. Castro Fernandez, J. Min, D. Nava, and S. Madden.Lazo: A cardinality-based method for coupledestimation of jaccard similarity and containment. In , pages 1190–1201, 2019.[9] A. Chapman, E. Simperl, L. Koesten,G. Konstantinidis, L. D. Ib´a˜nez-Gonzalez,E. Kacprzak, and P. T. Groth. Dataset Search: ASurvey.
CoRR , abs/1901.00735, 2019.[10] A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee,F. Wu, R. Xin, and C. Yu. Finding Related Tables. In
SIGMOD ’12 , pages 817–828, 2012.[11] Search engine based on the Lucene library. ,.[12] R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan,S. Madden, and M. Stonebraker. Aurum: A DataDiscovery System. In
ICDE ’18 , pages 1001–1012,2018.[13] A. Halevy, F. Korn, N. F. Noy, C. Olston,N. Polyzotis, S. Roy, and S. E. Whang. Goods:Organizing Google’s Datasets. In
SIGMOD ’16 , pages795–806, 2016.[14] O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel,H. Paulheim, and C. Bizer. The Mannheim SearchJoin Engine.
Journal of Web Semantics , 35:159 – 166,2015.[15] F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller.Table Union Search on Open Data.
Proc. VLDBEndow. , 11(7):813–825, 2018.[16] National Oceanic and Atmospheric Administration. .[17] N. Noy, M. Burgess, and D. Brickley. Google DatasetSearch: Building a Search Engine for Datasets in anOpen Web Ecosystem. In
WebConf ’19 , 2019.[18] Chicago Data Portal. https://data.cityofchicago.org/ .[19] NYC Open Data. https://nycopendata.socrata.com .420] UK Open Data. https://data.gov.uk/ .[21] V. Shah, A. Kumar, and X. Zhu. Are Key-foreign KeyJoins Safe to Avoid when Learning High-capacityClassifiers?
Proc. VLDB Endow. , 11(3):366–379, 2017.[22] Socrata. https://socrata.com/ .[23] World Bank Open Data. https://data.worldbank.org/ . [24] M. Yakout, K. Ganjam, K. Chakrabarti, andS. Chaudhuri. InfoGather: Entity Augmentation andAttribute Discovery by Holistic Matching with WebTables. In
SIGMOD ’12 , pages 97–108, 2012.[25] M. Zhang and K. Chakrabarti. InfoGather+:Semantic Matching and Annotation of Numeric andTime-varying Attributes in Web Tables. In