[PDF] DFS: A Dataset File System for Data Discovering Users

Abstract

Many research questions can be answered quickly and efficiently using data already collected for previous research. This practice is called secondary data analysis (SDA), and has gained popularity due to lower costs and improved research efficiency. In this paper we propose DFS, a file system to standardize the metadata representation of datasets, and DDU, a scalable architecture based on DFS for semi-automated metadata generation and data recommendation on the cloud. We discuss how DFS and DDU lays groundwork for automatic dataset aggregation, how it integrates with existing data wrangling and machine learning tools, and explores their implications on datasets stored in digital libraries.

Full PDF

DDFS: A Dataset File System for Data DiscoveringUsers

Yasith Jayawardana and Sampath Jayarathna { yasith, sampath } @cs.odu.edu Department of Computer ScienceOld Dominion UniversityNorfolk, VA 23529 USA

Abstract —Many research questions can be answered quicklyand efﬁciently using data already collected for previous research.This practice is called secondary data analysis (SDA), and hasgained popularity due to lower costs and improved researchefﬁciency. In this paper we propose DFS, a ﬁle system tostandardize the metadata representation of datasets, and DDU, ascalable architecture based on DFS for semi-automated metadatageneration and data recommendation on the cloud. We discusshow DFS and DDU lays groundwork for automatic datasetaggregation, how it integrates with existing data wranglingand machine learning tools, and explores their implications ondatasets stored in digital libraries.

Index Terms —Metadata, Data Recommendation, Data Discov-ering Users

I. I

NTRODUCTION

With the advancements in digital technology, researchershave access to a vast amount of data collected for paststudies. They are utilized by many research communities tofuel entirely new research or to expand on the original study.This practice, termed secondary data analysis (SDA), enablesconducting non-experimental research with minimal cost.Due to the nature of the WWW, not all datasets are regulatedby an authority, nor follow any universally agreed convention.As a result, datasets have inconsistent naming conventionsand ﬁle formats, making it challenging to understand theirsemantics. Selecting a dataset for SDA has thus become acomplex process that involves searching for datasets, analyzingcandidate datasets for applicability, and data wrangling [1].The required pre-processing varies across different ﬁle typesand data, and cannot be pre-determined without understandingthe nature of data. This exerts a heavy workload on users todiscover and pre-process data, which consumes time and effortthat could be utilized more productively at the presence of auniﬁed semantic representation for datasets.Another challenge in SDA is ensuring the authenticity ofdatasets. While cryptography plays a major role in ensuringtrust and authenticity of digital content, ensuring the authentic-ity of datasets has not been explored. Datasets could be easilyforged and uploaded to data sharing platforms, and researchersdepending on such falsiﬁed data could arrive at misleadingconclusions in their publications. These errors cannot be cross-referenced back to the data source without proper citationpractices. Hence datasets require a mechanism to ensure trust and reference immutability, neither of which is available atpresent.Under such constraints, the SDA task could become overlycomplex, which is detrimental to the quality and efﬁciency ofresearch. II. B

ACKGROUND

At present, the approach for discovery of data is to a largeextent based on ”Users Discovering Data” (UDD). However,its opposite, ”Data Discovering Users (DDU)” is now gainingtraction with the introduction of dataset repositories, searchplatforms and recommender systems. Commercial retailersincreasingly use advanced algorithms including big data an-alytics, deep learning, deep search, and crowd-sourcing toenable such interactions. Thus, theoretical concepts have beendeveloped to capture connections between agents, products,tools, activities and transactions, and to construct graph datadescribing the chains and networks between these elements.

A. Related Work

There has been recent studies attempting to link opendata sets based on the presence of schema overlap betweendatasets [2]. Each linked datasets on the Linked Open DataCloud were characterized through a set of schema conceptlabels that describe them. Schema overlaps were identiﬁedusing a semantico-frequential concept similarity measure anda ranking criterium based on TF-IDF cosine similarity. Themappings between the schema concepts were also obtained.Through this, they have obtained an average precision of upto 53% for a recall of 100%.Another study [3] proposes a mechanism to create linkeddataset proﬁles. Each proﬁle consists of structured datasetmetadata describing topics and their relevance, all generatedby sampling resources from datasets, extracting topics fromreference datasets, and ranking based on graphical models.They have created topic proﬁles for all accessible datasets inthe Linked Open Data Cloud and showed that this approachgenerates accurate proﬁles even with comparably small samplesizes (10%), while outperforming established topic modellingapproaches.Parekh [4] has proposed an ontology-based semantic meta-data paradigm using Semantic Web languages. They have a r X i v : . [ c s . D L ] M a y eﬁned elements to incorporate information about data identi-ﬁcation, spatial extent, temporal extent, data presentation form,data content and data distribution regarding the dataset, andallowed data providers to select concepts from domain on-tologies that best describe the dataset. These selections, alongwith links to domain ontologies were stored in a metadataﬁle, thereby generating semantic metadata for datasets thatfacilitate content based discovery of datasets irrespective oftheir locations and formats.Another study [5] provides a storage-efﬁcient approach toprovide version control to datasets. They state that the amountof storage used is proportional to the speed of recreatingor retrieving dataset versions. A suite of inexpensive heuris-tics were created based on techniques in delay-constrainedscheduling, and spanning tree literature. Results show thatthese heuristics provide efﬁcient solutions in practical datasetversioning scenarios. B. Secondary Data Analysis (SDA)

A typical SDA task is conducted in several stages. Initially,data needs to be discovered using dataset search or obtaineddirectly from collaborators. The datasets may come from dif-ferent sources, and should be aggregated for analysis. Next, thedata should be loaded into a data manipulation/visualizationtool to determine their relevance and identify any preprocess-ing needed. Following this, the data should be preprocessed.Here, redundant data should be cleaned, and complementarydata should be aggregated through identiﬁcation of matchingﬁelds and patterns. If multiple datasets are used, the user needsto determine how to join them into a single data pool. Finally,the user should drop the attributes irrelevant to the hypothesis,and apply any algorithms needed to model their hypotheses toobtain results.

C. Data Wrangling

Data Wrangler [6] is a tool developed by researchers atStanford to simplify the task of ”data wrangling”, which in-volves reformatting data values or layout, correcting erroneousor missing values, and integrating multiple data sources. Itattempts to automatically infer the required transforms forcleaning and organizing the data, and leverages semantic datatypes (e.g., geographic locations, dates, classiﬁcation codes) toaid validation and type conversion. These semantic data typesare probabilistic estimates from the provided data, and is proneto errors. Thus, having rich and accurate semantic data is vitalfor wrangling the data successfully. In the presence of suchsemantic data, Data Wrangler can potentially infer the trans-forms with higher conﬁdence, and apply them automaticallyto prepare datasets for studies.

D. Automated Machine Learning

There is a research trend towards automatic machine learn-ing, which has led to the development of frameworks such asAuto-Keras [7], and Data Wrangler [6]. Auto-Keras is a NeuralArchitecture Search (NAS) to select an optimal conﬁgurationfor training a neural network for the given data based on Keras [8]. Research efforts spanning across many domains have beenmade using NAS for automating machine learning [9], [10],[11]. If the dataset semantics are readily available, it could actas a heuristic for optimizing the NAS process in Auto-Kerasand simplify the learning process further.

E. File Formats and Metadata

Different data types are represented using in different ﬁleformats speciﬁcally optimized for them. CSV/XLSX targetstabular data, PNG/MP3/MP4 targets multimedia, and PDF/-DOCX/HTML targets documents. Formats such as CDF/GRIB[12] targets storage efﬁciency, while formats such as RDF [13]provides the ability to store semantic relationships. However,not all ﬁle formats can store metadata. Hence, a ﬁle systemthat maintains metadata and links related ﬁles together is idealfor supporting a wide variety of datasets.HDF5 [14] is a ﬁle container that supports metadata. Itprovides ﬁle archival, and maintains hierarchical informationalong with metadata. However, transmitting gigabytes of dataacross a network just to compare metadata is not an efﬁcientsolution. To support data recommendation, aggregation andclustering while preserving the immutability of data, metadatashould be maintained external to the data ﬁles. Studies [15]show that researchers without a computer science backgroundprefers integrated single-click graphical programs over ter-minal and code. Thus, having an efﬁcient metadata formatwill help in generating text summaries and creating graphicalprograms that works for any type of dataset, and in turn,improve cross-domain research productivity.III. H

YPOTHESIS

We hypothesize that a standardized metadata format wouldcompensate for the tedious preprocessing and knowledgegathering steps required to understand and interpret poorlydocumented datasets, by streamlining information manage-ment in datasets, introducing dataset versioning, and laying thegroundwork for rule-based and machine learning algorithmsto generate metadata and data recommendations. Data rec-ommendations provided based on user interest would connectlike-minded users, and increase collaboration on datasets indigital libraries. This revolutionizes the current approach todata discovery, sharing and usage, and by doing so greatlyenhance the exploitation of available open-access data sets forresearch and the realization of societal beneﬁts.IV. K EY C ONCEPTS

The subsections below introduce the key components ofthe envisioned architecture for managing datasets in digitallibraries.

A. Data Discovering Users (DDU)

In DDU, datasets and products derived from interactionswill be associated with ”Intelligent Semantic Data” (ISD) thatcan communicate semantic information in response to queriesincluding access conditions, quality, uncertainties, guidance onapplicability, and user feedback. Figure 1 shows an overviewf how different components are interconnected in DDU.Datasets are stored in repositories, which are responsible

Fig. 1. Architecture of DDU including DDU Proﬁlers [16] for handling data replication and version control. The DDUservers maintain metadata ﬁles (i.e. metaﬁles) for each dataset,and indexes datasets by their ﬁelds. The metadata is generatedusing machine learning and improved through crowd-sourcing.The ISD Agents expose APIs for users to query, discoverand fetch datasets. Crowd-sourced metadata is fed into theISD Agents, which in turn, improves the existing metadataon DDU Servers. DDU Proﬁlers maintain interest proﬁlemodels (IPMs) [16] for tracking user interests and providingintelligent matches. The user-facing component, which is DDUSaaS, utilizes all aforementioned components to provide anecosystem for collaborative research.

B. Dataset File System

The main component of DFS is the metadata ﬁle (ormetaﬁle), which serves as the entry point to a dataset. Itsobjective is to capture as much information as possible aboutthe underlying dataset, such that it eliminates the need to relyon external documentation to understand the dataset semantics.Each metaﬁle stores information about the dataset, data ﬁles,and data ﬁelds. This enables multiple data ﬁles to behave asone coherent set of data, and also allows data ﬁles to be sharedacross datasets.Figure 2 shows a sample metaﬁle (shortened for brevity)in DFS. Each metaﬁle contains a ”$schema” ﬁeld to identifythe JSON schema, and the ﬁelds ”id” and ”meta-version” touniquely identify a dataset. The ”meta-version” increments asthe metadata and data ﬁles mature over time. The ”created”,and ”modiﬁed” ﬁelds provide a timeline of any changes to thedata or metadata. The ”checksum” ﬁeld maintains the hash ofthe contents in the ”meta” ﬁeld, which enforces integrity.The ”meta” ﬁeld keeps all information related to the dataset.The ”name” and ”description” ﬁelds describe the dataset in { "$schema": "https://some.example.com/schema-v1.3.json","$id": "A2BE-FC28-906C-03B7","meta-version": 2,"created": "01-20-2019","modified": "05-27-2019","checksum": 43278947328957439805847390257439,"meta": {"name": "EEG recordings during ADOS-2","description": "...","copyright": ".....","keywords": ["EEG", "ADOS-2", "ASD", "MEDICINE"],"authors": [{"$id": "023-23-425325","name": "John Doe","affiliation": "Some University","email" : "[email protected]"}],"files": [{"$id": 21,"path": "./020.json","encoding": "JSON","version": 1,"checksum": 0123015035783941274895378,"description": "Participant 020","measurement": {"name": "Brain Activity","device": "Emotiv EPOC+","units": "mV"},"fields": [{"name": "Fpz","type": "Float","description": "Electrode Fpz"},{ "name": "Oz","type": "Float","description": "Electrode Oz"}]}],"links": [{"$type": "ID","description": "Subject ID","fields": [{"file_id": 1, "field":"subject"},{"file_id": 2, "field":"*"}]}]}}

Fig. 2. Sample Metaﬁle in DFS (Shortened for Brevity) a human-readable format, and the ”copyright” ﬁeld storesany copyright information. The ”keywords” ﬁeld indicate theresearch sub-domains, and is updated dynamically throughcrowd-sourcing and user proﬁling. The ”author”, ﬁeld main-tains a list of dataset authors, and provides ﬁelds to store their”$id”, ”name”, ”afﬁliation” and ”email” for authenticity.The metaﬁle points to data ﬁles through ”ﬁles” ﬁeld. The”ﬁles” ﬁeld maintains an ”$id” ﬁeld for internal reference. Its”path” and ”encoding” ﬁelds store the ﬁle path and formatto read the ﬁle. The ”version” ﬁeld increments each time theﬁle is changed, and the ”checksum” ﬁeld provides integrityby ensuring that ﬁles cannot be arbitrarily modiﬁed withoutinvalidating the metadata. The ”description” ﬁeld provides atextual description of the ﬁle, and the ”measurement” providethe type of measurement, the devices used to measure it, andthe units the measurements are stored in the ﬁle. The ”ﬁelds”ﬁeld keeps a list of each ﬁeld in the ﬁle, and includes theirype and description to provide better semantics.Following the ”ﬁles” ﬁeld, the ”links” ﬁeld keeps track ofthe semantic relationships between the ﬁelds in data ﬁles. Forexample, if the Field X of File A maps to Field Y of File B,this can be stored as a link by providing the link type (e.g.ID), a description of the link, and the ﬁelds involved in thelink.As an added beneﬁt, the ”id” and ”meta-version” ﬁeldsprovide version control capability and reference immutability,making them an ideal candidate for citation. Citing datasetsusing DFS would resolve ambiguities caused by evolvingdatasets. V. A

PPLICATIONS OF

DFSIntroducing DFS creates a signiﬁcant impact on existing andfuture research. It provides the infrastructure for representingdataset semantics in rich detail, effectively eliminating depen-dencies on external sources to comprehend them.The most obvious application of DFS is for DDU. Here,DFS lays the groundwork for semi-automated metadata man-agement in DDU. Apart from DDU, we identiﬁed severalapplications of DFS that are described in the sections below.

A. SDA Automation

For the data acquisition stage of SDA, DFS provides thegroundwork for semantic searching, and could potentiallyprovide targeted results based on the domain and topic ofresearch. DDU realizes this concept by including user interestas a factor for dataset search. For the preprocessing stage, themetaﬁles serve as a semantic marker to determine the pre-processing needed for each data ﬁle in a dataset. Transformscould be inferred semantic information in the metaﬁles, andapplied on the source data to obtain the target format. Comple-mentary data could be aggregated, and redundant data couldbe removed during this process by identifying the semanticsimilarities between each ﬁeld. Section V-C discusses how datacould be aggregated using DFS metadata in more detail. Whenevaluating the hypotheses, the semantic information could beleveraged by machine learning libraries to identify the type ofmodel required. If neural networks are involved, the semanticinformation can be used to heuristically tune the model hyper-parameters using NAS.

B. Integration With Existing Tools

Figure 3 shows how DDU and DFS can be complementedwith existing tools and platforms to streamline data analytics.Here, the user queries can be cross-referenced with IPM [16]to conducts user proﬁling and tune the results. Data Wrangler[6] can leverage the semantic information in DFS to generatethe transforms required to pre-processes data. Data aggrega-tion, as described in Section V-C, can be used to multipledatasets based on relevance, and enables automatic clusteringof data. The resulting aggregate data and metadata can then bepassed to Auto-Keras [7] to heuristically determine the bestway to model the hypothesis of the experiments, effectivelyproducing a fully automated pipeline for data analytics withminimal user intervention.

Fig. 3. Integration with Data Wrangler and Auto-Keras

C. Dataset aggregation

Dataset aggregation is the process of comparing datasetsusing their ﬁeld information to determine if they could bemerged. Studies have proposed methods for calculating datasetsimilarity using schema overlap [2], and for merging datasetsusing scalable algorithms [17]. Since DFS metaﬁles provideinformation about the data ﬁelds and how they are related toeach other, datasets could be compared using their metaﬁlesto determine if they could be merged.Algorithm 1 provides a pseudo-code for dataset aggregationusing DFS. For each metaﬁle, the ﬁelds and their links are

Algorithm 1:

Dataset Aggregation using Metaﬁles function aggregate ( α, β ) :if similarity ( graph ( α ) , graph ( β )) ≤ (cid:15) thenthrow error; forall γ ← f ields ( α ) doforall δ ← f ields ( β ) doif overlap ( γ, δ ) ≥ σ then α ← metajoin ( α, β, γ, δ ) ; return α ;represented as a graph. Next, the two graphs are comparedusing graph similarity algorithms to determine if the datasetsare comparable. If so, a join operation is performed on theﬁelds and links of the two metaﬁles based on the schemaoverlap. This results in a connected graph which representslinks among both datasets. This information is then used tocreate a new metaﬁle which represents data from both datasets.Figure 4 provides a visualization of this aggregation process.Each data set contains meta data that describes the ﬁelds(columns in tabular data), domain, tags, encoding, authors,mode of data extraction, etc. that deﬁnes the data set in astructured, standardized format. Comparing the meta data of ig. 4. Dataset Aggregation Process each ﬁle, the aggregation process combines the two data setsinto a new data set with aggregated metadata.VI. C ONCLUSION AND F UTURE W ORK

DFS and DDU provide a fresh outlook to how data isdiscovered, wrangled, and used for data analytics and machinelearning. With DFS bringing new techniques for dataset aggre-gation and DDU enabling semi-automated metadata manage-ment and user interest proﬁling, research communities couldcollaborate efﬁciently on research and accelerate workﬂows.As a future work, we plan to conduct a comprehensive surveywith researchers, data curators, and practitioners, and incorpo-rate their feedback to ﬁne tune the DFS schema and the DDUarchitecture. We also plan to evaluate the compatibility of DFSacross multiple domains and ﬁle types through scenarios andcase studies, to evaluate the cross-domain coverage of DFS.R

EFERENCES[1] S. Kandel, J. Heer, C. Plaisant, J. Kennedy, F. van Ham, N. H. Riche,C. Weaver, B. Lee, D. Brodbeck, and P. Buono, “Research directionsin data wrangling: Visualizations and transformations for usable andcredible data,”

Information Visualization , vol. 10, no. 4, pp. 271–288,2011.[2] M. Ben Elleﬁ, Z. Bellahsene, S. Dietze, and K. Todorov, “Datasetrecommendation for data linking: An intensional approach,” in

The Se-mantic Web. Latest Advances and New Domains , H. Sack, E. Blomqvist,M. d’Aquin, C. Ghidini, S. P. Ponzetto, and C. Lange, Eds. Cham:Springer International Publishing, 2016, pp. 36–51.[3] B. Fetahu, S. Dietze, B. Pereira Nunes, M. Antonio Casanova, D. Taibi,and W. Nejdl, “A scalable approach for efﬁciently generating structureddataset topic proﬁles,” in

The Semantic Web: Trends and Challenges ,V. Presutti, C. d’Amato, F. Gandon, M. d’Aquin, S. Staab, and A. Tordai,Eds. Cham: Springer International Publishing, 2014, pp. 519–534.[4] V. Parekh, J. Gwo, and T. W. Finin, “Ontology based semantic metadatafor geoscience data,” in

Proceedings of the International Conference onInformation and Knowledge Engineering. IKE’04, June 21-24, 2004,Las Vegas, Nevada, USA , H. R. Arabnia, Ed. CSREA Press, 2004, pp.485–490. [5] S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, andA. Parameswaran, “Principles of dataset versioning: Exploring the recre-ation/storage tradeoff,”

Proceedings of the VLDB Endowment , vol. 8,no. 12, pp. 1346–1357, 2015.[6] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer, “Wrangler: Interactivevisual speciﬁcation of data transformation scripts,” in

Proceedings ofthe SIGCHI Conference on Human Factors in Computing Systems , ser.CHI ’11. New York, NY, USA: ACM, 2011, pp. 3363–3372. [Online].Available: http://doi.acm.org/10.1145/1978942.1979444[7] H. Jin, Q. Song, and X. Hu, “Efﬁcient neural architecture searchwith network morphism,”

CoRR , vol. abs/1806.10282, 2018. [Online].Available: http://arxiv.org/abs/1806.10282[8] A. Gulli and S. Pal,

Deep Learning with Keras . Packt Publishing Ltd,2017.[9] B. Zoph and Q. V. Le, “Neural architecture search with reinforcementlearning,”

CoRR , vol. abs/1611.01578, 2016. [Online]. Available:http://arxiv.org/abs/1611.01578[10] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei,A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecturesearch,” in

The European Conference on Computer Vision (ECCV) ,September 2018.[11] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “Mnasnet:Platform-aware neural architecture search for mobile,”

CoRR , vol.abs/1807.11626, 2018. [Online]. Available: http://arxiv.org/abs/1807.11626[12] F. Luengo, A. S. Coﬁ˜no, and J. M. Guti´errez, “Grid oriented implemen-tation of self-organizing maps for data mining in meteorology,” in

GridComputing . Springer, 2004, pp. 163–170.[13] B. Quilitz and U. Leser, “Querying distributed rdf data sources withsparql,” in

European Semantic Web Conference . Springer, 2008, pp.524–538.[14] M. Folk, G. Heber, Q. Koziol, E. Pourmal, and D. Robinson, “Anoverview of the hdf5 technology suite and its applications,” in

Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases ,ser. AD ’11. New York, NY, USA: ACM, 2011, pp. 36–47. [Online].Available: http://doi.acm.org/10.1145/1966895.1966900[15] G. H. Brimhall and A. Vanegas, “Removing science workﬂow barriers toadoption of digital geological mapping by using the geomapper universalprogram and visual user interface,”

US Geological Survey Open FileReport , pp. 01–223, 2001.[16] S. Jayarathna and F. Shipman, “Analysis and modeling of uniﬁeduser interest,” in