DFS: A Dataset File System for Data Discovering Users
DDFS: A Dataset File System for Data DiscoveringUsers
Yasith Jayawardana and Sampath Jayarathna { yasith, sampath } @cs.odu.edu Department of Computer ScienceOld Dominion UniversityNorfolk, VA 23529 USA
Abstract —Many research questions can be answered quicklyand efficiently using data already collected for previous research.This practice is called secondary data analysis (SDA), and hasgained popularity due to lower costs and improved researchefficiency. In this paper we propose DFS, a file system tostandardize the metadata representation of datasets, and DDU, ascalable architecture based on DFS for semi-automated metadatageneration and data recommendation on the cloud. We discusshow DFS and DDU lays groundwork for automatic datasetaggregation, how it integrates with existing data wranglingand machine learning tools, and explores their implications ondatasets stored in digital libraries.
Index Terms —Metadata, Data Recommendation, Data Discov-ering Users
I. I
NTRODUCTION
With the advancements in digital technology, researchershave access to a vast amount of data collected for paststudies. They are utilized by many research communities tofuel entirely new research or to expand on the original study.This practice, termed secondary data analysis (SDA), enablesconducting non-experimental research with minimal cost.Due to the nature of the WWW, not all datasets are regulatedby an authority, nor follow any universally agreed convention.As a result, datasets have inconsistent naming conventionsand file formats, making it challenging to understand theirsemantics. Selecting a dataset for SDA has thus become acomplex process that involves searching for datasets, analyzingcandidate datasets for applicability, and data wrangling [1].The required pre-processing varies across different file typesand data, and cannot be pre-determined without understandingthe nature of data. This exerts a heavy workload on users todiscover and pre-process data, which consumes time and effortthat could be utilized more productively at the presence of aunified semantic representation for datasets.Another challenge in SDA is ensuring the authenticity ofdatasets. While cryptography plays a major role in ensuringtrust and authenticity of digital content, ensuring the authentic-ity of datasets has not been explored. Datasets could be easilyforged and uploaded to data sharing platforms, and researchersdepending on such falsified data could arrive at misleadingconclusions in their publications. These errors cannot be cross-referenced back to the data source without proper citationpractices. Hence datasets require a mechanism to ensure trust and reference immutability, neither of which is available atpresent.Under such constraints, the SDA task could become overlycomplex, which is detrimental to the quality and efficiency ofresearch. II. B
ACKGROUND
At present, the approach for discovery of data is to a largeextent based on ”Users Discovering Data” (UDD). However,its opposite, ”Data Discovering Users (DDU)” is now gainingtraction with the introduction of dataset repositories, searchplatforms and recommender systems. Commercial retailersincreasingly use advanced algorithms including big data an-alytics, deep learning, deep search, and crowd-sourcing toenable such interactions. Thus, theoretical concepts have beendeveloped to capture connections between agents, products,tools, activities and transactions, and to construct graph datadescribing the chains and networks between these elements.
A. Related Work
There has been recent studies attempting to link opendata sets based on the presence of schema overlap betweendatasets [2]. Each linked datasets on the Linked Open DataCloud were characterized through a set of schema conceptlabels that describe them. Schema overlaps were identifiedusing a semantico-frequential concept similarity measure anda ranking criterium based on TF-IDF cosine similarity. Themappings between the schema concepts were also obtained.Through this, they have obtained an average precision of upto 53% for a recall of 100%.Another study [3] proposes a mechanism to create linkeddataset profiles. Each profile consists of structured datasetmetadata describing topics and their relevance, all generatedby sampling resources from datasets, extracting topics fromreference datasets, and ranking based on graphical models.They have created topic profiles for all accessible datasets inthe Linked Open Data Cloud and showed that this approachgenerates accurate profiles even with comparably small samplesizes (10%), while outperforming established topic modellingapproaches.Parekh [4] has proposed an ontology-based semantic meta-data paradigm using Semantic Web languages. They have a r X i v : . [ c s . D L ] M a y efined elements to incorporate information about data identi-fication, spatial extent, temporal extent, data presentation form,data content and data distribution regarding the dataset, andallowed data providers to select concepts from domain on-tologies that best describe the dataset. These selections, alongwith links to domain ontologies were stored in a metadatafile, thereby generating semantic metadata for datasets thatfacilitate content based discovery of datasets irrespective oftheir locations and formats.Another study [5] provides a storage-efficient approach toprovide version control to datasets. They state that the amountof storage used is proportional to the speed of recreatingor retrieving dataset versions. A suite of inexpensive heuris-tics were created based on techniques in delay-constrainedscheduling, and spanning tree literature. Results show thatthese heuristics provide efficient solutions in practical datasetversioning scenarios. B. Secondary Data Analysis (SDA)
A typical SDA task is conducted in several stages. Initially,data needs to be discovered using dataset search or obtaineddirectly from collaborators. The datasets may come from dif-ferent sources, and should be aggregated for analysis. Next, thedata should be loaded into a data manipulation/visualizationtool to determine their relevance and identify any preprocess-ing needed. Following this, the data should be preprocessed.Here, redundant data should be cleaned, and complementarydata should be aggregated through identification of matchingfields and patterns. If multiple datasets are used, the user needsto determine how to join them into a single data pool. Finally,the user should drop the attributes irrelevant to the hypothesis,and apply any algorithms needed to model their hypotheses toobtain results.
C. Data Wrangling
Data Wrangler [6] is a tool developed by researchers atStanford to simplify the task of ”data wrangling”, which in-volves reformatting data values or layout, correcting erroneousor missing values, and integrating multiple data sources. Itattempts to automatically infer the required transforms forcleaning and organizing the data, and leverages semantic datatypes (e.g., geographic locations, dates, classification codes) toaid validation and type conversion. These semantic data typesare probabilistic estimates from the provided data, and is proneto errors. Thus, having rich and accurate semantic data is vitalfor wrangling the data successfully. In the presence of suchsemantic data, Data Wrangler can potentially infer the trans-forms with higher confidence, and apply them automaticallyto prepare datasets for studies.
D. Automated Machine Learning
There is a research trend towards automatic machine learn-ing, which has led to the development of frameworks such asAuto-Keras [7], and Data Wrangler [6]. Auto-Keras is a NeuralArchitecture Search (NAS) to select an optimal configurationfor training a neural network for the given data based on Keras [8]. Research efforts spanning across many domains have beenmade using NAS for automating machine learning [9], [10],[11]. If the dataset semantics are readily available, it could actas a heuristic for optimizing the NAS process in Auto-Kerasand simplify the learning process further.
E. File Formats and Metadata
Different data types are represented using in different fileformats specifically optimized for them. CSV/XLSX targetstabular data, PNG/MP3/MP4 targets multimedia, and PDF/-DOCX/HTML targets documents. Formats such as CDF/GRIB[12] targets storage efficiency, while formats such as RDF [13]provides the ability to store semantic relationships. However,not all file formats can store metadata. Hence, a file systemthat maintains metadata and links related files together is idealfor supporting a wide variety of datasets.HDF5 [14] is a file container that supports metadata. Itprovides file archival, and maintains hierarchical informationalong with metadata. However, transmitting gigabytes of dataacross a network just to compare metadata is not an efficientsolution. To support data recommendation, aggregation andclustering while preserving the immutability of data, metadatashould be maintained external to the data files. Studies [15]show that researchers without a computer science backgroundprefers integrated single-click graphical programs over ter-minal and code. Thus, having an efficient metadata formatwill help in generating text summaries and creating graphicalprograms that works for any type of dataset, and in turn,improve cross-domain research productivity.III. H
YPOTHESIS
We hypothesize that a standardized metadata format wouldcompensate for the tedious preprocessing and knowledgegathering steps required to understand and interpret poorlydocumented datasets, by streamlining information manage-ment in datasets, introducing dataset versioning, and laying thegroundwork for rule-based and machine learning algorithmsto generate metadata and data recommendations. Data rec-ommendations provided based on user interest would connectlike-minded users, and increase collaboration on datasets indigital libraries. This revolutionizes the current approach todata discovery, sharing and usage, and by doing so greatlyenhance the exploitation of available open-access data sets forresearch and the realization of societal benefits.IV. K EY C ONCEPTS
The subsections below introduce the key components ofthe envisioned architecture for managing datasets in digitallibraries.
A. Data Discovering Users (DDU)
In DDU, datasets and products derived from interactionswill be associated with ”Intelligent Semantic Data” (ISD) thatcan communicate semantic information in response to queriesincluding access conditions, quality, uncertainties, guidance onapplicability, and user feedback. Figure 1 shows an overviewf how different components are interconnected in DDU.Datasets are stored in repositories, which are responsible
Fig. 1. Architecture of DDU including DDU Profilers [16] for handling data replication and version control. The DDUservers maintain metadata files (i.e. metafiles) for each dataset,and indexes datasets by their fields. The metadata is generatedusing machine learning and improved through crowd-sourcing.The ISD Agents expose APIs for users to query, discoverand fetch datasets. Crowd-sourced metadata is fed into theISD Agents, which in turn, improves the existing metadataon DDU Servers. DDU Profilers maintain interest profilemodels (IPMs) [16] for tracking user interests and providingintelligent matches. The user-facing component, which is DDUSaaS, utilizes all aforementioned components to provide anecosystem for collaborative research.
B. Dataset File System
The main component of DFS is the metadata file (ormetafile), which serves as the entry point to a dataset. Itsobjective is to capture as much information as possible aboutthe underlying dataset, such that it eliminates the need to relyon external documentation to understand the dataset semantics.Each metafile stores information about the dataset, data files,and data fields. This enables multiple data files to behave asone coherent set of data, and also allows data files to be sharedacross datasets.Figure 2 shows a sample metafile (shortened for brevity)in DFS. Each metafile contains a ”$schema” field to identifythe JSON schema, and the fields ”id” and ”meta-version” touniquely identify a dataset. The ”meta-version” increments asthe metadata and data files mature over time. The ”created”,and ”modified” fields provide a timeline of any changes to thedata or metadata. The ”checksum” field maintains the hash ofthe contents in the ”meta” field, which enforces integrity.The ”meta” field keeps all information related to the dataset.The ”name” and ”description” fields describe the dataset in { "$schema": "https://some.example.com/schema-v1.3.json","$id": "A2BE-FC28-906C-03B7","meta-version": 2,"created": "01-20-2019","modified": "05-27-2019","checksum": 43278947328957439805847390257439,"meta": {"name": "EEG recordings during ADOS-2","description": "...","copyright": ".....","keywords": ["EEG", "ADOS-2", "ASD", "MEDICINE"],"authors": [{"$id": "023-23-425325","name": "John Doe","affiliation": "Some University","email" : "[email protected]"}],"files": [{"$id": 21,"path": "./020.json","encoding": "JSON","version": 1,"checksum": 0123015035783941274895378,"description": "Participant 020","measurement": {"name": "Brain Activity","device": "Emotiv EPOC+","units": "mV"},"fields": [{"name": "Fpz","type": "Float","description": "Electrode Fpz"},{ "name": "Oz","type": "Float","description": "Electrode Oz"}]}],"links": [{"$type": "ID","description": "Subject ID","fields": [{"file_id": 1, "field":"subject"},{"file_id": 2, "field":"*"}]}]}}
Fig. 2. Sample Metafile in DFS (Shortened for Brevity) a human-readable format, and the ”copyright” field storesany copyright information. The ”keywords” field indicate theresearch sub-domains, and is updated dynamically throughcrowd-sourcing and user profiling. The ”author”, field main-tains a list of dataset authors, and provides fields to store their”$id”, ”name”, ”affiliation” and ”email” for authenticity.The metafile points to data files through ”files” field. The”files” field maintains an ”$id” field for internal reference. Its”path” and ”encoding” fields store the file path and formatto read the file. The ”version” field increments each time thefile is changed, and the ”checksum” field provides integrityby ensuring that files cannot be arbitrarily modified withoutinvalidating the metadata. The ”description” field provides atextual description of the file, and the ”measurement” providethe type of measurement, the devices used to measure it, andthe units the measurements are stored in the file. The ”fields”field keeps a list of each field in the file, and includes theirype and description to provide better semantics.Following the ”files” field, the ”links” field keeps track ofthe semantic relationships between the fields in data files. Forexample, if the Field X of File A maps to Field Y of File B,this can be stored as a link by providing the link type (e.g.ID), a description of the link, and the fields involved in thelink.As an added benefit, the ”id” and ”meta-version” fieldsprovide version control capability and reference immutability,making them an ideal candidate for citation. Citing datasetsusing DFS would resolve ambiguities caused by evolvingdatasets. V. A
PPLICATIONS OF
DFSIntroducing DFS creates a significant impact on existing andfuture research. It provides the infrastructure for representingdataset semantics in rich detail, effectively eliminating depen-dencies on external sources to comprehend them.The most obvious application of DFS is for DDU. Here,DFS lays the groundwork for semi-automated metadata man-agement in DDU. Apart from DDU, we identified severalapplications of DFS that are described in the sections below.
A. SDA Automation
For the data acquisition stage of SDA, DFS provides thegroundwork for semantic searching, and could potentiallyprovide targeted results based on the domain and topic ofresearch. DDU realizes this concept by including user interestas a factor for dataset search. For the preprocessing stage, themetafiles serve as a semantic marker to determine the pre-processing needed for each data file in a dataset. Transformscould be inferred semantic information in the metafiles, andapplied on the source data to obtain the target format. Comple-mentary data could be aggregated, and redundant data couldbe removed during this process by identifying the semanticsimilarities between each field. Section V-C discusses how datacould be aggregated using DFS metadata in more detail. Whenevaluating the hypotheses, the semantic information could beleveraged by machine learning libraries to identify the type ofmodel required. If neural networks are involved, the semanticinformation can be used to heuristically tune the model hyper-parameters using NAS.
B. Integration With Existing Tools
Figure 3 shows how DDU and DFS can be complementedwith existing tools and platforms to streamline data analytics.Here, the user queries can be cross-referenced with IPM [16]to conducts user profiling and tune the results. Data Wrangler[6] can leverage the semantic information in DFS to generatethe transforms required to pre-processes data. Data aggrega-tion, as described in Section V-C, can be used to multipledatasets based on relevance, and enables automatic clusteringof data. The resulting aggregate data and metadata can then bepassed to Auto-Keras [7] to heuristically determine the bestway to model the hypothesis of the experiments, effectivelyproducing a fully automated pipeline for data analytics withminimal user intervention.
Fig. 3. Integration with Data Wrangler and Auto-Keras
C. Dataset aggregation
Dataset aggregation is the process of comparing datasetsusing their field information to determine if they could bemerged. Studies have proposed methods for calculating datasetsimilarity using schema overlap [2], and for merging datasetsusing scalable algorithms [17]. Since DFS metafiles provideinformation about the data fields and how they are related toeach other, datasets could be compared using their metafilesto determine if they could be merged.Algorithm 1 provides a pseudo-code for dataset aggregationusing DFS. For each metafile, the fields and their links are
Algorithm 1:
Dataset Aggregation using Metafiles function aggregate ( α, β ) :if similarity ( graph ( α ) , graph ( β )) ≤ (cid:15) thenthrow error; forall γ ← f ields ( α ) doforall δ ← f ields ( β ) doif overlap ( γ, δ ) ≥ σ then α ← metajoin ( α, β, γ, δ ) ; return α ;represented as a graph. Next, the two graphs are comparedusing graph similarity algorithms to determine if the datasetsare comparable. If so, a join operation is performed on thefields and links of the two metafiles based on the schemaoverlap. This results in a connected graph which representslinks among both datasets. This information is then used tocreate a new metafile which represents data from both datasets.Figure 4 provides a visualization of this aggregation process.Each data set contains meta data that describes the fields(columns in tabular data), domain, tags, encoding, authors,mode of data extraction, etc. that defines the data set in astructured, standardized format. Comparing the meta data of ig. 4. Dataset Aggregation Process each file, the aggregation process combines the two data setsinto a new data set with aggregated metadata.VI. C ONCLUSION AND F UTURE W ORK
DFS and DDU provide a fresh outlook to how data isdiscovered, wrangled, and used for data analytics and machinelearning. With DFS bringing new techniques for dataset aggre-gation and DDU enabling semi-automated metadata manage-ment and user interest profiling, research communities couldcollaborate efficiently on research and accelerate workflows.As a future work, we plan to conduct a comprehensive surveywith researchers, data curators, and practitioners, and incorpo-rate their feedback to fine tune the DFS schema and the DDUarchitecture. We also plan to evaluate the compatibility of DFSacross multiple domains and file types through scenarios andcase studies, to evaluate the cross-domain coverage of DFS.R
EFERENCES[1] S. Kandel, J. Heer, C. Plaisant, J. Kennedy, F. van Ham, N. H. Riche,C. Weaver, B. Lee, D. Brodbeck, and P. Buono, “Research directionsin data wrangling: Visualizations and transformations for usable andcredible data,”
Information Visualization , vol. 10, no. 4, pp. 271–288,2011.[2] M. Ben Ellefi, Z. Bellahsene, S. Dietze, and K. Todorov, “Datasetrecommendation for data linking: An intensional approach,” in
The Se-mantic Web. Latest Advances and New Domains , H. Sack, E. Blomqvist,M. d’Aquin, C. Ghidini, S. P. Ponzetto, and C. Lange, Eds. Cham:Springer International Publishing, 2016, pp. 36–51.[3] B. Fetahu, S. Dietze, B. Pereira Nunes, M. Antonio Casanova, D. Taibi,and W. Nejdl, “A scalable approach for efficiently generating structureddataset topic profiles,” in
The Semantic Web: Trends and Challenges ,V. Presutti, C. d’Amato, F. Gandon, M. d’Aquin, S. Staab, and A. Tordai,Eds. Cham: Springer International Publishing, 2014, pp. 519–534.[4] V. Parekh, J. Gwo, and T. W. Finin, “Ontology based semantic metadatafor geoscience data,” in
Proceedings of the International Conference onInformation and Knowledge Engineering. IKE’04, June 21-24, 2004,Las Vegas, Nevada, USA , H. R. Arabnia, Ed. CSREA Press, 2004, pp.485–490. [5] S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, andA. Parameswaran, “Principles of dataset versioning: Exploring the recre-ation/storage tradeoff,”
Proceedings of the VLDB Endowment , vol. 8,no. 12, pp. 1346–1357, 2015.[6] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer, “Wrangler: Interactivevisual specification of data transformation scripts,” in
Proceedings ofthe SIGCHI Conference on Human Factors in Computing Systems , ser.CHI ’11. New York, NY, USA: ACM, 2011, pp. 3363–3372. [Online].Available: http://doi.acm.org/10.1145/1978942.1979444[7] H. Jin, Q. Song, and X. Hu, “Efficient neural architecture searchwith network morphism,”
CoRR , vol. abs/1806.10282, 2018. [Online].Available: http://arxiv.org/abs/1806.10282[8] A. Gulli and S. Pal,
Deep Learning with Keras . Packt Publishing Ltd,2017.[9] B. Zoph and Q. V. Le, “Neural architecture search with reinforcementlearning,”
CoRR , vol. abs/1611.01578, 2016. [Online]. Available:http://arxiv.org/abs/1611.01578[10] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei,A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecturesearch,” in
The European Conference on Computer Vision (ECCV) ,September 2018.[11] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “Mnasnet:Platform-aware neural architecture search for mobile,”
CoRR , vol.abs/1807.11626, 2018. [Online]. Available: http://arxiv.org/abs/1807.11626[12] F. Luengo, A. S. Cofi˜no, and J. M. Guti´errez, “Grid oriented implemen-tation of self-organizing maps for data mining in meteorology,” in
GridComputing . Springer, 2004, pp. 163–170.[13] B. Quilitz and U. Leser, “Querying distributed rdf data sources withsparql,” in
European Semantic Web Conference . Springer, 2008, pp.524–538.[14] M. Folk, G. Heber, Q. Koziol, E. Pourmal, and D. Robinson, “Anoverview of the hdf5 technology suite and its applications,” in
Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases ,ser. AD ’11. New York, NY, USA: ACM, 2011, pp. 36–47. [Online].Available: http://doi.acm.org/10.1145/1966895.1966900[15] G. H. Brimhall and A. Vanegas, “Removing science workflow barriers toadoption of digital geological mapping by using the geomapper universalprogram and visual user interface,”
US Geological Survey Open FileReport , pp. 01–223, 2001.[16] S. Jayarathna and F. Shipman, “Analysis and modeling of unifieduser interest,” in