[PDF] Knowledge Graph for Microdata of Statistics Netherlands

Abstract

Statistics Netherlands (CBS) hosted a huge amount of data not only on the statistical level but also on the individual level. With the development of data science technologies, more and more researchers request to conduct their research by using high-quality individual data from CBS (called CBS Microdata) or combining them with other data sources. Making great use of these data for research and scientific purposes can tremendously benefit the whole society. However, CBS Microdata has been collected and maintained in different ways by different departments in and out of CBS. The representation, quality, metadata of datasets are not sufficiently harmonized. The project converts the descriptions of all CBS microdata sets into one knowledge graph with comprehensive metadata in Dutch and English using text mining and semantic web technologies. Researchers can easily query the metadata, explore the relations among multiple datasets, and find the needed variables. For example, if a researcher searches a dataset about "Age at Death" in the Health and Well-being category, all information related to this dataset will appear including keywords and variable names. "Age at Death" dataset has a keyword - "Death". This keyword will lead to other datasets such as "Date of Death". "Cause of Death", "Production statistics Health and welfare" from Population, Business categories, and Health and well-being categories. This will tremendously save time and costs for the data requester but also data maintainers.

Full PDF

KKnowledge Graph for Microdata of StatisticsNetherlands

Chang SunInstitute of Data ScienceMaastricht University [email protected]

Statistics Netherlands (CBS) hosted a huge amount of data not only on the sta-tistical level but also on the individual level. They have collected and maintaineddata from the whole Dutch population over 100 years. With the development ofdata science technologies, more and more researchers request to conduct theirresearch by using high-quality individual data from CBS (called CBS Microdata)or combining them with other data sources. CBS Microdata is linkable dataat a personal, company and address level with which researchers can conductstatistical research themselves under strict conditions [1]. The requester (has tobe a researcher) will have a protected working environment to store the dataﬁles, intermediate ﬁles, and output. CBS Microdata is considered as a reliableand informative data source which covers health, socio-economic, educational,ﬁnancial and other 14 categories. To a large degree, making great use of thesedata for research and scientiﬁc purposes can tremendously beneﬁt the wholesociety.However, CBS Microdata has been collected and maintained in diﬀerentways by diﬀerent departments in and out of CBS. The representation, quality,metadata of datasets are not suﬃciently harmonized. Each dataset is brieﬂydescribed in one to three sentences on website . A more detailed descriptionfor each dataset is provided in a PDF ﬁle in Dutch on CBS website separately.Due to the lack of integration of all Microdata sets and a centralized platform toquery the metadata, it is a very time-consuming and costly task for researchersto ﬁnd all needed datasets or particular variables. Researchers ﬁrst need to diveinto all datasets description pages in the speciﬁc category. Then, researchershave to download and read (translate to English if needed) all lengthy PDFﬁles to know the basic information about the datasets. In this way, researchersmiss the relations between diﬀerent datasets and are not able to easily ﬁnd all a r X i v : . [ c s . D L ] J a n eeded variables across multiple datasets. Therefore, a general research questionis formulated for this project: Can we convert the descriptions of all CBSmicrodata sets into one knowledge graph with high-quality and comprehensivemetadata so that the researchers can easily query the metadata, explore therelations among multiple datasets, and ﬁnd the needed variables?

The abovegeneral research question can be divided into the following sub-questions:1.

Can we extract key information about CBS Microdata from the text (PDFﬁles)?2. What are the most suitable ontologies for the CBS Microdata metadata?3. Can we use the extracted information to make a knowledge graph on CBSMicrodata metadata?4. Can we ﬁnd relations across diﬀerent datasets and categories?

Semantic web and linked data technology is not new for the statistics oﬃces.In 2001, SDMX (Statistical Data and Metadata eXchange) was launched tostandardize and modernize the mechanisms and processes for the exchange ofstatistical data and metadata among international organisations and their mem-ber countries [2]. However, there are not many publications or publicly availablesoftwares describing how to convert statistical (meta)data to a knowledge graph.EU Open Data Portal provides a SPARQL tool to query the metadata of theirLinked data [3]. They created a vocabulary for the metadata using the DataCatalogue Vocabulary (DCAT) and Dublin Core Terms (DCT) vocabulary .Sarker et al. proposed their plan and methods to implement semantic webtechnology for Australian Bureau of Statistics in 2017 [4]. It is a proof-of-conceptpaper which doesn’t provide any actual implementations. In 2018, Chaves-Fragaet al. provided a mapping translator from RMLC to R2RML and a comparativeanalysis over two diﬀerent real statistics datasets using Data Cube Vocabulary[5]. This study focuses on converting CSV to RDF and reducing the size ofthe R2RML mapping documents. Existing studies only cover one part of thisproject. Since the descriptions of CBS Microdata are only presented by the text in PDFﬁles, a heavy data pre-processing job is required before building up the knowledgegraph. The data pre-processing tasks include extracting text from the diverselayout of PDF ﬁles, translating Dutch to English, extracting key information https://data.europa.eu/ The data description of each dataset is presented in a PDF ﬁle which canbe downloaded separately from CBS Microdata websites. To automaticallydownload all PDF ﬁles, I wrote a Python script to crawl information from therelated CBS websites and catch the downloading links of PDF ﬁles. This code ispublicly available on Github repository . Extracting text accurately from a PDF document is still regarded as a verychallenging task. PDF was designed as an output format that gives a goodviewing layout rather than a data input format. Therefore, most of the contentsemantics are lost when a text or word document is converted to PDF. To geta better converting result, I applied a Python package called PDFMiner toextract text from PDF documents. In addition to extracting pure text, it alsoextracts the corresponding locations, font names, font sizes, writing direction(horizontal or vertical) for each text segment. This tool has been developed andwell-maintained since 2008. It is well-recognized in the text mining communitybecause of it’s good performance. Many international researchers in the Netherlands are interested in CBS Micro-data. However, the Microdata websites and data descriptions are both written inDutch. One additional challenge of this project is to translate text from Dutchto English. Consider the timeframe of the project, the best option for this taskis Google Translator. I applied for the Google Translator API in Python.

To have a high-quality metadata, a data description which gathers all informationin a text is apparently not enough. Key information such as data released date,data publisher, subject identiﬁers, and other metadata elements need to beextracted from the description text. Text mining techniques are required tocomplete this task. I applied two well-known text mining Python libraries -NLTK and spaCy - to recognize entities from the text. NLTK has a very https://github.com/sunchang0124/KG-CBSMicrodata https://pypi.org/project/pdfminer/ https://spacy.io/ Finding a suitable vocabulary is a key to build the knowledge graph for themetadata of CBS Microdata. As CBS is a national statistics oﬃce, I searchedrelated ontologies and vocabularies in the statistics community. The best optionsfrom what I have observed are Data Catalogue Vocabulary (DCAT) and DublinCore Terms (DCT) vocabulary. EU Open Data Portal also applied these twovocabularies to their metadata for the Linked Data project.

After matching the vocabularies with extracted key information (metadata), Iapplied R2RML [6] to convert CSV (data of metadata) to RDF. I mapped thedataset, catalog, organization, variables, and keywords (of dataset) as subjects.Language tag is also used for tagging Dutch and English content. After RDF isgenerated successfully, all triples are imported and stored at GraphDB.

To complete the knowledge graph, I applied AMIE+ [7] to predict potentialrelations between entities. Additionally, I also tried a graph embedding methodusing Python Library Gensim. Prediction results will be discussed in the followingsection.

Crawling and downloading all PDF ﬁles from CBS Microdata website took lessthan 10 minutes including sleeping time. Sleeping time (1-5 seconds) is to avoidbeing detected and blocked by the website when the requests are frequently sentto the website. In total, 505 PDF documents were downloaded on 31st March2020.As I discussed in the previous section, text extraction from PDF ﬁles stillremains a challenge. At the end, 420 PDF documents (83.2%) were processedsuccessfully by PDFMiner, while 85 documents failed to be extracted to text.420 documents were collected from 18 diﬀerent categories as Table 1 shows. Thetwo main reasons for the failure of text extraction are the unrecognizable layoutand unable to detect words and paragraphs properly.5O Category Num of datasets Num of variables . As Figure 2 shows, some elementsof metadata such as dct:issued (Date of formal issuance (e.g., publication) ofthe item), dct:title, dct:description, dct:identiﬁer, dct:language, dct:isPartOf,dct:langingPath, dcat:keyword, dct:publisher, dct:creator can be fulﬁlled by the https://github.com/sunchang0124/KG-CBSMicrodata (a) R2RML triple mapping diagrams for “dataset” and “publisher”(b) R2RML triple mapping for “publisher” entity Figure 2: R2RML mapping examplesIn the last step, I applied AMIE+ to predict potential relations betweenentities. As this knowledge graph is not very complicated, only 9 rules werefound by AMIE+. For instance, ?b ?a7> ?a ?b. I insert 4 out of 9 rules to theexisting graph based on their conﬁdence score. In addition, I also tried a graphembedding method using Gensim. This method can ﬁnd two datasets whichare similar to each other but in two diﬀerent categories because they share thesame keywords or variable names. However, since information might be lost inthe text extraction and translation steps, it’s not very convincing to add newrelations based on the similarity in this case.Figure 3: Visualise graph based on a part of created triples

This project converts the descriptions of all CBS microdata sets into one knowl-edge graph with comprehensive metadata in Dutch and English. Researcherscan easily query the metadata, explore the relations among multiple datasets,and ﬁnd the needed variables. For example, if a researcher searches a datasetabout “Age at Death” in the Health and Well-being category, all informationrelated to this dataset will appear including keywords and variable names. “Ageat Death” dataset has a keyword - “Death”. This keyword will lead to otherdatasets such as “Date of Death”. “Cause of Death”, “Production statistics Healthand welfare” from Population, Business categories, and Health and well-beingcategories. This will tremendously save time and costs for the data requesterbut also data maintainers. However, there are some limitations in this short-term project. Firstly, only 83.2% PDF documents were extracted to text dueto several reasons such as diﬀerent versions of PDF ﬁles, and unrecognizable8ayout. Second, accuracy of language translation and entity recognition need tobe evaluated on a larger scale and optimized. For example, several dates can beextracted from the data description of one dataset. The dates might be the datacollecting time, publishing time, or modifying time. More information needs tobe accurately extracted from the data description documents and map to themetadata vocabulary.

References [1] George Kour and Raid Saabne. Real-time segmentation of on-line handwrittenarabic script. In