Knowledge Graph for Microdata of Statistics Netherlands
KKnowledge Graph for Microdata of StatisticsNetherlands
Chang SunInstitute of Data ScienceMaastricht University [email protected]
Statistics Netherlands (CBS) hosted a huge amount of data not only on the sta-tistical level but also on the individual level. They have collected and maintaineddata from the whole Dutch population over 100 years. With the development ofdata science technologies, more and more researchers request to conduct theirresearch by using high-quality individual data from CBS (called CBS Microdata)or combining them with other data sources. CBS Microdata is linkable dataat a personal, company and address level with which researchers can conductstatistical research themselves under strict conditions [1]. The requester (has tobe a researcher) will have a protected working environment to store the datafiles, intermediate files, and output. CBS Microdata is considered as a reliableand informative data source which covers health, socio-economic, educational,financial and other 14 categories. To a large degree, making great use of thesedata for research and scientific purposes can tremendously benefit the wholesociety.However, CBS Microdata has been collected and maintained in differentways by different departments in and out of CBS. The representation, quality,metadata of datasets are not sufficiently harmonized. Each dataset is brieflydescribed in one to three sentences on website . A more detailed descriptionfor each dataset is provided in a PDF file in Dutch on CBS website separately.Due to the lack of integration of all Microdata sets and a centralized platform toquery the metadata, it is a very time-consuming and costly task for researchersto find all needed datasets or particular variables. Researchers first need to diveinto all datasets description pages in the specific category. Then, researchershave to download and read (translate to English if needed) all lengthy PDFfiles to know the basic information about the datasets. In this way, researchersmiss the relations between different datasets and are not able to easily find all a r X i v : . [ c s . D L ] J a n eeded variables across multiple datasets. Therefore, a general research questionis formulated for this project: Can we convert the descriptions of all CBSmicrodata sets into one knowledge graph with high-quality and comprehensivemetadata so that the researchers can easily query the metadata, explore therelations among multiple datasets, and find the needed variables?
The abovegeneral research question can be divided into the following sub-questions:1.
Can we extract key information about CBS Microdata from the text (PDFfiles)?2. What are the most suitable ontologies for the CBS Microdata metadata?3. Can we use the extracted information to make a knowledge graph on CBSMicrodata metadata?4. Can we find relations across different datasets and categories?
Semantic web and linked data technology is not new for the statistics offices.In 2001, SDMX (Statistical Data and Metadata eXchange) was launched tostandardize and modernize the mechanisms and processes for the exchange ofstatistical data and metadata among international organisations and their mem-ber countries [2]. However, there are not many publications or publicly availablesoftwares describing how to convert statistical (meta)data to a knowledge graph.EU Open Data Portal provides a SPARQL tool to query the metadata of theirLinked data [3]. They created a vocabulary for the metadata using the DataCatalogue Vocabulary (DCAT) and Dublin Core Terms (DCT) vocabulary .Sarker et al. proposed their plan and methods to implement semantic webtechnology for Australian Bureau of Statistics in 2017 [4]. It is a proof-of-conceptpaper which doesn’t provide any actual implementations. In 2018, Chaves-Fragaet al. provided a mapping translator from RMLC to R2RML and a comparativeanalysis over two different real statistics datasets using Data Cube Vocabulary[5]. This study focuses on converting CSV to RDF and reducing the size ofthe R2RML mapping documents. Existing studies only cover one part of thisproject. Since the descriptions of CBS Microdata are only presented by the text in PDFfiles, a heavy data pre-processing job is required before building up the knowledgegraph. The data pre-processing tasks include extracting text from the diverselayout of PDF files, translating Dutch to English, extracting key information https://data.europa.eu/ The data description of each dataset is presented in a PDF file which canbe downloaded separately from CBS Microdata websites. To automaticallydownload all PDF files, I wrote a Python script to crawl information from therelated CBS websites and catch the downloading links of PDF files. This code ispublicly available on Github repository . Extracting text accurately from a PDF document is still regarded as a verychallenging task. PDF was designed as an output format that gives a goodviewing layout rather than a data input format. Therefore, most of the contentsemantics are lost when a text or word document is converted to PDF. To geta better converting result, I applied a Python package called PDFMiner toextract text from PDF documents. In addition to extracting pure text, it alsoextracts the corresponding locations, font names, font sizes, writing direction(horizontal or vertical) for each text segment. This tool has been developed andwell-maintained since 2008. It is well-recognized in the text mining communitybecause of it’s good performance. Many international researchers in the Netherlands are interested in CBS Micro-data. However, the Microdata websites and data descriptions are both written inDutch. One additional challenge of this project is to translate text from Dutchto English. Consider the timeframe of the project, the best option for this taskis Google Translator. I applied for the Google Translator API in Python.
To have a high-quality metadata, a data description which gathers all informationin a text is apparently not enough. Key information such as data released date,data publisher, subject identifiers, and other metadata elements need to beextracted from the description text. Text mining techniques are required tocomplete this task. I applied two well-known text mining Python libraries -NLTK and spaCy - to recognize entities from the text. NLTK has a very https://github.com/sunchang0124/KG-CBSMicrodata https://pypi.org/project/pdfminer/ https://spacy.io/ Finding a suitable vocabulary is a key to build the knowledge graph for themetadata of CBS Microdata. As CBS is a national statistics office, I searchedrelated ontologies and vocabularies in the statistics community. The best optionsfrom what I have observed are Data Catalogue Vocabulary (DCAT) and DublinCore Terms (DCT) vocabulary. EU Open Data Portal also applied these twovocabularies to their metadata for the Linked Data project.
After matching the vocabularies with extracted key information (metadata), Iapplied R2RML [6] to convert CSV (data of metadata) to RDF. I mapped thedataset, catalog, organization, variables, and keywords (of dataset) as subjects.Language tag is also used for tagging Dutch and English content. After RDF isgenerated successfully, all triples are imported and stored at GraphDB.
To complete the knowledge graph, I applied AMIE+ [7] to predict potentialrelations between entities. Additionally, I also tried a graph embedding methodusing Python Library Gensim. Prediction results will be discussed in the followingsection.
Crawling and downloading all PDF files from CBS Microdata website took lessthan 10 minutes including sleeping time. Sleeping time (1-5 seconds) is to avoidbeing detected and blocked by the website when the requests are frequently sentto the website. In total, 505 PDF documents were downloaded on 31st March2020.As I discussed in the previous section, text extraction from PDF files stillremains a challenge. At the end, 420 PDF documents (83.2%) were processedsuccessfully by PDFMiner, while 85 documents failed to be extracted to text.420 documents were collected from 18 different categories as Table 1 shows. Thetwo main reasons for the failure of text extraction are the unrecognizable layoutand unable to detect words and paragraphs properly.5O Category Num of datasets Num of variables . As Figure 2 shows, some elementsof metadata such as dct:issued (Date of formal issuance (e.g., publication) ofthe item), dct:title, dct:description, dct:identifier, dct:language, dct:isPartOf,dct:langingPath, dcat:keyword, dct:publisher, dct:creator can be fulfilled by the https://github.com/sunchang0124/KG-CBSMicrodata (a) R2RML triple mapping diagrams for “dataset” and “publisher”(b) R2RML triple mapping for “publisher” entity Figure 2: R2RML mapping examplesIn the last step, I applied AMIE+ to predict potential relations betweenentities. As this knowledge graph is not very complicated, only 9 rules werefound by AMIE+. For instance, ?b
This project converts the descriptions of all CBS microdata sets into one knowl-edge graph with comprehensive metadata in Dutch and English. Researcherscan easily query the metadata, explore the relations among multiple datasets,and find the needed variables. For example, if a researcher searches a datasetabout “Age at Death” in the Health and Well-being category, all informationrelated to this dataset will appear including keywords and variable names. “Ageat Death” dataset has a keyword - “Death”. This keyword will lead to otherdatasets such as “Date of Death”. “Cause of Death”, “Production statistics Healthand welfare” from Population, Business categories, and Health and well-beingcategories. This will tremendously save time and costs for the data requesterbut also data maintainers. However, there are some limitations in this short-term project. Firstly, only 83.2% PDF documents were extracted to text dueto several reasons such as different versions of PDF files, and unrecognizable8ayout. Second, accuracy of language translation and entity recognition need tobe evaluated on a larger scale and optimized. For example, several dates can beextracted from the data description of one dataset. The dates might be the datacollecting time, publishing time, or modifying time. More information needs tobe accurately extracted from the data description documents and map to themetadata vocabulary.
References [1] George Kour and Raid Saabne. Real-time segmentation of on-line handwrittenarabic script. In