Is this you? Create Your Porfile

J Julia Efremova

Eindhoven University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where J Julia Efremova is active.

Explore More

Publication

Featured researches published by J Julia Efremova.

Population Reconstruction | 2015

Multi-Source Entity Resolution for Genealogical Data

J Julia Efremova; Bijan Ranjbar-Sahraei; Hossein Rahmani; Toon Calders; Karl Tuyls; Gerhard Weiss

In this chapter, we study the application of existing entity resolution (ER) techniques on a real-world multi-source genealogical dataset. Our goal is to identify all persons involved in various notary acts and link them to their birth, marriage, and death certificates. We analyze the influence of additional ER features, such as name popularity, geographical distance, and co-reference information on the overall ER performance. We study two prediction models: regression trees and logistic regression. In order to evaluate the performance of the applied algorithms and to obtain a training set for learning the models we developed an interactive interface for getting feedback from human experts. We perform an empirical evaluation on the manually annotated dataset in terms of precision, recall, and F-score. We show that using name popularity, geographical distance together with co-reference information helps to significantly improve ER results.

european conference on information retrieval | 2015

Classification of historical notary acts with noisy labels

J Julia Efremova; A Alejandro Montes Garcia; Tgk Toon Calders

This paper approaches the problem of automatic classification of real-world historical notary acts from the 14th to the 20th century. We deal with category ambiguity, noisy labels and imbalanced data. Our goal is to assign an appropriate category for each notary act from the archive collection. We investigate a variety of existing techniques and describe a framework for dealing with noisy labels which includes category resolution, evaluation of inter-annotator agreement and the application of a two level classification. The maximum accuracy we achieve is 88%, which is comparable to the agreement between human annotators.

sighum workshop on language technology for cultural heritage social sciences and humanities | 2014

A Hybrid Disambiguation Measure for Inaccurate Cultural Heritage Data

J Julia Efremova; Bijan Ranjbar-Sahraei; Tgk Toon Calders

Cultural heritage data is always associated with inaccurate information and different types of ambiguities. For instance, names of persons, occupations or places mentioned in historical documents are not standardized and contain numerous variations. This article examines in detail various existing similarity functions and proposes a hybrid technique for the following task: among the list of possible names, occupations and places extracted from historical documents, identify those that are variations of the same person name, occupation and place respectively. The performance of our method is evaluated on three manually constructed datasets and one public dataset in terms of precision, recall and F-measure. The results demonstrate that the hybrid technique outperforms current methods and allows to significantly improve the quality of cultural heritage data.

european conference on machine learning | 2015

HiDER: Query-Driven Entity Resolution for Historical Data

Bijan Ranjbar-Sahraei; J Julia Efremova; Hossein Rahmani; Toon Calders; Karl Tuyls; Gerhard Weiss

Entity Resolution ER is the task of finding references that refer to the same entity across different data sources. Cleaning a data warehouse and applying ER on it is a computationally demanding task, particularly for large data sets that change dynamically. Therefore, a query-driven approach which analyses a small subset of the entire data set and integrates the results in real-time is significantly beneficial. Here, we present an interactive tool, called HiDER, which allows for query-driven ER in large collections of uncertain dynamic historical data. The input data includes civil registers such as birth, marriage and death certificates in the form of structured data, and notarial acts such as estate tax and property transfers in the form of free text. The outputs are family networks and event timelines visualized in an integrated way. The HiDER is being used and tested at BHIC centerBrabant Historical Information Center, https://www.bhic.nl; despite the uncertainties of the BHIC input data, the extracted entities have high certainty and are enriched by extra information.

Communications in computer and information science | 2015

Who Are My Ancestors? Retrieving Family Relationships from Historical Texts

J Julia Efremova; A Alejandro Montes Garcia; Alfredo Bolt Iriondo; Toon Calders

This paper presents an approach for automatically retrieving family relationships from a real-world collection of Dutch historical notary acts. We aim to retrieve relationships like husband - wife, parent - child, widow of, etc. Our approach includes person names extraction, reference disambiguation, candidate generation and family relationship prediction. Since we have a limited amount of training data, we evaluate different feature configurations based on the n-gram analysis. The best results were obtained by using a combination of bi-grams and tri-grams of words together with the distance in words between two names. We evaluate our results for each type of the relationships in terms of precision, recall and \(f-score\).

acm symposium on applied computing | 2016

A robust density-based clustering algorithm for multi-manifold structure

J Jianpeng Zhang; Mykola Pechenizkiy; Yulong Pei; J Julia Efremova

In real-world pattern recognition tasks, the data with multiple manifolds structure is ubiquitous and unpredictable. Performing an effective clustering on such data is a challenging problem. In particular, it is not obvious how to design a similarity measure for multiple manifolds. In this paper, we address this problem proposing a new manifold distance measure, which can better capture both local and global spatial manifold information. We define a new way of local density estimation accounting for the density characteristic. It represents local density more accurately. Meanwhile, it is less sensitive to the parameter settings. Besides, in order to select the cluster centers automatically, a two-phase exemplar determination method is proposed. The experiments on several synthetic and real-world datasets show that the proposed algorithm has higher clustering effectiveness and better robustness for data with varying density, multi-scale and noise overlap characteristics.

Lecture Notes in Computer Science | 2015

Effects of Evolutionary Linguistics in Text Classification

J Julia Efremova; A Alejandro Montes Garcia; J Jianpeng Zhang; Tgk Toon Calders

We perform an empirical study to explore the role of evolutionary linguistics on the text classification problem. We conduct experiments on a real-world collection with more than 100.000 Dutch historical notary acts. The document collection spans over six centuries. During such a large time period some lexical terms modified significantly. Person names, professions and other information changed over time as well. Standard text classification techniques which ignore temporal information of the documents might not produce the most optimal results in our case. Therefore, we analyse the temporal aspects of the corpus. We explore the effect of training and testing the model on different time periods. We use time periods that correspond to the main historical events and also apply clustering techniques in order to create time periods in a data driven way. All experiments show a strong time-dependency of our corpus. Exploiting this dependence, we extend standard classification techniques by combining different models trained on particular time periods and achieve overall accuracy above

Workshop on Population Reconstruction | 2014