Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Myriam C. Traub is active.

Publication


Featured researches published by Myriam C. Traub.


international conference theory and practice digital libraries | 2015

Impact analysis of OCR quality on research tasks in digital archives

Myriam C. Traub; Jacco van Ossenbruggen; Lynda Hardman

Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR-induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.


acm/ieee joint conference on digital libraries | 2016

Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus

Myriam C. Traub; Thaer Samar; Jacco van Ossenbruggen; Jiyin He; Arjen P. de Vries; Lynda Hardman

Bias in the retrieval of documents can directly influence the information access of a digital library. In the worst case, systematic favoritism for a certain type of document can render other parts of the collection invisible to users. This potential bias can be evaluated by measuring the retrievability for all documents in a collection. Previous evaluations have been performed on TREC collections using simulated query sets. The question remains, however, how representative this approach is of more realistic settings. To address this question, we investigate the effectiveness of the retrievability measure using a large digitized newspaper corpus, featuring two characteristics that distinguishes our experiments from previous studies: (1) compared to TREC collections, our collection contains noise originating from OCR processing, historical spelling and use of language; and (2) instead of simulated queries, the collection comes with real user query logs including click data. First, we assess the retrievability bias imposed on the newspaper collection by different IR models. We assess the retrievability measure and confirm its ability to capture the retrievability bias in our setup. Second, we show how simulated queries differ from real user queries regarding term frequency and prevalence of named entities, and how this affects the retrievability results.


international world wide web conferences | 2014

Crowd vs. experts: nichesourcing for knowledge intensive tasks in cultural heritage

Jasper Oosterman; Alessandro Bozzon; Geert-Jan Houben; Archana Nottamkandath; Chris Dijkshoorn; Lora Aroyo; Mieke H. R. Leyssen; Myriam C. Traub

The results of our exploratory study provide new insights to crowdsourcing knowledge intensive tasks. We designed and performed an annotation task on a print collection of the Rijksmuseum Amsterdam, involving experts and crowd workers in the domain-specific description of depicted flowers. We created a testbed to collect annotations from flower experts and crowd workers and analyzed these in regard to user agreement. The findings show promising results, demonstrating how, for given categories, nichesourcing can provide useful annotations by connecting crowdsourcing to domain expertise.


International Journal on Digital Libraries | 2018

Quantifying retrieval bias in Web archive search

Thaer Samar; Myriam C. Traub; Jacco van Ossenbruggen; Lynda Hardman; Arjen P. de Vries

A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.


acm ieee joint conference on digital libraries | 2018

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Myriam C. Traub; Thaer Samar; Jacco van Ossenbruggen; Lynda Hardman

Digitized document collections often suffer from OCR errors that may impact a documents readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the librarys search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.


international conference theory and practice digital libraries | 2016

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

Thaer Samar; Myriam C. Traub; Jacco van Ossenbruggen; Arjen P. de Vries

Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depth-first strategy on manually selected websites from the .nl domain, with the goal to crawl websites as completes as possible. The second crawl was collected by the Common Crawl foundation using a breadth-first strategy on the entire Web, this strategy focuses on discovering as many links as possible. The two crawls differ in their scope of coverage, while the KB dataset covers mainly the Dutch domain, the Common Crawl dataset covers websites from the entire Web. Therefore, we used three different sources to identify topics that were popular on the Web; both at the global level (entire Web) and at the national level (.nl domain): Google Trends, WikiStats, and queries collected from users of the Dutch historic newspaper archive. The two crawls are different in terms of their size, number of included websites and domains. To allow fair comparison between the two crawls, we created sub-collections from the Common Crawl dataset based on the .nl domain and the KB seeds. Using simple exact string matching between anchor texts and popular topics from the three different sources, we found that the breadth-first crawl covered more topics than the depth-first crawl. Surprisingly, this is not limited to popular topics from the entire Web but also applies to topics that were popular in the .nl domain.


information interaction in context | 2014

Measuring and improving data quality of media collections for professional tasks

Myriam C. Traub

Carrying out research tasks on data collections is hampered, or even made impossible, by data quality issues of different types, such as incompleteness or inconsistency, and severity. We identify research tasks carried out by professional users of data collections that are hampered by inherent quality issues. We investigate what types of issues exist and how they influence these research tasks. To measure the quality perceived by professional users, we develop a quality metric. This allows us to measure the suitability of the data quality for a chosen user task. For a chosen task, we study how the data quality can be improved using crowdsourcing. We validate our quality metric by investigating whether professionals perform better on the chosen research task.


CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings CEUR Workshop Proceedings | 2013

Personalized Nichesourcing: Acquisition of Qualitative Annotations from Niche Communities

Chris Dijkshoorn; Mieke H. R. Leyssen; Archana Nottamkandath; Jasper Oosterman; Myriam C. Traub; Lora Aroyo; Alessandro Bozzon; Wan Fokkink; Geert-Jan Houben; H. Hovelmann; Lizzy Jongma; J.R. van Ossenbruggen; Guus Schreiber; Jan Wielemaker


european conference on information retrieval | 2014

Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations

Myriam C. Traub; Jacco van Ossenbruggen; Jiyin He; Lynda Hardman


Archive | 2013

Second Screen Interactions for Automatically Web-Enriched Broadcast Video

Lilia Perez Romero; Myriam C. Traub; Mieke H. R. Leyssen; Hazel Lynda Hardman

Collaboration


Dive into the Myriam C. Traub's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Alessandro Bozzon

Delft University of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Arjen P. de Vries

Radboud University Nijmegen

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Geert-Jan Houben

Delft University of Technology

View shared research outputs
Top Co-Authors

Avatar

Jasper Oosterman

Delft University of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge