Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Thaer Samar is active.

Publication


Featured researches published by Thaer Samar.


International Journal on Digital Libraries | 2015

Lost but not forgotten: finding pages on the unarchived web

Hugo C. Huurdeman; Jaap Kamps; Thaer Samar; Arjen P. de Vries; Anat Ben-David; Richard Rogers

Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites.


acm/ieee joint conference on digital libraries | 2016

Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus

Myriam C. Traub; Thaer Samar; Jacco van Ossenbruggen; Jiyin He; Arjen P. de Vries; Lynda Hardman

Bias in the retrieval of documents can directly influence the information access of a digital library. In the worst case, systematic favoritism for a certain type of document can render other parts of the collection invisible to users. This potential bias can be evaluated by measuring the retrievability for all documents in a collection. Previous evaluations have been performed on TREC collections using simulated query sets. The question remains, however, how representative this approach is of more realistic settings. To address this question, we investigate the effectiveness of the retrievability measure using a large digitized newspaper corpus, featuring two characteristics that distinguishes our experiments from previous studies: (1) compared to TREC collections, our collection contains noise originating from OCR processing, historical spelling and use of language; and (2) instead of simulated queries, the collection comes with real user query logs including click data. First, we assess the retrievability bias imposed on the newspaper collection by different IR models. We assess the retrievability measure and confirm its ability to capture the retrievability bias in our setup. Second, we show how simulated queries differ from real user queries regarding term frequency and prevalence of named entities, and how this affects the retrievability results.


international acm sigir conference on research and development in information retrieval | 2014

Uncovering the unarchived web

Thaer Samar; Hugo C. Huurdeman; Anat Ben-David; Jaap Kamps; Arjen P. de Vries

Many national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collections (aura): the web documents that were not included in the archived collection, but are known to have existed --- due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empirically that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover references to 11.9M additional (unarchived) pages.


acm/ieee joint conference on digital libraries | 2014

Finding pages on the unarchived web

Hugo C. Huurdeman; Anat Ben-David; Jaap Kamps; Thaer Samar; Arjen P. de Vries

Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the DutchWeb archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.


european conference on information retrieval | 2014

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Alejandro Bellogín; Thaer Samar; Arjen P. de Vries; Alan Said

The TREC 2013 Contextual Suggestion Track allowed participants to submit personalised rankings using documents either from the OpenWeb or from an archived, static Web collection, the ClueWeb12 dataset. We argue that this setting poses problems in how the performance of the participants should be compared. We analyse biases found in the process, both objective and subjective, and discuss these issues in the general framework of evaluating personalised Information Retrieval using dynamic against static datasets.


International Journal on Digital Libraries | 2018

Quantifying retrieval bias in Web archive search

Thaer Samar; Myriam C. Traub; Jacco van Ossenbruggen; Lynda Hardman; Arjen P. de Vries

A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.


acm ieee joint conference on digital libraries | 2018

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Myriam C. Traub; Thaer Samar; Jacco van Ossenbruggen; Lynda Hardman

Digitized document collections often suffer from OCR errors that may impact a documents readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the librarys search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.


international conference theory and practice digital libraries | 2016

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

Thaer Samar; Myriam C. Traub; Jacco van Ossenbruggen; Arjen P. de Vries

Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depth-first strategy on manually selected websites from the .nl domain, with the goal to crawl websites as completes as possible. The second crawl was collected by the Common Crawl foundation using a breadth-first strategy on the entire Web, this strategy focuses on discovering as many links as possible. The two crawls differ in their scope of coverage, while the KB dataset covers mainly the Dutch domain, the Common Crawl dataset covers websites from the entire Web. Therefore, we used three different sources to identify topics that were popular on the Web; both at the global level (entire Web) and at the national level (.nl domain): Google Trends, WikiStats, and queries collected from users of the Dutch historic newspaper archive. The two crawls are different in terms of their size, number of included websites and domains. To allow fair comparison between the two crawls, we created sub-collections from the Common Crawl dataset based on the .nl domain and the KB seeds. Using simple exact string matching between anchor texts and popular topics from the three different sources, we found that the breadth-first crawl covered more topics than the depth-first crawl. Surprisingly, this is not limited to popular topics from the entire Web but also applies to topics that were popular in the .nl domain.


european conference on information retrieval | 2014

Column Stores as an IR Prototyping Tool

Hannes Mühleisen; Thaer Samar; Jimmy J. Lin; Arjen P. de Vries

We make the suggestion that instead of implementing custom index structures and query evaluation algorithms, IR researchers should simply store document representations in a column-oriented relational database and write ranking models using SQL. For rapid prototyping, this is particularly advantageous since researchers can explore new ranking functions and features by simply issuing SQL queries, without needing to write imperative code. We demonstrate the feasibility of this approach by an implementation of conjunctive BM25 using MonetDB on a part of the ClueWeb12 collection.


text retrieval conference | 2013

CWI and TU Delft Notebook TREC 2013: Contextual Suggestion, Federated Web Search, KBA, and Web Tracks

Alejandro Bellogín; Gebrekirstos G. Gebremeskel; Jiyin He; Alan Said; Thaer Samar; Arjen P. de Vries; Jimmy J. Lin; Jeroen B. P. Vuurens

Collaboration


Dive into the Thaer Samar's collaboration.

Top Co-Authors

Avatar

Arjen P. de Vries

Radboud University Nijmegen

View shared research outputs
Top Co-Authors

Avatar

Alejandro Bellogín

Autonomous University of Madrid

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jaap Kamps

University of Amsterdam

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Anat Ben-David

Open University of Israel

View shared research outputs
Top Co-Authors

Avatar

Alan Said

Delft University of Technology

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge