Harihar Shankar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Harihar Shankar is active.

Explore More

Publication

Featured researches published by Harihar Shankar.

PLOS ONE | 2014

Scholarly context not found: One in five articles suffers from reference rot

Martin Klein; Herbert Van de Sompel; Robert Sanderson; Harihar Shankar; Lyudmila Balakireva; Ke Zhou; Richard Tobin

The emergence of the web has fundamentally affected most aspects of information communication, including scholarly communication. The immediacy that characterizes publishing information to the web, as well as accessing it, allows for a dramatic increase in the speed of dissemination of scholarly knowledge. But, the transition from a paper-based to a web-based scholarly communication system also poses challenges. In this paper, we focus on reference rot, the combination of link rot and content drift to which references to web resources included in Science, Technology, and Medicine (STM) articles are subject. We investigate the extent to which reference rot impacts the ability to revisit the web context that surrounds STM articles some time after their publication. We do so on the basis of a vast collection of articles from three corpora that span publication years 1997 to 2012. For over one million references to web resources extracted from over 3.5 million articles, we determine whether the HTTP URI is still responsive on the live web and whether web archives contain an archived snapshot representative of the state the referenced resource had at the time it was referenced. We observe that the fraction of articles containing references to web resources is growing steadily over time. We find one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten. We suggest that, in order to safeguard the long-term integrity of the web-based scholarly record, robust solutions to combat the reference rot problem are required. In conclusion, we provide a brief insight into the directions that are explored with this regard in the context of the Hiberlink project.

international conference theory and practice digital libraries | 2015

Web Archive Profiling Through CDX Summarization

Sawood Alam; Michael L. Nelson; Herbert Van de Sompel; Lyudmila Balakireva; Harihar Shankar; David S. H. Rosenthal

With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the CDX files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator’s URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we gained up to 22 % routing precision with less than 5 % relative cost as compared to the complete knowledge profile without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a five fold increase in routing precision.

PLOS ONE | 2016

Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content

Shawn M. Jones; Herbert Van de Sompel; Harihar Shankar; Martin Klein; Richard Tobin; Claire Grover

Increasingly, scholarly articles contain URI references to “web at large” resources including project web sites, scholarly wikis, ontologies, online debates, presentations, blogs, and videos. Authors reference such resources to provide essential context for the research they report on. A reader who visits a web at large resource by following a URI reference in an article, some time after its publication, is led to believe that the resource’s content is representative of what the author originally referenced. However, due to the dynamic nature of the web, that may very well not be the case. We reuse a dataset from a previous study in which several authors of this paper were involved, and investigate to what extent the textual content of web at large resources referenced in a vast collection of Science, Technology, and Medicine (STM) articles published between 1997 and 2012 has remained stable since the publication of the referencing article. We do so in a two-step approach that relies on various well-established similarity measures to compare textual content. In a first step, we use 19 web archives to find snapshots of referenced web at large resources that have textual content that is representative of the state of the resource around the time of publication of the referencing paper. We find that representative snapshots exist for about 30% of all URI references. In a second step, we compare the textual content of representative snapshots with that of their live web counterparts. We find that for over 75% of references the content has drifted away from what it was when referenced. These results raise significant concerns regarding the long term integrity of the web-based scholarly record and call for the deployment of techniques to combat these problems.

International Conference on Intelligent Computer Mathematics | 2014

Towards Robust Hyperlinks for Web-Based Scholarly Communication

Herbert Van de Sompel; Martin Klein; Harihar Shankar

As the scholarly communication system evolves to become natively web-based, hyperlinks are increasingly used to refer to web resources that are created or used in the course of the research process. These hyperlinks are subject to reference rot: a link may break or the linked content may drift and eventually no longer be representative of the content intended by the link. The Hiberlink project quantifies the problem and investigates approaches aimed at alleviating it. The presentation will provide an insight in the project’s findings that result from mining a massive body of scholarly literature spanning the period from 1997 to 2012. It will also provide an overview of components of a possible solution: pro-active web archiving, links with added attributes, and the Memento “Time Travel for the Web” protocol.

acm ieee joint conference on digital libraries | 2018

Robust Links in Scholarly Communication

Martin Klein; Harihar Shankar; Herbert Van de Sompel

Web resources change over time and many ultimately disappear. While this has become an inconvenient reality in day-to-day use of the web, it is problematic when these resources are referenced in scholarship where it is expected that referenced materials can reliably be revisited. We introduce Robust Links, an approach aimed at maintaining the integrity of the scholarly record in a dynamic web environment. The approach consists of archiving web resources when referencing them and decorating links to convey information that supports accessing referenced resources both on the live web and in web archives.

arXiv: Information Retrieval | 2009