Sawood Alam | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sawood Alam is active.

Explore More

Publication

Featured researches published by Sawood Alam.

international conference theory and practice digital libraries | 2015

Web Archive Profiling Through CDX Summarization

Sawood Alam; Michael L. Nelson; Herbert Van de Sompel; Lyudmila Balakireva; Harihar Shankar; David S. H. Rosenthal

With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the CDX files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator’s URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we gained up to 22 % routing precision with less than 5 % relative cost as compared to the complete knowledge profile without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a five fold increase in routing precision.

acm/ieee joint conference on digital libraries | 2016

MemGator - A Portable Concurrent Memento Aggregator: Cross-Platform CLI and Server Binaries in Go

Sawood Alam; Michael L. Nelson

The Memento protocol makes it easy to build a uniform lookup service to aggregate the holdings of web archives. However, there is a lack of tools to utilize this capability in archiving applications and research projects. We created MemGator, an open source, easy to use, portable, concurrent, cross-platform, and self-documented Memento aggregator CLI and server tool written in Go. MemGator implements all the basic features of a Memento aggregator (e.g., TimeMap and TimeGate) and gives the ability to customize various options including which archives are aggregated. It is being used heavily by tools and services such as Mink, WAIL, OldWeb. today, and archiving research projects and has proved to be reliable even in conditions of extreme load.

international conference theory and practice digital libraries | 2016

Web Archive Profiling Through Fulltext Search

Sawood Alam; Michael L. Nelson; Herbert Van de Sompel; David S. H. Rosenthal

An archive profile is a high-level summary of a web archive’s holdings that can be used for routing Memento queries to the appropriate archives. It can be created by generating summaries from the CDX files (index of web archives) which we explored in an earlier work. However, requiring archives to update their profiles periodically is difficult. Alternative means to discover the holdings of an archive involve sampling based approaches such as fulltext keyword searching to learn the URIs present in the response or looking up for a sample set of URIs and see which of those are present in the archive. It is the fulltext search based discovery and profiling that is the scope of this paper. We developed the Random Searcher Model (RSM) to discover the holdings of an archive by a random search walk. We measured the search cost of discovering certain percentages of the archive holdings for various profiling policies under different RSM configurations. We can make routing decisions of 80 % of the requests correctly while maintaining about 0.9 recall by discovering only 10 % of the archive holdings and generating a profile that costs less than 1 % of the complete knowledge profile.

international conference theory and practice digital libraries | 2016

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives

Mat Kelly; Sawood Alam; Michael L. Nelson; Michele C. Weigle

We have integrated Web ARChive (WARC) files with the peer-to-peer content addressable InterPlanetary File System (IPFS) to allow the payload content of web archives to be easily propagated. We also provide an archival replay system extended from pywb to fetch the WARC content from IPFS and re-assemble the originally archived HTTP responses for replay. From a 1.0 GB sample Archive-It collection of WARCs containing 21,994 mementos, we show that extracting and indexing the HTTP response content of WARCs containing IPFS lookup hashes takes 66.6 min inclusive of dissemination into IPFS.

acm/ieee joint conference on digital libraries | 2016

InterPlanetary Wayback: The Permanent Web Archive

Sawood Alam; Mat Kelly; Michael L. Nelson

To facilitate permanence and collaboration in web archives, we built Interplanetary Wayback to disseminate the contents of WARC files into the IPFS network. IPFS is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. We split the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, build a CDXJ index, and combine them at the time of replay. From a 1.0 GB sample Archive-It collection of WARCs containing 21,994 mementos, we found that on an average, 570 files can be indexed and disseminated into IPFS per minute. We also found that in our naive prototype implementation, replay took on an average 370 milliseconds per request.

acm ieee joint conference on digital libraries | 2018

Unobtrusive and Extensible Archival Replay Banners Using Custom Elements

Sawood Alam; Mat Kelly; Michele C. Weigle; Michael L. Nelson

We compare and contrast three different ways to implement an archival replay banner. We propose an implementation that utilizes Custom Elements and adds some unique behaviors, not common in existing archival replay systems, to enhance the user experience. Our approach has a minimal user interface footprint and resource overhead while still providing rich interactivity and extended on-demand provenance information about the archived resources.

acm ieee joint conference on digital libraries | 2018

ArchiveNow: Simplified, Extensible, Multi-Archive Preservation

Mohamed Aturban; Mat Kelly; Sawood Alam; John A. Berlin; Michael L. Nelson; Michele C. Weigle

ArchiveNow is a Python module for preserving web pages in on-demand web archives. This module allows a user to submit a URI of a web page for archiving at several configured web archives. Once the web page is captured, ArchiveNow provides the user with links to the archived copies of the web page. ArchiveNow is initially configured to use four archives but is easily configurable to add or remove other archives. In addition to pushing web pages to public archives, ArchiveNow, through the use of Wget and Squidwarc, allows users to generate local WARC files, enabling them to create their own personal and private archives.

acm ieee joint conference on digital libraries | 2017