Suhit Gupta | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Suhit Gupta is active.

Explore More

Publication

Featured researches published by Suhit Gupta.

international world wide web conferences | 2005

Automating Content Extraction of HTML Documents

Suhit Gupta; Gail E. Kaiser; Peter Grimm; Michael F. Chiang; Justin Starren

Abstract Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of “useful and relevant” content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage’s inherent look and feel. Unlike “Content Reformatting,” which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses “Content Extraction.” We have developed a framework that employs an easily extensible set of techniques. It incorporates advantages of previous work on content extraction. Our key insight is to work with DOM trees, a W3C specified interface that allows programs to dynamically access document structure, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages. This proxy can be used both centrally, administered for groups of users, as well as by individuals for personal browsers. We have also, after receiving feedback from users about the proxy, created a revised version with improved performance and accessibility in mind.

Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A) | 2005

Extracting content from accessible web pages

Suhit Gupta; Gail E. Kaiser

Web pages often contain clutter (such as ads, unnecessary animations and extraneous links) around the body of an article, which distracts a user from actual content. This can be especially inconvenient for blind and visually impaired users. The W3Cs Web Accessibility Initiative (WAI) has defined a set of guidelines to make web pages more compatible with tools built specifically for persons with disabilities. While this initiative has put forth an excellent set of principles, unfortunately many websites continue to be inaccessible as well as cluttered. In order to address the clutter problem, we have developed a framework that employs a host of heuristics in the form of tunable filters for the purpose of content extraction. Our hypothesis is that automatically filtering out selected elements from websites will leave the base content that users are interested in and, as a side-effect, render them more accessible. Although our heuristics are intuition-based, rather than derived from the W3C accessibility guidelines, we imagined however that they would have little impact on web pages that are fully compliant with the accessibility guidelines. We were wrong: some (technically) accessible web pages still include significant clutter. This paper discusses our content extraction framework and its application to accessible web pages.

Archive | 2001

An Active Events Model for Systems Monitoring

Philip Gross; Suhit Gupta; Gail E. Kaiser; Gaurav S. Kc; Janak J. Parekh

We present an interaction model enabling data-source probes and action-based gauges to communicate using an intelligent event model known as ActEvents. ActEvents build on conventional event concepts by associating structural and semantic information with raw data, thereby allowing recipients to be able to dynamically understand the content of new kinds of events. Two submodels of ActEvents are proposed: SmartEvents, which are XML-structured events containing references to their syntactic and semantic models, and Gaugents, which are heavier but more flexible intelligent mobile software agents. This model is presented in light of DARPA’s DASADA program, where ActEvents are used in a larger-scale subsystem, called KX, which supports continual validation of distributed, componentbased systems. ActEvents are emitted by probes in this architecture, and propagated to gauges, where “measurements” of the raw data associated with probes are made, thereby continually determining updated target-system properties. ActEvents are also proposed as solutions for a number of other applications, including a distributed collaborative virtual environment (CVE) known as CHIME. Introduction and Motivation DARPA’s DASADA program has focused on standards for distributed systems to ease assembly and maintenance of systems that are composed of components “from anywhere” (e.g., COTS, GOTS, open source, etc.). This program has focused on four areas: architecture description languages to describe the composed system, probes to gather information about the current system configuration and state, gauges to interpret this information, and adaptation engines that can reconfigure the system as necessary. This paper focuses on the interaction between probes and gauges, and proposes a standard for data interchange between them. The control interfaces for both probes and gauges have been developed extensively, and standards have been proposed by others. However, the format and transmission mechanism for data collected from probes is underdeveloped. We examine the problem and suggest possible models and architectures, along with a description of our implementation and experience using it. Probes, Gauges and Events A probe is defined as “an individual sensor attached, either statically or dynamically, to a running program” [1]. Probes emit events that describe some aspect of a program’s execution, either at a specific point in time or over some duration. Probes usually: • are integrated into or wrapped onto the application itself; • communicate with the application via an API; or • look at indirect measures such as operating system or network resource usage. The proposed control interface for probes consists of the following methods: Deploy, Install, Activate (and their inverses), Query-Sensed and Generate-Sensed to enumerate the events that a probe can send, and the Sensed method to publish an event. The newer Focus interface allows additional probes to be activated for detailed examination of a problem. The DASADA standard assumes that probe data will be emitted in the form of Siena events. For the purposes of this paper, we define an event as “a collection of data produced by a system component, and of interest to zero or more other system components.” Note that this description makes no assertions about formatting, routing, or transport. The University of Colorado at Boulder’s Siena event system [2] enables Internet-scale content-based event delivery. Siena models events as an unordered, flat collection of attribute-value pairs. Gauges [3] are defined as “software entities that gather, aggregate compute, analyze, disseminate and/or visualize measurement information about software systems.” Gauges support a simple configuration interface. The proposed gauge standard includes the concept of a “Gauge Reporting Bus,” which is specifically for communicating gauge reports to consumers (who might, e.g., authorize repairs). Consumers supply callbacks to the reporting bus, which are called when an event of interest occurs. Probe-Gauge Interaction Probes use system-specific techniques to extract data from the target system. Gauges use the Gauge Reporting Bus interface to report to higher-level components. While the respective APIs for Probes and Gauges are clearly specified, there is no proposed standard for formatting probe data and sending it to the appropriate gauges. Since one cannot assume that probes and gauges will be located on the same machine, some form of networked interprocess communication (IPC) is necessary. Since the machines may be of heterogeneous type, the format for probe data should be as portable as possible. While we do not address the issue in this paper, we also note that the standard interface for controlling probes, although presumably intended to be an event-based interface, is in fact specified as an RPCstyle function API. The Problem There are three aspects of the probe-gauge relationship that make the problem of connection difficult: the dynamic nature of individual probes, the dynamic topology of the various components, and the heterogeneous nature of the systems involved. Individual probes may be frequently added and removed from the system. Probes may be heterogeneously sourced, with possibly different semantics for similar-looking data; simply labeling the type of data elements within the event, as in traditional attribute/value pairs, is insufficient. Instead, the semantic information required for proper interpretation of the probe data must be associated with the event. Probes and gauges will be activated and deactivated, and may migrate from machine to machine. Some of these components (especially probes) may be running on constrained devices, and requiring every component to maintain a complete network topology is not feasible. Further, since the main tasks for most probes are straightforward, requiring all of them to add the data and logic necessary to manage bidirectional RPC with gauges in a changing environment would increase their complexity considerably. Detailed knowledge of event routing and dispatch should ideally be removed from most probes and gauges. While more advanced systems such as CORBA can help with component discovery, probes will typically have many consumers for a single event, which is not handled efficiently under the CORBA model nor under analogous RPC extensions. The systems involved may be completely heterogeneous with different byte-ordering, operating systems, architectures, etc. Message formatting should be completely architecture independent, and leverage industry standards to the degree possible.

international conference on web based learning | 2005

P2P video synchronization in a collaborative virtual environment

Suhit Gupta; Gail E. Kaiser

We previously developed a collaborative virtual environment (CVE) for small-group virtual classrooms, intended for distance learning by geographically dispersed students. The CVE employs a P2P approach to the frequent real-time updates to the 3D virtual worlds required by avatar movements (fellow students in the same room). This paper focuses on our extensions to support group viewing of lecture videos, called VECTORS, for Video Enhanced Collaboration for Team Oriented Remote Synchronization. VECTORS supports synchronized viewing of lecture videos, so the students all see “the same thing at the same time”, and can pause, rewind, etc. in synchrony while discussing the lecture via “chat”. We are particularly concerned with the needs of the technologically disenfranchised, e.g., whose only Internet access if via dialup networking. Thus VECTORS employs semantically compressed videos with meager bandwidth requirements.

international world wide web conferences | 2006

Verifying genre-based clustering approach to content extraction

Suhit Gupta; Hila Becker; Gail E. Kaiser; Salvatore J. Stolfo

The content of a webpage is usually contained within a small body of text and images, or perhaps several articles on the same page; however, the content may be lost in the clutter, particularly hurting users browsing on small cell phone and PDA screens and visually impaired users relying on speed rendering of web pages. Using the genre of a web page, we have created a solution, Crunch that automatically identifies clutter and removes it, thus leaving a clean content-full page. In order to evaluate the improvement in the applications for this technology, we identified a number of experiments. In this paper, we have those experiments, the associated results and their evaluation.

CATE | 2004

Virtual Environment for Collaborative Distance Learning With Video Synchronization

Suhit Gupta; Gail E. Kaiser

We present a 3D collaborative virtual environment, CHIME, in which geographically dispersed students can meet together in study groups or to work on team projects. Conventional educational materials from heterogeneous backend data sources are reflected in the virtual world through an automated metadata extraction and projection process that structurally organizes container materials into rooms and interconnecting doors, with atomic objects within containers depicted as furnishings and decorations. A novel in-world authoring tool makes it easy for instructors to design environments, with additional in-world modification afforded to the students themselves, in both cases without programming. Specialized educational services can also be added to virtual environments via programmed plugins. We present an example plugin that supports synchronized viewing of lecture videos by groups of students with widely varying bandwidths.

Archive | 2005

Genre Classification of Websites Using Search Engine Snippets

Suhit Gupta; Gail E. Kaiser; Salvatore J. Stolfo; Hila Becker

Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article, which distracts a user from actual content. Automatic extraction of “useful and relevant” content from web pages has many applications, including browsing on small cell phone and PDA screens, speech rendering for the visually impaired, and reducing noise for information retrieval systems. Prior work has led to the development of Crunch, a framework which employs various heuristics in the form of filters and filter settings for content extraction. Crunch allows users to tune these settings, essentially the thresholds for applying each filter. However, in order to reduce human involvement in selecting these heuristic settings, we have extended this work to utilize a website’s classification, defined by its genre and physical layout. In particular, Crunch would then obtain the settings for a previously unknown website by automatically classifying it as sufficiently similar to a cluster of known websites with previously adjusted settings - which in practice produces better content extraction results than a single one-size-fits-all set of setting defaults. In this paper, we present our approach to clustering a large corpus of websites by their genre, utilizing the snippets generated by sending the website’s domain name to search engines as well as the website’s own text. We find that exploiting these snippets not only increased the frequency of function words that directly assist in detecting the genre of a website, but also allow for easier clustering of websites. We use existing techniques like Manhattan distance measure and Hierarchical clustering, with some modifications, to pre-classify websites into genres. Our clustering method does not require prior knowledge of the set of genres that websites fit into, but instead discovers these relationships among websites. Subsequently, we are able to classify newly encountered websites in linear-time, and then apply the corresponding filter settings, with no noticeable delay introduced for the content-extracting web proxy.

Archive | 2004

Optimizing Quality for Collaborative Video Viewing

Dan Phung; Giuseppe Valetto; Gail E. Kaiser; Suhit Gupta

The increasing popularity of distance learning and online courses has highlighted the lack of collaborative tools for student groups. In addition, the introduction of lecture videos into the online curriculum has drawn attention to the disparity in the network resources used by the students. We present an architecture and adaptation model called AITV (Adaptive Internet Interactive Team Video), a system that allows geographically dispersed participants, possibly some or all disadvantaged in network resources, to collaboratively view a video in synchrony. AITV upholds the invariant that each participant will view semantically equivalent content at all times. Video player actions, like play, pause and stop, can be initiated by any of the participants and the results of those actions are seen by all the members. These features allow group members to review a lecture video in tandem to facilitate the learning process. We employ an autonomic (feedback loop) controller that monitors clients’ video status and adjusts the quality of the video according to the resources of each client. We show in experimental trials that our system can successfully synchronize video for distributed clients while, at the same time, optimizing the video quality given actual (fluctuating) bandwidth by adaptively adjusting the quality level for each participant.

Archive | 2005

A Genre-based Clustering Approach to Content Extraction

Suhit Gupta; Hila Becker; Gail E. Kaiser; Salvatore J. Stolfo

The content of a webpage is usually contained within a small body of text and images, or perhaps several articles on the same page; however, the content may be lost in the clutter (defined as cosmetic features such as animations, menus, sidebars, obtrusive banners). Automatic content extraction has many applications, including browsing on small cell phone and PDA screens, speech rendering for the visually impaired, and reducing noise for information retrieval systems. We have developed a framework, Crunch, which employs various heuristics for content extraction in the form of filters applied to the webpage’s DOM tree; the filters aim to prune or transform the clutter, leaving only the content. Crunch allows users to tune what we call “settings”, consisting of thresholds for applying a particular filter and/or for toggling a filter on/off, because the HTML components that characterize clutter can vary significantly from website to website. However, we have found that the same settings tend to work well across different websites of the same genre, e.g., news or shopping, since the designers often employ similar page layouts. In particular, Crunch could obtain the settings for a previously unknown website by automatically classifying it as sufficiently similar to a cluster of known websites with previously adjusted settings. We present our approach to clustering a large corpus of websites into genres, using their pre-extraction textual material augmented by the snippets generated by searching for the website’s domain name in web search engines. Including these snippets increases the frequency of function words needed for clustering. We use existing Manhattan distance measure and hierarchical clustering techniques, with some modifications, to pre-classify the corpus into genres offline. Our method does not require prior knowledge of the set of genres that websites fit into, but to be useful a priori settings must be available for some member of each cluster or a nearby cluster (otherwise defaults are used). Crunch classifies newly encountered websites online in linear-time, and then applies the corresponding filter settings, with no noticeable delay added by our content-extracting web proxy.

international world wide web conferences | 2003