Parag Joshi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Parag Joshi is active.

Explore More

Publication

Featured researches published by Parag Joshi.

document engineering | 2009

Web document text and images extraction using DOM analysis and natural language processing

Parag Joshi; Sam Liu

Web has emerged as the most important source of information in the world. This has resulted in need for automated software components to analyze web pages and harvest useful information from them. However, in typical web pages the informative content is surrounded by a very high degree of noise in the form of advertisements, navigation bars, links to other content, etc. Often the noisy content is interspersed with the main content leaving no clean boundaries between them. This noisy content makes the problem of information harvesting from web pages much harder. Therefore, it is essential to be able to identify main content of a web page and automatically isolate it from noisy content for any further analysis. Most existing approaches rely on prior knowledge of website specific templates and hand-crafted rules specific to websites for extraction of relevant content. We propose a generic approach that does not require prior knowledge of website templates. While HTML DOM analysis and visual layout analysis approaches have sometimes been used, we believe that for higher accuracy in content extraction, the analyzing software needs to mimic a human user and understand content in natural language similar to the way humans intuitively do in order to eliminate noisy content. In this paper, we describe a combination of HTML DOM analysis and Natural Language Processing (NLP) techniques for automated extractions of main article with associated images from web pages.

IEEE Internet Computing | 2013

Casebook: A Cloud-Based System of Engagement for Case Management

Hamid Reza Motahari-Nezhad; Susan Spence; Claudio Bartolini; Sven Graupner; Charles Edgar Bess; Marianne Hickey; Parag Joshi; Roberto Mirizzi; Kivanc M. Ozonat; Maher Rahmouni

Casebook embraces social and collaboration technology, analytics, and intelligence to advance the state of the art in case management from systems of record to a system of engagement for knowledge workers. It addresses complex, inefficient work practices, information loss during hand offs between teams, and failure to learn from previous case experience. Intelligent agents help people adapt to changing work practices by tracking process evolution and providing updates and recommendations. Social collaboration surrounding cases integrates communication with information and supports collaborative roadmapping to enable people to work as they collaborate, thus accelerating how quickly and accurately they handle cases.

knowledge discovery and data mining | 2011

Article clipper: a system for web article extraction

Jian Fan; Ping Luo; Suk Hwan Lim; Sam Liu; Parag Joshi; Jerry Liu

Many people use the Web as the main source of information in their daily lives. However, most web pages contain non-informative components such as side bars, footers, headers, and advertisements, which are undesirable for certain applications like printing. We demonstrate a system that automatically extracts the informative contents from news- and blog-like web pages. In contrast to many existing methods that are limited to identifying only the text or the bounding rectangular region, our system not only identifies the content but also the structural roles of various content components such as title, paragraphs, images and captions. The structural information enables re-layout of the content in a pleasing way. Besides the article text extraction, our system includes the following components: 1) print-link detection to identify the URL link for printing, and to use it for more reliable analysis and recognition; 2) title detection incorporating both visual cues and HTML tags; 3) image and caption detection utilizing extensive visual cues; 4) multiple-page and next page URL detection. The performance of our system has been thoroughly evaluated using a human labeled ground truth dataset consisting of 2000 web pages from 100 major web sites. We show accurate results using such a dataset.

Proceedings of SPIE | 2011

Title identification of web article pages using HTML and visual features

Jian Fan; Ping Luo; Parag Joshi

Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.

international conference on data engineering | 2011

Analytics for similarity matching of IT cases with collaboratively-defined activity flows

Hamid R. Motahari Nezhad; Claudio Bartolini; Parag Joshi

Handling IT support cases efficiently is very important for operational excellence of IT organizations. Many IT service centers receive thousands of cases per day, some of which are similar to previously reported cases. To improve efficiency it is important to build upon lessons learned from past cases in the resolution of new cases. Therefore, a desired functionality of case management tools is finding similar previous cases to an open one, in order to leverage information about previous cases to effectively find resolution. A new generation of tools for IT case management, e.g., IT Support Conversation Manager, enables collaborative and adaptive process definition for IT case resolution. Leveraging collaborative and social networking technology makes the case information model increasingly richer and more structured compared to flat textual format case reports in traditional IT case management systems. We have developed an automated method for matching IT support cases that takes into account multiple information attributes including the collaborative flow of activities during case handling. We evaluated the system and the early evaluation results show that this method achieves a higher accuracy and comparable efficiency to text-based similarity approaches.

electronic imaging | 2006

Automated campaign system

Gary L. Vondran; Hui Chao; Dirk Beyer; Parag Joshi; Brian Atkins; Pere Obrador

To run a targeted campaign involves coordination and management across numerous organizations and complex process flows. Everything from market analytics on customer databases, acquiring content and images, composing the materials, meeting the sponsoring enterprise brand standards, driving through production and fulfillment, and evaluating results; all processes are currently performed by experienced highly trained staff. Presented is a developed solution that not only brings together technologies that automate each process, but also automates the entire flow so that a novice user could easily run a successful campaign from their desktop. This paper presents the technologies, structure, and process flows used to bring this system together. Highlighted will be how the complexity of running a targeted campaign is hidden from the user through technologies, all while providing the benefits of a professionally managed campaign.

electronic imaging | 2006

WARP (workflow for automated and rapid production): a framework for end-to-end automated digital print workflows

Parag Joshi

Publishing industry is experiencing a major paradigm shift with the advent of digital publishing technologies. A large number of components in the publishing and print production workflow are transformed in this shift. However, the process as a whole requires a great deal of human intervention for decision making and for resolving exceptions during job execution. Furthermore, a majority of the best-of-breed applications for publishing and print production are intrinsically designed and developed to be driven by humans. Thus, the human-intensive nature of the current prepress process accounts for a very significant amount of the overhead costs in fulfillment of jobs on press. It is a challenge to automate the functionality of applications built with the model of human driven exectution. Another challenge is to orchestrate various components in the publishing and print production pipeline such that they work in a seamless manner to enable the system to perform automatic detection of potential failures and take corrective actions in a proactive manner. Thus, there is a great need for a coherent and unifying workflow architecture that streamlines the process and automates it as a whole in order to create an end-to-end digital automated print production workflow that does not involve any human intervention. This paper describes an architecture and building blocks that lay the foundation for a plurality of automated print production workflows.

document engineering | 2006

From video to photo albums: digital publishing workflow for automatic album creation

Parag Joshi; C. Brian Atkins; Tong Zhang

The revolution in consumer electronics for capturing video has been followed by an explosion of video content. However, meaningful consumption models of such rich media for nonprofessional users are still emerging. In contrast to those of video cameras, the consumption models for output of still cameras have been long established and are considerably simpler. The output of a still camera is an image of sufficiently high quality and high resolution for a good quality production on paper. Due to ease of use, mobility, high quality and simplicity paper photographs are still incomparable in terms of overall human experience. On the other hand, video content by itself is not as easy to use. Consumption of video content requires computers and/or video display devices and so cannot be instantaneously displayed or shared. Rendition on paper is much more complex for video content compared to still camera images. In contrast with the simplicity of usage of still cameras, video camera output has to be edited on computer, key frames with good visual quality have to be manually extracted, digitally edited and prepared for printing before getting usable good quality photographs. Due to complexity of the video content, users often prefer to take still pictures instead of recording video clips. In this paper we describe an approach to construct an end-to-end digital publishing workflow system that automatically composes visually appealing photo albums with high quality photographic images from video content input.

Archive | 2005