Webpage Segmentation for Extracting Images and Their Surrounding Contextual Information
WWebpage Segmentation for Extracting Images and Their Surrounding Contextual Information
F. Fauzi, H. J. Long, M. Belkhatir
School of IT, Monash University
ABSTRACT
Web images come in hand with valuable contextual information. Although this information has long been mined for various uses such as image annotation, clustering of images, inference of image semantic content, etc., insufficient attention has been given to address issues in mining this contextual information. In this paper, we propose a webpage segmentation algorithm targeting the extraction of web images and their contextual information based on their characteristics as they appear on webpages. We conducted a user study to obtain a human-labeled dataset to validate the effectiveness of our method and experiments demonstrated that our method can achieve better results compared to an existing segmentation algorithm.
Categories and Subject Descriptors
H.3.1 [
Information Storage and Retrieval ]: Content Analysis and Indexing – indexing methods
General Terms
Algorithms, Experimentation, Human Factors. INTRODUCTION
As the World Wide Web fuses into our existence, an abundance of images can be found on the Web. Incidentally, these Web images come with rich contextual information, which is the text associated to the images, used jointly with their filename, alt description, and page title. This contextual information has varying definitions and has been perceived as a window of words [14], a paragraph [5, 15], a section [4, 9, 12] and even the entire page [6, 7]. There are two general methods for extraction of image contextual information. The first and simplest method is to use a fixed window size (min: 20 terms to max: entire page) whereby a fixed number of words before and after the image are considered as the image surrounding context. Alternatively, the second method performs webpage segmentation to extract sections containing the images and their surrounding context [1, 4, 9, 12]. Webpage segmentation is the task of breaking a webpage into sections that appear coherent to a user browsing the webpage as will be further discussed in the next section. Both are not without problems. The first method, although straightforward, tends to produce low-level accuracy as texts tend to be associated to the wrong image, for instance, when the image description appears only after the image. And when taking the entire page, the surrounding context will contain too much noisy information. As for the second method, we believe that webpage segmentation is the natural method for extracting image surrounding context. Nevertheless, there are problems that need addressing: i) the ambiguity in defining the boundary of the contextual information of each image ii) the heterogeneity of webpages – different websites having different content layout iii) the parameters/modifications required to tune general webpage segmentation algorithm to extract images and their surrounding context and iv) the performance of the segmentation algorithm in terms of time required to process a webpage, extract images and their corresponding surrounding context, a fast algorithm would be required to cater to the large and growing number of images of the Web.
Our Contributions . To address these concerns, we propose a fast DOM Tree-based segmentation algorithm that does not require any tuning parameters, targeting the extraction of images and their surrounding context, which we refer as image segments and test it against a human-labeled dataset obtained via a user study. Our method can extract image segments from a diverse range of websites, thus making it practical and scalable. Experimental results indicate that our method outperforms an existing state of the art segmentation algorithm, VIPS [2] in precision and recall. RELATED WORK
Efforts to segment webpages for extracting surrounding context can be categorized into two: i) DOM Tree-based and ii) DOM Tree-based with additional visual information obtained from rendering the DOM Tree. Typically, the webpage DOM tree structure is analyzed to discover segment-specific patterns. [5, 15] extract a paragraph of texts containing the image. Hua et al. [9] rely on the border properties of structural HTML markup elements such as
, and . Feng et al. [4] consider these structural tags as separators and have a cutoff point at text description length greater than 32 words before and after an image. While efficient, the heuristics used above work on limited webpages, and [4] fell back on fixed window size. Hence, better heuristics should be used to improve scalability to various types of webpages. Cai et al. ’s Vision-based Page Segmentation (VIPS) algorithm [2] is a general webpage segmentation algorithm that uses visual information obtained from rendering the webpage, in addition to the DOM tree structure. [1, 13] implement VIPS for the extraction of image surrounding context by reducing webpages to image Preprint accepted at ACM Multimedia DOI locks and taking all texts within a block as the surrounding context. The major problem in VIPS is the value of the Permitted Degree of Coherence (PDoC), which ranges from 1-10 and defines the different granularities of the segmentation algorithm to cater for different applications. In [8], the PDoC is empirically set to 5, while this may work for some pages; generally it takes more contextual information than required by considering a bigger section encompassing an image. Increasing the PDoC would cause an opposite effect. Li et al. [12] too include visual cues of size and position in their page segmentation algorithm. Even though visual cues might improve accuracy, these algorithms are known to be computationally expensive and become crucial when processing the large-scale Web. Other webpage segmentation algorithms that have been developed to address information retrieval applications are reviewed. Kao et al. [10] separate blocks of the DOM sub-trees by comparing the entropies of the terms within the blocks. Chakrabarti et al. ’s meta-heuristic Graph-Theoretic approach [3] cast the DOM tree as a weighted graph, where the weights indicate if two DOM nodes should be placed together or in different segments and Kohlschutter et al. [11] applied quantitative linguistics and computer vision strategies to the segmentation problem. These segmentation algorithms would require further modifications to suit our purpose. FORMULATION 3.1 Characteristics of Web Images Our observation on Web images embedded within webpages sampled from business, shopping, governmental, education, news and informational sites shows three classes of Web images irrespective of webpage category – unlisted, listed and semi-listed images. A webpage is parsed by a browser to obtain its Document Object Model (DOM) Tree structure. The DOM Tree is examined to discover different DOM Tree patterns for each class of Web image. Unlisted images are standalone or random images that appear anywhere on a page ( c.f. Fig 1a: Segment 9), for example, profile photos in personal homepages, company logos, advertisements etc. The corresponding DOM Tree for such images and their surrounding context is consistently an image node with its surrounding text as text node siblings, with a root HTML tag representing the boundary of this image segment ( c.f. Fig 1b). Listed images are two or more images that are systematically ordered within the webpage ( c.f. Fig 1a: Segment 1-8). Examples of listed images are representative images, list of product images, news images, etc. The associated DOM Trees for such image segments are characteristically the image node with its surrounding text nodes that are a sub-tree under a root HTML tag defining the segment boundary. Other siblings under this root HTML tag share similar sub-tree structure ( c.f. Fig 1d). Semi-listed images are visually similar to listed images. The difference is characterized by their DOM tree. Their DOM tree is similar to a DOM Tree of an unlisted image in the sense the image node with its surrounding text nodes are under a root HTML tag that represents the segment boundary but along with those nodes, there are other image nodes with their own surrounding texts nodes as well on the same level ( c.f. Fig 1c). Commonly, for all images, their surrounding context are texts in close proximity to the image within a webpage as well as in the webpage’s DOM Tree structure, the corresponding text nodes are neighboring nodes to the image node in the DOM Tree, and all image nodes and text nodes are leave nodes in the DOM Tree. Algorithm We propose a novel DOM Tree based segmentation algorithm to extract image segments from webpages using the image characteristics mentioned above to determine the heuristics for segmentation. For every image node found in the DOM Tree, the algorithm finds the image segment using heuristic determined by the image characteristics. This is accomplished by detecting the variation in total number of texts in each upward level of the DOM Tree, beginning from the image node. We use Segment 1 from Fig. 1a to explain this, from the image node, the algorithm traverses up the DOM Tree, and stops at *
|