Intelligent Self-Repairable Web Wrappers
aa r X i v : . [ c s . A I] J un Intelligent Self-Repairable Web Wrappers
Emilio Ferrara and Robert Baumgartner Dept. of Mathematics, University of Messina, V. Stagno d’Alcontres 31, 98166 Italy [email protected] Lixto Software GmbH, Favoritenstrasse 16/DG, 1040 Vienna, Austria [email protected]
Abstract.
The amount of information available on the Web grows atan incredible high rate. Systems and procedures devised to extract thesedata from Web sources already exist, and different approaches and tech-niques have been investigated during the last years. On the one hand,reliable solutions should provide robust algorithms of Web data miningwhich could automatically face possible malfunctioning or failures. Onthe other, in literature there is a lack of solutions about the maintenanceof these systems. Procedures that extract Web data may be strictly in-terconnected with the structure of the data source itself; thus, malfunc-tioning or acquisition of corrupted data could be caused, for example, bystructural modifications of data sources brought by their owners. Nowa-days, verification of data integrity and maintenance are mostly manuallymanaged, in order to ensure that these systems work correctly and re-liably. In this paper we propose a novel approach to create proceduresable to extract data from Web sources – the so called
Web wrappers –which can face possible malfunctioning caused by modifications of thestructure of the data source, and can automatically repair themselves.
Keywords:
Web data extraction, wrappers, automatic adaptation
The actual panorama of distribution of information through the Web depicts aclear situation: there is an incredible amount of data delivered under the formof Web data sources and a corresponding need of capability of mining this in-formation in a reliable and efficient way. Mining information from Web sourcesis a task which can obviously be useful in several different area of the knowl-edge. Moreover, this topic interests both the academia and the enterprises. Forexample, consider the following scenarios: i) a research group which needs to ac-quire a dataset of information delivered through online services, say for examplean online database publishing, day by day, information about the mapping ofsome genes; ii) a company for which it is essential, for marketing and productplacement, to monitor the trends of pricing of services offered by its competitors,provided through the Web. Both the two actors need to extract, possibly, a hugeamount of data during an extend period of time (e.g., months), at regular inter-vals (say, each day). One important aspect in both the cases is the reliability andhe quality of data extracted. It is utterly important that acquired informationis correct, because the research group can not accept corrupted data and thecomparison with competitors would fail in case of bad product data.These two examples highlight common requirements in the panorama of Webdata mining, and depict different related problems. Although in literature sometechniques to design systems for the extraction of data from Web sources havebeen presented, there is a lack of work in the area of their maintenance. An amplenumber of questions and problems related to the possibility of automatizing theprocess of maintenance are still uncovered. This work tries to focus on someaspects related to the maintenance of these systems. We first introduce thetheoretical background required to create intelligent procedures of Web dataextraction. Then, we explain how to face malfunctioning likely to happen duringthe extraction process, for example caused by modifications in the structureof the data source. The second point in particular is the main focus of thiswork. Let us contextualize this problem: essentially there exist two differentapproaches to extract information from Web sources. The first one relies onmachine learning platforms [5]; a system analyzes, possibly, huge amount ofpositive and negative examples during a training period, and, then, it inferssome set of rules that makes it able to perform its tasks in the same domainor Web site. Different approaches rely on logic-based algorithms which analyzethe structure of the data source and induct some procedures to extract requiredinformation exploiting structural characteristics of the Web source to identifyand find required data. The second approach utilizes the knowledge a humancan bring in about a particular site or domain. The wrapper is generated ina way that the human creates the rules and navigation paths together withthe system in a supervised and interactive fashion. Still, the system can assistthe wrapper designer and offer possibilities that make the wrapper execution asrobust as possible, even in case of structural changes. From now, in this workwe assume that the platform we are going to describe and improve adopts thelatter philosophy.
Organization of the paper
We describe related work in Section 2. In Section 3,the algorithmic background is introduced, describing an efficient tree matchingtechnique. Section 4 covers the design of robust and adaptable procedures ofWeb data extraction, henceforth called intelligent self-repairable Web wrappers .Then, in Section 5 we describe the adaptation process during wrapper execution.We explain how these procedures can automatically, in an autonomous way, facemalfunctioning, trying to adapt themselves to the modifications that possiblycaused problems. A prototype has been implemented on top of a state-of-the-artextraction platform, the Lixto Visual Developer. Performance of this system areshown in Section 6, by means of precision and recall scores. Section 7 concludessummarizing our main achievements and depicting some future work.
Background and Related Work
We split related literature in three main topics: i) Web data extraction systems;ii) maintenance and related problems; iii) tree matching algorithms.
Web data extraction systems
The work related to systems of Web informationextraction is manifold but well depicted by several surveys. Laender et al. [13]provided the first rigorous taxonomical classification of Web data extraction sys-tems. Kushmerick [11] classified several finite-state approaches to generate wrap-pers, such as the wrapper induction, natural language processing approaches andhidden Markov models. Sarawagi [17] provided the most comprehensive surveyon the information extraction panorama. This work covers different existing tech-niques explaining several approaches. In the last years, first Baumgartner et al.[1] and later Ferrara et al. [8] provided two different surveys on the discipline ofWeb data extraction. The first is mainly addressed to practitioners, the latterfocuses on application fields of this discipline.
Maintenance and related problems
Although some interesting work, we can iden-tify a general lack of solutions provided in the area of the Web wrapper main-tenance. Kushmerick [12,10] for first introduced the concept of wrapper mainte-nance as the process of verifying the correct functioning of the data extractionprocedures and manually, automatically or in a semi-automatic way, intervene incase of malfunctioning. Lerman and Minton [14], instead, faced both the prob-lems of verifying the correctness of data extracted by a wrapper and eventuallytry to repair it. Their approach is a mix of machine learning techniques. Anotherapproach based on machine learning has been provided by Chidlovskii [4]; he de-scribed a system which can automatically classify Web pages in order to extractinformation from those pages which can be handled adopting both conventionalextraction rules and ensemble methods of machine learning, such as the contentfeatures analysis. Meng et al. [15] developed the SG-WRAM (Schema-GuidedWRApper Maintenance) slightly modifying the perspective of Web wrappersgeneration, observing that changes in Web pages, even substantial, always pre-serve syntactic features (i.e., syntactic characteristics of data items like datapatterns, string lengths, etc.), hyperlinks and annotations (e.g., descriptive in-formation representing the semantic meaning of a piece of information in itscontext). Finally, another heuristic approach has been presented by Raposo etal. [16]; they adopted a collected sample of positive labeled examples during thenormal execution of the wrappers, to be exploited in case of malfunctioning, inorder to re-induct the broken wrapper ensuring a good accuracy of the process.
Tree Matching
In general, the process of comparing the structure of two trees isa well-known classic problem. The possibility of transforming a tree into anotherone, through a sequence of (possibly different) operations, is another well-knownalgorithmic challenge, namely the tree editing problem. The minimum numberof elementary transformations, such as adding/removing nodes, relabeling nodesor moving nodes, represents the distance between two trees. This value can besed to represent the measure of dissimilarity between two trees. The tree editdistance problem is a well-known NP-hard problem [3]. Several approximatesolutions have been advanced during the years; the most appropriate algorithmto face the problem of matching up similar trees, has been suggested by Selkow[18]. This technique relies on the concept of finding isomorphic elements presentin both the two compared trees, implementing a light-weight recursive top-downresolution during which the algorithm evaluates the position of nodes to measurethe degree of isomorphism between them, analyzing and comparing their sub-trees. Different versions of this algorithm exist; each of them presents someoptimizations. Ferrara and Baumgartner [6,7] so as Yang [19] adopt weights ,obtaining a variant of this algorithm with the capability of discovering clustersof similar sub-trees. An interesting evaluation of the simple tree matching and itsweighted version, presented by Kim et al. [9], has been performed exploiting thesetwo algorithms to extract information from HTML Web pages. These optimizedalgorithms underly the design of our self-repairable Web wrappers.
This work relies on some assumptions: i) Web pages are represented by usingDOM trees, as the HTML standard imposes ; ii) it is possible to identify ele-ments within a DOM tree by using the XPath language ; iii) the logics of XPathunderly the functioning of Web wrappers (this is further explained in followingsections and in [1,2]). Given these milestones, the main idea of our approach isto compare two trees, one representing the original Web page and another rep-resenting the page after that some modifications occurred. This is practical inorder to automatize the adaptive process of automatic repairing of our wrappers.To do so, we utilize a variant of the seminal Simple Tree Matching (STM) [18],optimized by Ferrara and Baumgartner [6,7]. Let d(n) be the degree of a node n (i.e., the number of first-level children); let T ( i ) be the i- th sub-tree of the treerooted at node T ; let t ( n ) be the number of total siblings of a node n includingitself. The Weighted Tree Matching here described (see Algorithm 1) optimizesthe simple tree matching, for our specific domain.
In supervised and interactive wrapper generation, the application designer is incharge of deciding how to characterize Web objects that are used for traversingthe Web and for extracting information. It is one of the most important aspects ofa wrapper to be resilient against changes (both changes over time and variationsof similarly structured pages), and parts of the robustness of a data extractordepend on how the application designer configures it. However, it is crucial thatthe wrapper generation system assists the wrapper designer and suggests how lgorithm 1 WeightedTreeMatching( T ′ , T ′′ ) if T ′ has the same label of T ′′ then m ← d ( T ′ )3: n ← d ( T ′′ )4: for i = 0 to m do M [ i ][0] ← for j = 0 to n do M [0][ j ] ← for all i such that 1 ≤ i ≤ m do for all j such that 1 ≤ j ≤ n do M [ i ][ j ] ← Max( M [ i ][ j − M [ i − j ], M [ i − j −
1] + W [ i ][ j ]) where W [ i ][ j ] = WeightedTreeMatching( T ′ ( i − T ′′ ( j − if m > n > then
12: return M[m][n] * 1 / Max( t ( T ′ ), t ( T ′′ ))13: else
14: return M[m][n] + 1 / Max( t ( T ′ ), t ( T ′′ ))15: else
16: return 0 to make the identification of Web objects and trails through Web sites as stableas possible.
In Lixto Visual Developer (VD) [2], a number of mechanisms are offered to createa resilient wrapper. During recording, one task is to generate a robust XPath orregular expression, interactively and supported by the system. During wrappergeneration, in many cases only one labeled example object is available, espe-cially in automatically recorded deep Web navigation sequences. In such cases,efficient heuristics in XPath generation and fallback strategies during replay, arerequired. Typical heuristics during recording for reliably identifying such singleWeb objects include: – Generalization of a chosen XPath by using form properties, element proper-ties, textual properties and formatting properties. During replay, these ingre-dients are used as input for an algorithm that checks in which constellationto best apply this property information to satisfy the integrity constraintsimposed on a rule (e.g., as result a single instance is required). – DOM Structural Generalization – starting from the full path, several gener-alized paths are created, using only characteristic elements and characteris-tic element sequences. A number of stable anchor points are identified andstored, from which relative paths to this object are created. Typical stableanchor points are identified automatically and include, e.g., the outermosttable structure and the main content area (being chosen upon factors suchas the longest content).
Positional information is considered if the structurally generalized pathsidentify more than one element. In this case, during execution, variationsof the XPath generated with this “index heuristics” are applied on the ac-tive Web page, removing indexes until the integrity constraints of the currentrule are satisfied. – Attributes and properties of elements are taken into account, in particular ofthe element of choice, but we also consider ancestor attributes if the elementattributes are not sufficient. – Attributes that make an element unique are preferred, i.e., similar elementsare checked for distinguishing criteria. – Attribute Values are considered, if attribute names are not sufficient. At-tribute Value Fragments are considered, if attribute values are not sufficient(using regular expressions). – The ID attributes are used as far as possible. If an ID is unique and mean-ingful for characterizing an element it is considered in the fallback strategieswith a high weight. – Textual information and label information is used, only if explicitly turnedon (since this might fail in case of a language switch).The output of the heuristic step is a “best XPath” shown to the wrapperdesigner, and a set of XPath expressions and priorities regarding when to usewhich fallback strategy, stored in the configuration. Figure 1 illustrates whichinformation is stored by the system during recording. In this case, a drop downwas selected by the application designer, and the system decided that the “id”attribute is the most reliable one and is chosen as best XPath. If this evaluationfails, the system will apply heuristics based on the (in this example, three) storedfallback XPaths, which mainly exploit form and index properties. In case oneof the heuristics generates results that do not invalidate the defined integrityconstraints, these Web objects are considered as result.During generation of rules (e.g., “extract”) and actions (e.g., “click”), thewrapper designer imposes constraints on the results to be obtained, such as: – Cardinality Constraints: restrictions on the number of results, e.g., exactlyone element or at least one element must be matched. – Data Type Constraints: restrictions on the data type of a result, e.g., a resultmust be of type integer or match a particular regular expression.Constraints can be defined individually per rule and action, or defined glob-ally by using a schema on the output data model.
The procedures described in the previous section do not adapt the wrapper,but address situations in which the initially chosen XPath does no longer matchand simply try different ones based on this one. In the configuration of wrapperadaptation, we go one step beyond: on the one hand we exploit tree and string ig. 1.
Robust Web Object Detection in Visual Developer. similarity techniques to find the most similar Web object(s) on the new page, andon the other hand, in case the adaptation is triggered, the wrapper is changedon the fly using the new configuration created by the adaptation algorithms.As before, integrity constraints can be imposed on extraction and navigationrules. Moreover, the application designer can choose whether to use wrapperadaptation on a particular rule in case the constraints are violated during run-time. When adaptation is chosen, alternatively to using XPath-based means toidentify Web objects we store the actual result subtree. In case of HTML leafelements, which are usually the elements under consideration for navigation ac-tions, we instead store the tree rooted at the n-th ancestor of the element, andthe additional fact where the result element is located within this tree. In thisway, tree matching can also be exploited for HTML leaf elements.Wrapper designers can choose between various similarity measures: this in-cludes in particular the Simple Tree Matching algorithm [18] and the WeightedTree Matching algorithm described in Section 3. In future, further algorithmswill extend the capabilities of the tool, e.g., a bigram-based tree matching that iscapable to deal with node permutations in a more favorable fashion. In additionto the similarity function, one can choose certain parameters, e.g., whether touse the HTML element name as node label or instead to use spelling attributessuch as class and id attributes. Figure 2 illustrates the configuration of wrapperadaptation in Visual Developer. ig. 2. Configuration of Wrapper Adaptation in Lixto VD.
Figure 3 describes the adaptation process. The wrapper adaptation process istriggered upon violation of defined constraints. In case in the initial wrapperan element is detected with an XPath, the adaptation procedure substitutesthis by storing the subtree of a matched element. In case the wrapper definitionalready stores the example tree, and the similarity computation returns resultsthat violate the defined constraints, the threshold is lowered or raised until aperfect match is generated.During runtime, the stored tree is compared to the elements on the new page,and the best fitting element(s) are considered as extraction results. During con-figuration, wrapper designers can choose an algorithm (such as the WeightedTree Matching), and a similarity threshold. The similarity threshold can be con-stant, or defined to be within an interval of acceptable thresholds. During exe-cution, various thresholds within the allowed range are considered, and the onegenerating the best fit with respect to the defined constraints is chosen.As a next step, the stored tree is refined and generalized so that it maximizesthe matching value for both the original subtree and the new trees, reflecting thechanges of a Web page over time. This generalization process generates a simple ig. 3.
Wrapper Adaptation Process. tree grammar, a “tree template” that is allowed to use occurrence indicators(one or more element, at least one element, etc.) and optional depth levels. Infurther runs, the tree template is compared against the sub trees of an activeWeb page during execution. First, the algorithm checks which trees on the newpage satisfy the tree template. In case the results are within the defined integrityconstraints, no further action is taken. In case the results are not satisfying, thesystem searches for most similar trees based on the defined distance metrics; inthis case, the wrapper is auto-adapted, the tree template is further refined andthe threshold or threshold interval is automatically re-adjusted. At the very endof the process, the corrected wrapper is stored in the wrapper repository andcommitted to a versioning system to keep track of all changes.
In practice, single adaptation steps of rules and actions are embedded into thewhole execution process of a wrapper and the adapted wrapper is stored in therepository after all adaptation steps have been concluded. The need for adaptinga particular rule influences the further execution steps.Usually, wrapper generation in VD is a hierarchical top-down process – e.g.,first, a “hotel record” is characterized, and inside the hotel record, entities suchas “rating” and “room types”. To define a rule to match such entities, the wrap-per designer visually selects an example and together with system suggestionsgeneralizes the rule configuration until the desired instances are matched. Tosupport the automatic adaptation process during runtime, as described above,the wrapper designer further specifies what it means that extraction failed. Ingeneral, this means wrong or missing data, and with integrity constraints onecan give indications how correct results look like. The upper half of Figure 4summarizes the wrapper generation. ig. 4.
Diagram of the Web wrapper creation, execution and maintenance flow.
During wrapper creation, the application designer provides a number of con-figuration settings to this process. This includes: – Threshold Values. – Priorities/Order of Adaptation Algorithms used. – Flags of the chosen algorithm (e.g., using HTML element name as node label,using id/class attributes as node labels, etc.). – Triggers for bottom-up, top-down and process flow adaptation bubbling. – Whether stored tree-grams and XPath statements are updated based onadaptation results to be additionally used as inputs in future adaptationprocedures (reflecting and addressing regular slight changes of a Web pageover time).Triggers in Adaptation Settings can be used to force adaptation of furtherfragments of the wrapper as depicted in the lower half of Figure 4. – Top-down: forcing adaptation of all/some descendant rules (e.g., adapt the“price” rule as well to identify prices within a record if the “record” rule wasadapted). – Bottom-up: forcing adaptation of a parent rule in case adaptation of a par-ticular rule was not successful. Experimental evaluation pointed out thatin such cases it is often the problem that the parent rule already provideswrong or missing results (even if matched by the integrity constraints) andhas to be adapted first. – Process flow: it might happen that particular rule matches can no longerdetected because the wrapper evaluates on the wrong page. Hence, there isthe need to use variations in the deep web navigation actions. In particular,a simple approach explored at this time is to use a switch window or backstep action to check if the previous window or another tab/popup providesthe required information. imple Tree Matching Weighted Tree MatchingPrecision/Recall Precision/RecallScenario thresh. tp fp fn tp fp fnDelicious 40% 100 4 - 100 - -Ebay 85% 200 12 - 196 - 4Facebook 65% 240 72 - 240 12 -Google news 90% 604 - 52 644 - 12Google.com 80% 100 - 60 136 - 24Kelkoo 40% 60 4 - 58 - 2Techcrunch 85% 52 - 28 80 - -Total - 1356 92 140 1454 12 42Recall - 90.64% 97.19%Precision - 93.65% 99.18%F-Measure - 92.13% 98.18%
Table 1.
Experimental performance evaluation in real world scenarios.
For our initial performance evaluation we tested the robustness of our Wrappersagainst real world use-cases. Actual areas of interest for Web data extractionproblems include social networks, retail market and Web communities. We de-fined a total of 7 scenarios and designed 10 adaptive wrappers each. Results, bymeans of precision, recall and F1-score, are as shown in Table 1. Column thresh. represents the fixed threshold value; tp , fp and fn summarize true and false posi-tive, and false negative, respectively. Performance obtained by using simple and weighted tree matching are good; these algorithms are definitely viable solutionsto our initial purpose and provide high degree of reliability (F-Measure > In literature, several implementations of systems to extract data from Websources have been presented, but there is a lack of solutions about their mainte-nance. This paper tries to address this problem, describing adaptive techniquesto make Web data extraction systems, based on wrappers, self-maintainable,adopting algorithms optimized to this purpose. So, enhanced Web wrappers be-come able to recognize structural modifications of Web sources and to adapttheir functioning accordingly. Characteristics of our self-repairable solution arediscussed in details, providing first experimental results to evaluate its robust-ness. More experimentation has to come in the next future.Moreover, as for future work, additional algorithms would be included in or-der to improve the capabilities of the adaptation feature; in particular, a viableidea could be to generalize a bigram-based tree matching algorithm capable ofdealing with node permutations in a more efficient way with respect to SimpleTree Matching based algorithms adopted as to date. Similarly, the Jaro-Winkleristance could be adapted to our tree matching problem in order to better reflectmissing or added node levels, so as improving performance of our adaptation pro-cess. Finally, the tree-grammar could be extended to classify different topologiesof templates (those frequently adopted by Web pages), in order to define severalstandard protocols of automatic adaptation, to be adopted in specific contexts.