[PDF] Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

Abstract

Automatic extraction of forum posts and metadata is a crucial but challenging task since forums do not expose their content in a standardized structure. Content extraction methods, therefore, often need customizations such as adaptations to page templates and improvements of their extraction code before they can be deployed to new forums. Most of the current solutions are also built for the more general case of content extraction from web pages and lack key features important for understanding forum content such as the identification of author metadata and information on the thread structure. This paper, therefore, presents a method that determines the XPath of forum posts, eliminating incorrect mergers and splits of the extracted posts that were common in systems from the previous generation. Based on the individual posts further metadata such as authors, forum URL and structure are extracted. We also introduce Harvest, a new open source toolkit that implements the presented methods and create a gold standard extracted from 52 different Web forums for evaluating our approach. A comprehensive evaluation reveals that Harvest clearly outperforms competing systems.

Full PDF

aa r X i v : . [ c s . I R ] F e b Harvest - An Open Source Toolkit for ExtractingPosts and Post Metadata from Web Forums

Albert Weichselbraun ∗ , Adrian M. P. Brasoveanu † , Roger Waldvogel ∗ , Fabian Odoni ∗∗ University of Applied Sciences of the Grisons,Pulverm¨uhlestrasse 57, 7000 Chur, SwitzerlandEmail: { ﬁrstname.lastname } @fhgr.ch † MODUL Technology GmbHAm Kahlenberg 1, 1090 Vienna, AustriaEmail: [email protected]

Abstract —Automatic extraction of forum posts and metadatais a crucial but challenging task since forums do not expose theircontent in a standardized structure. Content extraction methods,therefore, often need customizations such as adaptations to pagetemplates and improvements of their extraction code before theycan be deployed to new forums. Most of the current solutions arealso built for the more general case of content extraction fromweb pages and lack key features important for understandingforum content such as the identiﬁcation of author metadata andinformation on the thread structure.This paper, therefore, presents a method that determinesthe XPath of forum posts, eliminating incorrect mergers andsplits of the extracted posts that were common in systemsfrom the previous generation. Based on the individual postsfurther metadata such as authors, forum URL and structureare extracted. We also introduce Harvest, a new open sourcetoolkit that implements the presented methods and create a goldstandard extracted from 52 different Web forums for evaluatingour approach. A comprehensive evaluation reveals that Harvestclearly outperforms competing systems.

Index Terms —Information Extraction, Forum Extraction, Nat-ural Language Processing

I. I

NTRODUCTION .The main contributions of this paper are • the introduction of a method that extracts forum posts andmetadata on the post’s date, author, sequence and URLfrom web forums; • the development of the WEB-FORUM-52 gold standardthat contains pages from 52 different web forums usedfor evaluating the proposed approach; and • extensive experiments that draw upon multiple evaluationsettings to evaluate our method against a baseline andthree state of the art content extraction frameworks.The article is organized as follows: Section II provides anoverview of related work; Section III presents the algorithmswe developed for extracting post, date and user informationfrom forum pages; Section IV describes the created WEB-FORUMS-52 gold standard and discusses the results of ouralgorithms. The last section concludes the paper with a set ofobservations and details on future work.II. R ELATED W ORK

Applying machine learning (ML) and deep learning (DL)algorithms to the task of extracting forum content is usuallymodeled as a multi-step process. At its core, we can identify github.com/fhgr/harvest© 2020 IEEE. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution toservers or lists, or reuse of any copyrighted component of this work inother works. everal large tasks: (i) extraction of the page source code(if needed); (ii) correct identiﬁcation of all page regionsthat might contain important data such as navigation, forumposts, user and date information; (iii) extraction of additionalinformation relevant to each of the identiﬁed blocks (e.g.,description, author metadata, replies to, etc.).An early method for extracting forum posts was proposedby Bing Liu in [2] and [3]. The basic idea was to use anunsupervised learning algorithm called Mining Data Region(MDR) which identiﬁed data-rich regions from a web page.The method can be applied for different page types, fromregular blogs to forums. The basic MDR algorithm is stillexpensive due to several constraints: (i) it does not work forwebsites with a ﬂexible structure; (ii) training needs to beperformed for each website separately; and (iii) there is noreliable method to separate the various types of data regionsfrom a page (e.g., if a web form contains text advertising,text, comments and annotated comments, all of these willsimply be marked as data regions when using MDR). Thereare multiple variants of MDR including more generalized treeextraction methods like PyDepta [3], trinary trees [4] andtemplate matching [5].Another induction method used to extract user data and aminimal set of metadata is presented in [6] showing goodresults. However, this method still requires up to twenty ﬁvemanual extraction rules. AutoRM [7] adds a candidate step(Candidate Records) and a ﬁltering step in order to make sureonly the interesting candidates are selected. CMDR [8] usesa Deep Learning node classiﬁer in order to remove the needto train separately for each website. Another article [9] goesone step further and uses a convolutional network trained toidentify data regions based on their visual properties and usesit further as input for its MDR-like algorithm.Most of the current ML approaches rely on additionalfeatures (e.g., the tag ratios within the HTML document) [10]and may even require further components such as boilerplatedetection [11]) to provide meaningful results. Dragnet [12]combines several feature sets into a single library which hasdemonstrated good results on several early forum extractiondatasets. Another recent approach has focused on navigatingthe hierarchy of objects extracted from web pages (e.g.,hyperlink blocks extracted from DOM trees) [13].Summit Bhatia uses inference networks to extract forumthreads [14] and includes a dataset of forum pages collectedfrom Ubuntu and TripAdvisor pages. Some of Bhatia’s laterarticles are focused on forum classiﬁcation [15], forum sum-marization [16], subjectivity detection in forums [17] anddetection of factual or discursive threads [18], showcasingthe different processing options available for the informationextracted from forum data.In the medical domain extracting forum data has alsobeen approached as a domain-speciﬁc problem. This is dueto the fact that forums are often one of the few availablesources that provide user generated content on rare diseases ordrug side-effects. Medical forum thread retrieval includes anadditional step of ﬁltering irrelevant information [19]. Some of the recent medical tools have also showcased good results(e.g., the Vigi4Med scraper [20]), but as far as we havenoticed, their results and methodology has not been replicatedin other domains. PREDOSE [21] uses semantic, syntacticand contextual features to enable prescription drug abuse datasearch through medical forums and social media, however itsmain focus is on entity and relation extraction, not necessarilyfull content extraction. Baskaran [22] presents an automatedmethod of extracting medical information from forums that isbased on semantic analysis, but unfortunately difﬁcult to adaptto other domains. III. M ETHOD

Web forums are complex websites that contain a consider-able amount of content and metadata such as information onthe post’s author, date and thread structure.From a technical perspective extracting these discussionsis challenging since they leverage different blog and forumengines which usually do not have a clear, machine-readablestructure and differ signiﬁcantly in terms of navigation, threadstructure and page style, rendering most automatic contentextraction attempts ineffective. Forum posts might also containdifferent types of styling, images and emojis. In addition, postmetadata (e.g., user data, date, language) can provide contextfor subsequent tasks such as forum classiﬁcation when the postis processed by an automated Natural Language Processingpipeline.The discussions in this section elaborate on the followingsub-tasks relevant to the presented approach: (i) the identiﬁ-cation of the forum posts’ XPath (Section III-A) which actsas an anchors for extracting the post content and metadata;and (ii) the subsequent tasks of extracting the post’s content(Section III-B), date (Section III-C), URL (Section III-D), andauthor (Section III-E).

A. Identiﬁcation of forum posts

We identify posts by combining a textual representation ofthe Web page with information on the document’s DOM tree.Our content extraction strategy draws upon the following twoobservations:1) usually most of the textual content present in Webforums is located in the posts.2) the XPath to the forum posts yields multiple siblingnodes, i.e. one sibling per post.We, therefore, ﬁrst obtain a textual representation of thepage’s content ( pageContent ) that has been derived from theHTML to text engine inscriptis .Algorithm 1 illustrates how the textual content is then splitinto lines for which the corresponding XPath is obtained( getContentXPath ). Afterwards the algorithm computesa score ( xpathScore ; see Algorithm 2) that indicates thetextual coverage of the given XPath. Line 5 of Algorithm 1introduces the constraint that the extracted candidate pathneeds to yield at least MIN_POST_COUNT siblings, therefore, gitlab.com/weblyzard/inscriptis nsuring that the extracted structure appears (similar to aforum post) multiple times on the analyzed forum page.Selecting the XPath with the highest xpathScore yields theXPath for the forum post. Algorithm 1:

Computation of the forum post XPath.

Data: pageContent, domTree

Result:

The XPath to the forum posts candidatePaths = [] ; foreach line in pageContent.split(’ \ n’) do xpath ← getContentXPath(domTree, line) ; xpathScore ← getScore(pageContent, xpath,domTree) ; xpathElementCount ← countSiblings(xpath,domTree) ; if (xpathElementCount > MIN POST COUNT) then candidatePaths.append([xpath, xpathScore,xpathElementCount]) ; end end return getHighestScoringPath(candidatePaths) ;Algorithm 2 outlines the computation of the similaritymetric used for assessing an XPath’s coverage of the total pagecontent. The algorithm takes three inputs: (a) a text represen-tation of the page’s content ( pageContent ), (b) the XPathto evaluate and (c) the forum’s DOM tree. It then computesthe cosine similarity to determine the overlap between theforum coverage (i.e. the text present in the nodes that matchthe provided XPath) and the total page content. In addition,Algorithm 2 introduces constraints on the forum XPath byblacklisting HTML tags such as