Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums
Albert Weichselbraun, Adrian M. P. Brasoveanu, Roger Waldvogel, Fabian Odoni
aa r X i v : . [ c s . I R ] F e b Harvest - An Open Source Toolkit for ExtractingPosts and Post Metadata from Web Forums
Albert Weichselbraun ∗ , Adrian M. P. Brasoveanu † , Roger Waldvogel ∗ , Fabian Odoni ∗∗ University of Applied Sciences of the Grisons,Pulverm¨uhlestrasse 57, 7000 Chur, SwitzerlandEmail: { firstname.lastname } @fhgr.ch † MODUL Technology GmbHAm Kahlenberg 1, 1090 Vienna, AustriaEmail: [email protected]
Abstract —Automatic extraction of forum posts and metadatais a crucial but challenging task since forums do not expose theircontent in a standardized structure. Content extraction methods,therefore, often need customizations such as adaptations to pagetemplates and improvements of their extraction code before theycan be deployed to new forums. Most of the current solutions arealso built for the more general case of content extraction fromweb pages and lack key features important for understandingforum content such as the identification of author metadata andinformation on the thread structure.This paper, therefore, presents a method that determinesthe XPath of forum posts, eliminating incorrect mergers andsplits of the extracted posts that were common in systemsfrom the previous generation. Based on the individual postsfurther metadata such as authors, forum URL and structureare extracted. We also introduce Harvest, a new open sourcetoolkit that implements the presented methods and create a goldstandard extracted from 52 different Web forums for evaluatingour approach. A comprehensive evaluation reveals that Harvestclearly outperforms competing systems.
Index Terms —Information Extraction, Forum Extraction, Nat-ural Language Processing
I. I
NTRODUCTION .The main contributions of this paper are • the introduction of a method that extracts forum posts andmetadata on the post’s date, author, sequence and URLfrom web forums; • the development of the WEB-FORUM-52 gold standardthat contains pages from 52 different web forums usedfor evaluating the proposed approach; and • extensive experiments that draw upon multiple evaluationsettings to evaluate our method against a baseline andthree state of the art content extraction frameworks.The article is organized as follows: Section II provides anoverview of related work; Section III presents the algorithmswe developed for extracting post, date and user informationfrom forum pages; Section IV describes the created WEB-FORUMS-52 gold standard and discusses the results of ouralgorithms. The last section concludes the paper with a set ofobservations and details on future work.II. R ELATED W ORK
Applying machine learning (ML) and deep learning (DL)algorithms to the task of extracting forum content is usuallymodeled as a multi-step process. At its core, we can identify github.com/fhgr/harvest© 2020 IEEE. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution toservers or lists, or reuse of any copyrighted component of this work inother works. everal large tasks: (i) extraction of the page source code(if needed); (ii) correct identification of all page regionsthat might contain important data such as navigation, forumposts, user and date information; (iii) extraction of additionalinformation relevant to each of the identified blocks (e.g.,description, author metadata, replies to, etc.).An early method for extracting forum posts was proposedby Bing Liu in [2] and [3]. The basic idea was to use anunsupervised learning algorithm called Mining Data Region(MDR) which identified data-rich regions from a web page.The method can be applied for different page types, fromregular blogs to forums. The basic MDR algorithm is stillexpensive due to several constraints: (i) it does not work forwebsites with a flexible structure; (ii) training needs to beperformed for each website separately; and (iii) there is noreliable method to separate the various types of data regionsfrom a page (e.g., if a web form contains text advertising,text, comments and annotated comments, all of these willsimply be marked as data regions when using MDR). Thereare multiple variants of MDR including more generalized treeextraction methods like PyDepta [3], trinary trees [4] andtemplate matching [5].Another induction method used to extract user data and aminimal set of metadata is presented in [6] showing goodresults. However, this method still requires up to twenty fivemanual extraction rules. AutoRM [7] adds a candidate step(Candidate Records) and a filtering step in order to make sureonly the interesting candidates are selected. CMDR [8] usesa Deep Learning node classifier in order to remove the needto train separately for each website. Another article [9] goesone step further and uses a convolutional network trained toidentify data regions based on their visual properties and usesit further as input for its MDR-like algorithm.Most of the current ML approaches rely on additionalfeatures (e.g., the tag ratios within the HTML document) [10]and may even require further components such as boilerplatedetection [11]) to provide meaningful results. Dragnet [12]combines several feature sets into a single library which hasdemonstrated good results on several early forum extractiondatasets. Another recent approach has focused on navigatingthe hierarchy of objects extracted from web pages (e.g.,hyperlink blocks extracted from DOM trees) [13].Summit Bhatia uses inference networks to extract forumthreads [14] and includes a dataset of forum pages collectedfrom Ubuntu and TripAdvisor pages. Some of Bhatia’s laterarticles are focused on forum classification [15], forum sum-marization [16], subjectivity detection in forums [17] anddetection of factual or discursive threads [18], showcasingthe different processing options available for the informationextracted from forum data.In the medical domain extracting forum data has alsobeen approached as a domain-specific problem. This is dueto the fact that forums are often one of the few availablesources that provide user generated content on rare diseases ordrug side-effects. Medical forum thread retrieval includes anadditional step of filtering irrelevant information [19]. Some of the recent medical tools have also showcased good results(e.g., the Vigi4Med scraper [20]), but as far as we havenoticed, their results and methodology has not been replicatedin other domains. PREDOSE [21] uses semantic, syntacticand contextual features to enable prescription drug abuse datasearch through medical forums and social media, however itsmain focus is on entity and relation extraction, not necessarilyfull content extraction. Baskaran [22] presents an automatedmethod of extracting medical information from forums that isbased on semantic analysis, but unfortunately difficult to adaptto other domains. III. M ETHOD
Web forums are complex websites that contain a consider-able amount of content and metadata such as information onthe post’s author, date and thread structure.From a technical perspective extracting these discussionsis challenging since they leverage different blog and forumengines which usually do not have a clear, machine-readablestructure and differ significantly in terms of navigation, threadstructure and page style, rendering most automatic contentextraction attempts ineffective. Forum posts might also containdifferent types of styling, images and emojis. In addition, postmetadata (e.g., user data, date, language) can provide contextfor subsequent tasks such as forum classification when the postis processed by an automated Natural Language Processingpipeline.The discussions in this section elaborate on the followingsub-tasks relevant to the presented approach: (i) the identifi-cation of the forum posts’ XPath (Section III-A) which actsas an anchors for extracting the post content and metadata;and (ii) the subsequent tasks of extracting the post’s content(Section III-B), date (Section III-C), URL (Section III-D), andauthor (Section III-E).
A. Identification of forum posts
We identify posts by combining a textual representation ofthe Web page with information on the document’s DOM tree.Our content extraction strategy draws upon the following twoobservations:1) usually most of the textual content present in Webforums is located in the posts.2) the XPath to the forum posts yields multiple siblingnodes, i.e. one sibling per post.We, therefore, first obtain a textual representation of thepage’s content ( pageContent ) that has been derived from theHTML to text engine inscriptis .Algorithm 1 illustrates how the textual content is then splitinto lines for which the corresponding XPath is obtained( getContentXPath ). Afterwards the algorithm computesa score ( xpathScore ; see Algorithm 2) that indicates thetextual coverage of the given XPath. Line 5 of Algorithm 1introduces the constraint that the extracted candidate pathneeds to yield at least MIN_POST_COUNT siblings, therefore, gitlab.com/weblyzard/inscriptis nsuring that the extracted structure appears (similar to aforum post) multiple times on the analyzed forum page.Selecting the XPath with the highest xpathScore yields theXPath for the forum post. Algorithm 1:
Computation of the forum post XPath.
Data: pageContent, domTree
Result:
The XPath to the forum posts candidatePaths = [] ; foreach line in pageContent.split(’ \ n’) do xpath ← getContentXPath(domTree, line) ; xpathScore ← getScore(pageContent, xpath,domTree) ; xpathElementCount ← countSiblings(xpath,domTree) ; if (xpathElementCount > MIN POST COUNT) then candidatePaths.append([xpath, xpathScore,xpathElementCount]) ; end end return getHighestScoringPath(candidatePaths) ;Algorithm 2 outlines the computation of the similaritymetric used for assessing an XPath’s coverage of the total pagecontent. The algorithm takes three inputs: (a) a text represen-tation of the page’s content ( pageContent ), (b) the XPathto evaluate and (c) the forum’s DOM tree. It then computesthe cosine similarity to determine the overlap between theforum coverage (i.e. the text present in the nodes that matchthe provided XPath) and the total page content. In addition,Algorithm 2 introduces constraints on the forum XPath byblacklisting HTML tags such as
Harvest obtains the post content by selecting an orderedlist of DOM nodes that match the post XPath. Afterwards, theHTML within each of the selected DOM nodes is convertedto the corresponding text yielding the list of forum posts.
C. Date extraction
Harvest’s data extraction component analyses the DOM treein the vicinity of the post XPath and identifies candidatedates by locating HTML time elements which contain the datetime attribute, and by using dateparser to locate additionalcandidate elements.We collect all candidate elements that match the criteriaoutlined above and then remove dates that are older than 1993-04-30 as well as candidates where the number of extracted github.com/scrapinghub/dateparser The date on which the Internet was opened to the public by CERN.
Algorithm 2:
Evaluation of a node’s likelihood ofbeing a post container
Data: pageContent, xpath, domTree
Result:
A score that indicates the node’s likelihood ofbeing the post container. /* ignore nodes with descendants inthe tag blacklist. */ if (decendantsContainBlacklistedTags(domTree, xpath)) then return 0.0; end /* cosine similarity between thepage’s content and the textenclosed by the selected xpath. */ nodeText ← getNodeTreeText(domTree, xpath); vsmPageContent ← getVsm(pageContent); vsmNode ← getVsm(nodeText); sim ← vsmContent · vsmNode || vsmContent ||·|| vsmNode || ; /* discount nodes with blacklistedancestors. */ if (ancestorsContainBlacklistedTags(domTree, xpath)) then sim ← sim/10 ; end return sim ;dates does not correspond to the number of forum posts. Wehave relaxed the later constraint to allow for a difference ofup to two posts less, since our experiments showed some rarecases where the leading posts used a different layout withanother element storing the post’s creation date.Finally, the candidate elements are scored and sorted ac-cording to the following criteria:1) Post sequence : XPaths yielding chronologically ascend-ing or descending dates are preferred.2)
Recentness : In most cases, the most recent date refersto the post’s creation date. We, therefore, favor XPathcandidates returning the most recent dates.Afterwards, the highest scoring XPath is selected.
D. Post link extraction
Harvest locates candidates for the post link extraction bysearching for HTML link ( href ) and anchor ( name ) attributeswithin the individual posts. Only candidates that point to theforum page’s URL and yield one link per individual post areconsidered. In addition, a number is searched at the end ofthe URL. If this number increases continuously, it is almostcertainly a direct link to the post.
E. User extraction
Harvest’s post user extraction identifies candidates for thepost’s author by (i) searching link elements other than the postlink that do not refer to another web page, and (ii) identifyingpotential user names in the HTML elements span , strong , div nd b . We only consider candidates that suggest user nameswith less than 100 characters and less than four words.Class attributes containing the words user , member , person or profile raise the score of the corresponding XPath. Thescore also increases if the extracted user name differs betweenindividual posts. In addition, links are weighted higher thantext.Sorting the candidate XPaths based on the obtained scoreyields the most likely XPath for the post’s creator.IV. E VALUATION
Appropriate benchmarking suites and gold standard data arekey towards evaluating content extraction methods, identifyingtheir strengths and weaknesses. We, therefore, have createda gold standard dataset that is used in conjunction with theOpen Source Orbis benchmarking framework [23] to evaluateHarvest’s performance.The evaluation section first provides a short description ofthe created WEB-FORUM-52 gold standard and then describesthe evaluation system. Afterwards, we introduce two sets ofexperiments:1) An evaluation of the post content extraction whichcompares Harvest to a baseline and other state-of-theart content extraction frameworks; and2) an assessment of Harvest’s forum metadata extractioncapabilities.We conclude the section with a discussion of the evaluationresults while taking into account some of the systems that wereunavailable for testing in order to provide a clear picture ofwhere Harvest currently stands.Table I summarizes the capabilities of the evaluated sys-tems. General content extraction systems such as boilerpy andjusText only provide a text representation of the relevant pagecontent. Dragnet and Harvest, in contrast, also extract singularposts and Harvest is the only system capable of extractingpost metadata as well. Inscriptis acts as a baseline since it hasbeen designed for converting HTML to text and, therefore,also returns boilerplate elements such as navigation areas andcopyright notes.
TABLE IC
APABILITIES OF THE EVALUATED SYSTEMS . T
EXT REFERS TO THEEXTRACTION OF THE FORUM ’ S TEXT , POST TO THE IDENTIFICATION OFINDIVIDUAL POSTS AND DATE , USER , URL
TO THE EXTRACTION OF THECORRESPONDING METADATA .System text post date user URLInscriptis (baseline) X boilerpy X Dragnet
X X jusText X Harvest
X X X X X
A. The WEB-FORUM-52 gold standard
The WEB-FORUM-52 gold standard comprises (i) 13 webforums from the health domain, (ii) 15 forums obtained from a Wikipedia list of popular forums , (iii) 13 forums mentionedon a list of popular German Web forums , (iv) nine forumsobtained from WPressBlog and (v) two additional forums.For most forums two web pages (from different threads)were used and stored together with gold standard annotationsthat have been manually created by domain experts anddescribe the post text, post date, post user and direct URLto the post.The gold standard is publicly available on GitHub . B. Evaluation system
Since one of our long-term goals when creating new experi-ments is to enhance transparency and reproducibility, we havedecided to use an open source framework for computing theevaluation scores. We have selected Orbis [1] which was de-signed with extensibility in mind. Although Orbis also enablesvisual evaluations, we created a forum-extraction evaluationplugin that solely uses the Orbis command line interface, sincedesigning new visual evaluations is beyond the scope of thiswork.Barbaresi and Lejeune [24] present an extensive evaluationof the best content extraction tools available in early 2020.In this evaluation Inscriptis proved to be the fastest tool andalso yielded the best recall. Dragnet in turn, provided thehighest precision, and depending on the metric (e.g., clean-eval, euclidean or cosine distances, etc) either Dragnet orNews-Please yielded the best F1 score. For our experiments,we have selected several of the systems used in Barberesi’sevaluation [24] including Inscriptis, BoilerPy3 , jusText andDragnet. Several other tools have been initially targeted butwere not included since their source code was not availableonline at the date the experiments were performed (e.g., Sido’sforum extraction tool [25]) or due to various errors (e.g., theNews-Please tool). We will continue to pursue the developersof these tools to include their systems in future versions ofthe created evaluation plugin. C. Post content extraction
This experiment evaluates how well the extracted postcontent corresponds to the gold standard data. Harvest iscompared to three other content extraction methods and toa baseline (Inscriptis) that extracts the whole text from theWeb page (i.e. forum posts and boilerplate content).Most of the forum extraction evaluations tend to be recall-oriented, therefore the classic metrics used within them arerecall at various cutoffs (R@N) and Mean Reciprocal Rank(MRR) which evaluates a list of possible responses to a set ofsample queries and generally makes sense to use when only a en.wikipedia.org/wiki/List of Internet forums github.com/fhgr/harvest github.com/weblyzard/inscriptis github.com/dragnet-org/dragnet github.com/fhamborg/news-please github.com/jmriebold/BoilerPy3 github.com/miso-belica/jusText ingle relevant document is known (e.g., one relevant pagefrom a forum). Recall-oriented evaluations are rather well-suited for settings in which pages are extracted from a singleforum, and, therefore, have not been considered in the presentevaluation.In precision-oriented settings (e.g., production environ-ments), it is also customary to compute precision at variouscutoffs (e.g., P@N) and Mean Average Precision (mAP).Depending on the algorithms that are evaluated (e.g., MDRalgorithms, clustering algorithms), some forum evaluationshave also provided additional metrics like Adjusted Rand In-dex (ARI) and Adjusted Mutual Information (AMI), measuresthat are used for establishing the similarity between clusteringalgorithms.Since we were interested in comparing the actual text of theposts returned by each tool, we have considered the follow-ing performance metrics that are tailored towards evaluatingcontent extraction tools: (a) Levenshtein distance [26], (b)the Jaccard Coefficient, and (c) a token-based computation ofprecision, recall, and the F1 measure [27].The equations below use the following notation: S g refersto the string containing the gold standard text, S e to the stringthat has been extracted by the evaluated systems. For token-based measures these strings are split into tokens t i yieldingtwo token sets - one for the gold standard ( T g ) and a secondone for the extracted text ( T e ).
1) Levenshtein distance:
The first approach computes thenormalized Levenshtein distance ( lev norm ) between the goldstandard forum text ( S g ) and the forum text extracted by thesystems ( S e ) lev norm ( S g , S e ) = lev ( S g , S e ) max ( | S g | , | S s | ) (1)where | S g | and | S e | refer to the length of the gold standard andextracted text respectively. The evaluated systems often halved,split or incorrectly merged posts which seriously impacts thetime required for computing the Levenshtein distance. We,therefore, selected the FuzzyWuzzy Python package whichprovides a fast computation of the Levenshtein distance [26]even under the stated conditions.
2) Jaccard Coefficient:
A second similarity measure is theJaccard Coefficient which is computed based on the extractedtoken sets as outlined below: J ( T g , T e ) = | T g ∩ T e | / | T g ∪ T e | (2)
3) Token-based similarity:
The token-based similarity met-ric computes precision, recall and consequently the F1 measurebased on the common tokens between the gold standard andthe extracted text. github.com/seatgeek/fuzzywuzzy TABLE IIE VALUATION OF THE POST EXTRACTION TASK USING L EVENSHTEINDISTANCE . M
ICRO AND M ACRO PRECISION , RECALL AND F1 SCORES .Method mP mR mF1 MP MR MF1Dragnet 2.0.4 0.26 0.47 0.33 0.35 0.48 0.37jusText 2.2.0 0.73 0.63 0.68 0.63 0.63 0.63BoilerPy3 1.0.2 0.50 0.49 0.50 0.49 0.49 0.49Inscriptis 1.1.0 0.37 0.37 0.37 0.37 0.37 0.37Harvest 1.0.0 0.93 0.87 0.90 0.86 0.86 0.86TABLE IIIE
VALUATION OF THE POST EXTRACTION TASK USING J ACCARD C OEFFICIENT . M
ICRO AND M ACRO PRECISION , RECALL AND F1 SCORES .Method mP mR mF1 MP MR MF1Dragnet 2.0.4 0.24 0.43 0.3 0.33 0.45 0.34jusText 2.2.0 0.52 0.45 0.48 0.45 0.45 0.45BoilerPy3 1.0.2 0.30 0.29 0.30 0.29 0.29 0.29Inscriptis 1.1.0 0.31 0.31 0.31 0.31 0.31 0.31Harvest 1.0.0 0.92 0.87 0.89 0.86 0.85 0.85 P = | T g ∩ T e | T e (3) R = | T g ∩ T e | T g (4)We compute both micro (mP, mR and mF1) and macro(MP, MR and MF1) results for the evaluations. The microresults correspond to the weighted average scores, whereas themacro results represent the arithmetic mean of the per-class(e.g., type) scores. Micro results are well-suited for evaluatingresults of imbalanced classes, whereas the macro-averagescompute the metrics separately for each class and then takethe averages. It is important to provide both metrics preciselybecause class distributions differ wildly between various pagesor corpora. D. Experimental results
Tables II, III and IV summarize the evaluation results.Regardless of the performance metric used, Harvest provided
TABLE IVM
ICRO AND M ACRO PRECISION , RECALL AND F1 SCORES . P
RECISION , RECALL AND F1 COMPUTED WITH THE APPROACH FROM W ENINGER ETAL .Method mP mR mF1 MP MR MF1Dragnet 2.0.4 0.94 0.49 0.65 0.87 0.64 0.70jusText 2.2.0 0.93 0.73 0.82 0.78 0.60 0.66BoilerPy3 1.0.2 0.95 0.50 0.65 0.83 0.47 0.57Inscriptis 1.1.0 0.71 0.99 0.83 0.34 0.55 0.41Harvest 1.0.0 0.99 0.99 0.99 0.91 0.91 0.91TABLE VE
VALUATION OF H ARVEST ’ S METADATA EXTRACTIONPERFORMANCE .Metadata field mP mR mF1 MP MR MF1post user 0.86 0.79 0.83 0.76 0.76 0.76post date 0.41 0.33 0.36 0.37 0.38 0.38post URL 0.51 0.66 0.58 0.43 0.42 0.42 he best performance for the post extraction tasks. Since someof the tools included in this evaluation were primarily designedfor general content extraction (e.g., Dragnet), a wrapper wasadded to correct mistakes in the partitioning of the posts.Example of such errors include: (i) the merging of multipleforum posts into a single post, as most tools will definitelyend up with a variant of this issue which needs to be dealtwith (e.g., merging of consecutive posts, merging of final postin a page with the footer of the page); (ii) the reverse error- which is splitting a post into multiple posts (e.g., due topictures or other media, posts can sometimes end up beingbroken into multiple pieces); or (iii) merging of post metadataand post content (e.g., instead of extracting user metadata andpost content into separate slots, they end up all merged intothe content slot). The wrappers draw upon lists of conversationstarters and enders in several languages (e.g., English, German,Spanish), to correct the partitioning obtained from the originaltools.Besides the three error classes mentioned above, we havealso encountered error classes that were less frequent orspecific to particular tools.Dragnet, for example, has repeatedly lost various posts frommultiple forum websites (e.g., instead of retrieving the contentfor ten posts from a page, it only retrieved the content foreight or nine posts) or sometimes returned the content onlypartially (e.g., retrieved only a sentence or several sentencesfrom a post, but not the full content of the post). The missingposts could be related to the various features used for trainingDragnet, but it is currently difficult to understand whichfeatures have led to this outcome.Justext has performed well in situations in which the postsegmentation was clear, but has often failed to correctlyseparate the posts when media objects were included in thepages (e.g., when lots of adverts or different sections of a pageintermingled).Boilerpy has sometimes retrieved only the titles or the firstforum posts from a page and has barely retrieved any contentwhen the page layouts were similar to blog posts with a largercontent section and many comments. Also, similar to Dragnet,in some cases it has randomly lost forum posts.Inscriptis is typically used for extracting the HTML andDOM content of a page, therefore some of its errors wererelated to these use cases.Harvest also loses some posts due to page layout (e.g., inseveral cases it fails to recognize the first post on a page),but generally has less errors due to the content. Most ofthe Harvest errors currently are due to the failure to retrievesome of the post URLs or user names correctly, but theseare currently fixed. However, compared to the rest of thecompetitors, Harvest does manage to extract almost all of thecontent available on a forum page.The second set of experiments (see Table V) evaluatesthe performance of Harvest’s metadata extraction componentswhich also extracts the post’s date, user and link. From thesethree tasks the post user extraction is the most reliable one.The results also clearly indicate that the extraction of the post creation data is currently the most challenging task sincedates are not always organized in a separate XPath whichrequires the combination of multiple tools – i.e., Harvest forcomputing the XPath and dateparser for extracting the date.Consequently, even smaller mistakes multiply and reduce theoverall effectiveness of the extraction task.V. O
UTLOOK AND C ONCLUSIONS
Most content extraction frameworks are generic enough towork in multiple settings (e.g., crawlers that can extract con-tent from classic web pages and social media), but rarely pro-vide good results for custom scenarios such as the extraction ofWeb forum posts. This observation has also been confirmed bythe evaluation results presented in Section IV which indicatethat Harvest clearly outperformed other systems for the givenforum extraction task.Consequently, users can either adapt existing tools such asDragnet to their tasks or draw upon domain-specific contentextractors such as the Vigil4Med scraper, if they are availablefor the targeted application domain. Harvest although limitedto forum extraction, in contrast, addresses both needs: (i)generality and (ii) domain-specificity, therefore, simplifyingthe task of extracting and processing information from webforums.Since Harvest focuses on extracting the entire forum content(i.e. high recall) while maintaining a high precision, ourevaluations draws upon the F1 metric. Other measures suchas ARI, AMI, mAP or P@n are also frequently used inliterature. We plan to extend the created Orbis forum extractionevaluation plugin with some of these metrics and also aim toinclude additional evaluation types that cover an even largerarray of content extraction issues in future work.Our efforts ultimately aim at correctly classifying informa-tion from the extracted text based on multiple features likesentiment, entities and symptoms, but during early evaluationswe discovered that content extraction is rarely perfect. Futurework will return to this original goal and focus mostly onclassification and related tasks like joint intent detection andslot filling.
Acknowledgement
The research presented in this paper has been con-ducted within the
MedMon
EFERENCES[1] F. Odoni, P. Kuntschik, A. M. P. Brasoveanu, and A. Weichselbraun,“On the importance of drill-down analysis for assessing gold standardsand named entity linking performance,” in
Proceedings of the 14thInternational Conference on Semantic Systems, SEMANTICS 2018,Vienna, Austria, September 10-13, 2018 , ser. Procedia ComputerScience, A. Fensel, V. de Boer, T. Pellegrini, E. Kiesling, B. Haslhofer,L. Hollink, and A. Schindler, Eds., vol. 137. Elsevier, 2018, pp.33–42. [Online]. Available: https://doi.org/10.1016/j.procs.2018.09.0042] B. Liu, R. L. Grossman, and Y. Zhai, “Mining web pages for datarecords,”
IEEE Intell. Syst. , vol. 19, no. 6, pp. 49–55, 2004. [Online].Available: https://doi.org/10.1109/MIS.2004.68[3] Y. Zhai and B. Liu, “Structured data extraction from the webbased on partial tree alignment,”
IEEE Trans. Knowl. DataEng. , vol. 18, no. 12, pp. 1614–1628, 2006. [Online]. Available:https://doi.org/10.1109/TKDE.2006.197[4] H. A. Sleiman and R. Corchuelo, “Trinity: On using trinary treesfor unsupervised web data extraction,”
IEEE Trans. Knowl. DataEng. , vol. 26, no. 6, pp. 1544–1556, 2014. [Online]. Available:https://doi.org/10.1109/TKDE.2013.161[5] M. Kayed and C. Chang, “Fivatech: Page-level web data extraction fromtemplate pages,”
IEEE Trans. Knowl. Data Eng. , vol. 22, no. 2, pp. 249–263, 2010. [Online]. Available: https://doi.org/10.1109/TKDE.2009.82[6] J. Zhang, C. Jin, Y. Lin, and X. Gong, “Forum data extraction withoutexplicit rules,” in , J. Liu, J. Chen, and G. Xu, Eds. IEEE Computer Society, 2012,pp. 460–465. [Online]. Available: https://doi.org/10.1109/CGC.2012.72[7] S. Shi, C. Liu, Y. Shen, C. Yuan, and Y. Huang, “Autorm:An effective approach for automatic web data record mining,”
Knowl. Based Syst. , vol. 89, pp. 314–331, 2015. [Online]. Available:https://doi.org/10.1016/j.knosys.2015.07.012[8] F. K. Wai, L. W. Yong, V. L. Thing, and V. Pomponiu, “Cmdr:Classifying nodes for mining data records with different html structures,”in
TENCON 2017-2017 IEEE Region 10 Conference . IEEE, 2017, pp.1862–1862.[9] J. Liu, L. Lin, Z. Cai, J. Wang, and H.-j. Kim, “Deep web data extractionbased on visual information processing,”
Journal of Ambient Intelligenceand Humanized Computing , pp. 1–11, 2017.[10] T. Weninger, W. H. Hsu, and J. Han, “CETR: content extraction viatag ratios,” in
Proceedings of the 19th International Conferenceon World Wide Web, WWW 2010, Raleigh, North Carolina,USA, April 26-30, 2010 , M. Rappa, P. Jones, J. Freire, andS. Chakrabarti, Eds. ACM, 2010, pp. 971–980. [Online]. Available:https://doi.org/10.1145/1772690.1772789[11] C. Kohlsch¨utter, P. Fankhauser, and W. Nejdl, “Boilerplate detection us-ing shallow text features,” in
Proceedings of the third ACM internationalconference on Web search and data mining , 2010, pp. 441–450.[12] M. E. Peters and D. Lecocq, “Content extraction using diverse featuresets,” in ,L. Carr, A. H. F. Laender, B. F. L´oscio, I. King, M. Fontoura,D. Vrandecic, L. Aroyo, J. Palazzo M. de Oliveira, F. Lima,and E. Wilde, Eds. International World Wide Web ConferencesSteering Committee / ACM, 2013, pp. 89–90. [Online]. Available:https://doi.org/10.1145/2487788.2487828[13] K. Zhao, B. Li, Z. Peng, J. Bu, and C. Wang, “Navigationobjects extraction for better content structure understanding,” in
Proceedings of the International Conference on Web Intelligence,Leipzig, Germany, August 23-26, 2017 , A. P. Sheth, A. Ngonga,Y. Wang, E. Chang, D. Slezak, B. Franczyk, R. Alt, X. Tao, andR. Unland, Eds. ACM, 2017, pp. 629–636. [Online]. Available:https://doi.org/10.1145/3106426.3106437[14] S. Bhatia and P. Mitra, “Adopting inference networks foronline thread retrieval,” in
Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010,Atlanta, Georgia, USA, July 11-15, 2010
Proceedings of the 15th International Workshop on the Web and Databases 2012,WebDB 2012, Scottsdale, AZ, USA, May 20, 2012 , Z. G. Ivesand Y. Velegrakis, Eds., 2012, pp. 13–18. [Online]. Available:http://db.disi.unitn.eu/pages/WebDB2012/papers/p26.pdf[16] ——, “Summarizing online forum discussions - can dialog acts ofindividual messages help?” in
Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing, EMNLP2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT,a Special Interest Group of the ACL , A. Moschitti, B. Pang, andW. Daelemans, Eds. ACL, 2014, pp. 2127–2131. [Online]. Available:https://doi.org/10.3115/v1/d14-1226[17] P. Biyani, S. Bhatia, C. Caragea, and P. Mitra, “Using subjectivityanalysis to improve thread retrieval in online forums,” in
Advances inInformation Retrieval - 37th European Conference on IR Research,ECIR 2015, Vienna, Austria, March 29 - April 2, 2015. Proceedings ,ser. Lecture Notes in Computer Science, A. Hanbury, G. Kazai,A. Rauber, and N. Fuhr, Eds., vol. 9022, 2015, pp. 495–500. [Online].Available: https://doi.org/10.1007/978-3-319-16354-3 54[18] ——, “Using non-lexical features for identifying factualand opinionative threads in online forums,”
Knowl. BasedSyst. , vol. 69, pp. 170–178, 2014. [Online]. Available:https://doi.org/10.1016/j.knosys.2014.04.048[19] J. H. D. Cho, P. Sondhi, C. Zhai, and B. R. Schatz, “Resolvinghealthcare forum posts via similar thread retrieval,” in
Proceedings ofthe 5th ACM Conference on Bioinformatics, Computational Biology,and Health Informatics, BCB ’14, Newport Beach, California, USA,September 20-23, 2014 , P. Baldi and W. Wang, Eds. ACM, 2014, pp.33–42. [Online]. Available: https://doi.org/10.1145/2649387.2649399[20] B. Audeh, M. Beigbeder, A. Zimmermann, P. Jaillon, and C. Bousquet,“Vigi4med scraper: A framework for web forum structured data extrac-tion and semantic representation,”
PloS one , vol. 12, no. 1, 2017.[21] D. Cameron, G. A. Smith, R. Daniulaityte, A. P. Sheth, D. Dave,L. Chen, G. Anand, R. Carlson, K. Z. Watkins, and R. Falck,“PREDOSE: A semantic web platform for drug abuse epidemiologyusing social media,”
J. Biomed. Informatics , vol. 46, no. 6, pp. 985–997,2013. [Online]. Available: https://doi.org/10.1016/j.jbi.2013.07.007[22] U. Baskaran and K. Ramanujam, “Automated scraping of structureddata records from health discussion forums using semantic analysis,”
Informatics in Medicine Unlocked , vol. 10, pp. 149–158, 2018.[Online]. Available: https://doi.org/10.1016/j.imu.2018.01.003[23] F. Odoni, A. M. Brasoveanu, P. Kuntschik, and A. Weichselbraun,“Introducing orbis: An extendable evaluation pipeline for named entitylinking drill-down analysis,” in , Melbourne, Australia, October2019.[24] A. Barbaresi and G. Lejeune, “Out-of-the-box and into the ditch?multilingual evaluation of generic text extraction tools,” in
Proceedingsof the 12th Web as Corpus Workshop, WAC@LREC 2020, Marseille,France, May 2020
Computaci´on y Sis-temas
Soviet Physics Doklady , vol. 10, p. 707, 1966.[27] T. Weninger, W. H. Hsu, and J. Han, “CETR: content extraction via tagratios,” in