[PDF] Metaknowledge Extraction Based on Multi-Modal Documents

Abstract

The triple-based knowledge in large-scale knowledge bases is most likely lacking in structural logic and problematic of conducting knowledge hierarchy. In this paper, we introduce the concept of metaknowledge to knowledge engineering research for the purpose of structural knowledge construction. Therefore, the Metaknowledge Extraction Framework and Document Structure Tree model are presented to extract and organize metaknowledge elements (titles, authors, abstracts, sections, paragraphs, etc.), so that it is feasible to extract the structural knowledge from multi-modal documents. Experiment results have proved the effectiveness of metaknowledge elements extraction by our framework. Meanwhile, detailed examples are given to demonstrate what exactly metaknowledge is and how to generate it. At the end of this paper, we propose and analyze the task flow of metaknowledge applications and the associations between knowledge and metaknowledge.

Full PDF

PPREPRINT VERSION ON ARXIV 1

Metaknowledge Extraction Based onMulti-Modal Documents

Shukan Liu ∗ , Ruilin Xu ∗ , Boying Geng † , Qiao Sun, Li Duan, and Yiming Liu Abstract —The triple-based knowledge in large-scale knowledge bases is most likely lacking in structural logic and problematic ofconducting knowledge hierarchy. In this paper, we introduce the concept of metaknowledge to knowledge engineering research for thepurpose of structural knowledge construction. Therefore, the Metaknowledge Extraction Framework and Document Structure Tree modelare presented to extract and organize metaknowledge elements (titles, authors, abstracts, sections, paragraphs, etc.), so that it is feasibleto extract the structural knowledge from multi-modal documents. Experiment results have proved the effectiveness of metaknowledgeelements extraction by our framework. Meanwhile, detailed examples are given to demonstrate what exactly metaknowledge is andhow to generate it. At the end of this paper, we propose and analyze the task ﬂow of metaknowledge applications and the associationsbetween knowledge and metaknowledge.

Index Terms —Metaknowledge, Multi-Modal, Document Layout Analysis, Knowledge Graph. (cid:70)

NTRODUCTION R ECENTLY , widely used large-scale knowledge bases,such as DBpedia [1], FreeBase [2], and YAGO [3], areall based on semantic entities and relations, focusing onspeciﬁc concepts and things. These widely used large-scaleknowledge bases have achieved good performance in infor-mation retrieval and question answering tasks. However,the very knowledge that exists in large-scale knowledgebases is based on entity-relation triples, which is not exactlyas same as the knowledge in human beings’ perception.Knowledge in human minds is a complex of hierarchical,structured, and systematized elements which has stronglylogical or topological associations, especially presented instructure or sequence.For matching human’s natural intuition of knowledge,it is necessary to ﬁnd a brand-new representation thatmakes knowledge in the computer structured and hierar-chical. Thus, to address the signiﬁcant issue that currentknowledge bases have difﬁculty in forming knowledgehierarchy, this work introduces the concept of metaknowl-edge into knowledge engineering research. Ref. [4] stud-ies metaknowledge in scientiﬁc articles and holds that ”...metaknowledge results from the critical scrutiny of what isknown, how, and by whom. It can now be obtained on largescales, enabled by a concurrent informatics revolution.”Inspired by their thoughts, we believe that metaknowledgeis knowledge about knowledge. It describes the patterns ofhow human beings get, analyze and organize knowledge. ∗ Contributed equally and should be considered as co-ﬁrst authors. † Corresponding author (e-mail: boying [email protected]). • Shukan liu was with the School of computer science and engineering,Southeast University, Nanjing, 211189, China; and the School of Elec-tronic Engineering, Naval University of Engineering.E-mail: [email protected] • Ruilin Xu and Yiming Liu were with the Graduate School, Naval Univer-sity of Engineering, Wuhan, 430033, China; and the School of ElectronicEngineering, Naval University of Engineering.E-mail: [email protected] and [email protected] • Boying Geng, Qiao Sun, and Li Duan were with the School of ElectronicEngineering, Naval University of Engineering.

Similar to the metadata, metaknowledge is the structuralrepresentation of knowledge and knowledge with ﬁne-grained and hierarchical characteristics.Analogous to scientiﬁc articles in Ref. [4], documentssuch as textbooks, papers, theses, governmental documents,laws, and regulations are the most commonly used materialsin getting knowledge. Those materials are normally formedbased on the author’s logic and intentions which people caneasily understand the backgrounds, motivations, themesonly from structural patterns such as titles, section titles,keywords, etc. They are good data sources to acquire meta-knowledge. Nevertheless, these structural patterns aboutgetting knowledge cannot be well described by triple-basedknowledge. This work holds that the structural elementssuch as titles, section titles, abstract, keywords and so on,potentially reﬂect the structural logical associations betweenknowledge. Therefore, in our study, the elements above aredeﬁned as Metaknowledge Elements.The process of people reading can be divided into threesteps: see the words, know the words, and understand thewords. When the computer imitates human beings’ process,the three steps become object detection, optical characterrecognition (OCR), and natural language processing (NLP).Object detection makes the computer classify and locate thelayout elements of documents. OCR makes the computerknow the exact textual content. NLP makes the computeranalyze and understand the textual content. Documentsof three modalities are involved in these signs of process:image modal, layout modal, and text modal. Therefore, amulti-modal method should be designed to extract meta-knowledge elements from documents.To extract metaknowledge elements from multi-modaldocuments, our work proposes MEF, a multi-modal M etaknowledge E xtraction F ramework. Through experi-ments on multi-modal Chinese governmental documentsdataset GovDoc-CN, MEF performs better than single-modal models. Furthermore, this work reﬁnes and explainsthe concept, the functions, the applications task ﬂow of a r X i v : . [ c s . C V ] F e b REPRINT VERSION ON ARXIV 2 metaknowledge, and proposes that the metaknowledge ap-plications are in a higher level compared with knowledgeapplications. Metaknowledge applications are similar to theknowledge applications partly but play a more signiﬁcantrole in managing and organizing knowledge.

ELATED W ORKS

Visually-Rich Document Analysis (VRDA) is a taskaiming to analyzing visually-rich documents (VRDs) whichare scanned or digital-born, such as page images or PDFs,it plays a crucial role in governmental and commercialapplications. VRDA basically relies on three modalities: thetext modal, the layout modal, and the image modal. Ref. [5]presents an end-to-end and multi-modal fully convolutionalnetworks to extract document semantic elements. Ref. [6]introduces a graph convolutional model that combines tex-tual and visual information in VRDs. Ref. [7] combines largepre-trained language models and graph neural networks toencode both textual and visual information in VRDs. Ref.[8] presents an approach based on graph neural networkto build a rich representation of text ﬁelds on a webpageand the relationships between them. The works above haveshown signiﬁcant improvement on VRDA, whereas theydo not comprehensively consider the combination of allthree modalities (text, layout, and image). LayoutLM [9]and its improved model [10] are the pre-trained and ﬁne-tuned multi-modal framework to analyze VRDs, it unitestext modal, layout modal, and image modal to better extractsemantic elements. Nonetheless, the textual information inLayoutLMs is extracted from the image region it corre-sponds to by OCR, that is, the frameworks consider textualinformation of a document as independent semantic unitsinstead of a coherent passage.

Hierarchical Graph is initially proposed in the docu-ment reading comprehension. Focusing on natural ques-tions answering, Ref. [11] presents multi-grained machinereading comprehension framework for modeling docu-ments with their hierarchical natures. The framework inRef. [11] divides a document into four levels of granularity:document, paragraphs, sentences, and tokens, then utilizesgraph attention networks (GATs) to obtain a multi-grainedrepresentation of the document. Meanwhile, the recent work[12] presents hierarchical graph network (HGN) for themulti-hop question answering task. Different from the one-hop question answering that answers originated from asingle paragraph, the multi-hop question answering focuseson acquire answers from multiple paragraphs or the wholepassage. HGN is built by constructing nodes on differentlevels of granularity including questions, paragraphs, sen-tences and entities, it aggregates clues from texts acrossmultiple paragraphs. In creating HGN, the initial noderepresentations are updated through graph propagation andtraverses through the graph edges of subsequent sub-tasksfor multi-hop reasoning. The works above present exquisitemodels for modeling documents, but restricted in a singlemodal of text.To make the computer effectively organizes and utilizesdocuments described by natural language, the ﬁrst task is toconvert nonstructural documents in natural language into structural data, where

Natural Language to SQL (NL-to-SQL) is one of such the task. Approaches (Seq2SQL [13],SQLNet [14], RE-SQL [15], etc.) and datasets (SPLASH [16],ACL-SQL [17], etc.) all aim to transform natural language toa structural data format, which generally is the StructuredQuery Language (SQL). The natural language materialsthese approaches and datasets processed are commonlytokens and sentences, while whereby the passage-level in-formation is not referred to. Recently, Ref. [18] designsa weakly-supervised text-to-graph model

Doc2Graph tobridge the gap between concept map construction and neu-ral networks, it is able to translate passage-level documentsinto graph data.Thanks to the inspirational works above, in this work,we consider getting through the key links between how toextract the semantic structure of documents (where VRDAfocuses on), how to model structured documents (whereHGN focuses on), and how to translated documents in nat-ural language into data that the computer can understand(where NL2SQL and Doc2Graph focus on).

ETAKNOWLEDGE E XTRACTION F RAMEWORK

In this work, a multi-modal framework is designed toextract metaknowledge elements from documents in the textmodal and the image modal, which is described as Fig. 1.There are three parts in the framework: (1) metaknowledgeelements extraction modules, which extract metaknowledgeelements form documents in both text modal and imagemodal, such as titles, subtitles, authors’ information, etc.;(2) veriﬁcation and alignment module, which aligns meta-knowledge elements from text modal and image modal; (3)metaknowledge generating module, which organizes meta-knowledge elements by document topology and hierarchy,then generates metaknowledge through entity recognitionand relation extraction.

BERT [22] is a pre-trained transformer model. It designstwo pre-training tasks: Masked Language Model (MLM)and Next Sentence Prediction (NSP). In MEF, BERT is ap-plied to vectorize the text in the text modal. When textmodal document 𝑫 𝑡 is input, BERT transform it into vector 𝒗 𝑡 . The BiLSTM structure adds a layer of reverse LSTMnetwork based on a unidirectional LSTM network to use thecontext information more effectively. The BiLSTM networkcan better capture the preceding and subsequent messagesat a certain time. It calculates two different hidden layerrepresentations for each sentence using the sequential andreverse order local methods and then obtains the ﬁnalhidden layer representation by splicing vectors.In the task of NER, BiLSTM can predict the probabilityof each input word corresponding to different output labels.For these directly obtained probabilities, we can judge thelabel with the highest probability of the current token.However, only using the BiLSTM network will make theoutput ignore the correlation between labels and only payattention to the association between input characters andlabels, which will lead to great deviation in recognition REPRINT VERSION ON ARXIV 3

Fig. 1: The metaknowledge extraction framework (MEF), including: (1) Metaknowledge elements extraction modules (fromboth text modal and image modal); (2) Veriﬁcation and alignment module; (3) Metaknowledge generating module. The textmodal extraction uses BERT+BiLSTM+CRF [19] framework, the image modal extraction uses YOLOv4 [20] + PaddleOCR[21] framework.results. To solve the problem, a conditional random ﬁeld(CRF) is added in BiLSTM. The BiLSTM+CRF model [23]uses the CRF layer to add constraints, which effectivelyreduces the false prediction results of BiLSTM.Fig. 2: An example of BERT+BiLSTM+CRF model: Theinput Chinese text sequence means “Notice on epidemicsituation (line break) University Ofﬁce”, in which “Noticeon epidemic situation” is the title and “University Ofﬁce” isthe addressee. All the tokens in the sequence are annotatedin BIO format.Inspired by the above works, in this paper, we useBERT+BiLSTM+CRF [19] (Fig. 2) to extract metaknowledgeelements from text modal documents in a way similar toNER. The word vector output from the BERT is input intothe BiLSTM network. The output of BiLSTM is spliced andintegrated, and the CRF layer is superimposed behind theBiLSTM to increase the utilization of label information. TheCRF layer uses the transfer feature to consider the correla-tion between labels, combines BiLSTM and CRF to obtainthe global optimal output sequence so that the predictionlabel is a sequence considering the label correlation.

In the image modal, the metaknowledge elements extrac-tion is similar to the document layout analysis task, whichis a combination of optical character recognition (OCR) andobject detection. Speciﬁcally, it is to ﬁnd the regions ofthe metaknowledge elements through the object detectionand then recognize the text content of the metaknowledgeelements from regions they belong to.The image modal metaknowledge elements extractionmodule is designed based on the YOLOv4 [20] frame-work and PaddleOCR [21]. As is shown in Fig. 1, theextraction module includes an object detection part and anOCR part. The object detection part uses YOLOv4 frame-work to detect metaknowledge elements in document im-age. Each metaknowledge element corresponds to a vector 𝒗 𝑖 = ( 𝑙 𝑖 , 𝑝 𝑖 , 𝑥 𝑖 , 𝑦 𝑖 , 𝑤 𝑖 , 𝑑 𝑖 ) , where 𝑙 𝑖 represents the label of theobject, 𝑝 𝑖 indicates the probability that the object belongsto this type of element, ( 𝑥 𝑖 , 𝑦 𝑖 , 𝑤 𝑖 , 𝑑 𝑖 ) is the bounding boxcoordinate of the location of the target, representing thelocation origin ( 𝑥 𝑖 , 𝑦 𝑖 ) , width 𝑤 𝑖 and height 𝑑 𝑖 .Due to the accuracy of OCR, the text recognition resultof the detected image region can not be guaranteed to beaccurate, but can be corrected by using the correspondingtext modal data. In this work, the Levenshtein distance (Eq.1) between the OCR extraction results and the original textsin the text modal is calculated for correcting the OCR errors.In the OCR post-process above, for each element ofimage recognition, the text element with the shortest Lev-enshtein distance corresponding to it will be found. Consid-ering the possible errors in the optical character recognitionmodule for the extraction of image modal information, thiswork holds that for general elements, when the Levenshteindistance between OCR recognition results and correspond-ing named entity recognition results is not more than 3 ofOCR recognition results, that is, the difference is not morethan 3 of the total sentence length of image recognition, thetwo are consistent. REPRINT VERSION ON ARXIV 4 𝑙𝑒𝑣 𝑎,𝑏 ( 𝑖, 𝑗 ) =  max ( 𝑖, 𝑗 ) , 𝑖 𝑓 min ( 𝑖, 𝑗 ) = 𝑒𝑙𝑠𝑒 min  𝑙𝑒𝑣 𝑎,𝑏 ( 𝑖 − , 𝑗 ) + 𝑙𝑒𝑣 𝑎,𝑏 ( 𝑖, 𝑗 − ) + 𝑙𝑒𝑣 𝑎,𝑏 ( 𝑖 − , 𝑗 − ) + ( 𝑎 𝑖 ≠ 𝑏 𝑗 ) (1) The single-modal approaches have different perfor-mances when extracting various metaknowledge elements.Therefore, it is necessary to consider the two modalities’extraction results at the same time to ﬁx the fault-toleranceproblem of single-modal extraction to improve the extrac-tion quality.For metaknowledge element 𝑖 , let the extraction resultof text modal be the One-Hot vector 𝒗 𝑖 𝑡𝑒𝑥𝑡 , the extractionresult of image modal be the One-Hot vector 𝒗 𝑖 𝑖𝑚𝑎𝑔𝑒 . If thetotal number of metaknowledge elements is 𝑛 and the totalnumber of element categories is 𝑘 , then the total numberof all possible combinations of the multi-modal extractionresults is 𝑘 . Assume the 𝑘 matrix 𝑾 is the decision matrix, 𝑾 = [ 𝑤 𝑗 , 𝑤 𝑗 ] 𝑘 × . Each row in 𝑾 represents a possiblecombination of multi-modal extraction results. Suppose thej-th row in 𝑾 represents the case that ”the extraction resultof the element 𝑖 in the text modal is the 𝑘 type, and theextraction result of the same element 𝑖 in the image modal isthe 𝑘 type”, where 𝑗 = 𝑘 × 𝑘 . If the extraction result of thetext modal is correct but the extraction result of the imagemodal is incorrect, set 𝑤 𝑗 = , 𝑤 𝑗 =

0, otherwise, set 𝑤 𝑗 = , 𝑤 𝑗 =

1. If the extraction results of the two modalitiesare both correct, then set 𝑤 𝑗 = 𝑤 𝑗 = .

5. Therefore, forthe metaknowledge element 𝑖 , the ﬁnal extraction result is 𝒗 𝑖 𝑡𝑒𝑥𝑡 = 𝑤 𝑗 𝒗 𝑖 𝑡𝑒𝑥𝑡 + 𝑤 𝑗 𝒗 𝑖 𝑖𝑚𝑎𝑔𝑒 . When 𝑾 is trained withenough samples, the extraction results in the two modalities( 𝑘 and 𝑘 ) only need to look up the 𝑘 × 𝑘 row of matrix 𝑾 ,and the weighted summation is the ﬁnal extraction result. The framework proposed above realizes the classiﬁca-tion of various types of metaknowledge elements. Still, itdoes not clarify the juxtaposition and inclusive relationshipsbetween the elements, especially the sections at differentlevels, and does not form a hierarchical document structure.From the perspective of people’s writing and readinghabits, to determine the relationship between sections atdifferent levels, it is only necessary to consider the orderin which sections appear in the whole document. Amongthe sections of different levels belonging to the ”inclusive”relationship, the level of the section that appears ﬁrst mustbe higher than that of the later; among the sections of thesame level belonging to the ”juxtaposition” relationship, theorder in which they appear in the document can also reﬂectthe relationship. Generally speaking, it is to rank sectionswith the ”inclusive” relationship and to sequence sectionswith the ”juxtaposition” relationship, through which differ-ent levels of sections appear in the document.In MEF, through segmenting paragraphs and sen-tences, the document is divided into a set 𝑇 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 = {( 𝑥 , 𝑦 ) , ( 𝑥 , 𝑦 ) , · · · , ( 𝑥 𝑛 , 𝑦 𝑛 )} , where the coordinate ( 𝑥, 𝑦 ) is the Paragraph Index (PI) and Sentence Index (SI) of eachsentence in the document. For instance, assume the coordi-nate of sentence 𝑖 is ( 𝑥 𝑖 , 𝑦 𝑖 ) , where 𝑥 𝑖 represents that sentence 𝑖 belongs to the 𝑥 𝑖 -th paragraph in the whole document, 𝑦 𝑖 represents that the sentence 𝑖 is the 𝑦 𝑖 -th sentence in the 𝑥 𝑖 -thparagraph. The smaller the PI, the paragraph in which thesentence is in the front; the smaller the SI in the paragraph,the more forward the sentence in the paragraph.Exact Subgraph Enumeration Tree [24] (ESU-Tree) is astructural model designed for network motif recognition.This model is used to search for subgraphs of a speciﬁedscale in the network. Since the structure of ESU-Tree canbetter reﬂect the hierarchical and structural relationship,based on ESU-Tree, we propose a special structure for thehierarchical representation of documents, which is called”Document Structure Tree (DST)” (Fig. 3). x , y x , y x , y x , y th Level Sections th Internal Vertices x , y x , y x , y x , y rd Level Sections rd Internal Vertices x , y x , y x , y x , y nd Level Sections nd Internal Vertices x , y x , y x , y st Level Sections1 st Internal Vertices x , y Title

Root Height=5

Depth=4

Fig. 3: Document Structure Tree

DEFINITION 1

The Document Structure Tree (DST) is adirected rooted tree that: • Each of its child vertices points to its parent vertex; • Its root vertex is located at level 0, the number of fulltree levels is 4, the depth is 4, and the height is 5; • Its fourth layer is all leaf vertices; • Its vertices have weights, but edges have no weights,and the vertex weights are composed of two parts front weight and back weight . When comparingweights, the front weights is compared ﬁrst, andthe back weights is compared when the two frontweights is equal; • Its left vertex has a smaller weight than its rightvertex, and its parent vertex has a smaller weightthan its child vertices.

DEFINITION 2

If the vertex 𝑅𝑃 is the nearest right vertexof a 𝐶ℎ𝑖𝑙𝑑 vertex’s parent, deﬁne 𝑅𝑃 as the Right Parent of 𝐶ℎ𝑖𝑙𝑑 . Similarly, if the vertex 𝐿𝑃 is the nearest left vertexof a 𝐶ℎ𝑖𝑙𝑑 vertex’s parent, deﬁne 𝐿𝑃 as the Left Parent of 𝐶ℎ𝑖𝑙𝑑 .If 𝑃 is the Parent, 𝐶 is the Child, 𝐿𝑃 is the Left Parent, 𝑅𝑃 is the Right Parent, 𝑆𝑇 is the subtree, 𝐷𝑆𝑇 is the entiredocument structure tree,

𝑅𝑆𝑇 is the right subtree, and 𝐿 is the section level. Use ” ← ” represents ”assignment”; ob-viously, by analyzing the characteristics of the documentstructure tree, the following three basic properties can besummarized: PROPERTY 1

In any 𝑆𝑇 of 𝐷𝑆𝑇 , there is the followingweight relationship: 𝑤𝑒𝑖𝑔ℎ𝑡 ( 𝑃 ) < 𝑤𝑒𝑖𝑔ℎ𝑡 ( 𝐶 ) < 𝑤𝑒𝑖𝑔ℎ𝑡 ( 𝑃 ) (2) REPRINT VERSION ON ARXIV 5

PROPERTY 2

In any 𝑆𝑇 of 𝐷𝑆𝑇 , there is the followingweight relationship: ∀ 𝑣𝑒𝑟𝑡𝑒𝑥 ∈ 𝐷𝑆𝑇, ∃ 𝑆𝑇 ⊆ 𝐷𝑇,𝑝 ← 𝑅𝑜𝑜𝑡 ( 𝑆𝑇 ) , 𝑟 𝑝 ← 𝑅𝑜𝑜𝑡 ( 𝑅𝑆𝑇 ) , IF 𝑤𝑒𝑖𝑔ℎ𝑡 ( 𝑝 ) < 𝑤𝑒𝑖𝑔ℎ𝑡 ( 𝑣𝑒𝑟𝑡𝑒𝑥 ) < 𝑤𝑒𝑖𝑔ℎ𝑡 ( 𝑟 𝑝 ) , THEN 𝑣𝑒𝑟𝑡𝑒𝑥 ⊆ 𝑆𝑇 . (3)

PROPERTY 3

For any vertex in

𝐷𝑆𝑇 , its hierarchical attri-bution satisﬁes: ∀ 𝑣𝑒𝑟𝑡𝑒𝑥 ∈ 𝐷𝑆𝑇, ∃ 𝐿 , 𝐿 , and 𝐿 = 𝐿 + , IF min { 𝑤𝑒𝑖𝑔ℎ𝑡 ( 𝐿 )} < 𝑤𝑒𝑖𝑔ℎ𝑡 ( 𝑣𝑒𝑟𝑡𝑒𝑥 ) < min { 𝑤𝑒𝑖𝑔ℎ𝑡 ( 𝐿 )} , THEN 𝑣𝑒𝑟𝑡𝑒𝑥 ⊆ 𝐿 . (4)The establishment order and traversal order of the docu-ment structure tree are consistent with the Chinese readingorder, basically in the order that ”root → left node → right”.The establishment problem can be abstracted into the fol-lowing representation: Known: (1) the sections levels of vertices; (2) the weightof each vertex.

Solve: (1) the parent-child relationship of each vertex; (2)the hierarchy of vertices.To make the computer able to organize metaknowledgeelements, it is necessary to consider designing the datastructure of the above-mentioned document structure treemodel. The main problem of the design is to realize therelationship judgment of ”parent < child < right parent” onthe computer. To accomplish this task, it is needed to accesseach vertex from left to right and from top to bottom tojudge the juxtaposition and inclusive relationship betweenleft and right, superior and subordinate vertices (subtrees).For the inequality ”parent < child < right parent”, the con-dition is incomplete. Because the parent node is traversedfrom top to bottom, the parent vertex must be accessedbefore the child vertex, that is, the left end of the inequalitymust be true. Therefore, it is only necessary to consider thecase where the right end condition is incomplete, that is, theright parent vertex (right subtree) does not exist.Obviously, if the classiﬁcation discussion method isadopted, it is more expensive to add supplementary rulesto where the right parent does not exist. Therefore, considerconstructing the conditions that make the right part of theinequality always hold to adapt to the original rules, ratherthan establishing new rules. For this reason, the concept of”absolute right subtree” is introduced. DEFINITION 3

The Absolute Right Subtree (ARS) is adocument structure tree in which the root vertex weight issufﬁciently large, and the child is empty. It is actually a fullyweighted vertex at the rightmost end of the level. It onlyparticipates in weight comparisons but cannot be accessed.The 4th layer belongs to the 4th section level, all are leafvertices. Their subtrees are empty, so it is only necessary toestablish an ARS in the 1st layer, the 2nd layer, and the 3rdlayer. Moreover, by setting the traversal condition, the ARScan participate in the weight comparisons without beingaccessed, which solves the situation that the right parentvertex does not exist.

40 47 51 56 59 64 67 71 75 803 3827 ARS0x3F3F3F3F > 38

LessThan

Root a b c de f g h i j k l m n

Fig. 4: Absolute Right Subtree: the vertices weights are partof the DST established by the extracted metaknowledgeelements of the

Report on the work of the Government, Year2019, The Central People’s Government, P. R. China .As is shown in Fig. 4, obviously, for all the child vertices 𝑒 to 𝑛 of vertex 𝑐 , there is no right parent, and Eq. 2is no longer valid. To ensure that Eq. 2 is always true, 𝑤𝑒𝑖𝑔ℎ𝑡 ( 𝐴𝑅𝑆 ) should be a sufﬁciently large number. In thiswork, 0x3F3F3F3F is set to the sufﬁciently large number,which not only avoids data overﬂow but also has the sameorder of magnitude as the maximum 0x7FFFFFFF of 32-bitinteger data. Due to the introduction of the ARS 𝑑 , the rightparent of the child vertices 𝑒 to 𝑛 becomes 𝑑 , and then theinequality 38 < 𝑤𝑒𝑖𝑔ℎ𝑡 ( 𝑠𝑢𝑏𝑛𝑜𝑑𝑒𝑠 ) < 𝑥 𝐹 𝐹 𝐹 𝐹 holds.Therefore, the document structure tree’s minimum dataunit is a structure containing the attributes of the root nodeand all the child vertices, which can be deﬁned recursivelyto realize the document structure tree’s construction. class DST: def __init__(self):self.content = [] self.subtree = []

XPERIMENTS AND A NALYSIS

To evaluate the metaknowledge elements extractionframework in this work, experiments are designed to verifythe feasibility of extracting metaknowledge elements frommulti-modal documents. Here we build governmental doc-uments dataset GovDoc-CN to evaluate our framework.

In our previous work, we released a multi-modal gov-ernmental documents dataset GovDoc-CN , which obtains 6816 government document pages from the Central People’sGovernment of China (CPGC) and its subordinate ministriesand commissions from the policy document database of theCPGC.These documents are automatically typeset by L A TEXand manually annotated in text modal and image modal.Table 1 shows the statistical information of GovDoc-CN,in which there are 10 kinds of metaknowledge elements,including the sign of issuing authority, the document num-ber, the title, the addressee, the 1st level section, the 2ndlevel section, the 3rd level section, the main text, the issuingauthority and the date of writing.

1. https://github.com/RuilinXu/GovDoc-CN

REPRINT VERSION ON ARXIV 6

TABLE 1: The Statistical Information Of GovDoc-CN

Types a Sign of issuing authority 1 347Document number 1 344Title 1 359Addressee 1 0651st level section 5 1842nd level section 4 6973rd level section 1 415Issuing authority 2 073Date of writing 1 178Paragraph 10 280 a The number of these elements comes from the text modaland image modal, and the annotations of the two modalitiescorrespond to each other.

This work uses the same train, evaluation, and test set totrain both the text modal model (BERTBASE+BiLSTM+CRF)and the image modal model (YOLOv4+PaddleOCR). Using3 672 document pages for training, 918 pages for evaluationand 511 pages for testing. The setups are as follows: • Text Modal.

Using pre-trained Chinese BERT

BASE forﬁne-tuning. Batch size is set as 4, learning rate is 2e-5,max length is 512. Train 30 epochs. • Image Modal.

The YOLOv4 is trained with batch size64, learning rate 1e-5, and is trained 20 000 iterations.

TABLE 2: Reuslts on GovDoc-CN

Metaknowledge Elements BERT+BiLSTM+CRF YOLOv4+PaddleOCR a MEFSign of issuing authority 0.9870 0.8308 0.9999Document number 0.8977 0.8588 0.8977Title 0.9290 0.9211 0.9299Addressee 0.9608 0.8090 0.99051st level section 0.9608 0.7453 0.88452nd level section 0.7770 0.6384 0.82143rd level section 0.7348 0.6194 0.8235Issuing authority 0.8646 0.5342 0.9347Date of writing 0.8462 0.8095 0.9510Main text 0.7097 0.6452 0.8146Average 0.8668 0.7412 0.9048 a Considering OCR mismatch.

The metaknowledge elements extraction results onGovDoc-CN has been shown in Table 2. Micro-F1 Scoreare used for evaluating our framework. Compared withthe single-modal approaches BERT+BiLSTM+CRF andYOLOv4+PaddleOCR, our framework MEF reachesbetter performance in average F1-Score ( +3.80%BERT+BiLSTM+CRF, +16.36% YOLOv4+PaddleOCR ).Table 2 also shows that the two single-modal approaches oftext and image complement each other in extracting variousmetaknowledge elements. The example outputs are shownin Fig. 5.As the layout structure of ofﬁcial documents is relativelysimple, to verify the generalization ability of MEF to ex-tract metaknowledge elements of complex documents, weoriginally planned to use large-scale multi-modal documentlayout analysis dataset DocBank [25] for evaluation. Fig. 5: Examples of original document pages, annotatedground truth pages, and outputs of the MEF.However, when using the dataset, we ﬁnd that thecomputer annotating approach has a terrible impact on thedata quality, and there is a lot of dirty data mixed in thebounding boxes of the image modal annotations. In thetext modal annotations, the dataset only gives the label ofeach token but does not provide the starting and endingposition of paragraphs, making it very difﬁcult to convertthese annotated tokens into BIO format data that commonlyused in NLP. Limited by the computing capability, we haveto run our experiments on a subset of 50 000 samplesof DocBank. We ﬁnd that the the text modal results aredeplorable according to either the text classiﬁcation or theNER approaches. In the image modal, after a large amountof complex bounding boxes merging work that has trans-formed the token-level bounding boxes into the component-level, we ﬁnally get SOTA (Table 3).As shown in Table 3 and Fig. 6, MEF’s extraction per-formance of 7 types of elements exceeds that of LayoutLMtrained with 10 times more samples. Due to issues of theDocBank, MEF’s performance in the text modal is not asgood as pleasant. Nevertheless, considering its performanceon GovDoc-CN, we believe that switching to a better multi-modal document dataset will achieve more ideal results.Unfortunately, there is currently a lack of relevant datasets. A b s t r a c t A u t ho r C ap t i on E qua t i on F i gu r e F oo t e r L i s t P a r ag r aph R e f e r en c e S e c t i on T ab l e T i t l e MEF (Image Modal) Pre-trained LayoutLM

Fig. 6: The metaknowledge elements extraction results onDocBank 50K.

REPRINT VERSION ON ARXIV 7

The layout analysis datasets represented by PubLayNet [26]have only a single modal of image. Therefore, we will workharder to strengthen the weak links in the next work andexpand GovDoc-CN into a multilingual and multi-modaldocument dataset with more document types.TABLE 3: Reuslts on DocBank 50K

MetaknowledgeElements MEF(Image Modal) Pre-trainedLayoutLM a Abstract

Footer

Reference 0.8461 0.9643Section

Title 0.9834 a Results come from Ref. [25], the model was trained andevaluated on the complete DocBank.

The previous processes have acquired the hierarchicalstructure of documents, that is, the typologies of documentsmetaknowledge. However, the semantic information has notbeen linked to document metaknowledge yet. Therefore,this work picks Ref. [27] from DocBank 50K as an exam-ple to demonstrate the whole process of how to generatemetaknowledge from documents.The metaknowledge elements are extracted throughMEF and organized by the DST model. Furthermore, entity-relation triples, or the

Structure Contextual Triples called inRef. [28], are extracted from the paragraphs and linked to thesections or subsections they belong to, and then we get themetaknowledge of Ref. [27]. The metaknowledge is saved inthe Neo4j database, which is shown in Fig. 7(a).We explain the metaknowledge of Ref. [27] in Fig. 7(b).Metaknowledge is a graph structure with hierarchical char-acteristics, which is basically consistent with the directorystructure of documents. In Fig. 7(b), the hierarchical char-acteristics are represented by three layers: the 1st layer,the title; the 2nd layer, the sections; and the 3rd layer,the subsections. In the 1st layer, metaknowledge elementsdescribe the attributes of documents such as the authors,the institutes, and the supporters. These metaknowledgeelements are represented as a form similar to triple-basedknowledge. In the 2nd and 3rd layers, we extract entitiesand relations from paragraphs, which consists of triple-based knowledge, and then linked to their nearest sectionor subsection they belong to.This work recognizes that the citation network is alsowell-organized in metaknowledge. As shown in Fig. 8, allthe references of Ref. [27] are extracted and organized byhierarchical structure. Compared with the current citationnetwork analysis research, the citation network based onmetaknowledge no longer takes the authors as the mainresearch target, but pays more attention to the knowledgestructure and knowledge itself. Each reference is a document au t ho r au t ho r au t ho r au t ho r ab s t r a c t s e c t i on s e c t i on s e c t i on s e c t i on s e c t i on s e c t i on s uppo r t e r s uppo r t e r c on t en t c on t en t c on t … c on t en t c on t en t c on t en t c on t en t s e c t i on s e c t i on s e c t i on s e c t i on c on t en t c on t en t c on t en t c on t en t c on t en t c on t en t c on t en t c on … c on t en t c on t en t c on t en t c on t … c on t en t c on t en t c on t en t c on t en t be tt e r be tt e r c on t en t c on t en t c on t en t c on t en t c on t en t c … c on t en t c on t en t b … s i m … s … s e c t i on c on t en t c on t … c on t en t c on t en t be … be tt e r u s e c on t en t c on t en t c on t en t c on t en t s e c t i on s e c t i on s e c t i on u s e u s e c on t en t c on t en t c on t en t c on t en t u s e u s e u s e c on t en t c on t en t i n s t i t u t e i n s t i t u t e i n s t i t u t e i n s t i t u t e i n s t i t u t e i n s t i t u t e YouOnlyLookOnc… JosephRedmonSantoshDivvalaRossGirshickAliFarhadiAbstractIntroduc… UnitiedDetecti… Compari… Experim…Real-Ti… Conclusi…ONRN00014…TheAllenDistin…R-CNNYOLOFastR-CNNDPM YOLOPr(Obje…PASCALVOCNetworkDesign Training Inference Limitati…PASCALVOCYOLOFastYOLOImageN… ImageN…GoogleL…Caffe'sModelZooDarknet leakyrectifiedlinearacti… sum-sq…PASCALVOCmulti-pa… PASCALVOC YOLO R-CNNDPM Deform…R-CNNYOLOFastandFasterR-C…HOGDeepMultiBox OverFeat MultiGr… Compari…YOLODPM FastR-CNNVGG-16YOLOPASCALVOCFastR-CNN mAPR-CNNVOC2007ErrorAnal… VOC2012Results Generali…FastR-CNNYOLO VGG-MCaffeNet YOLO FastYOLOAllenInstitutefor AIFacebo…Universi… (a) Metaknowledge of Ref. [27] which is displayed in Neo4j(b) The hierarchical structure of Ref. [27]’s metaknowledge

Fig. 7: An example: metaknowledge of Ref. [27]that can generate metaknowledge. Therefore, a large-scalemetaknowledge network can be constructed by convertinga large number of references into metaknowledge. s e c t i on s e c t i on s e c t i on s e c t i on s e c t i on s e c t i on r e f e r en c e r e f e r en c e s e c t i on s e c t i on s e c t i o n s e c t i on r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en … r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f … r e f e r en c e s e c t i on s e c t i on s e c t i on s e c t i on reference r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en … r e f e r en c e r e f e r en c e r e f e r en … r e f e r en c e r e f e r e … r e f … r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e r e f e r en c e YouOnlyLookOnc…Introduc… UnitiedDetecti…Compari… Experim…Real-Ti… Conclusi…Objectdetectionwith d… Richfeaturehierar…FastR-CNN NetworkDesign TrainingInferenceLimitati…Histogra…"Fast,accuratedetect…Decaf:A deepconvol… ScalableobjectRegion-… Anextendedset of Networkin netw…Objectrecogni…Ageneralframe… Fasterr-cnn:Towar…30hzobjectdetect…"Overfe… Selectivesearch f…Robustreal-timeobjectRobustreal-timeface d… Edgeboxes:Locatingobje…Learningto local… Poselets:Body p…Compari… VOC2007ErrorAnal… VOC2012ResultsGenerali… Improvi…Modelsaccuracyon im…Opensourceneuralnet…Objectdetectionnetwo…ImageN…Thepascalvisualobje…Goingdeeperwith c…R-cnnminus r Thecrossde…Detecti…SpatialpyramidpoolingThefastestdefor…

Fig. 8: The reference network of Ref. [27] constructed bymetaknowledge.

ASK F LOW O F M ETAKNOWLEDGE A PPLICA - TIONS

Metaknowledge is the knowledge about knowledge, inother words, metaknowledge realizes the organization andmanagement of speciﬁc knowledge. Under the metaknowl-edge empowerment, the knowledge can be effectively usedat a higher level. We analyze the whole task ﬂow of current

REPRINT VERSION ON ARXIV 8

TASKS L EVE L S MetaknowledgeExtraction MetaknowledgeRepresentation Metaknowledge NetworkConstruction MetaknowledgeUtilization

Layout AnalysisDocument StructuringStructure Contextual Triples… GCNGAT… Motif AnalysisGraph FusionGraph Linking… Question AnsweringReading Comprehension…

KnowledgeExtraction KnowledgeRepresentation Knowledge BaseConstruction KnowledgeUtilization

Named Entity RecognitionRelation ExtractionAttribute Extraction… TransXWord2VecSen2Vec… Node2VecDeepWalkGNNs… Entity AlignmentEntity DisambiguationOntology Construction… Intelligent RetrievalKnowledge Recommendation…

Fig. 9: The task ﬂow of metaknowledge applications. The horizontal axis represents the task ﬂow of knowledge andmetaknowledge, and the vertical axis represents the low-to-high levels of knowledge and metaknowledge.knowledge engineering applications, inspired by that, wepropose a novel task ﬂow of metaknowledge applications(Fig. 9).Fig. 9 shows the task ﬂow of knowledge and meta-knowledge from two different levels. From the perspec-tive of tasks, the knowledge application task ﬂow can bedivided into four parts: (1) Knowledge Extraction, includ-ing named entity recognition, relation extraction, attributeextraction, etc. (2) Knowledge Representation. Generally, ithas two types: one is based on entity-relation triples, wherethe approaches include TransX [29], [30], [31], Word2Vec[32], Sen2Vec [33], Para2Vec [34]; the other type is basedon knowledge networks, where the approaches includeNode2Vec [35], DeepWalk [36], Graph Neural Networks(GNNs) [37], etc. (3) Knowledge Base Construction, in-cluding entity alignment, entity disambiguation, ontologyconstruction, knowledge reasoning, etc. (4) Knowledge Uti-lization, including intelligent retrieval, knowledge recom-mendation, etc.The metaknowledge application task ﬂow can also bedivided into four parts: (1) Metaknowledge Extraction, thedata generally comes from multiple modes, but the ap-proaches basically include document layout analysis, doc-ument structuring, structure contextual triples extraction,etc. (2) Metaknowledge Representation. Considering meta-knowledge is weighted and directional graph data, theoret-ically the approaches such as graph convolutional network(GCN) [38], [39] and graph attention network (GAT) [40]could be well-performed on this task. (3) MetaknowledgeNetworks Construction. Its main task is building a networkfrom large amounts of metaknowledge, using approachessuch as graph fusion, graph linking, etc. (4) MetaknowledgeUtilization, which includes Metaknowledge-based QuestionAnswering (MbQA) and Metaknowledge-based DocumentUnderstanding (MbDU), among which MbQA seems tobe a combination of knowledge base question answering(KBQA) [41], [42] and document-based question answering(DBQA) [43], [44], it considers both semantic logic of theknowledge and hierarchical structure of the document.From the perspective of levels (Fig. 9), metaknowledge applications is higher, more holistic and more macroscopicthan knowledge applications. The information elements ofknowledge are entities and the relations (entity-relationtriples), while the information elements of metaknowledgeshould be knowledge and the relations between knowledge(structure contextual triples). Speciﬁcally, metaknowledgeis the weighted and directional graph data, which couldhave a better performance than the triple-based knowl-edge bases on tasks such as question answering, machinereading comprehension, etc. Knowledge applications realizethe management of information, similarly, metaknowledgeapplications will realize the management and organizationof knowledge.

ONCLUSION

In this work, we introduce the concept of metaknowl-edge to knowledge engineering researches, propose a frame-work MEF to extract metaknowledge elements from multi-modal documents, and the Document Structure Tree modelto organize metaknowledge with document topological andhierarchical features. Experiments and analysis demonstratethe effectiveness of MEF, an metaknowledge example of Ref.[27] is given to analyze what metaknowledge actually is andhow is it organized. Finally, our work proposes the task ﬂowof metaknowledge applications, analyzes the tasks fromupstream to downstream that metaknowledge is utilized.Future work will focus on metaknowledge representationlearning, metaknowledge network construction, and how toverify the functionality of metaknowledge. R EFERENCES [1] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak,and S. Hellmann, “Dbpedia - a crystallization point for the web ofdata,”

Web Semantics Science Services & Agents on the World WideWeb , vol. 7, no. 3, pp. 154–165, 2009.[2] K. Bollacker, R. Cook, and P. Tufts, “Freebase: A shared databaseof structured general human knowledge,” in

Aaai Conference onArtiﬁcial Intelligence , 2007.[3] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A core ofsemantic knowledge unifying wordnet and wikipedia,” in

Interna-tional Conference on World Wide Web , 2007.

REPRINT VERSION ON ARXIV 9 [4] J. A. Evans and J. G. Foster, “Metaknowledge,”

Science , vol. 331,no. 6018, pp. 721–725, 2011.[5] X. Yang, E. Yumer, P. Asente, M. Kraley, D. Kifer, and C. Giles,“Learning to extract semantic structure from documents usingmultimodal fully convolutional neural networks,” in . LosAlamitos, CA, USA: IEEE Computer Society, jul 2017, pp.4342–4351. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR.2017.462[6] X. Liu, F. Gao, Q. Zhang, and H. Zhao, “Graphconvolution for multimodal information extraction fromvisually rich documents,” in

Proceedings of the 2019 Conferenceof the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 2 (Industry Pa-pers)

Robust Layout-Aware IE for VisuallyRich Documents with Pre-Trained Language Models . New York, NY,USA: Association for Computing Machinery, 2020, p. 2367–2376.[Online]. Available: https://doi.org/10.1145/3397271.3401442[8] C. Lockard, P. Shiralkar, X. L. Dong, and H. Hajishirzi, “Ze-roShotCeres: Zero-shot relation extraction from semi-structuredwebpages,” in

Proceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics

Proceedings of the 26th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining , Jul 2020.[Online]. Available: http://dx.doi.org/10.1145/3394486.3403172[10] Y. Xu, Y. Xu, T. Lv, L. Cui, F. Wei, G. Wang, Y. Lu, D. Florencio,C. Zhang, W. Che, M. Zhang, and L. Zhou, “Layoutlmv2: Multi-modal pre-training for visually-rich document understanding,”2020.[11] B. Zheng, H. Wen, Y. Liang, N. Duan, W. Che, D. Jiang,M. Zhou, and T. Liu, “Document modeling with graph attentionnetworks for multi-grained machine reading comprehension,”in

Proceedings of the 58th Annual Meeting of the Association forComputational Linguistics

Proceedingsof the 2020 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pp. 1219 – 1225, 2019.[16] A. Elgohary, S. Hosseini, and A. Hassan Awadallah, “Speak toyour parser: Interactive text-to-SQL with natural language feed-back,” in

Proceedings of the 58th Annual Meeting of the Association forComputational Linguistics , ser.CODS COMAD 2021. New York, NY, USA: Associationfor Computing Machinery, 2021, p. 423. [Online]. Available:https://doi.org/10.1145/3430984.3431046[18] C. Yang, J. Zhang, H. Wang, B. Li, and J. Han,

Neural ConceptMap Generation for Effective Document Classiﬁcation with InterpretableStructured Summarization . New York, NY, USA: Association forComputing Machinery, 2020, p. 1629–1632. [Online]. Available:https://doi.org/10.1145/3397271.3401312 [19] L. Gu, W. Zhang, Y. Wang, B. Li, and S. Mao, “Named entityrecognition in judicial ﬁeld based on bert-bilstm-crf model,” in , 2020.[20] A. Bochkovskiy, C. Y. Wang, and H. Y. M. Liao, “Yolov4: Optimalspeed and accuracy of object detection,” 2020.[21] Y. Du, C. Li, R. Guo, X. Yin, W. Liu, J. Zhou, Y. Bai, Z. Yu, Y. Yang,Q. Dang, and H. Wang, “Pp-ocr: A practical ultra lightweight ocrsystem,” 09 2020.[22] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” 2019.[23] L. Ling, Y. Zhihao, Y. Pei, Z. Yin, W. Lei, L. Hongfei, and W. Jian,“An attention-based bilstm-crf approach to document-level chem-ical named entity recognition,”

Bioinformatics , no. 8, p. 8.[24] M. N. Yudina, “Assessment of accuracy in calculations of networkmotif concentration by rand esu algorithm,”

Journal of PhysicsConference Series , vol. 1260, p. 022012, 2019.[25] M. Li, Y. Xu, L. Cui, S. Huang, F. Wei, Z. Li, andM. Zhou, “DocBank: A benchmark dataset for document layoutanalysis,” in

Proceedings of the 28th International Conference onComputational Linguistics

Computer Vision& Pattern Recognition , 2016.[28] W. Zhang, Z. HUang, G. Ye, B. Wen, W. Zhang, and H. Chen,“Billion-scale pre-trained e-commerce product knowledge graphmodel,” in ,2021.[29] A. Bordes, N. Usunier, A. Garcia-Dur´an, J. Weston, andO. Yakhnenko, “Translating embeddings for modeling multi-relational data,” in

Proceedings of the 26th International Conferenceon Neural Information Processing Systems - Volume 2 , ser. NIPS’13.Red Hook, NY, USA: Curran Associates Inc., 2013, p. 2787–2795.[30] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao, “Knowledge graph embed-ding via dynamic mapping matrix,” in

Meeting of the Associationfor Computational Linguistics & the International Joint Conference onNatural Language Processing , 2015.[31] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graphembedding by translating on hyperplanes,” in

Proceedings ofthe Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence , ser.AAAI’14. AAAI Press, 2014, p. 1112–1119.[32] CHURCH and K. Ward, “Word2vec,”

Natural Language Engineer-ing , vol. 23, no. 01, pp. 155–162, 2017.[33] S. Arora, Y. Liang, and T. Ma, “A simple but tough-to-beat baselinefor sentence embeddings.” in , no. 5thInternational Conference on Learning Representations, ICLR 2017- Conference Track Proceedings, Princeton University, 2017.[34] Q. Le and T. Mikolov, “Distributed representations of sentencesand documents,” in

Proceedings of the 31st International Conferenceon International Conference on Machine Learning - Volume 32 , ser.ICML’14. JMLR.org, 2014, p. II–1188–II–1196.[35] A. Grover and J. Leskovec, “node2vec: Scalable feature learningfor networks,” in the 22nd ACM SIGKDD International Conference ,2016.[36] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learningof social representations,”

Proceedings of the ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining , 03 2014.[37] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A com-prehensive survey on graph neural networks,”

IEEE Transactionson Neural Networks and Learning Systems , vol. 32, no. 1, pp. 4–24,2021.[38] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation withgraph convolutional networks,” 2017.[39] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive represen-tation learning on large graphs,” in

Proceedings of the 31st Inter-national Conference on Neural Information Processing Systems , ser.NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, p.1025–1035.

REPRINT VERSION ON ARXIV 10 [40] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li`o,and Y. Bengio, “Graph attention networks,”

ArXiv , vol.abs/1710.10903, 2018.[41] Y. Lan and J. Jiang, “Query graph generation foranswering multi-hop complex questions from knowledgebases,” in

Proceedings of the 58th Annual Meeting of the Association forComputational Linguistics

Proceedings of the 58th Annual Meeting ofthe Association for Computational Linguistics