[PDF] Representation Learning for Natural Language Processing

Abstract

This book aims to review and present the recent advances of distributed representation learning for NLP, including why representation learning can improve NLP, how representation learning takes part in various important topics of NLP, and what challenges are still not well addressed by distributed representation.

Full PDF

RRepresentation Learning for Natural Language Processing

Zhiyuan Liu · Yankai Lin · Maosong Sun epresentation Learning for Natural LanguageProcessinghiyuan Liu (cid:129)

Yankai Lin (cid:129)

Maosong Sun

Representation Learning for Natural LanguageProcessing hiyuan LiuTsinghua UniversityBeijing, ChinaMaosong SunDepartment of Computer Scienceand TechnologyTsinghua UniversityBeijing, China Yankai LinPattern Recognition CenterTencent WechatBeijing, ChinaISBN 978-981-15-5572-5 ISBN 978-981-15-5573-2 (eBook)https://doi.org/10.1007/978-981-15-5573-2 © The Editor(s) (if applicable) and The Author(s) 2020. This book is an open access publication.

Open Access

This book is licensed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adap-tation, distribution and reproduction in any medium or format, as long as you give appropriate credit tothe original author(s) and the source, provide a link to the Creative Commons license and indicate ifchanges were made.The images or other third party material in this book are included in the book ’ s Creative Commonslicense, unless indicated otherwise in a credit line to the material. If material is not included in the book ’ sCreative Commons license and your intended use is not permitted by statutory regulation or exceeds thepermitted use, you will need to obtain permission directly from the copyright holder.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publi-cation does not imply, even in the absence of a speci ﬁ c statement, that such names are exempt from therelevant protective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in thisbook are believed to be true and accurate at the date of publication. Neither the publisher nor theauthors or the editors give a warranty, express or implied, with respect to the material contained herein orfor any errors or omissions that may have been made. The publisher remains neutral with regard tojurisdictional claims in published maps and institutional af ﬁ liations.This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.The registered company address is: 152 Beach Road, reface In traditional Natural Language Processing (NLP) systems, language entries such aswords and phrases are taken as distinct symbols. Various classic ideas and methods,such as n -gram and bag-of-words models, were proposed and have been widelyused until now in many industrial applications. All these methods take words as theminimum units for semantic representation, which are either used to furtherestimate the conditional probabilities of next words given previous words(e.g., n -gram) or used to represent semantic meanings of text (e.g., bag-of-wordsmodels). Even when people ﬁ nd it is necessary to model word meanings, they eithermanually build some linguistic knowledge bases such as WordNet or use contextwords to represent the meaning of a target word (i.e., distributional representation).All these semantic representation methods are still based on symbols!With the development of NLP techniques for years, it is realized that manyissues in NLP are caused by the symbol-based semantic representation. First, thesymbol-based representation always suffers from the data sparsity problem. Takestatistical NLP methods such as n -gram with large-scale corpora, for example, dueto the intrinsic power-law distribution of words, the performance will decay dra-matically for those few-shot words, even many smoothing methods have beendeveloped to calibrate the estimated probabilities about them. Moreover, there aremultiple-grained entries in natural languages from words, phrases, sentences todocuments, it is dif ﬁ cult to ﬁ nd a uni ﬁ ed symbol set to represent the semanticmeanings for all of them simultaneously. Meanwhile, in many NLP tasks, it isrequired to measure semantic relatedness between these language entries at differentlevels. For example, we have to measure semantic relatedness betweenwords/phrases and documents in Information Retrieval. Due to the absence of auni ﬁ ed scheme for semantic representation, there used to be distinct approachesproposed and explored for different tasks in NLP, and it sometimes makes NLPdoes not look like a compatible community.As an alternative approach to symbol-based representation, distributed repre-sentation was originally proposed by Geoffrey E. Hinton in a technique report in1984. The report was then included in the well-known two-volume book ParallelDistributed Processing (PDP) that introduced neural networks to model human v ognition and intelligence. According to this report, distributed representation isinspired by the neural computation scheme of humans and other animals, and theessential idea is as follows: Each entity is represented by a pattern of activity distributed over many computing ele-ments, and each computing element is involved in representing many different entities.

It means that each entity is represented by multiple neurons, and each neuroninvolves in the representation of many concepts. This also indicates the meaning of distributed in distributed representation. As opposed to distributed representation,people used to assume one neuron only represents a speci ﬁ c concept or object, e.g.,there exists a single neuron that will only be activated when recognizing a person orobject, such as his/her grandmother, well known as the grandmother-cell hypothesisor local representation. We can see the straightforward connection between thegrandmother-cell hypothesis and symbol-based representation.It was about 20 years after distributed representation was proposed, neuralprobabilistic language model was proposed to model natural languages by YoshuaBengio in 2003, in which words are represented as low-dimensional and real-valuedvectors based on the idea of distributed representation. However, it was until 2013that a simpler and more ef ﬁ cient framework word2vec was proposed to learn worddistributed representations from large-scale corpora, we come to the popularity ofdistributed representation and neural network techniques in NLP. The performanceof almost all NLP tasks has been signi ﬁ cantly improved with the support of thedistributed representation scheme and the deep learning methods.This book aims to review and present the recent advances of distributed repre-sentation learning for NLP, including why representation learning can improveNLP, how representation learning takes part in various important topics of NLP,and what challenges are still not well addressed by distributed representation. Book Organization

This book is organized into 11 chapters with 3 parts. The ﬁ rst part of thebook depicts key components in NLP and how representation learning works forthem. In this part, Chap. 1 ﬁ rst introduces the basics of representation learningand why it is important for NLP. Then we give a comprehensive review ofrepresentation learning techniques on multiple-grained entries in NLP, includingword representation (Chap. 2), phrase representation as known as compositionalsemantics (Chap. 3), sentence representation (Chap. 4), and document representa-tion (Chap. 5).The second part presents representation learning for those components closelyrelated to NLP. These components include sememe knowledge that describes thecommonsense knowledge of words as human concepts, world knowledge (alsoknown as knowledge graphs) that organizes relational facts between entities in thereal world, various network data such as social networks, document networks, and vi Preface ross-modal data that connects natural languages to other modalities such as visualdata. A deep understanding of natural languages requires these complex compo-nents as a rich context. Therefore, we provide an extensive introduction to thesecomponents, i.e., sememe knowledge representation (Chap. 6), world knowledgerepresentation (Chap. 7), network representation (Chap. 8), and cross-modal rep-resentation (Chap. 9).In the third part, we will further provide some widely used open resourcetools on representation learning techniques (Chap. 10) and ﬁ nally outlook theremaining challenges and future research directions of representation learning forNLP (Chap. 11).Although the book is about representation learning for NLP, those theories andalgorithms can be also applied in other related domains, such as machine learning,social network analysis, semantic web, information retrieval, data mining, andcomputational biology.Note that, some parts of this book are based on our previous published orpre-printed papers, including [1, 11] in Chap. 2, [32] in Chap. 3, [10, 5, 29] inChap. 4, [12, 7] in Chap. 5, [17, 14, 24, 30, 6, 16, 2, 15] in Chap. 6, [9, 8, 13, 21,22, 23, 3, 4, 31] in Chap. 7, and [25, 19, 18, 20, 26, 27, 33, 28] in Chap. 8. Book Cover

The book cover shows an oracle bone divided into three parts, corresponding tothree revolutionized stages of cognition and representation in human history.The left part shows oracle scripts, the earliest known form of Chinese writingcharacters used on oracle bones in the late 1200 BC. It is used to represent theemergence of human languages, especially writing systems. We consider this as the ﬁ rst representation revolution for human beings about the world.The upper right part shows the digitalized representation of information andsignals. After the invention of electronic computers in the 1940s, big data can beef ﬁ ciently represented and processed in computer programs. This can be regardedas the second representation revolution for human beings about the world.The bottom right part shows the distributed representation in arti ﬁ cial neuralnetworks originally proposed in the 1980s. As the representation basis of deeplearning, it has extensively revolutionized many ﬁ elds in arti ﬁ cial intelligence,including natural language processing, computer vision, and speech recognitionever since the 2010s. We consider this as the third representation revolution aboutthe world. This book focuses on the theory, methods, and applications of distributedrepresentation learning in natural language processing. Preface vii rerequisites

This book is designed for advanced undergraduate and graduate students, post-doctoral fellows, researchers, lecturers, and industrial engineers, as well as anyoneinterested in representation learning and NLP. We expect the readers to have someprior knowledge in Probability, Linear Algebra, and Machine Learning. We rec-ommend the readers who are speci ﬁ cally interested in NLP to read the ﬁ rst part(Chaps. 1 –

5) which should be read sequentially. The second and third parts can beread in selected order according to readers ’ interests. Contact Information

We welcome any feedback, corrections, and suggestions on the book, which maybe sent to [email protected]. The readers can also ﬁ nd updates about the bookfrom the personal homepage http://nlp.csai.tsinghua.edu.cn/ * lzy/.Beijing, China Zhiyuan LiuYankai LinMarch 2020 Maosong Sun References

1. Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. Joint learning ofcharacter and word embeddings. In

Proceedings of IJCAI , 2015.2. Yihong Gu, Jun Yan, Hao Zhu, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Fen Lin, and LeyuLin. Language modeling with sparse product of sememe experts. In

Proceedings of EMNLP ,pages 4642 – arXiv preprint arXiv:1611.04125 , 2016.4. Xu Han, Zhiyuan Liu, and Maosong Sun. Neural knowledge acquisition via mutual attentionbetween knowledge graph and text. In Proceedings of AAAI , pages 4832 – ﬁ cation dataset with state-of-the-artevaluation. In Proceedings of EMNLP , 2018.6. Huiming Jin, Hao Zhu, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Fen Lin, and Leyu Lin.Incorporating chinese characters of words for lexical sememe prediction. In

Proceedings ofACL , 2018.7. Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. Denoising distantly supervisedopen-domain question answering. In

Proceedings of ACL , 2018.8. Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, and Song Liu. Modelingrelation paths for representation learning of knowledge bases. In

Proceedings of EMNLP ,2015.viii Preface. Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity andrelation embeddings for knowledge graph completion. In

Proceedings of AAAI , 2015.10. Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. Neural relationextraction with selective attention over instances. In

Proceedings of ACL , 2016.11. Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Topical word embeddings. In

Proceedings of AAAI , 2015.12. Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. Entity-duet neural ranking:Understanding the role of knowledge graph semantics in neural information retrieval. In

Proceedings of ACL , 2018.13. Zhiyuan Liu, Maosong Sun, Yankai Lin, and Ruobing Xie. Knowledge representationlearning: a review.

JCRD , 53(2):247 – Proceedings of ACL , 2017.15. Fanchao Qi, Junjie Huang, Chenghao Yang, Zhiyuan Liu, Xiao Chen, Qun Liu, and MaosongSun. Modeling semantic compositionality with sememe knowledge. In

Proceedings of ACL ,2019.16. Fanchao Qi, Yankai Lin, Maosong Sun, Hao Zhu, Ruobing Xie, and Zhiyuan Liu.Cross-lingual lexical sememe prediction. In

Proceedings of EMNLP , 2018.17. Maosong Sun and Xinxiong Chen. Embedding for words and word senses based on humanannotated knowledge base: A case study on hownet.

Journal of Chinese InformationProcessing , 30:1 –

6, 2016.18. Cunchao Tu, Hao Wang, Xiangkai Zeng, Zhiyuan Liu, and Maosong Sun. Community-enhanced network representation learning for network analysis. arXiv preprintarXiv:1611.06645 , 2016.19. Cunchao Tu, Weicheng Zhang, Zhiyuan Liu, and Maosong Sun. Max-margin deepwalk:Discriminative learning of network representation. In

Proceedings of IJCAI , 2016.20. Cunchao Tu, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. Transnet: translation-basednetwork representation learning for social relation extraction. In

Proceedings of IJCAI , 2017.21. Ruobing Xie, Zhiyuan Liu, Tat-seng Chua, Huanbo Luan, and Maosong Sun.Image-embodied knowledge representation learning. In

Proceedings of IJCAI , 2016.22. Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. Representation learningof knowledge graphs with entity descriptions. In

Proceedings of AAAI , 2016.23. Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Representation learning of knowledge graphswith hierarchical types. In

Proceedings of IJCAI , 2016.24. Ruobing Xie, Xingchi Yuan, Zhiyuan Liu, and Maosong Sun. Lexical sememe prediction viaword embeddings and matrix factorization. In

Proceedings of IJCAI , 2017.25. Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. Network rep-resentation learning with rich text information. In

Proceedings of IJCAI , 2015.26. Cheng Yang, Maosong Sun, Zhiyuan Liu, and Cunchao Tu. Fast network embeddingenhancement via high order proximity approximation. In

Proceedings of IJCAI , 2017.27. Cheng Yang, Maosong Sun, Wayne Xin Zhao, Zhiyuan Liu, and Edward Y Chang. A neuralnetwork approach to jointly modeling social networks and mobile trajectories.

ACMTransactions on Information Systems (TOIS) , 35(4):36, 2017.28. Cheng Yang, Jian Tang, Maosong Sun, Ganqu Cui, and Liu Zhiyuan. Multi-scale informationdiffusion prediction with reinforced recurrent networks. In

Proceedings of IJCAI , 2019.29. Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, LixinHuang, Jie Zhou, and Maosong Sun. DocRED: A large-scale document-level relationextraction dataset. In

Proceedings of ACL , 2019.30. Xiangkai Zeng, Cheng Yang, Cunchao Tu, Zhiyuan Liu, and Maosong Sun. Chinese liwclexicon expansion via hierarchical classi ﬁ cation of word embeddings with sememe attention.In Proceedings of AAAI , 2018.31. Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. Ernie:Enhanced language representation with informative entities. In

Proceedings of ACL , 2019.Preface ix2. Yu Zhao, Zhiyuan Liu, and Maosong Sun. Phrase type sensitive tensor indexing model forsemantic composition. In

Proceedings of AAAI , 2015.33. Jie Zhou, Xu Han, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and MaosongSun. GEAR: Graph-based evidence aggregating and reasoning for fact veri ﬁ cation. In Proceedings of ACL 2019 , 2019.x Preface cknowledgements

The authors are very grateful to the contributions of our students and researchcollaborators, who have prepared initial drafts of some chapters or have given uscomments, suggestions, and corrections. We list main contributors for preparinginitial drafts of each chapter as follows, (cid:129)

Chapter 1: Tianyu Gao, Zhiyuan Liu. (cid:129)

Chapter 2: Lei Xu, Yankai Lin. (cid:129)

Chapter 3: Yankai Lin, Yang Liu. (cid:129)

Chapter 4: Yankai Lin, Zhengyan Zhang, Cunchao Tu, Hongyin Luo. (cid:129)

Chapter 5: Jiawei Wu, Yankai Lin, Zhenghao Liu, Haozhe Ji. (cid:129)

Chapter 6: Fanchao Qi, Chenghao Yang. (cid:129)

Chapter 7: Ruobing Xie, Xu Han. (cid:129)

Chapter 8: Cheng Yang, Jie Zhou, Zhengyan Zhang. (cid:129)

Chapter 9: Ji Xin, Yuan Yao, Deming Ye, Hao Zhu. (cid:129)

Chapter 10: Xu Han, Zhengyan Zhang, Cheng Yang. (cid:129)

Chapter 11: Cheng Yang, Zhiyuan Liu.For the whole book, we thank Chaojun Xiao and Zhengyan Zhang for drawingmodel ﬁ gures, thank Chaojun Xiao for unifying the styles of ﬁ gures and tables inthe book, thank Shengding Hu for making the notation table and unifying thenotations across chapters, thank Jingcheng Yuzhi and Chaojun Xiao for organizingthe format of reference, thank Jingcheng Yuzhi, Jiaju Du, Haozhe Ji, SicongOuyang, and Ayana for the ﬁ rst-round proofreading, and thank Weize Chen, GanquCui, Bowen Dong, Tianyu Gao, Xu Han, Zhenghao Liu, Fanchao Qi, GuangxuanXiao, Cheng Yang, Yuan Yao, Shi Yu, Yuan Zang, Zhengyan Zhang, Haoxi Zhongand Jie Zhou for the second-round proofreading. We also thank Cuncun Zhao fordesigning the book cover.In this book, there is a speci ﬁ c chapter talking about sememe knowledge rep-resentation. Many works in this chapter are carried out by our research group. Theseworks have received great encouragement from the inventor of HowNet,Mr. Zhendong Dong, who died at 82 on February 28, 2019. HowNet is the great xi inguistic and commonsense knowledge base composed by Mr. Dong for about 30years. At the end of his life, he and his son Mr. Qiang Dong decided to collaboratewith us and released the open-source version of HowNet, OpenHowNet. As apioneer of machine translation in China, Mr. Zhendong Dong devoted his wholelife to natural language processing. He will be missed by all of us forever.We thank our colleagues and friends, Yang Liu and Juanzi Li at TsinghuaUniversity, and Peng Li at Tencent Wechat, who offered close and frequent dis-cussions which substantially improved this book. We also want to express ourspecial thanks to Prof. Bo Zhang. His insights to deep learning and representationlearning, and sincere encouragements to our research of representation learning onNLP, have greatly stimulated us to move forward with more con ﬁ dence andpassion.We proposed the plan of this book in 2015 after discussing it with the SpringerSenior Editor, Dr. Celine Lanlan Chang. As the ﬁ rst of the time of preparing atechnical book, we were not expecting it took so long to ﬁ nish this book. We thankCeline for providing insightful comments and incredible patience to the preparationof this book. We are also grateful to Springer ’ s Assistant Editor, Jane Li, foroffering invaluable help during manuscript preparation.Finally, we give our appreciations to our organizations, Department of ComputerScience and Technology at Tsinghua University, Institute for Arti ﬁ cial Intelligenceat Tsinghua University, Beijing Academy of Arti ﬁ cial Intelligence (BAAI), ChineseInformation Processing Society of China, and Tencent Wechat, who have providedoutstanding environment, supports, and facilities for preparing this book.This book is supported by the Natural Science Foundation of China (NSFC) andthe German Research Foundation (DFG) in Project Crossmodal Learning, NSFC61621136008/DFG TRR-169. xii Acknowledgements ontents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ﬁ c Word Representation . . . . . . . . . . . . . . . ﬁ c Word Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii .7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ﬁ cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Contents .5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents xv

Network Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Contents xvii cronyms

ACNN Anisotropic Convolutional Neural NetworkAI Arti ﬁ cial IntelligenceAUC Area Under the Receiver Operating Characteristic CurveBERT Bidirectional Encoder Representations from TransformersBFS Breadth-First SearchBiDAF Bi-Directional Attention FlowBRNN Bidirectional Recurrent Neural NetworkCBOW Continuous Bag-of-WordsccDCLM Context-to-Context Document-Context Language ModelCIDEr Consensus-based Image Description EvaluationCLN Column NetworkCLSP Cross-Lingual Lexical Sememe PredictionCNN Convolutional Neural NetworkCNRL Community-enhanced Network Representation LearningCOCO-QA Common Objects in COntext Question AnsweringConSE Convex Combination of Semantic EmbeddingsConv-KNRM Convolutional Kernel-based Neural Ranking ModelCSP Character-enhanced Sememe PredictionCWE Character-based Word EmbeddingsDCNN Diffusion-Convolutional Neural NetworkDeViSE Deep Visual-Semantic Embedding ModelDFS Depth-First SearchDGCN Dual Graph Convolutional NetworkDGE Directed Graph EmbeddingDKRL Description-embodied Knowledge Graph Representation LearningDRMM Deep Relevance Matching ModelDSSM Deep Structured Semantic ModelECC Edge-Conditioned ConvolutionERNIE Enhanced Language Representation Model with InformativeEntities xix M-IQA Freestyle Multilingual Image Question AnsweringGAAN Gated Attention NetworkGAT Graph Attention NetworksGCN Graph Convolutional NetworkGCNN Geodesic Convolutional Neural NetworkGEAR Graph-based Evidence Aggregating and ReasoningGENQA Generative Question Answering ModelGGNN Gated Graph Neural NetworkGloVe Global Vectors for Word RepresentationGNN Graph Neural NetworksGRN Graph Recurrent NetworkGRU Gated Recurrent UnitHAN Heterogeneous Graph Attention NetworkHMM Hidden Markov ModelHOPE High-Order Proximity preserved EmbeddingsIDF Inverse Document FrequencyIE Information ExtractionIKRL Image-embodied Knowledge Graph Representation LearningIR Information RetrievalKALM Knowledge-Augmented Language ModelKB Knowledge BaseKBC Knowledge Base CompletionKG Knowledge GraphKL Kullback-LeiblerKNET Knowledge-guided Attention Neural Entity TypingK-NRM Kernel-based Neural Ranking ModelKR Knowledge RepresentationLBSN Location-Based Social NetworkLDA Latent Dirichlet AllocationLIWC Linguistic Inquiry and Word CountLLE Locally Linear EmbeddingLM Language ModelLSA Latent Semantic AnalysisLSHM Latent Space Heterogeneous ModelLSTM Long Short-Term MemoryMAP Mean Average PrecisionMETEOR Metric for Evaluation of Translation with Explicit ORderingMMD Maximum Mean DiscrepancyMMDW Max-Margin DeepWalkM-NMF Modularized Nonnegative Matrix FactorizationmovMF mixture of von Mises-Fisher distributionsMRF Markov Random FieldMSLE Mean-Square Log-Transformed ErrorMST Minimum Spanning TreeMV-RNN Matrix-Vector Recursive Neural Network xx Acronyms

EU Network Embedding UpdateNKLM Neural Knowledge Language ModelNLI Natural Language InferenceNLP Natural Language ProcessingNRE Neural Relation ExtractionOOKB Out-of-Knowledge-BasePCNN Piece-wise Convolution Neural NetworkpLSI Probabilistic Latent Semantic IndexingPMI Point-wise Mutual InformationPOS Part-of-SpeechPPMI Positive Point-wise Mutual InformationPTE Predictive Text EmbeddingPV-DBOW Paragraph Vector with Distributed Bag-of-WordsPV-DM Paragraph Vector with Distributed MemoryQA Question AnsweringRBF Restricted Boltzmann MachineRC Relation Classi ﬁ cationR-CNN Region-based Convolutional Neural NetworkRDF Resource Description FrameworkRE Relation ExtractionRMSE Root Mean Squared ErrorRNN Recurrent Neural NetworkRNTN Recursive Neural Tensor NetworkRPN Region Proposal NetworkSAC Sememe Attention over Context ModelSAT Sememe Attention over Target ModelSC Semantic CompositionalitySCAS Semantic Compositionality with Aggregated SememeSCMSA Semantic Compositionality with Mutual Sememe AttentionSDLM Sememe-Driven Language ModelSDNE Structural Deep Network EmbeddingsSE-WRL Sememe-Encoded Word Representation LearningSGD Stochastic Gradient DescentSGNS Skip-gram with Negative Sampling ModelS-LSTM Sentence Long Short-Term MemorySPASE Sememe Prediction with Aggregated Sememe EmbeddingsSPCSE Sememe Prediction with Character and Sememe EmbeddingsSPICE Semantic Propositional Image Caption EvaluationSPSE Sememe Prediction with Sememe EmbeddingsSPWCF Sememe Prediction with Word-to-Character FilteringSPWE Sememe Prediction with Word EmbeddingsSSA Simple Sememe Aggregation ModelSSWE Sentiment-Speci ﬁ c Word EmbeddingsSVD Singular Value DecompositionSVM Support Vector Machine Acronyms xxi

ADW Text-associated DeepWalkTF Term FrequencyTF-IDF Term Frequency – Inverse Document FrequencyTKRL Type-embodied Knowledge Graph Representation LearningTSP Traveling Salesman ProblemTWE Topical Word EmbeddingsVQA Visual Question AnsweringVSM Vector Space ModelWRL Word Representation LearningWSD Word Sense DisambiguationYAGO Yet Another Great Ontology xxii Acronyms ymbols and Notations

Tokyo

Word exampleConvolution operator , De ﬁ ned as (cid:1) Element-wise multiplication (Hadamard product) ) Induces / Proportional to P Summation operatormin Minimizemax Maximize/max poolingarg min k The parameter that minimizes a functionarg max k The parameter that maximizes a functionsim Similarityexp Exponential functionAtt Attention functionAvg Average functionF1-Score F1 scorePMI Pair-wise mutual informationReLU ReLU activation functionSigmoid Sigmoid functionSoftmax Softmax function V Vocabulary set w ; w Word; word embedding vector E Word embedding matrix R Relation set r ; r Relation; relation embedding vector a Answer to question q Query W ; M ; U Weight matrix b ; d bias vector M i ; : ; M : ; j Matrix ’ s i th row; matrix ’ s j th column xxiii T ; M T Transpose of a vector or a matrixtr ð(cid:3)Þ

Trace of a matrix M ! Tensor a ; b ; k Hyperparameters g ; f ; / ; U ; d Functions N a Number of occurrences of ad ð(cid:3) ; (cid:3)Þ Distance function s ð(cid:3) ; (cid:3)Þ Similarity function P (cid:3)ð Þ ; p (cid:3)ð Þ Probability I ð(cid:3) ; (cid:3)Þ Mutual information H ð(cid:3)Þ Entropy O (cid:3)ð Þ Time complexity D KL (cid:3)jj(cid:3)ð Þ KL divergence L Loss function O Objective function E Energy function a Attention score / (cid:4) Optimal value of a variable/function / j (cid:3) j Vector length; set size (cid:3)k k a a -norm ð(cid:3)Þ Indicator function h Parameter in a neural network E Expectation of a random variable (cid:3) þ ; (cid:3) (cid:5) Positive sample; negative sample c Margin l Mean of normal distribution l Mean vector of Gaussian distribution r Standard error of normal distribution R Covariance matrix of Gaussian distribution I n n -dimensional identity matrix N Normal distribution xxiv Symbols and Notations hapter 1

Representation Learning and NLP

Abstract

Natural languages are typical unstructured information. ConventionalNatural Language Processing (NLP) heavily relies on feature engineering, whichrequires careful design and considerable expertise. Representation learning aims tolearn representations of raw data as useful information for further classiﬁcation or pre-diction. This chapter presents a brief introduction to representation learning, includ-ing its motivation and basic idea, and also reviews its history and recent advances inboth machine learning and NLP.

Machine learning addresses the problem of automatically learning computer pro-grams from data. A typical machine learning system consists of three components [5]:Machine Learning = Representation + Objective + Optimization . (1.1)That is, to build an effective machine learning system, we ﬁrst transform usefulinformation on raw data into internal representations such as feature vectors. Then bydesigning appropriate objective functions, we can employ optimization algorithmsto ﬁnd the optimal parameter settings for the system.Data representation determines how much useful information can be extractedfrom raw data for further classiﬁcation or prediction. If there is more useful infor-mation transformed from raw data to feature representations, the performance ofclassiﬁcation or prediction will tend to be better. Hence, data representation is acrucial component to support effective machine learning.Conventional machine learning systems adopt careful feature engineering aspreprocessing to build feature representations from raw data. Feature engineeringneeds careful design and considerable expertise, and a speciﬁc task usually requirescustomized feature engineering algorithms, which makes feature engineering laborintensive, time consuming, and inﬂexible.Representation learning aims to learn informative representations of objects fromraw data automatically. The learned representations can be further fed as input to © The Author(s) 2020Z. Liu et al., Representation Learning for Natural Language Processing ,https://doi.org/10.1007/978-981-15-5573-2_1 1 1 Representation Learning and NLP machine learning systems for prediction or classiﬁcation. In this way, machine learn-ing algorithms will be more ﬂexible and desirable while handling large-scale andnoisy unstructured data, such as speech, images, videos, time series, and texts.Deep learning [9] is a typical approach for representation learning, which hasrecently achieved great success in speech recognition, computer vision, and naturallanguage processing. Deep learning has two distinguishing features: (cid:129)

Distributed Representation . Deep learning algorithms typically represent eachobject with a low-dimensional real-valued dense vector, which is named as dis-tributed representation . As compared to one-hot representation in conventionalrepresentation schemes (such as bag-of-words models), distributed representationis able to represent data in a more compact and smoothing way, as shown in Fig. 1.1,and hence is more robust to address the sparsity issue in large-scale data. (cid:129)

Deep Architecture . Deep learning algorithms usually learn a hierarchical deeparchitecture to represent objects, known as multilayer neural networks. The deeparchitecture is able to extract abstractive features of objects from raw data, whichis regarded as an important reason for the great success of deep learning for speechrecognition and computer vision.Currently, the improvements caused by deep learning for NLP may still not beso signiﬁcant as compared to speech and vision. However, deep learning for NLPhas been able to signiﬁcantly reduce the work of feature engineering in NLP in themeantime of performance improvement. Hence, many researchers are devoting todeveloping efﬁcient algorithms on representation learning (especially deep learning)for NLP.In this chapter, we will ﬁrst discuss why representation learning is important forNLP and introduce the basic ideas of representation learning. Afterward, we will

Apple Inc. Steve JobsTim CookiPhone CEO f o u n d e r product Steve JobsTim Cook iPhone Apple Inc.

Entities Embeddings

Fig. 1.1

Distributed representation of words and entities in human languages.1 Motivation 3 brieﬂy review the development history of representation learning for NLP, introducetypical approaches of contemporary representation learning, and summarize existingand potential applications of representation learning. Finally, we will introduce thegeneral organization of this book.

NLP aims to build linguistic-speciﬁc programs for machines to understand languages.Natural language texts are typical unstructured data, with multiple granularities, mul-tiple tasks, and multiple domains, which make NLP challenging to achieve satisfac-tory performance.

Multiple Granularities . NLP concerns about multiple levels of language entries,including but not limited to characters, words, phrases, sentences, paragraphs, anddocuments. Representation learning can help to represent the semantics of theselanguage entries in a uniﬁed semantic space, and build complex semantic relationsamong these language entries.

Multiple Tasks . There are various NLP tasks based on the same input. For exam-ple, given a sentence, we can perform multiple tasks such as word segmentation,part-of-speech tagging, named entity recognition, relation extraction, and machinetranslation. In this case, it will be more efﬁcient and robust to build a uniﬁed repre-sentation space of inputs for multiple tasks.

Multiple Domains . Natural language texts may be generated from multipledomains, including but not limited to news articles, scientiﬁc articles, literary works,and online user-generated content such as product reviews. Moreover, we can alsoregard texts in different languages as multiple domains. Conventional NLP systemshave to design speciﬁc feature extraction algorithms for each domain according to itscharacteristics. In contrast, representation learning enables us to build representationsautomatically from large-scale domain data.In summary, as shown in Fig. 1.2, representation learning can facilitate knowledgetransfer across multiple language entries, multiple NLP tasks, and multiple appli-cation domains, and signiﬁcantly improve the effectiveness and robustness of NLPperformance.

In this book, we focus on the distributed representation scheme (i.e., embedding),and talk about recent advances of representation learning methods for multiple lan-guage entries, including words, phrases, sentences, and documents, and their closelyrelated objects including sememe-based linguistic knowledge, entity-based worldknowledge, networks, and cross-modal entries.

NLP TasksLanguage Entries

KnowledgeNetworkDocumentSentencePhraseWord NLP ApplicationsSemantic AnalysisSyntactic AnalysisLexical Analysis

Fig. 1.2

Distributed representation can provide uniﬁed semantic space for multi-grained languageentries and for multiple NLP tasks

By distributed representation learning, all objects that we are interested in areprojected into a uniﬁed low-dimensional semantic space. As demonstrated in Fig. 1.1,the geometric distance between two objects in the semantic space indicates theirsemantic relatedness; the semantic meaning of an object is related to which objectsare close to it. In other words, it is the relative closeness with other objects that revealsan object’s meaning rather than the absolute position.

In this section, we introduce the development of representation learning for NLP,also shown in Fig. 1.3. To study representation schemes in NLP, words would be agood start, since they are the minimum units in natural languages. The easiest wayto represent a word in a computer-readable way (e.g., using a vector) is one-hotvector , which has the dimension of the vocabulary size and assigns 1 to the word’scorresponding position and 0 to others. It is apparent that one-hot vectors hardlycontain any semantic information about words except simply distinguishing themfrom each other.One of the earliest ideas of word representation learning can date back to n -grammodels [15]. It is easy to understand: when we want to predict the next word in asequence, we usually look at some previous words (and in the case of n -gram, theyare the previous n − n − n -gram models is coherent with the distributional hypothesis : lin-guistic items with similar distributions have similar meanings [7]. In another phrase,“a word is characterized by the company it keeps” [6]. It became the fundamentalidea of many NLP models, from word2vec to BERT. .4 Development of Representation Learning for NLP 5 Predicts the next item in a sequence based on its previous n-1 items.A word is characterized by the company it keeps.Represents a sentence or a document as the bag of its words. Represents items by a pattern of activation distributed over elements.Learns a distributed representation of words for language modeling. A simple and e cient distributed word representation used in many NLP models.Contextual word representation, pipeline, larger corpora and deeper neural architectures

Fig. 1.3

The timeline for the development of representation learning in NLP. With the growingcomputing power and large-scale text data, distributed representation trained with neural networksand large corpora has become the mainstream

Another example of the distributional hypothesis is

Bag-Of-Words (BOW) mod-els [7]. BOW models regard a document as a bag of its words, disregarding the ordersof these words in the document. In this way, the document can be represented as avocabulary-size vector, in which each word that has appeared in the document cor-responds to a unique and nonzero dimension. Then a score can be further computedfor each word (e.g., the numbers of occurrences) to indicate the weights of thesewords in the document. Though very simple, BOW models work great in applica-tions like spam ﬁltering, text classiﬁcation, and information retrieval, proving thatthe distributions of words can serve as a good representation for text.In the above cases, each value in the representation clearly matches one entry(e.g., word scores in BOW models). This one-to-one correspondence between con-cepts and representation elements is called local representation or symbol-basedrepresentation , which is natural and simple.In distributed representation , on the other hand, each entity (or attribute) isrepresented by a pattern of activation distributed over multiple elements, and eachcomputing element is involved in representing multiple entities [11]. Distributed rep-resentation has been proved to be more efﬁcient because it usually has low dimen-sions that can prevent the sparsity issue. Useful hidden properties can be learned fromlarge-scale data and emerged in distributed representation. The idea of distributedrepresentation was originally inspired by the neural computation scheme of humansand other animals. It comes from neural networks (activations of neurons), and withthe great success of deep learning, distributed representation has become the mostcommonly used approach for representation learning.One of the pioneer practices of distributed representation in NLP is Neural Prob-abilistic Language Model (NPLM) [1]. A language model is to predict the jointprobability of sequences of words ( n -gram models are simple language models).NPLM ﬁrst assigns a distributed vector for each word, then uses a neural networkto predict the next word. By going through the training corpora, NPLM successfullylearns how to model the joint probability of sentences, while brings word embed-dings (i.e., low-dimensional word vectors) as learned parameters in NPLM. Though Word Embedding ModelPre-training ObjectiveWord Embedding Target Task ModelTarget Task ObjectiveWord Embedding Pre-trained Language ModelPre-training ObjectiveWord Embedding Target Task ObjectiveTarget Task ModelWord Embedding

Word Embedding Pre-trained Language Model

Fig. 1.4

This ﬁgure shows how word embeddings and pre-trained language models work in NLPpipelines. They both learn distributed representations for language entries (e.g., words) throughpretraining objectives and transfer them to target tasks. Furthermore, pre-trained language modelscan also transfer model parameters it is hard to tell what each element of a word embedding actually means, the vectorsindeed encode semantic meanings about the words, veriﬁed by the performance ofNPLM.Inspired by NPLM, there came many methods that embed words into distributedrepresentations and use the language modeling objective to optimize them as modelparameters. Famous examples include word2vec [12],

GloVe [13], and fastText [3].Though differing in detail, these methods are all very efﬁcient to train, utilize large-scale corpora, and have been widely adopted as word embeddings in many NLPmodels. Word embeddings in the NLP pipeline map discrete words into informativelow-dimensional vectors, and help to shine a light on neural networks in comput-ing and understanding languages. It makes representation learning a critical part ofnatural language processing.The research on representation learning in NLP took a big leap when

ELMo [14] and

BERT [4] came out. Besides using larger corpora, more parameters, andmore computing resources as compared to word2vec, they also take complicatedcontext in text into consideration. It means that instead of assigning each wordwith a ﬁxed vector, ELMo and BERT use multilayer neural networks to calculatedynamic representations for the words based on their context, which is especiallyuseful for the words with multiple meanings. Moreover, BERT starts a new fashion(though not originated from it) of the pretrained ﬁne-tuning pipeline. Previously,word embeddings are simply adopted as input representation. But after BERT, itbecomes a common practice to keep using the same neural network structure such asBERT in both pretraining and ﬁne-tuning, which is taking the parameters of BERTfor initialization and ﬁne-tuning the model on downstream tasks (Fig. 1.4).Though not a big theoretical breakthrough, BERT-like models (also known as

Pre-trained Language Models (PLM) , for they are pretrained through languagemodeling objective on large corpora) have attracted wide attention in the NLP andmachine learning community, for they have been so successful and achieved state-of-the-art on almost every NLP benchmarks. These models show what large-scaledata and computing power can lead to, and new research works on the topic of Pre-Trained language Models (PLMs) emerge rapidly. Probing experiments demonstratethat PLMs implicitly encode a variety of linguistic knowledge and patterns inside .4 Development of Representation Learning for NLP 7 their multilayer network parameters [8, 10]. All these signiﬁcant performances andinteresting analyses suggest that there are still a lot of open problems to explore inPLMs, as the future of representation learning for NLP.Based on the distributional hypothesis, representation learning for NLP hasevolved from symbol-based representation to distributed representation. Startingfrom word2vec, word embeddings trained from large corpora have shown signiﬁcantpower in most NLP tasks. Recently, emerged PLMs (like BERT) take complicatedcontext into word representation and start a new trend of the pretraining ﬁne-tuningpipeline, bringing NLP to a new level. What will be the next big change in repre-sentation learning for NLP? We hope the contents of this book can give you someinspiration.

People have developed various effective and efﬁcient approaches to learn semanticrepresentations for NLP. Here we list some typical approaches.

Statistical Features : As introduced before, semantic representations for NLP inthe early stage often come from statistics, instead of emerging from the optimizationprocess. For example, in n -gram or bag-of-words models, elements in the representa-tion are usually frequencies or numbers of occurrences of the corresponding entriescounted in large-scale corpora. Hand-craft Features : In certain NLP tasks, syntactic and semantic features areuseful for solving the problem. For example, types of words and entities, semanticroles and parse trees, etc. These linguistic features may be provided with the tasksor can be extracted by speciﬁc NLP systems. In a long period before the wide useof distributed representation, researchers used to devote lots of effort into designinguseful features and combining them as the inputs for NLP models.

Supervised Learning : Distributed representations emerge from the optimizationprocess of neural networks under supervised learning. In the hidden layers of neu-ral networks, the different activation patterns of neurons represent different entitiesor attributes. With a training objective (usually a loss function for the target task)and supervised signals (usually the gold-standard labels for training instances of thetarget tasks), the networks can learn better parameters via optimization (e.g., gra-dient descent). With proper training, the hidden states will become informative andgeneralized as good semantic representations of natural languages.For example, to train a neural network for a sentiment classiﬁcation task, the lossfunction is usually set as the cross-entropy of the model predictions with respect tothe gold-standard sentiment labels as supervision. While optimizing the objective,the loss gets smaller, and the model performance gets better. In the meantime, thehidden states of the model gradually form good sentence representations by encodingthe necessary information for sentiment classiﬁcation inside the continuous hiddenspace.

Self-supervised Learning : In some cases, we simply want to get good represen-tations for certain elements, so that these representations can be transferred to othertasks. For example, in most neural NLP models, words in sentences are ﬁrst mappedto their corresponding word embeddings (maybe from word2vec or GloVe) beforesent to the networks. However, there are no human-annotated “labels” for learningword embeddings. To acquire the training objective necessary for neural networks,we need to generate “labels” intrinsically from existing data. This is called self-supervised learning (one way for unsupervised learning).For example, language modeling is a typical “self-supervised” objective, for itdoes not require any human annotations. Based on the distributional hypothesis,using the language modeling objective can lead to hidden representations that encodethe semantics of words. You may have heard of a famous equation: w ( king ) − w ( man ) + w ( woman ) = w ( queen ) , which demonstrates the analogical propertiesthat the word embeddings have possessed through self-supervised learning.We can see another angle of self-supervised learning in autoencoders. It is also away to learn representations for a set of data. Typical autoencoders have a reduction(encoding) phase and a reconstruction (decoding) phase. In the reduction phase, anitem from the data is encoded into a low-dimensional representation, and in thereconstruction phase, the model tries to reconstruct the item from the intermediaterepresentation. Here, the training objective is the reconstruction loss, derived fromthe data itself. During the training process, meaningful information is encoded andkept in the latent representation, while noise signals are discarded.Self-supervised learning has made a great success in NLP, for the plain text itselfcontains abundant knowledge and patterns about languages, and self-supervisedlearning can fully utilize the existing large-scale corpora. Nowadays, it is still themost exciting research area of representation learning for natural languages, andresearchers continue to put their efforts into this direction.Besides, many other machine learning approaches have also been explored inrepresentation learning for NLP, such as adversarial training, contrastive learning,few-shot learning, meta-learning, continual learning, reinforcement learning, et al.How to develop more effective and efﬁcient approaches of representation learningfor NLP and to better take advantage of large-scale and complicated corpora andcomputing power, is still an important research topic. In general, there are two kinds of applications of representation learning for NLP. Inone case, the semantic representation is trained in a pretraining task (or designed byhuman experts) and is transferred to the model for the target task. Word embeddingis an example of the application. It is trained by using language modeling objectiveand is taken as inputs for other down-stream NLP models. In this book, we will .6 Applications of Representation Learning for NLP 9 also introduce sememe knowledge representation and world knowledge representa-tion, which can also be integrated into some NLP systems as additional knowledgeaugmentation to enhance their performance in certain aspects.In other cases, the semantic representation lies within the hidden states of the neu-ral model and directly aims for better performance of target tasks as an end-to-endfashion. For example, many NLP tasks want to semantically compose sentence ordocument representation: tasks like sentiment classiﬁcation, natural language infer-ence, and relation extraction require sentence representation and the tasks like ques-tion answering need document representation. As shown in the latter part of thebook, many representation learning methods have been developed for sentences anddocuments and beneﬁt these NLP tasks.

We start the book from word representation. By giving a thorough introduction toword representation, we hope the readers can grasp the basic ideas for representa-tion learning for NLP. Based on that, we further talk about how to compositionallyacquire the representation for higher level language components, from sentences todocuments.As shown in Fig. 1.5, representation learning will be able to incorporate varioustypes of structural knowledge to support a deep understanding of natural languages,named as knowledge-guided NLP. Hence, we next introduce two forms of knowledgerepresentation that are closely related to NLP. On the one hand, sememe represen-tation tries to encode linguistic and commonsense knowledge in natural languages.Sememe is deﬁned as the minimum indivisible unit of semantic meaning [2]. Withthe help of sememe representation learning, we can get more interpretable and morerobust NLP models. On the other hand, world knowledge representation studies howto encode world facts into continuous semantic space. It can not only help withknowledge graph tasks but also beneﬁt knowledge-guided NLP applications.Besides, the network is also a natural way to represent objects and their relation-ships. In the network representation section, we study how to embed vertices andedges in a network and how these elements interact with each other. Through theapplications, we further show how network representations can help NLP tasks.Another interesting topic related to NLP is the cross-modal representation, whichstudies how to model uniﬁed semantic representations across different modalities(e.g., text, audios, images, videos, etc.). Through this section, we review severalcross-modal problems along with representative models.At the end of the book, we introduce some useful resources to the readers, includ-ing deep learning frameworks and open-source codes. We also share some viewsabout the next big topics in representation learning for NLP. We hope that theresources and the outlook can help our readers have a better understanding of thecontent of the book, and inspire our readers about how representation learning inNLP would further develop.

Deep NetworkOpen Data SymbolEmbedding

KnowledgeGuidanceKnowledgeExtraction

Learning Under-standing GNNKRL

Deep Learning Knowledge Graph

Fig. 1.5

The architecture of knowledge-guided NLP

References

1. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilisticlanguage model.

Journal of Machine Learning Research , 3(Feb):1137–1155, 2003.2. Leonard Bloomﬁeld. A set of postulates for the science of language.

Language , 2(3):153–164,1926.3. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectorswith subword information.

Transactions of the Association for Computational Linguistics ,5:135–146, 2017.4. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. In

Proceedings of NAACL , 2019.5. Pedro Domingos. A few useful things to know about machine learning.

Communications ofthe ACM , 55(10):78–87, 2012.6. John R Firth. A synopsis of linguistic theory, 1930–1955. 1957.7. Zellig S Harris. Distributional structure.

Word , 10(2–3):146–162, 1954.8. John Hewitt and Christopher D. Manning. A structural probe for ﬁnding syntax in word repre-sentations. In

Proceedings of NAACL-HLT , 2019.9. Goodfellow Ian, Yoshua Bengio, and Aaron Courville. Deep learning. Book in preparation forMIT Press, 2016.10. Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Lin-guistic knowledge and transferability of contextual representations. In

Proceedings of NAACL-HLT , 2019.eferences 1111. James L McClelland, David E Rumelhart, PDP Research Group, et al. Parallel distributedprocessing.

Explorations in the Microstructure of Cognition , 2:216–271, 1986.12. T Mikolov and J Dean. Distributed representations of words and phrases and their composi-tionality.

Proceedings of NeurIPS , 2013.13. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for wordrepresentation. In

Proceedings of EMNLP , 2014.14. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,and Luke Zettlemoyer. Deep contextualized word representations. In

Proceedings of NAACL-HLT , pages 2227–2237, 2018.15. Claude E Shannon. A mathematical theory of communication.

Bell system technical journal ,27(3):379–423, 1948.

Open Access

This chapter is licensed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,adaptation, distribution and reproduction in any medium or format, as long as you give appropriatecredit to the original author(s) and the source, provide a link to the Creative Commons license andindicate if changes were made.The images or other third party material in this chapter are included in the chapter’s CreativeCommons license, unless indicated otherwise in a credit line to the material. If material is notincluded in the chapter’s Creative Commons license and your intended use is not permitted bystatutory regulation or exceeds the permitted use, you will need to obtain permission directly fromthe copyright holder. hapter 2

Word Representation

Abstract

Word representation, aiming to represent a word with a vector, playsan essential role in NLP. In this chapter, we ﬁrst introduce several typical wordrepresentation learning methods, including one-hot representation and distributedrepresentation. After that, we present two widely used evaluation tasks for measuringthe quality of word embeddings. Finally, we introduce the recent extensions for wordrepresentation learning models.

Words are usually considered as the smallest meaningful units of speech or writing inhuman languages. High-level structures in a language, such as phrases and sentences,are further composed of words. For human beings, to understand a language, it iscrucial to understand the meanings of words. Therefore, it is essential to accuratelyrepresent words, which could help models better understand, categorize, or generatetext in NLP tasks.A word can be naturally represented as a sequence of several characters. However,it is very inefﬁcient and ineffective only to use raw character sequences to representwords. First, the variable lengths of words make it hard to be processed and used inmachine learning methods. Moreover, it is very sparse, because only a tiny proportionof arrangements are meaningful. For example, English words are usually charactersequences which are composed of 1–20 characters in the English alphabet, but mostof these character sequences such as “aaaaa” are meaningless.One-hot representation is another natural approach to represent words, whichassigns a unique index to each word. It is also not good enough to represent words withone-hot representation. First, one-hot representation could not capture the seman-tic relatedness among words. Second, one-hot representation is a high-dimensionalsparse representation, which is very inefﬁcient. Third, it is very inﬂexible for one-hotrepresentation to deal with new words, which requires assigning new indexes for newwords and would change the dimensions of the representation. The change may leadto some problems for existing NLP systems. © The Author(s) 2020Z. Liu et al.,

Representation Learning for Natural Language Processing ,https://doi.org/10.1007/978-981-15-5573-2_2 134 2 Word Representation

Recently, distributed word representation approaches are proposed to address theproblem of one-hot word representation. The distributional hypothesis [23, 30] thatlinguistic objects with similar distributions have similar meanings is the basis for dis-tributed word representation learning. Based on the distributional hypothesis, variousword representation models, such as CBOW and Skip-gram, have been proposed andapplied in different areas.In the remaining part of this chapter, we start with one-hot word representation.Further, we introduce distributed word representation models, including Brown Clus-ter, Latent Semantic Analysis, Word2vec, and GloVe in detail. Then we introducetwo typical evaluation tasks for word representation. Finally, we discuss variousextensions of word representation models.

In this section, we will introduce one-hot word representation in details. Given aﬁxed set of vocabulary V = { w , w , . . . , w | V | } , one very intuitive way to representa word w is to encode it with a | V | -dimensional vector w , where each dimension of w is either 0 or 1. Only one dimension in w can be 1 while all the other dimensionsare 0. Formally, each dimension of w can be represented as w i = (cid:2) w = w i . (2.1)One-hot word representation, in essence, maps each word to an index of thevocabulary, which can be very efﬁcient for storage and computation. However, itdoes not contain rich semantic and syntactic information of words. Therefore, one-hotrepresentation cannot capture the relatedness among words. The difference between cat and dog is as much as the difference between cat and bed in one-hot wordrepresentation. Besides, one-hot word representation embeds each word into a | V | -dimensional vector, which can only work for a ﬁxed vocabulary. Therefore, it isinﬂexible to deal with new words in a real-world scenario. Recently, distributed word representation approaches are proposed to address theproblem of one-hot word representation. The distributional hypothesis [23, 30] thatlinguistic objects with similar distributions have similar meanings is the basis forsemantic word representation learning.Based on the distributional hypothesis, Brown Cluster [9] groups words into hier-archical clusters where words in the same cluster have similar meanings. The cluster .3 Distributed Word Representation 15 label can roughly represent the similarity between words, but it cannot preciselycompare words in the same group. To address this issue, distributed word represen-tation aims to embed each word into a continuous real-valued vector. It is a denserepresentation, and “dense” means that one concept is represented by more than onedimension of the vector, and each dimension of the vector is involved in representingmultiple concepts. Due to its continuous characteristic, distributed word represen-tation can be easily applied in deep neural models for NLP tasks. Distributed wordrepresentation approaches such as Word2vec and GloVe usually learn word vectorsfrom a large corpus based on the distributional hypothesis. In this section, we willintroduce several distributed word representation approaches in detail. Brown Cluster classiﬁes words into several clusters that have similar semantic mean-ings. Detailedly, Brown Cluster learns a binary tree from a large-scale corpus, inwhich the leaves of the tree indicate the words and the internal nodes of the treeindicate word hierarchical clusters. This is a hard clustering method since each wordbelongs to exactly one group.The idea of Brown Cluster to cluster the words comes from the n -gram languagemodel. A language model evaluates the probability of a sentence. For example,the sentence have a nice day should have a higher probability than a randomsequence of words. Using a k -gram language model, the probability of a sentence s = { w , w , w , . . . , w n } can be represented as P ( s ) = n (cid:3) i = P ( w i | w i − i − k ). (2.2)It is easy to estimate P ( w i | w i − i − k ) from a large corpus, but the model has | V | k − k is 2, the number of parameters is considerable. Moreover, the estimation is inaccuratefor rare words. To address these problems, [9] proposes to group words into clustersand train a cluster-level n -gram language model rather than a word-level model. Byassigning a cluster to each word, the probability can be written as P ( s ) = n (cid:3) i = P ( c i | c i − i − k ) P ( w i | c i ), (2.3) We emphasize that distributed representation and distributional representation are two completelydifferent aspects of representations. A word representation method may belong to both categories.Distributed representation indicates that the representation is a real-valued vector, while distri-butional representation indicates that the meaning of a word is learned under the distributionalhypothesis.6 2 Word Representation where c i is the corresponding cluster of w i . In cluster-level language model, thereare only | C k | − + | V | − | C | independent parameters, where C is the cluster setwhich is usually much smaller than the vocabulary | V | .The quality of the cluster affects the performance of the language model. Given atraining text s , for a 2-gram language model, the quality of a mapping π from wordsto clusters is deﬁned as Q (π) = n log P ( s ) (2.4) = n n (cid:4) i = (cid:5) log P ( c i | c i − ) + log P ( w i | c i ) (cid:6) . (2.5)Let N w be the number of times word w appears in corpus s , N w w be the numberof times bigram w w appears, and N π( w ) be the number of times a cluster appears.Then the quality function Q (π) can be rewritten in a statistical way as follows: Q (π) = n n (cid:4) i = (cid:5) log P ( c i | c i − ) + log P ( w i | c i ) (cid:6) (2.6) = (cid:4) w , w N w w n log P (π( w ) | π( w )) P ( w | π( w )) (2.7) = (cid:4) w , w N w w n log N π( w )π( w ) N π( w ) N w N π( w ) (2.8) = (cid:4) w , w N w w n log N π( w )π( w ) nN π( w ) N π( w ) + (cid:4) w , w N w w n log N w n (2.9) = (cid:4) c , c N c c n log N c c nN c N c + (cid:4) w N w n log N w n . (2.10)Since P ( w ) = N w n , P ( c ) = N c n , and P ( c c ) = N c c n , the quality function can berewritten as Q (π) = (cid:4) c , c P ( c c ) log P ( c c ) P ( c ) P ( c ) + (cid:4) w P ( w ) log P ( w ) (2.11) = I ( C ) − H ( V ), (2.12)where I ( C ) is the mutual information between clusters and H ( V ) is the entropy ofthe word distribution, which is a constant value. Therefore, to optimize Q (π) equalsto optimize the mutual information.There is no practical method to obtain optimum partitions. Nevertheless, BrownCluster uses a greedy strategy to obtain a suboptimal result. Initially, it assigns adistinct class for each word. Then it merges two classes with the least average mutualinformation. After | V | − | C | mergences, the partition is generated. Keeping the | C | .3 Distributed Word Representation 17 Table 2.1

Some clusters of Brown ClusterCluster clusters, we can continuously perform | C | − O ( | V | ) .We show some clusters in Table 2.1. From the table, we can ﬁnd that each clusterrelates to a sense in the natural language. The words in the same cluster tend toexpress similar meanings or could be used exchangeably. Latent Semantic Analysis (LSA) is a family of strategies derived from vector spacemodels, which could capture word semantics much better. LSA aims to explore latentfactors for words and documents by matrix factorization to improve the estimationof word similarities. Reference [14] applies Singular Value Decomposition (SVD)on the word-document matrix and exploits uncorrelated factors for both words anddocuments. The SVD of word-document matrix M yields three matrices E , Σ and D such that M = E Σ D (cid:2) , (2.13)where Σ is the diagonal matrix of singular values of M , each row vector w i inmatrix E corresponds to word w i , and each row vector d i in matrix D correspondsto document d i . Then the similarity between two words could besim ( w i , w j ) = M i , : M (cid:2) j , : = w i Σ w j . (2.14)Here, the number of singular values k included in Σ is a hyperparameter thatneeds to be tuned. With a reasonable amount of the largest singular values used, LSAcould capture much useful information in the word-document matrix and provide asmoothing effect that prevents large variance.With a relatively small k , once the matrices E , Σ and D are computed, measuringword similarity could be very efﬁcient because there are often fewer nonzero dimen-sions in word vectors. However, the computation of E and D can be costly becausefull SVD on a n × m matrix takes O ( min { m n , mn } ) time, while the parallelizationof SVD is not trivial. Another algorithm for LSA is Random Indexing [34, 55]. It overcomes the difﬁ-culty of SVD-based LSA, by avoiding costly preprocessing of a huge word-document matrix. In random indexing, each document is assigned with a randomly generatedhigh-dimensional sparse ternary vector (called index vector ). Then for each wordin the document, the index vector is added to the word’s vector. The index vectors are supposed to be orthogonal or nearly orthogonal. This algorithm is simple andscalable, which is easy to parallelize and implemented incrementally. Moreover, itsperformance is comparable with the SVD-based LSA, according to [55].

Google’s word2vec toolkit was released in 2013. It can efﬁciently learn word vectorsfrom a large corpus. The toolkit has two models, including Continuous Bag-Of-Words (CBOW) and Skip-gram. Based on the assumption that the meaning of aword can be learned from its context, CBOW optimizes the embeddings so that theycan predict a target word given its context words. Skip-gram, on the contrary, learnsthe embeddings that can predict the context words given a target word. In this section,we will introduce these two models in detail. CBOW predicts the center word given a window of context. Figure 2.1 shows theidea of CBOW with a window of 5 words.Formally, CBOW predicts w i according to its contexts as P ( w i | w j ( | j − i |≤ l , j (cid:4)= i ) ) = Softmax ⎛⎝ M ⎛⎝ (cid:4) | j − i |≤ l , j (cid:4)= i w j ⎞⎠⎞⎠ , (2.15)where P ( w i | w j ( | j − i |≤ l , j (cid:4)= i ) ) is the probability of word w i given its contexts, l is the sizeof training contexts, M is the weight matrix in R | V |× m , V indicates the vocabulary,and m is the dimension of the word vector.The CBOW model is optimized by minimizing the sum of negative log probabil-ities: L = − (cid:4) i log P ( w i | w j ( | j − i |≤ l , j (cid:4)= i ) ). (2.16)Here, the window size l is a hyperparameter to be tuned. A larger window sizemay lead to a higher accuracy as well as the more expense of the training time. https://code.google.com/archive/p/word2vec/..3 Distributed Word Representation 19 ClassifierWord MatrixAverage/Concatenate w i w i-1 w i+1 w i+2 W W Ww i-2 W Fig. 2.1

The architecture of CBOW model

On the contrary to CBOW, Skip-gram predicts the context given the center word.Figure 2.2 shows the model.Formally, given a word w i , Skip-gram predicts its context as P ( w j | w i ) = softmax ( Mw i ) (cid:5) | j − i | ≤ l , j (cid:4)= i (cid:6) , (2.17)where P ( w j | w i ) is the probability of context word w j given w i , and M is the weightmatrix. The loss function is deﬁned similar to CBOW as L = − (cid:4) i (cid:4) j ( | j − i |≤ l , j (cid:4)= i ) P ( w j | w i ). (2.18) Word Matrix

Classifier w i-2 w i-1 w i+1 w i+2 w i W Fig. 2.2

The architecture of skip-gram model0 2 Word Representation

To train CBOW or Skip-gram directly is very time consuming. The most time-consuming part is the softmax layer. The conventional softmax layer needs to obtainthe scores of all words even though only one word is used in computing the lossfunction. An intuitive idea to improve efﬁciency is to get a reasonable but muchfaster approximation of that word. Here, we will introduce two typical approxima-tion methods which are included in the toolkit, including hierarchical softmax andnegative sampling. We explain these two methods using CBOW as an example.The idea of hierarchical softmax is to build hierarchical classes for all wordsand to estimate the probability of a word by estimating the conditional probabilityof its corresponding hierarchical class. Figure 2.3 gives an example. Each internalnode of the tree indicates a hierarchical class and has a feature vector, while eachleaf node of the tree indicates a word. In this example, the probability of word the is p × p while the probability of cat is p × p × p . The conditionalprobability is computed by the feature vector of each node and the context vector.For example, p = exp ( w · w c ) exp ( w · w c ) + exp ( w · w c ) , (2.19) p = − p , (2.20)where w c is the context vector, w and w are the feature vectors.Hierarchical softmax generates the hierarchical classes according to the wordfrequency, i.e., a Huffman tree. By the approximation, it can compute the probabilityof each word much faster, and the complexity of calculating the probability of eachword is O ( log | V | ) .Negative sampling is more straightforward. To calculate the probability of a word,negative sampling directly samples k words as negative samples according to theword frequency. Then, it computes a softmax over the k + Fig. 2.3

An illustration ofhierarchical softmax dog cat the is .3 Distributed Word Representation 21 Table 2.2

Co-occurrence probabilities and the ratio of probabilities for target words ice and steam with context word solid , gas , water , and fashion Probability and ratio k = solid k = gas k = water k = f ashionP ( k | ice ) . e − . e − e − . e − P ( k | steam ) . e − . e − . e − . e − P ( k | ice )/ P ( k | steam ) . . e − .

36 0 . Methods like Skip-gram and CBOW are shallow window-based methods. Thesemethods scan a context window across the entire corpus, which fails to take advantageof some global information. Global Vectors for Word Representation (GloVe), on thecontrary, can capture corpus statistics directly.As shown in Table 2.2, the meaning of a word can be learned from the co-occurrence matrix. The ratio of co-occurrence probabilities can be especially useful.In the example, the meaning of ice and water can be examined by studying theratio of their co-occurrence probabilities with various probe words. For words relatedto ice but not steam , for example, solid , the ratio P ( solid | ice )/ P ( solid | steam ) will be large. Similarly, gas is related to steam but not ice , so P ( gas | ice )/ P ( gas | steam ) will be small. For words that are relevant or irrel-evant to both words, the ratio is close to 1.Based on this idea, GloVe models F ( w i , w j , ˜ w k ) = P ik P jk , (2.21)where ˜ w ∈ R d are separate context word vectors, and P i j is the probability of word j to be in the context of word i , formally P i j = N i j N i , (2.22)where N i j is the number of occurrences of word j in the context of word i , and N i = (cid:11) k N ik is the number of times any word appears in the context of word j . F ( · ) is supposed to encode the information presented in the ratio P ik / P jk in theword vector space. To keep the inherently linear structure, F should only depend onthe difference of two target words F ( w i − w j , ˜ w k ) = P ik P jk . (2.23)The arguments of F are vectors while the right side of the equation is a scalar, toavoid F obfuscating the linear structure, a dot product is used: F (cid:5) ( w i − w j ) (cid:2) ˜ w k (cid:6) = P ik P jk . (2.24)The model keeps the invariance under relabeling the target word and context word.It requires F to be a homomorphism between the groups ( R , + ) and ( R > , × ). Thesolution is F = exp. Then w (cid:2) i ˜ w k = log N ik − log N i . (2.25)To keep exchange symmetry, log N i is eliminated by adding biases b i and ˜ b k . Themodel becomes w (cid:2) i ˜ w k + b i + ˜ b k = log N ik , (2.26)which is signiﬁcantly simpler than Eq. (2.21).The loss function is deﬁned as L = | V | (cid:4) i , j = f ( N i j )( w (cid:2) i ˜ w j + b i + ˜ b j − log N i j ), (2.27)where f ( · ) is a weighting function: f ( x ) = (cid:2) ( x / x max ) α if x < x max , . (2.28) In natural language, the meaning of an individual word usually relates to its contextin a sentence. For example, • The central bank has slashed its forecast for economicgrowth this year from 4.1 to 2.6%. • More recently, on a blazing summer day, he took me backto one of the den sites, in a slumping bank above theSouth Saskatchewan River.

In these two sentences, although the word bank is always the same, their meaningsare different. However, most of the traditional word embeddings (CBOW, Skip-gram,GloVe, etc.) cannot well understand the different nuances of the meanings of wordswith the different surrounding texts. The reason is that these models only learn aunique representation for each word, and therefore it is impossible for these modelsto capture how the meanings of words change based on their surrounding contexts.To address this issue, [48] proposes ELMo, which uses a deep, bidirectional LSTMmodel to build word representations. ELMo could represent each word depending .4 Contextualized Word Representation 23 on the entire context in which it is used. More speciﬁcally, rather than having a look-up table of word embedding matrix, ELMo converts words into low-dimensionalvectors on-the-ﬂy by feeding the word and its surrounding text into a deep neuralnetwork. ELMo utilizes a bidirectional language model to conduct word representa-tion. Formally, given a sequence of N words, ( w , w , . . . , w N ) , a forward languagemodel (LM, the details of language model are in Sect. 4) models the probability ofthe sequence by predicting the probability of each word t k according to the historicalcontext: P ( w , w , . . . , w N ) = N (cid:3) k = P ( w k | w , w , . . . , w k − ). (2.29)The forward LM in ELMo is a multilayer LSTM, and the j th layer of the LSTM-based forward LM will generate the context-dependent word representation −→ h L Mk , j for the word w k . The backward LM is similar to the forward LM. The only differenceis that it reverses the input word sequence to ( w N , w N − , . . . , w ) and predicts eachword according to the future context: P ( w , w , . . . , w N ) = N (cid:3) k = P ( w k | w k + , w k + , . . . , w N ). (2.30)As the same as the forward LM, the j th backward LM layer generates the repre-sentations ←− h L Mk , j for the word w k .ELMo generates a task-speciﬁc word representation, which combines all layer rep-resentations of the bidirectional LM. Formally, it computes a task-speciﬁc weightingof all bidirectional LM layers: w k = α task L (cid:4) j = s taskj h L Mk , j , (2.31)where s task are softmax-normalized weights and α task is the weight of the entire wordvector for the task. Besides those very popular toolkits, such as word2vec and GloVe, various worksare focusing on different aspects of word representation, contributing to numerousextensions. These extensions usually focus on the following directions.

With the success of word representation, researchers begin to explore the theories ofword representation. Some works attempt to give more theoretical analysis to provethe reasonability of existing tricks on word representation learning [39, 45], whilesome works try to discuss the new learning methods [26, 61].

Reasonability . Word2vec and other similar tools are empirical methods of wordrepresentation learning. Many tricks are proposed in [43] to learn the representation ofwords from a large corpus efﬁciently, for example, negative sampling. Consideringthe effectiveness of these methods, a more theoretical analysis should be done toprove the reasonability of these tricks. Reference [39] gives some theoretical analysisof these tricks. They formalize the Skip-gram model with negative sampling asan implicit matrix factorization process. The Skip-gram model generates a wordembedding matrix E and a context matrix C . The size of the word embedding matrix E is | V | × m . Each row of context matrix C is a context word’s m -dimensional vector.The training process of Skip-gram is an implicit factorization of M = EC (cid:2) . C is notexplicitly considered in word2vec. This work further analyzes that the matrix M is M i j = w i · c j = PMI ( w i , c j ) − log k , (2.32)where k is the number of negative samples, PMI ( w , c ) is the point-wise mutualinformation PMI ( w , c ) = log P ( w , c ) P ( w ) P ( c ) . (2.33)The shifted PMI matrix can directly be used to compare the similarity of words.Another intuitive idea is to factorize the shifted PMI matrix directly. Reference [39]evaluates the performance of using the SVD matrix factorization method on theimplicit matrix M . Matrix factorization achieves signiﬁcantly better objective valuewhen the embedding size is smaller than 500 dimensions and the number of negativesamples is 1. With more negative samples and higher embedding dimensions, Skip-gram with negative sampling gets better objective value. This is because when thenumber of zeros increases in M , and SVD prefers to factorize a matrix with mini-mum values. With 1,000 dimensional embeddings and different numbers of negativesamples in { , , } , SVD achieves slightly better performance on word analogy andword similarity. In contrast, Skip-gram with negative sampling achieves 2% betterperformance on syntactical analogy. Interpretability . Most existing distributional word representation methods couldgenerate a dense real-valued vector for each word. However, the word embeddingsobtained by these models are hard to be interpreted. Reference [26] introduces non-negative and sparsity embeddings, where the models are interpretable and eachdimension indicates a unique concept. This method factorizes the corpus statisticsmatrix X ∈ R | V |×| D | into a word embedding matrix E ∈ R | V |× m and a documentstatistics matrix D ∈ R m ×| D | . Its training objective is .5 Extensions 25 arg min E , D | V | (cid:4) i = (cid:8) X i , : − E i , : D (cid:8) + λ (cid:8) E i , : (cid:8) , s.t. D i , : D (cid:2) i , : ≤ , ∀ ≤ i ≤ m , E i , j ≥ , ≤ i ≤ | V | , ≤ j ≤ m . (2.34)By iteratively optimizing E and D via gradient descent, this model can learn non-negative and sparse embeddings for words. Since the embeddings are sparse andnonnegative, words with the highest scores in each dimension show high similarity,which can be viewed as a concept of this dimension. To further improve the embed-dings, this work also proposes phrasal-level constraints into the loss function. Withnew constraints, it could achieve both interpretability and compositionality. Using only one single vector to represent a word is problematic due to the ambiguityof words. A single vector cannot represent multiple meanings of a word well becauseit may lead to semantic confusion among the different senses of this word.The multi-prototype vector space model [51] is proposed to better represent dif-ferent meanings of a word. In multi-prototype vector space model, a mixture ofvon Mises-Fisher distributions (movMF) clustering method with ﬁrst-order unigramcontexts [5] is used to cluster different meanings of a word. Formally, it assigns adifferent word representation w i ( x ) to the same word x in each different cluster i .When the multi-prototype embedding is used, the similarity between two words x , y is computed straightforwardly. If contexts of words are not available, the similaritybetween two words is deﬁned asAvgSim ( x , y ) = K K (cid:4) i = K (cid:4) j = s ( w i ( x ), w j ( y )), (2.35)MaxSim ( x , y ) = max ≤ i , j ≤ K s ( w i ( x ), w j ( y )), (2.36)where K is a hyperparameter indicating the number of the clusters and s ( · ) is a simi-larity function of two vectors such as cosine similarity. When contexts are available,the similarity can be computed more precisely as:AvgSimC ( x , y ) = K K (cid:4) i = K (cid:4) j = s c , x , i s c , y , j s ( w i ( x ), w j ( y )), (2.37)MaxSimC ( x , y ) = s ( ˆ w ( x ), ˆ w ( y )), (2.38) where s c , x , i = s ( w i ( c ), w i ( x )) is the likelihood of context c belonging to cluster i , and ˆ w ( x ) = w arg max ≤ i ≤ K s c , x , i ( x ) is the maximum likelihood cluster for x in context c . Withmulti-prototype embeddings, the accuracy on the word similarity task is signiﬁcantlyimproved, but the performance is still sensitive to the number of clusters.Although the multi-prototype embedding method can effectively cluster differentmeanings of a word via its contexts, the clustering is ofﬂine, and the number of clustersis ﬁxed and needs to be predeﬁned. It is difﬁcult for a model to select an appropriateamount of meanings for different words, to adapt to new senses, new words, or newdata, and to align the senses with prototypes. To address these problems, [12] proposesa uniﬁed model for word sense representation and word sense disambiguation. Thismodel uses available knowledge bases such as WordNet [46] to determine the sensesof a word. Each word and each sense had a single vector and are trained jointly. Thismodel can learn representations of both words and senses, and two simple methodsare proposed to do disambiguation using the word and sense vectors. There is much information about words that can be leveraged to improve the quality ofword representations. We will introduce other kinds of word representation learningmethods utilizing multisource information.

There is much information locating inside words, which can be utilized to improvethe quality of word representations further.

Using Character Information.

Many languages such as Chinese and Japanesehave thousands of characters, and the words in these languages are composed ofseveral characters. Characters in these languages have richer semantic informationcomparing with other languages containing only dozens of characters. Hence, themeaning of a word can not only be learned from its contexts but also the compositionof characters. Driven by this intuitive idea, [13] proposes a joint learning model forCharacter and Word Embeddings (CWE). In CWE, a word is a composition of aword embedding and its character embeddings. Formally, x = w + | w | (cid:4) i c i , (2.39)where x is the representation of a word, which is the composition of a word vector w and several character vectors c i , and | w | is the number of characters in the word. Notethat this model can be integrated with various models such as Skip-gram, CBOW,and GloVe. .5 Extensions 27 Further, position-based and cluster-based methods are proposed to address thisissue that characters are highly ambiguous. In position-based approach, each char-acter is assigned three vectors which appear in begin , middle and end of a wordrespectively. Since the meaning of a character varies when it appears in the differentpositions of a word, this method can signiﬁcantly resolve the ambiguity problem.However, characters that appear in the same position may also have different mean-ings. In the cluster-based method, a character is assigned K different vectors for itsdifferent meanings, in which a word’s context is used to determine which vector tobe used.By introducing character embeddings, the representation of low-frequency wordscan be signiﬁcantly improved. Besides, this method can deal with new words whileother methods fail. Experiments show that the joint learning method can achieve bet-ter performance on both word similarity and word analogy tasks. By disambiguatingcharacters using the position-based and cluster-based method, it can further improvethe performance. Using Morphology Information.

Many languages such as English have richmorphology information and plenty of rare words. Most word representation modelsassign a distinct vector to each word ignoring the rich morphology information. Thisis a limitation because the afﬁxes of a word can help infer the meaning of a wordand the morphology information of word is essential especially when facing rarecontexts.To address this issue, [8] proposes to represent a word as a bag of morphology n -grams. This model substitutes word vectors in Skip-gram with the sum of morphology n -gram vectors. When creating the dictionary of n -grams, they select all n -grams witha length greater or equal than 3 and smaller or equal than 6. To distinguish preﬁxes andsufﬁxes with other afﬁxes, they also add special characters to indicate the beginningand the end of a word. This model is efﬁcient and straightforward, which achievesgood performance on word similarity and word analogy tasks especially when thetraining set is small.Reference [41] further uses a bidirectional LSTM to generate word representationby composing morphologies. This model does not use a look-up table to assign adistinct vector to each word like what those independent word embedding methodsare doing. Hence, this model not only signiﬁcantly reduces the number of parametersbut also addresses some disadvantages of independent word embeddings. Moreover,the embeddings of words in this model could affect each other. Besides internal information of words, there is much external knowledge that couldhelp us learn the word representations.

Using Knowledge Base.

Some languages have rich internal information, whereaspeople have also annotated lots of knowledge bases which can be used in wordrepresentation learning to constrain embeddings. Reference [62] introduces relationconstraints into the CBOW model. With these constraints, the embeddings can not only predict its contexts, but also predict words with relations. The objective is tomaximize the sum of log probability of all relations as O = N N (cid:4) i = (cid:4) w ∈ R wi log P ( w | w i ), (2.40)where R w i indicates a set of words which have relation with w i . Then the jointobjective is deﬁned as O = N N (cid:4) i = log P ( w i | w j ( | j − i | < l , j (cid:4)= i ) ) + β N N (cid:4) i = (cid:4) w ∈ R wi log p ( w | w i ), (2.41)where β is a hyperparameter. The external information helps to train a better wordrepresentation, which shows signiﬁcant improvements on word similarity bench-marks.Moreover, Retroﬁtting [19] introduces a post-processing step which can introduceknowledge bases into word representation learning. It is more modular than otherapproaches which consider knowledge base during training. Let the word embeddingslearned by existing word representation approaches be E . Retroﬁtting attempts to ﬁndanother embedding space ˆ E , which is close to E but considers the relations in theknowledge base. Formally, L = (cid:4) i (cid:5) α i (cid:8) w i − ˆ w i (cid:8) + (cid:4) ( i , j ) ∈ R β i j (cid:8) w i − w j (cid:8) (cid:6) , (2.42)where α and β are hyperparameters indicating the strength of the associations, and R is a set of relations in the knowledge base. The adapted embeddings ˆ E can beoptimized by several iterations of the following online updates: ˆ w i = (cid:11) { j | ( i , j ) ∈ R } β i j ˆ w j + α i w i (cid:11) { j | ( i , j ) ∈ R } β i j + α i , (2.43)where α is usually set to 1 and β i j is deg ( i ) − (deg ( · ) is a node’s degree in a knowledgegraph). With knowledge bases such as the paraphrase database [27], WordNet [46]and FrameNet [3], this model can achieve consistent improvement on word similaritytasks. But it also may signiﬁcantly reduce the performance on the analogy of syntacticrelations. Since this module is a post-processing of word embeddings, it is compatiblewith various distributed representation models.In addition to the aforementioned synonym-based knowledge bases, there are alsosememe-based knowledge bases, in which the sememe is deﬁned as the minimumsemantic unit of word meanings. HowNet [16] is one of such knowledge bases,which annotates each Chinese word with one or more relevant sememes. General .5 Extensions 29 knowledge injecting methods could not apply to HowNet. As a result, [47] proposesa speciﬁc model to introduce HowNet into word representation learning.Bases on Skip-gram model, [47] introduces sense and sememe embeddings torepresent target word w i . More speciﬁcally, this model leverages context words,which are represented with original word embeddings, as attention over multiplesenses of target word w i to obtain its new embeddings. w i = | S ( wi ) | (cid:4) k = Att ( s ( w i ) k ) s ( w i ) k , (2.44)where s ( w i ) k denotes the k th sense embedding of w i and S ( w i ) is the sense set of w i .The attention term is as follows: Att ( s ( w i ) k ) = exp ( w (cid:11) c · ˆ s ( w i ) k ) (cid:11) | S ( wi ) | n = exp ( w (cid:11) c · ˆ s ( w i ) n ) , (2.45)where ˆ s ( w i ) k stands for the average of sememe embeddings x , ˆ s ( w i ) k = Avg ( x ( s k ) ) and w (cid:11) c is the average of context word embeddings, w (cid:11) c = Avg ( w j )( | j − i | ≤ l , j (cid:4)= i ) .This model shows a substantial advance in both word similarity and analogytasks. Moreover, the introduction of sense embeddings can also be used in wordsense disambiguation. Considering Document Information . Word embedding methods like Skip-gramsimply consider the context information within a window to learn word represen-tation. However, the information in the whole document could help our word rep-resentation learning. Topical Word Embeddings (TWE) [42] introduces topic infor-mation generated by Latent Dirichlet Allocation (LDA) to help distinguish differentmeanings of a word. The model is deﬁned to maximize the following average logprobability: O = N N (cid:4) i = (cid:4) − k ≤ c ≤ k , c (cid:4)= ( log P ( w i + c | w i ) + log P ( w i + c | z i )) , (2.46)where w i is the word embedding and z i is the topic embedding of w i . Each word w i is assigned a unique topic, and each topic has a topic embedding. The topical wordembedding model shows advantages of contextual word similarity and documentclassiﬁcation tasks.However, TWE simply combines the LDA with word embeddings and lacks statis-tical foundations. The LDA topic model needs numerous documents to learn seman-tically coherent topics. Reference [40] further proposes the TopicVec model, whichencodes words and topics in the same semantic space. TopicVec outperforms TWEand other word embedding methods on text classiﬁcation datasets. It can learn coher-ent topics on only one document which is not possible for other topic models. Human knowledge is in a hierarchical structure. Recently, many works also introducea hierarchical structure of texts into word representation learning.

Dependency-based Word Representation.

Continuous word embeddings arecombinations of semantic and syntactic information. However, existing word repre-sentation models depend solely on linear contexts and show more semantic infor-mation than syntactic information. To make the embeddings show more syntacticinformation, the dependency-based word embedding [38] uses the dependency-basedcontext. The dependency-based embeddings are less topical and exhibit more func-tional similarity than the original Skip-gram embeddings. It takes the information ofdependency parsing tree into consideration when learning word representations. Thecontexts of a target word w are the modiﬁers of this word, i.e., ( m , r ), . . . , ( m k , r k ) ,where r i is the type of the dependency relation between the head node and the mod-iﬁer. When training, the model optimizes the probability of dependency-based con-texts rather than neighboring contexts. This model gains some improvements onword similarity benchmarks compared with Skip-gram. Experiments also show thatwords with syntactic similarity are more similar in the vector space. Semantic Hierarchies.

Because of the linear substructure of the vector space,it is proven that word embeddings can make simple analogies. For example, thedifference between

Japan and

Tokyo is similar to the difference between

China and

Beijing . But it has trouble identifying hypernym-hyponym relations sincethese relationships are complicated and do not necessarily have linear substructure.To address this issue, [25] tries to identify hypernym-hyponym relationships usingword embeddings. The basic idea is to learn a linear projection rather than simplyuse the embedding offset to represent the relationship. The model optimizes theprojection as M ∗ = arg min M N (cid:4) ( i , j ) (cid:8) Mx i − y j (cid:8) , (2.47)where x i and y j are hypernym and hyponym embeddings.To further increase the capability of the model, they propose to ﬁrst cluster wordpairs into several groups and learn a linear projection for each group. The linearprojection can help identify various hypernym-hyponym relations. There are thousands of languages in the world. In word level, how to represent wordsfrom different languages in a uniﬁed vector space is an interesting problem. Thebilingual word embedding model [64] uses machine translation word alignments asconstraining translational evidence and embeds words of two languages into a singlevector space. The basic idea is (1) to initialize each word according to its aligned .5 Extensions 31 words in another language and (2) to constrain the distance between two languagesduring the training using translation pairs.When learning bilingual word embeddings, it ﬁrstly trains source word embed-dings. Then they use aligned sentence pairs to count the co-occurrence of source andtarget words. The target word embeddings can be initialized as E t − init = S (cid:4) s = N ts + N t + S E s , (2.48)where E s and E t − init are the trained embeddings of the source word and the initialembedding of the target word, respectively. N ts is the number of target words beingaligned with source word. S is all the possible alignments of word t . So N t + S normalizes the weights as a distribution. During the training, they jointly optimizethe word embedding objective as well as the bilingual constraint. The constraint isdeﬁned as L cn → en = (cid:8) E en − N en → cn E cn (cid:8) , (2.49)where N en → cn is the normalized align counts.When given a lexicon of bilingual word pairs, [44] proposes a simple model thatcan learn bilingual word embeddings in a uniﬁed space. Based on the distributionalgeometric similarities of word vectors of two languages, this model learns a lineartransformation matrix T that transforms the vector space of source language to thatof the target language. The training loss is L = (cid:8) TE s − E t (cid:8) , (2.50)where E t is the word vector matrix of aligned words in target language.However, this model performs badly when the seed lexicon is small. To tacklethis limitation, some works introduce the idea of bootstrapping into bilingual wordrepresentation learning. Let’s take [63] for example. In this work, in addition tomonolingual word embedding learning and bilingual word embedding alignmentbased on seed lexicon, a new matching mechanism is introduced. The main idea ofmatching is to ﬁnd the most probably matched source (target) word for each target(source) word and make their embeddings closer. Next, we explain the target-to-source matching process formally, and the source-to-target side is similar.The target-to-source matching loss function is deﬁned as L T S = − log P (cid:5) C ( T ) | E ( S ) (cid:6) = − log (cid:4) m P (cid:5) C ( T ) , m | E ( S ) (cid:6) , (2.51)where C ( T ) denotes the target corpus and m is a latent variable specifying the matchedsource word for each target word. On independency assumption, it has P (cid:5) C ( T ) , m | E ( S ) (cid:6) = (cid:3) w ( T ) i ∈ C ( T ) P (cid:12) w ( T ) i , m | E ( S ) (cid:13) = | V ( T ) | (cid:3) i = P (cid:12) w ( T ) i | w ( S ) m i (cid:13) N w ( T ) i , (2.52)where N w ( T ) i is the number of w ( T ) i occurrences in the target corpus. By training usingViterbi EM algorithm, this method can improve bilingual word embeddings on itsown and address the limitation of a small seed lexicon. In recent years, word representation learning has achieved great success and playeda crucial role in NLP tasks. People ﬁnd that word representation learning of thegeneral ﬁeld is still a limitation in a speciﬁc task and begin to explore the learningof task-speciﬁc word representation. In this section, we will take sentiment analysisas an example.

Word Representation for Sentiment Analysis . Most word representation meth-ods capture syntactic and semantic information while ignoring sentiment of text. Thisis problematic because words with similar syntactic polarity but opposite sentimentpolarity obtain closed word vectors. Reference [58] proposes to learn Sentiment-Speciﬁc Word Embeddings (SSWE) by integrating the sentiment information. Anintuitive idea is to jointly optimize the sentiment classiﬁcation model using wordembeddings as its feature and SSWE minimizes the cross-entropy loss to achievethis goal. To better combine the unsupervised word embedding method and the super-vised discriminative model, they further use the words in a window rather than a wholesentence to classify sentiment polarity. They propose the following ranking-basedloss: L r ( t ) = max ( , − s ( t ) f r ( t ) + s ( t ) f r ( t )), (2.53)where f r , f r are the predicted positive and negative scores. s ( t ) is an indicatorfunction: s ( t ) = (cid:2) , − . (2.54)This loss function only punishes the model when the model gives an incorrectresult.To get massive training data, they use distant-supervision technology to gener-ate sentiment labels for a document. The increase of labeled data can improve thesentiment information in word embeddings. On sentiment classiﬁcation tasks, senti-ment embeddings outperform other strong baselines including SVM and other wordembedding methods. SSWE also shows strong polarity consistency, where the clos-est words of a word are more likely to have the same sentiment polarity compared .5 Extensions 33 with existing word representation models. This sentiment speciﬁc word embeddingmethod provides us a general way to learn task-speciﬁc word embeddings, which isto design a joint loss function and to generate massive labeled data automatically. The meaning of a word changes during the time. Analyzing the changing meaningof a word is an exciting topic in both linguistic and NLP research. With the riseof word embedding methods, some works [29, 35] use embeddings to analyze thechange of words’ meanings. They separate corpus into bins with respect to yearsto train time-speciﬁc word embeddings and compare embeddings of different timeseries to analyze the change of word semantics. This method is intuitive but has someproblems. Dividing corpus into bins causes the data sparsity issue. The objective ofword embedding methods is nonconvex so that different random initialization leadsto different results, which makes comparing word embeddings difﬁcult. Embeddingsof a word in different years are in different semantic spaces and cannot be compareddirectly. Most work indirectly compares the meanings of a word in a different timeby the changes of a word’s closest words in the semantic space.To address these issues, [4] proposes a dynamic Skip-gram model which connectsseveral Bayesian Skip-gram models [6] using Kalman ﬁlters [33]. In this model, theembeddings of words in different periods could affect each other. For example, a wordthat appears in the 1990s’ document can affect the embeddings of that word in the1980s and 2000s. Moreover, it also trains the embedding in different periods by thewhole corpus to reduce the sparsity issue. This model also puts all the embeddings intothe same semantic space, which is a signiﬁcant improvement against other methodsand makes word embeddings in different periods comparable. Therefore, the changeof word embeddings in this model is continuous and smooth. Experimental resultsshow that the cosine distance between two words changes much more smoothly inthis model than those models which simply divide the corpus into bins.

In recent years, various methods to embed words into a vector space have beenproposed. Hence, it is essential to evaluate different methods. There are two gen-eral evaluations of word embeddings, including word similarity and word analogy.They both aim to check if the word distribution is reasonable. These two evaluationssometimes give different results. For example, CBOW achieves better performanceon word similarity, whereas Skip-gram outperforms CBOW on word analogy. There-fore, which method to choose depends on the high-level application. Task-speciﬁcword embedding methods are usually designed for speciﬁc high-level tasks and achieve signiﬁcant improvement on these tasks compared with baselines such asCBOW and Skip-gram. However, they only marginally outperform baselines on twogeneral evaluations.

The dynamics of words are very complex and subtle. There is no static, ﬁnite set ofrelations that can describe all interactions between two words. It is also not trivial fordownstream tasks to leverage different kinds of word relations. A more practical wayis to assign a score to a pair of words to represent to what extent they are related. Thismeasurement is called word similarity . When talking about the term word similarity ,the precise meaning may vary a lot in different situations. There are several kinds of similarity that may be referred to in various literature.

Morphological similarity . Many languages including English deﬁne morphol-ogy. The same morpheme can have multiple surface forms according to the syntac-tical function. For example, the word active is an adjective and activeness is its noun version. The word activate is a verb and activation is its nounversion. The morphology is an important dimension when considering the meaningand usage of words. It deﬁnes some relations between words from a syntactical view.Some relations are used in the Syntactic Word Relationship test set [43], includingadjectives to adverbs, past tense, and so on. However, in many higher level applica-tions and tasks, the words are often morphologically normalized by the base form(this process is also known as lemmatization). One widely used technique is thePorter stemming algorithm [49]. This algorithm converts active , activeness , activate , and activation to the same root format activ . By removing mor-phological features, the semantic meaning of words is more emphasized. Semantic Similarity . Two words are semantically similar if they can expressthe same concept, or sense, like article and document . One word may havedifferent senses, and each of its synonyms is associated with one or more of its senses.WordNet [46] is a lexical database that organizes the words as groups according to thesenses. Each group of words is called a synset , which contains all synonymous wordssharing the same speciﬁc sense. The words within the same synset are consideredsemantically similar. Words from two synsets that are linked by some certain relation(such as hyponym ) are also considered semantically similar to some degree, like bank(river) and bank (cid:5) bank(river) is the hyponym of bank (cid:6) . Semantic relatedness . Most modern literature that considers word similarityrefers to the semantic relatedness of words. Semantic relatedness is more generalthan semantic similarity. Words that are not semantically similar could still be relatedin many ways such as meronymy ( car and wheel ) or antonymy ( hot and cold ).Semantic relatedness often yields co-occurrence, but they are not equivalent. Thesyntactic structure could also yield co-occurrence. Reference [10] argues that distri-butional similarity is not an adequate proxy for semantic relatedness. .6 Evaluation 35

Table 2.3

Datasets for evaluating word similarity/relatednessDataset Similarity TypeRG-65 [52] Word SimilarityWordSim-353 [22] MixedWordSim-353 REL [1] Word RelatednessWordSim-353 SIM [1] Word SimilarityMTurk-287 [50] Word RelatednessSimLex-999 [31] Word Similarity

To evaluate the word representation system intrinsically, the most popular approachis to collect a set of word pairs and compute the correlation between human judg-ment and system output. So far, many datasets are collected and made public. Somedatasets focus on the word similarity, such as RG-65 [52] and SimLex-999 [31].Other datasets concern word relatedness, such as MTurk [50]. WordSim-353 [22] isa very popular dataset for word representation evaluation, but its annotation guidelinedoes not differentiate similarity and relatedness very clearly. Reference [1] conductsanother round of annotation based on WordSim-353 and generates two subsets, onefor similarity and the other for relatedness. Some information about these datasets issummarized in Table 2.3.To evaluate the similarity of two distributed word vectors, researchers usuallyselect cosine similarity as an evaluation metric. The cosine similarity of word w andword v is deﬁned as sim ( w , v ) = w · v (cid:8) w (cid:8)(cid:8) v (cid:8) . (2.55)When evaluating a word representation approach, the similarity of each word pairis computed in advance using cosine similarity. After that, Spearman’s correlationcoefﬁcient ρ is then used to evaluate the similarity between human annotator andword representation model as ρ = − (cid:11) d i n − n , (2.56)where a higher Spearman’s correlation coefﬁcient indicates they are more similar.Reference [10] describes a series of methods based on WordNet to evaluate thesimilarity of a pair of words. After the comparison between the traditional WordNet-based methods and distributed word representations, [1] addresses that relatednessand similarity are two different concerns. They point out that WordNet-based methodsperform better on similarity than on relatedness, while distributed word representa-tion shows similar performance on both. A series of distributed word representationsare compared on a wide variety of datasets in [56]. The state-of-the-art on bothsimilarity and relatedness is achieved by distributed representation, without a doubt. This evaluation method is simple and straightforward. However, as stated in [20],there are several problems with this evaluation. Since the datasets are small (less than1,000 word pairs in each dataset), one system may yield many different scores ondifferent partitions. Testing on the whole dataset makes it easier to overﬁt and hardto compute the statistical signiﬁcance. Moreover, the performance of a system onthese datasets may not be very correlated to its performance on downstream tasks.The word similarity measurement can come in an alternative format, the TOEFLsynonyms test. In this test, a cue word is given, and the test is required to choose onefrom four words that are the synonym of the cue word. The exciting part of this task isthat the performance of a system could be compared with human beings. Reference[37] evaluates the system with the TOEFL synonyms test to address the knowledgeinquiring and representing of LSA. The reported score is 64.4%, which is very closeto the average rating of the human test-takers. On this test set with 80 queries, [54]reported a score of 72.0%. Reference [24] extends the original dataset with the helpof WordNet and generates a new dataset (named WordNet-based synonymy test)containing thousands of queries. Besides word similarity, the word analogy task is an alternative way to measurehow well representations capture semantic meanings of words. This task gives threewords w , w , and w , then it requires the system to predict a word w such thatthe relation between w and w is the same as that between w and w . This task isused since [43, 45] to exploit the structural relationships among words. Here, theword relations could be divided into two categories, including semantic relationsand syntactic relations. This is a relatively novel method for word representationevaluation but quickly becomes a standard evaluation metric since the dataset isreleased. Unlike the TOEFL synonyms test, most words in this dataset are frequentacross all kinds of the corpus, but the fourth word is chosen from the whole vocabularyinstead of four options. This test favors distributed word representations because itemphasizes the structure of word space.The comparison between different models on the word analogy task measured byaccuracy could be found in [7, 56, 57, 61]. In this chapter, we ﬁrst introduce word representation methods, including one-hotrepresentation and various distributed representation methods. These classical meth-ods are the important foundation of various NLP models, and meanwhile present the major concepts and mechanisms of word representation learning for the reader. Next,considering classical word representation methods often suffer from the word poly-semy, we further introduce the effective contextualized word representation methodsELMo, to show the approach to capture complex word features across differentlinguistic contexts. As word representation methods are widely utilized in variousdownstream tasks, we then overview numerous extensions toward some representa-tive directions and discuss how to adapt word representations for speciﬁc scenarios.Finally, we introduce several evaluation tasks of word representation, including wordsimilarity and word analogy, which are the basic experimental settings for research-ing word representation methods.In the past decade, learning methods and applications of word representationhave been studied in depth. Here we recommend some surveys and books on wordrepresentative learning for reading: • Erk. Vector Space Models of Word Meaning and Phrase Meaning: A Survey [18]. • Lai et al. How to Generate a Good Word Embedding [36]. • Camacho et al. From Word to Sense Embeddings: A Survey on Vector Represen-tations of Meaning [11]. • Ruder et al. A Survey of Cross-lingual Word Embedding Models [53]. • Bakarov. A Survey of Word Embeddings Evaluation Methods [2].In the future, toward more effective word representation learning, some directionsare requiring further efforts:(1)

Utilizing More Knowledge . Current word representation learning models focuson representing words based on plain textual corpora. In fact, besides richsemantic information in text, there are also various kinds of word-related infor-mation hidden in heterogeneous knowledge in the real world, such as visualknowledge, factual knowledge, and commonsense knowledge. Some prelimi-nary explorations have attempted [59, 60] to utilize heterogeneous knowledgefor learning better word representations, and these explorations indicate thatutilizing more knowledge is a promising direction toward enhancing word rep-resentations. There remain open problems for further explorations.(2)

Considering More Contexts . As shown in this chapter, those word representa-tion learning methods considering contexts can achieve more expressive wordembeddings, which can grasp richer semantic information and further bene-ﬁt downstream NLP tasks than classical distributed methods. Context-awareword representations have been systematically veriﬁed for their effectivenessin existing works [32, 48], and adopting those context-aware word representa-tions has also become a necessary and mainstream operation for various NLPtasks. After BERT [15] has been proposed, language models pretrained on large-scale corpora have entered the public vision and their ﬁne-tuning models havealso achieved the state-of-the-art performance on speciﬁc NLP tasks. These newexplorations based on large-scale textual corpora and pretrained ﬁne-tuning lan-guage representation architectures indicate a promising direction to considermore contexts with more powerful representation architectures, and we will dis-cuss them more in the next chapter. (3)

Orienting Finer Granularity . Polysemy is a widespread phenomenon forwords. Hence, it is essential and meaningful to consider the ﬁner granulatedsemantic information than the words themselves. As some linguistic knowledgebases have been developed, such as synonym-based knowledge bases Word-Net [21] and sememe-based knowledge bases HowNet [17], we thus have waysto study the atomic semantics of words. The current work on word representa-tions learning is coarse-grained, and mainly focuses on shallow semantics of thewords themselves in text, and ignores the rich semantic information inside thewords, which is also an important resource for achieving better word embed-dings. Reference [28] explores to inject ﬁner granulated atomic semantics ofwords into word representations and performs much better language understand-ing. Although these explorations are still preliminary, orienting ﬁner granularityof word representations is important. In the next chapter, we will also introducemore details in this part.In the past decade, learning methods and applications of distributed representationhave been studied in depth. Because of its efﬁciency and effectiveness, lots of task-speciﬁc models have been proposed for various tasks. Word representation learninghas become a popular and important topic in NLP. However, word representationlearning is still challenging due to its ambiguity, data sparsity, and interpretability. Inrecent years, word representation learning has been no longer studied in isolation, butexplored together with sentence or document representation learning using pretrainedlanguage models. Readers are recommended to refer to the following chapters tofurther learn the integration of word representations in other scenarios.

References

1. Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pa¸sca, and Aitor Soroa.A study on similarity and relatedness using distributional and wordnet-based approaches. In

Proceedings of HLT-NAACL , 2009.2. Amir Bakarov. A survey of word embeddings evaluation methods. arXivpreprint arXiv:1801.09536, 2018.3. Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet project. In

Proceedings of ACL , 1998.4. Robert Bamler and Stephan Mandt. Dynamic word embeddings via skip-gram ﬁltering. arXivpreprint arXiv:1702.08359, 2017.5. Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unithypersphere using von mises-ﬁsher distributions.

Journal of Machine Learning Research , 2005.6. Oren Barkan. Bayesian neural word embedding. In

Proceedings of AAAI , 2017.7. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. Dont count, predict a systematiccomparison of context-counting vs. context-predicting semantic vectors. In

Proceedings ofACL , 2014.8. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectorswith subword information.

Transactions of the Association for Computational Linguistics ,5:135–146, 2017.9. Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai.Class-based n-gram models of natural language.

Computational linguistics , 18(4):467–479,1992.eferences 3910. Alexander Budanitsky and Graeme Hirst. Evaluating wordnet-based measures of lexical seman-tic relatedness.

Computational Linguistics , 32(1):13–47, 2006.11. Jose Camacho-Collados and Mohammad Taher Pilehvar. From word to sense embeddings:A survey on vector representations of meaning.

Journal of Artiﬁcial Intelligence Research ,63:743–788, 2018.12. Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. A uniﬁed model for word sense representationand disambiguation. In

Proceedings of EMNLP , 2014.13. Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. Joint learning ofcharacter and word embeddings. In

Proceedings of IJCAI , 2015.14. Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A.Harshman. Indexing by latent semantic analysis.

Japan Analytical & Scientiﬁc InstrumentsShow , 41(6):391–407, 1990.15. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. In

Proceedings of NAACL , 2019.16. Zhendong Dong and Qiang Dong. Hownet-a hybrid language and knowledge resource. In

Proceedings of NLP-KE , 2003.17. Zhendong Dong and Qiang Dong.

HowNet and the Computation of Meaning (With CD-Rom) .World Scientiﬁc, 2006.18. Katrin Erk. Vector space models of word meaning and phrase meaning: A survey.

Languageand Linguistics Compass , 6(10):635–653, 2012.19. Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah ASmith. Retroﬁtting word vectors to semantic lexicons. In

Proceedings of NAACL-HLT , 2015.20. Manaal Faruqui, Yulia Tsvetkov, Pushpendre Rastogi, and Chris Dyer. Problems with evaluationof word embeddings using word similarity tasks. In

Proceedings of RepEval , 2016.21. Christiane Fellbaum. Wordnet.

The encyclopedia of applied linguistics , 2012.22. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman,and Eytan Ruppin. Placing search in context: The concept revisited. In

Proceedings of WWW ,2001.23. John R Firth. A synopsis of linguistic theory, 1930–1955. 1957.24. Dayne Freitag, Matthias Blume, John Byrnes, Edmond Chow, Sadik Kapadia, Richard Rohwer,and Zhiqiang Wang. New experiments in distributional representations of synonymy. In

Pro-ceedings of CoNLL , 2005.25. Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Wang, and Ting Liu. Learning semantichierarchies via word embeddings. In

Proceedings of ACL , 2014.26. Alona Fyshe, Leila Wehbe, Partha Pratim Talukdar, Brian Murphy, and Tom M Mitchell. Acompositional and interpretable semantic space. In

Proceedings of HLT-NAACL , 2015.27. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. Ppdb: The paraphrasedatabase. In

Proceedings of HLT-NAACL , 2013.28. Yihong Gu, Jun Yan, Hao Zhu, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Fen Lin, and LeyuLin. Language modeling with sparse product of sememe experts. In

Proceedings of EMNLP ,pages 4642–4651, 2018.29. William L Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic word embeddings revealstatistical laws of semantic change. In

Proceedings of ACL , 2016.30. Zellig S Harris. Distributional structure.

Word , 10(2–3):146–162, 1954.31. Felix Hill, Roi Reichart, and Anna Korhonen. Simlex-999: Evaluating semantic models with(genuine) similarity estimation.

Computational Linguistics , 2015.32. Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁ-cation. In

Proceedings of ACL , pages 328–339, 2018.33. Rudolph Emil Kalman et al. A new approach to linear ﬁltering and prediction problems.

Journalof Basic Engineering , 82(1):35–45, 1960.34. Pentti Kanerva, Jan Kristofersson, and Anders Holst. Random indexing of text samples forlatent semantic analysis. In

Proceedings of CogSci , 2000.35. Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. Temporal analysis oflanguage through neural language models. In

Proceedings of the ACL Workshop , 2014.0 2 Word Representation36. Siwei Lai, Kang Liu, Shizhu He, and Jun Zhao. How to generate a good word embedding.

IEEE Intelligent Systems , 31(6):5–14, 2016.37. Thomas K Landauer and Susan T Dumais. A solution to plato’s problem: The latent seman-tic analysis theory of acquisition, induction, and representation of knowledge.

Psychologicalreview , 104(2):211, 1997.38. Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In

Proceedings of ACL ,2014.39. Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In

Proceedings of NeurIPS , 2014.40. Shaohua Li, Tat-Seng Chua, Jun Zhu, and Chunyan Miao. Generative topic embedding: acontinuous representation of documents. In

Proceedings of ACL , 2016.41. Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, Ramón Fermandez, Silvio Amir, LuisMarujo, and Tiago Luís. Finding function in form: Compositional character models for openvocabulary word representation. In

Proceedings of EMNLP , 2015.42. Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Topical word embeddings. In

Proceedings of AAAI , 2015.43. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of wordrepresentations in vector space. In

Proceedings of ICLR , 2013.44. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages formachine translation. arXiv preprint arXiv:1309.4168, 2013.45. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous spaceword representations. In

Proceedings of HLT-NAACL , 2013.46. George A Miller. Wordnet: a lexical database for english.

Communications of the ACM ,38(11):39–41, 1995.47. Yilin Niu, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Improved word representation learn-ing with sememes. In

Proceedings of ACL , 2017.48. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,and Luke Zettlemoyer. Deep contextualized word representations. In

Proceedings of NAACL-HLT , pages 2227–2237, 2018.49. Martin F Porter. An algorithm for sufﬁx stripping.

Program , 14(3):130–137, 1980.50. Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. A word at atime: computing word relatedness using temporal semantic analysis. In

Proceedings of WWW ,2011.51. Joseph Reisinger and Raymond J Mooney. Multi-prototype vector-space models of word mean-ing. In

Proceedings of HLT-NAACL , 2010.52. Herbert Rubenstein and John B Goodenough. Contextual correlates of synonymy.

Communi-cations of the ACM , 8(10):627–633, 1965.53. Sebastian Ruder, Ivan Vuli´c, and Anders Søgaard. A survey of cross-lingual word embeddingmodels.

Journal of Artiﬁcial Intelligence Research , 65:569–631, 2019.54. Magnus Sahlgren. Vector-based semantic analysis: Representing word meanings based onrandom labels. In

Proceedings of Workshop on SKAC , 2001.55. Magnus Sahlgren. An introduction to random indexing. In

Proceedings of TKE , 2005.56. Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. Evaluation methods forunsupervised word embeddings. In

Proceedings of EMNLP , 2015.57. Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. Learning word representationsby jointly modeling syntagmatic and paradigmatic relations. In

Proceedings of ACL , 2015.58. Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning sentiment-speciﬁc word embedding for twitter sentiment classiﬁcation. In

Proceedings of ACL , 2014.59. Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and MichaelGamon. Representing text for joint embedding of text and knowledge bases. In

Proceedings ofthe EMNLP , pages 1499–1509, 2015.60. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph and text jointlyembedding. In

Proceedings of EMNLP , pages 1591–1601, 2014.eferences 4161. Dani Yogatama, Manaal Faruqui, Chris Dyer, and Noah A Smith. Learning word representationswith hierarchical sparse coding. In

Proceedings of ICML , 2015.62. Mo Yu and Mark Dredze. Improving lexical embeddings with semantic knowledge. In

Pro-ceedings of ACL , 2014.63. Meng Zhang, Haoruo Peng, Yang Liu, Huan-Bo Luan, and Maosong Sun. Bilingual lexiconinduction from non-parallel data with minimal supervision. In

Proceedings of AAAI , 2017.64. Will Y Zou, Richard Socher, Daniel M Cer, and Christopher D Manning. Bilingual wordembeddings for phrase-based machine translation. In

Proceedings of EMNLP , 2013.

Open Access

Compositional Semantics

Abstract

Many important applications in NLP ﬁelds rely on understanding morecomplex language units such as phrases, sentences, and documents beyond words.Therefore, compositional semantics has remained a core task in NLP. In this chapter,we ﬁrst introduce various models for binary semantic composition, including additivemodels and multiplicative models. After that, we present various typical models forN-ary semantic composition including recurrent neural network, recursive neuralnetwork, and convolutional neural network.

From the previous chapter, following the distributed hypothesis, one could projectthe semantic meaning of a word into a low-dimensional real-valued vector accordingto its context information, which is named as word vectors. Here comes a furtherproblem: how to compress a higher semantic unit into a vector or other kinds ofmathematical representations like a matrix or a tensor. In other words, using repre-sentation learning to model a semantic composition function remains an unsolvedbut surging research topic recently.Compositionality enables natural languages to construct complex semantic mean-ings from the combinations of simpler semantic elements. This property is oftencaptured with the following principle: the semantic meaning of a whole is a functionof the semantic meanings of its several parts. Therefore, the semantic meanings ofcomplex structures will depend on how their semantic elements combine.Here we express the composition of two semantic units, which are denoted as u and v , respectively, and the most intuitive way to deﬁne the joint representationcould be formulated as follows: p = f ( u , v ), (3.1)where p corresponds to the representation of the joint semantic unit ( u , v ) . It shouldbe noted that here u and v could denote words, phrases, sentences, paragraphs, oreven higher level semantic units. © The Author(s) 2020Z. Liu et al., Representation Learning for Natural Language Processing ,https://doi.org/10.1007/978-981-15-5573-2_3 434 3 Compositional Semantics

However, given the representations of two semantic constituents, it is not enoughto derive their joint embeddings with the lack of syntactic information. For instance,although the phrase machine learning and learning machine have thesame vocabulary, they contain different meanings: machine learning refers toa research ﬁeld in artiﬁcial intelligence while learning machine means somespeciﬁc learning algorithms. This phenomenon stresses the importance of syntacticand order information in a compositional sentence. Reference [12] takes the role ofsyntactic and order information into consideration and suggests a further reﬁnementof the above principle: the meaning of a whole is a function of the meaning of itsseveral parts and the way they are syntactically combined. Therefore, the compositionfunction in Eq. (3.1) is redeﬁned to combine the syntactic relationship rule R betweenthe semantic units u and v : p = f ( u , v , R ), (3.2)where R denotes the syntactic relationship rule between two constituent semanticunits.Unfortunately, even this formulation may not be fully adequate. Therefore, [7]claims that the meaning of a whole is greater than the meanings of its severalparts. It implies that people may suffer from the problem of constructing com-plex meanings rather than simply understanding the meanings of several parts andtheir syntactic relations. In real language composition, in different contexts, thesame sentence could have different meanings, which means that some sentencesare hard to understand without any background information. For example, thesentence Tom and Jerry is one of the most popular comediesin that style. needs two main backgrounds: Firstly,

Tom and Jerry isa special noun phrase or knowledge entity which indicates a cartoon comedy, ratherthan two ordinary people. The other prior knowledge should be that style , whichneeds further explanation in the previous sentences. Hence, a full understanding ofthe compositional semantics needs to take existing knowledge into account. Here,the argument K is added into the composition function, incorporating knowledgeinformation as a prior in the compositional process: p = f ( u , v , R , K ), (3.3)where K represents the background knowledge.Reference [4] claims that we should ask for the meaning of a word in isolation butonly in the context of a statement. That is, the meaning of a whole is constructed fromits parts, and the meanings of the parts are meanwhile derived from the whole. More-over, compositionality is a matter of degree rather than a binary notion. Linguisticstructures range from fully compositional (e.g., black hair), to partly compositionalsyntactically ﬁxed expressions, (e.g., take advantage), in which the constituents canstill be assigned separate meanings, and non-compositional idioms (e.g., kick thebucket) or multi-word expressions (e.g., by and large), whose meaning cannot bedistributed across their constituents [11]. .1 Introduction 45 From the above three equations formulating composition function, it could beconcluded that composition could be viewed as a speciﬁc binary operation butbeyond this. The syntactic message could help to indicate a particular approach whilebackground knowledge helps to explain some obscure words or speciﬁc context-dependent entities such as pronouns. Beyond binary compositional operations, onecould build the sentence-level composition by applying binary composition oper-ations recursively. In this chapter, we will ﬁrst explain some sorts of basic binarycomposition functions in both the semantic vector space and matrix-vector space.After, we will climb up to more complex composition scenarios and introduce severalapproaches to model sentence-level composition.

In general, the central task in semantic representation is projecting words from anabstract semantic space to a mathematical low-dimensional space. As introduced inthe previous chapters, to make the transformation reasonable, the purpose is to main-tain the word similarity in this new projected space. In other words, the more similarthe words are, the closer their vectors should be. For instance, we hope the wordvectors w ( book ) and w ( magazine ) are close while the word vectors w ( apple ) and w ( computer ) are far away. In this chapter, we will introduce several widely usedtypical semantic vector space including one-hot representation, distributed represen-tation, and distributional representation. Despite the wide use of semantic vector spaces, an alternative semantic space isproposed to be a more powerful and general compositional semantic framework.Different from conventional vector spaces, matrix-vector semantic space utilizes amatrix to represent the word meaning rather than a skinny vector. The motivationbehind this is when modeling the semantic meaning under a speciﬁc context, one iswondering not only what is the meaning of each word, but also the holistic meaningof the whole sentence. Thus, we concern about the semantic transformation betweenadjacent words inside each sentence. However, the semantic vector space could notcharacterize the semantic transformation of one word on the others explicitly.Driven by the idea of modeling semantic transformation, some researchers haveproposed to use a matrix to represent the transformation operation of one word on theothers. Different from those vector space models, it could incorporate some structuralinformation like the word order and syntax composition.

The goal is to construct vector representations for phrases, sentences, paragraphs,and documents. Without loss of generality, we assume that each constituent of aphrase (sentence, paragraph, or document) is embedded into a vector which willbe subsequently combined in some way to generate a representation vector for thephrase (sentence, paragraph, or document). In this section, we focus on binary composition. We will take phrases consistingof a head and a modiﬁer or complement as an example. If we cannot model the binarycomposition (or phrase representation), there is little hope that we can construct morecomplex compositional representations for sentences or even documents. Therefore,given a phrase such as “machine learning” and the vectors u and v representing theconstituents “machine” and “learning”, respectively, we aim to produce a represen-tation vector p of the whole phrase. Let the hypothetical vectors for machine and learning be [ , , , , ] and [ , , , , ] , respectively. This simpliﬁed seman-tic space will serve to illustrate examples of the composition functions which weconsider in this section.The fundamental problem of semantic composition modeling in representing atwo-word phrase is designing a primitive composition function as a binary operator.Based on this function, one could apply it on a word sequence recursively and derivesentence-level composition. Here a word sequence could be any level of the seman-tic units, such as a phrase, a sentence, a paragraph, a knowledge entity, or even adocument.From the previous section, one of the basic formulae is to formulate semanticcomposition f in the following equation: p = f ( u , v , R , K ), (3.4)where u , v denote the representations of the constituent parts in this semantic unit, p denotes the joint representation, R indicates the relationship while K indicatesthe necessary background knowledge. The expression deﬁnes a wide class of com-position functions. For easier discussion, we give some appropriate constraints tonarrow the space of our considering function. First, we will ignore the backgroundknowledge K to explore what can be achieved without any utilization of backgroundor world knowledge. Second, for the consideration of the syntactic relation R , wecan proceed by investigating only one relation at a time. And then we can removeany explicit dependence on R which allows us to explore any possible distinct com-position function for various syntactic relations. That is, we simplify the formula p = f ( u , v ) by simply ignoring the background knowledge and relationship. Note that, the problem of combining semantic vectors of small units to make a representation for amulti-word sequence is different from the problem of incorporating information about multi-wordcontexts into a distributional representation for a single target word..3 Binary Composition 47

In recent years, modeling the binary composition function is a well-studied butstill challenging problem. There are mainly two perspectives toward this question,including the additive model and the multiplicative model.

The additive model has a constraint in which it assumes that p , u , and v lie in thesame semantic space. This essentially means that all syntactic types have the samedimension. One of the simplest ways is to directly use the sum to represent the jointrepresentation: p = u + v . (3.5)According to Eq. (3.5), the sum of the two vectors representing machine and learning would be w ( machine ) + w ( lear ning ) = [ , , , , ] . It assumes thatthe composition of different constituents is a symmetric function of them; in otherwords, it does not consider the order of constituents. Although having lots of draw-backs such as lack of the ability to model word orders and absence from backgroundsyntactic or knowledge information, this approach still provides a relatively strongbaseline [9].To overcome the word order issue, one easy variant is applying a weighted suminstead of uniform weights. This is to say, the composition has the following form: p = α u + β v , (3.6)where α and β correspond to different weights for two vectors. Under this setting, twosequences ( u , v ) and ( v , u ) have different representations, which is consistent withreal language phenomena. For example, “machine learning” and “learning machine”have different meanings which requires different representations. In this setting, wecould give greater emphasis to heads than other constituents. As an example, if weset α to 0 . β to 0 .

7, the 0 . × w ( machine ) = [ , . , . , . , . ] and 0 . × w ( lear ning ) = [ . , . , . , . , ] , and “machine learning” is represented by theiraddition 0 . × w ( machine ) + . × w ( lear ning ) = [ . , . , . , . , . ] .However, this model could not consider prior knowledge and syntax information.To incorporate prior information into the additive model, one method combinesnearest neighborhood semantics into composition, deriving p = u + v + K (cid:2) i = n i , (3.7)where n , n , . . . , n K denote all semantic neighbors of v . Therefore, this methodcould ensemble all synonyms of the component as a smoothing factor into com-position function, which reduces the variance of language. For example, if in the composition of “machine” and “learning”, the chosen neighbor is “optimiz-ing”, with w ( optimi zing ) = [ , , , , ] , then this leads to the situation thatthe representation of “machine learning” becomes w ( machine ) + w ( lear ning ) + w ( optimi zing ) = [ , , , , ] .Since the joint representations of one additive model still lie in the same semanticspace with their original component vectors, it is natural to conduct cosine similarityto measure their semantic relationships. Thus, under a naive additive model, we havethe following similarity equation: s ( p , w ) = p · w (cid:2) p (cid:2) · (cid:2) w (cid:2) = ( u + v ) w (cid:2) u + v (cid:2)(cid:2) w (cid:2) (3.8) = (cid:2) u (cid:2)(cid:2) u + v (cid:2) s ( u , w ) + (cid:2) v (cid:2)(cid:2) u + v (cid:2) s ( v , w ), (3.9)where w denotes any other word in the vocabulary and s indicates the similarityfunction. From derivation ahead, it could be concluded that this composition functioncomposes both magnitude and directions of two component vectors. In other words, ifone vector dominates the magnitude, it will also dominate the similarity. Furthermore,we have (cid:2) p (cid:2) = (cid:2) u + v (cid:2) ≤ (cid:2) u (cid:2) + (cid:2) v (cid:2) . (3.10)This lemma suggests that the semantic unit with a deeper-rooted parsing tree coulddetermine the joint representation when combining with a shallow unit. Because thedeeper the semantic unit is, the larger the magnitude it has.Moreover, incorporating geometry insight, we can observe that the additive modelbuilds a more solid understanding of semantic composition. Supposing that our com-ponent vectors are u and v , the additive model aims to project them to x and y , where x follows the direction of u while y is orthogonal to u . The following ﬁgure couldclearly illustrate this issue (Fig. 3.1). Fig. 3.1

An illustration ofthe additive model

ABC D yux v .3 Binary Composition 49

From the ﬁgure, the vector x and the vector y could be represented as x = u · vu · u · u , y = v − x = v − u · vu · u · u . (3.11)Then, using the linear combination of these two new vectors x , y yields a newadditive model: p = α x + β y (3.12) = α u · vu · u · u + β (cid:3) v − u · vu · u · u (cid:4) (3.13) = (α − β) · u · vu · u · u + β v . (3.14)Furthermore, using cosine similarity measurement, the relationship could be writ-ten as follows: s ( p , w ) = | α − β || α | s ( u , w ) + | β || α | s ( v , w ). (3.15)From similarity measurement derivation, it is indicated that with this projectionmethod, the composition similarity could be viewed as a linear combination of thesimilarities of two components, which means that combining semantic units withdifferent semantic depths, the deeper one will not dominate the representation. Though the additive model achieves great success in semantic composition, the sim-pliﬁcation it adopted may be too restrictive because it assumes all words, phrases,sentences, and documents are substantially similar enough to be represented in a uni-ﬁed semantic space. Different from the additive model which regards composition asa simple linear transformation, the multiplicative model aims to make higher orderinteraction. Among all models from this perspective, the most intuitive approachtried to apply the pair-wise product as a composition function approximation. In thismethod, the composition function is shown as the following: p = u (cid:4) v , (3.16)where, p i = u i · v i , which implies each dimension of the output only depends onthe corresponding dimension of two input vectors. However, similar to the simplestadditive model, this model is also suffering from the lack of the ability to model wordorder, and the absence from background syntactic or knowledge information. In the additive model, we have p = α u + β v to alleviate the word order issue.Note that here α and β are two scalars, which could be easily changed to two matrices.Therefore, the composition function could be represented as p = W α · u + W β · v , (3.17)where W α and W β are matrices which determine the importance of u and v to p .With this expression, the composition could be more expressive and ﬂexible althoughmuch harder to train.Generalizing multiplicative model ahead, another approach is to utilize tensors asmultiplicative descriptors and the composition function could be viewed as p = −→ W · uv , (3.18)where −→ W denotes a 3-order tensor, i.e., the formula above could be written as p k = (cid:5) i , j W i jk · u i · v j . Hence, this model makes that each element of p could beinﬂuenced by all elements of both u and v , with a relationship of linear combinationby assigning each ( i , j ) a unique weight.Starting from this simple but general baseline, some researchers proposed tomake the function not symmetric to consider word order in the sequence. Payingmore attention to the ﬁrst element, the composition function could be p = −→ W · uuv , (3.19)where −→ W denotes a 4-order tensor. This method could be understood as replacinglinear transformation of u and v to a quadratic in u asymmetrically. So this is a variantof the tensor multiplicative compositional model.Different from expanding a simple multiplicative model to complex ones, otherkinds of approaches are proposed to reduce the parameter space. With the reductionof parameter size, people could make compositions much more efﬁcient rather thanhave an O ( n ) time complexity in the tensor-based model. Thus, some compressiontechniques could be applied in the original tensor model. One representative instanceis the circular convolution model, which could be shown as p = u (cid:2) v , (3.20)where (cid:2) represents the circular convolution operation with the following deﬁnition: p i = (cid:2) j u j · v i − j . (3.21)If we assign each pair with unique weights, the composition function will be p i = (cid:2) j W i j · u j · v i − j . (3.22) .3 Binary Composition 51 Note that the circular convolution model could be viewed as a special instance ofa tensor-based composition model. If we write the circular convolution in the tensorform, we have W i jk =

0, where k (cid:6)= i + j . Thus, the parameter number could bereduced from n to n , while maintaining the interactions between each pair ofdimensions in the input vectors.Both in the additive and multiplicative models, the basic condition is all compo-nents lie in the same semantic space as the output. Nevertheless, different modelingtypes of words in different semantic spaces could bring us a different perspective.For instance, given ( u , v ) , the multiplicative model could be reformulated as p = W · ( u · v ) = U · v . (3.23)This implies that each left unit could be treated as an operation on the repre-sentation of the right one. In other words, each remaining unit could be formulatedas a transformation matrix, while the right one should be represented as a seman-tic vector. This argument could be meaningful, especially for some kinds of phrasecompositions. Reference [2] argues that for ADJ-NOUN phrases, the joint semanticinformation could be viewed as the conjunction of the semantic meanings of twocomponents. Given a phrase red car , its semantic meaning is the conjunction ofall red things and all different kinds of cars. Thus, red could be formulated as anoperator on the vector of car , deriving the new semantic vector, which expressedthe meaning of red car . These observations lead to another genre of semanticcompositional modeling: semantic matrix-composition space. In real-world NLP tasks, the input is usually a sequence of multiple words rather thanjust a pair of words. Therefore, besides designing a suitable binary compositionaloperator, the order to apply binary operations is also important. In this section, wewill introduce three mainstream strategies in N-ary composition by taking languagemodeling as an example.To illustrate the language modeling task more clearly, the composition problemto model a sentence or even a document could be formulated as

Given a sentence/document consisting of a word sequence { w , w , w , . . . , w n } , we aim to design following functions to obtain the joint semantic representation ofthe whole sentence/document:

1. A semantic representation method like semantic vector space or compositionalmatrix space.2. A binary compositional operation function f ( u , v ) like we introduced in the pre-vious sections. Here the input u and v denote the representations of two constitutesemantic units, while the output is also the representation in the same space.

3. A sequential order to apply the binary function in step 2. To describe in detail,we could use a bracket to identify the order to apply the composition function.For instance, we could use (( w , w ), w ) to represent the sequential order frombeginning to end.In this section, we will introduce several systematic strategies to model sentencesemantics by describing the solutions for the three problems above. We will classifythe methods by word-level order: sequential order, recursive order (following parsingtrees), and convolution order. To design orders to apply binary compositional functions, the most intuitive methodis utilizing sequentiality. Namely, the sequence order should be s n = ( s n − , w n ) ,where s n − is the order of the ﬁrst n − h t denotes the representation of the ﬁrst t words and w t represents the t th word, thegeneral composition could be formulated as h t = f ( h t − , x t ), (3.24)where f is a well-designed binary composition function.From the deﬁnition of the RNN, the composition function could be formulated asfollows: h t = tanh ( W h t − + W w t ), (3.25)where W and W are two weighted matrices.We could see that here we use a matrix-weighted summation to represent binarysemantic composition: p = W α u + W β v . (3.26) LSTM.

Since the raw RNN only utilizes the simple tangent function, it is hardto obtain the long-term dependency of a long sentence/document. Reference [5]reinvents Long Short-Term Memory (LSTM) networks to strengthen the ability to .4 N-Ary Composition 53 model long-term semantic dependency in RNN. In detail, the composition function ofthe LSTM allows information from previous layers to ﬂow directly to their followinglayers. The composition function could be deﬁned as f t = Sigmoid ( W hf h t − + W xf x t + b f ), (3.27) i t = Sigmoid ( W hi h t − + W xi x t + b i ), (3.28) o t = Sigmoid ( W ho h t − + W xo x t + b o ), (3.29) ˆ c t = tanh ( W hc h t − + W xc x t + b c ), (3.30) c t = f t (cid:4) c t − + i t (cid:4) ˆ c t , (3.31) h t = o t (cid:4) c t . (3.32) Variants of LSTM.

To simplify LSTM and obtain more efﬁcient algorithms,[3] proposes to utilize a simple but comparable RNN architecture, named GatedRecurrent Unit (GRU). Compared with LSTM, GRU has fewer parameters, whichbring higher efﬁciency. The composition function is showed as z t = Sigmoid ( W hz h t − + W xz x t + b z ), (3.33) r t = Sigmoid ( W hr h t − + W xr x t + b r ), (3.34) ˆ h t = tanh ( W h ( r t (cid:4) h t − ) + W xh x t + b h ), (3.35) h t = ( − z t ) (cid:4) h t − + z t (cid:4) ˆ h t . (3.36) Besides the recurrent neural network, another strategy to apply binary compositionalfunction follows a parsing tree instead of sequential word order. Based on this philos-ophy, [15] proposes a recursive neural network to model different levels of semanticunits. In this subsection, we will introduce some algorithms following the recursiveparsing tree with different binary compositional functions.Since all the recursive neural networks are binary trees, the basic problem we needto consider is how to derive the representation of the father component on the treegiven its two children semantic components. Reference [15] proposes a recursivematrix-vector model (MV-RNN) which captures constituent parsing tree structureinformation by assigning a matrix-vector representation for each constituent. Thevector captures the meaning of the constituent itself, and the matrix represents howit modiﬁes the meaning of the word it combines with. Suppose we have two childrencomponents a , b and their father component p , the composition can be formulatedas follows: p = f vec ( a , b ) = g (cid:6) W (cid:7) BaAb (cid:8)(cid:9) , (3.37) P = f matrix ( a , b ) = W (cid:7) AB (cid:8) , (3.38)where a , b , p are the embedding vectors for each component and A , B , P are thematrices, W is a matrix that maps the transformed words into another semanticspace, the element-wise function g is an activation function, and W is a matrix thatmaps the two matrices into one combined matrix P with the same dimension. Thewhole process is illustrated in Fig. 3.2. And then MV-RNN selects the highest nodeof the path in the parse tree between the two target entities to represent the inputsentence.In fact, the composition operation used in the above recursive network is similarto an RNN unit introduced in the previous subsection. And the RNN unit here canbe replaced by LSTM units or GRU units. Reference [16] proposes two types oftree-structured LSTMs including the Child-Sum Tree-LSTM and the N-ary Tree-LSTM to capture constituent or dependency parsing tree structure information. Forthe Child-Sum Tree-LSTM, given a tree, let C ( t ) denote the children set of the node t . Its transition equations are deﬁned as follows: very f( Ba , Ab )= Ba = Ab = beautiful girl( a , A ) ( b , B ) ( c , C ) ... ...... Fig. 3.2

The architecture of the matrix-vector recursive encoder.4 N-Ary Composition 55 ˆ h t = (cid:2) k ∈ C ( t ) h k , (3.39) i t = Sigmoid ( W ( i ) w t + U i ˆ h t + b ( i ) ), (3.40) f tk = Sigmoid ( W ( f ) w t + U f ˆ h k + b ( f ) ) ( k ∈ C ( t )), (3.41) o t = Sigmoid ( W ( o ) w t + U o ˆ h t + b ( o ) ), (3.42) u t = tanh ( W ( u ) w t + U u ˆ h t + b ( u ) ), (3.43) c t = i t (cid:4) u t + (cid:2) k ∈ C ( t ) f tk (cid:4) c t − , (3.44) h t = o t (cid:4) tanh ( c t ). (3.45)The N-ary Tree-LSTM has similar transition equations as the Child-Sum Tree-LSTM. The only difference is that it limits the tree structures to have at most Nbranches. Reference [6] proposes to embed an input sentence using a Convolutional NeuralNetwork (CNN) which extracts local features by a convolution layer and combinesall local features via a max-pooling operation to obtain a ﬁxed-sized vector for theinput sentence.Formally, the convolution operation is deﬁned as a matrix multiplication betweena sequence of vectors, a convolution matrix W , and a bias vector b with a slidingwindow. Let us deﬁne the vector q i as the concatenation of the subsequence of inputrepresentations in the i th window, we have h j = max i [ f ( Wq i + b ) ] j , (3.46)where f indicates a nonlinear function such as sigmoid or tangent function, and h indicates the ﬁnal representation of the sentence. In this chapter, we ﬁrst introduce the semantic space for compositional semantics.Afterwards, we take phrase representation as an example to introduce various modelsfor binary semantic composition, including additive models and multiplicative mod-els. Finally, we introduce typical models for N-ary semantic composition includingrecurrent neural network, recursive neural network, and convolutional neural net-work. Compositional semantics allows languages to construct complex meaningsfrom the combinations of simpler elements, and its binary semantic composition and N-ary semantic composition is the foundation of multiple NLP tasks includingsentence representation, document representation, relational path representation, etc.We will give a detailed introduction to these scenarios in the following chapters.For further understanding of compositional semantics, there are also some rec-ommended surveys and books: • Pelletier et al., The principle of semantic compositionality [13]. • Jeff et al., Composition in distributional models of semantics [10].For better modeling compositional semantics, some directions require furtherefforts in the future:(1)

Neurobiology-inspired Compositional Semantics . What is the neurobiologyfor dealing with compositional semantics in human language? Recently, [14]ﬁnds that the human combinatory system is related to rapidly peaking activityin the left anterior temporal lobe and later engagement of the medial prefrontalcortex. The analysis of how language builds meaning and lays out directionsin neurobiological research may bring some instructive reference for modelingcompositional semantics in representation learning. It is valuable to design novelcompositional forms inspired by recent neurobiological advances.(2)

Combination of Symbolic and Distributed Representation . Human languageis inherently a discrete symbolic representation of knowledge. However, werepresent the semantics of discrete symbols with distributed/distributional rep-resentations when dealing with natural language in deep learning. Recently, thereare some approaches such as neural module networks [1] and neural symbolicmachine [8] attempting to consider discrete symbols in neural networks. Howto take advantage of these symbolic neural models to represent the compositionof semantics is an open problem to be explored.

References

1. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In

Proceedings of CVPR , pages 39–48, 2016.2. Marco Baroni and Roberto Zamparelli. Nouns are vectors, adjectives are matrices: Representingadjective-noun constructions in semantic space. In

Proceedings of EMNLP , 2010.3. Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedbackrecurrent neural networks. In

Proceedings of ICML , 2015.4. Gottlob Frege. Die grundlagen derarithmetik.

Eine logisch mathematische Untersuchung u’berden Begrijfder Zahl. Breslau: Koebner , 1884.5. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural Computation ,9(8):1735–1780, 1997.6. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural networkfor modelling sentences. In

Proceedings of ACL , 2014.7. George Lakoff. Linguistic gestalts. In

Proceedings of ILGISA , 1977.8. Chen Liang, Jonathan Berant, Quoc Le, Kenneth Forbus, and Ni Lao. Neural symbolicmachines: Learning semantic parsers on freebase with weak supervision. In

Proceedings ofACL , pages 23–33, 2017.eferences 579. Jeff Mitchell and Mirella Lapata. Vector-based models of semantic composition. In

Proceedingsof ACL , 2008.10. Jeff Mitchell and Mirella Lapata. Composition in distributional models of semantics.

Cognitivescience , 34(8):1388–1429, 2010.11. Geoffrey Nunberg, Ivan A Sag, and Thomas Wasow. Idioms.

Language , pages 491–538, 1994.12. Barbara Partee. Lexical semantics and compositionality.

An Invitation to Cognitive Science:Language , 1:311–360, 1995.13. Francis Jeffry Pelletier. The principle of semantic compositionality.

Topoi , 13(1):11–24, 1994.14. Liina Pylkkänen. The neural basis of combinatory syntax and semantics.

Science ,366(6461):62–66, 2019.15. Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. Semantic compo-sitionality through recursive matrix-vector spaces. In

Proceedings of EMNLP , 2012.16. Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representa-tions from tree-structured long short-term memory networks. In

Proceedings of ACL , 2015.

Open Access

Sentence Representation

Abstract

Sentence is an important linguistic unit of natural language. Sentence Rep-resentation has remained as a core task in natural language processing, because manyimportant applications in related ﬁelds lie on understanding sentences, for example,summarization, machine translation, sentiment analysis, and dialogue system. Sen-tence representation aims to encode the semantic information into a real-valued rep-resentation vector, which will be utilized in further sentence classiﬁcation or match-ing tasks. With large-scale text data available on the Internet and recent advanceson deep neural networks, researchers tend to employ neural networks (e.g., con-volutional neural networks and recurrent neural networks) to learn low-dimensionalsentence representations and achieve great progress on relevant tasks. In this chapter,we ﬁrst introduce the one-hot representation for sentences and the n -gram sentencerepresentation (i.e., probabilistic language model). Then we extensively introduceneural-based models for sentence modeling, including feedforward neural network,convolutional neural network, recurrent neural network, and the latest Transformer,and pre-trained language models. Finally, we introduce several typical applicationsof sentence representations. Natural language sentences consist of words or phrases, follow grammatical rules,and convey complete semantic information. Compared with words and phrases, sen-tences have more complex structures, including both sequential and hierarchicalstructures, which are essential for understanding sentences. In NLP, how to rep-resent sentences is critical for related applications, such as sentence classiﬁcation,sentiment analysis, sentence matching, and so on.Before deep learning took off, sentences were usually represented as one-hot vec-tors or TF-IDF vectors, following the assumption of bag-of-words. In this case, asentence is represented as a vocabulary-sized vector, in which each element repre-sents the importance of a speciﬁc word (either term frequency or TF-IDF) to thesentence. However, this method confronts two issues. Firstly, the dimension of suchrepresentation vectors is usually up to thousands or millions. Thus, they usually facesparsity problem and bring in computational efﬁciency problem. Secondly, such a © The Author(s) 2020Z. Liu et al.,

Representation Learning for Natural Language Processing ,https://doi.org/10.1007/978-981-15-5573-2_4 590 4 Sentence Representation representation method follows the bag-of-words assumption and ignores the sequen-tial and structural information, which can be crucial for understanding the semanticmeanings of sentences.Inspired by recent advances of deep learning models in computer vision andspeech, researchers proposed to model sentences with deep neural networks, such asconvolutional neural network, recurrent neural network, and so on. Compared withconventional word frequency-based sentence representations, deep neural networkscan capture the internal structures of sentences, e.g., sequential and dependencyinformation, through convolutional or recurrent operations. Thus, neural network-based sentence representations have achieved great success in sentence modelingand NLP tasks.

One-hot representation is the most simple and straightforward method for word rep-resentation tasks. This method represents each word with a ﬁxed length binary vector.Speciﬁcally, for a vocabulary V = { w , w , . . . , w | V | } , the one-hot representation ofword w is w = [ , . . . , , , , . . . , ] . Based on the one-hot word representation andthe vocabulary, it can be extended to represent a sentence s = { w , w , . . . , w l } as s = l (cid:2) k = w i , (4.1)where l indicates the length of the sentence s . The sentence representation s is thesum of the one-hot representations of n words within the sentence, i.e., each elementin s represents the Term Frequency (TF) of the corresponding word.Moreover, researchers usually take the importance of different words into consid-eration, rather than treat all the words equally. For example, the function words suchas “a”, “an”, and “the” usually appear in different sentences, and reserve little mean-ings. Therefore, the Inverse Document Frequency (IDF) is employed to measure theimportance of w i in V as follows:idf w i = log | D | df w i , (4.2)where | D | is the number of all documents in the corpus D and df w i represents theDocument Frequency (DF) of w i .With the importance of each word, the sentences are represented more preciselyas follows: ˆ s = s ⊗ idf , (4.3)where ⊗ is the element-wise product.Here, ˆ s is the TF-IDF representation of the sentence s . .3 Probabilistic Language Model 61 One-hot sentence representation usually neglects the structure information in a sen-tence. To address this issue, researchers propose probabilistic language model, whichtreats n -grams rather than words as the basic components. An n -gram means a subse-quence of words in a context window of length n , and probabilistic language modeldeﬁnes the probability of a sentence s = [ w , w , . . . , w l ] as P ( s ) = l (cid:3) i = P ( w i | w i − ). (4.4)Actually, model indicated in Eq. (4.4) is not practicable due to its enormous param-eter space. In practice, we simplify the model and set an n -sized context window,assuming that the probability of word w i only depends on [ w i − n + · · · w i − ] . Morespeciﬁcally, an n -gram language model predicts word w i in the sentence s basedon its previous n − P ( s ) = l (cid:3) i = P ( w i | w i − i − n + ), (4.5)where the probability of selecting the word w i can be calculated from n -gram modelfrequency counts: P ( w i | w i − i − n + ) = P ( w ii − n + ) P ( w i − i − n + ) . (4.6)Typically, the conditional probabilities in n -gram language models are not cal-culated directly from the frequency counts, since it suffers severe problems whenconfronted with any n -grams that have not explicitly been seen before. Therefore,researchers proposed several types of smoothing approaches, which assign some ofthe total probability mass to unseen words or n -grams, such as “add-one” smoothing,Good-Turing discounting, or back-off models. n -gram model is a typical probabilistic language model for predicting the nextword in an n -gram sequence, which follows the Markov assumption that the proba-bility of the target word only relies on the previous n − n -gram language model is used asan approximation of the true underlying language model. This assumption is crucialbecause it massively simpliﬁes the problem of learning the parameters of languagemodels from data. Recent works on word representation learning [3, 40, 43] aremainly based on the n -gram language model. Although smoothing approaches could alleviate the sparse problem in the probabilis-tic language model, it still performs poorly for those unseen or uncommon words and n -grams. Moreover, since probabilistic language models are constructed on largerand larger texts, the number of unique words (the vocabulary) increases and thenumber of possible sequences of words increases exponentially with the size of thevocabulary, causing a data sparsity problem. Thus statistics are needed to estimateprobabilities accurately.To address this issue, researchers propose neural language models which usecontinuous representations or embeddings of words and neural networks to make theirpredictions, in which embeddings in the continuous space help to alleviate the curseof dimensionality in language modeling, and neural networks avoid this problem byrepresenting words in a distributed way, as nonlinear combinations of weights in aneural net [2]. An alternate description is that a neural network approximates thelanguage function. The neural net architecture might be feedforward or recurrent,and while the former is simpler, the latter is more common.Similar to probabilistic language models, neural language models are constructedand trained as probabilistic classiﬁers that learn to predict a probability distribution: P ( s ) = l (cid:3) i = P ( w i | w i − ), (4.7)where the conditional probability of the selecting word w i can be calculated byvarious kinds of neural networks such as feedforward neural networks, recurrentneural networks, and so on. In the following sections, we will introduce these neurallanguage models in detail. The goal of neural network language model is to estimate the conditional probabil-ity P ( w i | w , . . . , w i − ) . However, the feedforward neural network (FNN) lacks aneffective way to represent the long-term historical context. Therefore, it adopts theidea of n -gram language models to approximate the conditional probability, whichassumes that each word in a word sequence more statistically depends on thosewords closer to it, and only n − P ( w i | w i − ) ≈ P ( w i | w i − i − n + ) .The overall architecture of the FNN language model is proposed by [3]. To evaluatethe conditional probability of the word w i , it ﬁrst projects its n − x = [ w i − n + , . . . , w i − ] , and then feedsthem into an FNN, which can be generally represented as .4 Neural Language Model 63 y = M f ( Wx + b ) + d , (4.8)where W is a weighted matrix to transform word vectors to hidden representations, M is a weighted matrix for the connections between the hidden layer and the outputlayer, and b , d are bias vectors. And then the conditional probability of the word w i can be calculated as P ( w i | w i − i − n ) = exp ( y w i ) (cid:4) j exp ( y j ) . (4.9) The Convolutional Neural Network (CNN) is the family of neural network modelsthat features a type of layer known as the convolutional layer. This layer can extractfeatures by a learnable ﬁlter (or kernel) at the different positions of an input. Phamet al. [47] propose the CNN language model to enhance the FNN language model.The proposed CNN network is produced by injecting a convolutional layer after theword input representation x = [ w i − n , . . . , w i − ] . Formally, the convolutional layerinvolves a sliding window of the input vectors centered on each word vector using aparameter matrix W c , which can be generally represented as y = M (cid:5) max ( W c x ) (cid:6) , (4.10)where max ( · ) indicates a max-pooling layer. The architecture of CNN is shown inFig. 4.1.Moreover, [12] also introduces a convolutional neural network for language mod-eling with a novel gating mechanism. To address the lack of ability for modeling long-term dependency in the FNN lan-guage model, [41] proposes a Recurrent Neural Network (RNN) language modelwhich applies RNN in language modeling. RNNs are fundamentally different fromFNNs in the sense that they operate on not only an input space but also an internalstate space, and the internal state space enables the representation of sequentiallyextended dependencies. Therefore, the RNN language model can deal with thosesentences of arbitrary length. At every time step, its input is the vector of its previousword instead of the concatenation of vectors of its n previous words, and the infor-mation of all other previous words can be taken into account by its internal state.Formally, the RNN language model can be deﬁned as Fig. 4.1

The architecture ofCNN W c * Non-linear Layer

MaxPoolingConvolutionLayerInputRepresentation tanh h i = f ( W h i − + W w i + b ), (4.11) y = Mh i − + d , (4.12)where W , W , M are weighted matrices and b , d are bias vectors. Here, the RNNunit can also be implemented by LSTM or GRU. The architecture of RNN is shownin Fig. 4.2.Recently, researchers make some comparisons among neural network languagemodels with different architectures on both small and large corpora. The experimentalresults show that, generally, the RNN language model outperforms the CNN languagemodel. In 2018, Google proposed a pre-trained language model (PLM), called BERT, whichachieved state-of-the-art results on a variety of NLP tasks. At that time, it was verybig news. Since then, all the NLP researchers began to consider how PLMs canbeneﬁt their research tasks. .4 Neural Language Model 65 tanh

RNNUnit RNNUnit RNNUnit RNNUnit σ σ σ tanh

GRU Cell σ σ tanh x t h t h h h n x x x n h t-1 c t-1 x t h t c t x t h t h t-1 LSTM Cell

Fig. 4.2

The architecture of RNN

In this section, we will ﬁrst introduce the Transformer architecture and then talkabout BERT and other PLMs in detail.

Transformer [65] is a nonrecurrent encoder-decoder architecture with a series ofattention-based blocks. For the encoder, there are 6 layers and each layer is composedof a multi-head attention sublayer and a position-wise feedforward sublayer. Andthere is a residual connection between sublayers. The architecture of the Transformeris as shown in Fig. 4.3.There are several attention heads in the multi-head attention sublayer. A head represents a scaled dot-product attention structure, which takes the query matrix Q ,the key matrix K , and the value matrix V as the inputs, and the output is computedby Attention ( Q , K , V ) = Softmax (cid:7) QK T √ d k (cid:8) V , (4.13)where d k is the dimension of query matrix.The multi-head attention sublayer linearly projects the input hidden states H several times into the query matrix, the key matrix, and the value matrix for h heads.The dimensions of the query, key, and value vectors are d k , d k , and d v , respectively. Inputs Outputs(shifted right)N ×

FeedForwardAdd & NormMulti-HeadAttentionAdd & Norm MaskedMulti-HeadAttentionAdd & NormFeedForwardAdd & NormMulti-HeadAttentionAdd & Norm

N ×

PositionalEncoding PositionalEncodingLinearSoftmax

OutputProbabilities

Fig. 4.3

The architecture of Transformer

The multi-head attention sublayer could be formulated asMultihead ( H ) = [ head , head , . . . , head h ] W O , (4.14)where head i = Attention ( HW Qi , HW Ki , HW Vi ) , and W Qi , W Ki and W Vi are linearprojections. W O is also a linear projection for the output. Here, the fully connectedposition-wise feedforward sublayer contains two linear transformations with ReLUactivation: .4 Neural Language Model 67 FFN ( x ) = W max ( , W x + b ) + b . (4.15)Transformer is better than RNNs for modeling the long-term dependency, whereall tokens will be equally considered during the attention operation. The Transformerwas proposed to solve the problem of machine translation. Since Transformer has avery powerful ability to model sequential data, it becomes the most popular backboneof NLP applications. Neural models can learn large amounts of language knowledge from language mod-eling. Since the language knowledge covers the demands of many downstreamNLP tasks and provides powerful representations of words and sentences, someresearchers found that knowledge can be transferred to other NLP tasks easily. Thetransferred models are called Pre-trained Language Models (PLMs).Language modeling is the most basic and most important NLP task. It contains avariety of knowledge for language understanding, such as linguistic knowledge andfactual knowledge. For example, the model needs to decide whether it should addan article before a noun. This requires linguistic knowledge about articles. Anotherexample is the question of what is the following word after “Trump is the presidentof”. The answer is “America”, which requires factual knowledge. Since languagemodeling is very complex, the models can learn a lot from this task.On the other hand, language modeling only requires plain text without any humanannotation. With this feature, the models can learn complex NLP abilities from a verylarge-scale corpus. Since deep learning needs large amounts of data and languagemodeling can make full use of all texts in the world, PLMs signiﬁcantly beneﬁt thedevelopment of NLP research.Inspired by the success of the Transformer, GPT [50] and BERT [14] begin toadopt the Transformer as the backbone of the pre-trained language models. GPT andBERT are the most representative Transformer-based pre-trained language models(PLMs). Since they achieved state-of-the-art performance on various NLP tasks,nearly all PLMs after them are based on the Transformer. In this subsection, we willtalk about GPT and BERT in more detail.GPT is the ﬁrst work to pretrain a PLM based on the Transformer. The train-ing procedure of GPT [50] contains two classic stages: generative pretraining anddiscriminative ﬁne-tuning.In the pretraining stage, the input of the model is a large-scale unlabeled corpusdenoted as U = { u , u , . . . , u n } . The pretraining stage aims to optimize a languagemodel. The learning objective over the corpus is to maximize a conditional likelihoodin a ﬁxed-size window: L ( U ) = (cid:2) i log P ( u i | u i − k , . . . , u i − ; Θ), (4.16) where k represents the size of the window, the conditional likelihood P is modeledby a neural network with parameters Θ .For a supervised dataset χ , the input is a sequence of words s = ( w , w , .., w l ) and the output is a label y . The pretraining stage provides an advantageous startpoint of parameters that can be used to initialize subsequent supervised tasks. Atthis occasion, the objective is a discriminative task that maximizes the conditionalpossibility distribution: L (χ) = (cid:2) ( s , y ) log P ( y | w , . . . , w l ), (4.17)where P ( y | w , . . . , w l ) is modeled by a K-layer Transformer. After the input tokenspass through the pretrained GPT, a hidden vector of the ﬁnal layer h Kl will be pro-duced. To obtain the output distribution, a linear transformation layer is added, whichhas the same size as the number of labels: P ( y | w , . . . , w m ) = Softmax ( W y h Kl ). (4.18)The ﬁnal training objective is combined with a language modeling L for bettergeneralization: L (χ) = L (χ) + λ ∗ L (χ), (4.19)where λ is a weight hyperparameter.BERT [14] is a milestone work in the ﬁeld of PLM. BERT achieved signiﬁcantempirical results on 17 different NLP tasks, including SQuAD (outperform humanbeing), GLUE (7.7% point absolute improvement), MultiNLI (4.6% point absoluteimprovement), etc. Compared to GPT, BERT uses a bidirectional deep Transformeras the model backbone. As illustrated in Fig. 4.4, BERT contains pretraining andﬁne-tuning stages.In the pretraining stage, two objectives are designed: Masked Language Model(MLM) and

Next Sentence Prediction (NSP) . (1) For MLM, tokens are randomly

Pre-training

Unlabeled Sentence A and B PairMasked Sentence A Masked Sentence B [CLS] Tok Tok N [SEP]… Tok Tok M …E [CLS] E E N E [SEP] E’ E’ M … …E [CLS] E E N E [SEP] E’ E’ M … … NSP Mask LM Mask LM

BERT Fine-Tuning

MNLI NER

Question Answer PairQuestion Paragraph [CLS] Tok Tok N [SEP]… Tok Tok M …E [CLS] E E N E [SEP] E’ E’ M … …E [CLS] E E N E [SEP] E’ E’ M … … BERT

SQuAD

Start/End Span

Fig. 4.4

The pretraining and ﬁne-tuning stages for BERT.4 Neural Language Model 69 masked with a special token [ MASK ] . The training objective is to predict the maskedtokens based on the contexts. Compared with the standard unidirectional conditionallanguage model, which can only be trained in one direction, MLM aims to train a deepbidirectional representation model. This task is inspired by Cloze [64]. (2) The objec-tive of NSP is to capture relationships between sentences for some sentence-baseddownstream tasks such as natural language inference (NLI) and question answering(QA). In this task, a binary classiﬁer is trained to predict whether the sentence isthe next sentence for the current. This task effectively captures the deep relationshipbetween sentences, exploring semantic information from a different level.After pretraining, BERT can capture various language knowledge for downstreamsupervised tasks. By modifying inputs and outputs, BERT can be ﬁne-tuned for anyNLP tasks, which contain the applications with the input of single text or text pairs.The input consists of sentence A and sentence B, which can represent (1) sentencepairs in paraphrase, (2) hypothesis-premise pairs in entailment, (3) question-passagepairs in QA, and (4) text- ∅ for text classiﬁcation task or sequence tagging. For theoutput, BERT can produce the token-level representation for each token, which isused to sequence tagging task or question answering. Besides, the special token [ CLS ] in BERT is fed into the classiﬁcation layer for sequence classiﬁcation. Pre-trained language models have rapid progress after BERT. We summarize sev-eral important directions of PLMs and show some representative models and theirrelationship in Fig. 4.5.Here is a brief introduction of the PLMs after BERT. Firstly, there are somevariants of BERT for better general language representation, such as RoBERTa [38]and XLNet [70]. These models mainly focus on the improvement of pretraining tasks.Secondly, some people work on pretrained generation models, such as MASS [57]and UniLM [15]. These models achieve promising results on the generation tasksinstead of the Natural Language Understanding (NLU) tasks used by BERT. Thirdly,the sentence pair format of BERT inspired works on the cross-lingual and cross-modalﬁelds. XLM [8], ViLBERT [39], and VideoBERT [59] are the important works in thisdirection. Lastly, there are some works [46, 81] that explore to incorporate externalknowledge into PLMs since some low-frequency knowledge cannot be efﬁcientlylearned by PLMs.

Inspired by the contrastive divergence model, [4] proposes to adopt importance sam-pling to accelerate the training of neural language models. They ﬁrst normalize the

Fig. 4.5

The Pre-trained language model family outputs of neural network language model and view neural network language modelsas a special case of energy-based probability models as following: P ( w i | w i − i − n ) = exp ( − y w i ) (cid:4) j exp ( − y j ) . (4.20)The key idea of importance sampling is to approximate the mean of log-likelihoodgradient of the loss function of neural network language model by sampling severalimportant words instead of calculating the explicit gradient. Here, the log-likelihoodgradient of the loss function of neural network language model can be generallyrepresented as ∂ P ( w i | w i − i − n )∂θ = − ∂ y w i ∂θ + | V | (cid:2) j = P ( w j | w i − i − n ) ∂ y j ∂θ = − ∂ y i ∂θ + E w k ∼ P (cid:9) ∂ y k ∂θ (cid:10) , (4.21)where θ indicates all parameters of the neural network language model. Here, thelog-likelihood gradient of the loss function consists of two parts including positivegradient for target word w i and negative gradient for all words w j , i.e., E w i ∼ P [ ∂ y j ∂θ ] .Here, the second part can be approximated by sampling important words followingthe probability distribution P : .4 Neural Language Model 71 E w k ∼ P (cid:9) ∂ y k ∂θ (cid:10) ≈ (cid:2) w k ∈ V (cid:9) | V (cid:9) | ∂ y k ∂θ , (4.22)where V (cid:9) is the word set sampled under P .However, since we cannot obtain probability distribution P in advance, it is impos-sible to sample important words following the probability distribution P . Therefore,importance sampling adopts a Monte Carlo scheme which uses an existing proposaldistribution Q to approximate P , and then we have E w k ∼ P (cid:9) ∂ y k ∂θ (cid:10) ≈ | V (cid:9)(cid:9) | (cid:2) w l ∈ V (cid:9)(cid:9) ∂ y l ∂θ P ( w l | w i − i − n )/ Q ( w l ), (4.23)where V (cid:9)(cid:9) is the word set sampled under Q . Moreover, the sample size of impor-tance sampling approach should be increased as training processes in order to avoiddivergence, which aims to ensure its effective sample size S : S = ( (cid:4) w l ∈ V (cid:9)(cid:9) r l ) (cid:4) w l ∈ V (cid:9)(cid:9) r l , (4.24)where r l is further deﬁned as r l = P ( w l | w i − i − n )/ Q ( w l ) (cid:4) w j ∈ V (cid:9)(cid:9) P ( w j | w i − i − n )/ Q ( w j ) . (4.25) Besides important sampling, researchers [7, 22] also propose class-based languagemodel, which adopts word classiﬁcation to improve the performance and speed of alanguage model. In class-based language model, all words are assigned to a uniqueclass, and the conditional probability of a word given its context can be decomposedinto the probability of the word’s class given its previous words and the probabilityof the word given its class and history, which is formally deﬁned as P ( w i | w i − i − n ) = (cid:2) c ( w i ) ∈ C P ( w i | c ( w i )) P ( c ( w i ) | w i − i − n ), (4.26)where C indicates the set of all classes and c ( w i ) indicates the class of word w i .Moreover, [44] proposes a hierarchical neural network language model, whichextends word classiﬁcation to a hierarchical binary clustering of words in a languagemodel. Instead of simply assigning each word with a unique class, it ﬁrst buildsa hierarchical binary tree of words according to the word similarity obtained fromWordNet. Next, it assigns a unique bit vector c ( w i ) = [ c ( w i ), c ( w i ), . . . , c l ( w i ) ] for each word, which indicates the hierarchical classes of them. And then the conditionalprobability of each word can be deﬁned as P ( w i | w i − i − n ) = l (cid:3) j = P ( c j ( w i ) | c ( w i ), c ( w i ), . . . , c j − ( w i ), w i − i − n ). (4.27)The hierarchical neural network language model can achieve O ( k / log k ) speedup as compared to a standard language model. However, the experimental results of[44] show that although the hierarchical neural network language model achievesan impressive speed up for modeling sentences, it has worse performance than thestandard language model. The reason is perhaps that the introduction of hierarchicalarchitecture or word classes imposes a negative inﬂuence on the word classiﬁcationby neural network language models. Caching is also one of the important extensions in language model. A type of cache-based language model assumes that each word in recent context is more likely toappear again [58]. Hence, the conditional probability of a word can be calculated bythe information from history and caching: P ( w i | w i − i − n ) = λ P s ( w i | w i − i − n ) + ( − λ) P c ( w i | w i − i − n ), (4.28)where P s ( w i | w i − i − n ) indicates the conditional probability generated by standard lan-guage and P c ( w i | w i − i − n ) indicates the conditional probability generated by caching,and λ is a constant.Another cache-based language model is also used to speed up the RNN languagemodeling [27]. The main idea of this approach is to store the outputs and states oflanguage models for future predictions given the same contextual history. In this section, we will introduce two typical sentence-level NLP applications includ-ing text classiﬁcation and relation extraction, as well as how to utilize sentence rep-resentation for these applications. .5 Applications 73

Text classiﬁcation is a typical NLP application and has lots of important real-worldtasks such as parsing and semantic analysis. Therefore, it has attracted the interest ofmany researchers. The conventional text classiﬁcation models (e.g., the LDA [6] andtree kernel [48] models) focus on capturing more contextual information and correctword order by extracting more useful and distinct features, but still expose a fewissues (e.g., data sparseness) which has the signiﬁcant impact on the classiﬁcationaccuracy. Recently, with the development of deep learning in the various ﬁelds ofartiﬁcial intelligence, neural models have been introduced into the text classiﬁcationﬁeld due to their abilities of text representation learning. In this section, we willintroduce the two typical tasks of text classiﬁcation, including sentence classiﬁcationand sentiment classiﬁcation.

Sentence classiﬁcation aims to assign a sentence an appropriate category, which is abasic task of the text classiﬁcation application.Considering the effectiveness of the CNN models in capturing sentence semanticmeanings, [31] ﬁrst proposes to utilize the CNN models trained on the top of pre-trained word embeddings to classify sentences, which achieved promising results onseveral sentence classiﬁcation datasets. Then, [30] introduces a dynamic CNN modelto model the semantic meanings of sentences. This model handles sentences of vary-ing lengths and uses dynamic max-pooling over linear sequences, which could helpthe model capture both short-range and long-range semantic relations in sentences.Furthermore, [9] proposes a novel CNN-based model named as Very Deep CNN,which operates directly at the character level. It shows that those deeper models havebetter results on sentence classiﬁcation and can capture the hierarchical informationfrom scattered characters to whole sentences. Yin and Schütze [74] also proposeMV-CNN, which utilizes multiple types of pretrained word embeddings and extractsfeatures from multi-granular phrases with variable-sized convolutional layers. Toaddress the drawbacks of MV-CNN such as model complexity and the requirementfor the same dimension of embeddings, [80] proposes a novel model called MG-CNNto capture multiple features from multiple sets of embeddings that are concatenatedat the penultimate layer. Zhang et al. [79] present RA-CNN to jointly exploit labelson documents and their constituent sentences, which can estimate the probabilitythat a given sentence is informative and then scales the contribution of each sentenceto aggregate a document representation in proportion to the estimates.The RNN model which aims to capture the sequential information of sentencesis also widely used in sentence classiﬁcation. Lai et al. [32] propose a neural net-work for text classiﬁcation, which applies a recurrent structure to capture contextualinformation. Moreover, [37] introduces a multitask learning framework based on theRNN to jointly learn across multiple sentence classiﬁcation tasks, which employs three different mechanisms of sharing information to model sentences with both task-speciﬁc and shared layers. Yang et al. [71] introduce word-level and sentence-levelattention mechanisms into an RNN-based model as well as a hierarchical structureto capture the hierarchical information of documents for sentence classiﬁcation.

Sentiment classiﬁcation is a special task of the sentence classiﬁcation application,whose objective is to classify the sentimental polarities of opinions a piece of textcontains, e.g., favorable or unfavorable, positive or negative. This task appeals theNLP community since it has lots of potential downstream applications such as moviereview suggestions.Similar to text classiﬁcation, the sentence representation based on neural modelshas also been widely explored for sentiment classiﬁcation. Glorot et al. [20] use astacked denoising autoencoder in sentiment classiﬁcation for the ﬁrst time. Then,a series of recursive neural network models based on the recursive tree structureof sentences are conducted to learn sentence representations for sentiment classi-ﬁcation, including the recursive autoencoder (RAE) [55], matrix-vector recursiveneural network (MV-RNN) [54], and recursive neural tensor network (RNTN) [56].Besides, [29] adopts a CNN to learn sentence representations and achieves promisingperformance in sentiment classiﬁcation.The RNN models also beneﬁt sentiment classiﬁcation as they are able to capturethe sequential information. Li et al. [35] and Tai et al. [62] investigate a tree-structuredLSTM model on text classiﬁcation. There are also some hierarchical models proposedto deal with document-level sentiment classiﬁcation [5, 63], which generate seman-tic representations at different levels (e.g., phrase, sentence, or document) withina document. Moreover, the attention mechanism is also introduced into sentimentclassiﬁcation, which aims to select important words from a sentence or importantsentences from a document [71].

To enrich existing KGs, researchers have devoted many efforts to automatically ﬁnd-ing novel relational facts in text. Therefore, relation extraction (RE), which aimsat extracting relational facts according to semantic information in plain text, hasbecome a crucial NLP application. As RE is also an important downstream applica-tion of sentence representation, we will, respectively, introduce the techniques andextensions to show how to utilize sentence representation for different RE scenarios.Considering neural networks have become the backbone of the recent NLP research,we mainly focus on Neural RE (NRE) models in this section. .5 Applications 75

Fig. 4.6

An example of sentence-level relation extraction

Sentence-level NRE aims at predicting the semantic relations between the givenentity (or nominal) pair in a sentence. As shown in Fig. 4.6, given the input sentence s which consists of n words s = { w , w , . . . , w n } and its corresponding entity pair e and e as input, sentence-level NRE wants to obtain the conditional probability P ( r | s , e , e ) of relation r ( r ∈ R ) via a neural network, which can be formalizedas P ( r | s , e , e ) = P ( r | s , e , e , θ), (4.29)where θ is all parameters of the neural network and r is a relation in the relation set R .A basic form of sentence-level NRE consists of three components: (a) an inputencoder to give a representation for each input word, (b) a sentence encoder whichcomputes either a single vector or a sequence of vectors to represent the originalsentence, and (c) a relation classiﬁer which calculates the conditional probabilitydistribution of all relations. Input Encoder . First, a sentence-level NRE system projects the discrete wordsof the source sentence into a continuous vector space, and obtains the input repre-sentation w = { w , w , . . . , w m } of the source sentence.(1) Word Embeddings. Word embeddings aim to transform words into distributedrepresentations to capture the syntactic and semantic meanings of the words.In the sentence s , every word w i is represented by a real-valued vector. Wordrepresentations are encoded by column vectors in an embedding matrix E ∈ R d a ×| V | where V is a ﬁxed-sized vocabulary. Although word embeddings arethe most common way to represent input words, there are also efforts made toutilize more complicated information of input sentences for RE.(2) Position Embeddings. In RE, the words close to the target entities are usuallyinformative to determine the relation between the entities. Therefore, positionembeddings are used to help models keep track of how close each word is tothe head or tail entities. It is deﬁned as the combination of the relative distancesfrom the current word to the head or tail entities. For example, in the sentence Bill_Gates is the founder of Microsoft. , the relative distance from the word founder to the head entity

Bill_Gates is − Microsoft is 2. Besides word position embeddings, more linguisticfeatures are also considered in addition to the word embeddings to enrich thelinguistic features of the input sentence.(3) Part-of-speech (POS) Tag Embeddings. POS tag embeddings are to represent thelexical information of the target word in the sentence. Because word embeddingsare obtained from a large-scale general corpus, the general information theycontain may not be in accordance with the meaning in a speciﬁc sentence. Hence,it is necessary to align each word with its linguistic information considering itsspeciﬁc context, e.g., noun and verb. Formally, each word w i is encoded by thecorresponding column vector in an embedding matrix E p ∈ R d p ×| V p | , where d p is the dimension of embedding vector and V p indicates a ﬁxed-sized POS tagvocabulary.(4) WordNet Hypernym Embeddings. WordNet hypernym embeddings aim to takeadvantages of the prior knowledge of hypernym to help RE models. Whengiven the hypernym information of each word in WordNet (e.g., noun.food andverb.motion), it is easier to build the connections between different but concep-tually similar words. Formally, each word w i is encoded by the correspondingcolumn vector in an embedding matrix E h ∈ R d h ×| V h | , where d h is the dimensionof embedding vector and V h indicates a ﬁxed-sized hypernym vocabulary.For each word, the NRE models often concatenate some of the above four featureembeddings as their input embeddings. Therefore, the feature embeddings of allwords are concatenated and denoted as a ﬁnal input sequence w = { w , w , . . . , w m } ,where w i ∈ R d , d is the total dimension of all feature embeddings concatenated foreach word. Sentence Encoder . The sentence encoder is the core for sentence representation,which encodes input representations into either a single vector or a sequence ofvectors x to represent sentences. We will introduce the different sentence encodersin the following.(1) Convolutional Neural Network Encoder. Zeng et al. [76] propose to encodeinput sentences using a CNN model, which extracts local features by a convolutionallayer and combines all local features via a max-pooling operation to obtain a ﬁxed-sized vector for the input sentence. Formally, a convolutional layer is deﬁned as anoperation on a vector sequence w : p = CNN ( w ), (4.30)where CNN indicates the convolution operation inside the convolutional layer.And the i th element of the sentence vector x can be calculated as follows: [ x ] i = f ( max ( p i )), (4.31)where f is a nonlinear function applied at the output, such as the hyperbolic tangentfunction. .5 Applications 77 Further, PCNN [75], which is a variation of CNN, adopts a piece-wise max-pooling operation. All hidden vectors { p , p , . . . } are divided into three segmentsby the head and tail entities. The max-pooling operation is performed over the threesegments separately, and the x is the concatenation of the pooling results over thethree segments.(2) Recurrent Neural Network Encoder. Zhang and Wang [78] propose to embedinput sentences using an RNN model which can learn the temporal features. Formally,each input word representation is put into recurrent layers step by step. For each step i , the network takes the i th word representation vector w i and the output of theprevious i − h i − as input: h i = RNN ( w i , h i − ), (4.32)where RNN indicates the transform function inside the RNN cell, which can be theLSTM units or the GRU units mentioned before.The conventional RNN models typically deal with text sequences from start toend, and build the hidden state of each word only considering its preceding words.It has been veriﬁed that the hidden state considering its following words is moreeffective. Hence, the bi-directional RNN (BRNN) [52] is adopted to learn hiddenstates using both preceding and following words.Similar to the previous CNN models in RE, the RNN model combines the outputvectors of the recurrent layer as local features, and then uses a max-pooling operationto extract the global feature, which forms the representation of the whole inputsentence. The max-pooling layer could be formulated as [ x ] j = max i [ h i ] j . (4.33)Besides max-pooling, word attention can also combine all local feature vectorstogether. The attention mechanism [1] learns attention weights on each step. Sup-posing H = [ h , h , . . . , h m ] is the matrix consisting of all output vectors producedby the recurrent layer, the feature vector of the whole sentence x is formed by aweighted sum of these output vectors: α = Softmax ( s (cid:10) tanh ( H )), (4.34) x = H α (cid:10) , (4.35)where s is a trainable query vector.Besides, [42] proposes a model that captures information from both word sequenceand tree-structured dependency by stacking bidirectional path-based LSTM-RNNs(i.e., bottom-up and top-down). More speciﬁcally, it focuses on the shortest pathbetween the two target entities in the dependency tree, and utilizes the stacked layersto encode the shortest path for the whole sentence representation. In fact, somepreliminary work [69] has shown that these paths are useful in RE, and various recursive neural models are also proposed for this. Next, we will introduce theserecursive models in detail.(3) Recursive Neural Network Encoder. The recursive encoder aims to extractfeatures from the information of syntactic parsing trees, considering the syntacticinformation is beneﬁcial for extracting relations from sentences. Generally, theseencoders treat the tree structure inside syntactic parsing trees as a strategy of com-position as well as a direction to combine each word feature.Socher et al. [54] propose a recursive matrix-vector model (MV-RNN) whichcaptures the structure information by assigning a matrix-vector representation foreach constituent of the constituents in parsing trees. The vector captures the meaningof the constituent itself and the matrix represents how it modiﬁes the meaning of theword it combines with. Tai et al. [62] further propose two types of tree-structuredLSTMs including the Child-Sum Tree-LSTM and the N-ary Tree-LSTM to capturetree structure information. For the Child-Sum Tree-LSTM, given a tree, let C ( t ) denote the set of children of node t . Its transition equations are deﬁned as follows: ˆ h t = (cid:2) k ∈ C ( t ) TLSTM ( h k ), (4.36)where TLSTM ( · ) indicates a Tree-LSTM cell, which is simply modiﬁed from LSTMcell. The N-ary Tree-LSTM has similar transition equations as the Child-Sum Tree-LSTM. The only difference is that it limits the tree structures to have at most N branches. Relation Classiﬁer . When obtaining the representation x of the input sentence,relation classiﬁer calculates the conditional probability P ( r | x , e , e ) via a softmaxlayer as follows: P ( r | x , e , e ) = Softmax ( Mx + b ), (4.37)where M indicates the relation matrix and b is a bias vector. Although existing neural models have achieved great success for extracting novelrelational facts, it always suffers the lack of training data. To address this issue,researchers proposed a distant supervision assumption to generate training datavia aligning KGs and plain text automatically. The intuition of distant supervisionassumption is that all sentences that contain two entities will express their relationsin KGs. For example, (

New York , city_of , United States ) is a relationalfact in a KG, distant supervision assumption will regard all sentences that containthese two entities as positive instances for the relation city_of . It offers a naturalway of utilizing information from multiple sentences (bag-level) rather than a singlesentence (sentence-level) to decide if a relation holds between two entities.Therefore, bag-level NRE aims to predict the semantic relations between an entitypair using all involved sentences. As shown in Fig. 4.7, given the input sentence set .5 Applications 79

Fig. 4.7

An example of bag-level relation extraction S which consists of n sentences S = { s , s , . . . , s n } and its corresponding entitypair e and e as inputs, bag-level NRE wants to obtain the conditional probability P ( r | S , e , e ) of relation r ( r ∈ R ) via a neural network, which can be formalizedas P ( r | S , e , e ) = P ( r | S , e , e , θ). (4.38)A basic form of bag-level NRE consists of four components: (a) an input encodersimilar to sentence-level NRE, (b) a sentence encoder similar to sentence-level NRE,(c) a bag encoder which computes a vector representing all related sentences in a bag,and (d) a relation classiﬁer similar to sentence-level NRE which takes bag vectorsas input instead of sentence vectors. As the input encoder, sentence encoder, andrelation classiﬁer of bag-level NRE are similar to the ones of sentence-level NRE,we will thus mainly focus on introducing the bag encoder in detail. Bag Encoder . The bag encoder encodes all sentence vectors into a single vector S . We will introduce the different bag encoders in the following:(1) Random Encoder. It simply assumes that each sentence can express the relationbetween two target entities and randomly select one sentence to represent the bag.Formally, the bag representation is deﬁned as S = s i ( i ∈ { , , . . . , n } ), (4.39)where s i indicates the sentence representation of s i ∈ S and i is a random index.(2) Max Encoder. As introduced above, not all sentences containing two targetentities can express their relations. For example, the sentence New York Cityis the premier gateway for legal immigration to theUnited States does not express the relation city of . Hence, in [75], theyfollow the at-least-one assumption which assumes that at least one sentence thatcontains these two target entities can express their relations, and select the sentence with the highest probability for the relation to represent the bag. Formally, bag rep-resentation is deﬁned as S = s i ( i = arg max i P ( r | s i , e , e )). (4.40)(3) Average Encoder. Both random encoder or max encoder use only one sentenceto represent the bag, which ignores the rich information of different sentences. Toexploit the information of all sentences, [36] believes that the representation S ofthe bag depends on all sentences’ representations. Each sentence representation s i can give the relation information about two entities to a certain extent. The averageencoder assumes that all sentences contribute equally to the representation of thebag. It means the embedding S of the bag is the average of all the sentence vectors: S = (cid:2) i n s i . (4.41)(4) Attentive Encoder. Due to the wrong label issue brought by distant supervisionassumption inevitably, the performance of average encoder will be inﬂuenced bythose sentences that contain no relation information. To address this issue, [36]further proposes to employ a selective attention to reduce those noisy sentences.Formally, the bag representation is deﬁned as a weighted sum of sentence vectors: S = (cid:2) i α i s i , (4.42)where α i is deﬁned as α i = exp ( s (cid:10) i Ar ) (cid:4) j exp ( x (cid:10) j Ar ) , (4.43)where A is a diagonal matrix and r is the representation vector of relation r . Relation Classiﬁer . Similar to sentence-level NRE, when obtaining the bag rep-resentation S , relation classiﬁer also calculates the conditional probability P ( r | S , e , e ) via a softmax layer as follows: P ( r | S , e , e ) = Softmax ( MS + b ), (4.44)where M indicates the relation matrix and b is a bias vector. Recently, NRE systems have achieved signiﬁcant improvements in both, the super-vised and distantly supervised scenarios. However, there are still many challenges inthe task of RE, and many researchers have been focusing on other aspects to improve .5 Applications 81 the performance of NRE as well. In this section, we will introduce these extensionsin detail.

Utilization of External Information . Most existing NRE systems stated aboveonly concentrate on the sentences which are extracted, regardless of the rich externalinformation such as KGs. This heterogeneous information could provide additionalknowledge from KG and is essential when extracting new relational facts.Han et al. [24] propose a novel joint representation learning framework for knowl-edge acquisition. The key idea is that the joint model learns knowledge and textrepresentations within a uniﬁed semantic space via KG-text alignments. For the textpart, the sentence with two entities

Mark Twain and

Florida is regarded asthe input for a CNN encoder, and the output of CNN is considered to be the latentrelation

PlaceOfBirth of this sentence. For the KG part, entity and relation rep-resentations are learned via translation-based methods. The learned representationsof KG and text parts are aligned during training. Besides this preliminary attempt,many efforts have been devoted to this direction [25, 28, 51, 67, 68].

Incorporating Relational Paths . Although existing NRE systems have achievedpromising results, they still suffer a major problem: the models can only directlylearn from those sentences which contain both two-target entities. However, thosesentences containing only one of the entities could also provide useful informationand help build inference chains. For example, if we know that “A is the son of B”and “B is the son of C”, we can infer that A is the grandson of C.To utilize the information of both direct and indirect sentences, [77] introducesa path-based NRE model that incorporates textual relational paths. The model ﬁrstemploys a CNN encoder to embed the semantic meanings of sentences. Then, themodel builds a relation path encoder, which measures the probability of relationsgiven an inference chain in the text. Finally, the model combines information fromboth direct sentences and relational paths, and then predicts the conﬁdence of eachrelationship. This work is the ﬁrst effort to consider the knowledge of relation pathin text for NRE, and there are also several methods later to consider the reasoningpath of sentence semantic meanings for RE [11, 19].

Document-level Relation Extraction . In fact, not all relational facts can beextracted by sentence-level RE, i.e., a large number of relational facts are expressedin multiple sentences. Taking Fig. 4.9 as an example, multiple entities are mentionedin the document and exhibit complex interactions. In order to identify the relationalfact (

Riddarhuset , country , Sweden ), one has to ﬁrst identify the fact that

Riddarhuset is located in

Stockholm from Sentence 4, then identify the facts

Stockholm is the capital of

Sweden and

Sweden is a country from Sentence 1.With the above facts, we can ﬁnally infer that the sovereign state of

Riddarhuset is Sweden . This process requires reading and reasoning over multiple sentences ina document, which is intuitively beyond the reach of sentence-level RE methods.According to the statistics on a human-annotated corpus sampled from Wikipediadocuments [72], at least 40 .

7% relational facts can only be extracted from multi-ple sentences, which is not negligible. Swampillai and Stevenson [61] and Vergaet al. [66] also report similar observations. Therefore, it is necessary to move RE

Fig. 4.8

An example of document-level relation extraction forward from the sentence level to the document level. Figure 4.8 is an example fordocument-level RE.

Fig. 4.9

An example from DocRED [72].5 Applications 83

However, existing datasets for document-level RE either only have a small num-ber of manually annotated relations and entities [34], or exhibit noisy annotationsfrom distant supervision [45, 49], or serve speciﬁc domains or approaches [33]. Toaddress this issue, [72] constructs a large-scale, manually annotated, and general-purpose document-level RE dataset, named as DocRED. DocRED is constructedfrom Wikipedia and Wikidata, and has two key features. First, DocRED contains132 ,

375 entities and 56 ,

354 relational facts annotated on 5 ,

053 Wikipedia docu-ments, which is the largest human-annotated document-level RE dataset now. Sec-ond, over 40% of the relational facts in DocRED can only be extracted from multiplesentences. This makes DocRED require reading multiple sentences in a documentto recognize entities and inferring their relations by synthesizing all information ofthe document.The experimental results on DocRED show that the performance of existingsentence-level RE methods declines signiﬁcantly on DocRED, indicating the taskdocument-level RE is more challenging than sentence-level RE and remains an openproblem. It also relates to the document representation which will be introduced inthe next chapter.

Few-shot Relation Extraction .As we mentioned before, the performance of the conventional RE models [23,76] heavily depend on time-consuming and labor-intensive annotated data, whichmake themselves hard to generalize well. Although adopting distant supervisionis a primary approach to alleviate this problem, the distantly supervised data alsoexhibits a long-tail distribution, where most relations have very limited instances.Furthermore, distant supervision suffers the wrong labeling problem, which makesit harder to classify long-tail relations. Hence, it is necessary to study training REmodels with insufﬁcient training instances. Figure 4.10 is an example for few-shotRE.

Fig. 4.10

An example of few-shot relation extraction4 4 Sentence Representation

Table 4.1

An example for a 3 way 2 shot scenario. Different colors indicate different entities,underline for head entity, and emphasize for tail entitySupporting set(A) capital_of (1) London is the capital of the U.K (2) Washington is the capital of the U.S.A (B) member_of (1) Newton served as the president of the RoyalSociety (2) Leibniz was a member of the PrussianAcademy of Sciences (C) birth_name (1)

Samuel Langhorne Clemens , better knownby his pen name Mark Twain, was an Americanwriter(2) Alexei Maximovich Peshkov, primarilyknown as

Maxim Gorky , was a Russian andSoviet writerTest instance(A) or (B) or (C) Euler was elected a foreign member of theRoyal Swedish Academy of Sciences

FewRel [26] is a new large-scale supervised few-shot RE dataset, which requiresmodels capable of handling classiﬁcation task with a handful of training instances,as shown in Table 4.1. Beneﬁting from the FewRel dataset, there are some efforts toexploring few-shot RE [17, 53, 73] and achieve promising results. Yet, few-shot REstill remains a challenging problem for further research [18].

In this chapter, we introduce sentence representation learning. Sentence representa-tion encodes the semantic information of a sentence into a real-valued representationvector, and can be utilized in further sentence classiﬁcation or matching tasks. First,we introduce the one-hot representation for sentences and probabilistic languagemodels. Secondly, we extensively introduce several neural language models, includ-ing adopting the feedforward neural networks, the convolutional neural networks, therecurrent neural networks, and the Transformer for language models. These neuralmodels can learn rich linguistic and semantic knowledge from language modeling.Beneﬁting from this, the pre-trained language models trained with large-scale cor-pora have achieved state-of-the-art performance on various downstream NLP tasksby transferring the learned semantic knowledge from general corpora to the targettasks. Finally, we introduce several typical applications of sentence representationincluding text classiﬁcation and relation extraction. .6 Summary 85

For further understanding of sentence representation learning and its applications,there are also some recommended surveys and books including • Yoav, Neural network methods for natural language processing [21]. • Deng & Liu, Deep learning in natural language processing [13].In the future, for better sentence representation, some directions are requiringfurther efforts:(1)

Exploring Advanced Architectures.

The improvement of model architecturesis the key factor in the success of sentence representation. From the feedforwardneural networks to the Transformer, people are designing more suitable neuralmodels for sequential inputs. Based on the Transformer, some researchers areworking on new NLP architectures. For instance, Transformer-XL [10] is pro-posed to solve the problem of ﬁxed-length context in the Transformer. Sincethe Transformer is the state-of-the-art NLP architecture, current works mainlyadopt attention mechanisms. Beyond these works, is it possible to introducemore human cognitive mechanisms to neural models?(2)

Modeling Long Documents.

The representation of long documents is an impor-tant extension of sentence representation. There are some new challenges duringmodeling long documents, such as discourse analysis and co-reference resolu-tion. Although some existing works already provide document-level NLP tasks(e.g., DocRED [72]), the model performance on these tasks is still much lowerthan the human performance. We will also introduce the advances in documentrepresentation learning in the following chapter.(3)

Performing Efﬁcient Representation.

Although the combination of Trans-former and large-scale data leads to very powerful sentence representation, theserepresentation models require expensive computational cost, which limits theapplications in downstream tasks. Some existing works explore to use modelcompression techniques for more efﬁcient models. These techniques includeknowledge distillation [60], parameter pruning [16], etc. Beyond these works,there remain lots of unsolved problems for developing better representation mod-els, which can efﬁciently learn from large-scale data and provide effective vectorsin downstream tasks.

References

1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. In

Proceedings of ICLR , 2015.2. Yoshua Bengio. Neural net language models.

Scholarpedia , 3(1):3881, 2008.3. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilisticlanguage model.

Journal of Machine Learning Research , 3(Feb):1137–1155, 2003.4. Yoshua Bengio, Jean-Sébastien Senécal, et al. Quick training of probabilistic neural nets byimportance sampling. In

Proceedings of AISTATS , 2003.6 4 Sentence Representation5. Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein. Better document-level sentiment analysisfrom rst discourse parsing. In

Proceedings of EMNLP , 2015.6. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.

Journal ofMachine Learning Research , 3:993–1022, 2003.7. Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai.Class-based n-gram models of natural language.

Computational linguistics , 18(4):467–479,1992.8. Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. In

Pro-ceedings of NeurIPS , 2019.9. Alexis Conneau, Holger Schwenk, Loïc Barrault, and Yann Lecun. Very deep convolutionalnetworks for text classiﬁcation. In

Proceedings of EACL , volume 1, 2017.10. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov.Transformer-xl: Attentive language models beyond a ﬁxed-length context. In

Proceedings ofACL , page 2978–2988, 2019.11. Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. Chains of reason-ing over entities, relations, and text using recurrent neural networks. In

Proceedings of EACL ,pages 132–141, 2017.12. Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling withgated convolutional networks. In

Proceedings of ICML , 2017.13. Li Deng and Yang Liu.

Deep learning in natural language processing . Springer, 2018.14. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. In

Proceedings of NAACL , 2019.15. Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, MingZhou, and Hsiao-Wuen Hon. Uniﬁed language model pre-training for natural language under-standing and generation. In

Proceedings of NeurIPS , 2019.16. Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand withstructured dropout. In

Proceedings of ICLR , 2020.17. Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun. Hybrid attention-based prototypicalnetworks for noisy few-shot relation classiﬁcation. In

Proceedings of AAAI , pages 6407–6414,2019.18. Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. FewRel 2.0:Towards more challenging few-shot relation classiﬁcation. In

Proceedings of EMNLP-IJCNLP ,pages 6251–6256, 2019.19. Michael Glass, Alﬁo Gliozzo, Oktie Hassanzadeh, Nandana Mihindukulasooriya, and Gae-tano Rossiello. Inducing implicit relations from text using distantly supervised deep nets. In

International Semantic Web Conference , pages 38–55. Springer, 2018.20. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale senti-ment classiﬁcation: A deep learning approach. In

Proceedings of ICML , 2011.21. Yoav Goldberg. Neural network methods for natural language processing.

Synthesis Lectureson Human Language Technologies , 10(1):1–309, 2017.22. Joshua Goodman. Classes for fast maximum entropy training. In

Proceedings of ASSP , 2001.23. Matthew R Gormley, Mo Yu, and Mark Dredze. Improved relation extraction with feature-richcompositional embedding models. In

Proceedings of EMNLP , 2015.24. Xu Han, Zhiyuan Liu, and Maosong Sun. Joint representation learning of text and knowledgefor knowledge graph completion. arXiv preprint arXiv:1611.04125, 2016.25. Xu Han, Zhiyuan Liu, and Maosong Sun. Neural knowledge acquisition via mutual attentionbetween knowledge graph and text. In

Proceedings of AAAI , pages 4832–4839, 2018.26. Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun.FewRel: A large-scale supervised few-shot relation classiﬁcation dataset with state-of-the-artevaluation. In

Proceedings of EMNLP , 2018.27. Zhiheng Huang, Geoffrey Zweig, and Benoit Dumoulin. Cache based recurrent neural networklanguage model inference for ﬁrst pass speech recognition. In

Proceedings of ICASSP , 2014.28. Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. Distant supervision for relation extractionwith sentence-level attention and entity descriptions. In

Proceedings of AAAI , pages 3060–3066, 2017.eferences 8729. Rie Johnson and Tong Zhang. Effective use of word order for text categorization with convo-lutional neural networks. In

Proceedings of ACL-HLT , 2015.30. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural networkfor modelling sentences. In

Proceedings of ACL , 2014.31. Yoon Kim. Convolutional neural networks for sentence classiﬁcation. In

Proceedings ofEMNLP , 2014.32. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. Recurrent convolutional neural networks fortext classiﬁcation. In

Proceedings of AAAI , 2015.33. Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. Zero-shot relation extractionvia reading comprehension. In

Proceedings of CoNLL , 2017.34. Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman,Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. BioCreative VCDR task corpus: a resource for chemical disease relation extraction.

Database , pages 1–10,2016.35. Jiwei Li, Minh-Thang Luong, Dan Jurafsky, and Eduard Hovy. When are tree structures nec-essary for deep learning of representations? In

Proceedings of EMNLP , 2015.36. Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. Neural relation extrac-tion with selective attention over instances. In

Proceedings of ACL , 2016.37. Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classiﬁcationwith multi-task learning. In

Proceedings of IJCAI , 2016.38. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, MikeLewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretrainingapproach. arXiv preprint arXiv:1907.11692, 2019.39. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visi-olinguistic representations for vision-and-language tasks. In

Proceedings of NeurIPS , 2019.40. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of wordrepresentations in vector space. In

Proceedings of ICLR , 2013.41. Tomas Mikolov, Martin Karaﬁát, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. Recur-rent neural network based language model. In

Proceedings of InterSpeech , 2010.42. Makoto Miwa and Mohit Bansal. End-to-end relation extraction using lstms on sequences andtree structures. In

Proceedings of ACL , 2016.43. Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilisticlanguage models. In

Proceedings of ICML , 2012.44. Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model.In

Proceedings of Aistats , 2005.45. Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. Cross-sentence n-ary relation extraction with graph LSTMs.

Transactions of the Association forComputational Linguistics , 5:101–115, 2017.46. Matthew E Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh,and Noah A Smith. Knowledge enhanced contextual word representations. In

Proceedings ofEMNLP-IJCNLP , 2019.47. Ngoc-Quan Pham, German Kruszewski, and Gemma Boleda. Convolutional neural networklanguage models. In

Proceedings of EMNLP , 2016.48. Matt Post and Shane Bergsma. Explicit and implicit syntactic features for text classiﬁcation.In

Proceedings of ACL , 2013.49. Chris Quirk and Hoifung Poon. Distant supervision for relation extraction beyond the sentenceboundary. In

Proceedings of EACL , 2017.50. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving languageunderstanding by generative pre-training.

URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/languageunderstandingpaper.pdf, 2018.51. Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. Relation extractionwith matrix factorization and universal schemas. In

Proceedings of NAACL-HLT , pages 74–84,2013.8 4 Sentence Representation52. Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks.

IEEE Transac-tions on Signal Processing , 45(11):2673–2681, 1997.53. Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. Matching theBlanks: Distributional similarity for relation learning. In

Proceedings of ACL , pages 2895–2905, 2019.54. Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. Semantic compo-sitionality through recursive matrix-vector spaces. In

Proceedings of EMNLP , 2012.55. Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christopher D Manning.Semi-supervised recursive autoencoders for predicting sentiment distributions. In

Proceedingsof EMNLP , 2011.56. Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew YNg, and Christopher Potts. Recursive deep models for semantic compositionality over a senti-ment treebank. In

Proceedings of EMNLP , 2013.57. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass: Masked sequence tosequence pre-training for language generation. In

Proceedings of ICML , 2019.58. Daniel Soutner, Zdenˇek Loose, Ludˇek Müller, and Aleš Pražák. Neural network languagemodel with cache. In

Proceedings of ICTSD , 2012.59. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: Ajoint model for video and language representation learning. In

Proceedings of ICCV , 2019.60. Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert modelcompression. In

Proceedings of EMNLP-IJCNLP , page 4314–4323, 2019.61. Kumutha Swampillai and Mark Stevenson. Inter-sentential relations in information extractioncorpora. In

Proceedings of LREC , 2010.62. Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representa-tions from tree-structured long short-term memory networks. In

Proceedings of ACL , 2015.63. Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated recurrent neural networkfor sentiment classiﬁcation. In

Proceedings of EMNLP , 2015.64. Wilson L Taylor. “cloze procedure": A new tool for measuring readability.

Journalism Bulletin ,30(4):415–433, 1953.65. Ashish Vaswani, Noam Shazeer, Niki Parmar, Llion Jones, Jakob Uszkoreit, Aidan N Gomez,and Lukasz Kaiser. Attention is all you need. In

Proceedings of NeurIPS , 2017.66. Patrick Verga, Emma Strubell, and Andrew McCallum. Simultaneously self-attending to allmentions for full-abstract biological relation extraction. In

Proceedings of NAACL-HLT , 2018.67. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph and text jointlyembedding. In

Proceedings of EMNLP , pages 1591–1601, 2014.68. Zhigang Wang and Juan-Zi Li. Text-enhanced representation learning for knowledge graph. In

Proceedings of IJCAI , pages 1293–1299, 2016.69. Kun Xu, Yansong Feng, Songfang Huang, and Dongyan Zhao. Semantic relation classiﬁcationvia convolutional neural networks with simple negative sampling. In

Proceedings of EMNLP ,2015.70. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc VLe. Xlnet: Generalized autoregressive pretraining for language understanding. In

Proceedingsof NeurIPS , 2019.71. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchicalattention networks for document classiﬁcation. In

Proceedings of NAACL , 2016.72. Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang,Jie Zhou, and Maosong Sun. DocRED: A large-scale document-level relation extraction dataset.In

Proceedings of ACL , 2019.73. Zhi-Xiu Ye and Zhen-Hua Ling. Multi-level matching and aggregation network for few-shotrelation classiﬁcation. In

Proceedings of ACL , pages 2872–2881, 2019.74. Wenpeng Yin and Hinrich Schütze. Multichannel variable-size convolution for sentence clas-siﬁcation. In

Proceedings of CoNLL , 2015.75. Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. Distant supervision for relation extractionvia piecewise convolutional neural networks. In

Proceedings of EMNLP , 2015.eferences 8976. Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. Relation classiﬁcation viaconvolutional deep neural network. In

Proceedings of COLING , 2014.77. Wenyuan Zeng, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Incorporating relation paths inneural relation extraction. In

Proceedings of EMNLP , 2017.78. Dongxu Zhang and Dong Wang. Relation classiﬁcation via recurrent neural network. arXivpreprint arXiv:1508.01006, 2015.79. Ye Zhang, Iain Marshall, and Byron C Wallace. Rationale-augmented convolutional neuralnetworks for text classiﬁcation. In

Proceedings of EMNLP , 2016.80. Ye Zhang, Stephen Roller, and Byron C Wallace. Mgnc-cnn: A simple approach to exploitingmultiple word embeddings for sentence classiﬁcation. In

Proceedings of NAACL , 2016.81. Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. Ernie:Enhanced language representation with informative entities. In

Proceedings of ACL , 2019.

Open Access

Document Representation

Abstract

A document is usually the highest linguistic unit of natural language.Document representation aims to encode the semantic information of the wholedocument into a real-valued representation vector, which could be further utilizedin downstream tasks. Recently, document representation has become an essentialtask in natural language processing and has been widely used in many document-level real-world applications such as information retrieval and question answering.In this chapter, we ﬁrst introduce the one-hot representation for documents. Next,we extensively introduce topic models that learn the topic distribution of words anddocuments. Further, we give an introduction to distributed document representation,including paragraph vector and neural document representations. Finally, we intro-duce several typical real-world applications of document representation, includinginformation retrieval and question answering.

Advances in information and communication technologies offer ubiquitous access tovast amounts of information and are causing an exponential increase in the numberof documents available online. While more and more textual information is avail-able electronically, effective retrieval and mining are getting more and more difﬁcultwithout the efﬁcient organization, summarization, and indexing of document content.Therefore, document representation is playing an important role in many real-worldapplications, e.g., document retrieval, web search, and spam ﬁltering. Document rep-resentation aims to represent document input into a ﬁxed-length vector, which coulddescribe the contents of the document, to reduce the complexity of the documentsand make them easier to handle. Traditional document representation models suchas one-hot document representation have achieved promising results in many docu-ment classiﬁcation and clustering tasks due to their simplicity, efﬁciency, and oftensurprising accuracy.However, the one-hot document representation model has many disadvantages.First, it loses the word order, and thus, different documents can have the same repre-sentation, as long as the same words are used. Second, it usually suffers data sparsity © The Author(s) 2020Z. Liu et al.,

Representation Learning for Natural Language Processing ,https://doi.org/10.1007/978-981-15-5573-2_5 912 5 Document Representation and high dimensionality. One-hot document representation model has very little senseabout the semantics of the words or, more formally, the distances between the words.Hence, the approach for representing text documents uses multi-word terms as vectorcomponents, which are noun phrases extracted using a combination of linguistic andstatistical criteria. This representation is motivated by the notion of topic models thatterms should contain more semantic information than individual words. And anotheradvantage of using terms for representing a document is its lower dimensionalitycompared with the traditional one-hot document representation.Nevertheless, applying these to generation tasks remains difﬁcult. To understandhow discourse units are connected, one has to understand the communicative functionof each unit, and the role it plays within the context that encapsulates it, recursivelyall the way up for the entire text. Identifying increasingly sophisticated human-developed features may be insufﬁcient for capturing these patterns, but developingrepresentation-based alternatives has also been difﬁcult. Although document repre-sentation can capture aspects of coherent sentence structure, it is not clear how itcould help in generating more broadly cohesive text.Recently, neural network models have shown compelling results in generatingmeaningful and grammatical documents in sequence generation tasks like machinetranslation or parsing. It is partially attributed to the ability of these systems to cap-ture local compositionally: the way neighboring words are combined semanticallyand syntactically to form meanings that they wish to express. Based on neural net-work models, many research works have developed a variety of ways to incorporatedocument-level contextual information. These models are all hybrid architectures inthat they are recurrent at the sentence level, but use a different structure to summarizethe context outside the sentence. Furthermore, some models explore multilevel recur-rent architectures for combining local and global information in language modeling.In this chapter, we ﬁrst introduce the one-hot representation for documents. Next,we extensively introduce topic models that aim to learn latent topic distributions ofwords and documents. Further, we give an introduction on distributed document rep-resentation including paragraph vector and neural document representations. Finally,we introduce several typical real-world applications of document representations,including information retrieval and question answering.

Majority of machine learning algorithms take a ﬁxed-length vector as the input,so documents are needed to be represented as vectors. The bag-of-words modelis the most common and simple representation method for documents. Similar toone-hot sentence representation, for a document d = { w , w , . . . , w l } , a bag-of-word representation d can be used to represent this document. Speciﬁcally, fora vocabulary V = [ w , w , . . . , w | V | ] , the one-hot representation of word w is .2 One-Hot Document Representation 93 w = [ , , . . . , , . . . , ] . Based on the one-hot word representation and a vocab-ulary V , it can be extended to represent a document as d = l (cid:2) k = w i , (5.1)where l is the length of the document d . And similar to one-hot sentence represen-tation, the TF-IDF method is also proposed to enhance the ability of bag-of-wordsrepresentation in reﬂecting how important a word is to a document in a corpus.Actually, the bag-of-words representation is mainly used as a tool of feature gen-eration, and the most common type of features calculated from this method is wordfrequency appearing in the documents. This method is simple but efﬁcient and some-times can reach excellent performance in many real-world applications. However, thebag-of-words representation still ignores entirely the word order information, whichmeans different documents can have the same representation as long as the samewords are used. Furthermore, bag-of-words representation has little sense about thesemantics of the words or, more formally, the distances between words, which meansthis method cannot utilize rich information hidden in the word representations. As our collective knowledge continues to be digitized and stored in the form ofnews, blogs, web pages, scientiﬁc articles, books, images, audio, videos, and socialnetworks, it becomes more difﬁcult to ﬁnd and discover what we are looking for. Weneed new computational tools to help organize, search, and understand these vastamounts of information.Right now, we work with online information using two main tools—search andlinks. We type keywords into a search engine and ﬁnd a set of documents related tothem. We look at the documents in that set, possibly navigating to other linked doc-uments. This is a powerful way of interacting with our online archive, but somethingis missing.Imagine searching and exploring documents based on the themes that run throughthem. We might “zoom in” and “zoom out” to ﬁnd speciﬁc or broader themes; wemight look at how those themes changed through time or how they are connected.Rather than ﬁnding documents through keyword search alone, we might ﬁrst ﬁndthe theme that we are interested in, and then examine the documents related to thattheme.For example, consider using themes to explore the complete history of the NewYork Times. At a broad level, some of the themes might correspond to the sectionsof the newspaper, such as foreign policy, national affairs, and sports. We could zoomin on a theme of interest, such as foreign policy, to reveal various aspects of it, suchas Chinese foreign policy, the conﬂict in the Middle East, and the United States’ relationship with Russia. We could then navigate through time to reveal how thesespeciﬁc themes have changed, tracking, for example, the changes in the conﬂict inthe Middle East over the last 50 years. And, in all of this exploration, we would bepointed to the original articles relevant to the themes. The thematic structure wouldbe a new kind of window through which to explore and digest the collection.But we do not interact with electronic archives in this way. While more and moretexts are available online, we do not have the human power to read and study them toprovide the kind of browsing experience described above. To this end, machine learn-ing researchers have developed probabilistic topic modeling , a suite of algorithmsthat aim to discover and annotate vast archives of documents with thematic informa-tion. Topic modeling algorithms are statistical methods that analyze the words of theoriginal texts to explore the themes that run through them, how those themes are con-nected, and how they change over time. Topic modeling algorithms do not requireany prior annotations or labeling of the documents. The topics emerge from theanalysis of the original texts. Topic modeling enables us to organize and summarizeelectronic archives at a scale that would be impossible by human annotation.

A variety of probabilistic topic models have been used to analyze the content ofdocuments and the meaning of words. Hofmann ﬁrst introduced the probabilistictopic approach to document modeling in his Probabilistic Latent Semantic Indexingmethod (pLSI). The pLSI model does not make any assumptions about how themixture weights are generated, making it difﬁcult to test the generalization ability ofthe model to new documents. Thus, Latent Dirichlet Allocation (LDA) was extendedfrom this model by introducing a Dirichlet prior to the model. LDA is believed as asimple but efﬁcient topic model. We ﬁrst describe the basic ideas of LDA [6].The intuition behind LDA is that documents exhibit multiple topics. LDA is astatistical model of document collections that tries to capture this intuition. It is mosteasily described by its generative process, the imaginary random process by whichthe model assumes the documents arose.We formally deﬁne a topic to be a distribution over a ﬁxed vocabulary. We assumethat these topics are speciﬁed before any data has been generated. Now for eachdocument in the collection, we generate the words in a two-stage process.1. Randomly choose a distribution over topics.2. For each word in the document, • Randomly choose a topic from the distribution over topics in step • Randomly choose a word from the corresponding distribution over the vocab-ulary.This statistical model reﬂects the intuition that documents exhibit multiple topics.Each document exhibits the topics with different proportions (step .3 Topic Model 95 each document is drawn from one of the topics (step

LDA and other topic models are part of the broader ﬁeld of probabilistic modeling .In generative probabilistic modeling, we treat our data as arising from a generativeprocess that includes hidden variables . This generative process deﬁnes a joint prob-ability distribution over both the observed and hidden random variables. Given theobserved variables, we perform data analysis by using that joint distribution to com-pute the conditional distribution of the hidden variables. This conditional distributionis also called the posterior distribution .LDA falls precisely into this framework. The observed variables are the words ofthe documents, the hidden variables are the topic structure, and the generative processis as described above. The computational problem of inferring the hidden topicstructure from the documents is the problem of computing the posterior distribution,the conditional distribution of the hidden variables given the documents.We can describe LDA more formally with the following notation. The topics are β : K , where each β k is a distribution over the vocabulary. The topic proportions for the d th document are θ d , where θ dk is the topic proportion for topic k in document d . Thetopic assignments for the d th document are z d , where z d , n is the topic assignmentfor the n th word in document d . Finally, the observed words for document d are w d , where w d , n is the n th word in document d , which is an element from the ﬁxedvocabulary.With this notation, the generative process for LDA corresponds to the followingjoint distribution of the hidden and observed variables: P (β : K , θ : D , z : D , w : D ) = K (cid:3) i = P (β i ) D (cid:3) d = P (θ d )( N (cid:3) n = P ( z d , n | θ d ) P ( w d , n | β : K , z d , n ). (5.2)Notice that this distribution speciﬁes the number of dependencies. For example,the topic assignment z d , n depends on the per-document topic proportions θ d . Asanother example, the observed word w d , n depends on the topic assignment z d , n andall of the topics β : K .These dependencies deﬁne LDA. They are encoded in the statistical assumptionsbehind the generative process, in the particular mathematical form of the joint distri-bution, and in a third way, in the probabilistic graphical model for LDA. Probabilisticgraphical models provide a graphical language for describing families of probability α θ d z d,n w d,n β i η N D K

Fig. 5.1

The architecture of graphical model for Latent Dirichlet Allocation distributions. The graphical model for LDA is in Fig. 5.1. Each node is a random vari-able and is labeled according to its role in the generative process. The hidden nodes,the topic proportions, assignments, and topics are unshaded. The observed nodes andthe words of the documents, are shaded. We use rectangles as plate notation to denotereplication. The N plate denotes the collection of words within documents; the D plate denotes the collection of documents within the collection. These three repre-sentations are equivalent ways of describing the probabilistic assumptions behindLDA. We now turn to the computational problem, computing the conditional distributionof the topic structure given the observed documents. (As we mentioned above, thisis called the posterior .) Using our notation, the posterior is P (β : K , θ : D , z : D | v : D ) = P (β : K , θ : D , z : D , v : D ) P ( v : D ) . (5.3)The numerator is the joint distribution of all the random variables, which canbe easily computed for any setting of the hidden variables. The denominator isthe marginal probability of the observations, which is the probability of seeing theobserved corpus under any topic model. In theory, it can be computed by summingthe joint distribution over every possible instantiation of the hidden topic structure.Topic modeling algorithms form an approximation of the above equation by form-ing an alternative distribution over the latent topic structure that is adapted to be closeto the true posterior. Topic modeling algorithms generally fall into two categories:sampling-based algorithms and variational algorithms.Sampling-based algorithms attempt to collect samples from the posterior byapproximating it with an empirical distribution. The most commonly used samplingalgorithm for topic modeling is Gibbs sampling, where we construct a Markov chain,a sequence of random variables, each dependent on the previous—whose limitingdistribution is posterior. The Markov chain is deﬁned on the hidden topic variablesfor a particular corpus, and the algorithm is to run the chain for a long time, collectsamples from the limiting distribution, and then approximate the distribution withthe collected samples. .3 Topic Model 97 Variational methods are a deterministic alternative to sampling-based algorithms.Rather than approximating the posterior with samples, variational methods posit aparameterized family of distributions over the hidden structure and then ﬁnd themember of that family that is closest to the posterior. Thus, the inference problemis transformed into an optimization problem. Variational methods open the door forinnovations in optimization to have a practical impact on probabilistic modeling.

The simple LDA model provides a powerful tool for discovering and exploitingthe hidden thematic structure in large archives of text. However, one of the mainadvantages of formulating LDA as a probabilistic model is that it can easily beused as a module in more complicated models for more complex goals. Since itsintroduction, LDA has been extended and adapted in many ways.

LDA is deﬁned by the statistical assumptions it makes about the corpus. One activearea of topic modeling research is how to relax and extend these assumptions touncover a more sophisticated structure in the texts.One assumption that LDA makes is the bag-of-words assumption that the orderof the words in the document does not matter. While this assumption is unrealistic, itis reasonable if our only goal is to uncover the coarse semantic structure of the texts.For more sophisticated goals, such as language generation, it is patently not appropri-ate. There have been many extensions to LDA that model words non-exchangeable.For example, [59] develops a topic model that relaxes the bag-of-words assumptionby assuming that the topics generate words conditional on the previous word; [22]develops a topic model that switches between LDA and a standard HMM. These mod-els expand the parameter space signiﬁcantly but show improved language modelingperformance.Another assumption is that the order of documents does not matter. Again, thiscan be seen by noticing that Eq. 5.3 remains invariant to permutations of the orderingof documents in the collection. This assumption may be unrealistic when analyzinglong-running collections that span years or centuries. In such collections, we maywant to assume that the topics change over time. One approach to this problem is thedynamic topic model [5], a model that respects the ordering of the documents andgives a more productive posterior topical structure than LDA.The third assumption about LDA is that the number of topics is assumed knownand ﬁxed. The Bayesian nonparametric topic model provides an elegant solution: Thecollection determines the number of topics during posterior inference, and new doc-uments can exhibit previously unseen topics. Bayesian nonparametric topic models have been extended to hierarchies of topics, which ﬁnd a tree of topics, moving frommore general to more concrete, whose particular structure is inferred from the data[4].

In many text analysis settings, the documents contain additional information suchas author, title, geographic location, links, and others that we might want to accountfor when ﬁtting a topic model. There has been a ﬂurry of research on adapting topicmodels to include meta-data.The author-topic model [51] is an early success story for this kind of research. Thetopic proportions are attached to authors; papers with multiple authors are assumedto attach each word to an author, drawn from a topic drawn from his or her topicproportions. The author-topic model allows for inferences about authors as well asdocuments.Many document collections are linked. For example, scientiﬁc papers are linked bycitations, or web pages are connected by hyperlinks. And several topic models havebeen developed to account for those links when estimating the topics. The relationaltopic model of [9] assumes that each document is modeled as in LDA and that thelinks between documents depend on the distance between their topic proportions.This is both a new topic model and a new network model. Unlike traditional statisticalmodels of networks, the relational topic model takes into account node attributes inmodeling the links.Other work that incorporates meta-data into topic models includes models oflinguistic structure [8], models that account for distances between corpora [60], andmodels of named entities [42]. General-purpose methods for incorporating meta-data into topic models include Dirichlet-multinomial regression models [39] andsupervised topic models [37].

In the existing fast algorithms, it is difﬁcult to decouple the access to C d and C w because both counts need to be updated instantly after the sampling of every token.Many algorithms have been proposed to accelerate LDA based on this equation.WarpLDA [13] is built based on a new Monte Carlo Expectation Maximization(MCEM) algorithm, which is similar to CGS, but both counts are ﬁxed until thesampling of all tokens is ﬁnished. This scheme can be used to develop a reorderingstrategy to decouple the accesses to C d and C w , and minimize the size of randomlyaccessed memory.Speciﬁcally, WarpLDA seeks a MAP solution of the latent variables Θ and Φ ,with the latent topic assignments Z integrated out: where α (cid:2) and β (cid:2) are the Dirichlethyperparameters. Reference [2] has shown that this MAP solution is almost identicalwith the solution of CGS, with proper hyperparameters. .3 Topic Model 99 Computing log P (Θ, Φ | W , α (cid:2) , β (cid:2) ) directly is expensive because it needs to enu-merate all the K possible topic assignments for each token. We, therefore, optimizeits lower bound as a surrogate. Let Q ( Z ) be a variational distribution. Then, byJensen’s inequality, the lower bound can be J (Θ, Φ, Q ( Z )) : log P (Θ, Φ | W , α (cid:2) , β (cid:2) ) ≥ E Q [ log P ( W , Z | Θ, Φ) − log Q ( Z ) ] + log P (Θ | α (cid:2) ) + log P (Φ | β (cid:2) ) (cid:2) J (Θ, Φ, Q ( Z )). (5.4)An Expectation Maximization (EM) algorithm is implemented to ﬁnd a localmaximum of the posterior P (Θ, Φ | W , α (cid:2) , β (cid:2) ) , where the E-step maximizes J withrespect to the variational distribution Q ( Z ) and the M-step maximizes J withrespect to the model parameters (Θ, Φ) , while keeping Q ( Z ) ﬁxed. One can provethat the optimal solution at E-step is Q ( Z ) = P ( Z | W , Θ, Φ) without further assump-tion on Q . We apply Monte Carlo approximation on the expectation in Eq. 5.4, E Q [ log P ( W , Z | Θ, Φ) − log Q ( Z ) ] ≈ S S (cid:2) s = log P ( W , Z ( s ) | Θ, Φ) − log Q ( Z ( s ) ), (5.5)where Z ( ) , . . . , Z ( S ) ∼ Q ( Z ) = P ( Z | W , Θ, Φ). The sample size is set as S = Z as an abbreviation of Z ( ) . Sampling Z : Each dimension of Z can be sampled independently: Q ( z d , n = k ) ∝ P ( W , Z | Θ, Φ) ∝ θ dk φ w d , n , k . (5.6) Optimizing

Θ, Φ : With the Monte Carlo approximation, we have J ≈ log P ( W , Z | Θ, Φ) + log P (Θ | α (cid:2) ) + log P (Φ | β (cid:2) ) + const. = (cid:2) d , k ( C dk + α (cid:2) k − ) log θ dk + (cid:2) k , w ( C kw + β (cid:2) − ) log φ kw + const. , (5.7)and with the optimal solutions, we have ˆ θ dk ∝ C dk + α (cid:2) k − , ˆ φ wk = C wk + β (cid:2) − C k + ¯ β (cid:2) − V . (5.8)Instead of computing and storing ˆ Θ and ˆ Φ , we compute and store C d and C w tosave memory because the latter are sparse. Plug Eqs. 5.8–5.6, and let α = α (cid:2) − , β = β (cid:2) −

1, we get the full MCEM algorithm, which iteratively performs the followingtwo steps until a given iteration number is reached:

00 5 Document Representation • E-step: We can sample z d , n ∼ Q ( z d , n = k ) according to Q ( z d , n = k ) ∝ ( C dk + α k ) C wk + β w C k + ¯ β . (5.9) • M-step: Compute C d and C w by Z .Note the resemblance intuitively justiﬁes why MCEM leads to similar results withCGS. The difference between MCEM and CGS is that MCEM updates the counts C d and C w after sampling all z d , n s, while CGS updates the counts instantly aftersampling each z d , n . The strategy that MCEM updates the counts after sampling all z d , n s is called delayed count update , or simply delayed update . MCEM can be viewedas a CGS with a delayed update, which has been widely used in other algorithms[1, 41]. While previous work uses the delayed update as a trick, we at this momentpresent a theoretical guarantee to converge to a MAP solution. The delayed updateis essential for us to decouple the accesses of C d and C w to improve cache locality,without affecting the correctness. To address the disadvantages of bag-of-words document representation, [31] pro-poses paragraph vector models, including the version with Distributed Memory(PV-DM) and the version with Distributed Bag-of-Words (PV-DBOW). Moreover,researchers also proposed several hierarchical neural network models to representdocuments. In this section, we will introduce these models in detail.

As shown in Fig. 5.2, paragraph vector maps every paragraph to a unique vector,represented by a column in the matrix P and maps every word to a unique vector,represented by a column in word embedding matrix E . The paragraph vector andword vectors are averaged or concatenated to predict the next word in a context. Moreformally, compared to the word vector framework, the only change in this model isin the following equation, where h is constructed from E and P . y = Softmax ( h ( w t − k , . . . , w t + k ; E , P )), (5.10)where h is constructed by the concatenation or average of word vectors extractedfrom E and P . .4 Distributed Document Representation 101 Classifier

Paragraph Matrix

Average/Concatenate w i+3 w i w i+1 w i+2 W W WParagraphidD

Fig. 5.2

The architecture of PV-DM model

The other part of this model is that given a sequence of training words w , w , w ,…, w l , the objective of the paragraph vector model is to maximize the average logprobability: O = l l − k (cid:2) i = k log P ( w i | w i − k , . . . , w i + k ). (5.11)And the prediction task is typically done via a multi-class classiﬁer, such assoftmax. Thus, the probability equation is P ( w i | w i − k , . . . , w i + k ) = e y wi (cid:4) j e y j . (5.12)The paragraph token can be thought of as another word. It acts as a memory thatremembers what is missing from the current context, or the topic of the paragraph. Forthis reason, this model is often called the Distributed Memory Model of ParagraphVectors (PV-DM).The above method considers the concatenation of the paragraph vector with theword vectors to predict the next word in a text window. Another way is to ignore thecontext words in the input, but force the model to predict words randomly sampledfrom the paragraph in the output. In reality, what this means is that at each iterationof stochastic gradient descent, we sample a text window, then sample a random wordfrom the text window and form a classiﬁcation task given the Paragraph Vector. Thistechnique is shown in Fig. 5.3. This version is named the Distributed Bag-of-Wordsversion of Paragraph Vector (PV-DBOW), as opposed to the Distributed Memoryversion of Paragraph Vector (PV-DM) in the previous section.

02 5 Document Representation

Paragraph MatrixClassifier w i w i+1 w i+2 w i+3 D Paragraphid

Fig. 5.3

The architecture of PV-DBOW model

In addition to being conceptually simple, this model requires to store fewer data.The data only needed to be stored is the softmax weights as opposed to both softmaxweights and word vectors in the previous model. This model is also similar to theSkip-gram model in word vectors.

In this part, we introduce two main kinds of neural networks for document repre-sentation including document-context language model and hierarchical documentautoencoder.

Recurrent architectures can be used to combine local and global information in doc-ument language modeling. The simplest such model would be to train a single RNN,ignoring sentence boundaries as mentioned above; the last hidden state from the pre-vious sentence t − t . In suchan architecture, the length of the RNN is equal to the number of tokens in the docu-ment; in typical genres such as news texts, this means training RNNs from sequencesof several hundred tokens, which introduces two problems: (1) Information decay

In a sentence with thirty tokens (not unusual in news text), the contextual informa-tion from the previous sentence must be propagated through the recurrent dynamicsthirty times before it can reach the last token of the current sentence. Meaningfuldocument-level information is unlikely to survive such a long pipeline. (2)

Learning

It is notoriously difﬁcult to train recurrent architectures that involve many time steps. .4 Distributed Document Representation 103

In the case of an RNN trained on an entire document, back-propagation would haveto run over hundreds of steps, posing severe numerical challenges.To address these two issues, [28] proposes to use multilevel recurrent structuresto represent documents, thereby successfully efﬁciently leveraging document-levelcontext in language modeling. They ﬁrst proposed Context-to-Context Document-Context Language Model (ccDCLM), which assumes that contextual informationfrom previous sentences needs to be able to “short-circuit” the standard RNN, so asto more directly impact the generation of words across longer spans of text. Formally,we have c t − = h t − , l , (5.13)where l is the length of sentence t −

1. The ccDCLM model then creates additionalpaths for this information to impact each hidden representation in the current sentence t . Writing w t , n for the word representation of the n th word in the t th sentence, wehave h t , n = g θ ( h t , n − , f ( w t , n , c t − ), (5.14)where g θ ( · ) is the activation function parameterized by θ and f ( · ) is a function thatcombines the context vector with the input x t , n for the hidden state. Here we simplyconcatenate the representations, f ( x t , n , c t − ) = [ x t , n ; c t − ] . (5.15)The emission probability for y t , n is then computed from h t , n as in the standardRNNLM. The underlying assumption of this model is that contextual informationshould impact the generation of each word in the current sentence. The model,therefore, introduces computational “short-circuits” for cross-sentence information,as illustrated in Fig. 5.4. y t-1,1 x t-1,1 y t-1,2 x t-2,2 y t-1,M-1 x t-1,M-1 y t-1,M x t-1,M y t,1 x t,1 y t,2 x t,2 y t,N-1 x t,N-1 y t,N x t,N Fig. 5.4

The architecture of ccDCLM model04 5 Document Representation y t-1,1 x t-1,1 y t-1,2 x t-2,2 y t-1,M-1 x t-1,M- y t-1,M x t-1,M y t,1 x t,1 y t,2 x t,2 y t,N-1 x t,N-1 y t,N x t,N Fig. 5.5

The architecture of coDCLM model

Besides, they also proposed Context-to-Output Document-Context LanguageModel (coDCLM). Rather than incorporating the document context into the recurrentdeﬁnition of the hidden state, the coDCLM model pushes it directly to the output, asillustrated in Fig. 5.5. Let h t , n be the hidden state from a conventional RNNLM ofsentence t , h t , n = g θ ( h t , n − , x t , n ). (5.16)Then, the context vector c t − is directly used in the output layer as y t , n ∼ Softmax ( W h h t , n + W c c t − + b ). (5.17) Reference [33] also proposes hierarchical document autoencoder to represent doc-uments. The model draws on the intuition that just as the juxtaposition of wordscreates a joint meaning of a sentence, the juxtaposition of sentences also creates ajoint meaning of a paragraph or a document.They ﬁrst obtain representation vectors at the sentence level by putting one layerof LSTM (denoted as LSTM wordencode ) on top of its containing words: h wt ( enc ) = LSTM wordencode ( w t , h vt − ( enc )). (5.18)The vector output at the ending time step is used to represent the entire sentenceas s = h wend s . (5.19)To build representation e D for the current document/paragraph, another layer ofLSTM (denoted as LSTM sentenceencode ) is placed on top of all sentences, computing rep-resentations sequentially for each time step: .4 Distributed Document Representation 105 food any find she hungry was mary.didn’tMary was hungry she didn’t find any foodEncode -WordDecode -WordEncode-SentenceDecode-Sentence . Fig. 5.6

The architecture of hierarchical document autoencoder h st ( enc ) = LSTM sentenceencode ( s , h st − ( enc )). (5.20)Representation h send D computed at the ﬁnal time step is used to represent the entiredocument: d = h send D .Thus one LSTM operates at the token level, leading to the acquisition of sentence-level representations that are then used as inputs into the second LSTM that acquiresdocument-level representations, in a hierarchical structure.As with encoding, the decoding algorithm operates on a hierarchical structure withtwo layers of LSTMs. LSTM outputs at sentence level for time step t are obtainedby h st ( dec ) = LSTM sentencedecode ( s t , h st − ( dec )). (5.21)The initial time step h s ( d ) = e D , the end-to-end output from the encoding proce-dure h st ( d ) is used as the original input into LSTM worddecode for subsequently predictingtokens within sentence t +

1. LSTM worddecode predicts tokens at each position sequen-tially, the embedding of which is then combined with earlier hidden vectors for thenext time-step prediction until the end s token is predicted. The procedure can besummarized as follows: h wt ( dec ) = LSTM sentencedecode ( w t , h wt − ( dec )), (5.22) P ( w |· ) = Softmax ( w , h wt − ( dec )). (5.23)

06 5 Document Representation food any find she hungry was mary.didn’tMary was hungry she didn’t find any foodEncode -WordDecode -WordEncode-SentenceDecode-Sentence .

Fig. 5.7

The architecture of hierarchical document autoencoder with attentions

During decoding, LSTM worddecode generates each word token w sequentially andcombines it with earlier LSTM-outputted hidden vectors. The LSTM hidden vectorcomputed at the ﬁnal time step is used to represent the current sentence.This is passed to LSTM sentencedecode , combined with h st for the acquisition of h t + ,and outputted to the next time step in sentence decoding. For each time step t ,LSTM sentencedecode has to ﬁrst decide whether decoding should proceed or come to a fullstop: we add an additional token end D to the vocabulary. Decoding terminates whentoken end D is predicted. Details are shown in Fig. 5.6.Attention models adopt a look-back strategy by linking the current decodingstage with input sentences in an attempt to consider which part of the input is mostresponsible for the current decoding state (Fig. 5.7).Let H = { h s ( e ), h s ( e ), . . . , h sN ( e ) } be the collection of sentence-level hiddenvectors for each sentence from the inputs, outputted from LSTM sentenceencode . Each ele-ment in H contains information about input sequences with a strong focus on theparts surrounding each speciﬁc sentence (time step). During decoding, suppose that e st denotes the sentence-level embedding at current step and that h st − ( dec ) denotesthe hidden vector outputted from LSTM sentencedecode at previous time step t −

1. Atten-tion models would ﬁrst link the current-step decoding information, i.e., h st − ( dec ) which is outputted from LSTM sentencedec with each of the input sentences i ∈ [ , N ] ,characterized by a strength indicator v i : v i = U (cid:8) f ( W · h st − ( dec ) + W · h si ( enc )), (5.24) .4 Distributed Document Representation 107 where W , W ∈ R K × K , U ∈ R K × . v i is then normalized α i = exp ( v i ) (cid:4) j exp ( v j ) . (5.25)The attention vector is then created by averaging weights over all input sentences: m t = N D (cid:2) i = α i h si ( enc ) (5.26) In this section, we will introduce several applications on document level analysisbased on representation learning.

Information retrieval aims to obtain relevant resources from a large-scale collectionof information resources. As shown in Fig. 5.8, given the query “Steve Jobs” as input,the search engine (a typical application of information retrieval) provides relevantweb pages for users. Traditional information retrieval data consists of search queriesand document collections D . And the ground truth is available through explicit humanjudgments or implicit user behavior data such as click-through rate. Fig. 5.8

An example of information retrieval08 5 Document Representation

For the given query q and document d , traditional information retrieval modelsestimate their relevance through lexical matches. Neural information retrieval mod-els pay more attention to garner the query and document relevance from semanticmatches. Both lexical and semantic matches are essential for neural informationretrieval. Thriving from neural network black magic, it helps information retrievalmodels catch more sophisticated matching features and have achieved the state ofthe art in the information retrieval task [17].Current neural ranking models can be categorized into two groups: representation-based and interaction-based [23]. The earlier works mainly focus on representation-based models. They learn good representations and match them in the learned repre-sentation space of queries and documents. Interaction-based methods, on the otherhand, model the query-document matches from the interactions of their terms. The representation-based methods directly match the query and documents by learn-ing two distributed representations, respectively, and then compute the matchingscore based on the similarity between them. In recent years, several deep neuralmodels have been explored based on such Siamese architecture, which can be doneby feedforward layers, convolutional neural networks, or recurrent neural networks.Reference [26] proposes Deep Structured Semantic Models (DSSM) ﬁrst to hashwords to the letter-trigram-based representation. And then use a multilayer fullyconnected neural network to encode a query (or a document) as a vector. The rel-evance between the query and document can be simply calculated with the cosinesimilarity. Reference [26] trains the model by minimizing the cross-entropy loss onclick-through data where each training sample consists of a query q , a positive doc-ument d + , and a uniformly sampled negative document set D − : L DSSM ( q , d + , D − ) = − log (cid:5) e r · cos ( q , d + ) (cid:4) d ∈ D e r · cos ( q , d ) (cid:6) , (5.27)where D = d + ∪ D − .Furthermore, CDSSM [54] and ARC-I [25] utilize convolutional neural network(CNN), while LSTM-RNN [44] adopts recurrent neural network with Long Short-Term Memory (LSTM) units to represent a sentence better. Reference [53] also comesup with a more sophisticated similarity function by leveraging additional layers ofthe neural network. The interaction-based neural ranking models learn word-level interaction patternsfrom query-document pairs, as shown in Fig. 5.9. And they provide an opportunity tocompare different parts of the query with different parts of the document individually .5 Applications 109 documentinteraction matrix qu e r y neural network Fig. 5.9

The architecture of interaction-based neural ranking models and aggregate the partial evidence of relevance. ARC-II [25] and MatchPyra-mind [45] utilize convolutional neural network to capture complicated patterns fromword-level interactions. The Deep Relevance Matching Model (DRMM) uses pyra-mid pooling (histogram) to summarize the word-level similarities into ranking mod-els [23]. There are also some works establishing position-dependent interactions forranking models [27, 46].Kernel-based Neural Ranking Model (K-NRM) [66] and its convolutional versionConv-KNRM [17] achieve the state of the art in neural information retrieval. K-NRMﬁrst establishes a translation matrix M in which each element M i j is the cosinesimilarity of i th word in q and j th word in d . Then K-NRM utilizes kernels toconvert translation matrix M to ranking features φ( M ) : φ( M ) = n (cid:2) i = log K ( M i ), (5.28) K ( M i ) = { K ( M i ), . . . , K K ( M i ) } . (5.29)Each RBF kernel K k calculates how word pair similarities are distributed: K k ( M i ) = (cid:2) j exp (cid:7) − ( M i j − μ k ) σ K (cid:8) . (5.30)Then the relevance of q and d is calculated by a ranking layer: f ( q , d ) = tanh ( w (cid:8) φ( M ) + b ), (5.31)where w and b are trainable parameters.

10 5 Document Representation

Reference [66] trains the model by minimizing pair-wise loss on click-throughdata: L = (cid:2) q (cid:2) d + , d − ∈ D + , − max ( , − f ( q , d + ) + f ( q , d − )). (5.32)For the given query q , D + , − are the pair-wise preferences from the ground truth. d + and d − are two documents such that d + is more relevant with q than d − . Conv-KNRM extends K-NRM to model n -gram semantic matches based on the convolu-tional neural network which can leverage snippet information. Representation-based models and interaction-based models extract match featuresfrom overall and local aspects, respectively. They can also be combined for furtherimprovements [40].Recently, large-scale knowledge graphs such as DBpedia, Yago, and Freebasehave emerged. Knowledge graphs contain human knowledge about real-world enti-ties and become an opportunity for search systems to understand queries and doc-uments better. The emergence of large-scale knowledge graphs has motivated thedevelopment of entity-oriented search, which brings in entities and semantics fromthe knowledge graphs and has dramatically improved the effectiveness of feature-based search systems.Entity-oriented search and neural ranking models push the boundary of match-ing from two different perspectives. Reference [36] incorporates semantics fromknowledge graphs into the neural ranking, such as entity descriptions and entitytypes. This work signiﬁcantly improves the effectiveness and generalization abilityof interaction-based neural ranking models. However, how to fully leverage semi-structured knowledge graphs and establish semantic relevance between queries anddocuments remains an open question.Information retrieval has been widely used in many natural language processingtasks such as reading comprehension and question answering. Therefore, it is nodoubt that neural information retrieval will lead to a new tendency for these tasks.

Question Answering (QA) is one of the most important tasks and so are document-level applications in NLP. Many efforts have been invested in QA, especially inmachine reading comprehension and open-domain QA. In this section, we will intro-duce the advances in these two tasks, respectively. .5 Applications 111

As shown in Fig. 5.10, machine reading comprehension aims to determine the answer a to the question q given a passage p . The task could be viewed as a supervisedlearning problem: given a collection of training examples { ( p i , q i , a i ) } ni = , we wantto learn a mapping f ( · ) that takes the passage p i and corresponding question q i asinputs and outputs ˆ a i , where evaluate ( ˆ a i , a i ) is maximized. The evaluation metric istypically correlated with the answer type, which will be discussed in the following.Generally, the current machine reading comprehension task could be divided intofour categories depending on the answer types according to [10], i.e., cloze style,multiple choices, span prediction, and free-form answer.The cloze style task such as CNN/Daily Mail [24] consists of ﬁll-in-the-blanksentences where the question contains a placeholder to be ﬁlled in. The answer a iseither chosen from a predeﬁned candidate set | A | or from the vocabulary | V | . Themultiple-choice task such as RACE [30] and

MCTest [50] aims to select the bestanswer from a set of answer choices. It is typical to use accuracy to measure theperformance on these two tasks: the percentage of correctly answered questions inthe whole example set, since the question could be either correctly answered or notfrom the given hypothesized answer set.

Fig. 5.10

An example of machine reading comprehension from SQuAD [49]12 5 Document Representation

The span prediction task such as SQuAD [49] is perhaps the most widely adoptedtask among all, since it takes compromises between ﬂexibility and simplicity. Thetask is to extract a most likely text span from the passage as the answer to the question,which is usually modeled as predicting the start position idx start and end position idx end of the answer span. To evaluate the predicted answer span ˆ a , we typically usetwo evaluation metrics proposed by [49]. Exact match assigns full score 1.0 to thepredicted answer span ˆ a if it exactly equals the ground truth answer a , otherwise 0.0.F1 score measures the degree of overlap between ˆ a and a by computing a harmonicmean of the precision and recall.The free-form answer task such as MS MARCO [43] does not restrict the answerform or length and is also referred to as generative question answering . It is practicalto model the task as a sequence generation problem, where the discrete token-levelprediction was made. Currently, a consensus on what is the ideal evaluation metricshas not been achieved. It is common to adopt standard metrics in machine translationand summarization, including ROUGE [34] and BLEU [57].As a critical component in the question answering system, the surging neural-based machine reading comprehension models have greatly boosted the task ofquestion answering in the last decades.The ﬁrst attempt [24] to apply neural networks on machine reading comprehensionconstructs bidirectional LSTM reader models along with attention mechanisms. Thework introduces two reader models, i.e., the attentive reader and the impatient reader,as shown in Fig. 5.11. After encoding the passage and the query into hidden statesusing LSTMs, the attentive reader computes a scalar distribution s ( t ) over the passagetokens and uses it to compute the weighted sum of the passage hidden states r . Theimpatient reader extends this idea further by recurrently updating the weighted sumof passage hidden states after it has seen each query token.The attention mechanisms used in reading comprehension could be viewed asa variant of Memory Networks [64]. Memory Networks use long-term memoryunits to store information for inference dynamically. Typically, given an input x , Mary went to England visited England r g u s(1)y(1)s(2)y(2) s(3)y(3)s(4)y(4) (a) attentive reader

Mary went to England visited England r gur r (b) impatient reader

Fig. 5.11

The architecture of bidirectional LSTM reader model.5 Applications 113 the model ﬁrst converts it into an internal feature representation F ( x ) . Then, themodel can update the designated memory units m i given the new input: m i = g ( m i , F ( x ), m ) , or generate output features o given the new input and the mem-ory states: o = f ( F ( x ), m ) . Finally, the model converts the output into the responsewith the desired format: r = R ( o ) . The key takeaway of Memory Networks is theretaining and updating of some internal memories that captivate global information.We will see how this idea is further extended in some sophisticated models.It is no doubt that the application of attention to machine reading comprehensiongreatly promotes researches in this ﬁeld. Following [11], the work [24] modiﬁes themethod to compute attention and simplify the prediction layer in the attentive reader.Instead of using tanh ( · ) to compute the relevance between the passage representa-tions { ˜ p i } ni = and the query hidden state q (see Eq. 5.33), Chen et al. use the bilinearterms to directly capture the passage-query alignment (see Eq. 5.34). α i = Softmax i ( tanh ( W ˜ p i + W q )), (5.33) α i = Softmax i ( q (cid:8) W ˜ p i ). (5.34)Most machine reading comprehension models follow the same paradigm to locatethe start and endpoint of the answer span. As shown in Fig. 5.12, while encoding thepassage, the model retains the length of the sequence and encodes the question intoa ﬁxed-length hidden representation q . The question’s hidden vector is then usedas a pointer to scan over the passage representation { p i } ni = and compute scoreson every position in the passage. While maintaining this similar architecture, mostmachine reading comprehension models vary in the interaction methods between thepassage and the question. In the following, we will introduce several classic readingcomprehension architectures that follow this paradigm. Fig. 5.12

The architectureof classic machine readingcomprehension models qp p p p n-1 p n ……

14 5 Document Representation

First, we introduce BiDAF, which is short for

Bi-Directional Attention Flow [52].The BiDAF network consists of the token embedding layer, the contextual embeddinglayer, the bi-directional attention ﬂow layer, the LSTM modeling layer, and thesoftmax output layer, as shown in Fig. 5.13.The token embedding layer consists of two levels. First, the character embeddinglayer encodes each word in character level by adopting a 1D convolutional neuralnetwork (CNN). Speciﬁcally, for each word, characters are embedded into ﬁxed-length vectors, which are considered as 1D input for CNNs. The outputs are thenmax-pooled along the embedding dimension to obtain a single ﬁxed-length vector.Second, the word embedding layer uses pretrained word vectors, i.e., GloVe [47], tomap each word into a high-dimensional vector directly.Then the concatenation of the two vectors is fed into a two-layer Highway Network[56]. Equation 5.35 shows one layer of the highway network used in the paper, where H ( · ) and H ( · ) represent two afﬁne transformations: g = Sigmoid ( H ( x )), (5.35) y = g (cid:10) ReLU ( H ( x )) + ( − g ) (cid:10) x . (5.36) Fig. 5.13

The architecture of BiDAF model.5 Applications 115

After feeding the context and the query to the token embedding layer, we obtain X ∈ R d × T for the context and Q ∈ R d × J for the query, respectively. Afterward, thecontextual embedding layer, which is a bidirectional LSTM, model the temporalinteraction between words for both the context and the query.Then, come to the attention ﬂow layer. In this layer, the attention dependencyis computed in both directions, i.e., the context-to-query (C2Q) attention and thequery-to-context (Q2C) attention. For both kinds of attention, we ﬁrst compute asimilarity matrix S ∈ R T × J using the contextual embeddings of the context H andthe query U obtained from the last layer (Eq. 5.37). In the equation, α( · ) computesthe scalar similarity of the given two vectors and m is a trainable weight vector. S t j = α( H : , t , U : , j ) (5.37) α( h , u ) = m (cid:8) [ h ; u ; h (cid:10) u ] , (5.38)where (cid:10) indicates element-wise product.For the C2Q attention, a weighted sum of contextual query embeddings iscomputed given each context word. The attention distribution over the query isobtained by a j = Softmax ( S j , : ) ∈ R J . The ﬁnal attended query vector is therefore ˜ U : , t = (cid:4) j a t j U : , j for each context word.For the Q2C attention, the context embeddings are merged into a single ﬁxedlength hidden vector ˜ h . The attention distribution over the context is computed by b t = Softmax ( max j S t j ) , and ˜ h = (cid:4) t b t H : , t . Lastly, the merged context embeddingsare tiled T times along the column to produce ˜ H .Finally, the attended outputs are combined to yield G , which is deﬁned by Eq. 5.39 G : , t = φ( H : , t , ˜ U : , t , ˜ H : , t ) (5.39) β( h , ˜ u , ˜ h ) = [ h ; ˜ u ; h (cid:10) ˜ u ; h (cid:10) ˜ h ] . (5.40)Afterward, the LSTM modeling layer takes G as input and encodes it using atwo-layer bidirectional LSTM. The output M ∈ R d × T is combined with G to yieldthe ﬁnal start and end probability distributions over the passage. P = Softmax ( u (cid:8) [ G ; M ] ), (5.41) P = Softmax ( u (cid:8) [ G ; LSTM ( M ) ] ), (5.42)where u , u are two trainable weight vectors.To train the model, the negative log likelihood loss is adopted and the goal isto maximize the probability of the golden start index idx start and end index idx end being selected by the model,

16 5 Document Representation L = − N N (cid:2) i = (cid:9) log ( P idx istart ) + log ( P idx istart ) (cid:10) . (5.43)Besides BiDAF, where attention dependencies are computed in two directions,we will also brieﬂy introduce other interaction methods between the query and thepassage. The Gated-Attention Reader proposed by [19] adopts the gated attentionmodule, where each token representation of the passage d i is scaled by the attendedquery vector Q after each Bi-GRU layer (Eq. 5.44). α i = Softmax ( Q (cid:8) d i ) (5.44) ˜ q i = Q α i (5.45) x i = d i (cid:10) ˜ q i . (5.46)This gated attention mechanism allows the query to directly interact with the tokenembeddings of the passage at the semantic level. And such layer-wise interactionenables the model to learn conditional token representation given the question atdifferent representation levels.The Attention-over-Attention Reader [16] takes another path to model the inter-action. The attention-over-attention mechanism involves calculating the attentionbetween the passage attention α( t ) and the averaged question attention β after obtain-ing the similarity matrix M ∈ R n × m (Eq. 5.47). This operation is considered to learnthe contributions of individual question words explicitly. α( t ) = Softmax ( M : , t ),β = N N (cid:2) t = Softmax ( M t , : ). (5.47) Open-domain QA (OpenQA) has been ﬁrst proposed by [21]. The task aims toanswer open-domain questions using external resources such as collections of docu-ments [58], web pages [14, 29], structured knowledge graphs [3, 7] or automaticallyextracted relational triples [20].Recently, with the development of machine reading comprehension techniques[11, 16, 19, 55, 63], researchers attempt to answer open-domain questions via per-forming reading comprehension on plain texts. Reference [12] proposes to employneural-based models to answer open-domain questions. As illustrated in Fig. 5.14,neural-based OpenQA system usually retrieves relevant texts of the question from alarge-scale corpus and then extracts answers from these texts using reading compre-hension models. .5 Applications 117

Q: Who is the CEO of applein 2020? DocumentRetriever DocumentReader Tim Cook Q P PAnswer

Fig. 5.14

An example of open-domain question answering

The DrQA system consists of two components: (1) The document retriever modulefor ﬁnding relevant articles and (2) the document reader model for extracting answersfrom given contexts.The document retriever is used as a ﬁrst quick skim to narrow the searching spaceand focus on documents that are likely to be relevant. The retriever builds TF-IDFweighted bag-of-words vectors for the documents and the questions, and computessimilarity scores for ranking. To further utilize local word order information, theretriever uses bigram counts with hash while preserving both the speed and memoryefﬁciency.The document reader model takes in the top 5 Wikipedia articles yielded by thedocument retriever and extracts the ﬁnal answer to the question. For each article, thedocument reader predicts an answer span with a conﬁdence score. The ﬁnal predictionis made by maximizing the unnormalized exponential of prediction scores across thedocuments.Given each document d , the document reader ﬁrst builds feature representation ˜ d i for each word in the document. The feature representation ˜ d is made up by thefollowing components.1. Word embeddings : The word embeddings f emb ( d ) are obtained from large-scaleGloVe embeddings pretrained on Wikipedia.2. Manual features : The manual features f token ( d ) combined part-of-speech (POS)and named entity recognition tags and normalized Term Frequencies (TF).3. Exact match : This feature indicates whether d i can be exactly matched to onequestion word in q .4. Aligned question embeddings : This feature aims to encode a soft alignmentbetween words in the document and the question in the word embedding space.

18 5 Document Representation f align ( d i ) = (cid:2) j α i j E ( q j ) (5.48) α i j = exp ( MLP ( E ( d i )) (cid:8) MLP ( E ( q j ))) (cid:4) j (cid:2) exp ( MLP ( E ( d i )) (cid:8) MLP ( E ( q j (cid:2) ))) (5.49)where MLP ( x ) = max ( , Wx + b ) and E ( q j ) indicates the word embedding ofthe j th word in the question.Finally, the feature representation is obtained by concatenating the above features: ˜ d i = ( f emb ( d i ), f token ( d i ), f exact _ match ( d i ), f align ( d i )). (5.50)Then the feature representation of the document is fed into a multilayer bidirec-tional LSTM (BiLSTM) to encode the contextual representation. d , . . . , d n = BiLSTM ( ˜ d , . . . , ˜ d n ). (5.51)For the question, the contextual representation is simply obtained by encodingthe word embeddings using a multilayer BiLSTM. q , . . . , q m = BiLSTM ( ˜ q , . . . , ˜ q m ) (5.52)After that, the contextual representation is aggregated into a ﬁxed-length vectorusing self-attention. b j = exp ( u (cid:8) q j ) (cid:4) j (cid:2) exp ( u (cid:8) q j (cid:2) ) (5.53) q = (cid:2) j b j q j . (5.54)In the answer prediction phase, the start and end probability distributions arecalculated following the paradigm mentioned in the Reading Comprehension Modelsection (Sect. 5.5.2.1). P start ( i ) = exp ( d (cid:8) i W start q ) (cid:4) i (cid:2) exp ( d (cid:8) i (cid:2) W start q ) (5.55) P end ( i ) = exp ( d (cid:8) i W end q ) (cid:4) i (cid:2) exp ( d (cid:8) i (cid:2) W end q ) . (5.56)Despite its success, the DrQA system is prone to noise in retrieved texts which mayhurt the performance of the system. Hence, [15] and [61] attempt to solve the noise .5 Applications 119 problem in DrQA via separating the question answering into paragraph selection andanswer extraction, and they both only select the most relevant paragraph among allretrieved paragraphs to extract answers. They lose a large amount of rich informationcontained in those neglected paragraphs. Hence, [62] proposes strength-based andcoverage-based re-ranking approaches, which can aggregate the results extractedfrom each paragraph by the existing DS-QA system to determine the answer better.However, the method relies on the pre-extracted answers of existing DS-QA modelsand still suffers the noise issue in distant supervision data because it considers allretrieved paragraphs indiscriminately. To address this issue, [35] proposes a coarse-to-ﬁne denoising OpenQA model, which employs a paragraph selector to ﬁlter outparagraphs and a paragraph reader to extract the correct answer from those denoisedparagraphs. In this chapter, we have introduced document representation learning, which encodesthe semantic information of the whole document into a real-valued representationvector, providing an effective way of downstream tasks utilizing the document infor-mation and has signiﬁcantly improved the performances of these tasks.First, we introduce the one-hot representation for documents. Next, we exten-sively introduce topic models to represent both words and documents using latenttopic distribution. Further, we give an introduction on distributed document repre-sentation including paragraph vector and neural document representations. Finally,we introduce several typical real-world applications of document representations,including information retrieval and question answering.In the future, for better document representation, some directions are requiringfurther efforts:(1)

Incorporating External Knowledge.

Current document representationapproaches focus on representing documents with the semantic information ofthe whole document text. Moreover, knowledge bases provide external semanticinformation to better understand the real-world entities in the given document.Researchers have formed a consensus that incorporating entity semantics ofknowledge bases into document representation is a potential way toward betterdocument representation. Some existing work leverages various entity semanticsto enhance the semantic information of document representation and achievesbetter performance in multiple applications such as document ranking [36, 65].Explicitly modeling structural and textual semantic information as well as con-sidering the entity importance for the given document also share some lights fora more interpretable and knowledgable document representation for downstreamNLP tasks.(2)

Considering Document Interactions.

The candidate documents in downstreamNLP tasks are usually relevant to each other and may help for better modeling

20 5 Document Representation document semantic information. There is no doubt that the interactions amongdocuments, no matter with implicit semantic relations or with explicit links, willprovide additional semantic signals to enhance the document representations.Reference [32] preliminarily uses document interactions to extract importantwords and improve model performance. Nevertheless, it remains an unsolvedproblem of how to effectively and explicitly incorporate semantic informationinto document representations from other documents.(3)

Pretraining for Document Representation.

Pretraining has shown effective-ness and thrives on downstream NLP tasks. Existing pre-trained language modelssuch as Word2vec style word co-occurrence models [38] and BERT style masklanguage models [18, 48] focus on the representation learning at the sentencelevel, which cannot work well for document-level representation. It is still chal-lenging to model cross-sentence relations, text coherence, and co-reference atthe document level in document representation learning. Moreover, there are alsosome methods that leverage useful signals such as anchor-document informationto supervise document representation learning [67]. How to pretrain documentrepresentation models with efﬁcient and effective strategies is still a critical andchallenging problem.

References

1. Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and AlexanderSmola. Scalable inference in latent variable models. In

Proceedings of WSDM , 2012.2. Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. On smoothing and infer-ence for topic models. In

Proceedings of UAI , 2009.3. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebasefrom question-answer pairs. In

Proceedings of EMNLP , 2013.4. David M Blei, Thomas L Grifﬁths, and Michael I Jordan. The nested chinese restaurant processand bayesian nonparametric inference of topic hierarchies.

The Journal of the ACM , 57(2):7,2010.5. David M Blei and John D Lafferty. Dynamic topic models. In

Proceedings of ICML , 2006.6. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.

Journal ofMachine Learning Research , 3:993–1022, 2003.7. Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale simple ques-tion answering with memory networks. arXiv preprint arXiv:1506.02075, 2015.8. Jordan L Boyd-Graber and David M Blei. Syntactic topic models. In

Proceedings of NeurIPS ,2009.9. Jonathan Chang and David M Blei. Hierarchical relational models for document networks.

TheAnnals of Applied Statistics , pages 124–150, 2010.10. Danqi Chen.

Neural Reading Comprehension and Beyond . PhD thesis, Stanford University,2018.11. Danqi Chen, Jason Bolton, and Christopher D. Manning. A thorough examination of thecnn/daily mail reading comprehension task. In

Proceedings of ACL , 2016.12. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answeropen-domain questions. In

Proceedings of the ACL , 2017.13. Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. Warplda: a cache efﬁcient o (1)algorithm for latent dirichlet allocation.

Proceedings of VLDB , 2016.14. Tongfei Chen and Benjamin Van Durme. Discriminative information retrieval for questionanswering sentence selection. In

Proceedings of EACL , 2017.eferences 12115. Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, andJonathan Berant. Coarse-to-ﬁne question answering for long documents. In

Proceedings ofACL , 2017.16. Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. Attention-over-attention neural networks for reading comprehension. In

Proceedings of ACL , 2017.17. Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networksfor soft-matching n-grams in ad-hoc search. In

Proceedings of WSDM , 2018.18. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. In

Proceedings of NAACL , 2019.19. Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. Gated-attention readers for text comprehension. In

Proceedings of ACL , 2017.20. Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Open question answering over curatedand extracted knowledge bases. In

Proceedings of SIGKDD , 2014.21. Bert F Green Jr, Alice K Wolf, Carol Chomsky, and Kenneth Laughery. Baseball: an automaticquestion-answerer. In

Proceedings of IRE-AIEE-ACM , 1961.22. Thomas L Grifﬁths, Mark Steyvers, David M Blei, and Joshua B Tenenbaum. Integrating topicsand syntax. In

Proceedings of NeurIPS , 2004.23. Jiafeng Guo, Yixing Fan, Qingyao Ai, and W.Bruce Croft. A deep relevance matching modelfor ad-hoc retrieval. In

Proceedings of CIKM , 2016.24. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, MustafaSuleyman, and Phil Blunsom. Teaching machines to read and comprehend. In

Proceedings ofNeurIPS , 2015.25. Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neural network archi-tectures for matching natural language sentences. In

Proceedings of NeurIPS , 2014.26. Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learningdeep structured semantic models for web search using clickthrough data. In

Proceedings ofCIKM , 2013.27. Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. Pacrr: A position-aware neuralir model for relevance matching. In

Proceedings of EMNLP , 2017.28. Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. Document contextlanguage models. arXiv preprint arXiv:1511.03962, 2015.29. Cody Kwok, Oren Etzioni, and Daniel S Weld. Scaling question answering to the web.

TOIS ,pages 242–262, 2001.30. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scalereading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.31. Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In

Proceedings of ICML , 2014.32. Canjia Li, Yingfei Sun, Ben He, Le Wang, Kai Hui, Andrew Yates, Le Sun, and JungangXu. Nprf: A neural pseudo relevance feedback framework for ad-hoc information retrieval. In

Proceedings of EMNLP , 2018.33. Jiwei Li, Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphsand documents. In

Proceedings of ACL , 2015.34. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries.

Text SummarizationBranches Out , 2004.35. Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. Denoising distantly supervised open-domain question answering. In

Proceedings of ACL , 2018.36. Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. Entity-duet neural ranking:Understanding the role of knowledge graph semantics in neural information retrieval. In

Pro-ceedings of ACL , 2018.37. Jon D Mcauliffe and David M Blei. Supervised topic models. In

Proceedings of NeurIPS , 2008.38. T Mikolov and J Dean. Distributed representations of words and phrases and their composi-tionality.

Proceedings of NeurIPS , 2013.39. David Mimno and Andrew McCallum. Topic models conditioned on arbitrary features withdirichlet-multinomial regression. In

Proceedings of UAI , 2008.22 5 Document Representation40. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributedrepresentations of text for web search. In

Proceedings of WWW , 2017.41. David Newman, Arthur U Asuncion, Padhraic Smyth, and Max Welling. Distributed inferencefor latent dirichlet allocation. In

Proceedings of NeurIPS , 2007.42. David Newman, Chaitanya Chemudugunta, and Padhraic Smyth. Statistical entity-topic mod-els. In

Proceedings of SIGKDD , 2006.43. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, andLi Deng. MS MARCO: A human generated machine reading comprehension dataset. arXivpreprint arXiv:1611.09268, 2016.44. Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, XinyingSong, and Rabab Ward. Deep sentence embedding using long short-term memory networks:Analysis and application to information retrieval.

IEEE/ACM Transactions on Audio, Speechand Language Processing , 24(4):694–707, 2016.45. Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. Text match-ing as image recognition. In

Proceedings of AAAI , 2016.46. Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. Deeprank: Anew deep architecture for relevance ranking in information retrieval. In

Proceedings of CIKM ,2017.47. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for wordrepresentation. In

Proceedings of EMNLP , 2014.48. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,and Luke Zettlemoyer. Deep contextualized word representations. In

Proceedings of NAACL-HLT , pages 2227–2237, 2018.49. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ ques-tions for machine comprehension of text. In

Proceedings of EMNLP , 2016.50. Matthew Richardson, Christopher JC Burges, and Erin Renshaw. MCTest: A challenge datasetfor the open-domain machine comprehension of text. In

Proceedings of EMNLP , 2013.51. Michal Rosen-Zvi, Thomas Grifﬁths, Mark Steyvers, and Padhraic Smyth. The author-topicmodel for authors and documents. In

Proceedings of UAI , 2004.52. Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional atten-tion ﬂow for machine comprehension. In

Proceedings of ICLR , 2017.53. Aliaksei Severyn and Alessandro Moschitti. Learning to rank short text pairs with convolutionaldeep neural networks. In

Proceedings of SIGIR , 2015.54. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. A latent semanticmodel with convolutional-pooling structure for information retrieval. In

Proceedings of CIKM ,2014.55. Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stopreading in machine comprehension. In

Proceedings of SIGKDD , 2017.56. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXivpreprint arXiv:1505.00387, 2015.57. A Cuneyd Tantug, Kemal Oﬂazer, and Ilknur Durgar El-Kahlout. Bleu+: a tool for ﬁne-grainedbleu computation. 2008.58. Ellen M Voorhees et al. The trec-8 question answering track report. In

Proceedings of TREC ,1999.59. Hanna M Wallach. Topic modeling: beyond bag-of-words. In

Proceedings of ICML , 2006.60. Chong Wang, Bo Thiesson, Chris Meek, and David Blei. Markov topic models. In

Proceedingsof AISTATS , 2009.61. Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang,Gerald Tesauro, Bowen Zhou, and Jing Jiang. R3: Reinforced ranker-reader for open-domainquestion answering. In

Proceedings of AAAI , 2018.62. Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang,Tim Klinger, Gerald Tesauro, and Murray Campbell. Evidence aggregation for answer re-ranking in open-domain question answering. In

Proceedings of ICLR , 2018.eferences 12363. Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matchingnetworks for reading comprehension and question answering. In

Proceedings of ACL , 2017.64. Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXivpreprint arXiv:1410.3916, 2014.65. Chenyan Xiong, Jamie Callan, and Tie-Yan Liu. Word-entity duet representations for documentranking. In

Proceedings of SIGIR , 2017.66. Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neuralad-hoc ranking with kernel pooling. In

Proceedings of SIGIR , 2017.67. Kaitao Zhang, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. Selective weak supervisionfor neural information retrieval. arXiv preprint arXiv:2001.10382, 2020.

Open Access

Sememe Knowledge Representation

Abstract

Linguistic Knowledge Graphs (e.g., WordNet and HowNet) describe lin-guistic knowledge in formal and structural language, which can be easily incorpo-rated in modern natural language processing systems. In this chapter, we focus onthe research about HowNet. We ﬁrst brieﬂy introduce the background and basicconcepts of HowNet and sememe. Next, we introduce the motivations of sememerepresentation learning and existing approaches. At the end of this chapter, we reviewimportant applications of sememe representation.

In the ﬁeld of Natural Language Processing (NLP), words are generally the smallestobjects of study because they are considered as the smallest meaningful units thatcan stand by themselves of human languages. However, the meanings of wordscan be further divided into smaller parts. For example, the meaning of man can beconsidered as the combination of the meanings of human , male and adult , andthe meaning of boy is composed of the meanings of human , male , and child .In linguistics, the minimum indivisible units of meaning, i.e., semantic units, aredeﬁned as sememes [8]. And some linguists believe that meanings of all the wordscan be composed of a limited closed set of sememes.However, sememes are implicit and as a result, it is hard to intuitively deﬁne the setof sememes and determine which sememes a word can have at a glance. Therefore,some researchers spend tens of years sifting sememes from all kinds of dictionariesand linguistic Knowledge Bases (KBs), and annotating words with these selectedsememes to construct sememe-based linguistic KB. WordNet and HowNet [17] arethe two most famous ones of such KBs. In this section, we focus on the representationof linguistic knowledge in HowNet. © The Author(s) 2020Z. Liu et al., Representation Learning for Natural Language Processing ,https://doi.org/10.1007/978-981-15-5573-2_6 12526 6 Sememe Knowledge Representation

WordNet is a large lexical database for the English language and could also be viewedas a KG containing multi-relational data. It was ﬁrst started in 1985, and created underthe direction of George Armitage Miller, a psychology professor in the CognitiveScience Laboratory of Princeton University. Nowadays, WordNet is becoming themost popular lexicon dictionary in the world that could be available through the Webfor free and is widely used in NLP applications such as text analysis, informationretrieval, and relation extraction. There is also a Global WordNet Association aimingto provide a public and noncommercial platform for WordNets of all languages inthe world.Based on meanings, WordNet groups English nouns, verbs, adjectives, andadverbs into synsets (i.e., sets of cognitive synonyms), which represent a distinctconcept. Each synset possesses a brief description, and in most cases, there are evensome short sentences functioning as examples illustrating the use of words in thissynset. The conceptual-semantic and lexical relations link the synsets and words. Themain relation among words is synonymy, which indicates that the words share similarmeanings and could be replaced by others in some contexts, while the main relationamong synsets is hyperonymy/hyponymy (i.e., the ISA relation), which indicatesthe relationship between a more general synset and a more speciﬁc synset. There arealso hierarchical structures for verb synsets, and the antonymy is describing the rela-tion between adjectives with opposite meanings. To sum up, all WordNets’ 117 , HowNet was initially designed and constructed by Zhendong Dong and his son QiangDong in the 1990s. And it has been kept frequently updated since it was publishedin 1999.The sememe set of HowNet is determined by extracting, analyzing, merging,and ﬁltering semantics of thousands of Chinese characters. And the sememe set canalso be adjusted or expanded in the subsequent process of annotating words. Eachsememe in HowNet is represented by a word or phrase in Chinese and English suchas (human | { } ) and (ProperName | { } ).HowNet also builds a taxonomy for the sememes. All the sememes of HowNetcan be classiﬁed as one of the following types: Thing, Part, Attribute, Time, Space,Attribute Value, and Event. In addition, to depict the semantics of words more pre-cisely, HowNet incorporates relations between sememes, which are called “dynamicroles”, into the sememe annotations of words.Considering the polysemy, HowNet differentiates diverse senses of each wordin the sememe annotations. And each sense is also expressed in both Chinese and .1 Introduction 127 wordsensesememe applecomputerbring bringSpecificBrand SpecificBrandcommunicatePatternValuePatternValue tool fruitfruitable able treereproduceapple(computer) apple(phone) apple(fruit) apple(tree) patient patient agentmodifierCoEvent CoEvent instrument PatientProdectscopescope modifier

Fig. 6.1

An example of word annotated with sememes in HowNet

Table 6.1

Statistics of HowNetType CountSense 229,767Distinct Chinese word 127,266Distinct English word 104,025Sememe 2,187

English. An example of sememe annotation for a word is illustrated in Fig. 6.1.We can see from the ﬁgure that the word apple has four senses including apple(computer) , apple(phone) , apple(fruit) , and apple(tree) ,and each sense is the root node of a “sememe tree” where each pair of father and sonsememe nodes is multi-relational. Additionally, HowNet annotates the POS tag foreach sense, and adds sentiment category as well as some usage examples for certainsenses.The latest version of HowNet was published in January 2019 and the statistics areshown in Table 6.1.Since HowNet was published, it has attracted wide attention. People use HowNetand sememe in various NLP tasks including word similarity computation [40], wordsense disambiguation [70], question classiﬁcation [62], and sentiment analysis [16,20]. Among these researches, [40] is one of the most inﬂuential works, in which thesimilarity of given two words is computed by measuring the degree of resemblanceof their sememe trees.Recent years also witnessed some works incorporating sememes into neural net-work models. Reference [49] proposes a novel word representation learning modelnamed SST that reforms Skip-gram [43] by adding contextual attention to senses of

28 6 Sememe Knowledge Representation the target word, which are represented with combinations of corresponding sememes’embeddings. Experimental results show that SST can not only improve the qualityof word embeddings but also learn satisfactory sense embeddings to do word sensedisambiguation.Reference [23] incorporates sememes into the decoding phase of language mod-eling where sememes are predicted ﬁrst, and then senses and words are predicted insuccession. The proposed model shows enhancement in the perplexity of languagemodeling and the performance of the downstream task headline generation.Besides, HowNet is also utilized in lexicon expansion [68], semantic rationalityevaluation [41], etc.Considering that human annotation is time-consuming and labor-intensive, someworks attempt to employ machine learning methods to predict sememes for newwords automatically. Reference [66] proposes the task ﬁrstly and presents two sim-ple but effective models: SPWE, which is based on collaborative ﬁltering, and SPSE,which is based on matrix factorization. Reference [30] further takes the internal infor-mation of words into account when predicting sememes and achieves a considerableboost of performance. And [38] takes advantage of deﬁnitions of words to predictsememes. As for [56], they propose the task of cross-lingual lexical sememe pre-diction and present a bilingual word representation learning and alignment-basedmodel, which demonstrates effectiveness in predicting sememes for cross-lingualwords.

Word Representation Learning (WRL) is a fundamental and critical step in many NLPtasks such as language modeling [4] and neural machine translation [64]. There havebeen a lot of researches for learning word representations, among which Word2vec[43] achieves a nice balance between effectiveness and efﬁciency. In Word2vec, eachword corresponds to one single embedding, ignoring the polysemy of most words.To address this issue, [29] introduces a multi-prototype model for WRL, conductingunsupervised word sense induction and embeddings according to context clusters.Reference [13] further utilizes the synset information in WordNet to instruct wordsense representation learning.These previous studies demonstrate that word sense disambiguation is criticalfor WRL, and the sememe annotation of word senses in HowNet can provide nec-essary semantic regularization for these tasks [63]. To explore its feasibility, weintroduce the S ememe- E ncoded W ord R epresentation L earning (SE-WRL) model,which detects word senses and learns representations simultaneously. More specif-ically, this framework regards each word sense as a combination of its sememes,and iteratively performs word sense disambiguation according to their contexts andlearns representations of sememes, senses, and words by extending Skip-gram inWord2vec [43]. In this framework, an attention-based method is proposed to selectappropriate word senses according to contexts automatically. To take full advantage .2 Sememe Knowledge Representation 129 of sememes, we introduce three different learning and attention strategies SSA, SAC,and SAT for SE-WRL, which will be described in the following paragraphs. The Simple Sememe Aggregation model (SSA) is a straightforward idea based onSkip-gram model. For each word, SSA considers all sememes in all senses of theword together, and represents the target word using the average of all its sememeembeddings. Formally, we have w = m (cid:2) s ( w ) i ∈ S ( w ) (cid:2) x ( si ) j ∈ X ( w ) i x ( s i ) j , (6.1)which means the word embedding of w is composed by the average of all its sememeembeddings. Here, S ( w ) is the sense set of w and X ( w ) i is the sememe set of the i thsense of w . m stands for the overall number of sememes belonging to w .This model follows the assumption that the semantic meaning of a word is com-posed of the semantic units, i.e., sememes. As compared to the conventional Skip-gram model, since sememes are shared by multiple words, this model can utilizesememe information to encode latent semantic correlations between words. In thiscase, similar words that share the same sememes may ﬁnally obtain similar repre-sentations. The SSA Model replaces the target word embedding with the aggregated sememeembeddings to encode sememe information into word representation learning. How-ever, each word in the SSA model still has only one single representation in differentcontexts, which cannot deal with the polysemy of most words. It is intuitive that weshould construct distinct embeddings for a target word according to speciﬁc contexts,with the favor of word sense annotation in HowNet.To address this issue, the Sememe Attention over Context model (SAC) is pro-posed. SAC utilizes the attention scheme to automatically select appropriate sensesfor context words according to the target word. That is, SAC conducts word sensedisambiguation for context words to learn better representations of target words. Thestructure of the SAC model is shown in Fig. 6.2.More speciﬁcally, SAC utilizes the original word embedding for target word w ,and uses sememe embeddings to represent context word w c instead of the originalcontext word embeddings. Suppose a word typically demonstrates some speciﬁcsenses in one sentence. Here, the target word embedding is employed as attention

30 6 Sememe Knowledge Representation sememesensecontextword s s s Att Att Att w t-2 w t-1 w t+1 w t+2 w t Fig. 6.2

The architecture of SAC model to select the most appropriate senses to make up the context word embeddings. Thecontext word embedding w c can be formalized as follows: w c = | S ( wc ) | (cid:2) j = Att ( s ( w c ) j ) s ( w c ) j , (6.2)where s ( w c ) j stands for the j th sense embedding of w c , and Att ( s ( w c ) j ) represents theattention score of the j th sense with respect to the target word w , deﬁned as follows:Att ( s ( w c ) j ) = exp ( w · ˆ s ( w c ) j ) (cid:3) | S ( wc ) | k = exp ( w · ˆ s ( w c ) k ) . (6.3)Note that, when calculating attention, the average of sememe embeddings is usedto represent each sense s ( w c ) j : ˆ s ( w c ) j = | X ( w c ) j | | X ( wc ) j | (cid:2) k = x ( s j ) k . (6.4)The attention strategy assumes that the more relevant a context word sense embed-ding is to the target word w , the more this sense should be considered when buildingcontext word embeddings. With the favor of attention scheme, each context word .2 Sememe Knowledge Representation 131 sememesensecontextword s s s Att Att Att w t-2 w t-1 w t+1 w t+2 w t contextualembedding Fig. 6.3

The architecture of SAT model can be represented as a particular distribution over its sense. This can be regarded assoft WSD and it helps learn better word representations.

The Sememe Attention over Context Model can ﬂexibly select appropriate sensesand sememes for context words according to the target word. The process can alsobe applied to select appropriate senses for the target word by taking context wordsas attention. Hence, the Sememe Attention over Target model (SAT) is proposed,which is shown in Fig. 6.3.Different from the SAC model, SAT learns the original word embeddings forcontext words and sememe embeddings for target words. Then SAT applies contextwords to perform attention over multiple senses of the target word w to build theembedding of w , formalized as follows: w = | S ( w ) | (cid:2) j = Att ( s ( w ) j ) s ( w ) j , (6.5)and the context-based attention is deﬁned as follows:

32 6 Sememe Knowledge Representation

Att ( s ( w ) j ) = exp ( w (cid:3) c · ˆ s ( w ) j ) (cid:3) | S ( w ) | k = exp ( w (cid:3) c · ˆ s ( w ) k ) , (6.6)where the average of sememe embeddings ˆ s ( w ) j is also used to represent each sense s ( w ) j . Here, w (cid:3) c is the context embedding, consisting of a constrained window of wordembeddings in C ( w i ) . We have w (cid:3) c = K (cid:3) k = i + K (cid:3) (cid:2) k = i − K (cid:3) w k , k (cid:4)= i . (6.7)Note that, since in experiments, the sense selection of the target word is found tobe only dependent on more limited context words for calculating attention, hence asmaller K (cid:3) is selected as compared to K .Recall that SAC only uses one target word as attention to select senses of contextwords whereas SAT uses several context words together as attention to select appro-priate senses of target words. Hence SAT is expected to conduct more reliable WSDand result in more accurate word representations, which is explored in experiments. In the previous section, we introduce HowNet and sememe representation. In fact,linguistic knowledge graphs such as HowNet contain rich information which couldeffectively help downstream applications. Therefore, in this section, we will introducethe major applications of sememe representation, including sememe-based wordrepresentation, linguistic knowledge graph construction, and language modeling.

Sememe-Guided word representation is intended for improving word embeddingsfor sememe prediction by introducing the information of sememe-based linguisticKBs of the source language. Qi et al. [56] present two methods of the sememe-guidedword representation.

A simple and intuitive method is to let words with similar sememe annotations tendto have similar word embeddings, which is named as word relation-based approach.To begin with, a synonym list is constructed from sememe-based linguistic KBs, .3 Applications 133 where words sharing a certain number of sememes are regarded as synonyms. Next,synonyms are forced to have closer word embeddings.Formally, let w i be the original word embedding of w i and ˆ w i be its adjustedword embedding. And let Syn ( w i ) denote the synonym set of word w i . Then the lossfunction is deﬁned as L sememe = (cid:2) w i ∈ V (cid:4) α i (cid:5) w i − ˆ w i (cid:5) + (cid:2) w j ∈ Syn ( w i ) β i j (cid:5) ˆ w i − ˆ w j (cid:5) (cid:5) , (6.8)where α and β control the relative strengths of the two terms. It should be notedthat the idea of forcing similar words to have close word embeddings is similarto the state-of-the-art retroﬁtting approach [19]. However, the retroﬁtting approachcannot be applied here because sememe-based linguistic KBs such as HowNet cannotdirectly provide its needed synonym list. Simple and effective as the word relation-based approach is, it cannot make fulluse of the information of sememe-based linguistic KBs because it disregards thecomplicated relations between sememes and words as well as relations between dif-ferent sememes. To address this limitation, the sememe embedding-based approachis proposed, which learns both sememe and word embeddings jointly.In this approach, sememes are represented with distributed vectors as well andplace them into the same semantic space as words. Similar to SPSE [66], whichlearns sememe embeddings by decomposing the word-sememe matrix and sememe-sememe matrix, the method utilizes sememe embeddings as regularizers to learnbetter word embeddings. Different from SPSE, the model described in [56] does notuse pretrained word embeddings. Instead, it learns word embeddings and sememeembeddings simultaneously. More speciﬁcally, a word-sememe matrix M can beextracted from HowNet, where M i j = w i is annotated with sememe x j , otherwise M i j =

0. Hence by factorizing M , the loss function can be deﬁned as L sememe = (cid:2) w i ∈ V , x j ∈ X ( w i · x j + b s + b (cid:3) j − M i j ) , (6.9)where b i and b (cid:3) j are the biases of w i and x j , and X denotes sememe set.In this approach, word and sememe embeddings are obtained in a uniﬁed seman-tic space. The sememe embeddings bear all the information about the relationshipsbetween words and sememes, and they inject the information into word embed-dings. Therefore, the word embeddings are expected to be more suitable for sememeprediction.

34 6 Sememe Knowledge Representation

Semantic Compositionality (SC) is deﬁned as the linguistic phenomenon that themeaning of a syntactically complex unit is a function of meanings of the complexunit’s constituents and their combination rule [50]. Some linguists regard SC as thefundamental truth of semantics [51]. In the ﬁeld of NLP, SC has proved effective inmany tasks including language modeling [47], sentiment analysis [42, 61], syntacticparsing [59], etc.Most literature on SC pays attention to using vector-based distributional mod-els of semantics to learn representations of Multiword Expressions (MWEs), i.e.,embeddings of phrases or compounds. Reference [46] conducts a pioneering workwhich introduces a general framework to formulate this task: p = f ( w , w , R , K ), (6.10)where f is the compositionality function, p denotes the embedding of an MWE, w and w represent the embeddings of the MWE’s two constituents, R stands forthe combination rule, and K refers to the additional knowledge which is needed toconstruct the semantics of the MWE.Most of the proposed approaches ignore R and K , centering on reforming com-positionality function f [3, 21, 60, 61]. Some try to integrate combination rule R into SC models [7, 35, 65, 71]. A few works consider external knowledge K . Ref-erence [72] tries to incorporate task-speciﬁc knowledge into an LSTM model forsentence-level SC.Reference [55] proposes a novel sememe-based method to model semantic com-positionality. They argue that sememes are beneﬁcial to modeling SC. To verify this,they ﬁrst design a simple SC degree (SCD) measurement experiment and ﬁnd thatthe SCDs of MWEs computed by simple sememe-based formulae are highly corre-lated with human judgment. This result shows that sememes can ﬁnely depict mean-ings of MWEs and their constituents, and capture the semantic relations betweenthe two sides. Moreover, they propose two sememe-incorporated SC models forlearning embeddings of MWEs, namely Semantic Compositionality with Aggre-gated Sememe (SCAS) model and Semantic Compositionality with Mutual SememeAttention (SCMSA) model. When learning the embedding of an MWE, the SCASmodel concatenates the embeddings of the MWE’s constituents and their sememes,while the SCMSA model considers the mutual attention between a constituent’ssememes and the other constituent. Finally, they integrate the combination rule, i.e., R in Eq. (6.10), into the two models. Their models achieve signiﬁcant performanceover the MWE similarity computation task and sememe prediction task comparedwith baseline methods.In this section, we focus on the work conducted by [55]. We will ﬁrst intro-duce sememe-based SC Degree (SCD) computation formulae, and then expand theirSememe-incorporated SC models. This formula only applies to two-word MWEs but can be easily extended to longer MWEs..3 Applications 135

Although SC widely exists in MWEs, not every MWE is fully semantically com-positional. In fact, different MWEs show different degrees of SC. Reference [55]believes that sememes can be used to measure SCD conveniently.To this end, based on the assumption that all the sememes of a word accuratelydepict the word’s meaning, they intuitively design a set of SCD computation formu-lae, which are consistent with the principle of SCD.The formulae are illustrated in Table 6.2. They deﬁne four SCDs denoted bynumbers 3, 2, 1, and 0, where larger numbers mean higher SCDs. S p , S w , and S w represent the sememe sets of an MWE, its ﬁrst and second constituent, respectively.Here is a brief explanation for their SCD computation formulae:(1) For SCD 3, the sememe set of an MWE is identical to the union of the twoconstituents’ sememe sets, which means the meaning of the MWE is exactly thesame as the combination of the constituents’ meanings. Therefore, the MWE is fullysemantically compositional and should have the highest SCD.(2) For SCD 0, an MWE has totally different sememes from its constituents, whichmeans the MWE’s meaning cannot be derived from its constituents’ meanings. Hencethe MWE is completely non-compositional, and its SCD should be the lowest.(3) As for SCD 2, the sememe set of an MWE is a proper subset of the union ofits constituents’ sememe sets, which means the meanings of the constituents coverthe MWE’s meaning but cannot precisely infer the MWE’s meaning.(4) Finally, for SCD 1, an MWE shares some sememes with its constituents, butboth the MWE itself and its constituents have some unique sememes.There is an example for each SCD in Table 6.2, including a Chinese MWE, itstwo constituents, and their sememes. To evaluate their sememe-based SCD computation formulae, [55] constructs ahuman-annotated SCD dataset. They ask several native speakers to label SCDs for500 Chinese MWEs, where there are four degrees to choose. Before labeling anMWE, they are shown the dictionary deﬁnitions of both the MWE and its con-stituents.Each MWE is labeled by 3 annotators, and the average of the 3 SCDs given bythem is the MWE’s ﬁnal SCD.Eventually, they obtain a dataset containing 500 Chinese MWEs together withtheir human-annotated SCDs.Then they evaluate the correlativity between SCDs of the MWEs in the datasetcomputed by sememe-based rules and those given by humans. They ﬁnd Pearson’scorrelation coefﬁcient is up to , and Spearman’s rank correlation coefﬁcient is In Chinese, most MWEs are words consisting of more than two characters which are actuallysingle-morpheme words.36 6 Sememe Knowledge Representation T a b l e . S e m e m e - b a s e d s e m a n ti cc o m po s iti on a lit yd e g r eec o m pu t a ti on f o r m u l aea nd e x a m p l e s . B o l d s e m e m e s o f c on s tit u e n t s a r e s h a r e d w it h t h ec on s tit u e n t s ’ c o rr e s pond i ng M W E .3 Applications 137 In this section, we ﬁrst introduce two basic sememe-incorporated SC models in detail,namely Semantic Compositionality with Aggregated Sememe (SCAS) and SemanticCompositionality with Mutual Sememe Attention (SCMSA). SCAS model simplyconcatenates the embeddings of the MWE’s constituents and their sememes, whilethe SCMSA model takes account of the mutual attention between a constituent’ssememes and the other constituent. Then we describe how to integrate combinationrules into the two basic models.

Incorporating Sememes Only . Following the notations in Eq. (6.10), for anMWE p = { w , w } , its embedding can be represented as p = f ( w , w , K ), (6.11)where p , w , w ∈ R d , and d is the dimension of embeddings, K denotes thesememe knowledge here, and we assume that we only know the sememes of w and w , considering that MWEs are normally not in the sememe KBs. X indicatesthe set of all the sememes and X w = { x , ..., x | X w | } ⊂ X to signify the sememe setof w . In addition, x ∈ R d denotes the embedding of sememe x .(1) SCAS Model

The ﬁrst model we introduce is the SCAS model, which isillustrated in Fig. 6.4. The idea of the SCAS model is straightforward, i.e., simply

MWE

ConstituentSememe w ′ w w ′ w Fig. 6.4

The architecture of SCAS model38 6 Sememe Knowledge Representation concatenating word embedding of a constituent and the aggregation of its sememes’embeddings. Formally, we have w (cid:3) = (cid:2) x i ∈ X w x i , w (cid:3) = (cid:2) x j ∈ X w x j , (6.12)where w (cid:3) and w (cid:3) represent the aggregated sememe embeddings of w and w , respec-tively. Then p can be obtained by p = tanh ( W c [ w + w ; w (cid:3) + w (cid:3) ] + b c ), (6.13)where W c ∈ R d × d is the composition matrix and b c ∈ R d is a bias vector.(2) SCMSA Model

The SCAS model simply uses the sum of all the sememes’ embeddings of aconstituent as the external information. However, a constituent’s meaning may varywith the other constituent, and accordingly, the sememes of a constituent should havedifferent weights when the constituent is combined with different constituents (thereis an example in the case study).Correspondingly, we introduce the SCMSA model (Fig. 6.5), which adopts themutual attention mechanism to dynamically endow sememes with weights. Formally,we have e = tanh ( W a w + b a ), a , i = exp ( s i · e ) (cid:3) x j ∈ X w exp ( x j · e ) , w (cid:3) = (cid:2) x i ∈ X w a , i x i , (6.14)where W a ∈ R d × d is the weight matrix and b a ∈ R d is a bias vector. Similarly, w (cid:3) can be calculated. Then they still use Eq. (6.13) to obtain p . Integrating Combination Rules . Reference [55] further integrates combinationrules into their sememe-incorporated SC models. In other words, p = f ( w , w , K , R ). (6.15)We can use totally different composition matrices for MWEs with different com-bination rules: W c = W rc , r ∈ R s , (6.16)where W rc ∈ R d × d and R s refers to combination rule set containing syntax rules ofMWEs, e.g., adjective-noun and noun-noun.However, there are many different combination rules, and some rules have sparseinstances which are not enough to train the corresponding composition matrices .3 Applications 139 MWE

ConstituentSememe w ′ w w ′ w a a a a a Fig. 6.5

The architecture of SCMSA model with d × d parameters. In addition, we believe that the composition matrix shouldcontain common compositionality information except the combination rule-speciﬁccompositionality information. Hence, they let composition matrix W c be the sum ofa low-rank matrix containing combination rule information and a matrix containingcommon compositionality information: W c = U r U r + W cc , (6.17)where U r ∈ R d × d r , U r ∈ R d r × d , and d r ∈ N + is a hyperparameter and may varywith the combination rule, and W cc ∈ R d × d . Language Modeling (LM) aims to measure the probability of a word sequence,reﬂecting its ﬂuency and likelihood as a feasible sentence in a human language.Language Modeling is an essential component in a wide range of natural language

40 6 Sememe Knowledge Representation (a)(b) Conventional DecoderSememe-Driven DecoderSememePredictorcontextvector worddistribution sememedistribution sensedistribution

SensePredictor WordPredictor worddistributioncontextvector

Fig. 6.6

Decoders of a conventional LM, b sememe-driven LM processing (NLP) tasks, such as machine translation [9, 10], speech recognition [34],information retrieval [5, 24, 45, 54], and document summarization [2, 57].A probabilistic language model calculates the conditional probability of the nextword given their contextual words, which are typically learned from large-scale textcorpora. Taking the simplest language model, for example, n -gram estimates theconditional probabilities according to maximum likelihood over text corpora [31].Recent years have witnessed the advances of Recurrent Neural Networks (RNNs)as the state-of-the-art approach for language modeling [44], in which the context isrepresented as a low-dimensional hidden state to predict the next word (Fig. 6.6).Those conventional language models, including neural models, typically assumewords as atomic symbols and model sequential patterns at the word level. However,this assumption does not necessarily hold to some extent. Consider the followingexample sentence for which people want to predict the next word in the blank, The U.S. trade deficit last year is initially estimated to be 40 billion . People may ﬁrst realize a unit should be ﬁlled in, then realize it should be a currency unit . Based on the country this sentence is talking about, the U.S.,one may conﬁrm it should be an

American currency unit and predict theword dollars . Here, the unit , currency , and American , which are basicsemantic units of the word dollars , are also the sememes of the word dollars .However, this process has not been explicitly taken into consideration by conventionallanguage models. That is in most cases, words are atomic language units, words arenot necessarily atomic semantic units for language modeling. Thus, explicit modelingof sememes could improve both the performance and the interpretability of languagemodels. However, as far as we know, a few efforts have been devoted to exploring theeffectiveness of sememes in language models, especially neural language models.It is nontrivial for neural language models to incorporate discrete sememe knowl-edge, as it is not compatible with continuous representations in neural models. Inthis part, Sememe-Driven Language Model (SDLM) is proposed to leverage lexi-cal sememe knowledge. In order to predict the next word, SDLM utilizes a novel .3 Applications 141 sememe-sense-word generation process: (1) First, SDLM estimates sememes’ dis-tribution according to the context. (2) Regarding these sememes as experts, SDLMemploys a sparse product of expert method to select the most probable senses. (3)Finally, SDLM calculates the distribution of words by marginalizing out the distri-bution of senses.SDLM is composed of three modules in series: Sememe Predictor, Sense Pre-dictor, and Word Predictor (Fig. 6.6). The Sememe Predictor ﬁrst takes the contextvector as input and assigns a weight to each sememe. Then each sememe is regardedas an expert and makes predictions about the probability distribution over a set ofsenses in the Sense Predictor. Finally, the probability of each word is obtained in theWord Predictor.

Sememe Predictor . The Sememe Predictor takes the context vector g ∈ R H as input and assigns a weight to each sememe. Assume that given the context w , w , . . . , w t − , the events that word w t contains sememe x k ( k ∈ { , , . . . , K } )are independent, since the sememe is the minimum semantic unit and there is nosemantic overlap between any two different sememes. For simplicity, the superscript t is ignored. The Sememe Predictor is designed as a linear decoder with the sig-moid activation function. Therefore, p k , the probability that the next word containssememe x k , is formulated as p k = P ( x k | g ) = Sigmoid ( g · v k + b k ), (6.18)where v k ∈ R H , b k ∈ R are trainable parameters, and Sigmoid ( · ) denotes the sig-moid activation function. Sense Predictor and Word Predictor . The architecture of the Sense Predictor ismotivated by Product of Experts (PoE) [25]. Each sememe is regarded as an expertthat only makes predictions on the senses connected with it. Let S ( x k ) denote theset of senses that contain sememe x k , the k th expert. Different from conventionalneural language models, which directly use the inner product of the context vector g ∈ R H and the output embedding w ∈ R H for word w to generate the score foreach word, Sense Predictor uses φ ( k ) ( g , w ) to calculate the score given by expert x k . And a bilinear function parameterized with a matrix U k ∈ R H × H is chosen as astraight implementation of φ ( k ) ( · , · ) : φ ( k ) ( g , w ) = g (cid:7) U k w . (6.19)The score of sense s provided by sememe expert x k can be written as φ ( k ) ( g , s ) .Therefore, P ( x k ) ( s | g ) , the probability of sense s given by expert x k , is formulated as P ( x k ) ( s | g ) = exp ( q k C k , s φ ( k ) ( g , s )) (cid:3) s (cid:3) ∈ S ( xk ) exp ( q k C k , s (cid:3) φ ( k ) ( g , s (cid:3) )) , (6.20)where C k , s is a normalization constant because sense s is not connected to all experts(the connections are sparse with approximately λ N edges, λ <

42 6 Sememe Knowledge Representation choose either C k , s = / | X ( s ) | ( left normalization ) or C k , s = / (cid:6) | X ( s ) || S ( x k ) | ( sym-metric normalization ).In the Sense Predictor, q k can be viewed as a gate which controls the magnitude ofthe term C k , s φ ( k ) ( g , s ) , thus controlling the ﬂatness of the sense distribution providedby sememe expert x k . Consider the extreme case when p k →

0, the prediction willconverge to the discrete uniform distribution. Intuitively, it means that the sememeexpert will refuse to provide any useful information when it is not likely to be relatedto the next word.Finally, the predictions can be summarized on sense s by taking the product ofthe probabilities given by relevant experts and then normalize the result; that is tosay, P ( s | g ) , the probability of sense s , satisﬁes P ( s | g ) ∝ (cid:7) x k ∈ X ( s ) P ( x k ) ( s | g ). (6.21)Using Eqs. 6.19 and 6.20, P ( s | g ) can be formulated as P ( s | g ) = exp ( (cid:3) x k ∈ X ( s ) q k C k , s g (cid:7) U k s ) (cid:3) s (cid:3) exp ( (cid:3) x k ∈ X ( s (cid:3) ) q k C k , s (cid:3) g (cid:7) U k s (cid:3) ) . (6.22)It should be emphasized that all the supervision information provided by HowNetis embodied in the connections between the sememe experts and the senses. If themodel wants to assign a high probability to sense s , it must assign a high probability tosome of its relevant sememes. If the model wants to assign a low probability to sense s , it can assign a low probability to its relevant sememes. Moreover, the predictionmade by sememe expert x k has its own tendency because of its own φ ( k ) ( · , · ) . Besides,the sparsity of connections between experts and senses is also determined by HowNetitself.As illustrated in Fig. 6.7, in the Word Predictor, P ( w | g ) , the probability of word w is calculated by summing up probabilities of corresponding s given by the SensePredictor, that is P ( w | g ) = (cid:2) s ∈ S ( w ) P ( s | g ). (6.23) The manual construction of HowNet is actually time-consuming and labor-intensive,e.g., HowNet has been built for more than 10 years by several linguistic experts.However, as the development of communications and techniques, new words andphrases are emerging, the semantic meanings of existing words are also dynamicallyevolving. In this case, sustained manual annotation and updates are becoming muchmore overwhelmed. Moreover, due to the high complexity of sememe ontology and .3 Applications 143

LSTM

LSTM LSTM LSTM

P wordP sensesememeexpertscontextvector

Fig. 6.7

The architecture of SDLM model word meanings, it is also challenging to maintain annotation consistency amongexperts when they collaboratively annotate lexical sememes.To address the issues of inﬂexibility and inconsistency of manual annotation, theautomatic lexical sememe prediction task is proposed, which is expected to assistexpert annotation and reduce manual workload. Note that for simplicity, most worksintroduced in this part do not consider the complicated hierarchies of word sememes,and simply group all annotated sememes of each word as the sememe set for learningand prediction.The basic idea of sememe prediction is that those words of similar semantic mean-ings may share overlapped sememes. Hence, the key challenge of sememe predictionis how to represent semantic meanings of words and sememes to model the semanticrelatedness between them. In this part, we will focus on introducing the sememe pre-diction word accomplished by Xie et al. [66]. In their work, they propose to modelthe semantics of words and sememes using distributed representation learning [26].Distributed representation learning aims to encode objects into a low-dimensionalsemantic space, which has shown its impressive capability of modeling semantics ofhuman languages, e.g., word embeddings [43] have been widely studied and utilizedin various tasks of NLP.As shown in previous work [43], it is effective to measure word similarities usingcosine similarity or Euclidean distance of their word embeddings learned from alarge-scale text corpus. Hence, a straightforward method for sememe prediction is

44 6 Sememe Knowledge Representation that, given an unlabeled word, we ﬁnd its most related words in HowNet accordingto their word embeddings, and recommend the annotated sememes of these relatedwords to the given word. The method is intrinsically similar to collaborative ﬁltering[58] in recommendation systems, capable of capturing semantic relatedness betweenwords and sememes based on their annotation co-occurrences.Word embeddings can also be learned with techniques of matrix factorization[37]. Inspired by the successful practice of matrix factorization for personalizedrecommendation [36], a new model which factorizes the word-sememe matrix fromHowNet and obtains sememe embeddings is proposed. In this way, the relatedness ofwords and sememes can be measured directly using dot products of their embeddings,according to which we could recommend the most related sememes to an unlabeledword.The two methods are named as Sememe Prediction with Word Embeddings(SPWE) and with Sememe Embeddings (SPSE/SPASE), respectively.

Given an unlabeled word, it is straightforward to recommend sememes according toits most related words, assuming that similar words should have similar sememes.This idea is similar to collaborative ﬁltering in the personalized recommendation, forin the scenario of sememe prediction words can be regarded as users and sememesas the items/products to be recommended. Inspired by this, Sememe Predictionwith Word Embeddings (SPWE) model is proposed, which uses similarities of wordembeddings to judge user distances.Formally, the score function P ( x j | w ) of sememes x j given a word w is deﬁnedas P ( x j | w ) = (cid:2) w i ∈ V cos ( w , w i ) M i j c r i , (6.24)where cos ( w , w i ) is the cosine similarity between word embeddings of w and w i pretrained by GloVe. M i j indicates the annotation of sememe x j on word w i , where M i j = w i which has the sememe x j in HowNet and otherwisehas not. Higher the score function P ( x j | w ) is, more possible the word w should berecommended with x j .Differing from classical collaborative ﬁltering in recommendation systems, onlythe most similar words should be concentrated when predicting sememes for newwords since irrelevant words have totally different sememes which may be noisesfor sememe prediction. To address this problem, a declined conﬁdence factor c r i isassigned for each word w i , where r i is the descend rank of word similarity cos ( w , w i ) ,and c ∈ ( , ) is a hyperparameter. In this way, only a few top words that are similarto w have strong inﬂuences on predicting sememes.SPWE only uses word embeddings for word similarities and is simple and effec-tive for sememe prediction. It is because, differing from the noisy and incompleteuser-item matrix in most recommender systems, HowNet is carefully annotated by .3 Applications 145 human experts, and thus the word-sememe matrix is with high conﬁdence. Therefore,the word-sememe matrix can be conﬁdently applied to collaboratively recommendreliable sememes based on similar words. Sememe Prediction with Word Embeddings model follows the assumption that thesememes of a word can be predicted according to its related words’ sememes. How-ever, simply considering sememes as discrete labels may inevitably neglect the latentrelations between sememes. To take the latent relations of sememes into consider-ation, Sememe Prediction with Sememe Embeddings (SPSE) model is proposed,which projects both words and sememes into the same semantic vector space, learn-ing sememe embeddings according to the co-occurrences of words and sememes inHowNet.Similar to GloVe [53] which decomposes co-occurrence matrix of words to learnword embeddings, sememe embeddings can be learned by factorizing word-sememematrix and sememe-sememe matrix simultaneously. These two matrices are bothconstructed from HowNet. As for word embeddings, similar to SPWE, SPSE usesword embeddings pretrained from a large-scale corpus and ﬁxes them during fac-torizing of the word-sememe matrix. With matrix factorization, both sememe andword embeddings can be encoded into the same low-dimensional semantic space,and then computed the cosine similarity between normalized embeddings of wordsand sememes for sememe prediction.More speciﬁcally, similar to M , a sememe-sememe matrix C can also be extracted,where C jk is deﬁned as point-wise mutual information that C jk = PMI ( x j , x k ) toindicate the correlations between two sememes x j and x k . Note that, by factorizing C , two distinct embeddings for each sememe s will be obtained, denoted as x and ¯ x ,respectively. The loss function of learning sememe embeddings is deﬁned as follows: L = (cid:2) w i ∈ W , x j ∈ X (cid:8) w i · ( x j + Nx j ) + b i + b (cid:3) j − M i j (cid:9) + λ (cid:2) x j , x k ∈ X (cid:8) x j · ¯ x k − C jk (cid:9) , (6.25)where b i and b (cid:3) j denote the bias of w i and x j . These two parts correspond to thelosses of factorizing matrices M and C , adjusted by the hyperparameter λ . Sincethe sememe embeddings are shared by both factorizations, our SPSE model enablesjointly encoding both words and sememes into a uniﬁed semantic space.Since each word is typically annotated with 2–5 sememes in HowNet, most ele-ments in the word-sememe matrix are zeros. If all zero elements and nonzero ele-ments are treated equally during factorization, the performance will be much worse.To address this issue, different factorization strategies are assigned for zero andnonzero elements. For each zero element, the model chooses to factorize them witha small probability like 0 .

46 6 Sememe Knowledge Representation

In SPSE, sememe embeddings are learned accompanying with word embeddingsvia matrix factorization into the uniﬁed low-dimensional semantic space. Matrixfactorization has been veriﬁed as an effective approach in the personalized recom-mendation, because it can accurately model relatedness between users and items, andis highly robust to noises in user-item matrices. Using this model, we can ﬂexiblycompute semantic relatedness of words and sememes, which provides us an effec-tive tool to manipulate and manage sememes, including but not limited to sememeprediction.

Inspired by the characteristics of sememes, we assume that the word embeddings aresemantically composed of sememe embeddings. In the word-sememe joint space, wecan simply implement semantic composition as additive operations that each wordembedding is expected to be the sum of its all sememes’ embeddings. Following thisassumption, Sememe Prediction with Aggregated Sememe Embeddings (SPASE)model is proposed. SPASE is also based on matrix factorization, and is formallydenoted as w i = (cid:2) x j ∈ X wi M (cid:3) i j x j , (6.26)where X w i is the sememe set of the word w i and M (cid:3) i j represents the weight ofsememe x j for word w i , which only has value on nonzero elements of word-sememelabeled matrix M . To learn sememe embeddings, we attempt to decompose the wordembedding matrix V into M (cid:3) and sememe embedding matrix X , with pretrained wordembeddings ﬁxed during training, which could also be written as V = M (cid:3) X .The contribution of SPASE is that it complies with the deﬁnition of sememesin HowNet that sememes are the semantic components of words. In SPASE, eachsememe can be regarded as a tiny semantic unit, and all words can be representedby composing several semantic units, i.e., sememes, which make up an interestingsemantic regularity. However, SPASE is difﬁcult to train because word embeddingsare ﬁxed, and the number of words is much larger than the number of sememes. In thecase of modeling complex semantic compositions of sememes into words, the rep-resentation capability of SPASE may be strongly constrained by limited parametersof sememe embeddings and excessive simpliﬁcation of additive assumption. In the previous section, we introduce the automatic lexical sememe prediction pro-posed by Xie et al. [66]. These methods ignore the internal information within words(e.g., the characters in Chinese words), which is also signiﬁcant for word understand-ing, especially for words which are of low frequency or do not appear in the corpus .3 Applications 147

Wors embedding(iron)External informationInternal information HostOf Relate To domain(craftsman) (ironsmith)ironsmith(human)(occupation) (metal) (industrial)Wors embedding wordsensesememe

Fig. 6.8

Sememes of the word (ironsmith) in HowNet, where occupation , human ,and industrial can be inferred by both external (contexts) and internal (characters) information,while metal is well-captured only by the internal information within the character (iron) at all. In this section, we introduce the work of Jin et al. [30], which takes Chineseas an example and explores methods of taking full advantage of both external andinternal information of words for sememe prediction.In Chinese, words are composed of one or multiple characters, and most charactershave corresponding semantic meanings. As shown by [67], more than 90% of Chinesecharacters in modern Chinese corpora are morphemes. Chinese words can be dividedinto single-morpheme words and compound words, where compound words accountfor a dominant proportion. The meanings of compound words are closely relatedto their internal characters as shown in Fig. 6.8. Taking a compound word (ironsmith) , for instance, it consists of two Chinese characters: (iron) and (craftsman) , and the semantic meaning of can be inferred from thecombination of its two characters ( iron + craftsman → ironsmith ). Even forsome single-morpheme words, their semantic meanings may also be deduced fromtheir characters. For example, both characters of the single-morpheme word (hover) represent the meaning of hover or linger . Therefore, it is intuitive totake the internal character information into consideration for sememe prediction.Reference [30] proposes a novel framework for Character-enhanced SememePrediction (CSP), which leverages both internal character information and externalcontext for sememe prediction. CSP predicts the sememe candidates for a target wordfrom its word embedding and the corresponding character embeddings. Speciﬁcally,following SPWE and SPSE as introduced by [66] to model external information,Sememe Prediction with Word-to-Character Filtering (SPWCF) and Sememe Pre-diction with Character and Sememe Embeddings (SPCSE) are proposed to modelinternal character information. Sememe Prediction with Word-to-Character Filtering . Inspired by collabora-tive ﬁltering [58], Jin et al. [30] propose to recommend sememes for an unlabeledword according to its similar words based on internal information. And words areconsidered as similar if they contain the same characters at the same positions.

48 6 Sememe Knowledge Representation

Fig. 6.9

An example of theposition of characters in aword

Begin EndMiddle

In Chinese, the meaning of a character may vary according to its position withina word [14]. Three positions within a word are considered:

Begin , Middle , and

End . For example, as shown in Fig. 6.9, the character at the

Begin position of theword (railway station) is (fire) , while (vehicle) and (station) are at the Middle and

End position, respectively. The characterusually means station when it is at the

End position, while it usually means stand at the

Begin position like in (stand) , (standingguard) , and (stand up) .Formally, for a word w = c c ... c | w | , we deﬁne π B ( w ) = { c } , π M ( w ) ={ c , ..., c | w − | } , π E ( w ) = { c | w | } , and P p ( x j | c ) ∼ (cid:3) w i ∈ W ∧ c ∈ π p ( w i ) M i j (cid:3) w i ∈ W ∧ c ∈ π p ( w i ) | X w i | , (6.27)that represents the score of a sememe x j given a character c and a position p , where π p may be π B , π M , or π E . M is the same matrix used in SPWE. Finally, the scorefunction P ( x j | w ) of sememe x j given a word w is deﬁned as P ( x j | w ) ∼ (cid:2) p ∈{ B , M , E } (cid:2) c ∈ π p ( w ) P p ( x j | c ). (6.28)SPWCF is a simple and efﬁcient method. It performs well because compositionalsemantics are pervasive in Chinese compound words, which makes it straightforwardand effective to ﬁnd similar words according to common characters. Sememe Prediction with Character and Sememe Embeddings (SPCSE) . Themethod Sememe Prediction with Word-to-Character Filtering (SPWCF) can effec-tively recommend the sememes that have strong correlations with characters. How-ever, just like SPWE, it ignores the relations between sememes. Hence, inspiredby SPSE, Sememe Prediction with Character and Sememe Embeddings (SPCSE) isproposed to take the relations between sememes into account. In SPCSE, the modelinstead learns the sememe embeddings based on internal character information, thencomputes the semantic distance between sememes and words for prediction.Inspired by GloVe [53] and SPSE, matrix factorization is adopted in SPCSE, bydecomposing the word-sememe matrix and the sememe-sememe matrix simultane-ously. Instead of using pretrained word embeddings in SPSE, pretrained characterembeddings are used in SPCSE. Since the ambiguity of characters is stronger thanthat of words, multiple embeddings are learned for each character [14]. The most rep-resentative character and its embedding are selected to represent the word meaning.Because low-frequency characters are much rare than those low-frequency words, .3 Applications 149

Fig. 6.10

An example of adopting multiple-prototype character embeddings. The numbers are thecosine distances. The sememe (metal) is the closest to one embedding of (iron) and even low-frequency words are usually composed of common characters, it isfeasible to use pretrained character embeddings to represent rare words. During fac-torizing of the word-sememe matrix, the character embeddings are ﬁxed. N e is set as the number of embeddings for each character, and each character c has N e embeddings c , ..., c N e . Given a word w and a sememe x , the embedding ofa character of w closest to the sememe embedding by cosine distance is selected asthe representation of the word w , as shown in Fig. 6.10. Speciﬁcally, given a word w = c ... c | w | and a sememe x j , we deﬁne k ∗ , r ∗ = arg min k , r (cid:10) − cos ( c rk , x (cid:3) j + ¯ x (cid:3) j ) (cid:11) , (6.29)where k ∗ and r ∗ indicate the indices of the character and its embedding closest tothe sememe x j in the semantic space. With the same word-sememe matrix M andsememe-sememe correlation matrix C in SPSE, the sememe embeddings are learnedwith the loss function: L = (cid:2) w i ∈ W , x j ∈ X (cid:8) c r ∗ k ∗ · (cid:8) x (cid:3) j + ¯ x (cid:3) j (cid:9) + b ck ∗ + b (cid:3)(cid:3) j − M i j (cid:9) + λ (cid:3) (cid:2) x j , x q ∈ X (cid:8) x (cid:3) j · ¯ x (cid:3) q − C jq (cid:9) , (6.30)where x (cid:3) j and ¯ x (cid:3) j are the sememe embeddings for sememe x j , and c r ∗ k ∗ is the embeddingof the character that is the closest to sememe x j within w i . Note that, as the charactersand the words are not embedded into the same semantic space, new sememe embed-dings are learned instead of using those learned in SPSE, hence different notationsare used for the sake of distinction. b ck and b (cid:3)(cid:3) j denote the biases of c k and x j , and λ (cid:3) is the hyperparameter adjusting the two parts. Finally, the score function of word w = c ... c | w | is deﬁned as P ( x j | w ) ∼ c r ∗ k ∗ · (cid:8) x (cid:3) j + ¯ x (cid:3) j (cid:9) . (6.31)

50 6 Sememe Knowledge Representation

SPWESPSESPWCFSPCSE Legendhigh-frequency wordslow-frequency wordsword ExternalInternal CSP

Fig. 6.11

An illustration of model ensembling in sememe prediction

Model Ensembling . SPWCF/SPCSE and SPWE/SPSE take different sourcesof information as input, which means that they have different characteristics:SPWCF/SPCSE only have access to internal information, while SPWE/SPSE canonly make use of external information. On the other hand, just like the differencebetween SPWE and SPSE, SPWCF originates from collaborative ﬁltering, whereasSPCSE uses matrix factorization. All of those methods have in common that they tendto recommend the sememes of similar words, but they diverge in their interpretationof similar .Therefore, to obtain better prediction performance, it is necessary to combine thesemodels. We denote the ensemble of SPWCF and SPCSE as the internal model, and theensemble of SPWE and SPSE as the external model. The ensemble of the internal and the external models is the novel framework CSP. In practice, for words withreliable word embeddings, i.e., high-frequency words, we can use the integration ofthe internal and the external models; for words with extremely low frequencies (e.g.,having no reliable word embeddings), we can just use the internal model and ignorethe external model, because the external information is noisy in this case. Figure 6.11shows model ensembling in different scenarios. For the sake of comparison, we usethe integration of SPWCF, SPCSE, SPWE, and SPSE as CSP in all experiments.And two models are integrated by simple weighted addition.

Most languages do not have sememe-based linguistic KBs such as HowNet, whichprevents us from understanding and utilizing human languages to a greater extent.Therefore, it is important to build sememe-based linguistic KBs for various lan-guages.To address the issue of the high labor cost of manual annotation, Qi et al. [56]propose a new task, cross-lingual lexical sememe prediction (CLSP) which aims to .3 Applications 151 automatically predict lexical sememes for words in other languages. There are twocritical challenges for CLSP:(1) There is not a consistent one-to-one match between words in different lan-guages. For example, English word “beautiful” can refer to Chinese words of eitheror . Hence, we cannot simply translate HowNet into another language.And how to recognize the semantic meaning of a word in other languages becomesa critical problem.(2) Since there is a gap between the semantic meanings of words and sememes,we need to build semantic representations for words and sememes to capture thesemantic relatedness between them.To tackle these challenges, Qi et al. [56] propose a novel model for CLSP, whichaims to transfer sememe-based linguistic KBs from source language to target lan-guage. Their model contains three modules: (1) monolingual word embedding learn-ing which is intended for learning semantic representations of words for source andtarget languages, respectively; (2) cross-lingual word embedding alignment whichaims to bridge the gap between the semantic representations of words in two lan-guages; (3) sememe-based word embedding learning whose objective is to incorpo-rate sememe information into word representations.They take Chinese as source language and English as the target language toshow the effectiveness of their model. Experimental results show that the proposedmodel could effectively predict lexical sememes for words with different frequenciesin other languages and their model has consistent improvements on two auxiliaryexperiments including bilingual lexicon induction and monolingual word similaritycomputation by jointly learning the representations of sememes, words in source andtarget languages.The model consists of three parts: monolingual word representation learning,cross-lingual word embedding alignment, and sememe-based word representationlearning. Hence, they deﬁne the objective function of our method corresponding tothe three parts: L = L mono + L cross + L sememe . (6.32)Here, the monolingual term L mono is designed for learning monolingual wordembeddings from nonparallel corpora for source and target languages, respectively.The cross-lingual term L cross aims to align cross-lingual word embeddings in auniﬁed semantic space. And L sememe can draw sememe information into word rep-resentation learning and conduce to better word embeddings for sememe prediction.In the following paragraphs, we will introduce the three parts in detail. Monolingual Word Representation . Monolingual word representation is respon-sible for explaining regularities in monolingual corpora of source and target lan-guages. Since the two corpora are nonparallel, L mono comprises two monolingualsubmodels that are independent of each other: L mono = L Smono + L Tmono , (6.33)where the superscripts S and T denote source and target languages, respectively.

52 6 Sememe Knowledge Representation

As a common practice, the well-established Skip-gram model is chosen to obtainmonolingual word embeddings. The Skip-gram model is aimed at maximizing thepredictive probability of context words conditioned on the centered word. Formally,taking the source side, for example, given a training word sequence { w S , . . . , w Sn } ,Skip-gram model intends to minimize L Smono = − n − K (cid:2) c = K + (cid:2) − K ≤ k ≤ K , k (cid:4)= log P ( w Sc + k | w Sc ), (6.34)where K is the size of the sliding window. P ( w Sc + k | w Sc ) stands for the predictive prob-ability of one of the context words conditioned on the centered word w Sc , formalizedby the following softmax function: P ( w Sc + k | w Sc ) = exp ( w Sc + k · w Sc ) (cid:3) w Ss ∈ V S exp ( w Ss · w Sc ) , (6.35)in which V s indicates the word vocabulary of source language. L Tmono can be formu-lated similarly.

Cross-lingual Word Embedding Alignment . Cross-lingual word embeddingalignment aims to build a uniﬁed semantic space for the words in source and targetlanguages. Inspired by [69], the cross-lingual word embeddings are aligned withsignals of a seed lexicon and self-matching.Formally, L cross is composed of two terms including alignment by seed lexicon L seed and alignment by matching L match : L cross = λ s L seed + λ m L match , (6.36)where λ s and λ m are hyperparameters for controlling relative weightings of the twoterms.(1) Alignment by Seed Lexicon

The seed lexicon term L seed encourages word embeddings of translation pairs ina seed lexicon D to be close, which can be achieved via an L regularizer: L seed = (cid:2) (cid:14) w Ss , w Tt (cid:15)∈ D (cid:5) w Ss − w Tt (cid:5) , (6.37)in which w Ss and w Tt indicate the words in source and target languages in the seedlexicon, respectively.(2) Alignment by Matching Mechanism

As for the matching process, it is found on the assumption that each target wordshould be matched to a single source word or a special empty word, and vice versa.The goal of the matching process is to ﬁnd the matched source (target) word for each .3 Applications 153 target (source) word and maximize the matching probabilities for all the matchedword pairs. The loss of this part can be formulated as L match = L T Smatch + L S Tmatch , (6.38)where L T Smatch is the term for target-to-source matching and L S Tmatch is the term forsource-to-target matching.Next, a detailed explanation of target-to-source matching is given, and the source-to-target matching is deﬁned in the same way. A latent variable m t ∈ { , , . . . , | V S |} ( t = , , . . . , | V T | ) is ﬁrst introduced for each target word w Tt , where | V S | and | V T | indicate the vocabulary size of source and target languages, respectively. Here, m t speciﬁes the index of the source word that w Tt matches with, and m t = m = { m , m , . . . , m | V T | } , and can formalizethe target-to-source matching term: L T Smatch = − log P ( C T | C S ) = − log (cid:2) m P ( C T , m | C S ), (6.39)where C T and C S denote the target and source corpus, respectively. Here, they simplyassume that the matching processes of target words are independent of each other.Therefore, we have P ( C T , m | C S ) = (cid:7) w T ∈ C T P ( w T , m | C S ) = | V T | (cid:7) t = P ( w Tt | w Sm t ) c ( w Tt ) , (6.40)where w Sm t is the source word that w Tt matches with, and c ( w Tt ) is the number oftimes w Tt occurs in the target corpus. Linguistic Inquiry and Word Count (LIWC) [52] has been widely used for comput-erized text analysis in social science. Not only can LIWC be used to analyze textfor classiﬁcation and prediction, but it has also been used to examine the underlyingpsychological states of a writer or speaker. In the beginning, LIWC was developedto address content analytic issues in experimental psychology. Nowadays, there isan increasing number of applications across ﬁelds such as computational linguistics[22], demographics [48], health diagnostics [11], and social relationship [32].Chinese is the most spoken language in the world, but we cannot use the originalLIWC to analyze Chinese text. Fortunately, Chinese LIWC [28] has been released

54 6 Sememe Knowledge Representation to ﬁll the vacancy. In this part, we mainly focus on Chinese LIWC and using LIWCto stand for Chinese LIWC if not otherwise speciﬁed.While LIWC has been used in a variety of ﬁelds, its lexicon only contains less than7,000 words. This is insufﬁcient because according to [39], there are at least 56,008common words in Chinese. Moreover, LIWC lexicon does not consider emergingwords and phrases on the Internet. Therefore, it is reasonable and necessary toexpand the LIWC lexicon so that it is more accurate and comprehensive for sci-entiﬁc research. One way to expand LIWC lexicon is to annotate the new wordsmanually. However, it is too time-consuming and often requires language expertiseto add new words. Hence, expanding LIWC lexicon automatically is proposed.In LIWC lexicon, words are labeled with different categories and categories forma certain hierarchy. Therefore, hierarchical classiﬁcation algorithms can be naturallyapplied to LIWC lexicon. Reference [15] proposes Hierarchical SVM (Support Vec-tor Machine), which is a modiﬁed version of SVM based on the hierarchical problemdecomposition approach. In [6], the authors presented a novel algorithm which canbe used on both tree- and Directed Acyclic Graph (DAG)-structured hierarchies.Some recent works [12, 33] attempted to use neural networks in the hierarchicalclassiﬁcation.However, these methods are often too generic without considering the specialproperties of words and LIWC lexicon. Many words and phrases have multiplemeanings and are thereby classiﬁed into multiple leaf categories. This is often referredto as polysemy. Additionally, many categories in LIWC are ﬁne-grained, thus makingit more difﬁcult to distinguish them. To address these issues, we introduce severalmodels to incorporate sememe information when expanding the lexicon, which willbe discussed after the introduction of the basic model.

Basic Decoder for Hierarchical Classiﬁcation . First, we introduce the basicmodel for Chinese LIWC lexicon expansion. The well-known Sequence-to-Sequencedecoder [64] is exploited for hierarchical classiﬁcation. The original Sequence-to-Sequence decoder is often trained to predict the next word w t with consideration ofall the previously predicted words { w , . . . , w t − } . This is a useful feature since animportant difference between ﬂat multilabel classiﬁcation and hierarchical classiﬁ-cation is that there are explicit connections among hierarchical labels. This propertyis utilized by transforming hierarchical labels into a sequence. Let Y denote the labelset and π : Y → Y denote the parent relationship where π( y ) is the parent node of y ∈ Y . Given a word w , its labels form a tree structure hierarchy. We then chooseeach path from the root node to the leaf node, and transform it into a sequence { y , y , . . . , y L } where π( y i ) = y i − , ∀ i ∈ [ , L ] and L is the number of levels inthe hierarchy. In this way, when the model predicts a label y i , it takes into consid-eration the probability of parent label sequence { y , . . . , y i − } . Formally, the decoderdeﬁnes a probability over the label sequence: P ( y , y , . . . , y L ) = L (cid:7) i = P ( y i | ( y , . . . , y i − ), w ). (6.41) .3 Applications 155 A common approach for decoder is to use LSTM [27] so that each conditionalprobability is computed as P ( y i | ( y , . . . , y i − ), w ) = g ( y i − , s i ) = o i (cid:17) tanh ( h i ), (6.42)where h i = f i (cid:17) h i − + i i (cid:17) ˜ h i , ˜ h i = tanh ( W h [ h i − ; y i − ] + b h ), o i = Sigmoid ( W o [ h i − ; y i − ] + b o ), z i = Sigmoid ( W z [ h i − ; y i − ] + b z ), f i = Sigmoid ( W f [ h i − ; y i − ] + b f ), (6.43)where (cid:17) is an element-wise multiplication and h i is the i th hidden state of the RNN. W h , W o , W z , W f are weights and b h , b o , b z , b f are biases. o i , z i , and f i are knownas output gate layer, input gate layer, and forget gate layer, respectively.To take advantage of word embeddings, the initial state h = w is deﬁned where w represents the embedding of the word. In other words, the word embeddings areapplied as the initial state of the decoder.Speciﬁcally, the inputs of our model are word embeddings and label embeddings.First, raw words are transformed into word embeddings by an embedding matrix E ∈ R | V |× d w , where d w is the word embedding dimension. Then, at each time step, labelembeddings y are fed to the model, which is obtained by a label embedding matrix Y ∈ R | Y |× d y , where d y is the label embedding dimension. Here word embeddings arepretrained and ﬁxed during training.Generally speaking, the decoder is expected to decode word labels hierarchicallybased on word embeddings. At each time step, the decoder will predict the currentlabel depending on previously predicted labels. Hierarchical Decoder with Sememe Attention . The basic decoder uses wordembeddings as the initial state, then predicts word labels hierarchically as sequences.However, each word in the basic decoder model has only one representation. Thisis insufﬁcient because many words are polysemous and many categories are ﬁne-grained in the LIWC lexicon. It is difﬁcult to handle these properties using a singlereal-valued vector. Therefore, Zeng et al. [68] propose to incorporate sememe infor-mation.Because different sememes represent different meanings of a word, they shouldhave different weights when predicting word labels. Moreover, we believe that thesame sememe should have different weights in different categories. Take the word apex in Fig. 6.12, for example. The sememe location should have a relativelyhigher weight when the decoder chooses among the subclasses of relative . Whenchoosing among the subclasses of

PersonalConcerns , location should havea lower weight because it represents a relatively irrelevant sense vertex .

56 6 Sememe Knowledge Representation Sense (acme) Sense (vertex)(apex)(Boundary) (Location)(Entity) (Angular) (Dot)(Most)(GreaterThanNormal) ho s t m od i f i e r b e l ong m od i f i e r degree Fig. 6.12

Example word apex and its senses and sememes in HowNet annotation

To achieve these goals, the utilization of attention mechanism [1] is proposedto incorporate sememe information when decoding the word label sequence. Thestructure of the model is illustrated in Fig. 6.13.Similar to the basic decoder approach, word embeddings are applied as the initialstate of the decoder. The primary difference is that the conditional probability isdeﬁned as P ( y i | ( y , . . . , y i − ), w , c i ) = g ( [ y i − ; c i ] , h i ), (6.44)where c i is known as context vector. The context vector c i depends on a set of sememeembeddings { x , . . . , x N }, acquired by a sememe embedding matrix X ∈ R | S |× d s ,where d s is the sememe embedding dimension.To be more speciﬁc, the context vector c i is computed as a weighted sum of thesememe embedding x j : c i = N (cid:2) j = α i j x j . (6.45)The weight α i j of each sememe embedding x j is deﬁned as α i j = exp ( v · tanh ( W y i − + W x j )) (cid:3) Nk = exp ( v · tanh ( W y i − + W x k )) , (6.46) .3 Applications 157 WordEmbeddingSememeEmbedding SememeEmbeddingSememeEmbeddingSoftmax

Label 1

Label 1.1 Label 1.1.1

Label 1 Label 1.1 Label 1.1.1 … Fig. 6.13

The architecture of sememe attention decoder with word embeddings as the initial state where v ∈ R a is a trainable parameter, W ∈ R a × d y and W ∈ R a × d s are weightmatrices, and a is the number of hidden units in attention model.Intuitively, at each time step, the decoder chooses which sememes to pay atten-tion to when predicting the current word label. In this way, different sememes canhave different weights, and the same sememe can have different weights in differ-ent categories. With the support of sememe attention, the decoder can differentiatemultiple meanings in a word and the ﬁne-grained categories and thus can expand amore accurate and comprehensive lexicon. In this chapter, we ﬁrst give an introduction to the most well-known sememe knowl-edge base, HowNet, which uses about 2 ,

000 predeﬁned sememes to annotate over100 ,

000 Chinese and English words and phrases. Different from other linguisticknowledge bases like WordNet, HowNet is based on the minimum semantics units(sememes) and captures the compositional relations between sememes and words.To learn the representations of sememe knowledge, we elaborate on three models,namely Simple Sememe Aggregation model (SSA), Sememe Attention over Contextmodel (SAC), and Sememe Attention over Target model (SAT). These models notonly learn the representations of sememes but also help improve the representationsof words. Next, we describe some applications of sememe knowledge, including word

58 6 Sememe Knowledge Representation representation, semantic composition, and language modeling. We also detail how toautomatically predict sememes for both monolingual and cross-lingual unannotatedwords.For further learning of sememe knowledge-based NLP, you can read the bookwritten by the authors of HowNet [18]. You can also ﬁnd more related papers inthis paper list https://github.com/thunlp/SCPapers. You can use the open source APIOpenHowNet https://github.com/thunlp/OpenHowNet to access HowNet data.In the future, there are some research directions worth exploring:(1)

Utilizing Structures of Sememe Annotations . The sememe annotations inHowNet are hierarchical, and sememes annotated to a word are actually organizedas a tree. However, existing studies still do not utilize the structural information ofsememes. Instead, in current methods, sememes are simply regarded as semanticlabels. In fact, the structures of sememes also incorporate abundant semantic infor-mation and will be helpful to the deep understanding of lexical semantics. Besides,existing sememe prediction studies also predict unstructured sememes only, and itis an interesting task to conduct structured sememe predictions.(2)

Leveraging Sememes in Low-data Regimes . One of the most important andtypical characteristics of sememes is that limited sememes can represent unlimitedsemantics, which can play an important and positive role in tackling the low-dataregimes. In word representation learning, the representations of low-frequency wordscan be improved by their sememes, which have been well learned with the high-frequency words they annotate. We believe sememes will be beneﬁcial to otherlow-data regimes, e.g., low-resource language NLP tasks.(3)

Building Sememe Knowledge Bases for Other Languages . Original HowNetannotates sememes for only two languages: Chinese and English. As far as weknow, there are not sememe knowledge bases like HowNet in other languages. SinceHowNet and its sememe knowledge have been veriﬁed helpful for better understand-ing human languages, it will be of great signiﬁcance to annotate sememes for wordsand phrases in other languages. In the section, we have described a study on cross-lingual sememe prediction. And we think it is promising to make efforts toward thisdirection.

References

1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. In

Proceedings of ICLR , 2015.2. Michele Banko, Vibhu O Mittal, and Michael J Witbrock. Headline generation based on sta-tistical translation. In

Proceedings of ACL , 2000.3. Marco Baroni and Roberto Zamparelli. Nouns are vectors, adjectives are matrices: Representingadjective-noun constructions in semantic space. In

Proceedings of EMNLP , 2010.4. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilisticlanguage model.

Journal of Machine Learning Research , 3(Feb):1137–1155, 2003.5. Adam Berger and John Lafferty. Information retrieval as statistical translation. In

Proceedingsof SIGIR , 1999.eferences 1596. Wei Bi and James T Kwok. Multi-label classiﬁcation on tree-and dag-structured hierarchies.In

Proceedings of ICML , 2011.7. William Blacoe and Mirella Lapata. A comparison of vector-based representations for semanticcomposition. In

Proceedings of EMNLP-CoNLL , 2012.8. Leonard Bloomﬁeld. A set of postulates for the science of language.

Language , 2(3):153–164,1926.9. Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and Jeffrey Dean. Large languagemodels in machine translation. In

Proceedings of EMNLP , 2007.10. Peter F Brown, John Cocke, Stephen A Della Pietra, Vincent J Della Pietra, Fredrick Jelinek,John D Lafferty, Robert L Mercer, and Paul S Roossin. A statistical approach to machinetranslation.

Computational Linguistics , 16(2):79–85, 1990.11. Wilma Bucci and Bernard Maskit. Building a weighted dictionary for referential activity.

Com-puting Attitude and Affect in Text , pages 49–60, 2005.12. Ricardo Cerri, Rodrigo C Barros, and André CPLF De Carvalho. Hierarchical multi-labelclassiﬁcation using local neural networks.

Journal of Computer and System Sciences , 80(1):39–56, 2014.13. Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. A uniﬁed model for word sense representationand disambiguation. In

Proceedings of EMNLP , 2014.14. Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. Joint learning ofcharacter and word embeddings. In

Proceedings of IJCAI , 2015.15. Yangchi Chen, Melba M Crawford, and Joydeep Ghosh. Integrating support vector machinesin a hierarchical output space decomposition framework. In

Proceedings of IGARSS , 2004.16. Lei Dang and Lei Zhang. Method of discriminant for chinese sentence sentiment orientationbased on hownet.

Application Research of Computers , 4:43, 2010.17. Zhendong Dong and Qiang Dong. Hownet-a hybrid language and knowledge resource. In

Proceedings of NLP-KE , 2003.18. Zhendong Dong and Qiang Dong.

HowNet and the Computation of Meaning (With CD-Rom) .World Scientiﬁc, 2006.19. Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah ASmith. Retroﬁtting word vectors to semantic lexicons. In

Proceedings of NAACL-HLT , 2015.20. Xianghua Fu, Guo Liu, Yanyan Guo, and Zhiqiang Wang. Multi-aspect sentiment analysis forchinese online social reviews based on topic modeling and hownet lexicon.

Knowledge-BasedSystems , 37:186–195, 2013.21. Edward Grefenstette and Mehrnoosh Sadrzadeh. Experimental support for a categorical com-positional distributional model of meaning. In

Proceedings of EMNLP , 2011.22. Justin Grimmer and Brandon M Stewart. Text as data: The promise and pitfalls of automaticcontent analysis methods for political texts.

Political analysis , 21(3):267–297, 2013.23. Yihong Gu, Jun Yan, Hao Zhu, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Fen Lin, and LeyuLin. Language modeling with sparse product of sememe experts. In

Proceedings of EMNLP ,pages 4642–4651, 2018.24. Djoerd Hiemstra. A linguistically motivated probabilistic model of information retrieval. In

Proceedings of TPDL , 1998.25. G. E Hinton. Products of experts. In

Proceedings of ICANN , 1999.26. Geoffrey E Hinton. Learning distributed representations of concepts. In

Proceedings of CogSci ,1986.27. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural Computation ,9(8):1735–1780, 1997.28. Chin-Lan Huang, CK Chung, Natalie Hui, Yi-Cheng Lin, Yi-Tai Seih, WC Chen, and JW Pen-nebaker. The development of the chinese linguistic inquiry and word count dictionary.

ChineseJournal of Psychology , 54(2):185–201, 2012.29. Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. Improving wordrepresentations via global context and multiple word prototypes. In

Proceedings of ACL , 2012.30. Huiming Jin, Hao Zhu, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Fen Lin, and Leyu Lin.Incorporating chinese characters of words for lexical sememe prediction. In

Proceedings ofACL , 2018.60 6 Sememe Knowledge Representation31. Dan Jurafsky. Speech & language processing. 2000.32. Ewa Kacewicz, James W Pennebaker, Matthew Davis, Moongee Jeon, and Arthur C Graesser.Pronoun use reﬂects standings in social hierarchies.

Journal of Language and Social Psychol-ogy , 33(2):125–143, 2014.33. Sanjeev Kumar Karn, Ulli Waltinger, and Hinrich Schütze. End-to-end trainable attentivedecoder for hierarchical entity classiﬁcation.

Proceedings of EACL , 2017.34. Slava Katz. Estimation of probabilities from sparse data for the language model component of aspeech recognizer.

IEEE Transactions on Acoustics, Speech, and Signal Processing , 35(3):400–401, 1987.35. Thomas Kober, Julie Weeds, Jeremy Refﬁn, and David Weir. Improving sparse word repre-sentations with distributional inference for semantic composition. In

Proceedings of EMNLP ,2016.36. Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recom-mender systems.

Computer , 42(8), 2009.37. Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In

Proceedings of NeurIPS , 2014.38. Wei Li, Xuancheng Ren, Damai Dai, Yunfang Wu, Houfeng Wang, and Xu Sun. Sememeprediction: Learning semantic knowledge from unstructured textual wiki descriptions. arXivpreprint arXiv:1808.05437, 2018.39. Xingjian Li et al. Lexicon of common words in contemporary chinese, 2008.40. Qun Liu. Word similarity computing based on hownet.

Computational linguistics and Chineselanguage processing , 7(2):59–76, 2002.41. Shu Liu, Jingjing Xu, Xuancheng Ren, and Xu Sun. Evaluating semantic rationality of asentence: A sememe-word-matching neural network based on hownet. 2019.42. Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and ChristopherPotts. Learning word vectors for sentiment analysis. In

Proceedings of ACL , 2011.43. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of wordrepresentations in vector space. In

Proceedings of ICLR , 2013.44. Tomas Mikolov, Martin Karaﬁát, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. Recur-rent neural network based language model. In

Proceedings of InterSpeech , 2010.45. David RH Miller, Tim Leek, and Richard M Schwartz. A hidden markov model informationretrieval system. In

Proceedings of SIGIR , 1999.46. Jeff Mitchell and Mirella Lapata. Vector-based models of semantic composition. In

Proceedingsof ACL , 2008.47. Jeff Mitchell and Mirella Lapata. Language models based on semantic composition. In

Pro-ceedings of EMNLP , 2009.48. Matthew L Newman, Carla J Groom, Lori D Handelman, and James W Pennebaker. Gen-der differences in language use: An analysis of 14,000 text samples.

Discourse Processes ,45(3):211–236, 2008.49. Yilin Niu, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Improved word representation learn-ing with sememes. In

Proceedings of ACL , 2017.50. Francis Jeffry Pelletier. The principle of semantic compositionality.

Topoi , 13(1):11–24, 1994.51. Francis Jeffry Pelletier.

Semantic Compositionality , volume 1. 2016.52. James W Pennebaker, Roger J Booth, and Martha E Francis. Linguistic inquiry and word count:Liwc [computer software].

Austin, TX: liwc. net , 2007.53. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for wordrepresentation. In

Proceedings of EMNLP , 2014.54. Jay M Ponte and W Bruce Croft. A language modeling approach to information retrieval. In

Proceedings of SIGIR , 1998.55. Fanchao Qi, Junjie Huang, Chenghao Yang, Zhiyuan Liu, Xiao Chen, Qun Liu, and MaosongSun. Modeling semantic compositionality with sememe knowledge. In

Proceedings of ACL ,2019.56. Fanchao Qi, Yankai Lin, Maosong Sun, Hao Zhu, Ruobing Xie, and Zhiyuan Liu. Cross-linguallexical sememe prediction. In

Proceedings of EMNLP , 2018.eferences 16157. Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractivesentence summarization. In

Proceedings of EMNLP , 2015.58. Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborativeﬁltering recommendation algorithms. In

Proceedings of WWW , 2001.59. Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. Parsing with com-positional vector grammars. In

Proceedings of ACL , 2013.60. Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic compo-sitionality through recursive matrix-vector spaces. In

Proceedings of EMNLP-CoNLL , 2012.61. Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D. Manning, Andrew YNg, and Christopher Potts. Recursive deep models for semantic compositionality over a senti-ment treebank. In

Proceedings of EMNLP , 2013.62. Jingguang Sun, Dongfeng Cai, Dexin Lv, and Yanju Dong. Hownet based chinese questionautomatic classiﬁcation.

Journal of Chinese Information Processing , 21(1):90–95, 2007.63. Maosong Sun and Xinxiong Chen. Embedding for words and word senses based on humanannotated knowledge base: A case study on hownet.

Journal of Chinese Information Processing ,30:1–6, 2016.64. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neuralnetworks. In

Proceedings of NeurIPS , 2014.65. David Weir, Julie Weeds, Jeremy Refﬁn, and Thomas Kober. Aligning packed dependency trees:a theory of composition for distributional semantics.

Computational Linguistics , 42(4):727–761, December 2016.66. Ruobing Xie, Xingchi Yuan, Zhiyuan Liu, and Maosong Sun. Lexical sememe prediction viaword embeddings and matrix factorization. In

Proceedings of IJCAI , 2017.67. Binyong Yin. Quantitative research on Chinese morphemes.

Studies of the Chinese Language ,5:338–347, 1984.68. Xiangkai Zeng, Cheng Yang, Cunchao Tu, Zhiyuan Liu, and Maosong Sun. Chinese liwclexicon expansion via hierarchical classiﬁcation of word embeddings with sememe attention.In

Proceedings of AAAI , 2018.69. Meng Zhang, Haoruo Peng, Yang Liu, Huan-Bo Luan, and Maosong Sun. Bilingual lexiconinduction from non-parallel data with minimal supervision. In

Proceedings of AAAI , 2017.70. Yuntao Zhang, Ling Gong, and Yongcheng Wang. Chinese word sense disambiguation usinghownet. In

Proceedings of ICNC ,71. Yu Zhao, Zhiyuan Liu, and Maosong Sun. Phrase type sensitive tensor indexing model forsemantic composition. In

Proceedings of AAAI , 2015.72. Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. DAG-Structured long short-term memoryfor semantic compositionality. In

Proceedings of NAACL-HLT , 2016.

Open Access

World Knowledge Representation

Abstract

World knowledge representation aims to represent entities and relationsin the knowledge graph in low-dimensional semantic space, which have been widelyused in large knowledge-driven tasks. In this chapter, we ﬁrst introduce the conceptof the knowledge graph. Next, we introduce the motivations and give an overviewof the existing approaches for knowledge graph representation. Further, we discussseveral advanced approaches that aim to deal with the current challenges of knowl-edge graph representation. We also review the real-world applications of knowledgegraph representation, such as language modeling, question answering, informationretrieval, and recommender systems.

Knowledge Graph (KG), which is also named as Knowledge Base (KB), is a signif-icant multi-relational dataset for modeling concrete entities and abstract concepts inthe real world. It provides useful structured information and plays a crucial role inlots of real-world applications such as web search and question answering. It is notexaggerated to say that knowledge graphs teach us how to model the entities as wellas the relationships among them in this complicated real world.To encode knowledge into a real-world application, knowledge graph represen-tation, which represents entities and relations in knowledge graphs with distributedrepresentations, has been proposed and applied to various real-world artiﬁcial intel-ligence ﬁelds including question answering, information retrieval, and dialogue sys-tem. That is, knowledge graph representation learning plays a vital role as a bridgebetween knowledge graphs and knowledge-driven tasks.In this section, we will introduce the concept of knowledge graph, several typicalknowledge graphs, knowledge graph representation learning, and several typicalknowledge-driven tasks. © The Author(s) 2020Z. Liu et al.,

Representation Learning for Natural Language Processing ,https://doi.org/10.1007/978-981-15-5573-2_7 16364 7 World Knowledge Representation

In ancient times, knowledge was stored and inherited through books and letterswritten on parchment or bamboo slip. With the Internet thriving in the twenty-ﬁrstcentury, millions of thousands of messages have ﬂooded into the World Wide Web,and knowledge was transferred to the semi-structured textual information on theweb. However, due to the information explosion, it is not easy to extract knowledgewe want from the signiﬁcant, noisy plain text on the Internet. To obtain knowledgeeffectively, people notice that the world is not only made of strings but also made ofentities and relations. Knowledge Graph, which arranges structured multi-relationaldata of concrete entities and abstract concepts in the real world, is blooming in recentyears and attracts wide attention in both academia and industry.KGs are usually constructed from existing Semantic Web datasets in ResourceDescription Framework (RDF) with the help of manual annotation, while it canalso be automatically enriched by extracting knowledge from large plain texts onthe Internet. A typical KG usually contains two elements, including entities (i.e.,concrete entities and abstract concepts in the real world) and relations betweenentities. It usually represents knowledge with large quantities of triple facts in thetriple form of (cid:2) head entity , relation , tail entity (cid:3) abridged as (cid:2) h , r , t (cid:3) . For example, William Shakespeare is a famous English poet and playwright, who is widelyregarded as the greatest writer in the English language, and

Romeo and Juliet is one of his masterpieces. In knowledge graph, we will represent this knowledge as (cid:2)

William Shakespeare , works_written , Romeo and Juliet (cid:3) . Notethat in the real world, the same head entity and relation may have multiple tailentities (e.g.,

William Shakespeare also wrote

Hamlet and

A MidsummerNight’s Dream ), and reversely the same situation will happen when tail entityand relation are ﬁxed. Even it is possible when both the head entity and tail entity aremultiple (e.g., in relations like actor_in_movie ). However, in KG, all knowl-edge can be represented in triple facts regardless of the types of entities and relations.Through these triples, we can generate a huge directed graph whose nodes corre-spond to entities and edges correspond to relations to model the real world. With thewell-structured united knowledge representation, KGs are widely used in a varietyof applications to enhance their system performance.There are several KGs widely utilized nowadays in applications of informationretrieval and question answering. In this subsection, we will introduce some famousKGs such as Freebase, DBpedia, Yago, and WordNet. In fact, there are also lotsof comparatively smaller KGs in speciﬁc ﬁelds of knowledge functioned in verticalsearch.

Freebase is one of the most popular knowledge graphs in the world. It is a largecommunity-curated database consisting of well-known people, places, and things, .1 Introduction 165

Fig. 7.1

An example of search results in Freebase which is composed of existing databases and its community members. Freebasewas ﬁrst developed by Metaweb, an American software company, and ran sinceMarch 2007. In July 2010, Metaweb was acquired by Google, and Freebase wascombined to power up Google’s Knowledge Graph. In December 2014, the Freebaseteam ofﬁcially announced that the website, as well as the API of Freebase, wouldbe shut down by June 30, 2015. While the data in Freebase would be transferredto Wikidata, which is another collaboratively edited knowledge base operated byWikimedia Foundation. Up to March 24, 2016, Freebase arranged 58,726,427 topicsand 3,197,653,841 facts.Freebase contains well-structured data representing relationships between entitiesas well as the attributes of entities in the form of triple facts (Fig. 7.1). Data in Freebasewas mainly harvested from various sources, including Wikipedia, Fashion ModelDirectory, NNDB, MusicBrainz, and so on. Moreover, the community members alsocontributed a lot to Freebase. Freebase is an open and shared database that aims toconstruct a global database which encodes the world’s knowledge. It announced anopen API, RDF endpoint, and a database dump for its users for both commercial andnoncommercial use. As described by Tim O’Reilly, Freebase is the bridge betweenthe bottom-up vision of Web 2.0 collective intelligence and the more structured worldof the Semantic Web.

DBpedia is a crowd-sourced community effort aiming to extract structured contentfrom Wikipedia and make this information accessible on the web. It was started byresearchers at Free University of Berlin, Leipzig University and OpenLink Software,

66 7 World Knowledge Representation initially released to the public in January 2007. DBpedia allows users to ask semanticqueries associated with Wikipedia resources, even including links to other relateddatasets, which makes it easier for us to fully utilize the massive amount of informa-tion in Wikipedia in a novel and effective way. DBpedia is also an essential part ofthe Linked Data effort described by Tim Berners-Lee.The English version of DBpedia describes 4 .

58 million entities, out of which4 .

22 million are classiﬁed in a consistent ontology, including 1,445,000 persons,735,000 places, 411,000 creative works, 251,000 species, 241,000 organizations, and6,000 diseases. There are also localized versions of DBpedia in 125 languages, all ofwhich contain 38 . . . . . YAGO, which is short for Yet Another Great Ontology, is a high-quality KG devel-oped by Max Planck Institute for Computer Science in Saarbruücken initially releasedin 2008. Knowledge in YAGO is automatically extracted from Wikipedia, WordNet,and GeoNames, whose accuracy has been manually evaluated and proves a con-ﬁrmed accuracy of 95%. YAGO is special not only because of the conﬁdence valueevery fact possesses depending on the manual evaluation but also because that YAGOis anchored in space and time, which can provide a spatial dimension or temporaldimension to part of its entities.Currently, YAGO has more than 10 million entities, including persons, organi-zations, and locations, with over 120 million facts about these entities. YAGO alsocombines knowledge extracted from Wikipedias of 10 different languages and clas-siﬁes them into approximately 350,000 classes according to the Wikipedia categorysystem and the taxonomy of WordNet. YAGO has also joined the linked data projectand been linked to the DBpedia ontology and the SUMO ontology (Fig. 7.2).

Knowledge Graphs provide us with a novel aspect to describe the world with entitiesand triple facts, which attract growing attention from researchers. Large KGs suchas Freebase, DBpedia, and YAGO have been constructed and widely used in anenormous amount of applications such as question answering and Web search. .2 Knowledge Graph Representation 167

Fig. 7.2

An example of search results in YAGO

However, with KG size increasing, we are facing two main challenges: data spar-sity and computational inefﬁciency. Data sparsity is a general problem in lots ofﬁelds like social network analysis or interest mining. It is because that there are toomany nodes (e.g., users, products, or entities) in a large graph, while too few edges(e.g., relationships) between these nodes, since the number of relations of a node islimited in the real world. Computational efﬁciency is another challenge we need toovercome with the increasing size of knowledge graphs.To tackle these problems, representation learning is introduced to knowledge rep-resentation. Representation learning in KGs aims to project both entities and relationsinto a low-dimensional continuous vector space to get their distributed representa-tions, whose performance has been conﬁrmed in word representation and social rep-resentation. Compared with the traditional one-hot representation, distributed repre-sentation has much fewer dimensions, and thus lowers the computational complexity.What is more, distributed representation can explicitly show the similarity betweenentities through some distance calculated by the low-dimensional embeddings, whileall embeddings in one-hot representation are orthogonal, making it difﬁcult to tellthe potential relations between entities.With the advantages above, knowledge graph representation learning is bloomingin knowledge applications, signiﬁcantly improving the ability of KGs on the taskof knowledge completion, knowledge fusion, and reasoning. It is considered as thebridge between knowledge construction, knowledge graphs, and knowledge-drivenapplications. Up till now, a high number of methods have been proposed using adistributed representation for modeling knowledge graphs, with the learned knowl-edge representations widely utilized in various knowledge-driven tasks like questionanswering, information retrieval, and dialogue system.

68 7 World Knowledge Representation

In summary, Knowledge graph Representation Learning (KRL) aims to constructdistributed knowledge representations for entities and relations, projecting knowl-edge into low-dimensional semantic vector spaces. Recent years have witnessed sig-niﬁcant advances in knowledge graph representation learning with a large amount ofKRL methods proposed to construct knowledge representations, among which thetranslation-based methods achieve state-of-the-art performance in many KG tasks,with a right balance in both effectiveness and efﬁciency.In this section, we will ﬁrst describe the notations that we will use in KRL. Then,we will introduce TransE, which is the fundamental version of translation-basedmethods. Next, we will explore the various extension methods of TransE in detail.At last, we will take a brief look over other representation learning methods utilizedin modeling knowledge graphs.

First, we introduce the general notations used in the rest of this section. We use G = ( E , R , T ) to denote the whole KG, in which E = { e , e , . . . , e | E | } stands forthe entity set, R = { r , r , . . . , r | R | } stands for the relation set, and T stands for thetriple set. | E | and | R | are the corresponding entity and relation numbers in theiroverall sets. As stated above, we represent knowledge in the form of triple fact (cid:2) h , r , t (cid:3) , where h ∈ E means the head entity, t ∈ E means the tail entity, and r ∈ R means the relation between h and t . TransE [7] is a translation-based model for learning low-dimensional embeddings ofentities and relations. It projects entities as well as relations into the same semanticembedding space, and then considers relations as translations in the embedding space.First, we will start with the motivations of this method, and then discuss the detailsin how knowledge representations are trained under TransE. Finally, we will explorethe advantages and disadvantages of TransE for a deeper understanding.

There are three main motivations behind the translation-based knowledge graphrepresentation learning method. The primary motivation is that it is natural to con-sider relationships between entities as translating operations. Through distributedrepresentations, entities are projected to a low-dimensional vector space. Intuitively,we agree that a reasonable projection should map entities with similar semanticmeanings to the same ﬁeld, while entities with different meanings should belong to .2 Knowledge Graph Representation 169 distinct clusters in the vector space. For example,

William Shakespeare and

Jane Austen may be in the same cluster of writers,

Romeo, and Juliet and

Pride and Prejudice may be in another cluster of books. In this case, theyshare the same relation works_written , and the translations between writers andbooks in the vector space are similar.The secondary motivation of TransE derives from the breakthrough in word repre-sentation by Word2vec [49]. Word2vec proposes two simple models, Skip-gram andCBOW, to learn word embeddings from large-scale corpora, signiﬁcantly improv-ing the performance in word similarity and analogy. The word embeddings learnedby Word2vec have some interesting phenomena: if two word-pairs share the samesemantic or syntactic relationships, their subtraction embeddings in each word pairwill be similar. For instance, we have w ( king ) − w ( man ) ≈ w ( queen ) − w ( woman ), (7.1)which indicates that the latent semantic relation between king and man , which issimilar to the relation between queen and woman , is successfully embedded inthe word representation. This approximate relation could be found not only with thesemantic relations but also with the syntactic relations. We have w ( bigger ) − w ( big ) ≈ w ( smaller ) − w ( small ). (7.2)The phenomenon found in word representation strongly implies that there mayexist an explicit method to represent relationships between entities as translatingoperations in vector space.The last motivation comes from the consideration of computational complexity.On the one hand, a substantial increase in model complexity will result in highcomputational costs and obscure model interpretability. Moreover, a complex modelmay lead to overﬁtting. On the other hand, experimental results on model complexitydemonstrate that the simpler models perform almost as good as more expressive mod-els in most KG applications, in the condition that there are sizeable multi-relationaldataset and a relatively large amount of relations. As KG size increases, computa-tional complexity becomes the primary challenge in the knowledge graph represen-tation. The intuitive assumption of translation leads to a better trade-off betweenaccuracy and efﬁciency. As illustrated in Fig. 7.3, TransE projects entities and relations into the same low-dimensional space. All embeddings take values in R d , where d is a hyperparameterindicating the dimension of embeddings. With the translation assumption, for eachtriple (cid:2) h , r , t (cid:3) in T , we want the summation embedding h + r to be the nearestneighbor of tail embedding t . The score function of TransE is then deﬁned as follows:

70 7 World Knowledge Representation

Fig. 7.3

The architecture ofTransE model [47] h t r E ( h , r , t ) = (cid:6) h + r − t (cid:6) . (7.3)More speciﬁcally, to learn such embeddings of entities and relations, TransEformalizes a margin-based loss function with negative sampling as objective fortraining. The pair-wise function is deﬁned as follows: L = (cid:2) (cid:2) h , r , t (cid:3)∈ T (cid:2) (cid:2) h (cid:7) , r (cid:7) , t (cid:7) (cid:3)∈ T − max (γ + E ( h , r , t )) − E ( h (cid:7) , r (cid:7) , t (cid:7) ), ), (7.4)in which E ( h , r , t ) is the score of energy function for a positive triple (i.e., triple in T ) and E ( h (cid:7) , r (cid:7) , t (cid:7) ) is that of a negative triple. The energy function E can be eithermeasured by L or L distance. γ > γ means a wider gap between positive and the corresponding negative scores. T − isthe negative triple set with respect to T .Since there are no explicit negative triples in knowledge graphs, we deﬁne T − asfollows: T − = {(cid:2) h (cid:7) , r , t (cid:3)| h (cid:7) ∈ E } ∪ {(cid:2) h , r (cid:7) , t (cid:3)| r (cid:7) ∈ R } ∪ {(cid:2) h , r , t (cid:7) (cid:3)| t (cid:7) ∈ E } , (cid:2) h , r , t (cid:3) ∈ T , (7.5)which means the negative triple set T − is composed of the positive triple (cid:2) h , r , t (cid:3) with head entity, relation, or tail entity randomly replaced by any other entities orrelations in KG. Note that the new triple generated after replacement will not beconsidered as a negative sample if it has already been in T .TransE is optimized using mini-batch stochastic gradient descent (SGD), withentities and relations randomly initialized. Knowledge completion, which is a linkprediction task aiming to predict the third element in a triple (could be either entityor relation) with the given rest two elements, is designed to evaluate the learnedknowledge representations. .2 Knowledge Graph Representation 171 TransE is effective and efﬁcient and has shown its power on link prediction. However,it still has several disadvantages and challenges to be further explored.First, in knowledge completion, we may have multiple correct answers with thegiven two elements in a triple. For instance, with the given head entity

WilliamShakespeare and the relation works_written , we will get a list of master-pieces including

Romeo and Juliet , Hamlet and

A Midsummer Night’sDream . These books share the same information in the writer while differing in manyother ﬁelds such as theme, background, and famous roles in the book. However,with the translation assumption in TransE, every entity has only one embeddingin all triples, which signiﬁcantly limits the ability of TransE in knowledge graphrepresentations. In [7], the authors categorize all relations into four classes, 1-to-1,1-to-Many, Many-to-1, Many-to-Many, according to the cardinalities of their headand tail arguments. A relation is considered as 1-to-1 if most heads appear withone tail, 1-to-Many if a head can appear with many tails, Many-to-1 if a tail canappear with many heads, and Many-to-Many if multiple heads appear with multipletails. Statistics demonstrate that the 1-to-Many, Many-to-1, Many-to-Many relationsoccupy a large proportion. TransE does well in 1-to-1, but it has issues when dealingwith 1-to-Many, Many-to-1, Many-to-Many relations. Similarly, TransE may alsostruggle with reﬂexive relations.Second, the translating operation is intuitive and effective, only considering thesimple one-step translation, which may limit the ability to model KGs. Taking enti-ties as nodes and relations as edges, we can construct a huge knowledge graphwith the triple facts. However, TransE focuses on minimizing the energy function E ( h , r , t ) = (cid:6) h + r − t (cid:6) , which only utilize the one-step relation information inknowledge graphs, regardless of the latent relationships located in long-distancepaths. For example, if we know the triple fact that (cid:2) The forbidden city , locate_in , Beijing (cid:3) and (cid:2)

Beijing , capital_of , China (cid:3) , we can inferthat

The forbidden city locates in

China . TransE can be further enhancedwith the favor of multistep information.Third, the representation and the dissimilarity function in TransE are oversimpli-ﬁed for the consideration of efﬁciency. Therefore, TransE may not be capable enoughof modeling those complicated entities and relations in knowledge graphs. There stillexist challenges on how to balance the effectiveness and efﬁciency, avoiding bothoverﬁtting and underﬁtting.Besides the disadvantages and challenges stated above, multisource informationsuch as textual information and hierarchical type/label information is of great sig-niﬁcance, which will be further discussed in the following.

72 7 World Knowledge Representation

There are lots of extension methods following TransE to address the challengesabove. Speciﬁcally, TransH, TransR, TransD, and TranSparse are proposed to solvethe challenges in modeling 1-to-Many, Many-to-1, and Many-to-Many relations,PTransE is proposed to encode long-distance information located in multistep paths,and CTransR, TransA, TransG, and KG2E further extend the oversimpliﬁed modelof TransE. We will discuss these extension methods in detail.

With distributed representation, entities are projected to the semantic vector space,and similar entities tend to be in the same cluster. However, it seems that

WilliamShakespeare should be in the neighborhood of

Isaac Newton when talkingabout nationality, while it should be next to

Mark Twain when talking about occu-pation. To accomplish this, we want entities to show different preferences in differentsituations, that is, to have multiple representations in different triples.To address the issue when modeling 1-to-Many, Many-to-1, Many-to-Many, andreﬂexive relations, TransH [77] enables an entity to have multiple representationswhen involved in different relations. As illustrated in Fig. 7.4, TransH proposes arelation-speciﬁc hyperplane w r for each relation, and judge dissimilarities on thehyperplane instead of the original vector space of entities. Given a triple (cid:2) h , r , t (cid:3) ,TransH ﬁrst projects h and t to the corresponding hyperplane w r to get the projection h ⊥ and t ⊥ , and the translation vector r is used to connect h ⊥ and t ⊥ on the hyperplane.The score function is deﬁned as follows: E ( h , r , t ) = (cid:6) h ⊥ + r − t ⊥ (cid:6) , (7.6)in which we have h ⊥ = h − w (cid:10) r hw r , t ⊥ = t − w (cid:10) r tw r , (7.7)where w r is a vector and (cid:6) w r (cid:6) is restricted to 1. As for training, TransH alsominimizes the margin-based loss function with negative sampling which is similarto TransE, and use mini-batch SGD to learn representations. TransH enables entities to have multiple representations in different relations withthe favor of hyperplanes, while entities and relations are still restricted in the samesemantic vector space, which may limit the ability for modeling entities and relations.TransR [39] assumes that entities and relations should be arranged in distinct spaces,that is, entity space for all entities and relation space for each relation. .2 Knowledge Graph Representation 173 h r t r rh t Fig. 7.4

The architecture of TransH model [47]

As illustrated in Fig. 7.5, For a triple (cid:2) h , r , t (cid:3) , h , t ∈ R k and r ∈ R d , TransR ﬁrstprojects h and t from entity space to the corresponding relation space of r . That isto say, every entity has a relation-speciﬁc representation for each relation, and thetranslating operation is processed in the speciﬁc relation space. The energy functionof TransR is deﬁned as follows: E ( h , r , t ) = (cid:6) h r + r − t r (cid:6) , (7.8)where h r and t r stand for the relation-speciﬁc representation for h and t r in thecorresponding relation space of r . The projection from entity space to relation spaceis h r = hM r , t r = tM r , (7.9)where M r ∈ R k × d is a projection matrix mapping entities from the entity space tothe relation space of r . TransR also constrains the norms of the embeddings and has (cid:6) h (cid:6) ≤ (cid:6) t (cid:6) ≤ (cid:6) r (cid:6) ≤ (cid:6) h r (cid:6) ≤ (cid:6) t r (cid:6) ≤

1. As for training, TransR sharesthe same margin-based score function as TransE.Furthermore, the author found that some relations in knowledge graphs couldbe divided into a few sub-relations that give more precise information. The dif-ferences between those sub-relations can be learned from corresponding entitypairs. For instance, the relation location_contains has head-tail patterns like city-street , country-city , and even country-university , showingdifferent attributes in cognition. With the sub-relations being considered, entitiesmay be projected to more precise positions in the semantic vector space.Cluster-based TransR (CTransR), which is an enhanced version of TransR with thesub-relations into consideration, is then proposed. More speciﬁcally, for each relation r , all entity pairs ( h , t ) are ﬁrst clustered into several groups. The clustering of entitypairs depends on the subtraction result of t − h , in which h and t are pretrained byTransE. Next, we learn a distinct sub-relation vector r c for each cluster according tothe corresponding entity pairs, and the original energy function is modiﬁed as

74 7 World Knowledge Representation t h t r h r rM r Mrentity space relation space of r

Fig. 7.5

The architecture of TransR model [47] E ( h , r , t ) = (cid:6) h r + r c − t r (cid:6) + α (cid:6) r c − r (cid:6) , (7.10)where (cid:6) r c − r (cid:6) wants the sub-relation vector r c not to be too distinct from the uniﬁedrelation vector r . TransH and TransR focus on the multiple representations of entities in differentrelations, improving the performance on knowledge completion and triple classi-ﬁcation. However, both models only project entities according to the relations intriples, ignoring the diversity of entities. Moreover, the projection operation withmatrix-vector multiplication leads to a higher computational complexity comparedto TransE, which is time consuming when applied on large-scale graphs. To addressthis problem, TransD [32] proposes a novel projection method with a dynamic map-ping matrix depending on both entity and relation, which takes the diversity of entitiesas well as relations into consideration.TransD deﬁnes two vectors for each entity and relation, i.e., the original vector thatis also used in TransE, TransH, and TransR for distributed representation of entitiesand relations, and the projection vector that is used in constructing projection matricesfor mapping entities from entity space to relation space. As illustrated in Fig. 7.6,TransD uses h , t , r to represent the original vectors, while h p , t p , and r p are used torepresent the projection vectors. There are two projection matrices M rh , M rt ∈ R m × n used to project from entity space to relation space, and the projection matrices aredynamically constructed as follows: M rh = r p h (cid:10) p + I m × n , M rt = r p t (cid:10) p + I m × n , (7.11) .2 Knowledge Graph Representation 175 t h h M rh i Mrt i entity space relation space of rt t h h t h t t h Fig. 7.6

The architecture of TransD model [47] which means the projection vectors of entity and relation are combined to determinethe dynamic projection matrix. The score function is then deﬁned as E ( h , r , t ) = (cid:6) M rh h + r − M rt t (cid:6) . (7.12)The projection matrices are initialized with identity matrices, and there are alsosome normalization constraints as in TransR.TransD proposes a dynamic method to construct projection matrices with theconsideration of diversity in both entities and relations, achieving better performancecompared to existing methods in link prediction and triple classiﬁcation. Moreover,it lowers both computational and spatial complexity compared to TransR. The extension methods of TransE stated above focus on the multiple representa-tions for entities in different relations and entity pairs. However, there are still twochallenges ignored: (1) The heterogeneity. Relations in knowledge graphs differ ingranularity. Some complex relations may link to many entity pairs, while some rela-tively simple relations not. (2) The unbalance. Some relations may have more linksto head entities and fewer links to tail entities, and vice versa. The performance willbe further improved if we consider these rather than merely treat all relations equally.Existing methods like TransR build projection matrices for each relation, whilethese projection matrices have the same number of parameters, regardless of the vari-ety in the complexity of relations. TranSparse [33] is then proposed to address theissues. The underlying assumption of TranSparse is that complex relations shouldhave more parameters to learn while simple relations have fewer, where the complex-ity of a relation is judged from the number of triples or entities linked by the relation.

76 7 World Knowledge Representation

To accomplish this, two models, i.e., TranSparse(share) and TranSparse(separate),are proposed for avoiding overﬁtting and underﬁtting.Inspired by TransR, TranSparse(share) builds a projection matrix M r (θ r ) for eachrelation r . This projection matrix is sparse, and the sparse degree θ r mainly depends onthe number of entity pairs linked to r . Suppose N r is the number of linked entity pairs, N ∗ r represents the maximum number of N r , and θ min denotes the minimum sparsedegree of projection matrix M r that 0 ≤ θ min ≤

1. The sparse degree of relation r isdeﬁned as follows: θ r = − ( − θ min ) N r / N ∗ r . (7.13)Both head and tail entities share the same sparse projection matrix M r (θ r ) intranslation. The score function is E ( h , r , t ) = (cid:6) M r (θ r ) h + r − M r (θ r ) t (cid:6) . (7.14)Differing from TranSparse(share), TranSparse(separate) builds two differentsparse matrices M rh (θ rh ) and M rt (θ rt ) for head and tail entities. The sparse degree θ rh (or θ rt ) then depends on the number of head (or tail) entities linked by relation r . We have N rh (or N rt ) to represent the number of head (or tail) entities, as well as N ∗ rh (or N ∗ rt ) to represent the maximum number of N rh (or N rt ). And θ min will alsobe set as the minimum sparse degree of projection matrices that 0 ≤ θ min ≤

1. Wehave θ rh = − ( − θ min ) N rh / N ∗ rh , θ rt = − ( − θ min ) N rt / N ∗ rt . (7.15)The score function of TranSparse(separate) is E ( h , r , t ) = (cid:6) M rh (θ rh ) h + r − M rt (θ rt ) t (cid:6) . (7.16)Through the sparse projection matrix, TranSparse solves the heterogeneity andthe unbalance simultaneously. The extension models of TransE stated above are mainly focused on the challenge ofmultiple representations of entities in different scenarios. However, those extensionmodels only consider the simple one-step paths (i.e., relation) in translating operation,ignoring the rich global information located in the whole knowledge graphs. Consid-ering the multistep relational path is a potential method to utilize the global informa-tion. For instance, if we notice the multistep relational path that (cid:2)

The forbiddencity , locate_in , Beijing (cid:3) → (cid:2)

Beijing , capital_of , China (cid:3) , we caninference with conﬁdence that the triple (cid:2)

The forbidden city , locate_in , China (cid:3) may exist. The relational path provides us with a powerful way to con- .2 Knowledge Graph Representation 177 struct better knowledge graph representations and even get a better understanding ofknowledge reasoning.There are two main challenges when encoding the information in multistep rela-tional paths. First, how to select reliable and meaningful relational paths amongenormous path candidates in KGs, since there are lots of relation sequence patternswhich do not indicate reasonable relationships. Let us just consider the relationalpath (cid:2)

The forbidden city , locate_in , Beijing (cid:3) → (cid:2)

Beijing , held , (cid:3) , it is hard to describe the relationship between Theforbidden city and . Second, how to modelthose meaningful relational paths once we get them since it is difﬁcult to solvethis composition semantic problem in relational paths.PTransE [38] is then proposed to model the multistep relational paths. To selectmeaningful relational paths, the authors propose a Path-Constraint Resource Alloca-tion (PCRA) algorithm to judge the relation path reliability. Suppose there is infor-mation (or resource) in head entity h which will ﬂow to tail entity t through somecertain relational paths. The basic assumption of PCRA is that: the reliability of path (cid:5) depends on the resource amount that ﬁnally ﬂows from head to tail. Formally, weset (cid:5) = ( r , . . . , r l ) for a certain path between h and t . The resource travels from h to t and the path could be represented as S / h r −→ S r −→ . . . r l −→ S l / t . For an entity m ∈ S i , the resource amount of m is deﬁned as follows: R (cid:5) ( m ) = (cid:2) n ∈ S i − ( · , m ) | S i ( n , · ) | R (cid:5) ( n ), (7.17)where S i − ( · , m ) indicates all direct predecessors of entity m along with relation r i in S i − , and S i ( n , · ) indicates all direct successors of n ∈ S i − with relation r . Finally,the resource amount of tail R (cid:5) ( t ) is used to measure the reliability of (cid:5) in the giventriple (cid:2) h , (cid:5), t (cid:3) .Once we have learned the reliability and select those meaningful relational pathcandidates, the next challenge is how to model the meaning of those multistep paths.PTransE proposes three types of composition operation, namely, Addition, Multipli-cation, and recurrent neural networks, to get the representation l of (cid:5) = ( r , . . . , r l ) through those relations. The score function of the path triple (cid:2) h , (cid:5), t (cid:3) is deﬁned asfollows: E ( h , (cid:5), t ) = (cid:6) l − ( t − h ) (cid:6) ≈ (cid:6) l − r (cid:6) = E ((cid:5), r ), (7.18)where r indicates the golden relation between h and t . Since PTransE wants to meetthe assumption in TransE that r ≈ t − h simultaneously, PTransE directly utilizes r in training. The optimization objective of PTransE is L = (cid:2) ( h , r , t ) ∈ S [ L ( h , r , t ) + Z (cid:2) (cid:5) ∈ P ( h , t ) R ((cid:5) | h , t ) L ((cid:5), r ) ] , (7.19)

78 7 World Knowledge Representation t h (a) t h t r r t h (b) t h t r r Fig. 7.7

The architecture of TransA model [47] where L ( h , r , t ) is the margin-based score function with E ( h , r , t ) and L ((cid:5), r ) isthe margin-based score function with E ((cid:5), r ) . The reliability R ((cid:5) | h , t ) of (cid:5) in ( h , (cid:5), t ) is well considered in the overall loss function.Besides PTransE, similar ideas such as [21, 22] also consider the multistep rela-tional paths on different tasks such as knowledge completion and question answer-ing successfully. These works demonstrate that there is plentiful information locatedin multi-step relational paths, which could signiﬁcantly improve the performanceof knowledge graph representation, and further explorations on more sophisticatedmodels for relational paths are still promising. TransA [78] is proposed to solve the following problems in TransE and otherextensions: (1) TransE and its extensions only consider the Euclidean distance intheir energy functions, which seems to be less ﬂexible. (2) Existing methods regardeach dimension in the semantic vector space identically whatever the triple is, whichmay bring in errors when calculating dissimilarities. To solve these problems, asillustrated in Fig. 7.7, TransA replaces the inﬂexible Euclidean distance with adaptiveMahalanobis distance, which is more adaptive and ﬂexible. The energy function ofTransA is as follows: E ( h , r , t ) = ( | h + r − t | ) (cid:10) W r ( | h + r − t | ), (7.20)where W r is a relation-speciﬁc nonnegative symmetric matrix corresponding to theadaptive matric. Note that the | h + r − t | stands for a nonnegative vector that eachdimension is the absolute value of the translating operation. We have ( | h + r − t | ) (cid:2) ( | h + r − t | , | h + r − t | , . . . | h n + r n − t n | ). (7.21) .2 Knowledge Graph Representation 179 Existing translation-based models usually consider entities and relations as vectorsembedded in low-dimensional semantic spaces. However, as explained above, entitiesand relations in KGs are various with different granularities. Therefore, the marginin the margin-based score function that is used to distinguish positive triples fromnegative triples should be more ﬂexible due to the diversity, and the uncertainties ofentities and relations should be taken into consideration.To solve this, KG2E [30] is proposed, introducing the multidimensional Gaus-sian distributions to KG representations. As illustrated in Fig. 7.8, KG2E representseach entity and relation with a Gaussian distribution. Speciﬁcally, the mean vectordenotes the entity/relation’s central position, and the covariance matrix denotes itsuncertainties. To learn the Gaussian distributions for entities and relations, KG2Ealso follows the score function proposed in TransE. For a triple (cid:2) h , r , t (cid:3) , the Gaussiandistributions of entity and relation are deﬁned as follows: h ∼ N ( μ h , Σ h ), t ∼ N ( μ t , Σ t ), r ∼ N ( μ r , Σ r ). (7.22)Note that the covariances are diagonal for the consideration of efﬁciency. KG2Ehypothesizes that the head and tail entity are independent with speciﬁc relations, thenthe translation h − t could be deﬁned as h − t = e ∼ N ( μ h − μ t , Σ h + Σ t ). (7.23)To measure the dissimilarity between e and r , KG2E proposes two methods con-sidering both asymmetric similarity and symmetric similarity.The asymmetric similarity is based on the KL divergence between e and r , whichis a straightforward method to measure the similarity between two probability dis-tributions. The energy function is as follows: E ( h , r , t ) = D KL ( e (cid:6) r ) = (cid:3) x ∈ R ke N ( x ; μ r , Σ r ) log N ( x ; μ e , Σ e ) N ( x ; μ r , Σ r ) d x = (cid:4) tr ( Σ − r Σ r ) + ( μ r − μ e ) (cid:10) Σ − r ( μ r − μ e ) − log det ( Σ e ) det ( Σ r ) − k e (cid:5) , (7.24)where tr ( Σ ) indicates the trace of Σ , and Σ − indicates the inverse.The symmetric similarity is based on the expected likelihood or probability prod-uct kernel. KE2G takes the inner product between P e and P r as the measurement ofsimilarity. The logarithm of energy function is

80 7 World Knowledge Representation

Fig. 7.8

The architecture ofKG2E model [47]

Bill Clinton

Hillary Clinton spouse

USA

Nationality

Arkansas

Born on E ( h , r , t ) = (cid:3) x ∈ R ke N ( x ; μ e , Σ e ) N ( x ; μ r , Σ r ) dx = log N ( ; μ e − μ r , Σ e + Σ r ) = (cid:6) ( μ e − μ r ) (cid:10) ( Σ e + Σ r ) − ( μ e − μ r ) + log det ( Σ e + Σ r ) + k e log ( π) (cid:7) . (7.25)The optimization objective of KG2E is also margin-based similar to TransE. Bothasymmetric and symmetric similarities are constrained by some regularization toavoid overﬁtting: ∀ l ∈ E ∪ R , (cid:6) μ l (cid:6) ≤ , c min I ≤ Σ l ≤ c max I , c min > . (7.26)Figure 7.8 shows a brief example of representations in KG2E. We have discussed the problem of TransE in the session of TransR/CTransR that somerelations in knowledge graphs such as location_contains or has_part mayhave multiple sub-meanings. These relations are more likely to be some combina-tions that could be divided into several more precise relations. To address this issue,CTransR is proposed with a preprocess of clustering for each relation r dependingon the entity pairs ( h , t ) . TransG [79] also focuses on this issue more elegantly byintroducing a generative model. As illustrated in Fig. 7.9, it assumes that differentsemantic component embeddings should follow a Gaussian Mixture Model. Thegenerative process is as follows:1. For each entity e ∈ E , TransG sets a standard normal distribution: μ e ∼ N ( , I ) .2. For a triple (cid:2) h , r , t (cid:3) , TransG uses Chinese Restaurant Process to automaticallydetect semantic components (i.e., sub-meanings in a relation): π r , n ∼ CRP (β) . .2 Knowledge Graph Representation 181 t h (b) t t rr t t t h (a) t t r t t Fig. 7.9

The architecture of TransG model [47]

3. Draw the head embedding to form a standard normal distribution: h ∼ N ( μ h , σ h I ) .4. Draw the tail embedding to form a standard normal distribution: t ∼ N ( μ t , σ t I ) .5. Draw the relation embedding for this semantic component: μ r , n = t − h ∼ N ( μ t − μ h , (σ h + σ t ) I ) . μ is the mean embedding and σ is the variance. Finally, the score function is E ( h , r , t ) ∝ N r (cid:2) n = π r , n N ( μ t − μ h , (σ h + σ t ) I ), (7.27)in which N r is the number of semantic components of r , and π r , n is the weight of i thcomponent generated by the Chinese Restaurant Process.Figure 7.9 shows the advantages of the generative Gaussian Mixture Model. KG2E and TransG introduce Gaussian distributions to knowledge graph represen-tation learning, improving the ﬂexibility and diversity with the various forms ofentity and relation representation. However, TransE and its most extensions view thegolden triples as almost points in the low-dimensional vector space, following theassumption of translation. This point assumption may lead to two problems: beingan ill-posed algebraic system and being over-strict with the geometric form.ManifoldE [80] is proposed to address this issue, considering the possible positionof the golden candidate in vector space as a manifold instead of one point. The overallscore function of ManifoldE is deﬁned as follows: E ( h , r , t ) = (cid:6) M ( h , r , t ) − D r (cid:6) , (7.28)

82 7 World Knowledge Representation in which D r is a relation-speciﬁc manifold parameter indicating the bias. Two kindsof manifolds are then proposed in ManifoldE. ManifoldE(Sphere) is a straightforwardmanifold that supposes t should be located in the sphere which has h + r to be thecenter and D r to be the radius. We have M ( h , r , t ) = (cid:6) h + r − t (cid:6) . (7.29)The second manifold utilized is the hyperplane for it is much easier for twohyperplanes to intersect. The function of ManifoldE(Hyperplane) is M ( h , r , t ) = ( h + r h ) (cid:10) ( t + r t ), (7.30)in which r h and r t represent the two relation embeddings. This indicates that fora triple (cid:2) h , r , t (cid:3) , the tail entity t should locate in the hyperplane whose directionis h + r h with the bias to be D r . Furthermore, ManifoldE(Hyperplane) considersabsolute values in M ( h , r , t ) as | h + r h | (cid:10) | t + r t | to double the solution number ofpossible tails. For both manifolds, the author applies kernel forms on ReproducingKernel Hilbert Space. Translation-based methods such as TransE are simple but effective, whose powerhas been consistently veriﬁed on various tasks like knowledge graph completion andtriple classiﬁcation, achieving state-of-the-art performance. However, there are alsosome other representation learning methods performing well on knowledge graphrepresentation. In this part, we will take a brief look at these methods as inspiration.

Structured Embeddings (SE) [8] is a classical representation learning method forKGs. In SE, each entity is projected to a d -dimensional vector space. SE designs tworelation-speciﬁc matrices M r , , M r , ∈ R d × d for each relation r , projecting both headand tail entities with these relation-speciﬁc matrices when calculating the similarities.The score function of SE is deﬁned as follows: E ( h , r , t ) = (cid:6) M r , h − M r , t (cid:6) , (7.31)in which both h and t are transformed into a relation-speciﬁc vector space withthose projection matrices. The assumption of SE is that the projected head and tailembeddings should be as similar as possible according to the loss function. Differentfrom the translation-based methods, SE models entities as embeddings and relations .2 Knowledge Graph Representation 183 as projection matrices. In training, SE considers all triples in the training set andminimizes the overall loss function. Semantic Matching Energy (SME) [5, 6] proposes a more complicated representationlearning method. Differing from SE, SME considers both entities and relations aslow-dimensional vectors. For a triple (cid:2) h , r , t (cid:3) , h and r are combined using a projectionfunction g to get a new embedding l h , r , and the same with t and r to get l t , r . Next, apoint-wise multiplication function is used on the two combined embeddings l h , r and l t , r to get the score of this triple. SME proposes two different projection functions inthe second step, among which the linear form is E ( h , r , t ) = ( M h + M r + b ) (cid:10) ( M t + M r + b ), (7.32)and the bilinear form is: E ( h , r , t ) = (( M h (cid:17) M r ) + b ) (cid:10) (( M t (cid:17) M r ) + b ), (7.33)where (cid:17) is the element-wise (Hadamard) product. M , M , M , M are weightmatrices in the projection function, and b and b are the bias. Bordes et al. [6] isbased on SME and improves the bilinear form with three-way tensors instead ofmatrices. Latent Factor Model (LFM) is proposed for modeling large multi-relational datasets.LFM is based on a bilinear structure, which models entities as embeddings andrelations as matrices. It could share sparse latent factors among different relations,signiﬁcantly reducing the model and computational complexity. The score functionof LFM is deﬁned as follows: E ( h , r , t ) = h (cid:10) M r t , (7.34)in which M r is the representation of the relation r . Moreover, [92] proposes DIST-MULT model, which restricts M r to be a diagonal matrix. This enhanced model notonly reduces the parameter number of LFM and thus lowers the model’s computa-tional complexity, but also achieves better performance.

84 7 World Knowledge Representation

RESCAL is a knowledge graph representation learning method based on matrixfactorization [54, 55]. In RESCAL, to represent all triple facts in knowledge graphs,the authors employ a three-way tensor −→ X ∈ R d × d × k in which d is the dimension ofentities and k is that of relations. In the three-way tensor −→ X , two modes stand forthe head and tail entities while the third mode represents the relations. The entries of −→ X are based on the existence of the corresponding triple facts. That is, −→ X i jm = (cid:2) ith entity , mth relation , jth entity (cid:3) holds in the training set, and otherwise −→ X i jm = −→ X = { X , . . . , X k } , for each slice X n , we havethe following rank- r factorization: X n ≈ AR n A (cid:10) , (7.35)where A ∈ R d × r stands for the r -dimensional entity representations, and R n ∈ R r × r represents the interactions of the r latent components for n -th relation. The assump-tion in this factorization is similar to LFM, while RESCAL also optimizes the nonex-isting triples where −→ X i jm = L = (cid:8)(cid:2) n (cid:6) X n − AR n A (cid:10) (cid:6) F (cid:9) + λ (cid:8) (cid:6) A (cid:6) F + (cid:2) n (cid:6) R n (cid:6) F (cid:9) , (7.36)in which the second term is a regularization term and λ is a hyperparameter. RESCAL works well with multi-relational data but suffers from high computationalcomplexity. To leverage both effectiveness and efﬁciency, Holographic Embeddings(HOLE) is proposed as an enhanced version of RESCAL [53].HOLE employs an operation named circular correlation to generate compositionalrepresentations, which is similar to those holographic models of associative memory.The circular correlation operation (cid:10) : R d × R d → R d between two entities h and t is as follows: h (cid:10) t t k = d − (cid:2) i = h i t ( k + i ) mod d . (7.37)Figure 7.10a also demonstrates a simple instance of this operation. The probabilityof a triple (cid:2) h , r , t (cid:3) is then deﬁned as .2 Knowledge Graph Representation 185 head tail e h0 r e h1 e h2 e t0 e t1 e t2 P( r ( h , t )) (a)HOLE head tail e h0 r e h1 e h2 e t0 e t1 e t2 P( r ( h , t )) (b)RESCAL Fig. 7.10

The architecture of RESCAL and HOLE models P (φ r ( h , t ) = ) = Sigmoid ( r (cid:10) ( h (cid:10) t )). (7.38)Considering circular correlation brings in lots of advantages: (1) unlike otheroperations like multiplication or convolution, circular correlation is noncommutative(i.e., h (cid:10) t (cid:18)= t (cid:10) h ), which is capable of modeling asymmetric relations in knowledgegraphs. (2) Circular correlation has lower computational complexity compared totensor product in RESCAL. What’s more, the circular correlation could further speedup with the help of Fast Fourier Transform (FFT), which is formalized as follows: h (cid:10) t = F − ( F ( h ) (cid:17) F ( b )). (7.39)

86 7 World Knowledge Representation F ( · ) and F ( · ) − represent the FFT and its inverse, while F ( · ) denotes the complexconjugate in C d , and (cid:17) stands for the element-wise (Hadamard) product. Due toFFT, the computational complexity of circular correlation is O ( d log d ) , which ismuch lower than that of tensor product. ComplEx [70] employs an eigenvalue decomposition model, which makes use ofcomplex valued embeddings. The composition of complex embeddings can handle alarge variety of binary relations, among the symmetric and antisymmetric relations.Formally, the log-odd of the probability that the fact (cid:2) h , r , t (cid:3) is true is f r ( h , t ) = Sigmoid ( X hrt ), (7.40)where f r ( h , t ) is expected to be 1 when ( h , r , t ) holds, otherwise −

1. Here, X hrt iscalculated as follows: X hrt = Re ( (cid:2) r , h , t (cid:3) ) = (cid:2) Re ( r ), Re ( h ), Re ( t ) (cid:3) + (cid:2) Re ( r ), Im ( h ), Im ( t ) (cid:3)−(cid:2) Im ( r ), Re ( h ), Im ( t ) (cid:3) − (cid:2) Im ( r ), Im ( h ), Re ( t ) (cid:3) , (7.41)where (cid:2) x , y , z (cid:3) = (cid:10) i x i y i z i denotes the trilinear dot product, Re ( x ) and Im ( x ) indi-cate the real part and the imaginary part of the number x respectively. In fact, Com-plEx can be viewed as an extension of RESCAL, which assigns complex embeddingof the entities and relations.Besides, [29] has proved that HolE is mathematically equivalent to ComplExrecently. ConvE [16] uses 2D convolution over embeddings and multiple layers of nonlinearfeatures to model knowledge graphs. It is the ﬁrst nonlinear model that signiﬁcantlyoutperforms previous linear models.Speciﬁcally, ConvE uses convolutional and fully connected layers to model theinteractions between input entities and relationships. After that, the obtained featuresare ﬂattened, transformed through a fully connected layer, and the inner product istaken with all object entity vectors to generate a score for each triple.For each triple (cid:2) h , r , t (cid:3) , ConvE deﬁnes its score function as f r ( h , t ) = f ( vec ( f ( [ ¯ h ; ¯ r ] ∗ ω)) W ) t , (7.42) .2 Knowledge Graph Representation 187 where ∗ denotes the convolution operator, and vec ( · ) means compressing a matrixinto a vector. r ∈ R k is a relation parameter depending on r , ¯ h and ¯ r denote a 2Dreshaping of h and r , respectively: if h , r ∈ R k , then ¯ h , ¯ r ∈ R k a × k b , where k = k a k b .ConvE can be seen as an improvement on HolE. Compared with HolE, it learnsmultiple layers of nonlinear features, and thus theoretically more expressive thanHolE. RotatE [67] deﬁnes each relation as a rotation from the head entity to the tail entity inthe complex vector space. Thus, it is able to model and infer various relation patterns,including symmetry/antisymmetry, inversion, and composition. Formally, the scorefunction of the fact (cid:2) h , r , t (cid:3) of RotatE is deﬁned as f r ( h , t ) = (cid:6) h (cid:17) r − t (cid:6) , (7.43)where (cid:17) denotes the element-wise (Hadamard) product, h , r , t ∈ C k and | r i | = Socher et al. [65] propose Neural Tensor Network (NTN) as well as Single LayerModel (SLM), while NTN is an enhanced version of SLM. Inspired by the previousattempts in KRL, SLM represents both entities and relations as low-dimensionalvectors, and also designs relation-speciﬁc projection matrices to map entities fromentity space to relation space. Similar to SE, the score function of SLM is as follows: E ( h , r , t ) = r (cid:10) tanh ( M r , h + M r , t ), (7.44)where h , t ∈ R d represent head and tail embeddings, r ∈ R k represents relationembedding, and M r , , M r , ∈ R d × k stand for the relation-speciﬁc matrices.Although SLM has introduced relation embeddings as well as a nonlinear layerinto the score function, the model representation capability is still restricted. Neuraltensor network is then proposed with tensors being introduced into the SLM frame-work. Besides the original linear neural network layer that projects entities to therelation space, NTN also adds another tensor-based neural layer which combineshead and tail embeddings with a relation-speciﬁc tensor, as illustrated in Fig. 7.11.The score function of NTN is then deﬁned as follows: E ( h , r , t ) = r (cid:10) tanh ( h (cid:10) −→ M r t + M r , h + M r , t + b r ), (7.45)

88 7 World Knowledge Representation thword space entity spacer Score

Neural

Tensor

Network

Fig. 7.11

The architecture of NTN model [47] where −→ M r ∈ R d × d × k is a 3-way relation-speciﬁc tensor, b r is the bias, and M r , , M r , ∈ R d × k is the relation-speciﬁc matrices similar to SLM. Note that SLM is thesimpliﬁed version of NTN if the tensor and bias are set to zero.Besides the improvements in score function, NTN also attempts to utilize thelatent textual information located in entity names and successfully achieves signif-icant improvements. Differing from previous RL models that provide each entitywith a vector, NTN represents each entity as the average of its entity name’s wordembeddings. For example, the entity Bengal tiger will be represented as theaverage word embeddings of

Bengal and tiger . It is apparent that the entityname will provide valuable information for understanding an entity, since

Bengaltiger may come from Bengal and be related to other tigers. Moreover, the numberof words is far less than that of entities. Therefore, using the average word embed-dings of entity names will also lower the computational complexity and alleviate theissue of data sparsity.NTN utilizes tensor-based neural networks to model triple facts and achievesexcellent successes. However, the overcomplicated method will lead to higher com-putational complexity compared to other methods, and the vast number of parameterswill limit the performance on rather sparse and large-scale KGs.

NAM [43] adopts multilayer nonlinear activations in the deep neural network tomodel the conditional probabilities between head and tail entities. NAM studiestwo model structures Deep Neural Network (DNN) and Relation Modulated NeuralNetwork (RMNN).NAM-DNN feeds the head and tail entities’ embeddings into an MLP with L fullyconnected layers, which is formalized as follows: .2 Knowledge Graph Representation 189 z ( l ) = Sigmoid ( M l z ( l − ) + b ( l ) ), l = , . . . , L , (7.46)where z ( ) = [ h ; r ] , M ( l ) and b ( l ) is the weight matrix and bias vector for the l -thfully connected layer, respectively. And ﬁnally the score function of NAM-DNN isdeﬁned as f r ( h , t ) = Sigmoid ( t (cid:10) z ( L ) ). (7.47)Different from NAM-DNN, NAM-RMNN feds the relation embedding r intoeach layer of the deep neural network as follows: z ( l ) = Sigmoid ( M ( l ) z ( l − ) + B ( l ) r ), l = , . . . , L , (7.48)where z ( ) = [ h ; r ] , M ( l ) and B ( l ) indicate the weight matrices. The score functionof NAM-RMNN is deﬁned as f r ( h , t ) = Sigmoid ( t (cid:10) z ( L ) + B ( l + ) r ). (7.49) We are living in a complicated pluralistic real world, in which we can get informationthrough all senses and learn knowledge not only from structured knowledge graphsbut also from plain texts, categories, images, and videos. This cross-modal infor-mation is considered as multisource information. Besides the structured knowledgegraph which is well utilized in previous KRL methods, we will introduce the otherkinds of KRL methods utilizing multisource information:1.

Plain text is one of the most common information we deliver, receive, andanalyze every day. There are vast amounts of plain texts we possess remaining to bedetected, in which the signiﬁcant knowledge that structured knowledge graphs maynot include locates.

Entity description is a special kind of textual information thatdescribes the corresponding entity within a few sentences or a short paragraph. Usu-ally, entity descriptions are maintained by some knowledge graphs (i.e., Freebase)or could be automatically extracted from huge databases like Wikipedia.2.

Entity type is another important structured information for building knowledgerepresentations. To learn new objects within our prior knowledge systems, humanbeings tend to systemize those objects into existing categories. An entity type is usu-ally represented with hierarchical structures, which consist of different granularitiesof entity subtypes. It is natural that entities in the real world usually have multipleentity types. Most of the existing famous knowledge graphs own their customizedhierarchical structures of entity types.3.

Images provide intuitive visual information to describe what the entity lookslike, which is conﬁrmed to be the most signiﬁcant information we receive and processevery day. The latent information located in images helps a lot, especially whendealing with concrete entities. For instance, we may ﬁnd out the potential relationship

90 7 World Knowledge Representation between

Cherry and

Plum (there are both plants belonging to

Rosaceae ) fromtheir appearances. Images could be downloaded from websites, and there are alsosubstantial image datasets like ImageNet.Multisource information learning provides a novel method to learn knowledge rep-resentations not only from the internal information of structured knowledge graphsbut also from the external information of plain texts, hierarchical types, and images.Moreover, the exploration in multisource information learning helps to further under-stand human cognition with all senses in the real world. The cross-modal represen-tations learned based on knowledge graphs will also provide possible relationshipsbetween different kinds of information.

Textual information is one of the most common and widely used information thesedays. There are large plain texts generated every day on the web and easy to beextracted. Words are compressed symbols of our thoughts and can provide the con-nections between entities, which are of great signiﬁcance in KRL.

Wang et al. [76] attempt to utilize textual information by jointly embedding entities,relations, and words into the same low-dimensional continuous vector space. Theirjoint model contains three parts, namely, the knowledge model, the text model, andthe alignment model. More speciﬁcally, the knowledge model is learned based on thetriple facts in KGs by translation-based models, while the text model is learned basedon the concurrences of words in the large corpus by Skip-gram. As for the alignmentmodel, two methods are proposed utilizing Wikipedia anchors and entity names. Themain idea of alignment by Wikipedia anchors is replacing the word-word pair ( w , v ) with the word-entity pair ( w , e v ) according to the anchors in Wiki pages, while themain idea of alignment by entity names is replacing the entities in original triple (cid:2) h , r , t (cid:3) with the corresponding entity names (cid:2) w h , r , t (cid:3) , (cid:2) h , r , w t (cid:3) , and (cid:2) w h , r , w t (cid:3) .Modeling entities and words into the same vector space are capable of encodingboth information in knowledge graphs and that in plain texts, while the performanceof this joint model depends on the completeness of Wikipedia anchors and may sufferfrom the weak interactions merely based on entity names. To address this issue,[101] proposes a new joint embedding based on [76] and improves the alignmentmodel with entity descriptions into consideration, assuming that entities should besimilar to all words in their descriptions. These joint models learn knowledge and textjoint embeddings, improving evaluation performance in both word and knowledgerepresentations. .3 Multisource Knowledge Graph Representation 191 CNN/CBOWdescription of headw w w n w w w n description of tailhead relation + =+ = tailCNN/CBOW … … Fig. 7.12

The architecture of DKRL model

Another way of utilizing textual information is directly constructing knowledge rep-resentations from entity descriptions instead of merely considering the alignments.Xie et al. [82] proposes Description-embodied Knowledge Graph RepresentationLearning (DKRL) that provides two kinds of knowledge representations: the ﬁrst isthe structure-based representation h S and t S , which can directly represent entitieswidely used in previous methods, and the second is the description-based represen-tation h D and t D which derives from entity descriptions. The energy function derivesfrom translation-based framework: E ( h , r , t ) = (cid:6) h S + r − t S (cid:6) + (cid:6) h S + r − t D (cid:6) + (cid:6) h D + r − t S (cid:6) + (cid:6) h D + r − t D (cid:6) . (7.50)The description-based representation is constructed via CBOW or CNN encodersthat encode rich textual information from plain texts into knowledge representations.The architecture of DKRL is shown in Fig. 7.12.Compared to conventional translation-based methods, the two types of entityrepresentations in DKRL are constructed with both structural information and textualinformation, and thus could get better performance in knowledge graph completionand type classiﬁcation. Besides, DKRL could represent an entity even if it is not inthe training set, as long as there are a few sentences to describe the entity. As theirmillions of new entities come up every day, DKRL is capable of handling zero-shotlearning.

92 7 World Knowledge Representation

Entity types, which serve as a kind of category information of entities and are usu-ally arranged with hierarchical structures, could provide structured information tounderstand entities in KRL better.

Krompaß et al. [36] take type information as type constraints, and improves exist-ing methods like RESCAL and TransE via type constraints. It is intuitive that in aparticular relation, the head or tail entities should belong to some speciﬁc types. Forexample, the head entities of the relation write_books should be a human (ormore precisely an author), and the tail entities should be a book.Speciﬁcally, in RESCAL, the original factorization X r ≈ AR r A (cid:10) is modiﬁed to X (cid:7) r ≈ A [ head r , :] R r A (cid:10)[ tail r , :] , (7.51)in which head r , tail r are the set of entities ﬁtting the type constraints of head or tailand X (cid:7) r is a sparse adjacency matrix of shape | head r | × | tail r | . In the enhanced ver-sion, only the entities that ﬁt type constraints will be considered during factorization.In TransE, type constraints are utilized in negative sampling. The margin-basedscore functions of translation-based methods need negative instances, which aregenerated through randomly replacing head or tail entities with another entity intriples. With type constraints, the negative samples are chosen by h (cid:7) ∈ E [ head r ] ⊆ E , t (cid:7) ∈ E [ tail r ] ⊆ E , (7.52)where E [ head r ] is the subset of entities following type constraints for head in relation r , and E [ tail r ] is that for tail. Considering type information as constraints is simple but effective, while the per-formance is still limited. Instead of merely viewing type information as type con-straints, Xie et al. [83] propose Type-embodied Knowledge Graph RepresentationLearning (TKRL), utilizing hierarchical-type structures to instruct the constructionof projection matrices. Inspired by TransR that every entity should have multiplerepresentations in different scenarios, the energy function of TKRL is deﬁned asfollows: E ( h , r , t ) = (cid:6) M rh h + r − M rt t (cid:6) , (7.53) .3 Multisource Knowledge Graph Representation 193 h ht th ch (m) h ch h ch t ct RHE WHEr t ct rM ct (m) M ch (m-1) M ct (m-1) M ct (1) M ct (1) ∑ i β i M Ch (1) ∑ i β i M Ct (1) Fig. 7.13

The architecture of TKRL model in which M rh and M rt are two projection matrices for h and t that depend on theircorresponding hierarchical types in this triple. Two hierarchical-type encoders areproposed to learn the projection matrices, regarding all subtypes in the hierarchyas projection matrices, in which Recursive Hierarchy Encoder is based on matrixmultiplication, while Weighted Hierarchy Encoder is based on matrix summation: M RH E c = m (cid:11) i = M c ( i ) = M c ( ) M c ( ) . . . M c ( m ) , (7.54) M W H E c = m (cid:2) i = β i M c ( i ) = β M c ( ) + · · · + β m M c ( m ) , (7.55)where M c ( i ) stands for the projection matrix of the i th subtype of the hierarchicaltype c , β i is the corresponding weight of the subtype. Figure 7.13 demonstrates asimple illustration of TKRL. Taking RHE, for instance, given an entity WilliamShakespeare , it is ﬁrst projected to a rather general sub-type space like human ,and then sequentially projected to a more precise subtype like author or Englishauthor . Moreover, TKRL also proposes an enhanced soft-type constraint to alle-viate the problems caused by type information incompleteness.

94 7 World Knowledge Representation

ArmetSuit of armourhas part

Fig. 7.14

Examples of entity images [81]

Images could provide intuitive visual information of their corresponding entities’outlook, which may give signiﬁcant hints suggesting some latent attributes of entitiesfrom certain aspects. For instance, Fig. 7.14 demonstrates some examples of entityimages of their corresponding entities

Suit of armour and

Armet . The leftside shows the triple facts that (cid:2)

Suit of armour , has_a_part , Armet (cid:3) , andsurprisingly, we can infer this knowledge directly from the images.

Xie et al. [81] propose Image-embodied Knowledge Graph Representation Learning(IKRL) to take visual information into consideration when constructing knowledgerepresentations. Inspired by the multiple entity representations in [82], IKRL alsoproposes the image-based representation h I and t I besides the original structure-based representation, and jointly learn both two types of entity representations simul-taneously within the translation-based framework. E ( h , r , t ) = (cid:6) h S + r − t S (cid:6) + (cid:6) h S + r − t I (cid:6) + (cid:6) h I + r − t S (cid:6) + (cid:6) h I + r − t I (cid:6) . (7.56)More speciﬁcally, IKRL ﬁrst constructs the image representations for all entityimages with neural networks, and then project these image representations fromimage space to entity space via a projection matrix. Since most entities may havemultiple images with different qualities, IKRL selects the more informative anddiscriminative images via an attention-based method. The evaluation results ofIKRL not only conﬁrm the signiﬁcance of visual information in understanding .3 Multisource Knowledge Graph Representation 195 -- ≈ dresser drawer pianoforte keyboard - ≈ - cat (Felidae) tiger toothed whale dolphin w(part_of) w(hypernym) ≈≈ Fig. 7.15

An example of semantic regularities in word space [81] entities but also show the possibility of a joint heterogeneous semantic space.Moreover, the authors also ﬁnd some interesting semantic regularities such as w ( man ) − w ( king ) ≈ w ( woman ) − w ( queen ) found in word space, which areshown in Fig. 7.15. Typical knowledge graphs store knowledge in the form of triple facts with one relationlinking two entities. Most existing KRL methods only consider the information withintriple facts separately, ignoring the possible interactions and correlations between dif-ferent triples. Logic rules, which are certain kinds of summaries deriving from humanbeings’ prior knowledge, could help us with knowledge inference and reasoning. Forinstance, if we know the triple fact that (cid:2)

Beijing , is_capital_of , China (cid:3) ,we can easily infer with high conﬁdence that (cid:2)

Beijing , located_in , China (cid:3) ,since we know the logic rule that the relation is_capital_of ⇒ located_in .Some works are focusing on introducing logic rules to knowledge acquisition andinference, among which Markov Logic Networks are intuitively utilized to addressthis challenge [3, 58, 75]. The path-based TransE [38] stated above also implicitlyconsiders the latent logic rules between different relations via relation paths. KALE is a translation-based KRL method that jointly learns knowledge representa-tions with logic rules [24]. The joint learning consists of two parts, namely, the triplemodeling and the rule modeling. For triple modeling, KALE follows the translationassumption with minor alteration in scoring function as follows:

96 7 World Knowledge Representation E ( h , r , t ) = − √ d (cid:6) h + r − t (cid:6) , (7.57)in which d stands for the dimension of knowledge embeddings. E ( h , r , t ) takes valuein [ , ] for the convenience of joint learning.For the newly added rule modeling, KALE uses the t-norm fuzzy logics proposedin [25] that represent the truth value of a complex formula with the truth values of itsconstituents. Specially, KALE focuses on two typical types of logic rules. The ﬁrst is ∀ h , t : (cid:2) h , r , t (cid:3) ⇒ (cid:2) h , r , t (cid:3) (e.g., given (cid:2) Beijing , is_capital_of , China (cid:3) ,we can infer that (cid:2)

Beijing , located_in , China (cid:3) ). KALE represents the scoringfunction of this logic rule f via speciﬁc t-norm based logical connectives as follows: E ( f ) = E ( h , r , t ) E ( h , r , t ) − E ( h , r , t ) + . (7.58)The second is ∀ h , e , t : (cid:2) h , r , e (cid:3) ∧ (cid:2) e , r , t (cid:3) ⇒ (cid:2) h , r , t (cid:3) (e.g., given (cid:2) Tsinghua , located_in , Beijing (cid:3) ) and (cid:2)

Beijing , located_in , China (cid:3) , we can inferthat (cid:2)

Tsinghua , located_in , China (cid:3) ). And KALE deﬁnes the second scoringfunction as E ( f ) = E ( h , r , e ) E ( e , r , t ) E ( h , r , t ) − E ( h , r , e ) E ( e , r , t ) + . (7.59)The joint training contains all positive formulae, including triple facts as well aslogic rules. Note that for the consideration of logic rule qualities, KALE ranks allpossible logic rules by their truth values with pretrained TransE and manually ﬁlterssome rules ranked at the top. Recent years have witnessed the great thrive in knowledge-driven artiﬁcial intelli-gence, such as QA systems and chatbot. AI agents are expected to accurately anddeeply understand user demands, and then appropriately and ﬂexibly give responsesand solutions. Such kind of work cannot be done without certain forms of knowledge.To introduce knowledge to AI agents, researchers ﬁrst extract knowledge fromheterogeneous information like plain texts, images, and structured knowledge bases.These various kinds of heterogeneous information are then fused and stored withcertain structures like knowledge graphs. Next, the knowledge is projected to a low-dimensional semantic space following some KRL methods. And ﬁnally, these learnedknowledge representations are utilized in various knowledge applications like infor-mation retrieval and dialogue system. Figure 7.16 demonstrates a brief pipeline ofknowledge-driven applications from scratch.From the illustration, we can observe that knowledge graph representation learn-ing is the critical component in the whole knowledge-driven application’s pipeline.It bridges the gap between knowledge graphs that store knowledge and knowledge .4 Applications 197

Hamlet is a tragedy written by WillianmShakespeare at anuncertain date between1599 and 1602,... heterogeneousinformation knowledge graph knowledgeconstruction knowledge representation

KRLmethods embeddingmodels knowledgeapplications informationretrieval questionanswering dialoguesystem

Fig. 7.16

An illustration of knowledge-driven applications applications that use knowledge. Knowledge representations with distributed meth-ods, compared to those with symbolic methods, are able to solve the data sparsity andmodeling the similarities between entities and relations. Moreover, embedding-basedmethods are convenient to be used with deep learning methods and are naturally ﬁtfor the combination with heterogeneous information.In this section, we will introduce possible applications of knowledge represen-tations mainly from two aspects. First, we will introduce the usage of knowledgerepresentations for knowledge-driven applications, and then we will show the powerof knowledge representations for knowledge extraction and construction.

Knowledge graph completion aims to build structured knowledge bases by extract-ing knowledge from heterogeneous sources such as plain texts, existing knowledgebases, and images. Knowledge construction consists of several subtasks like relationextraction and information extraction, making the fundamental step in the wholeknowledge-driven framework.Recently, automatic knowledge construction has attracted considerable attentionsince it is incredibly time consuming and labor intensive to deal with enormousexisting and new information. In the following section, we will introduce someexplorations on neural relation extraction, and concentrate on the combination ofknowledge representations.

Relation extraction focuses on predicting the correct relation between two entitiesgiven a short plain text containing the two entities. Generally, all relations to predictare predeﬁned, which is different to open information extraction. Entities are usuallymarked with named entity recognition systems or extracted according to anchor texts,or automatically generated via distance supervision [50].

98 7 World Knowledge Representation

KGFactText

WordPosition EncoderPlaceOfBirthwas born in[Mark Twain]Mark Twain h + = r t , , [Florida]Florida Fig. 7.17

The architecture of joint representation learning framework for knowledge acquisition

Conventional methods for relation extraction and classiﬁcation are mainly basedon statistical machine learning, which strongly depends on the qualities of extractedfeatures. Zeng et al. [96] ﬁrst introduce CNN to relation classiﬁcation and achievegreat improvements. Lin et al. [40] further improves neural relation extraction modelswith attention-based models over instances.Han et al. [27, 28] propose a novel joint representation learning framework forknowledge acquisition. The key idea is that the joint model learns knowledge and textrepresentations within a uniﬁed semantic space via KG-text alignments. Figure 7.17shows the brief framework of the KG-text joint model. For the text part, the sen-tence with two entities

Mark Twain and

Florida is regarded as the input fora CNN encoder, and the output of CNN is considered to be the latent relation place_of_birth of this sentence. While for the KG part, entity and relationrepresentations are learned via translation-based methods. The learned representa-tions of KG and text parts are aligned during training. This work is the ﬁrst attemptto encode knowledge representations from existing KGs to knowledge construc-tion tasks and achieves improvements in both knowledge completion and relationextraction. .4 Applications 199

Fig. 7.18

The architecture of KNET model

Entity typing is the task of detecting semantic types for a named entity (or entity men-tion) in plain text. For example, given a sentence

Jordan played 15 seasonsin the NBA , entity typing aims to infer that

Jordan in this sentence is a person ,an athlete , and even a basketball player . Entity typing is important fornamed entity disambiguation since it can narrow down the range of candidates for anentity mention [10]. Moreover, entity typing also beneﬁts massive Natural LanguageProcessing (NLP) tasks such as relation extraction [46], question answering [90],and knowledge base population [9].Conventional named entity recognition models [69, 73] typically classify entitymentions into a small set of coarse labels (e.g., person , organization , location , and others ). Since these entity types are too coarse grained for manyNLP tasks, a number of works [15, 41, 94, 95] have been proposed to introduce amuch larger set of ﬁne-grained types, which are typically subtypes of those coarse-grained types. Previous ﬁne-grained entity typing methods usually derive featuresusing NLP tools such as POS tagging and parsing, and inevitably suffer from errorpropagation. Dong et al. [18] make the ﬁrst attempt to explore deep learning in entitytyping. The method only employs word vectors as features, discarding complicatedfeature engineering. Shimaoka et al. [63] further introduce the attention scheme intoneural models for ﬁne-grained entity typing.Neural models have achieved state-of-the-art performance for ﬁne-grained entitytyping. However, these methods face the following nontrivial challenges:(1) Entity-Context Separation.

Existing methods typically encode contextwords without utilizing crucial correlations between entity and context. How-ever, it is intuitive that the importance of words in the context for entity typ-

00 7 World Knowledge Representation ing is signiﬁcantly inﬂuenced by which entity mentions we concern about. Forexample, in a sentence

In 1975, Gates and Paul Allen co-foundedMicrosoft, which became the world’s largest PC softwarecompany , the word company is much more important for determining the type of

Microsoft than for the type of

Gates .(2)

Entity-Knowledge Separation.

Existing methods only consider text informa-tion of entity mentions for entity typing. In fact, Knowledge Graphs (KGs) providerich and effective additional information for determining entity types. For example,in the above sentence

In 1975, Gates ... Microsoft ... company ,even if we have no type information of

Microsoft in KG, entities similar to

Microsoft (such as

IBM ) will also provide supplementary information.In order to address the issues of entity-context separation and entity-knowledgeseparation, we propose K nowledge-guided A ttention (KNET) N eural E ntity T yping.As illustrated in Fig. 7.18, KNET mainly consists of two parts. Firstly, KNET buildsa neural network, including a Long Short-Term Memory (LSTM) and a fully con-nected layer, to generate context and named entity representations. Secondly, KNETintroduces knowledge attention to emphasize those critical words and improve thequality of context representations. Here we introduce the knowledge attention indetail.Knowledge graphs provide rich information about entities in the form of triples (cid:2) h , r , t (cid:3) , where h and t are entities and r is the relation between them. Many KRLworks have been devoted to encoding entities and relations into real-valued semanticvector space based on triple information in KGs. KRL provides us with an efﬁcientway to exploit KG information for entity typing.KNET employs the most widely used KRL method TransE to obtain entity embed-ding e for each entity e . During the training scenario, it is known that the entity men-tion m indicates the corresponding e in KGs with embedding e , and hence, KNETcan directly compute knowledge attention as follows: α KA i = f (cid:8) eW KA (cid:12) −→ h i ←− h i (cid:13)(cid:9) , (7.60)where W KA is a bilinear parameter matrix, and a KA i is the attention weight for the i th word. Knowledge Attention in Testing.

The challenge is that, in the testing scenario,we do not know the corresponding entity in the KG of a certain entity mention. Asolution is to perform entity linking, but it will introduce linking errors. Besides,in many cases, KGs may not contain the corresponding entities for many entitymentions.To address this challenge, we build an additional text-based representation forentities in KGs during training. Concretely, for an entity e and its context sentence s , we encode its left and right context into c l and c r using an one-directional LSTM,and further learn the text-based representation ˆ e as follows: .4 Applications 201 ˆ e = tanh ⎛⎝ W ⎡⎣ mc l c r ⎤⎦⎞⎠ , (7.61)where W is the parameter matrix, and m is the mention representation. Note that,LSTM used here is different from those in context representation in order to preventinterference. In order to bridge text-based and KG-based representations, in thetraining scenario, we simultaneously learn ˆ e by putting an additional component inthe objective function: O KG (θ) = − (cid:2) e (cid:6) e − ˆ e (cid:6) . (7.62)In this way, in the testing scenario, we can directly use Eq. 7.61 to obtain the corre-sponding entity representation and compute knowledge attention using Eq. 7.60. The emergence of large-scale knowledge graphs has motivated the development of entity-oriented search , which utilizes knowledge graphs to improve search engines.Recent progresses in entity-oriented search include better text representations withentity annotations [61, 85], richer ranking features [14], entity-based connectionsbetween query and documents [45, 84], and soft-match query and documents throughknowledge graph relations or embeddings [19, 88]. These approaches bring in entitiesand semantics from knowledge graphs and have greatly improved the effectivenessof feature-based search systems.Another frontier of information retrieval is the development of neural rankingmodels ( neural-IR ). Deep learning techniques have been used to learn distributedrepresentations of queries and documents that capture their relevance relations( representation-based ) [62], or to model the query-document relevancy directly fromtheir word-level interactions ( interaction-based ) [13, 23, 87]. Neural-IR approaches,especially the interaction-based ones, have greatly improved the ranking accuracywhen large-scale training data are available [13].Entity-oriented search and neural-IR push the boundary of search engines fromtwo different aspects. Entity-oriented search incorporates human knowledge fromentities and knowledge graph semantics. It has shown promising results on feature-based ranking systems. On the other hand, neural-IR leverages distributed repre-sentations and neural networks to learn more sophisticated ranking models formlarge-scale training data. Entity-Duet Neural Ranking Model (EDRM), as shown inFig. 7.19, incorporates entities in interaction-based neural ranking models. EDRMﬁrst learns the distributed representations of entities using their semantics fromknowledge graphs: descriptions and types. Then it follows a recent state-of-the-artentity-oriented search framework, the word-entity duet [86], and matches documentsto queries with both bag-of-words and bag-of-entities. Instead of manual features,

02 7 World Knowledge Representation

ObamafamilytreeObamaDescriptionTypeFamily TreeDescriptionTypeAttentionCNN CNNQuery DocumentUnigramsBigramsTrigramsEnriched-entityEmbedding Interaction Matrix Soft Match FeatureN-gramEmbedding Enriched-entityEmbedding KernelPooling FinalRankingScore… … v w q v e q v w d v e d M ww M ew M we M ee M Φ (M) Fig. 7.19

The architecture of EDRM model

EDRM uses interaction-based neural models [13] to match the query and documentswith word-entity duet representations. As a result, EDRM combines entity-orientedsearch and the interaction-based neural-IR; it brings the knowledge graph semanticsto neural-IR and enhances entity-oriented search with neural networks.

Given a query q and a document d , interaction-based models ﬁrst build the word-level translation matrix between q and d . The translation matrix describes word-pairs similarities using word correlations, which are captured by word embeddingsimilarities in interaction-based models.Typically, interaction-based ranking models ﬁrst map each word w in q and d toan L -dimensional embedding v w . v w = Emb w ( w ). (7.63)It then constructs the interaction matrix M based on query and document embed-dings. Each element M i j in the matrix, compares the i th word in q and the j th wordin d , e.g., using the cosine similarity of word embeddings: M i j = cos ( v w qi , v w dj ). (7.64)With the translation matrix describing the term level matches between query anddocuments, the next step is to calculate the ﬁnal ranking score from the matrix. Manyapproaches have been developed in interaction-based neural ranking models, but ingeneral, that would include a feature extractor on M and then one or several rankinglayers to combine the features to the ranking score. .4 Applications 203 EDRM incorporates the semantic information about an entity from the knowledgegraphs into its representation. The representation includes three embeddings: entityembedding, description embedding, and type embedding, all in L dimension and arecombined to generate the semantic representation of the entity. Entity Embedding uses an L -dimensional embedding layer Emb e to get the entityembedding e for e : v e = Emb e ( e ). (7.65) Description Embedding encodes an entity description which contains m wordsand explains the entity. EDRM ﬁrst employs the word embedding layer Emb v toembed the description word v to v . Then it combines all embeddings in the text toan embedding matrix V . Next, it leverages convolutional ﬁlters to slide over the textand compose the l length n -gram as g je : g je = ReLU ( W CNN · V j : j + hw + b CNN ), (7.66)where W CNN and b CNN are two parameters of the convolutional ﬁlter.Then we use max pooling after the convolution layer to generate the descriptionembedding v dese : v dese = max ( g e , ..., g je , ..., g me ). (7.67) Type Embedding encodes the categories of entities. Each entity e has n kinds oftypes F e = { f , ..., f j , ..., f n } . EDRM ﬁrst gets the f j embedding v f j through thetype embedding layer Emb type : v embf j = Emb type ( e ). (7.68)Then EDRM utilizes an attention mechanism to combine entity types to the typeembedding v typee : v typee = n (cid:2) j α j v f j , (7.69)where α j is the attention score, calculated as: α j = exp ( y j ) (cid:10) nl exp ( y l ) , (7.70) y j = (cid:8)(cid:2) i W bow v t i (cid:9) · v f j , (7.71)

04 7 World Knowledge Representation where y j is the dot product of the query or document representation and type embed-ding f j . We leverage bag-of-words for query or document encoding. W bow is aparameter matrix. Combination.

The three embeddings are combined by a linear layer to generatethe semantic representation of the entity: v seme = v embe + W e [ v dese ; v typee ] (cid:10) + b e , (7.72)in which W e is an L × L matrix and b e is an L -dimensional vector. Word-entity duet [86] is a recently developed framework in entity-oriented search. Itutilizes the duet representation of bag-of-words and bag-of-entities to match question q and document d with handcrafted features. This work introduces it to neural-IR.They ﬁrst construct bag-of-entities q e and d e with entity annotation as well asbag-of-words q w and d w for q and d . The duet utilizes a four-way interaction: querywords to document words ( q w - d w ), query words to documents entities ( q w - d e ), queryentities to document words ( q e - d w ), and query entities to document entities ( q e - d e ).Instead of features, EDRM uses a translation layer that calculates the similaritybetween a pair of query-document terms: ( v iw q or v ie q ) and ( v jw d or v je d ). It constructs theinteraction matrix M = { M ww , M we , M ew , M ee } . And M ww , M we , M ew , M ee denoteinteractions of q w - d w , q w - d e , q e - d w , q e - d e respectively. And elements in them arethe cosine similarities of corresponding terms: M i jww = cos ( v iw q , v jw d ) ; M i jee = cos ( v ie q , v je d ) M i jew = cos ( v ie q , v jw d ) ; M i jwe = cos ( v iw q , v je d ). (7.73)The ﬁnal ranking feature (cid:13)( M ) is a concatenation of four cross matches ( φ( M ) ): (cid:13)( M ) = [ φ( M ww ) ; φ( M we ) ; φ( M ew ) ; φ( M ee ) ] , (7.74)where the φ can be any function used in interaction-based neural ranking models.The entity-duet presents an effective way to crossly match query and documentin entity and word spaces. In EDRM, it introduces the knowledge graph semanticsrepresentations into neural-IR models.The duet translation matrices provided by EDRM can be plugged into any standardinteraction-based neural ranking models such as K-NRM [87] and Conv-KNRM [13].With sufﬁcient training data, the whole model is optimized end-to-end with back-propagation. During the process, the integration of the knowledge graph semantics,entity embedding, description embeddings, type embeddings, and matching withentities is learned jointly with the ranking neural network. .4 Applications 205 Knowledge is an important external information for language modeling. It is becausethe statistical co-occurrences cannot instruct the generation of all kinds of knowledge,especially for those named entities with low frequencies. Researchers try to incorpo-rate external knowledge into language models for better performance on generationand representation.

Language models aim to learn the probability distribution over sequences of words,which is a classical and essential NLP task widely studied. Recently, sequence tosequence neural models (seq2seq) are blooming and widely utilized in sequentialgenerative tasks like machine translation [68] and image caption generation [72].However, most seq2seq models have signiﬁcant limitations when modeling and usingbackground knowledge.To address this problem, Ahn et al. [1] propose a Neural Knowledge LanguageModel (NKLM) that considers knowledge provided by knowledge graphs whengenerating natural language sequences with RNN language models. The key idea isthat NKLM has two ways to generate a word. The ﬁrst is the same way as conventionalseq2seq models that generate a “vocabulary word” according to the probabilities ofsoftmax, and the second is to generate a “knowledge word” according to the externalknowledge graphs.Speciﬁcally, the NKLM model takes LSTM as the framework of generating“vocabulary word”. For external knowledge graph information, NKLM denotes thetopic knowledge as K = { a , . . . a | K | } , in which a i represents the entities (i.e.,named as “topic” in [1]) that appear in the same triple of a certain entity. At each step t , NKLM takes both “vocabulary word” w vt − and “knowledge word” w ot − as well asthe fact a t − predicted at step t − h t is combined with the knowledge context e to get the fact key k t via an MLPmodule. The knowledge context e k derives from the mean embeddings of all relatedfacts of fact k . The fact key k t is then used to extract the most appropriate fact a t fromthe corresponding topic knowledge. And ﬁnally, the selected fact a t is combined withhidden state h t to predict (1) both “vocabulary word” w vt and “knowledge word” w ot ,and (2) which word to generate at this step. The architecture of NKLM is shown inFig. 7.20.The NKLM model explores a novel neural model that combines the symbolicknowledge information in external knowledge graphs with seq2seq language models.However, the topic of knowledge is given when generating natural languages, whichmakes NKLM less practical and scalable for more general free talks. Nevertheless,we still believe that it is promising to encode knowledge into language models withsuch methods.

06 7 World Knowledge Representation

Fig. 7.20

The architectureof NKLM model

Topic Knowledge copyfact searchx t a a a a …a N NaF o o o o …o N h t ek t a t w tv w to h t-1 a t-1 w vt-1 w ot-1 LSTMz t Pretrained language models like BERT [17] have a strong ability to representlanguage information from text. With rich language representation, pretrainedmodels obtain state-of-the-art results on various NLP applications. However, theexisting pretrained language models rarely consider incorporating external knowl-edge to provide related background information for better language understanding.For example, given a sentence

Bob Dylan wrote Blowin’ in the Windand Chronicles: Volume One , without knowing

Blowin’ in theWind and

Chronicles: Volume One are song and book respectively, itis difﬁcult to recognize the two occupations of

Bob Dylan , i.e., songwriter and writer .To enhance language representation models with external knowledge, Zhang etal. [100] propose an enhanced language representation model with informative enti-ties (ERNIE). Knowledge Graphs (KGs) are important external knowledge resources,and they think informative entities in KGs can be the bridge to enhance languagerepresentation with knowledge. ERNIE considers overcoming two main challengesfor incorporating external knowledge:

Structured Knowledge Encoding and

Hetero-geneous Information Fusion .For extracting and encoding knowledge information, ERNIE ﬁrstly recognizesnamed entity mentions in text and then aligns these mentions to their correspondingentities in KGs. Instead of directly using the graph-based facts in KGs, ERNIEencodes the graph structure of KGs with knowledge embedding algorithms likeTransE [7], and then takes the informative entity embeddings as input. Based on thealignments between text and KGs, ERNIE integrates entity representations in theknowledge module into the underlying layers of the semantic module. .4 Applications 207 bob dylan wrote ···

Multi-Head Attention Multi-Head Attention Information Fusion ···

Token Input Entity InputToken Output Entity Output

Bob Dylan wrote

Blowin’ in the Wind in 1962 blow

Multi-HeadAttentionFeedForward Nx Multi-HeadAttentionInformationFusion

Token Input

Multi-HeadAttention

Entity Input Mx Token Output Entity Output

Blowin’ in the WindBob Dylan

AggregatorTransformer Aggregator (a) Model Achitecture (b) AggregatorK-EncoderT-Encoder w w w w w ni-1 e e w w w w w ni-1 e e e e ~~ ~ ~ ~~ ~ ~ ~ ··· w w w w w ni e e e e Fig. 7.21

The architecture of ERNIE model

Similar to BERT, ERNIE adopts the masked language model and the next sen-tence prediction as the pretraining objectives. Besides, for the better fusion of textualand knowledge features, ERNIE uses a new pretraining objective (denoising entityauto-encoder) by randomly masking some of the named entity alignments in theinput text and training to select appropriate entities from KGs to complete the align-ments. Unlike the existing pre-trained language representation models only utilizinglocal context to predict tokens, these objectives require ERNIE to aggregate bothcontext and knowledge facts for predicting both tokens and entities, and lead to aknowledgeable language representation model.Figure 7.21 is the overall architecture. The left part shows that ERNIE consistsof two encoders (T-Encoder and K-Encoder), where T-Encoder is stacked by severalclassical transformer layers and K-Encoder is stacked by the new aggregator layersdesigned for knowledge integration. The right part is the detail of the aggregator layer.In the aggregator layer, the input token embeddings and entity embeddings from thepreceding aggregator are fed into two multi-head self-attention, respectively. Then,the aggregator adopts an information fusion layer for the mutual integration of thetoken and entity sequence and computes the output embedding for each token andentity.ERNIE explores how to incorporate knowledge information into language repre-sentation models. The experimental results demonstrate that ERNIE has more pow-erful abilities of both denoising distantly supervised data and ﬁne-tuning on limiteddata than BERT.

Pre-trained language models can do many tasks without supervised training data, likereading comprehension, summarization, and translation [60]. However, traditionallanguage models are unable to efﬁciently model entity names observed in text. To

08 7 World Knowledge Representation solve this problem, Liu et al. [42] propose a new language model architecture, calledKnowledge-Augmented Language Model (KALM), to use the entity types of wordsfor better language modeling.KALM is a language model with the option to generate words from a set of entitiesfrom a knowledge database. An individual word can either come from a generalword dictionary as in the traditional language model or be generated as a name of anentity from a knowledge database. The training objectives just supervise the outputand ignore the decision of the word type. Entities in the knowledge database arepartitioned by type and they use the database to build the types of words. Accordingto the context observed so far, the model decides whether the word is a general termor a named entity in a given type. Thus, KALM learns to predict whether the contextobserved is indicative of a named entity and what tokens are likely to be entities ofa given type.With the language modeling, KALM learns a named entity recognizer without anyexplicit supervision by using only plain text and the potential types of words. And,it achieves a comparable performance with the state-of-the-art supervised methods.

Knowledge enables AI agents to understand, infer, and address user demands, whichis essential in most knowledge-driven applications like information retrieval, questionanswering, and dialogue system. The behavior of AI agents will be more reasonableand accurate with the favor of knowledge representations. In the following subsec-tions, we will introduce the great improvements made by knowledge representationin question answering.

Question answering aims to give correct answers according to users’ questions,which needs the capabilities of both natural language understanding of questionsand inference on answer selection. Therefore, combining knowledge with questionanswering is a straightforward application for knowledge representations. Most con-ventional question answering systems directly utilize knowledge graphs as certaindatabases, ignoring the latent relationships between entities and relations. Recently,with the thriving in deep learning, explorations have focused on neural models forunderstanding questions and even generating answers.Considering the ﬂexibility and diversity of generated answers in natural languages,Yin et al. [93] propose a neural Generative Question Answering model (GENQA),which explores on generating answers to simple factoid questions in natural lan-guages. Figure 7.22 demonstrates the workﬂow of GENQA. First, a bidirectionalRNN is regarded as the Interpreter to transform question q from natural languageto compressed representation H q . Next, Enquirer takes H Q as the key to rank rel- .4 Applications 209 Fig. 7.22

The architectureof GENQA model

Q:“How tall is Yao Ming?”A:“He is 2.29m and visible from space”Answerer GeneratorEnquirerinterpreter Long-term Memory(Knowledge-base)Att.Model Short-term Mernory evant triples facts of q in knowledge graphs and retrieves possible entities in r q .Finally, Answerer combines H q and r q to generate answers in the form of naturallanguages. Similar to [1], at each step, Answerer ﬁrst decides whether to generatecommon words or knowledge words according to a logistic regression model. Forcommon words, Answerer acts in the same way as RNN decoders with H q selectedby attention-based methods. As for knowledge words, Answerer directly generatesentities with higher ranks.There are gradually more efforts focusing on encoding knowledge representationsinto knowledge-driven tasks like information retrieval and dialogue systems. How-ever, how to ﬂexibly and effectively combine knowledge with AI agents remains tobe explored in the future. Due to the rapid growth of web information, recommendation systems have beenplaying an essential role in the web application. The recommendation system aimsto predict the “rating” or “preference” that users may give to items. And since KGscan provide rich information, including both structured and unstructured data, rec-ommendation systems have utilized more and more knowledge from KGs to enrichtheir contexts.Cheekula et al. [11] explore to utilize the hierarchical knowledge from the DBpe-dia category structure in the recommendation system and employs the spreadingactivation algorithm to identify entities of interest to the user. Besides, Passant [56]measures the semantic relatedness of the artist entity in a KG to build music recom-mendation systems. However, most of these systems mainly investigate the problemby leveraging the structure of KGs. Recently, with the development of representation

10 7 World Knowledge Representation learning, [98] proposes to jointly learn the latent representations in a collaborativeﬁltering recommendation system as well as entities’ representations in KGs.Except the tasks stated above, there are gradually more efforts focusing on encod-ing knowledge graph representations into other tasks such as dialogue system [37,103], entity disambiguation [20, 31], knowledge graph alignment [12, 102], depen-dency parsing [35], etc. Moreover, the idea of KRL has also motivated the researchon visual relation extraction [2, 99] and social relation extraction [71].

In this chapter, we ﬁrst introduce the concept of the knowledge graph. Knowledgegraph contains both entities and the relationships among them in the form of triplefacts, providing an effective way of human beings learning and understanding thereal world. Next, we introduce the motivations of knowledge graph representation,which is considered as a useful and convenient method for a large amount of data andis widely explored and utilized in multiple knowledge-based tasks and signiﬁcantlyimproves the performance. And we describe existing approaches for knowledgegraph representation. Further, we discuss several advanced approaches that aim todeal with the current challenges of knowledge graph representation. We also reviewthe real-world applications of knowledge graph representation such as languagemodeling, question answering, information retrieval, and recommendation systems.For further understanding of knowledge graph representation, you can ﬁnd morerelated papers in this paper list https://github.com/thunlp/KRLPapers. There are alsosome recommended surveys and books including: • Bengio et al. Representation learning: A review and new perspectives [4]. • Liu et al. Knowledge representation learning: A review [47]. • Nickel et al. A review of relational machine learning for knowledge graphs [52]. • Wang et al. Knowledge graph embedding: A survey of approaches and applications[74]. • Ji et al. A survey on knowledge graphs: representation, acquisition and applications[34].In the future, for better knowledge graph representation, there are some directionsrequiring further efforts:(1)

Utilizing More Knowledge . Current KRL approaches focus on represent-ing triple-based knowledge from world knowledge graphs such as Freebase, Wiki-data, etc. In fact, there are various kinds of knowledge in the real world such asfactual knowledge, event knowledge, commonsense knowledge, etc. What’s more,the knowledge is stored with different formats, such as attributions, quantiﬁer, text,and so on. The researchers have formed a consensus that utilizing more knowledgeis a potential way toward more interpretable and intelligent NLP. Some existingworks [44, 82] have made some preliminary attempts of utilizing more knowledge .5 Summary 211 in KRL. Beyond these works, is it possible to represent different knowledge in auniﬁed semantic space, which can be easily applied in downstream NLP tasks?(2)

Performing Deep Fusion of knowledge and language. There is no doubtthat the joint learning of knowledge and language information can further beneﬁtdownstream NLP tasks. Existing works [76, 89, 97] have preliminarily veriﬁed theeffectiveness of joint learning. Recently, ERINE [100] and KnowBERT [57] furtherprovide us a novel perspective to fuse knowledge and language in pretraining. Soareset al. [64] learn the relational similarity in text with the guidance of KGs, which isalso a pioneer of knowledge fusion. Besides designing novel pretraining objectives,we could also design novel model architectures for downstream tasks, which aremore suitable to utilize KRL, such as memory-based models [48, 91] and graphnetwork-based models [66]. Nevertheless, it still remains an unsolved problem foreffectively performing the deep fusion of knowledge and language.(3)

Orienting Heterogeneous Modalities . With the fast development of the WorldWide Web, the data size of audios, images, and videos on the Web have becomelarger and larger, which are also important resources for KRL besides texts. Somepioneer works [51, 81] explore to learn knowledge representations on a multi-modalknowledge graph, but are still preliminary attempts. Intuitively, audio and visualknowledge can provide complementary information, which beneﬁts related NLPtasks. To the best of our knowledge, there still lacks research on applying multi-modalKRL in downstream tasks. How to efﬁciently and effectively integrate multi-modalknowledge is becoming a critical and challenging problem for KRL.(4)

Exploring Knowledge Reasoning . Most of the existing KRL methods rep-resent knowledge information in low-dimensional semantic space, which is feasiblefor the computation of complex knowledge graphs in neural-based NLP models.Although beneﬁting from the usability of low-dimensional embeddings, KRL cannotperform explainable reasoning such as symbolic rules, which is of great importancefor downstream NLP tasks. Recently, there has been increasing interest in the com-bination of embedding methods and symbolic reasoning methods [26, 59], aiming attaking both advantages of them. Beyond these works, there remain lots of unsolvedproblems for developing better knowledge reasoning ability for KRL.

References

1. Sungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, and Yoshua Bengio. A neural knowledge lan-guage model. arXiv preprint arXiv:1608.00318, 2016.2. Stephan Baier, Yunpu Ma, and Volker Tresp. Improving visual relationship detection usingsemantic modeling of scene descriptions. In

Proceedings of ISWC , 2017.3. Islam Beltagy and Raymond J Mooney. Efﬁcient markov logic inference for natural languagesemantics. In

Proceedings of AAAI Workshop , 2014.4. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review andnew perspectives.

TPAMI , 35(8):1798–1828, 2013.5. Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. Joint learning of wordsand meaning representations for open-text semantic parsing. In

Proceedings of AISTATS ,2012.12 7 World Knowledge Representation6. Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. A semantic matchingenergy function for learning with multi-relational data.

Machine Learning , 94(2):233–259,2014.7. Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and OksanaYakhnenko. Translating embeddings for modeling multi-relational data. In

Proceedings ofNeurIPS , 2013.8. Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio. Learning structuredembeddings of knowledge bases. In

Proceedings of AAAI , 2011.9. Andrew Carlson, Justin Betteridge, Richard C Wang, Estevam R Hruschka Jr, and Tom MMitchell. Coupled semi-supervised learning for information extraction. In

Proceedings ofWSDM , 2010.10. Mohamed Chabchoub, Michel Gagnon, and Amal Zouaq. Collective disambiguation andsemantic annotation for entity linking and typing. In

Proceedings of SWEC , 2016.11. Siva Kumar Cheekula, Pavan Kapanipathi, Derek Doran, Prateek Jain, and Amit P Sheth.Entity recommendations using hierarchical knowledge bases. 2015.12. Muhao Chen, Yingtao Tian, Mohan Yang, and Zaniolo Carlo. Multilingual knowledge graphembeddings for cross-lingual knowledge alignment. In

Proceedings of IJCAI , 2017.13. Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networksfor soft-matching n-grams in ad-hoc search. In

Proceedings of WSDM , 2018.14. Jeffrey Dalton, Laura Dietz, and James Allan. Entity query feature expansion using knowledgebase links. In

Proceedings of SIGIR , 2014.15. Luciano Del Corro, Abdalghani Abujabal, Rainer Gemulla, and Gerhard Weikum. Finet:Context-aware ﬁne-grained named entity typing. In

Proceedings of EMNLP , 2015.16. Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional2d knowledge graph embeddings. In

Proceedings of AAAI , 2018.17. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. In

Proceedings of NAACL , 2019.18. Li Dong, Furu Wei, Hong Sun, Ming Zhou, and Ke Xu. A hybrid neural model for typeclassiﬁcation of entity mentions. In

Proceedings of IJCAI , 2015.19. Faezeh Ensan and Ebrahim Bagheri. Document retrieval model through semantic linking. In

Proceedings of WSDM , 2017.20. Wei Fang, Jianwen Zhang, Dilin Wang, Zheng Chen, and Ming Li. Entity disambiguation byknowledge and text jointly embedding. In

Proceedings of CoNLL , 2016.21. Alberto García-Durán, Antoine Bordes, and Nicolas Usunier. Composing relationships withtranslations. In

Proceedings of EMNLP , 2015.22. Kelvin Gu, John Miller, and Percy Liang. Traversing knowledge graphs in vector space. In

Proceedings of EMNLP , 2015.23. Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. Semantic matching by non-linearword transportation for information retrieval. In

Proceedings of CIKM , 2016.24. Shu Guo, Quan Wang, Lihong Wang, Bin Wang, and Li Guo. Jointly embedding knowledgegraphs and logical rules. In

Proceedings of EMNLP , 2016.25. Petr Hájek.

Metamathematics of fuzzy logic , volume 4. Springer Science & Business Media,1998.26. Will Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, and Jure Leskovec. Embeddinglogical queries on knowledge graphs. In

Proceedings of NIPS , 2018.27. Xu Han, Zhiyuan Liu, and Maosong Sun. Joint representation learning of text and knowledgefor knowledge graph completion. arXiv preprint arXiv:1611.04125, 2016.28. Xu Han, Zhiyuan Liu, and Maosong Sun. Neural knowledge acquisition via mutual attentionbetween knowledge graph and text. In

Proceedings of AAAI , pages 4832–4839, 2018.29. Katsuhiko Hayashi and Masashi Shimbo. On the equivalence of holographic and complexembeddings for link prediction. In

Proceedings of ACL , 2017.30. Shizhu He, Kang Liu, Guoliang Ji, and Jun Zhao. Learning to represent knowledge graphswith gaussian embedding. In

Proceedings of CIKM , 2015.eferences 21331. Hongzhao Huang, Larry Heck, and Heng Ji. Leveraging deep neural networks and knowledgegraphs for entity disambiguation. arXiv preprint arXiv:1504.07678, 2015.32. Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. Knowledge graph embeddingvia dynamic mapping matrix. In

Proceedings of ACL , 2015.33. Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. Knowledge graph completion with adaptivesparse transfer matrix. In

Proceedings of AAAI , 2016.34. Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and Philip S Yu. A survey on knowl-edge graphs: Representation, acquisition and applications. arXiv preprint arXiv:2002.00388,2020.35. A-Yeong Kim, Hyun-Je Song, Seong-Bae Park, and Sang-Jo Lee. A re-ranking model fordependency parsing with knowledge graph embeddings. In

Proceedings of IALP , 2015.36. Denis Krompaß, Stephan Baier, and Volker Tresp. Type-constrained representation learningin knowledge graphs. In

Proceedings of ISWC , 2015.37. Phong Le, Marc Dymetman, and Jean-Michel Renders. Lstm-based mixture-of-experts forknowledge-aware dialogues. In

Proceedings of ACL Workshop , 2016.38. Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, and Song Liu. Modelingrelation paths for representation learning of knowledge bases. In

Proceedings of EMNLP ,2015.39. Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relationembeddings for knowledge graph completion. In

Proceedings of AAAI , 2015.40. Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. Neural relationextraction with selective attention over instances. In

Proceedings of ACL , 2016.41. Xiao Ling and Daniel S Weld. Fine-grained entity recognition. In

Proceedings of AAAI , 2012.42. Angli Liu, Jingfei Du, and Veselin Stoyanov. Knowledge-augmented language model and itsapplication to unsupervised named-entity recognition. In

Proceedings of NAACL , 2019.43. Quan Liu, Hui Jiang, Andrew Evdokimov, Zhen-Hua Ling, Xiaodan Zhu, Si Wei, andYu Hu. Probabilistic reasoning via deep learning: Neural association models. arXiv preprintarXiv:1603.07704, 2016.44. Quan Liu, Hui Jiang, Zhen-Hua Ling, Xiaodan Zhu, Si Wei, and Yu Hu. Commonsenseknowledge enhanced embeddings for solving pronoun disambiguation problems in winogradschema challenge. arXiv preprint arXiv:1611.04146, 2016.45. Xitong Liu and Hui Fang. Latent entity space: A novel retrieval approach for entity-bearingqueries.

Information Retrieval Journal , 18(6):473–503, 2015.46. Yang Liu, Kang Liu, Liheng Xu, Jun Zhao, et al. Exploring ﬁne-grained entity type constraintsfor distantly supervised relation extraction. In

Proceedings of COLING , 2014.47. Zhiyuan Liu, Maosong Sun, Yankai Lin, and Ruobing Xie. Knowledge representation learn-ing: a review.

JCRD , 53(2):247–261, 2016.48. Todor Mihaylov and Anette Frank. Knowledgeable reader: Enhancing cloze-style readingcomprehension with external commonsense knowledge. In

Proceedings of ACL , 2018.49. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of wordrepresentations in vector space. In

Proceedings of ICLR , 2013.50. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relationextraction without labeled data. In

Proceedings of ACL-IJCNLP , 2009.51. Hatem Mousselly-Sergieh, Teresa Botschen, Iryna Gurevych, and Stefan Roth. A multimodaltranslation-based approach for knowledge graph representation learning. In

Proceedings ofJCLCS , 2018.52. Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review ofrelational machine learning for knowledge graphs. 2015.53. Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. Holographic embeddings ofknowledge graphs. In

Proceedings of AAAI , 2016.54. Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collectivelearning on multi-relational data. In

Proceedings of ICML , 2011.55. Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. Factorizing yago: scalable machinelearning for linked data. In

Proceedings of WWW , 2012.14 7 World Knowledge Representation56. Alexandre Passant. dbrec–music recommendations using dbpedia. In

Proceedings of ISWC ,2010.57. Matthew E. Peters, Mark Neumann, Robert L Logan, Roy Schwartz, Vidur Joshi, SameerSingh, and Noah A. Smith. Knowledge enhanced contextual word representations. In

Pro-ceedings of EMNLP-IJCNLP , 2019.58. Jay Pujara, Hui Miao, Lise Getoor, and William W Cohen. Knowledge graph identiﬁcation.In

Proceedings of ISWC , 2013.59. Meng Qu and Jian Tang. Probabilistic logic neural networks for reasoning. In

Proceedings ofNIPS , 2019.60. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.Language models are unsupervised multitask learners.

OpenAI Blog , 1(8), 2019.61. Hadas Raviv, Oren Kurland, and David Carmel. Document retrieval using entity-based lan-guage models. In

Proceedings of SIGIR , 2016.62. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. A latent semanticmodel with convolutional-pooling structure for information retrieval. In

Proceedings of CIKM ,2014.63. Sonse Shimaoka, Pontus Stenetorp, Kentaro Inui, and Sebastian Riedel. An attentive neuralarchitecture for ﬁne-grained entity type classiﬁcation. In

Proceedings of AKBC Workshop ,2016.64. Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. Matching theBlanks: Distributional similarity for relation learning. In

Proceedings of ACL , pages 2895–2905, 2019.65. Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neuraltensor networks for knowledge base completion. In

Proceedings of NeurIPS , 2013.66. Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, andWilliam Cohen. Open domain question answering using early fusion of knowledge bases andtext. In

Proceedings of EMNLP , 2018.67. Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embed-ding by relational rotation in complex space. In

Proceedings of ICLR , 2019.68. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neuralnetworks. In

Proceedings of NeurIPS , 2014.69. Erik F Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task:Language-independent named entity recognition. In

Proceedings of HLT-NAACL , 2003.70. Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard.Complex embeddings for simple link prediction. In

Proceedings of ICML , 2016.71. Cunchao Tu, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. Transnet: translation-basednetwork representation learning for social relation extraction. In

Proceedings of IJCAI , 2017.72. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neuralimage caption generator. In

Proceedings of CVPR , 2015.73. Nina Wacholder, Yael Ravin, and Misook Choi. Disambiguation of proper names in text. In

Proceedings of ANLP , 1997.74. Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A surveyof approaches and applications.

TKDE , 29(12):2724–2743, 2017.75. Quan Wang, Bin Wang, and Li Guo. Knowledge base completion using embeddings and rules.In

Proceedings of IJCAI , 2015.76. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph and text jointlyembedding. In

Proceedings of EMNLP , pages 1591–1601, 2014.77. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding bytranslating on hyperplanes. In

Proceedings of AAAI , 2014.78. Han Xiao, Minlie Huang, Yu Hao, and Xiaoyan Zhu. Transa: An adaptive approach forknowledge graph embedding. arXiv preprint arXiv:1509.05490, 2015.79. Han Xiao, Minlie Huang, Yu Hao, and Xiaoyan Zhu. Transg: A generative mixture model forknowledge graph embedding. arXiv preprint arXiv:1509.05488, 2015.eferences 21580. Han Xiao, Minlie Huang, and Xiaoyan Zhu. From one point to a manifold: Knowledge graphembedding for precise link prediction. In

Proceedings of IJCAI , 2016.81. Ruobing Xie, Zhiyuan Liu, Tat-seng Chua, Huanbo Luan, and Maosong Sun. Image-embodiedknowledge representation learning. In

Proceedings of IJCAI , 2016.82. Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. Representation learningof knowledge graphs with entity descriptions. In

Proceedings of AAAI , 2016.83. Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Representation learning of knowledge graphswith hierarchical types. In

Proceedings of IJCAI , 2016.84. Chenyan Xiong and Jamie Callan. EsdRank: Connecting query and documents through exter-nal semi-structured data. In

Proceedings of CIKM , 2015.85. Chenyan Xiong, Jamie Callan, and Tie-Yan Liu. Bag-of-entities representation for ranking.In

Proceedings of ICTIR , 2016.86. Chenyan Xiong, Jamie Callan, and Tie-Yan Liu. Word-entity duet representations for docu-ment ranking. In

Proceedings of SIGIR , 2017.87. Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-endneural ad-hoc ranking with kernel pooling. In

Proceedings of SIGIR , 2017.88. Chenyan Xiong, Russell Power, and Jamie Callan. Explicit semantic ranking for academicsearch via knowledge graph embedding. In

Proceedings of WWW , 2017.89. Jiacheng Xu, Xipeng Qiu, Kan Chen, and Xuanjing Huang. Knowledge graph representationwith jointly structural and textual encoding. In

Proceedings of IJCAI , 2017.90. Mohamed Yahya, Klaus Berberich, Shady Elbassuoni, and Gerhard Weikum. Robust questionanswering over the web of linked data. In

Proceedings of CIKM , 2013.91. Bishan Yang and Tom Mitchell. Leveraging knowledge bases in LSTMs for improvingmachine reading. In

Proceedings of ACL , 2017.92. Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entitiesand relations for learning and inference in knowledge bases. In

Proceedings of ICLR , 2015.93. Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. Neural gener-ative question answering. In

Proceedings of IJCAI , 2016.94. Dani Yogatama, Daniel Gillick, and Nevena Lazic. Embedding methods for ﬁne grained entitytype classiﬁcation. In

Proceedings of ACL , 2015.95. Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol, and Gerhard Weikum.HYENA: Hierarchical type classiﬁcation for entity names. In

Proceedings of COLING , 2012.96. Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. Relation classiﬁcationvia convolutional deep neural network. In

Proceedings of COLING , 2014.97. Dongxu Zhang, Bin Yuan, Dong Wang, and Rong Liu. Joint semantic relevance learning withtext data and graph knowledge. In

Proceedings of ACL-IJCNLP , 2015.98. Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. Collaborativeknowledge base embedding for recommender systems. In

Proceedings of SIGKDD , 2016.99. Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. Visual translation embed-ding network for visual relation detection. In

Proceedings of CVPR , 2017.100. Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. Ernie:Enhanced language representation with informative entities. In

Proceedings of ACL , 2019.101. Huaping Zhong, Jianwen Zhang, Zhen Wang, Hai Wan, and Zheng Chen. Aligning knowledgeand text embeddings by entity descriptions. In

Proceedings of EMNLP , 2015.102. Hao Zhu, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Iterative entity alignment via jointknowledge embeddings. In

Proceedings of IJCAI , 2017.103. Wenya Zhu, Kaixiang Mo, Yu Zhang, Zhangbin Zhu, Xuezheng Peng, and Qiang Yang.Flexible end-to-end dialogue system for knowledge grounded conversation. arXiv preprintarXiv:1709.04264, 2017.16 7 World Knowledge Representation

Open Access

Network Representation

Abstract

Network representation learning aims to embed the vertexes in a networkinto low-dimensional dense representations, in which similar vertices in the net-work should have “close” representations (usually measured by cosine similarity orEuclidean distance of their representations). The representations can be used as thefeature of vertices and applied to many network study tasks. In this chapter, we willintroduce network representation learning algorithms in the past decade. Then wewill talk about their extensions when applied to various real-world networks. Finally,we will introduce some common evaluation tasks of network representation learningand relevant datasets.

As a natural way to represent objects and their relationships, the network is ubiqui-tous in our daily lives. The rapid development of social networks like Facebook andTwitter encourage researchers to design effective and efﬁcient algorithms on networkstructure. A key problem of network study is how to represent the network informa-tion properly. Traditional representations of networks are usually high dimensionaland sparse, which becomes a weakness when people apply statistical learning tonetworks. With the development of machine learning, feature learning of vertices ina network is becoming an emerging task. Therefore, network representation learn-ing algorithms turn network information into low-dimensional dense real-valuedvectors, which can be used as input for existing machine learning algorithms. Forexample, the representations of vertices can be fed to a classiﬁer like Support VectorMachine (SVM) for the vertex classiﬁcation task. Also, the representations can beused for visualization by taking the representations as points in Euclidean space. Inthis section, we will formalize the network representation learning problem.Denote a network as G = ( V , E ) where V is the vertex set and E is the edge set.An edge e = ( v i , v j ) ∈ E where v i , v j ∈ V is a directed edge from vertex v i to v j .The outdegree of vertex v i is deﬁned as deg O ( v i ) = |{ v j | ( v i , v j ) ∈ E }| . Similarly, © The Author(s) 2020Z. Liu et al., Representation Learning for Natural Language Processing ,https://doi.org/10.1007/978-981-15-5573-2_8 21718 8 Network Representation

Fig. 8.1

A visualization of vertex embeddings learned by DeepWalk model [93] the indegree of vertex v i is deg I ( v i ) = |{ v j | ( v j , v i ) ∈ E }| . For undirected network,we have deg ( v i ) = deg O ( v i ) = deg I ( v i ) . Taking social network as an example, avertex represents a user and an edge represents the friendship between two users.The indegree and outdegree represent the number of followers and followees of auser, respectively.Adjacency matrix A ∈ R | V |×| V | is a matrix where A i j = ( v i , v j ) ∈ E and A i j = A i j to the weight of edge ( v i , v j ) . The adjacency matrix is a simpleand straightforward representation of the network. Each row of adjacency matrix A denotes the relationship between a vertex and other vertices and can be seen asthe representation of the corresponding vertex.Though convenient and straightforward, the representation of the adjacency matrixsuffers from the scalability problem. Adjacency matrix A takes | V | × | V | space tostore, and it is usually unacceptable when | V | grows large. Also, the adjacency matrixis very sparse, which means most of its entries are zeros. The data sparsity makesdiscrete algorithms applicable, but it is still hard to develop efﬁcient algorithms forstatistic learning [93].Therefore, people come up with the idea to learn low-dimensional dense rep-resentations for vertices in a network. Formally, the goal of network representationlearning is to learn a real-valued vector v ∈ R d for vertex v ∈ V where dimension d ismuch smaller than the number of vertices | V | . The idea is that similar vertices shouldhave close representations as shown in Fig. 8.1. Network representation learning canbe unsupervised or semi-supervised. The representations are automatically learnedwithout feature engineering and can be further used for speciﬁc tasks like classi-ﬁcations once they are learned. These representations are low dimensional, whichenables efﬁcient algorithms to be designed over the representations without consid-ering the network structure itself. We will discuss more details about the evaluationof network representations later in this chapter. .2 Network Representation 219 In this section, we will introduce several kinds of network representation learningalgorithms in detail.

Spectral clustering based methods are a group of algorithms that compute ﬁrst k eigenvectors or singular vectors of an afﬁnity matrix, such as adjacency or Lapla-cian matrix of the network. These methods depend heavily on the construction of theafﬁnity matrix. The evaluation result of different afﬁnity matrices varies a lot. Gener-ally speaking, spectral clustering based methods have a high complexity because thecomputations of eigenvectors and singular vectors have a nonlinear time complexity.On the other hand, spectral clustering based methods need to save an afﬁnitymatrix in the memory during the computation. Thus the space complexity cannot beignored, either. These disadvantages limit the large-scale and online generalization ofthese methods. Now we will present several algorithms based on spectral clustering.Locally Linear Embedding (LLE) [98] assumes that the representations of verticesare sampled from a manifold. More speciﬁcally, LLE supposes that the representa-tions of a vertex and its neighbors lie in a locally linear patch of the manifold. Thatis to say, a vertex’s representation can be approximated by a linear combination ofthe representation of its neighbors. LLE uses the linear combination of neighbors toreconstruct the center vertex. Formally, the reconstruction error of all vertices can beexpressed as L ( W , V ) = | V | (cid:2) i = (cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3) v i − | V | (cid:2) j = W i j v j (cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3) , (8.1)where V ∈ R | V |× d is the vertex embedding matrix and W i j is the contribution coef-ﬁcient of vertex v j to v i . LLE enforces W i j = v i and v j are not connected,i.e., ( v i , v j ) / ∈ E . Further, the summation of a row of matrix W is set to 1, i.e., (cid:4) | V | j = W i j = W and represen-tation V . The optimization over W can be solved as a least-squares problem. Theoptimization over representation V leads to the following optimization problem: L ( W , V ) = | V | (cid:2) i = (cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3) v i − | V | (cid:2) j = W i j v j (cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3) , (8.2)

20 8 Network Representation s . t . | V | (cid:2) i = v i = , (8.3)and | V | − | V | (cid:2) i = v (cid:3) i v i = I d , (8.4)where I d denotes d × d identity matrix. The conditions Eqs. 8.3 and 8.4 ensurethe uniqueness of the solution. The ﬁrst condition enforces the center of all vertexembeddings to zero point and the second condition guarantees different coordinateshave the same scale, i.e., equal contribution to the reconstruction error.The optimization problem can be formulated as the computation of eigenvectorsof matrix ( I | V | − W (cid:3) )( I | V | − W ) , which is an easily solvable eigenvalue problem.More details can be found in the note [22].Laplacian Eigenmap [8] algorithm simply follows the idea that the representationsof two connected vertices should be close. Speciﬁcally, the “closeness” is measuredby the square of Euclidean distance. We use D to denote diagonal degree matrixwhere D is a | V | × | V | diagonal matrix and the i th diagonal entry D ii is the degreeof vertex v i . The Laplacian matrix L of a graph is deﬁned as the difference of diagonalmatrix D and adjacency matrix A , i.e., L = D − A .Laplacian Eigenmap algorithm wants to minimize the following cost function: L ( V ) = (cid:2) { i , j | ( v i , v j ) ∈ E } (cid:4) v i − v j (cid:4) , (8.5) s . t . V (cid:3) D V = I d . (8.6)The cost function is the summation of square loss of all connected vertex pairsand the condition prevents the trivial all-zero solution caused by arbitrary scale.Equation 8.5 can be reformulated in matrix form as V ∗ = arg min V (cid:3) D V = I d tr ( V (cid:3) L V ). (8.7)Algebraic knowledge tells us that the optimal solution V ∗ of Eq. 8.7 is the cor-responding eigenvectors of d smallest nonzero eigenvalues of Laplacian matrix L .Note that the Laplacian Eigenmap algorithm can be easily generalized to the weightedgraph.Both LLE and Laplacian Eigenmap have a symmetric cost function which indi-cates that both algorithms cannot be applied to the directed graph. Directed GraphEmbedding (DGE) [17] was proposed to generalize Laplacian Eigenmap.For both directed and undirected graph, we can deﬁne a transition probabilitymatrix P ∈ R | V |×| V | , where P i j denotes the probability that vertex v i walks to v j . .2 Network Representation 221 Table 8.1

Applicability of LLE, Laplacian Eigenmap, and DGE algorithms on undirected,weighted, and directed graphAlgorithm CapabilityUndirected Weighted DirectedLLE (cid:2) – –Laplacian Eigenmap (cid:2) (cid:2) –DGE (cid:2) (cid:2) (cid:2)

The transition matrix deﬁnes a Markov random walk through the graph. We denotethe stationary value of vertex v i as π i where (cid:4) i π i =

1. The stationary distributionof random walk is commonly used in many ranking algorithms such as PageRank.DGE designs a new cost function which emphasizes the important vertices, whichhave a higher stationary value: L ( V ) = | V | (cid:2) i = π i | V | (cid:2) j = P i j (cid:4) v i − v j (cid:4) . (8.8)By denoting M = diag (π , π , . . . , π | V | ) , the cost function Eq. 8.8 can be refor-mulated as L ( V ) = ( V (cid:3) BV ), (8.9) s . t . V (cid:3) MV = I d , (8.10)where B = M − MP − P (cid:3) M . (8.11)The condition Eq. 8.10 is added to remove an arbitrary scaling factor. Similar toLaplacian Eigenmap, the optimization problem can also be solved as a generalizedeigenvector problem.For comparisons between the above three network embedding learning algo-rithms, we conclude the following table to illustrate their applicability (Table 8.1).Unlike previous works which minimize the distance between vertex representa-tions, Tang and Liu [112] introduces modularity [85] into the cost function instead.Modularity is a measurement which characterizes how far the graph is away froma uniform random graph. Given graph G = ( V , E ) , we assume that vertices V aredivided into k nonoverlapping communities. By “uniform random graph”, we meanvertices connect to each other based on a uniform distribution given their degrees.Then the expected edges between v i and v j is deg ( v i ) deg ( v j ) | E | . Then the modularity of agraph Q is deﬁned as

22 8 Network Representation Q = | E | (cid:2) i , j (cid:5) A i j − deg ( v i ) deg ( v j ) | E | (cid:6) δ( v i , v j ), (8.12)where δ( v i , v j ) = v i and v j belong to the same community and δ( v i , v j ) = Q .However, a hard clustering on modularity maximization is proved to be NP hard.Therefore, they relax the problem to a soft case. Let d ∈ Z | V |+ denotes the degree ofall vertices and ∈ { , } | V |× k denotes the community indicator matrix where i j = (cid:7) i belongs to community j , B as B = A − dd T | E | , (8.14)and modularity Q can be reformulated as Q = | E | tr ( (cid:3) B1 ). (8.15)By relaxing to a continuous matrix, it has been proved that the optimal solution is the top- k eigenvectors of modularity matrix B [84].As an alternatively cost function, Tang and Liu also proposed another algorithm[113] by optimizing over normalized cut of the graph. Similarly, the algorithm turnsto the computation of top- k eigenvectors of normalized graph Laplacian (cid:8) L : (cid:8) L = D − L D − = I − D − AD − . (8.16)Then the community indicator matrix is taken as a k -dimensional vertex repre-sentation.To conclude spectral clustering methods for network representation learning, thesemethods often deﬁne a cost function that is linear or quadratic to the vertex embed-ding. Then they reformulate the cost function as a matrix form and ﬁgure out thatthe optimal solutions are eigenvectors of a particular matrix according to algebraknowledge. The major drawback of spectral clustering methods is the complexity:the computation of eigenvectors for large-scale matrices is both time consuming andspace consuming. .2 Network Representation 223 As shown in previous subsections, accurate computation of the optimal solution, suchas eigenvector computation, is not very efﬁcient for large-scale problems. Meantime,neural network approaches have proved their effectiveness in many areas such asnatural language and image processing. Though the gradient descent method cannotalways guarantee an optimal solution of the neural network models, the implemen-tation and learning of neural networks are relatively fast, and they usually have goodperformances. On the other hand, neural network models can let people get rid offeature engineering and are mostly data driven. Thus, the exploration of the neuralnetwork approach on representation learning is becoming an emerging task.DeepWalk [93] proposes a novel approach that introduces deep learning tech-niques into network representation learning for the ﬁrst time. The beneﬁts of model-ing truncated random walks instead of the adjacency matrix are twofold: ﬁrst, randomwalks need only local information and thus enable discrete and online algorithms onit while modeling of adjacency matrix may need to store everything in memory andthus be space consuming; second, modeling random walks can alleviate the varianceand uncertainty of modeling original binary adjacency matrix. We will look insightinto DeepWalk in the next subsection.Unsupervised representation learning algorithms have been widely studied andapplied in the natural language processing area. The authors show that the vertex fre-quency in short random walks also follows the power law as words in documents do.Showing the connection between vertex to the word and random walks to sentences,the authors adapted a well-known word representation learning algorithm word2vec[80] into vertex representation learning. Now, we will introduce DeepWalk algo-rithms in detail.Given graph G = ( V , E ) , we denote a random walk started at vertex v i as (cid:4) v i .We use (cid:4) kv i to represent the k th vertex in the random walk (cid:4) v i . The next vertex (cid:4) k + v i is generated by uniformly random selection from neighbors of vertex (cid:4) kv i . Randomwalk sequences have been used for many network analysis tasks, such as similaritymeasurement and community detection [2, 32].DeepWalk follows the idea of language modeling to model short random walksequences. That is to estimate the likelihood of observing vertex v i given all previousvertices in the random walk: P ( v i | ( v , v , . . . , v i − )). (8.17)To the extent of vertex representation learning, we turn to predict vertex v i giventhe representations of all previous vertices: P ( v i | ( v , v , . . . , v i − )). (8.18)A relaxation of this formula in language modeling turns to use vertex v i to predictits neighboring vertices v i − w , . . . , v i − , v i + , . . . , v i + w where w is the window size.

24 8 Network Representation

This part of model is named as Skip-gram model in word embedding learning. Theneighboring vertices are also called context vertices of the center vertex. As anothersimpliﬁcation, DeepWalk ignores the order and offset of the vertices and thus predict v i − w and v i − in the same way. The optimization function of a single vertex of arandom walk can be formulated asmin v − log P ( { v i − w , . . . , v i − , v i + , . . . , v i + w }| v i ). (8.19)Based on independent assumption, the loss function can be rewritten asmin v w (cid:2) k =− w , k (cid:6)= − log P ( v i + k | v i ). (8.20)The overall loss function can be obtained by adding up over every vertex in everyrandom walk.Now we talk about how to predict a single vertex v j given center vertex v i . InDeepWalk, each vertex v i has two representations with the same dimension: ver-tex representation v i ∈ R d and context representation c i ∈ R d . The probability ofprediction P ( v j | v i ) is deﬁned by a softmax function over all vertices: P ( v j | v i ) = exp ( v i c (cid:3) j ) (cid:4) | V | k = exp ( v i c (cid:3) k ) . (8.21)Here we come to the parameter learning phase of DeepWalk. We ﬁrst present thepseudocode of the DeepWalk framework in Algorithm 8.1. Algorithm 8.1

DeepWalk algorithm

Given graph G = ( V , E ) , window size w , embedding size d , walks per vertex n and walk length l for i = , , . . . , n dofor v i ∈ V do (cid:4) v i = RandomWalk ( G , v i , l ) Skip-gram ( V , (cid:4) v i , w ) end forend for where RandomWalk ( G , v i , l ) generates a random walk rooted at v i with length l andSkip-gram ( V , (cid:4) v i , w ) function is deﬁned in Algorithm 8.2, where α l is the learningrate of stochastic gradient descent.Note that the parameter updating rule V = V − α l ∂ J ∂ V in Skip-gram has a com-plexity of O ( | V | ) because in the computation of the gradient of P ( v k | v j ) (as shownin Eq. 8.21), the denominator has | V | terms to compute. This complexity is unac-ceptable for large-scale networks. .2 Network Representation 225 Algorithm 8.2

Skip-gram ( R , W v i , w ) for v j ∈ (cid:4) v i dofor v k ∈ (cid:4) v i [ j − w : j + w ] doif v k (cid:6)= v j then J ( V ) = − log P ( v k | V j ) V = V − α l ∂ J ∂ V end ifend forend forTable 8.2 Analogy of DeepWalk and word2vecMethod Object Input OutputWord2vec Word Sentence Word embeddingDeepWalk Vertex Random walk Vertex embedding

To address this problem, people proposed Hierarchical Softmax as a variant oforiginal softmax function. The core idea is to map the vertices to a balanced binarytree, where each vertex corresponds to a leaf of the tree. Then the prediction of avertex turns to the prediction of the path from the root to the corresponding leaf.Assume that the path from root to vertex v k is denoted by a sequence of tree nodes b , b . . . , b (cid:7) log | V |(cid:8) and then we havelog P ( v k | v j ) = (cid:7) log | V |(cid:8) (cid:2) i = log P ( b i | v j ). (8.22)A logistic function can easily implement a binary decision on a tree node. Hence,the time complexity reduces to O ( log | V | ) from O ( | V | ) . We can accelerate thealgorithm by using Huffman coding to map frequent vertices to the tree nodes thatare close to the root. We can also use negative sampling which is used in word2vecto replace hierarchical softmax for speeding up.So far, we have ﬁnished the introduction of the DeepWalk algorithm. Deep-Walk introduces efﬁcient deep learning techniques into network embedding learning.Table 8.2 gives an analogy between DeepWalk and Word2vec. DeepWalk outper-forms traditional network representation learning methods on network classiﬁcationtasks and is also efﬁcient for large-scale networks. Besides, the generation of randomwalks can be generalized to nonrandom walk, such as the information propagationstreams. In the next subsection, we will give a detailed proof to demonstrate thecorrelation between DeepWalk and matrix factorization.

26 8 Network Representation

Perozzi et al. introduced the Skip-gram model into the study of social network forthe ﬁrst time, and designed an algorithm named DeepWalk [93] for learning vertexrepresentation on a graph. In this subsection, we prove that the DeepWalk algorithmwith Skip-gram and softmax model is actually factoring a matrix M where eachentry M i j is the logarithm of the average probability that vertex v i randomly walksto vertex v j in ﬁx steps. We will explain it later.Since the Skip-gram model does not consider the offset of context vertex andpredict context vertices independently, we can regard the random walks as a set ofvertex-context pairs. The useful information on random walks is the co-occurrenceof vertex pairs inside a window. Given network G = ( V , E ) , we suppose that vertex-context set D is generated from random walks, where each piece of D is a vertex-context pair ( v , c ) . Let V be the set of nodes, and V C be the set of context nodes. Inmost cases, V = V C .Consider a vertex-context pair ( v , c ) : N ( v , c ) denotes the number of times ( v , c ) appears in D . N v = (cid:4) c (cid:9) ∈ V C N ( v , c (cid:9) ) and N c = (cid:4) v (cid:9) ∈ V N ( v (cid:9) , c ) denotes the number of times v and c appears in D . Note that | D | = (cid:4) v (cid:9) ∈ V (cid:4) c (cid:9) ∈ V C N ( v (cid:9) , c (cid:9) ) .A context vertex c ∈ V C is represented by a d -dimension vector c ∈ R d and C is a | V C | × d matrix, where row j is vector c j . Our goal is to ﬁgure out a matrix M = VC (cid:3) .Perozzi et al. implemented the DeepWalk algorithm with the Skip-gram and Hier-archical Softmax model. Note that Hierarchical Softmax is a variant of softmax forspeeding the training time. In this subsection, we give proofs for both negative sam-pling and softmax with the Skip-gram model.Negative sampling approximately maximizes the probability of softmax functionby randomly choosing k negative samples from the context set. Levy and Goldbergshowed that Skip-gram with the Negative Sampling model (SGNS) is implicitly fac-torizing a word-context matrix [69] by assuming that dimensionality d is sufﬁcientlylarge. In other words, we can assign each product v · c a value independent of theothers.In SGNS model, we have P (( v , c ) ∈ D ) = Sigmoid ( v · c ) = + e − v · c . (8.23)Suppose we choose k negative samples for each vertex-context pair ( v , c ) accord-ing to the distribution P D ( c N ) = N cN | D | . Then, the objective function for SGNS can bewritten as .2 Network Representation 227 O = (cid:2) v ∈ V (cid:2) c ∈ V C N ( v , c ) ( log Sigmoid ( v · c ) + k E c N ∼ P D [ log Sigmoid ( − v · c ) ] ) = (cid:2) v ∈ V (cid:2) c ∈ V C N ( v , c ) log Sigmoid ( v · c ) + k (cid:2) v ∈ V N v (cid:2) c N ∈ V C N c N | D | log Sigmoid ( − v · c ) = (cid:2) v ∈ V (cid:2) c ∈ V C N ( v , c ) log Sigmoid ( v · c ) + k N v N c | D | log Sigmoid ( − v · c ). (8.24)Denote x = v · c . By solving ∂ O ∂ x =

0, we have v · c = x = log N ( v , c ) | D | N v N c − log k . (8.25)Thus we have M i j = log N ( vi , cj ) | D | Nvi | D | Ncj | D | − log k . M i j can be interpreted as Point-wiseMutual Information(PMI) of vertex-context pair ( v i , c j ) shifted by log k .Since both negative sampling and hierarchical softmax are variants of softmax,we pay more attention to the softmax model and give a further discussion on it. Wealso assume that the values of v · c are independent.In softmax model, P (( v , c ) ∈ D ) = e v · c (cid:4) c (cid:9) ∈ V C e v · c (cid:9) . (8.26)And the objective function is O = (cid:2) v ∈ V (cid:2) c ∈ V C N ( v , c ) log e v · c (cid:4) c (cid:9) ∈ V C e v · c (cid:9) . (8.27)After extracting all terms associated to v · c as O ( v , c ) , we have O ( v , c ) = N ( v , c ) log e v · c (cid:4) c (cid:9) ∈ V C , c (cid:9) (cid:6)= c e v · c (cid:9) + e v · c + (cid:2) ˜ c ∈ V C , ˜ c (cid:6)= c N ( v , ˜ c ) log e v ·˜ c (cid:4) c (cid:9) ∈ V C , c (cid:9) (cid:6)= c e v · c (cid:9) + e v · c . (8.28)Note that O = | V C | (cid:4) v ∈ V (cid:4) c ∈ V C O ( v , c ) . Denote x = v · c . By solving ∂ O ∂ x = v · c = x = log N ( v , c ) N v + b v , (8.29)where b v can be any real constant since it will be canceled when we compute P (( v , c ) ∈ D ) . Thus, we have M i j = log N ( vi , cj ) N ( vi ) + b v i . We will discuss what M i j represents in next section.

28 8 Network Representation

It is clear that the method of sampling vertex-context pairs, i.e., random walksgeneration, will affect matrix M . In this section, we will discuss N v | D | , N c | D | and N ( v , c ) N v based on an ideal sampling method for DeepWalk algorithm.Assume the graph is connected and undirected, and the window size is w . Thesampling algorithm is illustrated in Algorithm 8.3. We can easily generalize thissampling method to the directed graph by only adding ( RW i , RW j ) into D . Algorithm 8.3

Ideal vertex-context pair sampling algorithm

Generate an inﬁnite long random walk (cid:4) .Denote (cid:4) i as the vertex on position i of (cid:4) , where i = , , , . . . for i = , , , . . . dofor j ∈ [ i + , i + w ] do add ((cid:4) i , (cid:4) j ) into D add ((cid:4) j , (cid:4) i ) into D end forend for Each appearance of vertex i will be recorded 2 w times in D for undirected graphand w times for directed graph. Thus, we can ﬁgure out that N vi | D | is the frequency of v i that appears in the random walk, which is exactly the PageRank value of v i . Alsonote that N ( vi , vj ) N vi / w is the expectation times that v j is observed in left/right w neighborsof v i .Denote the transition matrix in PageRank algorithm be P . More formally, letdeg ( v i ) be the degree of vertex i . P i j = ( v i ) if ( i , j ) ∈ E and P i j = e i to denote a | V | -dimension row vector, where all entries are zero exceptthe i th entry is 1.Suppose that we start a random walk from vertex i and use e i to denote theinitial state. Then e i P is the distribution over all the vertices where j th entry is theprobability that vertex v i walks to vertex v j . Hence, j th entry of e i P w is the probabilitythat vertex v i walks to vertex v j at exactly w steps. Thus [ e i ( P + P + · · · + P w ) ] j is the expectation times that v j appears in right w neighbors of v i .Hence N ( v i , v j ) N v i / w = [ e i ( P + P + · · · + P w ) ] j , N ( v i , v j ) N v i = [ e i ( P + P + · · · + P w ) ] j w . (8.30)This equality also holds for a directed graph.By setting b v i = log 2 w for all i , M i j = log N ( vi , vj ) N vi / w is logarithm of the expectationtimes that v j appears in left/right w neighbors of v i .By setting b v i = i , M i j = log N ( vi , vj ) N vi = log [ e i ( A + A +···+ A w ) ] j w is logarithmof the average probability that vertex v i randomly walks to vertex v j in w steps. .2 Network Representation 229 So far we have seen many different network representation learning algorithms andwe can ﬁgure out some patterns that how network representation methods share.Then we will move forward and see how these patterns match some recent networkembedding algorithms.Most network representation algorithms try to reconstruct a data matrix generatedfrom the graph with vertex embeddings. The simplest matrix would be the adjacencymatrix. However, recovering the adjacency matrix may not be the best choice. First,real-world networks are mostly very sparse which means O ( | E | ) = O ( | V | ) . There-fore, the adjacency matrix will be very sparse as well. Though the sparseness enablesan efﬁcient algorithm, it can harm the performance of vertex representation learningbecause of the deﬁciency of useful information. Second, the adjacency matrix maybe noisy and sensitive. A single missing link can completely change the correlationbetween two vertices.Hence people seek to ﬁnd an alternative matrix to replace the adjacency matrixthough implicitly. Take DeepWalk as an example, DeepWalk models the followingmatrix based on matrix factorization comprehension of DeepWalk: M = P + P + · · · + P w , (8.31)where P i j = (cid:7) / deg ( v i ) if ( v i , v j ) ∈ E , A , the matrix M modeled by DeepWalk ismuch denser. Furthermore, the window size parameter w can adjust the density: alarger window size models a denser matrix but will slow down the algorithm. Hence,the window size w works as a harmonic factor to balance efﬁciency and effectiveness.On the other hand, the matrix M can alleviate the noises in the adjacency matrix.Consider two similar vertices v i and v j , even though the edge between them ismissing, they can still have many co-occurrences by appearing inside a window sizeof the same random walks.In a real-world application, direct computation of M may have a high time com-plexity when window size w grows. Thus, it is essential to choose a proper w . How-ever, window size w is a discrete parameter, and thus the matrix M may grow fromtoo sparse to too dense by changing w by 1. Here, we can see another beneﬁt ofrandom walks. Random walks used by DeepWalk serve as Monte Carlo simulationsfor approximating matrix M . The more random walks you walk, the more likely youcan approximate the matrix.After we choose a matrix to model, we need to correlate the matrix entry withvertex representations pairs. There are two widely used measurements of verticespairs: Euclidean distance and inner product. Assume that we want to model the entry M i j given vertex representations v i and v j , we can employ

30 8 Network Representation M i j = f ( (cid:4) v i − v j (cid:4) ), M i j = f ( v i · v j ), (8.33)where function f can be any reasonable matching functions such as sigmoid functionor linear function for our propose. Actually, the inner product v i · v j is used morewidely and would correspond to equivalent matrix factorization methods.The next phase is to design a proper loss function between M i j and f ( v i · v j ) .Several loss functions such as square loss and hinge loss can be employed. You canalso design a generative model and maximize the likelihood of matrix M .The ﬁnal step of a network representation learning algorithm would be parameterlearning. The most frequently used parameter learning method would be StochasticGradient Descent (SGD). Other variants of SGD such as AdaGrad and AdaDelta canmake the learning phase converge faster. In the next subsection, we will see somerecent network representation learning algorithms which follow DeepWalk. We willﬁnd that their models can match all these phases above and have some innovationson building matrix M , modifying function f , and changing loss function. We will focus on two network representation learning algorithms LINE and GraRep[13, 111] in this subsection. They both follow the framework introduced in the lastsubsection.

Tang et al. [111] proposed a network embedding model named as LINE. LINE algo-rithm can handle large-scale networks with arbitrary types: (un)directed or weighted.To model the interaction between vertices, LINE models ﬁrst-order proximity whichis represented by observed links and second-order proximity which is determined byshared neighbors but not links between vertices.Before we introduce the details of the algorithm, we can move one step backand see how the idea works. The modeling of ﬁrst-order proximity, i.e., observedlinks, is the modeling of the adjacency matrix. As we said in the last subsection,the adjacency matrix is usually too sparse. Hence the modeling of second-orderproximity, i.e., vertices with shared neighbors, can serve as complement informationto enrich the adjacency matrix and make it denser. The enumeration of all vertexpairs which have common neighbors is time consuming. Thus, it is necessary todesign a sampling phase to handle large-scale networks. The sampling phase workslike Monte Carlo simulation to approximate the ideal matrix. .2 Network Representation 231

Now we only have two questions: how to deﬁne ﬁrst-order and second-orderproximity and how to deﬁne the loss function. In other words, it is equal to how todeﬁne M and loss function.First-order proximity between vertex u and v is deﬁned as the weight w uv onedge ( u , v ) . If there is no edge between vertex u and v , then the ﬁrst-order proximitybetween them is 0.Second-order proximity between vertex u and v is deﬁned as the similaritybetween their neighborhood network. Let p u = ( w u , , . . . , w u , | V | ) denote the ﬁrst-order proximity between vertex u and all other vertices. Then the second-order prox-imity between u and v is deﬁned as the similarity of p u and p v . If they have no sharedneighbors, then the second-order proximity is zero.Then we can introduce LINE model more speciﬁcally. The joint probabilitybetween v i and v j is p ( v i , v j ) = + exp ( − v i · v j ) , (8.34)where v i and v j are d -dimensional row vectors which indicate the representationsof vertex v i and v j .To supervise the probabilities, empirical probability is deﬁned as ˆ p ( i , j ) = w ij W ,where W = (cid:4) ( v i , v j ) ∈ E w i j . Thus our goal is to ﬁnd vertex embeddings to approximate w ij W with + exp ( − v i · v j ) . Following the idea in last subsection, it is equivalent to say v i · v j = M i j = − log ( Ww ij − ) .The loss function between joint probability p and its empirical probability ˆ p is L = D KL ( ˆ p || p ), (8.35)where D KL ( · || · ) is KL-divergence of two probability distributions.On the other hand, we deﬁne the probability that vertex v j appears in v i ’s context: p ( v j | v i ) = exp ( c j · v i ) (cid:4) | V | k = exp ( c k · v i ) . (8.36)Similarly, the empirical probability is deﬁned as ˆ p ( v j | v i ) = w ij d i where d i = (cid:4) k w ik and the loss function is L = (cid:2) i d i D KL ( ˆ p ( · , v i ) || p ( · , v i )). (8.37)The ﬁrst-order and second-order proximity embeddings are trained separately,and we concatenate the embeddings together after the training phase as vertex rep-resentations.

32 8 Network Representation

Now we turn to another network representation learning algorithm, GraRep, whichdirectly follows the proof of matrix factorization form of DeepWalk. Recall thatwe prove DeepWalk is actually factorizing a matrix M where M = log A + A +···+ A w w .GraRep algorithm can be divided into 3 steps: • Get k -step transition probability matrix A k for each k = , , . . . , K . • Get each k -step representation. • Concatenate all k -step representations.GraRep uses a simple idea, i.e., SVD decomposition on A k , in the second step toget embeddings. As K gets large, the matrix M gets denser and thus outputs a betterrepresentation. However, this algorithm is not very efﬁcient especially when K getslarge. Different from previous methods that use a shallow neural network model to char-acterize the network representations, Structural Deep Network Embedding (SDNE)[125] employs the deeper neural model to model the nonlinearity between vertexembeddings. As shown in Fig. 8.2, the whole model can be divided into two parts:(1) the ﬁrst part is supervised by Laplacian Eigenmaps, which models the ﬁrst-orderproximity; (2) the second part is unsupervised deep neural autoencoder which char-acterizes the second-order proximity. Finally, the algorithm takes the intermediatelayer which is used for the supervised part as the network representation.First, we will give a brief introduction to deep neural autoencoder. A neuralautoencoder requires that the output vector should be as similar to the input vector.Generally speaking, the output cannot be the same with the input vector becausethe dimension of intermediate layers of the autoencoder is much smaller than thatof the input and output layer. That is to say, a deep autoencoder ﬁrst compressesthe input into a low-dimensional intermediate vector and then tries to reconstructthe original input vector from the low-dimensional intermediate vector. Once thedeep autoencoder is trained, we can say that the intermediate layer is an excellentlow-dimensional representation of the original inputs since we can recover the inputvector from it.More formally, we assume the input vector is x i . Then the hidden representationof each layer is deﬁned as y ( ) i = Sigmoid ( W ( ) x i + b ( ) ), y ( k ) i = Sigmoid ( W ( k ) y ( k − ) i + b ( k ) ), k = , . . . , (8.38) .2 Network Representation 233 Unsupervised ComponentLocal structure preserved cost………………… Unsupervised ComponentLocal structure preserved cost………………… x i y i(1) y i(k) y i(1) x i ^^ x j y j(1) y j(k) y j(1) x j ^^ LaplacianEigenmapsparameter sharingparameter sharing

Supervised Component(Global structurepreserved cost)

Vertex i Vertex j

Fig. 8.2

The architecture of structural deep network embedding model where W ( k ) and b ( k ) are weighted matrix and bias vector of k th layer. We assumethat the hidden representation of the K th layer has the minimum dimension. Afterobtaining y ( K ) i , we can get the output ˆ x i by reversing the calculation process. Then theoptimization objective of autoencoder is to minimize the difference between inputvector x i and output vector ˆ x i : L ( W , b ) = n (cid:2) i = (cid:4)ˆ x i − x i (cid:4) , (8.39)where n is the number of input instances.Back to the network representation problem, SDNE applies the autoencoder toevery vertex. The input vector x i of each vertex v i is deﬁned as follows: if vertex v i and v j are connected, then the j th entry x i j >

0, otherwise x i j =

0. For unweighedgraph, if vertex ( v i , v j ) ∈ E , x i j =

1. Then the intermediate layer y ( K ) i can be seenas the low-dimension representation of vertex v i . Also note that there are much morezero entries in input vectors than positive entries due to the sparity of real-worldnetwork. Therefore, the loss of positive entries should be emphasized. Therefore, theﬁnal optimization objective of second proximity modeling can be written as L nd = | V | (cid:2) i = (cid:4) ( ˆ x i − x i ) (cid:11) b i (cid:4) , (8.40)

34 8 Network Representation where (cid:11) denotes element-wise multiplication and b i j = x i j = b i j = β > x i j > L st = | V | (cid:2) i , j = x i j (cid:4) y ( K ) i − y ( K ) j (cid:4) . (8.41)Finally, the overall loss function included regularization term is L = L nd + α L st + λ L reg , (8.42)where α and λ are harmonic hyperparameter and regularization loss L reg is the sumof the square of all parameters. The model can be optimized by back-propagationin a standard neural network way. After the training process, y ( K ) i is taken as therepresentation of vertex v i . . Existing networkrepresentation learning algorithms mostly focus on an undirected graph. Most ofthe methods cannot handle the directed graph well because they do not accuratelycharacterize the asymmetric property. High-Order Proximity preserved Embedding(HOPE) [89] is proposed to preserve high-order proximities of large-scale graphs andcapture the asymmetric transitivity. The algorithm further derives a general formula-tion that covers multiple popular high-order proximity measurements and providesan approximate algorithm with an upper bound of RMSE (Root Mean Squared Error).Network embedding assumes that the more and the shorter paths from v i to v j ,the more similar should be their representation vectors. In particular, the algorithmassigns two vectors, i.e., source and target vectors for each vertex. We denote adja-cency matrix as A and the user representations as U = [ U s , U t ] , where U s ∈ R | V |× d and U t ∈ R | V |× d are source and target vertex embeddings, respectively. We deﬁnea high-order proximity matrix as S , where S i j is the proximity between v i and v j .Then our goal is to approximate the matrix S with the product of U s and U t . Theoptimization objective can be written asmin U s , U t (cid:4) S − U s U t (cid:3) (cid:4) F . (8.43) .2 Network Representation 235 Many high-order proximity measurements which characterize the asymmetrictransitivity share a general formulation which can be used for the approximation ofthe proximities: S = M − g M l , (8.44)where M g and M l are both polynomials of matrices. Now we will take three com-monly used high-order proximity measurements to illustrate the formula. • Katz Index Katz Index is a weighted summation over the path set between twovertices. The computation of the Katz Index can be written recurrently: S := β A S + β A , (8.45)where the decay parameter β represents how fast the weight decreases when thelength of paths grows. • Rooted PageRank For rooted PageRank, S i j is the probability that a random walkfrom vertex v i will locate at v j in the stable state. The formula can be written as S := α SP + ( − α) I , (8.46)where α is the probability that a random walk returns to its start point and P is thetransition matrix. • Common Neighbors S i j is the number of vertexes which is the target of an edgefrom v i and the source of an edge to v j . The matrix S can be expressed as S = A . (8.47)For the three high-order proximity measurements introduced above, we summa-rize their equivalent form S = M − g M l in the following table (Table 8.3).A simple idea of approximating S with the product of matrices is SVD decom-position. However, the direct computation of SVD decomposition of matrix S has acomplexity of O ( | V | ) . By writing matrix S as M − g M l , we do not need to computematrix S directly. Instead, we can do JDGSVD decomposition on M g and M l inde-pendently and then use their results to derive the decomposition of S . The complexityreduces to | E | d for each iteration of JDGSVD. Community Preserving Network Representation . While previous methods aimat preserving the microscopic structure of a network such as ﬁrst- and second-order

Table 8.3

General formula for high-order proximity measurementsMeasurement M g M l Katz index I − β A β A Rooted PageRank I − α P ( − α) I Common neighbors I A

36 8 Network Representation proximities. Wang et al. [127] proposed Modularized Nonnegative Matrix Factor-ization (M-NMF), which encodes the mesoscopic community structure informationinto the network representations. The basic idea is to consider the modularity as partof the optimization function. Recall that the modularity is formulated in Eq. 8.15 and S is the community indicator matrix. Then the loss function of modularity part is tominimize − tr ( S (cid:3) BS ) .Similar to previous methods, M-NMF also factorizes an afﬁnity matrix whichencodes ﬁrst-order and second-order proximities. Speciﬁcally, M-NMF takes adja-cency matrix A as the ﬁrst-order proximity matrix A and computes the cosinesimilarity of corresponding rows of adjacency matrix A as the second-order prox-imity matrix A . M-NMF uses a mixture of A and A as the similarity matrix. Toconclude, the overall optimization function of M-NMF ismin M , U , S , C (cid:3)(cid:3) A + η A − MU (cid:3) (cid:3)(cid:3) F + α (cid:3)(cid:3) S − UC (cid:3) (cid:3)(cid:3) F − β tr ( S (cid:3) BS ), (8.48)where S ∈ R | V |× k , M , U ∈ R | V |× m , C ∈ R k × m , M i j , U i j , S i j , C i j ≥ , ∀ i ∀ j , tr ( S (cid:3) S ) =| V | and α, β, η > F denotes Frobeniusnorm. Here similarity matrix A + η A is factorized into two nonnegative matrices M and U . Then community representation matrix C in the second term bridges thematrix factorization part and the modularity part.A concurrent algorithm Community-enhanced NRL (CNRL) [116, 117] is apipeline algorithm that learns node-community assignment at ﬁrst and then reformsthe DeepWalk algorithm to incorporate community information. Speciﬁcally, in theﬁrst phase, CNRL made an analogy between community detection and topic model-ing. Then CNRL started by generating random walks and fed these vertex sequencesinto Latent Dirichlet Allocation (LDA) algorithm. By taking a vertex as a word and atopic as a community, CNRL can get a soft-assignment of vertex-community mem-bership. Then in the second phase, both the embedding of a center node and theembedding of its community are used to predict the neighborhood vertices in therandom walk sequences. The illustration ﬁgure is shown in Fig. 8.3. . We will present the networkembedding algorithm TADW, which further generalizes the matrix factorizationframework to take advantage of text information. Text-Associated DeepWalk(TADW) [136] incorporates text features of vertices into network representationlearning under the framework of matrix factorization. The matrix factorization viewof DeepWalk enables the introduction of text information into matrix factorization fornetwork representation learning. Figure 8.4 shows the main idea of TADW: factorizevertex afﬁnity matrix M ∈ R | V |×| V | into the product of three matrices: W ∈ R k ×| V | , H ∈ R k × f t , and text features T ∈ R f t ×| V | . Then TADW concatenates W and HT as2 k -dimensional representations of vertices. .2 Network Representation 237 Vetexembeddings Sliding window

Communityembeddings

CommunityembeddingAssignedcommunities

11 2 2 3 3 4 4 5 5 6 6 7 7

Vetex sequenceRandom walkson a network

Community 1 Community 2 Community 3

Fig. 8.3

The architecture of community preserving network embedding model

Fig. 8.4

The architecture oftext-associated DeepWalkmodel × × VV V kM W T H T f t Then the question is how to build vertex afﬁnity matrix M and how to extracttext feature T from the text information. Following the proof of matrix factorizationform of DeepWalk, TADW set vertex afﬁnity matrix M to a tradeoff between speedand accuracy: factorize the matrix M = ( A + A )/ A is the row-normalizedadjacency matrix. For text feature matrix T , TADW ﬁrst constructs the TF-IDF matrixfrom the text and then reduces the dimension of the TF-IDF matrix to 200 via SVDdecomposition.

38 8 Network Representation

ReconstructedEdge Vector Edge Autoencoder

Label u lu vv

TranslationMechanismBinary EdgeVertorEdge withMulti-labels

Fig. 8.5

The architecture of TransNet model

Formally, the model of TADW minimizes the following optimization function:min W , H (cid:4) M − W (cid:3) HT (cid:4) F + λ ( (cid:4) W (cid:4) F + (cid:4) H (cid:4) F ), (8.49)where λ is the regularization factor. The optimization of parameters are processedby updating W and H iteratively via conjugate gradient descent. TransNet . Most existing NRL methods neglect the semantic information of edgesand simplify the edge as a binary or continuous value. TransNet algorithm [119]considers the label information on the edges instead of nodes. In particular, TransNetis based on translation mechanism shown in Fig. 8.5.In the settings of TransNet, each edge has a number of binary labels on it. Then theloss function of TransNet consists of two parts: one part is the translation loss whichmeasures the distance between u + e and v where u , e , v stand for the embeddingsof head vertex, edge, and tail vertex; another part is the reconstruction loss of theautoencoder which encodes the labels of an edge into its embedding e and restorethe labels from the embedding. After the learning phase, we can compute the edgeembedding by subtracting two vertices and use the decoder part of the autoencoderto predict the labels of an unobserved edge. .2 Network Representation 239 Semi-supervised Network Representation . In this part, we introduce severalsemi-supervised network representation learning methods that are applied to het-erogeneous networks. All methods learn vertex embeddings and their classiﬁcationlabels simultaneously.(1)

LSHM

The ﬁrst algorithm LSHM (Latent Space Heterogeneous Model) [52],follows the manifold assumption which assumes that two connected nodes tend tohave similar node embeddings. Thus, the regularization loss which forces connectednodes to have similar representations can be formulated as (cid:2) i , j w i j (cid:4) v i − v j (cid:4) , (8.50)where w i j is the weight of edge ( v i , v j ) .As a semi-supervised representation learning algorithm, LHSM also needs topredict the classiﬁcation labels for unlabeled vertices. To train the classiﬁers, LHSMcomputes the loss of observed labels as (cid:2) i Δ( f θ ( v i ), y i ), (8.51)where f θ ( v i ) is the predicted label for vertex v i , y i is the observed label for v i and Δ( · , · ) is the loss function between predicted label and ground truth label. Speciﬁ-cally, f θ ( · ) is a linear function and Δ( · , · ) is set to hinge loss.Finally, the objective function is L ( V , θ) = (cid:2) i Δ( f θ ( v i ), y i ) + λ (cid:2) i , j w i j (cid:4) v i − v j (cid:4) , (8.52)where λ is a harmonic hyperparameter. The algorithm is optimized via stochasticgradient descent.(2) node2vec Node2vec [38] modiﬁes DeepWalk by changing the generationof random walks. As shown in previous subsections, DeepWalk generates rootedrandom walks by choosing the next vertex according to a uniform distribution, whichcould be improved by using a well-designed random walk generation strategy.Node2vec ﬁrst considers two extreme cases of vertex visiting sequences: Breadth-First Search (BFS) and Depth-First Search (DFS). By restricting the search to nearbynodes, BFS characterizes the nearby neighborhoods of center vertices and obtainsa microscopic view of the neighborhood of every node. Vertices in the sampledneighborhoods of BFS tend to repeat many times, which can reduce the variance incharacterizing the distribution of neighboring vertices of the source node. In contrast,the sampled nodes in DFS reﬂect a macro-view of the neighborhood which is essentialin inferring communities based on homophily.Node2vec designs a neighborhood sampling strategy which can smoothly inter-polate between BFS and DFS. More speciﬁcally, consider a random walk that justwalks through edge ( t , v ) and now stays at vertex v . The walk evaluates the transition

40 8 Network Representation probabilities of edge ( v , x ) to decide the next step. Node2vec sets the unnormalizedtransition probability to π vx = α pq ( t , x ) · w vx , where α pq ( t , x ) = ⎧⎨⎩ p if d tx = , d tx = , q if d tx = , (8.53)and d tx denotes the shortest path distance between vertices t and x . p and q areparameters that guide the random walk and control how fast the walk explores andleaves the neighborhood of starting vertex. A low p will increase the probability ofrevisiting a vertex and make the random walk focus on local neighborhoods while alow q will encourage the random walk to explore further vertices. After the generationof the random walks, the rest of the algorithm is almost the same as that of DeepWalk.(3) MMDW

Max-Margin DeepWalk (MMDW) [118] utilizes the max-marginstrategy in SVM to generalize DeepWalk algorithm for semi-supervised learning.Speciﬁcally, MMDW employs the matrix factorization form of DeepWalk provedin TADW [136] and further add the max-margin constraint which requires that theembeddings of nodes from different labels should be far from each other. The opti-mization function can be written asmin X , Y , W ,ξ L = min X , Y , W ,ξ L DW + (cid:4) W (cid:4) + C T (cid:2) i = ξ i , s . t . w (cid:3) l i x i − w (cid:3) j x i ≥ e ji − ξ i , ∀ i , j , (8.54)where W = [ w , w , . . . , w m ] T is the weight matrix of SVM, ξ is the slack variables, e ji = l i (cid:6)= j and e ji = L DW is the matrix factorization formDeepWalk loss function: L DW = (cid:4) M − X (cid:3) Y (cid:4) + λ ( (cid:4) X (cid:4) + (cid:4) Y (cid:4) ), (8.55)which is introduced in previous sections.Figure 8.6 shows the visualization result of the DeepWalk and MMDW algorithmon the Wiki dataset [103]. We can see that the embeddings of nodes from differentclasses are more separable with the help of semi-supervised max-margin represen-tation learning.(4) PTE

Another algorithm called PTE (Predictive Text Embedding) [110] focuseson text network such as the bibliography network where a paper is a vertex, andthe citation relationship between papers forms the edges. PTE considers networkstructure together with plain text and observed vertex labels. PTE proposes a semi-supervised framework to learn vertex representation and predict unobserved vertexlabels. .2 Network Representation 241

Fig. 8.6

A visualization of t-SNE 2D representations on Wiki dataset (left: DeepWalk, right:MMDW) [118]

A text network is divided into three bipartite networks: word-word, word-document, and word-label networks. We will introduce the deﬁnition of the threenetworks in more detail.For the word-word network, the weight w i j of the edge between word v i and v j isdeﬁned as the number of times that the two words co-occur in the same context win-dows. For word-document network, the weight w i j between word v i and document d j is deﬁned as the number of times v i appears in document d j . For the word-labelnetwork, the weight w i j of the edge between word v i and class c j is deﬁned as: w i j = (cid:4) d : l d = j n di , where n di is the term frequency of word v i in document d , and l d is the class label of document d .Then following previous work LINE, given bipartite network G = ( V A ∪ V B , E ) ,the conditional probability of generating v i ∈ V A from v j ∈ V B is deﬁned as P ( v i | v j ) = exp ( v j · v i ) (cid:4) | V | k = exp ( v k · v i ) . (8.56)Similar to LINE model, the loss function is deﬁned as the KL-divergence betweenempirical distribution and conditional distribution. The optimization objective canbe further formulated as L = − (cid:2) ( v i , v j ) ∈ E w i j log P ( v i | v j ). (8.57)

42 8 Network Representation

Then the ﬁnal objective can be obtained by summing all three bipartite networks: L pte = L ww + L wd + L wl , (8.58)where L ww = − (cid:2) ( v i , v j ) ∈ E ww w i j log P ( v i | v j ), (8.59) L wd = − (cid:2) ( v i , v j ) ∈ E wd w i j log P ( v i | d j ), (8.60) L wl = − (cid:2) ( v i , v j ) ∈ E wl w i j log P ( v i | l j ). (8.61)Then the optimization can be done by stochastic gradient descent. . As shown in spectral cluster-ing methods, people make their effort to learn community indicator matrix based onmodularity and normalized graph cut. The continuous community indicator matrixcan be seen as a k -dimensional vertex representation, where k is the number of com-munities. Note that modularity and graph cut is deﬁned for nonoverlapping commu-nities. By alternating a cost function for overlapping communities, the idea can alsowork for overlapping community detection. In this subsection, we will introduce sev-eral community detection algorithms. These community detection algorithms startby learning a k -dimensional nonnegative vertex-community afﬁnity matrix and thenderive a hard community assignment for vertices based on the matrix. Therefore, thekey procedure of these algorithms can be regarded as an unsupervised k -dimensionalnonnegative vertex embedding learning.BIGCLAM [140] is an overlapping community detection method. It assumes thatmatrix F ∈ R | V |× k is the user-community afﬁnity matrix, where F vc is the strengthbetween vertex v and community c . Matrix F is nonnegative and F vc = v i connects v j given user-community afﬁnity matrix F . More speciﬁcally, givenmatrix F , BIGCLAM generates an edge between vertex v i and v j with a probability P ( v i , v j ) = − exp ( − F v i · F v j ), (8.62)where F v i is the corresponding row of matrix F for vertex v i and can be seen as therepresentation of v i . Note that the probability P ( v i , v j ) has an increasing relationshipwith F v i · F (cid:3) v j = (cid:4) c F v i , c F v j , c , which indicates that the more communities a pair ofnodes shared, the more likely they are connected. .2 Network Representation 243 For the case that F v i · F v j =

0, BIGCLAM adds a background probability (cid:13) = | E || V | ( | V |− ) to the pair of nodes to avoid a zero probability.Then BIGCLAM tries to maximize the log-likelihood of the graph G = ( V , E ) : O ( F ) = (cid:2) i , j : ( v i , v j ) ∈ E log P ( v i , v j ) + (cid:2) i , j : ( v i , v j )/ ∈ E log ( − P ( v i , v j )), (8.63)which can be reformulated as O ( F ) = (cid:2) i , j : ( v i , v j ) ∈ E log ( − exp ( − F v i · F v j )) − (cid:2) i , j : ( v i , v j )/ ∈ E F v i · F v j . (8.64)The parameters F are learned by projected gradient descent. Note that the train-ing objective can be regarded as a variant of nonnegative matrix factorization. Themaximization of log-likelihood function is an approximation of adjacency matrix A by FF (cid:3) . Compared with L2-norm loss function, the gradient of Eq. 8.64 can be com-puted more efﬁciently for a sparse matrix A which is the most case in the real-worlddataset.The model can also be generalized to asymmetric case [141]. That is to replaceEq. 8.62 by P ( v i , v j ) = − exp ( − F v i · H v j ), (8.65)where H is another matrix that has the same size with the matrix F . The generativemodel can also consider attributes of vertices by adding attribute terms to Eq. 8.62[79]. Different from previous algorithms that focus on machine learning tasks, the algo-rithms introduced in this subsection are designed for visualization. As a commonlyused data structure, the visualization of networks is an important task. The dimen-sions of representations of vertices are usually 2 or 3 to draw the graph.Representation learning for network visualization generally follows the followingaesthetic criteria [30]: • Distribute the vertices evenly in the frame. • Minimize edge crossings. • Make edge lengths uniform. • Reﬂect inherent symmetry. • Conform to the frame.Following these criteria, graph visualization algorithms build a force-directedgraph drawing framework. The basic assumption is that there is a spring betweeneach pair of vertices. Then the optimization objective is to minimize the energy ofthe graph according to Hooke’s law:

44 8 Network Representation E = (cid:2) i , j k i j ( (cid:4) v i − v j (cid:4) − l i j ) , (8.66)where k i j is spring constant, v i is the position of vertex v i and l i j is the length ofshortest path between vertex v i and v j . The intuition is straightforward: close verticesshould have close positions in the drawing. Several algorithms have been proposedto improve this framework [34, 54, 60] by changing the setting of spring constant k i j or the energy function. The parameters can be easily learned via gradient descent. Yang et al. [137] summarize several existing NRL methods into a uniﬁed two-stepframework, including proximity matrix construction and dimension reduction. Theyconclude that an NRL method can be improved by exploring higher order proximitieswhen building the proximity matrix. Then they propose Network Embedding Update(NEU) algorithm, which implicitly approximates higher order proximities with the-oretical approximation bound and can be applied to any NRL methods to enhancetheir performances. NEU can make a consistent and signiﬁcant improvement oversome NRL methods with almost negligible running time.The two-step framework is summarized as follows:

Step 1: Proximity Matrix Construction.

Compute a proximity matrix M ∈ R | V |×| V | , which encodes the information of k -order proximity matrix where k = , . . . , K . For example, M = K A + K A · · · + K A K stands for an average com-bination of k -order proximity matrix for k = , . . . , K . The proximity matrix M isusually represented by a polynomial of normalized adjacency matrix A of degree K ,and we denote the polynomial as f ( A ) ∈ R | V |×| V | . Here the degree K of polynomial f ( A ) corresponds to the maximum order of proximities encoded in the proximitymatrix. Note that the storage and computation of proximity matrix M doesn’t nec-essarily take O ( | V | ) time because we only need to save and compute the nonzeroentries. Step 2: Dimension Reduction.

Find network embedding matrix V ∈ R | V |× d andcontext embedding C ∈ R | V |× d so that the product VC (cid:3) approximates proximitymatrix M . Here different algorithms may employ different distance functions tominimize the distance between M and VC (cid:3) . For example, we can naturally use thenorm of matrix M − VC (cid:3) to measure the distance and minimize it.Spectral Clustering, DeepWalk, and GraRep can be formalized into the two-stepframework. Now we focus on the ﬁrst step and study how to deﬁne the right proximitymatrix for NRL.We summarize the comparisons among Spectral Clustering (SC), DeepWalk, andGraRep in Table 8.4 and conclude the following observations. .2 Network Representation 245 Table 8.4

Comparisons among three NRL methodsSC DeepWalk GraRepProximity matrix L (cid:4) Kk = A k K A k , k = . . . K Computation Accurate Approximate AccurateScalability Yes Yes NoPerformance Low Middle High

Observation 8.1

Modeling higher order and accurate proximity matrix can improvethe quality of network representation. In other words, NRL can beneﬁt from exploringa polynomial proximity matrix f ( A ) of a higher degree.From the development of NRL methods, it can be seen that DeepWalk outperformsSpectral Clustering because DeepWalk considers higher order proximity matrices,and the higher order proximity matrices can provide complementary information forlower order proximity matrices. GraRep outperforms DeepWalk because GraRepaccurately calculates the k -order proximity matrix rather than approximating it byMonte Carlo simulation as DeepWalk does. Observation 8.2

Accurate computation of high-order proximity matrix is not fea-sible for large-scale networks.The major drawback of GraRep is the computation complexity of calculating theaccurate k -order proximity matrix. In fact, the computation of high-order proximitymatrix takes O ( | V | ) time and the time complexity of SVD decomposition alsoincreases as k -order proximity matrix gets dense when k grows. In summary, thetime complexity of O ( | V | ) is too expensive to handle large-scale networks.The ﬁrst observation provides the motivation to explore higher order proximitymatrices in NRL models, but the second observation indicates that an accurate infer-ence of higher order proximity matrices isn’t acceptable. Therefore, how to learnnetwork embeddings from approximate higher order proximity matrices efﬁcientlybecomes important. To be more efﬁcient, the network representations which encodethe information of lower order proximity matrices can be used as our basis to avoidrepeated computations. The problem is formalized below. Problem Formalization . Assume that we have normalized adjacency matrix A as the ﬁrst-order proximity matrix, network embedding V , and context embedding C ,where V , C ∈ R | V |× d . Suppose that the embeddings V and C are learned by the aboveNRL framework which indicates that the product VC (cid:3) approximates a polynomialproximity matrix f ( A ) of degree K . The goal is to learn a better representation V (cid:9) and C (cid:9) , which approximates a polynomial proximity matrix g ( A ) with higher degreethan f ( A ) . Also, the algorithm should be efﬁcient in the linear time of | V | . Notethat the lower bound of time complexity is O ( | V | d ) which is the size of embeddingmatrix R .There is a simple, efﬁcient, and effective iterative updating algorithm to solve theabove problem.

46 8 Network Representation

Method . Given hyperparameter λ ∈ ( , ] , normalized adjacency matrix A , weupdate V and C as follows: V (cid:9) = V + λ A V , C (cid:9) = C + λ A (cid:3) C . (8.67)The time complexity of computing A V and A (cid:3) C is O ( | V | d ) because matrix A is sparse and has O ( | V | ) nonzero entries. Thus the overall time complexity of oneiteration of operation (Eq. 8.67) is O ( | V | d ) .Recall that product of previous embedding V and C approximates polynomialproximity matrix f ( A ) of degree K . It can be proved that the algorithm can learnbetter embeddings V (cid:9) and C (cid:9) , where the product V (cid:9) C (cid:9)(cid:3) approximates a polynomialproximity matrix g ( A ) of degree K + Theorem

Denote the network and context embedding by V and C , and supposethat the approximation between VC (cid:3) and proximity matrix M = f ( A ) is boundedby r = (cid:4) f ( A ) − VC (cid:3) (cid:4) ∞ and f ( · ) is a polynomial of degree K . Then the productof updated embeddings V (cid:9) and C (cid:9) from Eq. 8.67 approximates a polynomial g ( A ) = f ( A ) + λ A f ( A ) + λ A f ( A ) of degree K + with approximation bound r (cid:9) = ( + λ + λ ) r ≤ r . Proof

Assume that S = f ( A ) − VC (cid:3) and thus r = (cid:4) S (cid:4) ∞ . (cid:4) g ( A ) − V (cid:9) C (cid:9)(cid:3) (cid:4) ∞ = (cid:4) g ( A ) − ( V + λ A V )( C (cid:3) + λ C (cid:3) A ) (cid:4) ∞ = (cid:4) g ( A ) − VC (cid:3) − λ A VC (cid:3) − λ VC (cid:3) A − λ A VC (cid:3) A (cid:4) ∞ = (cid:4) S + λ A S + λ S A + λ A S A (cid:4) ∞ ≤ (cid:4) S (cid:4) ∞ + λ (cid:4) A (cid:4) ∞ (cid:4) S (cid:4) ∞ + λ (cid:4) S (cid:4) ∞ (cid:4) A (cid:4) ∞ + λ (cid:4) S (cid:4) ∞ (cid:4) A (cid:4) ∞ = r + λ r + λ r , (8.68)where the second last equality replaces g ( A ) and f ( A ) − VC (cid:3) by the deﬁnitionsof g ( A ) and S and the last equality uses the fact that (cid:4) A (cid:4) ∞ = max i (cid:4) j | A i j | = A equals to 1.In the experimental settings, it is assumed that the weight of lower order proxim-ities should be larger than higher order proximities because they are more directlyrelated to the original network. Therefore, given g ( A ) = f ( A ) + λ A f ( A ) + λ A f ( A ) , we have 1 ≥ λ ≥ λ > λ ∈ ( , ] . The proof indicatesthat the updated embedding can implicitly approximate a polynomial g ( A ) of 2 moredegrees within times matrix inﬁnite norm of previous embeddings. (cid:3) Algorithm . The update Eq. 8.67 can be further generalized in two directions. Firstwe can update embeddings V and C according to Eq. 8.69: V (cid:9) = V + λ A V + λ A ( A V ), C (cid:9) = C + λ A (cid:3) C + λ A (cid:3) ( A (cid:3) C ). (8.69) .2 Network Representation 247 The time complexity is still O ( | V | d ) but Eq. 8.69 can obtain higher proximitymatrix approximation than Eq. 8.67 in one iteration. More complex update formulasthat explore further higher proximities than Eq. 8.69 can also be applied but Eq. 8.69is used in current experiments as a cost-effective choice.Another direction is that the update equation can be processed for T rounds toobtain higher proximity approximation. However, the approximation bound wouldgrow exponentially as the number of rounds T grows and thus the update cannot bedone inﬁnitely. Note that the update operation of V and C are completely independent.Therefore, only updating network embedding V is enough for NRL. The abovealgorithm (NEU) avoids an accurate computation of high-order proximity matrixbut can yield network embeddings that actually approximate high-order proximities.Hence, this algorithm can improve the quality of network embeddings efﬁciently.Intuitively, Eqs. 8.67 and 8.69 allow the learned embeddings to further propagate totheir neighbors. Hence, the proximities of longer distances between vertices will beembedded. In this part, we will introduce common applications for network representation learn-ing and their evaluation metrics.

A multi-label classiﬁcation task is the most widely used network representationlearning evaluation task. The representations of vertices are considered as vertexfeatures and applied to classiﬁers to predict vertex labels. More formally, we assumethat there are K labels in total. The vertex-label relationship can be expressed as abinary matrix M ∈ { , } | V |× K where M i j = v i has j th labeland M i j = M . For the evaluation task, we set a training ratio which indicates how much percentof vertices have observed labels. Then our goal is to predict the labels for the verticesin the test set.For unsupervised network representation learning algorithms, the labels of thetraining set are not used for embedding learning. The network representations arefed to classiﬁers like SVM or logistic regression. Each label will have its classi-ﬁer. For semi-supervised learning methods, they take the observed vertex labels intoaccount in the representation learning period. These algorithms will have their spe-ciﬁc classiﬁers for label prediction.Once the label prediction is done, we can move to compute the evaluation met-rics. For multiclass classiﬁcation, we assume that the number of correctly predictedvertices is | V r | . Then the classiﬁcation accuracy is deﬁned as the ratio of correctly

48 8 Network Representation predicted vertices which can be formulated as | V r | / | V | . For multi-label classiﬁca-tion, the precision, recall, and F1 are the most popular metrics, which are computedas follows: Precision = N correctly predicted labels N predicted labels , Recall = N correctly predicted labels N unobserved labels , F1-Score = × RecallPrecision + Recall . (8.70) Link prediction is another important evaluation task for network representation learn-ing because a good network embedding should have the ability to model the afﬁnitybetween vertices. For evaluation, we randomly pick up edges as training set and leavethe rest as test set. Cross-validation can also be employed for training and testing.To make link prediction given the vertex representations, we ﬁrst need to evaluatethe strength of a pair of vertices. The strength between two vertices is evaluatedby computing the similarity between their representations. This similarity is usuallycomputed by cosine similarity, inner product, or square loss, which depends on thealgorithm. For example, if an algorithm uses (cid:4) V i − C j (cid:4) in their objective function,then square loss should be used to measure the similarity between vertex represen-tations. Then after we get the similarity of all unobserved links, we can rank themfor link prediction. There are two signiﬁcant metrics for link prediction: area underthe receiver operating characteristic curve (AUC) and precision. AUC . The AUC value is the probability that a randomly chosen missing link hasa higher score than a randomly chosen nonexistent link. For implementation, werandomly select a missing link and a nonexistent link and compare their similarityscore. Assume that there are n times that missing link having a higher score and n times they have the same score among n independent comparisons. Then the AUCvalue is AUC = n + . n n . (8.71)Note that for a random network representation, the AUC value should be 0 . Precision . Given the ranking of all the non-observed links, we predict the linkswith top- L highest score as predicted ones. Assume that there are L r links that aremissing links, then the precision is deﬁned as L r / L . For the network representation based community detection algorithm, we ﬁrst needto convert the nonnegative vertex representation into the hard assignment of commu- .2 Network Representation 249 nities. Assume that we have network representation matrix V ∈ R +| V |× k where row i of V is the nonnegative embedding of vertex v i . For community detection, we regardeach dimension of the embeddings as a community. That is to say, V i j denotes theafﬁnity between vertex v i and community c j . For each column of matrix V , we seta threshold Δ and the vertices with afﬁnity score higher than the threshold will beconsidered as a member of the corresponding community. The threshold can be setin various ways. For example, we can set δ so that a vertex belongs to a community c if the node is connected to other members of c with an edge probability higher than1 / N : [140] 1 N ≤ − exp ( − Δ ), (8.72)which indicates that Δ = (cid:12) − log ( − / N ) .For evaluation metrics, we have two choices: modularity and matching score. Modularity . Recall that the modularity of a graph Q is deﬁned as Q = | E | (cid:2) i , j (cid:5) A i j − deg ( v i ) deg ( v j ) | E | (cid:6) δ( v i , v j ), (8.73)where δ( v i , v j ) = v i and v j belong to the same community and δ( v i , v j ) = Matching Score . This is a more sophisticated evaluation metric for communitydetection. To compare a set of ground truth communities C ∗ to a set of detectedcommunities C , we ﬁrst need to match each detected community to the most similarground truth community. On the other side, we also ﬁnd the most similar detectedcommunity for each ground truth community. Then the ﬁnal performance is evaluatedby the average of both sides:12 | C ∗ | (cid:2) c ∗ i ∈ C ∗ max c j ∈ C δ( c ∗ i , c j ) + | C | (cid:2) c j ∈ C max c ∗ i ∈ C ∗ δ( c ∗ i , c j ), (8.74)where δ( c ∗ i , c j ) is a similarity measurement of ground truth community c ∗ i anddetected community c j , such as Jaccard similarity. The score is between 0 and 1,where 1 indicates a perfect matching of ground truth communities. Recommender systems aim at recommending items (e.g., products, movies, or loca-tions) for users and cover a wide range of applications. In many cases, an applicationcomes with an associated social network between users. Now we will present anexample to show how to use the idea of network representation for building recom-mender systems in location-based social networks.

50 8 Network Representation (a) Friendship Network (b) User Trajectory

Fig. 8.7

An illustrative example for the data in LBSNs: a Link connections represent the friendshipbetween users. b A trajectory generated by a user is a sequence of chronologically ordered check-inrecords [138]

The accelerated growth of mobile trajectories in location-based services bringsvaluable data resources to understand users’ moving behaviors. Apart from recordingthe trajectory data, another major characteristic of these location-based services isthat they also allow the users to connect whomever they like or are interested in. Asshown in Fig. 8.7, a combination of social networking and location-based servicesis called as Location-Based Social Networks (LBSN). As shown in [21], locationsthat are frequently visited by socially related persons tend to be correlated, whichindicates the close association between social connections and trajectory behaviorsof users in LBSNs. In order to better analyze and mine LBSN data, we need to havea comprehensive view to analyze and mine the information from the two aspects,i.e., the social network and mobile trajectory data.Speciﬁcally, JNTM [138] is proposed to model both social networks and mobiletrajectories jointly. The model consists of two components: the construction of socialnetworks and the generation of mobile trajectories. First, JNTM adopts a networkembedding method for the construction of social networks where a networking rep-resentation can be derived for a user. Secondly, JNTM considers four factors thatinﬂuence the generation process of mobile trajectories, namely, user visit prefer-ence, inﬂuence of friends, short-term sequential contexts, and long-term sequentialcontexts. Then JNTM uses real-valued representations to encode the four factors andset two different user representations to model the ﬁrst two factors: a visit interestrepresentation and a network representation. To characterize the last two contexts,JNTM employs the RNN and GRU models to capture the sequential relatedness inmobile trajectories at different levels, i.e., short term or long term. Finally, the twocomponents are tied by sharing user network representations. The overall model isillustrated in Fig. 8.8. .2 Network Representation 251

Layer 1(Output)Layer 2(Representation)Layer 3(Deeper Neural Network) GRU RNNFriendship User Interest Long-term Context Short-term ContextNetwork G Trajectory T

Fig. 8.8

The architecture of JNTM model

Information diffusion prediction is an important task which studies how informationitems spread among users. The prediction of information diffusion, also known as cascade prediction, has been studied over a wide range of applications, such asproduct adoption [67], epidemiology [124], social networks [63], and the spread ofnews and opinions [68].As shown in Fig. 8.9, microscopic diffusion prediction aims at guessing the nextinfected user, while macroscopic diffusion prediction estimates the total numbers ofinfected users during the diffusion process. Also, an underlying social graph amongusers will be available when information diffusion occurs on a social network service.The social graph will be considered as additional structural inputs for diffusionprediction.FOREST [139] is the ﬁrst work to address both microscopic and macroscopicpredictions. As shown in Fig. 8.10, FOREST proposes a structural context extrac-

Fig. 8.9

Illustrative examples for microscopic next infected user prediction (left) and macroscopiccascade size prediction (right) [139]52 8 Network Representation

Fig. 8.10

An illustrative example of structural context extraction of the orange node by neighborsampling and feature aggregation [139] tion algorithm that was originally introduced for accelerating graph convolutionalnetworks [41] to build an RNN-based microscopic cascade model. For each user v , we ﬁrst sample Z users { u , u . . . , u Z } from v and its neighbors N ( v ) . Thenwe update its feature vector by aggregating the neighborhood features. The updateduser feature vector encodes structural information by aggregating features from v ’sﬁrst-order neighbors. The operation can also be processed recursively to explore alarger neighborhood of user v . Empirically, a two-step neighborhood exploration istime efﬁcient and enough to give promising results.FOREST further incorporates the ability of macroscopic prediction, i.e., esti-mating the eventual size of a cascade into the model by reinforcement learning. Themethod can be divided into four steps: (a) encode observed K users by a microscopiccascade model; (b) enable the microscopic cascade model to predict the size of a cas-cade by cascade simulations; (c) use Mean-Square Log-Transformed Error (MSLE)as the supervision signal for macroscopic predictions; and (d) employ a reinforce-ment learning framework to update parameters through policy gradient algorithm.The overall workﬂow is illustrated in Fig. 8.11. In this section, we will introduce another kind of method for network representationlearning, which is called Graph Neural Networks (GNNs) [101]. These methodsaim to utilize neural networks to model graph data and have shown their strongcapabilities in many applications. .3 Graph Neural Networks 253 … …h h h K Microscopic Cascade Model u u u K (a) Feed observed K users intomicroscopic cascade model (b) cascade simulations by sampling(d) Update parameters bypolicy gradients (c) MSLEReward Fig. 8.11

The workﬂow of adopting microscopic cascade model for macroscopic size predictionby reinforcement learning

Graph Neural Networks (GNNs) are deep learning based methods that operate ongraph domain. Due to its convincing performance and high interpretability, GNNhas been a widely applied graph analysis method recently. In this subsection, we willillustrate the fundamental motivations of graph neural networks.In recent years, CNNs [65] have made breakthroughs in various machine learningareas, especially in the area of computer vision, and started the revolution of deeplearning [64]. CNNs are capable of extracting multiscale localized features and thesefeatures are used to generate more expressive representations. As we are going deeperinto CNNs and graphs, we found the keys of CNNs: local connection, shared weights,and the use of multilayer [64]. These are also of great importance in solving problemsof graph domain, because (1) graphs are the most typical locally connected structure,(2) shared weights reduce the computational cost compared with traditional spectralgraph theory [23], and (3) multilayer structure is the key to deal with hierarchicalpatterns, which captures the features of various sizes. However, CNNs can onlyoperate on regular Euclidean data like images (2D grid) and text (1D sequence)while these data structures can be regarded as instances of graphs. Therefore, it isstraightforward to think of ﬁnding the generalization of CNNs to graphs. As shownin Fig. 8.12, it is hard to deﬁne localized convolutional ﬁlters and pooling operators,which hinders the transformation of CNN from Euclidean domain to non-Euclideandomain.The other motivation comes from network embedding [12, 24, 37, 42, 149]. Inthe ﬁeld of graph analysis, traditional machine learning approaches usually rely onhand-engineered features and are limited by its inﬂexibility and high cost. Follow-ing the idea of representation learning and the success of word embedding [81],DeepWalk [93], which is regarded as the ﬁrst graph embedding method based on

54 8 Network Representation

Fig. 8.12

Left: image in Euclidean space. Right: graph in non-Euclidean space [155] representation learning, applies Skip-gram model [81] on the generated randomwalks. Similar approaches such as node2vec [38], LINE [111], and TADW [136]also achieved breakthroughs. However, these methods suffer from two severe draw-backs [42]. First, no parameters are shared between nodes in the encoder, which leadsto computational inefﬁciency, since it means the number of parameters grows linearlywith the number of nodes. Second, the direct embedding methods lack the ability ofgeneralization, which means they cannot deal with dynamic graphs or generalize tonew graphs.Based on CNNs and network embedding, Graph Neural Networks (GNNs) areproposed to collectively aggregate information from graph structure. Thus, they canmodel input and/or output consisting of elements and their dependency. Further, thegraph neural networks can simultaneously model the diffusion process on the graphwith the RNN kernel.In the rest of this section, we will ﬁrst introduce several typical variants of graphneural networks such as Graph Convolutional Networks (GCNs), Graph AttentionNetworks (GATs), and Graph Recurrent Networks (GRNs). Then we will introduceseveral extensions to the original model and ﬁnally, we will give some examples ofapplications that utilize graph neural networks.

Graph Convolutional Networks (GCNs) aim to generalize convolutions to the graphdomain. Advances in this direction are often categorized as spectral approaches andspatial (nonspectral) approaches. .3 Graph Neural Networks 255

Spectral approaches work with a spectral representation of the graphs.

Spectral Network.

Bruna et al. [11] proposes the spectral network. The convolu-tion operation is deﬁned in the Fourier domain by computing the eigendecompositionof the graph Laplacian. The operation can be deﬁned as the multiplication of a signal x ∈ R N (a scalar for each node) with a ﬁlter g θ = diag ( θ ) parameterized by θ ∈ R N : g θ (cid:14) x = U g θ ((cid:15)) U T x , (8.75)where U is the matrix of eigenvectors of the normalized graph Laplacian L = I N − D − AD − = U (cid:15) U T ( D is the degree matrix and A is the adjacency matrix of thegraph), with a diagonal matrix of its eigenvalues (cid:15) .This operation results in potentially intense computations and non-spatially local-ized ﬁlters. Henaff et al. [47] attempts to make the spectral ﬁlters spatially localizedby introducing a parameterization with smooth coefﬁcients. ChebNet.

Hammond et al. [43] suggests that g θ ((cid:15)) can be approximated by atruncated expansion in terms of Chebyshev polynomials T k ( x ) up to K th order. Thus,the operation is g θ (cid:14) x ≈ K (cid:2) k = θ k T k ( ˜ L ) x , (8.76)with ˜ L = /λ max L − I N . λ max denotes the largest eigenvalue of L . θ ∈ R K is nowa vector of Chebyshev coefﬁcients. The Chebyshev polynomials are deﬁned as T k ( x ) = x T k − ( x ) − T k − ( x ) , with T ( x ) = T ( x ) = x . It can be observedthat the operation is K -localized since it is a K th-order polynomial in the Laplacian.Defferrard et al. [28] proposes the ChebNet. It uses this K -localized convolution todeﬁne a convolutional neural network, which could remove the need to compute theeigenvectors of the Laplacian. GCN.

Kipf and Welling [59] limits the layer-wise convolution operation to K = λ max ≈ g θ (cid:9) (cid:14) x ≈ θ (cid:9) x + θ (cid:9) ( L − I N ) x = θ (cid:9) x − θ (cid:9) D − AD − x , (8.77)with two free parameters θ (cid:9) and θ (cid:9) . After constraining the number of parameterswith θ = θ (cid:9) = − θ (cid:9) , we can obtain the following expression: g θ (cid:14) x ≈ θ (cid:13) I N + D − AD − (cid:14) x . (8.78)Note that stacking this operator could lead to numerical instabilities andexploding/vanishing gradients, [59] introduces the renormalization trick :

56 8 Network Representation I N + D − AD − → ˜ D − ˜ A ˜ D − , with ˜ A = A + I N and ˜ D ii = (cid:4) j ˜ A i j . Finally, [59]generalizes the deﬁnition to a signal X ∈ R N × C with C input channels and F ﬁltersfor feature maps as follows: H = f ( ˜ D − ˜ A ˜ D − XW ), (8.79)where W ∈ R C × F is a matrix of ﬁlter parameters, H ∈ R N × F is the convolved signalmatrix and f ( · ) is the activation function.The GCN layer can be stacked for multiple times so that we have the equation: H ( t ) = f ( ˜ D − ˜ A ˜ D − H ( t − ) W ( t − ) ), (8.80)where the superscripts t and t − H ( ) could be X . After L layers, we can use the ﬁnal embedding matrix H ( L ) and areadout function to get the ﬁnal output matrix Z : Z = Readout ( H ( L ) ), (8.81)where the readout function can be any machine learning methods, such as MLP.Finally, as a semi-supervised algorithm, GCN uses the feature matrix at the toplayer Z which has the same dimension with the total number of labels to predict thelabels of all observed labels. The loss function can be written as L = − (cid:2) l ∈ y L (cid:2) f Y l f ln Z l f , (8.82)where y L is the set of node indices that have observed labels. Figure 8.13 shows thealgorithm of GCN. Answer:green output layerinput layer hiddenlayersC Fx Z Z Z Z Y Y x x x Fig. 8.13

The architecture of graph convolutional network model.3 Graph Neural Networks 257

In all of the spectral approaches mentioned above, the learned ﬁlters depend onthe Laplacian eigenbasis, which depends on the graph structure, that is, a modeltrained on a speciﬁc structure could not be directly applied to a graph with a differentstructure.Spatial approaches deﬁne convolutions directly on the graph, operating on spa-tially close neighbors. The major challenge of spatial approaches is deﬁning the con-volution operation with differently sized neighborhoods and maintaining the localinvariance of CNNs.

Neural FPs.

Duvenaud et al. [31] uses different weight matrices for nodes withdifferent degrees x ( t ) = h ( t − ) v + | N v | (cid:2) i = h ( t − ) i , h ( t ) v = f ( W ( t ) | N v | x ( t ) ), (8.83)where W ( t ) | N v | is the weight matrix for nodes with degree | N v | at layer t . And the maindrawback of the method is that it cannot be applied to large-scale graphs with morenode degrees.In the following description of other models, we use h ( t ) v to denote the hiddenstate of node v at layer t . N v denotes the neighbor set of node v and | N v | denotes thesize of the set. DCNN.

Atwood and Towsley [4] proposes the Diffusion-Convolutional NeuralNetworks (DCNNs). Transition matrices are used to deﬁne the neighborhood fornodes in DCNN. For node classiﬁcation, it has H = f (cid:13) W c (cid:11) −→ P X (cid:14) , (8.84)where (cid:11) is the element-wise multiplication and X is an N × F matrix of inputfeatures. −→ P is an N × K × N tensor which contains the power series { P , P ,…, P K } of matrix P . And P is the degree-normalized transition matrix from the graphsadjacency matrix A . Each entity is transformed to a diffusion-convolutional rep-resentation, which is a K × F matrix deﬁned by K hops of graph diffusion over F features. And then it will be deﬁned by a K × F weight matrix and a nonlin-ear activation function f . Finally H (which is N × K × F ) denotes the diffusionrepresentations of each node in the graph. DGCN.

Zhuang and Ma [158] proposes the Dual Graph Convolutional Network(DGCN) to consider the local consistency and global consistency of graphs jointly. Ituses two convolutional networks to capture the local/global consistency and adoptsan unsupervised loss to ensemble them. The ﬁrst convolutional network is the sameas Eq. 8.80. And the second network replaces the adjacency matrix with PositivePoint-wise Mutual Information (PPMI) matrix:

58 8 Network Representation H ( t ) = f ( D − P X P D − P H ( t − ) W ), (8.85)where X P is the PPMI matrix and D P is the diagonal degree matrix of X P . GraphSAGE.

Hamilton et al. [41] proposes the GraphSAGE, a general induc-tive framework. The framework generates embeddings by sampling and aggregatingfeatures from a node’s local neighborhood. h ( t ) N v = AGGREGATE ( t ) ( { h ( t − ) u , ∀ u ∈ N v } ), h ( t ) v = f ( W ( t ) [ h ( t − ) v ; h ( t ) N v ] ). (8.86)However, [41] does not utilize the full set of neighbors in Eq. 8.86 but a ﬁxed-size set of neighbors by uniformly sampling. And [41] suggests three aggregatorfunctions. • Mean aggregator. It could be viewed as an approximation of the convolutionaloperation from the transductive GCN framework [59], so that the inductive versionof the GCN variant could be derived by h ( t ) v = f (cid:15) W · MEAN (cid:15) { h ( t − ) v } ∪ { h ( t − ) u |∀ u ∈ N v } (cid:16)(cid:16) . (8.87)The mean aggregator is different from other aggregators because it does not per-form the concatenation operation which concatenates h t − v and h tN v in Eq. 8.86. Itcan be viewed as a form of “skip connection” [46] and can achieve better perfor-mance. • LSTM aggregator. Hamilton et al. [41] also uses an LSTM-based aggregator whichhas a larger expressive capability. However, LSTMs process inputs in a sequentialmanner so that they are not permutation invariant. Hamilton et al. [41] adaptsLSTMs to operate on an unordered set by permutating node’s neighbors. • Pooling aggregator. In the pooling aggregator, each neighbor’s hidden state is fedthrough a fully connected layer and then a max-pooling operation is applied to theset of the node’s neighbors. h ( t ) N v = max ( { f ( W pool h ( t − ) u + b ), ∀ u ∈ N v } ). (8.88)Note that any symmetric functions could be used in place of the max-poolingoperation here. Other methods.

There are still many other spatial methods. The PATCHY-SANmodel [86] ﬁrst extracts exactly k nodes for each node and normalizes them. Thenthe convolutional operation is applied to the normalized neighborhood. LGCN [35]leverages CNNs as aggregators. It performs max-pooling on nodes’ neighborhoodmatrices to get top-k feature elements and then applies 1-D CNN to compute hid-den representations. Monti et al. [82] proposes a spatial-domain model (MoNet)on non-Euclidean domains which could generalize several previous techniques. .3 Graph Neural Networks 259 The Geodesic CNN (GCNN) [78] and Anisotropic CNN (ACNN) [10] on manifoldsor GCN [59] and DCNN [4] on graphs could be formulated as particular instancesof MoNet. Our readers can refer to their papers for more details.

The attention mechanism has been successfully used in many sequence-based taskssuch as machine translation [5, 36, 121], machine reading [19], etc. Many worksfocus on generalizing the attention mechanism to the graph domain.

GAT.

Velickovic et al. [122] proposes a Graph Attention Network (GAT) whichincorporates the attention mechanism into the propagation step. Speciﬁcally, it usesthe self-attention strategy and each node’s hidden state is computed by attendingover its neighbors.Velickovic et al. [122] deﬁnes a single graph attentional layer and constructsarbitrary graph attention networks by stacking this layer. The layer computes thecoefﬁcients in the attention mechanism of the node pair ( i , j ) by: α i j = exp (cid:13) LeakyReLU (cid:13) a (cid:3) [ Wh ( t − ) i ; Wh ( t − ) j ] (cid:14)(cid:14)(cid:4) k ∈ N i exp (cid:13) LeakyReLU (cid:13) a (cid:3) [ Wh ( t − ) i ; Wh ( t − ) k ] (cid:14)(cid:14) , (8.89)where α i j is the attention coefﬁcient of node j to i . W ∈ R F (cid:9) × F is the weight matrix of a shared linear transformation which applied to every node, a ∈ R F (cid:9) is the weightvector. It is normalized by a softmax function and the LeakyReLU nonlinearity (withnegative input slop 0 .

2) is applied.Then the ﬁnal output features of each node can be obtained by (after applying anonlinearity f ): h ( t ) i = f ⎛⎝(cid:2) j ∈ N i α i j Wh ( t − ) j ⎞⎠ . (8.90)Moreover, the layer utilizes the multi-head attention similarly to [121] to stabilizethe learning process. It applies K independent attention mechanisms to compute thehidden states and then concatenates their features(or computes the average), resultingin the following two output representations: h ( t ) i = (cid:4) Kk = f ⎛⎝(cid:2) j ∈ N i α ki j W k h ( t − ) j ⎞⎠ , (8.91) h ( t ) i = f ⎛⎝ K K (cid:2) k = (cid:2) j ∈ N i α ki j W k h ( t − ) j ⎞⎠ , (8.92)

60 8 Network Representation where α ki j is normalized attention coefﬁcient computed by the k th attention mecha-nism, (cid:4) is the concatenation operation.The attention architecture in [122] has several properties: (1) the computation ofthe node-neighbor pairs is parallelizable thus the operation is efﬁcient; (2) it candeal with nodes that have different degrees by assigning reasonable weights to theirneighbors; (3) it can be applied to the inductive learning problems easily. GAAN.

Besides GAT, Gated Attention Network (GAAN) [150] also uses themulti-head attention mechanism. However, it uses a self-attention mechanism togather information from different heads to replace the average operation of GAT.

Several works are attempting to use the gate mechanism like GRU [20] or LSTM [48]in the propagation step to release the limitations induced by the vanilla GNN architec-ture and improve the effectiveness of the long-term information propagation acrossthe graph. We call these methods Graph Recurrent Networks (GRNs) and we willintroduce some variants of GRNs in this subsection.

GGNN.

Li et al. [72] proposes the gated graph neural network (GGNN) which usesthe Gate Recurrent Units (GRU) in the propagation step. It follows the computationsteps from recurrent neural networks for a ﬁxed number of L steps, then it back-propagates through time to compute gradients.Speciﬁcally, the basic recurrence of the propagation model is a ( t ) v = A (cid:3) v [ h ( t − ) . . . h ( t − ) N ] (cid:3) + b , z ( t ) v = Sigmoid (cid:15) W z a ( t ) v + U z h ( t − ) v (cid:16) , r ( t ) v = Sigmoid (cid:15) W r a ( t ) v + U r h ( t − ) v (cid:16) , (8.93) (cid:8) h ( t ) v = tanh (cid:15) Wa ( t ) v + U (cid:15) r ( t ) v (cid:11) h ( t − ) v (cid:16)(cid:16) , h ( t ) v = (cid:15) − z ( t ) v (cid:16) (cid:11) h ( t − ) v + z ( t ) v (cid:11) (cid:8) h ( t ) v . The node v ﬁrst aggregates message from its neighbors, where A v is the sub-matrix of the graph adjacency matrix A and denotes the connection of node v withits neighbors. Then the hidden state of the node is updated by the GRU-like functionusing the information from its neighbors and the hidden state from the previoustimestep. a gathers the neighborhood information of node v , z and r are the updateand reset gates.LSTMs are also used similarly as GRU through the propagation process based ona tree or a graph. Tree-LSTM.

Tai et al. [109] proposes two extensions to the basic LSTM architec-ture: the Child-Sum Tree-LSTM and the N-ary Tree-LSTM. Like in standard LSTMunits, each Tree-LSTM unit (indexed by v ) contains input and output gates i v and o v ,a memory cell c v and hidden state h v . The Tree-LSTM unit replaces the single forget .3 Graph Neural Networks 261 gate by a forget gate f vk for each child k , allowing node v to select information fromits children accordingly. The equations of the Child-Sum Tree-LSTM are (cid:8) h t − v = (cid:2) k ∈ N v h t − k , i tv = Sigmoid (cid:13) W i x tv + U i (cid:8) h t − v + b i (cid:14) , f tvk = Sigmoid (cid:13) W f x tv + U f h t − k + b f (cid:14) , o tv = Sigmoid (cid:13) W o x tv + U o (cid:8) h t − v + b o (cid:14) , (8.94) u tv = tanh (cid:13) W u x tv + U u (cid:8) h t − v + b u (cid:14) , c tv = i tv (cid:11) u tv + (cid:2) k ∈ N v f tvk (cid:11) c t − k , h tv = o tv (cid:11) tanh ( c tv ), where x tv is the input vector at time t in the standard LSTM setting.In a speciﬁc case, if each node’s number of children is at most K and these childrencan be ordered from 1 to K , then the N -ary Tree-LSTM can be applied. For node v , h tvk and c tvk denote the hidden state and memory cell of its k th child at time t respectively. The transition equations are the following: i tv = Sigmoid (cid:13) W i x tv + K (cid:2) l = U il h t − vl + b i (cid:14) , f tvk = Sigmoid (cid:13) W f x tv + K (cid:2) l = U fkl h t − vl + b f (cid:14) , o tv = Sigmoid (cid:13) W o x tv + K (cid:2) l = U ol h t − vl + b o (cid:14) , (8.95) u tv = tanh (cid:13) W u x tv + K (cid:2) l = U ul h t − vl + b u (cid:14) , c tv = i tv (cid:11) u tv + K (cid:2) l = f tvl (cid:11) c t − vl , h tv = o tv (cid:11) tanh ( c tv ). Compared to the Child-Sum Tree-LSTM, the N -ary Tree-LSTM introduces sep-arate parameters for each child k . These parameters allow the model to learn moreﬁne-grained representations conditioning on each node’s children. Graph LSTM.

The two types of Tree-LSTMs can be easily adapted to the graph.The graph-structured LSTM in [148] is an example of the N -ary Tree-LSTM applied

62 8 Network Representation to the graph. However, it is a simpliﬁed version since each node in the graph has atmost 2 incoming edges (from its parent and sibling predecessor). Peng et al. [92]proposes another variant of the Graph LSTM based on the relation extraction task.The main difference between graphs and trees is that edges of graphs have theirlabels, and [92] utilizes different weight matrices to represent different labels. i tv = Sigmoid (cid:13) W i x tv + (cid:2) k ∈ N v U im ( v , k ) h t − k + b i (cid:14) , f tvk = Sigmoid (cid:13) W f x tv + U fm ( v , k ) h t − k + b f (cid:14) , o tv = Sigmoid (cid:13) W o x tv + (cid:2) k ∈ N v U om ( v , k ) h t − k + b o (cid:14) , (8.96) u tv = tanh (cid:13) W u x tv + (cid:2) k ∈ N v U um ( v , k ) h t − k + b u (cid:14) , c tv = i tv (cid:11) u tv + (cid:2) k ∈ N v f tvk (cid:11) c t − k , h tv = o tv (cid:11) tanh ( c tv ), where m ( v , k ) denotes the edge label between node v and k .Besides, [74] proposes a Graph LSTM network to address the semantic objectparsing task. It uses the conﬁdence-driven scheme to adaptively select the startingnode and determine the node updating sequence. It follows the same idea of general-izing the existing LSTMs into the graph-structured data but has a speciﬁc updatingsequence while the methods we mentioned above are agnostic to the order of nodes. Sentence LSTM.

Zhang et al. [152] proposes the Sentence LSTM (S-LSTM) forimproving text encoding. It converts text into a graph and utilizes the Graph LSTMto learn the representation. The S-LSTM shows strong representation power in manyNLP problems.

In this subsection, we will talk about some extensions of graph neural networks.

Many applications unroll or stack the graph neural network layer aiming to achievebetter results as more layers (i.e., k layers) make each node aggregate more informa-tion from neighbors k hops away. However, it has been observed in many experimentsthat deeper models could not improve the performance and deeper models could even .3 Graph Neural Networks 263 perform worse [59]. This is mainly because more layers could also propagate thenoisy information from an exponentially increasing number of expanded neighbor-hood members.A straightforward method to address the problem, the residual network [45], canbe found from the computer vision community. Nevertheless, even with residualconnections, GCNs with more layers do not perform as well as the 2-layer GCN onmany datasets [59]. Highway Network.

Rahimi et al. [96] borrows ideas from the highway net-work [159] and uses layer-wise gates to build a Highway GCN. The input of eachlayer is multiplied by the gating weights and then summed with the output: T ( h ( t ) ) = Sigmoid (cid:15) W ( t ) h ( t ) + b ( t ) (cid:16) , h ( t + ) = h ( t + ) (cid:11) T ( h ( t ) ) + h ( t ) (cid:11) ( − T ( h ( t ) )). (8.97)By adding the highway gates, the performance peaks at four layers in a speciﬁcproblem discussed in [96]. The Column Network (CLN) proposed in [94] also utilizesthe highway network. However, it has a different function to compute the gatingweights. Jump Knowledge Network.

Xu et al. [134] studies properties and resulting limi-tations of neighborhood aggregation schemes. It proposes the Jump Knowledge Net-work which could learn adaptive, structure-aware representations. The Jump Knowl-edge Network selects from all of the intermediate representations (which"jump" tothe last layer) for each node at the last layer, which enables the model to select effec-tive neighborhood information for each node. Xu et al. [134] uses three approachesof concatenation , max-pooling , and LSTM-attention in the experiments to aggre-gate information. The Jump Knowledge Network performs well on the experimentsin social, bioinformatics, and citation networks. It can also be combined with modelslike Graph Convolutional Networks, GraphSAGE, and Graph Attention Networks toimprove their performance.

In the area of computer vision, a convolutional layer is usually followed by a poolinglayer to get more general features. Similar to these pooling layers, much work focuseson designing hierarchical pooling layers on graphs. Complicated and large-scalegraphs usually carry rich hierarchical structures that are of great importance fornode-level and graph-level classiﬁcation tasks.To explore such inner features, Edge-Conditioned Convolution (ECC) [106]designs its pooling module with the recursively downsampling operation. The down-sampling method is based on splitting the graph into two components by the sign ofthe largest eigenvector of the Laplacian.

64 8 Network Representation

DIFFPOOL [144] proposes a learnable hierarchical clustering module by trainingan assignment matrix in each layer: S ( l ) = Softmax ( GNN l , pool ( A ( l ) , V ( l ) )), (8.98)where V ( l ) is node features and A ( l ) is coarsened adjacency matrix of layer l . The original graph convolutional neural network has several drawbacks. Speciﬁcally,GCN requires the full graph Laplacian, which is computationally consuming for largegraphs. Furthermore, the embedding of a node at layer L is computed recursively bythe embeddings of all its neighbors at layer L −

1. Therefore, the receptive ﬁeld of asingle node grows exponentially with respect to the number of layers, so computinggradient for a single node costs a lot. Finally, GCN is trained independently for aﬁxed graph, which lacks the ability for inductive learning.

GraphSAGE [41] is a comprehensive improvement of the original GCN. Tosolve the problems mentioned above, GraphSAGE replaced full graph Laplacianwith learnable aggregation functions, which are crucial to perform message passingand generalize to unseen nodes. As shown in Eq. 8.86, they ﬁrst aggregate neighbor-hood embeddings, concatenate with target node’s embedding, then propagate to thenext layer. With learned aggregation and propagation functions, GraphSAGE couldgenerate embeddings for unseen nodes. Also, GraphSAGE uses neighbor samplingto alleviate the receptive ﬁeld expansion.

PinSage [143] proposes importance-based sampling method. By simulating ran-dom walks starting from target nodes, this approach chooses the top T nodes withthe highest normalized visit counts. FastGCN [16] further improves the sampling algorithm. Instead of samplingneighbors for each node, FastGCN directly samples the receptive ﬁeld for eachlayer. FastGCN uses importance sampling, which the important factor is calculatedas below: q ( v ) ∝ | N v | (cid:2) u ∈ N v | N u | . (8.99) Adapt.

In contrast to ﬁxed sampling methods above, [51] introduces a parameter-ized and trainable sampler to perform layer-wise sampling conditioned on the formerlayer. Furthermore, this adaptive sampler could ﬁnd optimal sampling importanceand reduce variance simultaneously. .3 Graph Neural Networks 265

In the original GNN [101], the input graph consists of nodes with label informationand undirected edges, which is the simplest graph format. However, there are manyvariants of graphs in the world. In the following, we will introduce some methodsdesigned to model different kinds of graphs.

Directed Graphs.

The ﬁrst variant of the graph is directed graphs. Undirectededge which can be treated as two directed edges shows that there is a relation betweentwo nodes. However, directed edges can bring more information than undirectededges. For example, in a knowledge graph where the edge starts from the head entityand ends at the tail entity, the head entity is the parent class of the tail entity, whichsuggests we should treat the information propagation process from parent classesand child classes differently. DGP [55] uses two kinds of the weight matrix, W p and W c , to incorporate more precise structural information. The propagation rule isshown as follows: H ( t ) = f ( D − p A p f ( D − c A c H ( t − ) W c ) W p ), (8.100)where D − p A p , D − c A c are the normalized adjacency matrix for parents and children,respectively. Heterogeneous Graphs.

The second variant of the graph is a heterogeneousgraph, where there are several kinds of nodes. The simplest way to process theheterogeneous graph is to convert the type of each node to a one-hot feature vectorwhich is concatenated with the original feature.What’s more, GraphInception [151] introduces the concept of metapath into thepropagation on the heterogeneous graph. With metapath, we can group the neighborsaccording to their node types and distances. For each neighbor group, GraphInceptiontreats it as a subgraph in a homogeneous graph to do propagation and concatenatesthe propagation results from different homogeneous graphs to do a collective noderepresentation. Recently, [128] proposes the Heterogeneous graph Attention Network(HAN) which utilizes node-level and semantic-level attention. And the model hasthe ability to consider node importance and metapaths simultaneously.

Graphs with Edge Information.

In another variant of graph, each edge hasadditional information like the weight or the type of the edge. We list two ways tohandle this kind of graphs:Firstly, we can convert the graph to a bipartite graph where the original edgesalso become nodes and one original edge is split into two new edges which meansthere are two new edges between the edge node and begin/end nodes. The encoderof G2S [7] uses the following aggregation function for neighbors: h ( t ) v = f (cid:21) | N v | (cid:2) u ∈ N v W r (cid:15) r ( t ) v (cid:11) h ( t − ) u (cid:16) + b r (cid:22) , (8.101)where W r and b r are the propagation parameters for different types of edges(relations).

66 8 Network Representation

Secondly, we can adapt different weight matrices for the propagation of differentkinds of edges. When the number of relations is huge, r-GCN [102] introduces twokinds of regularization to reduce the number of parameters for modeling amounts ofrelations: basis - and block-diagonal -decomposition. With the basis decomposition,each W r is deﬁned as follows: W r = B (cid:2) b = α rb M b . (8.102)Here each W r is a linear combination of basis transformations M b ∈ R d in × d out with coefﬁcients α rb . In the block-diagonal decomposition, r-GCN deﬁnes each W r through the direct sum over a set of low-dimensional matrices, which needs moreparameters than the ﬁrst one. Dynamic Graphs.

Another variant of the graph is dynamic graph, which hasa static graph structure and dynamic input signals. To capture both kinds of infor-mation, DCRNN [71] and STGCN [147] ﬁrst collect spatial information by GNNs,then feed the outputs into a sequence model like sequence-to-sequence model orCNNs. Differently, Structural-RNN [53] and ST-GCN [135] collect spatial and tem-poral messages at the same time. They extend static graph structure with temporalconnections so they can apply traditional GNNs on the extended graphs.

Graph neural networks have been explored in a wide range of problem domains acrosssupervised, semi-supervised, unsupervised, and reinforcement learning settings. Inthis section, we simply divide the applications into three scenarios: (1) Structuralscenarios where the data has explicit relational structure, such as physical systems,molecular structures, and knowledge graphs; (2) Nonstructural scenarios where therelational structure is not explicit include image, text, etc; (3) Other applicationscenarios such as generative models and combinatorial optimization problems. Notethat we only list several representative applications instead of providing an exhaustivelist. We further give some examples of GNNs in the task of fact veriﬁcation andrelation extraction. Figure 8.14 illustrates some application scenarios of graph neuralnetworks.

In the following, we will introduce GNN’s applications in structural scenarios, wherethe data are naturally performed in the graph structure. For example, GNNs arewidely being used in social network prediction [41, 59], trafﬁc prediction [25, 96],recommender systems [120, 143], and graph representation [144]. Speciﬁcally, we .3 Graph Neural Networks 267

Fig. 8.14

Application scenarios of graph neural network [155] are discussing how to model real-world physical systems with object-relationshipgraphs, how to predict the chemical properties of molecules and biological interactionproperties of proteins and the applications of GNNs on knowledge graphs.

Physics.

Modeling real-world physical systems is one of the most fundamentalaspects of understanding human intelligence. By representing objects as nodes andrelations as edges, we can perform GNN-based reasoning about objects, relations,and physics in a simpliﬁed but effective way.Battaglia et al. [6] proposes Interaction Networks to make predictions and infer-ences about various physical systems. Objects and relations are ﬁrst fed into the modelas input. Then the model considers the interactions and physical dynamics to predictnew states. They separately model relation-centric and object-centric models, mak-ing it easier to generalize across different systems. In CommNet [107], interactionsare not modeled explicitly. Instead, an interaction vector is obtained by averaging allother agents’ hidden vectors. VAIN [49] further introduced attentional methods intothe agent interaction process, which preserves both the complexity advantages andcomputational efﬁciency as well.

68 8 Network Representation

Visual Interaction Networks [132] can make predictions from pixels. It learns astate code from two consecutive input frames for each object. Then, after addingtheir interaction effect by an Interaction Net block, the state decoder converts statecodes to the next step’s state.Sanchez-Gonzalez et al. [99] proposes a Graph Network based model which couldeither perform state prediction or inductive inference. The inference model takespartially observed information as input and constructs a hidden graph for implicitsystem classiﬁcation.

Molecular Fingerprints.

Molecular ﬁngerprints are feature vectors representingmolecules, which are important in computer-aided drug design. Traditional molecu-lar ﬁngerprint discovering relies on heuristic methods which are hand-crafted. AndGNNs can provide more ﬂexible approaches for better ﬁngerprints.Duvenaud et al. [31] propose neural graph ﬁngerprints (Neural FPs) that calculatesubstructure feature vectors via GCN and sum to get overall representation. Theaggregation function is introduced in Eq. 8.83.Kearnes et al. [56] further explicitly models atom and atom pairs independentlyto emphasize atom interactions. It introduces edge representation e ( t ) uv instead ofaggregation function, i.e., h ( t ) N v = (cid:4) u ∈ N v e ( t ) uv . The node update function is h ( t + ) v = ReLU ( W [ ReLU ( W h ( t ) u ) ; h ( t ) N v ] ), (8.103)while the edge update function is e ( t + ) uv = ReLU ( W [ ReLU ( W e ( t ) uv ) ; ReLU ( W [ h ( t ) v ; h ( t ) u ] ) ] ). (8.104) Protein Interface Prediction.

Fout et al. [33] focuses on the task named proteininterface prediction, which is a challenging problem with critical applications indrug discovery and design. The proposed GCN-based method, respectively, learnsligand and receptor protein residue representation and merges them for pair-wiseclassiﬁcation.GNN can also be used in biomedical engineering. With Protein-Protein Inter-action Network, [97] leverages graph convolution and relation network for breastcancer subtype classiﬁcation. Zitnik et al. [160] also suggest a GCN-based modelfor polypharmacy side effects prediction. Their work models the drug and proteininteraction network and separately deals with edges in different types.

Knowledge Graph.

Hamaguchi et al. [40] utilizes GNNs to solve the Out-Of-Knowledge-Base (OOKB) entity problem in Knowledge Base Completion (KBC).The OOKB entities in [40] are directly connected to the existing entities thus theembeddings of OOKB entities can be aggregated from the existing entities. Themethod achieves satisfying performance both in the standard KBC setting and theOOKB setting.Wang et al. [130] utilize GCNs to solve the cross-lingual knowledge graph align-ment problem. The model embeds entities from different languages into a uniﬁedembedding space and aligns them based on the embedding similarity. .3 Graph Neural Networks 269

In this section we will talk about applications on nonstructural scenarios such asimage, text, programming source code [1, 72], and multi-agent systems [49, 58,107]. We will only give a detailed introduction to the ﬁrst two scenarios due to thelength limit. Roughly, there are two ways to apply the graph neural networks onnonstructural scenarios: (1) Incorporate structural information from other domainsto improve the performance, for example, using information from knowledge graphsto alleviate the zero-shot problems in image tasks; (2) Infer or assume the relationalstructure in the scenario and then apply the model to solve the problems deﬁned ongraphs, such as the method in [152] which models text as graphs.

Image classiﬁcation.

Image classiﬁcation is a fundamental and essential taskin the ﬁeld of computer vision, which attracts much attention and has many famousdatasets like ImageNet [62]. Recent progress in image classiﬁcation beneﬁts from bigdata and the strong power of GPU computation, which allows us to train a classiﬁerwithout extracting structural information from images. However, zero-shot and few-shot learning become more and more popular in the ﬁeld of image classiﬁcation,because most models can achieve similar performance with enough data. There areseveral works leveraging graph neural networks to incorporate structural informationin image classiﬁcation.First, knowledge graphs can be used as extra information to guide zero-shot recog-nition classiﬁcation [55, 129]. Wang et al. [129] builds a knowledge graph whereeach node corresponds to an object category and takes the word embeddings of nodesas input for predicting the classiﬁer of different categories. As the over-smoothingeffect happens with the deep depth of convolution architecture, the 6-layer GCN usedin [129] will wash out much useful information in the representation. To solve thesmoothing problem in the propagation of GCN, [55] uses single-layer GCN with alarger neighborhood, which includes both one-hop and multi-hop nodes in the graph.And it proved effective in building a zero-shot classiﬁer with existing ones.Except for the knowledge graph, the similarity between images in the dataset isalso helpful for few-shot learning [100]. Satorras and Estrach [100] propose to builda weighted fully connected image network based on the similarity and do messagepassing in the graph for few-shot recognition. As most knowledge graphs are largefor reasoning, [77] selects some related entities to build a subgraph based on theresult of object detection and apply GGNN to the extracted graph for prediction.Besides, [66] proposes to construct a new knowledge graph where the entities are allthe categories. And, they deﬁned three types of label relations: super-subordinate,positive correlation, and negative correlation and propagate the conﬁdence of labelsin the graph directly.

Visual reasoning.

Computer vision systems usually need to perform reasoningby incorporating both spatial and semantic information. So it is natural to generategraphs for reasoning tasks.A typical visual reasoning task is Visual Question Answering (VQA), [114],respectively, constructs image scene graph and question syntactic graph. Then itapplies GGNN to train the embeddings for predicting the ﬁnal answer. Despite spatial

70 8 Network Representation connections among objects, [87] builds the relational graphs conditioned on thequestions. With knowledge graphs, [83, 131] can perform ﬁner relation explorationand a more interpretable reasoning process.Other applications of visual reasoning include object detection, interaction detec-tion, and region classiﬁcation. In object detection [39, 50], GNNs are used to calcu-late RoI features; In interaction detection [53, 95], GNNs are message-passing toolsbetween human and objects; In region classiﬁcation [18], GNNs perform reasoningon graphs which connects regions and classes.

Text Classiﬁcation.

Text classiﬁcation is an essential and classical problem innatural language processing. The classical GCN models [4, 28, 41, 47, 59, 82] andGAT model [122] are applied to solve the problem, but they only use the structuralinformation between the documents and they do not use much text information.Peng et al. [91] propose a graph-CNN-based deep learning model. It ﬁrst turnstexts to graph-of-words, then the graph convolution operations in [347] are usedon the word graph. Zhang et al. [152] propose the Sentence LSTM to encode text.The whole sentence is represented in a single state which contains a global sentence-level state and several substates for individual words. It uses the global sentence-levelrepresentation for classiﬁcation tasks.These methods either view a document or a sentence as a graph of word nodesor rely on the document citation relation to construct the graph. Yao et al. [142]regard the documents and words as nodes to construct the corpus graph (henceheterogeneous graph) and uses the Text GCN to learn embeddings of words anddocuments. Sentiment classiﬁcation could also be regarded as a text classiﬁcationproblem and a Tree-LSTM approach is proposed by [109].

Sequence Labeling.

As each node in GNNs has its hidden state, we can utilizethe hidden state to address the sequence labeling problem if we consider every wordin the sentence as a node. Zhang et al. [152] utilize the Sentence LSTM to label thesequence. It has conducted experiments on POS-tagging and NER tasks and achievespromising performance.Semantic role labeling is another task of sequence labeling. Marcheggiani andTitov [76] propose a Syntactic GCN to solve the problem. The Syntactic GCN whichoperates on the direct graph with labeled edges is a special variant of the GCN [59].It uses edge-wise gates that enable the model to regulate the contribution of eachdependency edge. The Syntactic GCNs over syntactic dependency trees are used assentence encoders to learn latent feature representations of words in the sentence.Marcheggiani and Titov [76] also reveal that GCNs and LSTMs are functionallycomplementary in the task.

Besides structural and nonstructural scenarios, there are some other scenarios wheregraph neural networks play an important role. In this subsection, we will introducegenerative graph models and combinatorial optimization with GNNs. .3 Graph Neural Networks 271

Generative Models.

Generative models for real-world graphs have drawn signif-icant attention for its essential applications, including modeling social interactions,discovering new chemical structures, and constructing knowledge graphs. As deeplearning methods have a powerful ability to learn the implicit distribution of graphs,there is a surge in neural graph generative models recently.NetGAN [104] is one of the ﬁrst works to build a neural graph generative model,which generates graphs via random walks. It transformed the problem of graphgeneration to the problem of walk generation, which takes the random walks from aspeciﬁc graph as input and trains a walk generative model using GAN architecture.While the generated graph preserves essential topological properties of the originalgraph, the number of nodes is unable to change in the generating process, which isas same as the original graph. GraphRNN [146] generate the adjacency matrix of agraph by generating the adjacency vector of each node step by step, which can outputrequired networks having different numbers of nodes.Instead of generating adjacency matrix sequentially, MolGAN [27] predict adiscrete graph structure (the adjacency matrix) at once and utilizes a permutation-invariant discriminator to solve the node variant problem in the adjacency matrix.Besides, it applies a reward network for RL-based optimization towards desiredchemical properties. What is more, [75] proposes constrained variational autoen-coders to ensure the semantic validity of generated graphs. Moreover, GCPN [145]incorporates domain-speciﬁc rules through reinforcement learning.Li et al. [73] propose a model that generates edges and nodes sequentially andutilize a graph neural network to extract the hidden state of the current graph, whichis used to decide the action in the next step during the sequential generative process.

Combinatorial optimization.

Combinatorial optimization problems over graphsare a set of NP-hard problems that attract much attention from scientists of all ﬁelds.Some speciﬁc problems like Traveling Salesman Problem (TSP) have got variousheuristic solutions. Recently, using a deep neural network for solving such problemshas been a hotspot, and some of the solutions further leverage graph neural networksbecause of their graph structure.Bello et al. [9] ﬁrst propose a deep learning approach to tackle TSP. Their methodconsists of two parts: a Pointer Network [123] for parameterizing rewards and a policygradient [108] module for training. This work has been proved to be comparable withtraditional approaches. However, Pointer Networks are designed for sequential datalike texts, while order-invariant encoders are more appropriate for such work.Khalil et al. [57] and Kool and Welling [61] improve the above method by includ-ing graph neural networks. The former work ﬁrst obtains the node embeddings fromstructure2vec [26] then feeds them into a Q-learning module for making decisions.The latter one builds an attention-based encoder-decoder system. By replacing thereinforcement learning module with an attention-based decoder, it is more efﬁcientfor training. These work achieved better performance than previous algorithms,which proved the representation power of graph neural networks.Nowak et al. [88] focus on Quadratic Assignment Problem, i.e., measuring the sim-ilarity of two graphs. The GNN-based model learns node embeddings for each graphindependently and matches them using an attention mechanism. Even in situations

72 8 Network Representation where traditional relaxation-based methods may perform not well, this model stillshows satisfying performance.

Due to the rapid development of Information Extraction (IE), huge volumes of datahave been extracted. How to automatically verify the data becomes a vital problemfor various data-driven applications, e.g., knowledge graph completion [126] andopen domain question answering [15]. Hence, many recent research efforts havebeen devoted to Fact Veriﬁcation (FV), which aims to verify given claims with theevidence retrieved from plain text. More speciﬁcally, given a claim, an FV systemis asked to label it as “SUPPORTED”, “REFUTED”, or “NOT ENOUGH INFO”,which indicates that the evidence can support, refute, or is not sufﬁcient for the claim.An example of the FV task is shown in Table 8.5.Existing FV methods formulate FV as a Natural Language Inference (NLI) [3]task. However, they utilize simple evidence combination methods such as concate-nating the evidence or just dealing with each evidence-claim pair. These methods areunable to grasp sufﬁcient relational and logical information among the evidence. Infact, many claims require to simultaneously integrate and reason over several piecesof evidence for veriﬁcation. As shown in Table 8.5, for this particular example, wecannot verify the given claim by checking any evidence in isolation. The claim canbe veriﬁed only by understanding and reasoning over multiple evidence.

Table 8.5

A case of the claim that requires integrating multiple evidence to verify. The represen-tation for evidence “{

DocName , LineNum }” means the evidence is extracted from the document“

DocName ” and of which the line number is

LineNum

Claim:

Al Jardine is an

American rhythm guitarist

Truth evidence: {Al Jardine, 0}, {Al Jardine, 1}

Retrieved evidence: {Al Jardine, 1}, {Al Jardine, 0}, {Al Jardine, 2}, {Al Jardine, 5}, {Jardine, 42}

Evidence: (1)

He is best known as the band’s rhythm guitarist , and for occasionally singing lead vocalson singles such as “Help Me, Rhonda” (1965), “Then I Kissed Her” (1965), and “Come Go withMe” (1978)(2)

Alan Charles Jardine (born September 3, 1942) is an American musician , singer andsongwriter who co-founded the Beach Boys(3) In 2010, Jardine released his debut solo studio album, A Postcard from California(4) In 1988, Jardine was inducted into the Rock and Roll Hall of Fame as a member of theBeach Boys(5) Ray Jardine American rock climber, lightweight backpacker, inventor, author, and globaladventurer

Label:

SUPPORTED.3 Graph Neural Networks 273

To integrate and reason over information from multiple pieces of evidence, [156]proposes a graph-based evidence aggregating and reasoning (GEAR) framework.Speciﬁcally, [156] ﬁrst builds a fully connected evidence graph and encouragesinformation propagation among the evidence. Then, GEAR aggregates the piecesof evidence and adopts a classiﬁer to decide whether the evidence can support,refute, or is not sufﬁcient for the claim. Intuitively, by sufﬁciently exchanging andreasoning over evidence information on the evidence graph, the proposed model canmake the best of the information for verifying claims. For example, by delivering theinformation “Los Angeles County is the most populous county in the USA” to “theRodney King riots occurred in Los Angeles County” through the evidence graph,the synthetic information can support “The Rodney King riots took place in themost populous county in the USA”. Furthermore, we adopt an effective pretrainedlanguage representation model BERT [29] to better grasp both evidence and claimsemantics.Zhou et al. [156] employ a three-step pipeline with components for documentretrieval, sentence selection, and claim veriﬁcation to solve the task. In the documentretrieval and sentence selection stages, they simply follow the method from [44] sincetheir method has the highest score on evidence recall in the former FEVER-sharedtask. And they propose the GEAR framework in the ﬁnal claim veriﬁcation stage.The full pipeline is illustrated in Fig. 8.15.Given a claim and the retrieved evidence, GEAR ﬁrst utilizes a sentence encoder to obtain representations for the claim and the evidence. Then it builds a fully con-nected evidence graph and uses an

Evidence Reasoning Network (ERNet) to prop-agate information among evidence and reason over the graph. Finally, it utilizes an evidence aggregator to infer the ﬁnal results.

Sentence Encoder.

Given an input sentence, GEAR employs BERT [29] as thesentence encoder by extracting the ﬁnal hidden state of the [ CLS ] token as therepresentation, where [ CLS ] is the special classiﬁcation token in BERT. e i = BERT ( e i , c ) , c = BERT ( c ) . (8.105) Fig. 8.15

The pipeline used in [156]. The GEAR framework is illustrated in the claim veriﬁcationsection74 8 Network Representation

Evidence Reasoning Network.

To encourage the information propagation amongevidence, GEAR builds a fully connected evidence graph where each node indi-cates a piece of evidence. It also adds self-loop to every node because each nodeneeds the information from itself in the message propagation process. We use h t = { h t , h t , . . . , h tN } to represent the hidden states of nodes at layer t . The ini-tial hidden state of each evidence node h i is initialized by the evidence presentation: h i = e i .Inspired by recent work on semi-supervised graph learning and relational rea-soning [59, 90, 122], Zhou et al. [156] propose an Evidence Reasoning Network(ERNet) to propagate information among the evidence nodes. It ﬁrst uses an MLPto compute the attention coefﬁcients between a node i and its neighbor j ( j ∈ N i ), y i j = W ( t − ) ( ReLU ( W ( t − ) [ h ( t − ) i ; h ( t − ) j ] )), (8.106)where N i denotes the set of neighbors of node i , W ( t − ) and W ( t − ) are weightmatrices, and [·; ·] denotes concatenation operation.Then, it normalizes the coefﬁcients using the softmax function α i j = Softmax j ( y i j ) = exp ( y i j ) (cid:4) k ∈ N i exp ( y ik ) . (8.107)Finally, the normalized attention coefﬁcients are used to compute a linear com-bination of the neighbor features and thus we obtain the features for node i at layer t , h ( t ) i = (cid:2) j ∈ N i α i j h ( t − ) j . (8.108)By stacking T layers of ERNet, [156] assumes that each evidence could graspenough information by communicating with other evidence. Evidence Aggregator.

Zhou et al. [156] employ an evidence aggregator to gatherinformation from different evidence nodes and obtain the ﬁnal hidden state o . Theaggregator may utilize different aggregating strategies and [156] suggests threeaggregators:Attention Aggregator. Zhou et al. [156] use the representation of the claim c toattend the hidden states of evidence and get the ﬁnal aggregated state o . y j = W (cid:9) ( ReLU ( W (cid:9) [ c ; h (cid:3) j ] )),α j = Softmax ( y j ) = exp ( y j ) (cid:4) Nk = exp ( y k ) , o = N (cid:2) k = α k h (cid:3) k . (8.109) .3 Graph Neural Networks 275 Max Aggregator. The max aggregator performs the element-wise max operationamong hidden states. o = Max ( h (cid:3) , h (cid:3) , . . . , h (cid:3) N ). (8.110)Mean Aggregator. The mean aggregator performs the element-wise mean opera-tion among hidden states. o = Mean ( h (cid:3) , h (cid:3) , . . . , h (cid:3) N ). (8.111)Once the ﬁnal state o is obtained, GEAR employs a one-layer MLP to get the ﬁnalprediction l . l = Softmax ( ReLU ( Wo + b )), (8.112)where W and b are parameters.Zhou et al. [156] conduct experiments on the large-scale benchmark dataset forFact Extraction and VERiﬁcation (FEVER) [115]. Experimental results show thatthe proposed framework outperforms recent state-of-the-art baseline systems. Thefurther case study indicates that the framework could better leverage multi-evidenceinformation and reason over the evidence for FV. In this chapter, we have introduced network representation learning, which turnsthe network structure information into the continuous vector space and make deeplearning techniques possible on network data.Unsupervised network representation learning comes ﬁrst during the developmentof NRL. Spectral Clustering, DeepWalk, LINE, GraRep, and other methods utilizethe network structure for vertex embedding learning. Afterward, TADW incorporatestext information into NRL under the framework of matrix factorization. The NEUalgorithm then moves one step forward and proposes a general method to improvethe quality of any learned network embeddings. Other unsupervised methods alsoconsider preserving speciﬁc properties of the network topology, e.g., community andasymmetry.Recently, semi-supervised NRL algorithms have attracted much attention. Thiskind of methods focus on a speciﬁc task such as classiﬁcation and use the labels ofthe training set to improve the quality of network embeddings. Node2vec, MMDW,and many other methods including the family of Graph Neural Networks (GNNs)are proposed for this end. Semi-supervised algorithms can achieve better results asthey can take advantage of more information from the speciﬁc task.For further understanding of network representation learning, you can also ﬁndmore related papers in this paper list https://github.com/thunlp/GNNPapers. Thereare also some recommended surveys and books including the following:

76 8 Network Representation • Cui et al. A survey on network embedding [24]. • Goyal and Ferrara. Graph embedding techniques, applications, and performance:A survey [37]. • Zhang et al. Network representation learning: A survey [149]. • Wu et al. A comprehensive survey on graph neural networks [133]. • Zhou et al. Graph neural networks: A review of methods and applications [155]. • Zhang et al. Deep learning on graphs: A survey [154].In the future, for better network representation learning, some directions arerequiring further efforts:(1)

More Complex and Realistic Networks . An intriguing direction would bethe representation of learning on heterogeneous and dynamic networks where mostreal-world network data fall into this category. The vertices and edges in a heteroge-neous network may belong to different types. Networks in real life are also highlydynamic, e.g., the friendship between Facebook users may establish and disappear.These characteristics require the researchers to design speciﬁc algorithms for them.Network embedding learning on dynamic network structures is, therefore, an impor-tant task. There have been some works proposed [14, 105] for much more complexand realistic settings.(2)

Deeper Model Architectures . Conventional deep neural networks can stackhundreds of layers to get better performance because the deeper structure has moreparameters and may improve the expressive power signiﬁcantly. However, NRL andGNN models are usually shallow. In fact, most of them have no more than three layers.Taking GCN as an example, as experiments in [70] show, stacking multiple GCNlayers will result in over-smoothing: the representations of all vertices will convergeto the same. Although some researchers have managed to tackle this problem [70,125] to some extents, it remains to be a limitation of NRL. Designing deeper modelarchitectures is an exciting challenge for future research, and will be a considerablecontribution to the understanding of NRL.(3)

Scalability . Scalability determines whether an algorithm is able to be appliedto practical use. How to apply NRL methods in real-world web-scale scenarios suchas social networks or recommendation systems has been an essential problem formost network embedding algorithms. Scaling up NRL methods especially GNNis difﬁcult because many core steps are computationally consuming in a big dataenvironment. For example, network data are not regular Euclidean, and each nodehas its own neighborhood structure. Therefore, batch tricks cannot be easily applied.Moreover, computing graph Laplacian is also unfeasible when there are millions oreven billions of nodes and edges. Several works has proposed their solutions to thisproblem [143, 153, 157] and we are paying close attention to the progress. eferences 277

References

1. Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to representprograms with graphs. In

Proceedings of ICLR , 2018.2. Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pagerank vectors.In

Proceedings of FOCS , 2006.3. Gabor Angeli and Christopher D Manning. Naturalli: Natural logic inference for commonsense reasoning. In

Proceedings of EMNLP , 2014.4. James Atwood and Don Towsley. Diffusion-convolutional neural networks. In

Proceedingsof NeurIPS , 2016.5. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation byjointly learning to align and translate. In

Proceedings of ICLR , 2015.6. Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interactionnetworks for learning about objects, relations and physics. In

Proceedings of NeurIPS , 2016.7. Daniel Beck, Gholamreza Haffari, and Trevor Cohn. Graph-to-sequence learning using gatedgraph neural networks. In

Proceedings of ACL , 2018.8. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embed-ding and clustering. In

Proceedings of NeurIPS , volume 14, 2001.9. Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural com-binatorial optimization with reinforcement learning. In

Proceedings of ICLR , 2017.10. Davide Boscaini, Jonathan Masci, Emanuele Rodolà, and Michael Bronstein. Learning shapecorrespondence with anisotropic convolutional neural networks. In

Proceedings of NeurIPS ,2016.11. Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locallyconnected networks on graphs. In

Proceedings of ICLR , 2014.12. Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. A comprehensive survey ofgraph embedding: Problems, techniques, and applications.

IEEE Transactions on Knowledgeand Data Engineering , 30(9):1616–1637, 2018.13. Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations withglobal structural information. In

Proceedings of CIKM , 2015.14. Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and Thomas S Huang.Heterogeneous network embedding via deep architectures. In

Proceedings of SIGKDD , 2015.15. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answeropen-domain questions. In

Proceedings of the ACL , 2017.16. Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: Fast learning with graph convolutional networksvia importance sampling. In

Proceedings of ICLR , 2018.17. Mo Chen, Qiong Yang, and Xiaoou Tang. Directed graph embedding. In

Proceedings of IJCAI ,2007.18. Xinlei Chen, Lijia Li, Li Feifei, and Abhinav Gupta. Iterative visual reasoning beyond con-volutions. In

Proceedings of CVPR , 2018.19. Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks formachine reading. In

Proceedings of EMNLP , 2016.20. Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In

Proceedings of EMNLP , 2014.21. Yoon-Sik Cho, Greg Ver Steeg, and Aram Galstyan. Socially relevant venue clustering fromcheck-in data. In

Proceedings of KDD Workshop , 2013.22. Wojciech Chojnacki and Michael J Brooks. A note on the locally linear embedding algorithm.

International Journal of Pattern Recognition and Artiﬁcial Intelligence , 23(08):1739–1752,2009.23. Fan RK Chung and Fan Chung Graham.

Spectral graph theory . Number 92. American Math-ematical Soc., 1997.24. Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. A survey on network embedding.

IEEETransactions on Knowledge and Data Engineering , 2018.78 8 Network Representation25. Zhiyong Cui, Kristian Henrickson, Ruimin Ke, and Yinhai Wang. Trafﬁc graph convolutionalrecurrent neural network: A deep learning framework for network-scale trafﬁc learning andforecasting.

IEEE Transactions on Intelligent Transportation Systems , 2019.26. Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models forstructured data. In

Proceedings of ICML , 2016.27. Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small moleculargraphs. arXiv preprint arXiv:1805.11973, 2018.28. Michael Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural net-works on graphs with fast localized spectral ﬁltering. In

Proceedings of NeurIPS , 2016.29. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. In

Proceedings of NAACL , 2019.30. Giuseppe Di Battista, Peter Eades, Roberto Tamassia, and Ioannis G Tollis. Algorithms fordrawing graphs: an annotated bibliography.

Computational Geometry , 4(5):235–282, 1994.31. David K Duvenaud, Dougal Maclaurin, Jorge Aguileraiparraguirre, Rafael Gomezbombarelli,Timothy D Hirzel, Alan Aspuruguzik, and Ryan P Adams. Convolutional networks on graphsfor learning molecular ﬁngerprints. In

Proceedings of NeurIPS , 2015.32. Francois Fouss, Alain Pirotte, Jean-Michel Renders, and Marco Saerens. Random-walk com-putation of similarities between nodes of a graph with application to collaborative recommen-dation.

IEEE Transactions on Knowledge and Data Engineering , 19(3):355–369, 2007.33. Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur. Protein interface prediction usinggraph convolutional networks. In

Proceedings of NeurIPS , 2017.34. Thomas MJ Fruchterman and Edward M Reingold. Graph drawing by force-directed place-ment.

Software: Practice and Experience , 21(11):1129–1164, 1991.35. Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Large-scale learnable graph convolu-tional networks. In

Proceedings of SIGKDD , 2018.36. Jonas Gehring, Michael Auli, David Grangier, and Yann N Dauphin. A convolutional encodermodel for neural machine translation. In

Proceedings of ACL , 2017.37. Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and perfor-mance: A survey.

Knowledge-Based Systems , 151:78–94, 2018.38. Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In

Pro-ceedings of SIGKDD , 2016.39. Jiayuan Gu, Han Hu, Liwei Wang, Yichen Wei, and Jifeng Dai. Learning region features forobject detection. In

Proceedings of ECCV , 2018.40. Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. Knowledge trans-fer for out-of-knowledge-base entities : A graph neural network approach. In

Proceedings ofIJCAI , 2017.41. Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on largegraphs. In

Proceedings of NeurIPS , 2017.42. William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Meth-ods and applications.

IEEE Data(base) Engineering Bulletin , 40:52–74, 2017.43. David K Hammond, Pierre Vandergheynst, and Remi Gribonval. Wavelets on graphs viaspectral graph theory.

Applied and Computational Harmonic Analysis , 30(2):129–150, 2011.44. Andreas Hanselowski, Hao Zhang, Zile Li, Daniil Sorokin, Benjamin Schiller, Claudia Schulz,and Iryna Gurevych. Ukp-athene: Multi-sentence textual entailment for claim veriﬁcation. In

Proceedings of EMNLP , 2018.45. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

Proceedings of CVPR , 2016.46. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residualnetworks. In

Proceedings of ECCV , 2016.47. Mikael Henaff, Joan Bruna, and Yann Lecun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015.48. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural Computation ,9(8):1735–1780, 1997.eferences 27949. Yedid Hoshen. Vain: Attentional multi-agent predictive modeling. In

Proceedings of NeurIPS ,2017.50. Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for objectdetection. In

Proceedings of CVPR , 2018.51. Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. Adaptive sampling towards fastgraph representation learning. In

Proceedings of NeurIPS , 2018.52. Yann Jacob, Ludovic Denoyer, and Patrick Gallinari. Learning latent representations of nodesfor classifying in heterogeneous social networks. In

Proceedings of WSDM , 2014.53. Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-rnn: Deeplearning on spatio-temporal graphs. In

Proceedings of CVPR , 2016.54. Tomihisa Kamada and Satoru Kawai. An algorithm for drawing general undirected graphs.

Information Processing Letters , 31(1):7–15, 1989.55. Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang, and Eric PXing. Rethinking knowledge graph propagation for zero-shot learning. In

Proceedings ofCVPR , 2019.56. Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecu-lar graph convolutions: moving beyond ﬁngerprints.

Journal of computer-aided moleculardesign , 30(8):595–608, 2016.57. Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorialoptimization algorithms over graphs. In

Proceedings of NeurIPS , 2017.58. Thomas Kipf, Ethan Fetaya, Kuanchieh Wang, Max Welling, and Richard S Zemel. Neuralrelational inference for interacting systems. In

Proceedings of ICML , 2018.59. Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutionalnetworks. In

Proceedings of ICLR , 2017.60. Stephen G Kobourov. Spring embedders and force directed graph drawing algorithms. arXivpreprint arXiv:1201.3011, 2012.61. WWM Kool and M Welling. Attention solves your tsp. arXiv preprint arXiv:1803.08475,2018.62. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deepconvolutional neural networks. In

Proceedings of NeurIPS , 2012.63. Theodoros Lappas, Evimaria Terzi, Dimitrios Gunopulos, and Heikki Mannila. Finding effec-tors in social networks. In

Proceedings of SIGKDD , 2010.64. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.

Nature , 521(7553):436,2015.65. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learningapplied to document recognition.

Proceedings of the IEEE , 1998.66. Chungwei Lee, Wei Fang, Chihkuan Yeh, and Yuchiang Frank Wang. Multi-label zero-shotlearning with structured knowledge graphs. In

Proceedings of CVPR , 2018.67. Jure Leskovec, Lada A Adamic, and Bernardo A Huberman. The dynamics of viral marketing.

ACM Transactions on the Web (TWEB) , 1(1):5, 2007.68. Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynamics of thenews cycle. In

Proceedings of SIGKDD , 2009.69. Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In

Proceedings of NeurIPS , 2014.70. Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional net-works for semi-supervised learning. In

Proceedings of AAAI , 2018.71. Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neuralnetwork: Data-driven trafﬁc forecasting. In

Proceedings of ICLR , 2018.72. Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S Zemel. Gated graph sequenceneural networks. In

Proceedings of ICLR , 2016.73. Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deepgenerative models of graphs. In

Proceedings of ICLR , 2018.74. Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic objectparsing with graph lstm. In

Proceedings of ECCV , 2016.80 8 Network Representation75. Tengfei Ma, Jie Chen, and Cao Xiao. Constrained generation of semantically valid graphs viaregularizing variational autoencoders. In

Proceedings of NeurIPS , 2018.76. Diego Marcheggiani and Ivan Titov. Encoding sentences with graph convolutional networksfor semantic role labeling. In

Proceedings of EMNLP , 2017.77. Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. The more you know: Usingknowledge graphs for image classiﬁcation. In

Proceedings of CVPR , 2017.78. Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesicconvolutional neural networks on riemannian manifolds. In

Proceedings of ICCV workshops ,2015.79. Julian J McAuley and Jure Leskovec. Learning to discover social circles in ego networks. In

Proceedings of NeurIPS , 2012.80. T Mikolov and J Dean. Distributed representations of words and phrases and their composi-tionality.

Proceedings of NeurIPS , 2013.81. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of wordrepresentations in vector space. In

Proceedings of ICLR , 2013.82. Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, andMichael M Bronstein. Geometric deep learning on graphs and manifolds using mixture modelcnns. In

Proceedings of CVPR , 2017.83. Medhini Narasimhan, Svetlana Lazebnik, and Alexander Gerhard Schwing. Out of the box:Reasoning with graph convolution nets for factual visual question answering. In

Proceedingsof NeurIPS , 2018.84. Mark EJ Newman. Finding community structure in networks using the eigenvectors of matri-ces.

Physical Review E , 74(3):036104, 2006.85. Mark EJ Newman. Modularity and community structure in networks.

Proceedings of theNational Academy of Sciences , 103(23):8577–8582, 2006.86. Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neuralnetworks for graphs. In

Proceedings of ICML , 2016.87. Will Norcliffebrown, Stathis Vafeias, and Sarah Parisot. Learning conditioned graph structuresfor interpretable visual question answering. In

Proceedings of NeurIPS , 2018.88. Alex Nowak, Soledad Villar, Afonso S Bandeira, and Joan Bruna. Revised note on learningquadratic assignment with graph neural networks. In

Proceedings of IEEE DSW 2018 , 2018.89. Mingdong Ou, Peng Cui, Jian Pei, and Wenwu Zhu. Asymmetric transitivity preserving graphembedding. In

Proceedings of SIGKDD , 2016.90. Rasmus Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks. In

Proceedingsof NeurIPS , 2018.91. Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu Song,and Qiang Yang. Large-scale hierarchical text classiﬁcation with recursively regularized deepgraph-cnn. In

Proceedings of WWW , 2018.92. Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wentau Yih. Cross-sentence n-ary relation extraction with graph lstms.

Transactions of the Association for Com-putational Linguistics , 5(1):101–115, 2017.93. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social rep-resentations. In

Proceedings of SIGKDD , 2014.94. Trang Pham, Truyen Tran, Dinh Phung, and Svetha Venkatesh. Column networks for collectiveclassiﬁcation. In

Proceedings of AAAI , 2017.95. Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Songchun Zhu. Learninghuman-object interactions by graph parsing neural networks. In

Proceedings of ECCV , 2018.96. Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. Semi-supervised user geolocation viagraph convolutional networks. In

Proceedings of ACL , 2018.97. Sungmin Rhee, Seokjun Seo, and Sun Kim. Hybrid approach of relation network and localizedgraph convolutional ﬁltering for breast cancer subtype classiﬁcation. In

Proceedings of IJCAI ,2018.98. Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linearembedding. science , 290(5500):2323–2326, 2000.eferences 28199. Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Ried-miller, Raia Hadsell, and Peter Battaglia. Graph networks as learnable physics engines forinference and control. In

Proceedings of ICML , 2018.100. Victor Garcia Satorras and Joan Bruna Estrach. Few-shot learning with graph neural networks.In

Proceedings of ICLR , 2018.101. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar-dini. The graph neural network model.

IEEE TNN 2009 , 20(1):61–80, 2009.102. Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, andMax Welling. Modeling relational data with graph convolutional networks. In

Proceedings ofESWC , 2018.103. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classiﬁcation in network data.

AI magazine , 29(3):93–93, 2008.104. Oleksandr Shchur, Daniel Zugner, Aleksandar Bojchevski, and Stephan Gunnemann. Netgan:Generating graphs via random walks. In

Proceedings of ICML , 2018.105. Chuan Shi, Binbin Hu, Wayne Xin Zhao, and S Yu Philip. Heterogeneous information networkembedding for recommendation.

IEEE Transactions on Knowledge and Data Engineering ,31(2):357–370, 2018.106. Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned ﬁlters in convolutionalneural networks on graphs. In

Proceedings of CVPR , 2017.107. Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backprop-agation. In

Proceedings of NeurIPS , 2016.108. Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction . MIT press,2018.109. Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representa-tions from tree-structured long short-term memory networks. In

Proceedings of ACL , 2015.110. Jian Tang, Meng Qu, and Qiaozhu Mei. Pte: Predictive text embedding through large-scaleheterogeneous text networks. In

Proceedings of SIGKDD , 2015.111. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In

Proceedings of WWW , 2015.112. Lei Tang and Huan Liu. Relational learning via latent social dimensions. In

Proceedings ofSIGKDD , 2009.113. Lei Tang and Huan Liu. Leveraging social media networks for classiﬁcation.

Data Miningand Knowledge Discovery , 23(3):447–478, 2011.114. Damien Teney, Lingqiao Liu, and Anton Van Den Hengel. Graph-structured representationsfor visual question answering. In

Proceedings of CVPR , 2017.115. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: alarge-scale dataset for fact extraction and veriﬁcation. In

Proceedings of NAACL-HLT , 2018.116. Cunchao Tu, Hao Wang, Xiangkai Zeng, Zhiyuan Liu, and Maosong Sun. Community-enhanced network representation learning for network analysis. arXiv preprintarXiv:1611.06645, 2016.117. Cunchao Tu, Xiangkai Zeng, Hao Wang, Zhengyan Zhang, Zhiyuan Liu, Maosong Sun,Bo Zhang, and Leyu Lin. A uniﬁed framework for community detection and network rep-resentation learning.

IEEE Transactions on Knowledge and Data Engineering (TKDE) ,31(6):1051–1065, 2018.118. Cunchao Tu, Weicheng Zhang, Zhiyuan Liu, and Maosong Sun. Max-margin deepwalk: Dis-criminative learning of network representation. In

Proceedings of IJCAI , 2016.119. Cunchao Tu, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. Transnet: translation-basednetwork representation learning for social relation extraction. In

Proceedings of IJCAI , 2017.120. Rianne van den Berg, Thomas N Kipf, and Max Welling. Graph convolutional matrix com-pletion. arXiv preprint arXiv:1706.02263, 2017.121. Ashish Vaswani, Noam Shazeer, Niki Parmar, Llion Jones, Jakob Uszkoreit, Aidan N Gomez,and Lukasz Kaiser. Attention is all you need. In

Proceedings of NeurIPS , 2017.122. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, andYoshua Bengio. Graph attention networks. In

Proceedings of ICLR , 2018.82 8 Network Representation123. Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In

Proceedings ofNeurIPS , 2015.124. Jacco Wallinga and Peter Teunis. Different epidemic curves for severe acute respiratory syn-drome reveal similar impacts of control measures.

American Journal of epidemiology , 2004.125. Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In

Proceedingsof SIGKDD , 2016.126. Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A surveyof approaches and applications.

TKDE , 29(12):2724–2743, 2017.127. Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. Communitypreserving network embedding. In

Proceedings of AAAI , 2017.128. Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. Het-erogeneous graph attention network. In

Proceedings of WWW , 2019.129. Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot recognition via semantic embed-dings and knowledge graphs. In

Proceedings of CVPR , 2018.130. Zhichun Wang, Qingsong Lv, Xiaohan Lan, and Yu Zhang. Cross-lingual knowledge graphalignment via graph convolutional networks. In

Proceedings of EMNLP , 2018.131. Zhouxia Wang, Tianshui Chen, Jimmy S J Ren, Weihao Yu, Hui Cheng, and Liang Lin. Deepreasoning with knowledge graph for social relationship understanding. In

Proceedings ofIJCAI , 2018.132. Nicholas Watters, Daniel Zoran, Theophane Weber, Peter Battaglia, Razvan Pascanu, andAndrea Tacchetti. Visual interaction networks: Learning a physics simulator from video. In

Proceedings of NeurIPS , 2017.133. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu.A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.134. Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Kenichi Kawarabayashi, andStefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In

Proceedings of ICML , 2018.135. Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks forskeleton-based action recognition. In

Proceedings of AAAI , 2018.136. Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. Network repre-sentation learning with rich text information. In

Proceedings of IJCAI , 2015.137. Cheng Yang, Maosong Sun, Zhiyuan Liu, and Cunchao Tu. Fast network embedding enhance-ment via high order proximity approximation. In

Proceedings of IJCAI , 2017.138. Cheng Yang, Maosong Sun, Wayne Xin Zhao, Zhiyuan Liu, and Edward Y Chang. A neuralnetwork approach to jointly modeling social networks and mobile trajectories.

ACM Trans-actions on Information Systems (TOIS) , 35(4):36, 2017.139. Cheng Yang, Jian Tang, Maosong Sun, Ganqu Cui, and Liu Zhiyuan. Multi-scale informationdiffusion prediction with reinforced recurrent networks. In

Proceedings of IJCAI , 2019.140. Jaewon Yang and Jure Leskovec. Overlapping community detection at scale: a nonnegativematrix factorization approach. In

Proceedings of WSDM , 2013.141. Jaewon Yang, Julian McAuley, and Jure Leskovec. Detecting cohesive and 2-mode commu-nities indirected and undirected networks. In

Proceedings of WSDM , 2014.142. Liang Yao, Chengsheng Mao, and Yuan Luo. Graph convolutional networks for text classiﬁ-cation. In

Proceedings of AAAI , 2019.143. Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and JureLeskovec. Graph convolutional neural networks for web-scale recommender systems. In

Pro-ceedings of SIGKDD , 2018.144. Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec.Hierarchical graph representation learning with differentiable pooling. In

Proceedings ofNeurIPS , 2018.145. Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay S Pande, and Jure Leskovec. Graph convolutionalpolicy network for goal-directed molecular graph generation. In

Proceedings of NeurIPS ,2018.eferences 283146. Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Gen-erating realistic graphs with deep auto-regressive models. In

Proceedings of ICML , 2018.147. Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: Adeep learning framework for trafﬁc forecasting. In

Proceedings of ICLR , 2018.148. Victoria Zayats and Mari Ostendorf. Conversation modeling on reddit using a graph-structuredlstm.

Transactions of the Association for Computational Linguistics , 6:121–132, 2018.149. Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. Network representation learning:A survey.

IEEE transactions on Big Data , 2018.150. Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit Yan Yeung. Gaan:Gated attention networks for learning on large and spatiotemporal graphs. In

Proceedings ofUAI , 2018.151. Yizhou Zhang, Yun Xiong, Xiangnan Kong, Shanshan Li, Jinhong Mi, and Yangyong Zhu.Deep collective classiﬁcation in heterogeneous information networks. In

Proceedings ofWWW , 2018.152. Yue Zhang, Qi Liu, and Linfeng Song. Sentence-state lstm for text representation. In

Pro-ceedings of ACL , 2018.153. Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Maosong Sun, Zhichong Fang, Bo Zhang, andLeyu Lin. Cosine: Compressive network embedding on large-scale information networks.arXiv preprint arXiv:1812.08972, 2018.154. Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. arXiv preprintarXiv:1812.04202, 2018.155. Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, ChangchengLi, and Maosong Sun. Graph neural networks: A review of methods and applications. arXivpreprint arXiv:1812.08434, 2018.156. Jie Zhou, Xu Han, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun.GEAR: Graph-based evidence aggregating and reasoning for fact veriﬁcation. In

Proceedingsof ACL 2019 , 2019.157. Zhaocheng Zhu, Shizhen Xu, Jian Tang, and Meng Qu. Graphvite: A high-performance cpu-gpu hybrid system for node embedding. In

Proceedings of WWW , 2019.158. Chenyi Zhuang and Qiang Ma. Dual graph convolutional networks for graph-based semi-supervised classiﬁcation. In

Proceedings of WWW , 2018.159. Julian G Zilly, Rupesh Kumar Srivastava, Jan Koutnik, and Jurgen Schmidhuber. Recurrenthighway networks. In

Proceedings of ICML , 2016.160. Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polypharmacy side effectswith graph convolutional networks.

Intelligent Systems in Molecular Biology , 34(13):258814,2018.

Open Access

Cross-Modal Representation

Abstract

Cross-modal representation learning is an essential part of representationlearning, which aims to learn latent semantic representations for modalities includingtexts, audio, images, videos, etc. In this chapter, we ﬁrst introduce typical cross-modalrepresentation models. After that, we review several real-world applications relatedto cross-modal representation learning including image captioning, visual relationdetection, and visual question answering.

As introduced in Wikipedia, a modality is the classiﬁcation of a single independentchannel of sensory input/output between a computer and a human . To be moregeneral, modalities are different means of information exchange between humanbeings and the real world. The classiﬁcation is usually based on the form in whichinformation is presented to a human. Typical modalities in the real world includetexts, audio, images, videos, etc.Cross-modal representation learning is an important part of representation learn-ing. In fact, artiﬁcial intelligence is inherently a multi-modal task [30]. Human beingsare exposed to multi-modal information every day, and it is normal for us to integrateinformation from different modalities and make comprehensive judgments. Further-more, different modalities are not independent, but they have correlations more orless. For example, the judgment of a syllable is made by not only the sound we hearbut also the movement of the lips and tongue of the speaker we see. An experimentin [48] shows that a voiced /ba/ with a visual /ga/ is perceived by most people asa /da/. Another example is human beings’ ability to consider the 2D image and 3Dscan of the same object together and reconstruct its structure: correlations betweenimage and scan can be found based on the fact that a discontinuity of depth in thescan usually indicates a sharp line in the image [52]. Inspired by this, it is naturalfor us to consider the possibility of combining inputs from multi-modalities in ourartiﬁcial intelligence systems and generate cross-modal representation.Ngiam et al. [52] explore the probability of merging multi-modalities into onelearning task. The authors divide a typical machine learning task into three stages: © The Author(s) 2020Z. Liu et al.,

Representation Learning for Natural Language Processing ,https://doi.org/10.1007/978-981-15-5573-2_9 28586 9 Cross-Modal Representation feature learning, supervised learning, and prediction. And they further propose fourkinds of learning settings for multi-modalities:(1)

Single-modal learning : all stages are all done on just one modality.(2)

Multi-modal fusion : all stages are all done with all modalities available.(3)

Cross-modal learning : in the feature learning stage, all modalities are avail-able, but in supervised learning and prediction, only one modality is used.(4)

Shared representation learning : in the feature learning stage, all modalitiesare available. In supervised learning, only one modality is used, and in prediction, adifferent modality is used.Experiments show promising results for these multi-modal tasks. When moremodalities are provided (such as multi-modal fusion, cross-modal learning, andshared representation learning), the performance of the system is generally better.In the following part of this chapter, we will ﬁrst introduce cross-modal represen-tation models, which are fundamental parts of cross-modal representation learningin NLP. And then, we will introduce several critical applications, such as imagecaptioning, visual relationship detection, and visual question answering.

Cross-modal representation learning aims to build embeddings using informationfrom multiple modalities. Existing cross-modal representation models involving textmodality can be generally divided into two categories: (1) [30, 77] try to fuse infor-mation from different modalities into uniﬁed embeddings (e.g., visually groundedword representations). (2) Researchers also try to build embeddings for differentmodalities in a common semantic space, which allows the model to compute cross-modal similarity. Such cross-modal similarity can be further utilized for downstreamtasks, such as zero-shot recognition [5, 14, 18, 53, 65] and cross-media retrieval [23,55]. In this section, we will introduce these two kinds of cross-modal representationmodels, respectively.

Computing word embeddings is a fundamental task in representation learning fornatural language processing. Typical word embedding models (like Word2vec [49])are trained on a text corpus. These models, while being extremely successful, cannotdiscover implicit semantic relatedness between words that could be expressed in othermodalities. Kottur et al. [30] provide an example: even though eat and stare at seem are unrelated from text, images might show that when people are eating something they would also tend to stare at it. This implies that consideringother modalities when constructing word embeddings may help capture more implicitsemantic relatedness. .2 Cross-Modal Representation 287

Two ducks in water swim walk sit

CNN

Fig. 9.1

The architecture for word embedding with global visual context

Vision, being one of the most critical modalities, has attracted attention fromresearchers seeking to improve word representation. Several models that incorporatevisual information and improve word embeddings with vision have been proposed.We introduce two typical word representation models incorporating visual informa-tion in the following.

Xu et al. [77] propose a model that makes a natural attempt to incorporate visualfeatures. It claims that in most word representation models, only local context infor-mation (e.g., trying to predict a word using neighboring words and phrases) is con-sidered. Global text information (e.g., the topic of the passage), on the other hand,is often neglected. This model extends a simple local context model by using visualinformation as global features (see Fig. 9.1).The input of the model is an image I and a sequence describing it. It is basedon a simple local context language model: when we consider a certain word w t ina sequence, its local feature is the average of embeddings of words in a window,i.e., { w t − k , . . . , w t − , w t + , . . . , w t + k } . The visual feature is computed directly fromthe image I using a CNN and then used as the global feature. The local feature andthe global feature are then concatenated into a vector f . The predicted probabilityof a word w t (in this blank) is the softmax normalized product of f and the wordembedding w t : o w t = w Tt f , (9.1) P ( w t | w t − k , . . . , w t − , w t + , . . . , w t + k ; I ) = exp ( o w t ) (cid:2) i exp ( o w i ) . (9.2)The model is optimized by maximizing the average of log probability:

88 9 Cross-Modal Representation L = T T − k (cid:3) t = k log P ( w t | w t − k , . . . , w t − , w t + , . . . , w t + k ; I ). (9.3)The classiﬁcation error will be back-propagated to local text vector (i.e., wordembeddings), visual vector, and all model parameters. This accomplishes jointlylearning for a set of word embeddings, a language model, and the model used forvisual encoding. Kottur et al. [30] also propose a neural model to capture ﬁne-grained semantics fromvisual information. Instead of focusing on literal pixels, the abstract scene behindthe vision is considered. The model takes a pair of the visual scene and a relatedword sequence ( I , w ) as input. At each training step, a window is used upon theword sequence w , forming a subsequence S w . All the words in S w will be fed into theinput layer using one-hot encoding, and therefore the dimension of the input layeris | V | , which is also the size of the vocabulary. The words are then transformed intotheir embeddings, and the hidden layer is the average of all these embeddings. Thesize of the hidden layer is N H , which is also the dimension of the word embeddings.The hidden layer and the output layer are connected by a full connection matrix ofdimension N H ∗ N K and a softmax function. The output layer can be regarded asa probability distribution over a discrete-valued function g ( · ) of the visual scene I (details will be given in the following paragraph). The entire model is optimized byminimizing the objective function: L = − log P ( g ( w ) | S w ). (9.4)The most important part of the model is the function g ( · ) . It maps the visual scene I into the set { , , . . . , N K } , which indicates what kind of abstract scene it is. Inpractice, it is learned ofﬂine using K-means clustering, and each cluster representsthe semantics of one kind of visual scenes, consequently, the word sequence w , whichis designed to be related to the scene. Large-scale datasets partially support the success of deep learning methods. Eventhough the scales of datasets continue to grow larger, and more categories areinvolved, the annotation of datasets is expensive and time-consuming. For manycategories, there are very limited or even no instances, which restricts the scalabilityof recognition systems. .2 Cross-Modal Representation 289

Zero-shot recognition is proposed to solve the problem as mentioned above, whichaims to classify instances of categories that have not been seen during training. Manyworks propose to utilize cross-modal representation for zero-shot image classiﬁcation[5, 14, 18, 53, 65]. Speciﬁcally, image representation and category representationare embedded into a common semantic space, where similarities between image andcategory representations can serve for further classiﬁcation. For example, in such acommon semantic space, the embedding of an image of cat is expected to be closerto the embedding of category cat than the embedding of category truck . The challenge of zero-shot learning lies in the absence of instances of unseen cat-egories, which makes it challenging to obtain well-performed classiﬁers of unseencategories. Frome et al. [18] present a model that utilizes both labeled images andinformation from the large-scale plain text for zero-shot image classiﬁcation. Theytry to leverage semantic information from word embeddings and transfer it to imageclassiﬁcation systems.Their model is motivated by the fact that word embeddings incorporate semanticinformation of concepts or categories, which can be potentially utilized as classi-ﬁers of corresponding categories. Similar categories cluster well in semantic space.For example, in word embedding space, the nearest neighbors of the term tigershark are similar kinds of sharks, such as bull shark, blacktip shark,sandbar shark, and oceanic whitetip shark . In addition, bound-aries between different clusters are clear. The aforementioned properties indicatethat word embeddings can be further utilized as classiﬁers for recognition systems.Speciﬁcally, the model ﬁrst pretrains word embeddings using the Skip-gram textmodel on large-scale Wikipedia articles. For visual feature extraction, the model pre-trains a deep convolutional neural network for 1 ,

000 object categories on ImageNet.The pretrained word embeddings and the convolutional neural network are used toinitialize the proposed Deep Visual-Semantic Embedding model (DeViSE).To train the proposed model, they replace the softmax layer of the pretrainedconvolutional neural network with a linear projection layer. The model is trained topredict the word embeddings of categories for images using a hinge ranking loss: L ( I , y ) = (cid:3) j (cid:3)= y max [ , γ − w y MI + w j MI ] , (9.5)where w y and w j are the learned word embeddings of the positive label and samplednegative label, respectively, I denotes the feature of the image obtained from theconvolutional neural network, M is the trainable parameters in linear projectionlayer, and γ is a hyperparameter in hinge ranking loss. Given an image, the objectiverequires the model to produce a higher score for the correct label than randomlychosen labels, where the score is deﬁned as the dot product of the projected imagefeature and word embedding of terms.

90 9 Cross-Modal Representation

At test time, given a test image, the score of each possible category is obtainedusing the same approach during training. Note that a crucial difference at test time isthat the classiﬁers (word embeddings) are expanded to all possible categories, includ-ing unseen categories. Thus the model is capable of predicting unseen categories.Experiment results show that DeViSE can make zero-shot predictions with moresemantically reasonable errors, which means that even if the prediction is not exactlycorrect, it is semantically related to the ground truth class. However, a drawback isthat although the model can utilize semantic information in word embeddings tomake zero-shot image classiﬁcation, using word embeddings as classiﬁers restrictsthe ﬂexibility of the model, which results in inferior performance in the original1 ,

000 categories compared to the original softmax classiﬁer.

Inspired by DeViSE, [53] proposes a model ConSE that tries to utilize semanticinformation from word embeddings for zero-shot classiﬁcation. A vital differenceto DeViSE is that they obtain the semantic embedding of test image using a convexcombination of word embeddings of seen categories. The score of the correspondingcategory determines the weights of the composing word embeddings.Speciﬁcally, they train a deep convolutional neural network on seen categories.At test time, given a test image I (possibly from unseen categories), they obtainthe top T conﬁdent predictions of seen categories, where T is a hyperparameter.Then the semantic embedding f ( I ) of I is determined by the convex combinationof semantic embeddings of the top T conﬁdent categories, which can be formallydeﬁned as follows: f ( I ) = Z T (cid:3) t = P ( ˆ y ( I , t ) | I ) · w ( ˆ y ( I , t )), (9.6)where ˆ y ( I , t ) is the t th most conﬁdent training label for I , w ( ˆ y ( I , t )) is the semanticembedding (word embedding) of ˆ y ( I , t ) , and Z is a normalization factor given by Z = T (cid:3) t = P ( ˆ y ( I , t ) | I ). (9.7)After obtaining the semantic embedding f ( I ) , the score of the category m is givenby the cosine similarity of f ( I ) and w ( m ) .The motivation of ConSE is that they assume novel categories can be mod-eled as the convex combination of seen categories. If the model is highly conﬁ-dent about a prediction, (i.e., P ( ˆ y ( I , ) | I ) ≈ f ( I ) willbe close to w ( ˆ y ( I , )) . If the predictions are ambiguous, (e.g., P ( tiger | I ) = . , P ( lion | I ) = . f ( I ) will be between w ( lion ) and w ( tiger ) . And they expect the semantic embedding f ( I ) = . w ( lion ) + .2 Cross-Modal Representation 291 . w ( tiger ) to be close to the semantic embedding w ( liger ) (a hybrid crossbetween lions and tigers ).Although ConSE and DeViSE share many similarities, there are also some crucialdifferences. DeViSE replaces the softmax layer of the pretrained visual model with aprojection layer, while ConSE preserves the softmax layer. ConSE does not need tobe further trained and uses a convex combination of semantic embeddings to performzero-shot classiﬁcation at test time. Experiment results show that ConSE outperformsDeViSE on unseen categories, indicating better generalization capability. However,the performance of ConSE on seen categories is not as competitive as DeViSE andthe original softmax classiﬁer. Socher et al. [65] present a cross-modal representation model for zero-shot recogni-tion. In their model, all word vectors are initialized with pretrained 50-dimensionalword vectors and are kept ﬁxed during training. Each image is represented by a vector I constructed by a deep convolutional neural network. They ﬁrst project an imageinto semantic word spaces by minimizing L (Θ) = (cid:3) y ∈ Y s (cid:3) I ( i ) ∈ X y (cid:6) w y − θ ( ) f (θ ( ) I ( i ) ) (cid:6) , (9.8)where Y s denotes the set of images’ classes which can be seen in training data, X y denotes the set of images’ vectors of class y , w y denotes the word vector ofclass y , and Θ = (θ ( ) , θ ( ) ) denotes parameters of the 2-layer neural network with f ( · ) = tanh ( · ) as activation function.They observe that instances from unseen categories are usually outliers of thecomplete data manifold. Following this observation, they ﬁrst classify an instanceinto seen and unseen categories via outlier detection methods. Then the instance isclassiﬁed using corresponding classiﬁers.Formally, they marginalize a binary random variable V ∈ { s , u } which denoteswhether an instance belongs to seen categories or unseen categories separately, whichmeans probability is given as P ( y | I ) = (cid:3) V ∈{ s , u } P ( y | V , I ) P ( V | I ). (9.9)For seen image classes, they simply use softmax classiﬁer to determine P ( y | s , I ) ,while for unseen classes, they assume an isometric Gaussian distribution around eachof the novel class word vectors and assign classes based on their likelihood. To detectnovelty, they calculate a Local Outlier Probability by Gaussian error function.

92 9 Cross-Modal Representation

Learning cross-modal representation from different modalities in a common semanticspace allows one to easily compute cross-modal similarities, which can facilitatemany important cross-modal tasks, such as cross-media retrieval. With the rapidgrowth of multimedia data such as text, image, video, and audio on the Internet, theneed to retrieve information across different modalities has become stronger. Cross-media retrieval is an important task in the multimedia area, which aims to performretrieval across different modalities such as text and image. For example, a user maysubmit an image of a white horse, and retrieve relevant information from differentmodalities, such as textual descriptions of horses, and vice versa.A signiﬁcant challenge of cross-modal retrieval is the domain discrepanciesbetween different modalities. Besides, for a speciﬁc area of interest, cross-modal datacan be insufﬁcient, which limits the performance of existing cross-modal retrievalmethods. Many works have focused on the challenges as mentioned above in cross-modal retrieval [23, 24].

Huang et al. [24] present a framework that tries to relieve the cross-modal datasparsity problem by transfer learning. They propose to leverage knowledge froma large-scale single-modal dataset to boost the model training on the small-scaledataset. The massive auxiliary dataset is denoted as the source domain, and thesmall-scale dataset of interest is denoted as the target domain. In their work, theyadopt ImageNet [12], a large-scale image database as the source domain.Formally, a training set consists of data from source domain

Sr c = { I ps , y ps } Pp = and target domain T ar tr = { ( I js , t js ), y js } Jj = , where ( I , t ) is the image/text pair withlabel y . Similarly, a test set can be denoted as T ar te = { ( I ms , t ms ), y ms } Mm = . The goalof their model is to transfer knowledge from Sr c to boost the model performance on

T ar te for cross-media retrieval.Their model consists of a modal-sharing transfer subnetwork and a layer-sharingcorrelation subnetwork. In modal-sharing transfer subnetwork, they adopt the con-volutional layers of AlexNet [32] to extract image features for source and targetdomains, and use word vectors to obtain text features. The image and text featurespass through two fully connected layers, where single-modal and cross-modal knowl-edge transfer are performed.Single-modal knowledge transfer aims to transfer knowledge from images in thesource domain to images in the target domain. The main challenge is the domaindiscrepancy between the two image datasets. They propose to solve the domaindiscrepancy problem by minimizing the Maximum Mean Discrepancy (MMD) ofimage modality between the source and target domains. MMD is calculated in alayer-wise style in the fully connected layers. By minimizing MMD in reproducedkernel Hilbert space, the image representations from source and target domains are .2 Cross-Modal Representation 293 encouraged to have the same distribution, so knowledge from images in the sourcedomain is expected to transfer to images in the target domain. Besides, the imageencoder in the source domain is also ﬁne-tuned by optimizing softmax loss on labeledimage instances.Cross-modal knowledge transfer aims to transfer knowledge between image andtext in the target domain. Text and image representations from an annotated pairin the target domain are encouraged to be close to each other by minimizing theirEuclidean distance. The cross-modal transfer loss of image and text representationsis also computed in a layer-wise style in the fully connected layers. The domaindiscrepancy between image and text modalities is expected to be reduced in high-level layers.In layer-sharing correlation subnetwork, representations from modal-sharingtransfer subnetwork in the target domain are fed into shared fully connected layersto obtain the ﬁnal common representation for both image and text. As the parametersare shared between two modalities, the last two fully connected layers are expectedto capture the cross-modal correlation. Their model also utilizes label informationin the target domain by minimizing softmax loss on labeled image/text pairs. Afterobtaining the ﬁnal common representations, cross-media retrieval can be achievedby simply computing the nearest neighbors in semantic space. As an extension of [23, 24] also focuses on dealing with domain discrepancy andinsufﬁcient cross-modal data for cross-media retrieval in speciﬁc areas, Huang andPeng [23] present a framework that transfers knowledge from a large-scale cross-media dataset (source domain) to boost the model performance on another small-scalecross-media dataset (target domain).A crucial difference from [24] is that the dataset in the source domain also consistsof image/text pairs with label annotations instead of a single-modal setting in [24].Since both domains contain image and text media types, domain discrepancy comesfrom the media-level discrepancy in the same media type, and correlation-level dis-crepancy in image/text correlation patterns between different domains. They proposeto transfer intra-media semantic and inter-media correlation knowledge by jointlyreducing domain discrepancies on media-level and correlation-level.To extract the distributed features for different media types, they adopt VGG19 [63]for image encoder and Word CNN [29] for text encoder. The two domains have thesame architecture but do not share parameters. The extracted image/text featurespass through two fully connected layers, respectively, where the media-level transferis performed. Similar to [24], they reduce domain discrepancies within the samemodalities by minimizing Maximum Mean Discrepancy (MMD) between the sourceand target domains. The MMD is computed in a layer-wise style to transfer knowl-edge within the same modalities. They also minimize Euclidean distance betweenimage/text representations pairs in both source and target domains to preserve thesemantic information across modalities.

94 9 Cross-Modal Representation

Correlation-level transfer aims to reduce domain discrepancy in image/text corre-lation patterns in different domains. In two domains, both image and text representa-tions share the last two fully connected layers to obtain the common representationfor each domain. They optimize layer-wise MMD loss between the shared fully con-nected layers in different domains for correlation-level knowledge transfer, whichencourages source and target domains to have the same image/text correlation pat-terns. Finally, both domains are trained with label information of image/text pairs.Note that the source domain and target domain do not necessarily share the samelabel set.In addition, they propose a progressive transfer mechanism, which is a curricu-lum learning method aiming to promote the robustness of the model training. This isachieved by selecting easy samples for model training in the early period, and grad-ually increases the difﬁculty during the training. The difﬁculty of training samplesis measured according to the bidirectional cross-media retrieval consistency.

Image captioning is the task of automatically generating natural language descrip-tions for images. It is a fundamental task in artiﬁcial intelligence, which connectsnatural language processing and computer vision. Compared with other computervision tasks, such as image classiﬁcation and object detection, image captioningis signiﬁcantly harder for two reasons: ﬁrst, not only objects but also relationshipsbetween them have to be detected; second, besides basic judgments and classiﬁcation,natural language sentences have to be generated.Traditional methods for image captioning are usually using retrieval models orgeneration models, of which the ability to generalize is comparatively weaker com-pared with that of novel deep neural network models. In this section, we will introduceseveral typical models of both genres in the following.

The primary pipeline of retrieval models is (1) represent images and/or sentencesusing special features; (2) for new images and/or sentences, search for probablecandidates according to the similarity of features.Linking words to images has a rich history, and [50] (a retrieval model) is the ﬁrstimage annotation system. This paper tries to build a keyword assigning system forimages from labeled data. The pipeline is as follows:(1) Image segmentation. Every image is divided into several parts, using thesimplest rectangular division. The reason for doing so is that an image is typicallyannotated with multiple labels, each of which often corresponds to only a part of it.Segmentation would help reduce noises in labeling. .3 Image Captioning 295 (2) Feature extraction. Features of every part of the image are extracted.(3) Clustering. Feature vectors of image segments are divided into several clusters.Each cluster accumulates word frequencies and thereby calculates word likelihood.Concretely, P ( w i | c j ) = P ( c j | w i ) P ( w i ) (cid:2) k P ( c j | w k ) P ( w k ) = n ji N j , (9.10)where n ji is the number of times word w i appears in cluster j , and N j is the number oftimes that all words appear in cluster j . The calculation is based on using frequenciesas probabilities.(4) Inference. For a new image, the model divides it into segments, extracts featuresfor every part, and ﬁnally, aggregates keywords assigned to every part to obtain theﬁnal prediction.The key idea of this model is image segmentation. Take a landscape picture, forinstance, there are two parts: mountain and sky , and both parts will be annotatedwith both labels. However, if another picture has two parts mountain and river ,the two mountain parts would hopefully be in the same cluster and discover thatthey share the same label mountain . In this way, labels can be assigned to thecorrect part of the image, and noises could be alleviated.Another typical retrieval model is proposed by [17], which can assign a linkingscore between an image and a sentence. An intermediate space of meaning calculatesthis score of linking. The representation of the meaning space is a triple in the form of (cid:7) object , action , scene (cid:8) . Each slot of the triple has a ﬁnite discrete candidate set. Theproblem of mapping images and sentences into the meaning space involves solvinga Markov random ﬁeld.Different from the previous model, this system can do not only image caption,but also do the inverse, that is, given a sentence, the model provides certain probableassociated images. At the inference stage, the image (sentence) is ﬁrst mapped to theintermediate meaning space, then we search in the pool for the sentence (image) thathas the best matching score.After that, researchers also proposed a lot of retrieval models which considerdifferent kinds of characteristics of the images, such as [21, 28, 34]. Different from the retrieval-based model, the basic pipeline of generation models is(1) use computer vision techniques to extract image features, (2) generate sentencesfrom these features using methods such as language models or sentence templates.Kulkarni et al. [33] propose a system that makes a tight connection between theparticular image and the sentence generating process. The model uses visual detectorsto detect speciﬁc objects, as well as attributes of a single object and relationshipsbetween multiple objects. Then it constructs a conditional random ﬁeld to incorporateunary image potentials and higher order text potentials and thereby predicts labels

96 9 Cross-Modal Representation for the image. Labels predicted by conditional random ﬁelds (CRF) is arranged as atriple, e.g., (cid:7)(cid:7) white , cloud (cid:8) , in , (cid:7) blue , sky (cid:8)(cid:8) .Then sentences are generated according to the labels. There are two ways to builda sentence based on the triple skeleton. (1) The ﬁrst is to use an n -gram languagemodel. For example, when trying to decide whether or not to put a glue word x between a pair of meaningful words (which means they are inside the triple) a and b ,the probabilities ˆ p ( axb ) and ˆ p ( ab ) are compared for the decision. ˆ p is the standardlength-normalized probability of the n -gram language model. (2) The second is touse a set of descriptive language templates, which alleviates the problem of grammarmistakes in the language model.Further, [16] proposes a novel framework to explicitly represent the relationshipbetween image structure and its caption sentence’s structure. The method, VisualDependency Representation, detects objects in the image, and detects the relationshipbetween these objects based on the proposed Visual Dependency Grammar, whichincludes eight typical relations like beside or above . Then the image can bearranged as a dependency graph, where nodes are objects and edges are relations. Thisimage dependency graph can be aligned with the syntactic dependency representationof the caption sentence. The paper further provides four templates to generatingdescriptive sentences from the extracted dependency representation.Besides these two typical works, there are massive generation models for imagecaptioning, such as [15, 35, 78]. In [33], it was claimed in 2011 that in image captioning tasks:

Natural languagegeneration still remains an open research problem. Most previous work is based onretrieval and summarization . From 2015, inspired by advances in neural languagemodel and neural machine translation, a number of end-to-end neural image caption-ing models based on the encoder-decoder system have been proposed. These newmodels signiﬁcantly improve the ability to generate natural language descriptions.

Traditional machine translation models typically stitch many subtasks together, suchas individual word translation and reordering, to perform sentence and paragraphtranslation. Recent neural machine translation models, such as [8], use a singleencoder-decoder model, which can be optimized by stochastic gradient descent con-veniently. The task of image captioning is inherently analogous to machine translationbecause it can also be regarded as a translation task, where the source “language”is an image. The encoders and decoders used for machine translations are typicallyRNNs, which is a natural selection for sequences of words. For image captioning,CNN is chosen to be the encoder, and RNN is still used as the decoder. .3 Image Captioning 297

Input Image o o op p p w w END w w w N …… CNN

LSTMSoftmax

Correct Caption Sentence s={w , w , …, w N } Fig. 9.2

The architecture of encoder-decoder framework for image captioning

Vinyals et al. [70] is the most typical model which uses encoder-decoder forimage captioning (see Fig. 9.2). Concretely, a CNN model is used to encode theimage into a ﬁx length vector, which is believed to contain the necessary informationfor captioning. With this vector, an RNN language model is used to generate naturallanguage descriptions, and this is the decoder. Here, the decoder is similar to theLSTM used for machine translation. The ﬁrst unit takes the image vector as theinput vector, and the rest units take the previous word embedding as input. Each unitoutputs a vector o and passes a vector to the next unit. o is further fed into a softmaxlayer, whose output p is the probability of each word within the vocabulary. Theways to deal with these calculated probabilities are different in training and testing: Training.

These probabilities p are used to calculate the likelihood of the provideddescription sentences. Considering the nature of RNNs, it is easy to model the jointprobability into conditional probabilities.log P ( s | I ) = N (cid:3) t = log P ( w t | I , w , . . . , w t − ), (9.11)where s = { w , w , ..., w N } is the sentence and its words, w is a special STARTtoken, and I is the image. Stochastic gradient descent can thereby be performed tooptimize the model. Testing.

There are multiple approaches to generate sentences given an image.The ﬁrst one is called Sampling. For each step, the single word with the highestprobability in p is chosen, and used as the input of the next unit until the ENDtoken is generated or a maximal length is reached. The second one is called BeamSearch. For each step (now the length of sentences is t ), k best sentences are kept.Each of them generates several new sentences of length t +

1, and again, only k new

98 9 Cross-Modal Representation

Fig. 9.3

An example of image captioning with attention mechanism sentences are kept. Beam Search provides a better approximation for s ∗ = arg max s log P ( s | I ). (9.12) The research on image captioning tightly follows that on machine translation.Inspired by [6], which uses attention mechanism in machine translation, [76] intro-duces visual attention into the encoder-decoder image captioning model.The major bottleneck of [70] is the fact that information from the image is shownto the LSTM decoder only at the ﬁrst decoding unit, which actually requires theencoder to squeeze all useful information into one ﬁxed-length vector. In contrast,[76] does not require such compression. The CNN encoder does not produce onevector for the entire image; instead, it produces L region vectors I i , each of which isthe representation of a part of the image. At every step of decoding, the inputs includestandard LSTM inputs (i.e., output and hidden state of last step o t − and h t − ), andan input vector z from the encoder. Here, z is the weighted sum of image vectors I i : z = (cid:2) i α i I i , where α i is the weight computed from I i and h t − . Throughoutthe training process, the model learns to focus on parts of the image for generatingthe next word by producing larger weights α on more relevant parts, as shown inFig. 9.3. While the above paper uses soft attention for the image, [27] makes explicitalignment between image fragments and sentence fragments before generating adescription for the image. In the ﬁrst stage, the alignment stage, sentence and imagefragments are aligned by being mapped to a shared space. Concretely, sentencefragments (i.e., n consecutive words) are encoded using a bidirectional LSTM intothe embeddings s , and image fragments (i.e., part of the image, and also the entireimage) are encoded using a CNN into the embeddings I . The similarity score betweenimage I and sentence s is computed as The example is obtained from the implementation of Yunjey Choi (https://github.com/yunjey/show-attend-and-tell)..3 Image Captioning 299 sim ( I , s ) = (cid:3) t ∈ g s max i ∈ g I ( , I (cid:9) i s t ), (9.13)where g s is the sentence fragment set of sentence s , and g I is the image fragment setof image I . The alignment is then optimized by minimizing the ranking loss L forboth sentences and images: L = (cid:3) I (cid:4)(cid:3) s max ( , sim ( I , s ) − sim ( I , I ) + ) + (cid:3) s max ( , sim ( s , I ) − sim ( I , I ) + ) (cid:5) . (9.14)The assumption for this alignment procedure is similar to [50] (see Sect. 9.3.1): alldescription sentences are regarded as (possibly noisy) labels for every image sectionand are based on the massive training data, the model would hopefully be trained toalign caption sentences to their corresponding image fragments. The second stage issimilar to the basic model in [70], but the alignment results are used to provide moreprecise training data.As mentioned above, [76] makes the decoder have the ability to focus attentionon the different parts of the image for different words. However, there are somenonvisual words in the decoding process. For example, words such as the and of are more dependent on semantic information than visual information. Furthermore,words such as phone followed by cell or meter before near the parking are usually generated by the language model. To avoid the gradient of a nonvisualword decreasing the effectiveness of visual attention, in the process of generatingcaptions, [43] adopts an adaptive attention model with a visual sentinel. At each timestep, the model needs to determine that it depends on an image region or a visualsentinel.Adaptive attention model [43] uses attention in the process of generating a wordrather than updating the LSTM state; it utilizes “visual sentinel" vector x t and imageregion vectors I i . Here, x t is produced by the inputs and states of LSTM at time stept, while I i is provided from CNN encoder. Then the adaptive context vector ˆ c t is theweighted sum of L image region vectors I i and visual sentinel x t : ˆ c t = L (cid:3) i α i I i + α L + x t , (9.15)where α i are the weights computed by I i , x t , and the LSTM hidden state h t . Wehave (cid:2) L + i = α i =

1. Finally the probability of a word in vocabulary at time t can becalculated as a residual form: p t = Softmax ( W p ( ˆ c t + h t )), (9.16)where W p is a learned weight parameter.Many existing image captioning models with attention allocate attention overimage’s regions, whose size is often 7 × ×

14 decided by the last pooling

00 9 Cross-Modal Representation

Resize

Latent Channel Activation Activated RegionImage H t c tI Fig. 9.4

An example of the activated region of a latent channel layer in CNN encoder. Anderson et al. [2] ﬁrst calculate attention at the level ofobjects. It ﬁrst employs Faster R-CNN [58] which is trained on ImageNet [60] andGenome [31] to predict attribute class, such as an open oven, green bottle, ﬂoraldress, and so on. After that, it applies attention over valid bounding boxes to getﬁne-grained attention for helping the caption generation.Besides, [11] rethinks the form of latent states in image captioning, which usu-ally compresses two-dimensional visual feature maps encoded by CNN to a one-dimensional vector as the input of the language model. They ﬁnd that the languagemodel with 2D states can preserve the spatial locality, which can link the input visualdomain and output linguistic domain observed by visualizing the transformation ofhidden states.Word embeddings and hidden states in [11] are 3D tensors of size C × H × W ,which means C channels, each of size H × W . The encoded features maps willbe directly inputted to the 2D language model instead of going through an averagepooling layer. In the 2D language model, the convolution operator takes the place ofmatrix multiplication in the 1D model, and mean pooling will be used to generatethe output word probability distribution from 2D hidden states. Figure 9.4 showsactivated region of a latent channel at the t th step. When we set a threshold for theactivated regions, it is revealed that the special channels are associated with speciﬁcnouns in the decoding process, which help get a better understanding of the processof generating captions.Traditional methods train the caption model by maximizing the likelihood oftraining examples, which forms a gap between the optimization objective and evalu-ating metrics. To alleviate the problem, [59] uses reinforcement learning to directlymaximize the CIDEr metric [69]. CIDEr reﬂects the diversity of generated cap-tions by giving high weights to the low-frequency n -grams in the training set, whichdemonstrates that people prefer detailed captions rather than universal ones, like aboy is playing a game . To encourage the distinctiveness of captions, [10]adopts contrastive learning. Their model learns to discriminate the caption of a givenimage and the caption of an alike image by maximizing the difference betweenground truth positive pair and mismatch negative pair. The experiment shows thatcontrastive learning increases the diversity of captions signiﬁcantly.Furthermore, automatic evaluation metrics, such as BLEU [54], METEOR [13],ROUGE [38], CIDEr [69], SPICE [1], and so on, may neglect some novel expres-sions restrained by the ground truth captions. To better evaluate the naturalness anddiversity of captions, [9] proposes a framework based on Conditional GenerativeAdversarial Networks, whose generator tries to achieve a higher score in the evalua- .3 Image Captioning 301 tor, while the evaluator tries to distinguish between the generated caption and humandescriptions for a given image, as well as between the given image and the mismatchdescription. The user study shows that the trained generator can generate natural anddiverse captions than the model trained by maximum likelihood estimate, while thetrained evaluator is more consistent with human’s evaluation.Besides the works we introduced above, there are also a mass of variants of thebasic encoder-decoder model such as [20, 26, 40, 45, 51, 71, 73]. Visual relationship detection is the task of detecting objects in an image and under-standing the relationship between them. While detecting the objects is always basedon semantic segmentation or object detection methods, such as R-CNN, understand-ing the relationship is the key challenge of this task. While detecting visual relationwith image information is intuitive and effective [25, 62, 84], leveraging informationfrom language can further boost the model performance [37, 41, 82].

Lu et al. [41] propose a model that uses language priors to enhance the performanceon infrequent relationships for which sufﬁcient training instances are hard to obtainsolely from images. The overall architecture is shown in Fig. 9.5.They ﬁrst train a CNN to calculate the unnormalized relations’ probabilityobtained from visual inputs by

RCNN

Language Module f ( R ,W) hat-ride-horsehorse-pull-hatperson-ride-horseperson-wear-horse Visual Module V ( R , θ ) person-wear-hathorse-wear-hatperson-ride-horse CNN ... ...

Fig. 9.5

The architecture of visual relationship detection with language prior02 9 Cross-Modal Representation P V ( R (cid:7) i , j , k (cid:8) , Θ |(cid:7) O , O (cid:8) ) = P i ( O )( z (cid:9) k CNN ( O , O ) + s k ) P j ( O ), (9.17)where P i ( O j ) denotes the probability that bounding box O j is entity i , and C N N ( O , O ) is the joint feature of box O with box O . Θ = { z k , s k } is the set ofparameters.Besides, language prior is considered in this model by calculating the unnormal-ized probability that the entity pair (cid:7) i , j (cid:8) has the relation k : P f ( R , W ) = r (cid:9) k [ w i ; w j ] + b k , (9.18)where w i and w j are the word embeddings of the text of subject and object, respec-tively, r k is the learned relational embedding of the relation k .Given the probabilities of a relation from visual and textual inputs, respectively,the authors combine them into the integrated probability of a relation. The ﬁnalprediction is the one with maximal integrated probability: R ∗ = max R P V ( R (cid:7) i , j , k (cid:8) |(cid:7) O , O (cid:8) ) P f ( R , W ). (9.19)The rank of the ground truth relationship R with bounding boxes O and O ismaximized using the following rank loss function: C (Θ, W ) = (cid:3) (cid:7) O , O (cid:8) , R max { − P V ( R , Θ |(cid:7) O , O (cid:8) ) P f ( R , W ) + max (cid:7) O (cid:10) , O (cid:10) (cid:8)(cid:3)=(cid:7) O , O (cid:8) , R (cid:10) (cid:3)= R P V ( R (cid:10) , Θ |(cid:7) O (cid:10) , O (cid:10) (cid:8) ) P f ( R (cid:10) , W ), } . (9.20)In addition to the loss that optimizes the rank of the ground truth relationships,the authors also propose two regularization functions for language priors. The ﬁnalloss function of this model is deﬁned as L = C (Θ, W ) + λ L ( W ) + λ K ( W ). (9.21) K ( W ) is a variance function to make the similar relationships’ corresponding f ( · ) function closer: K ( W ) = V ar { [ P f ( R , W ) − P f ( R (cid:10) , W ) ] d ( R , R (cid:10) ) } , ∀ R , R (cid:10) , (9.22)where d ( R , R (cid:10) ) is the sum of the cosine distances (in Word2vec space) between thetwo objects and the predicates of the two relationships R and R (cid:10) . L ( W ) is a function to encourage less-frequent relation to have a lower f () score.When R occurs more frequently than R (cid:10) , we have L ( W ) = (cid:3) R , R (cid:10) max { P f ( R (cid:10) , W ) − P f ( R , W ) + , } . (9.23) .4 Visual Relationship Detection 303 CNN Object Detection Module Relation Prediction Moduleclasseme location visual personelephantpersonperson ridetallernext towith elephantpersonelephantpants person elephantpersonpants box

BilinearInterpolationconvfeat S ca li ng S ca li ng w s FeatureExtractionLayer w o S o f t m a x ... box ... Fig. 9.6

The architecture of VTransE model (a) A scene.

Table Tablecloth Goose ApplePorcelainR: On R: On R: Inside R: InsideR: ? Answer: On (b) The corresponding scene graph.

Fig. 9.7

An illustration for scene graph generation

Inspired by recent progress in knowledge representation learning, [82] proposesVTransE, a visual translation embedding network. Objects and the relationshipbetween objects are modeled as TransE [7] like vector translation. VTransE ﬁrstprojects subject and object into the same space as relation translation vector r ∈ R r . Subject and object could be denoted as x s , x o ∈ R M in the feature space, where M (cid:12) r . Similar to TransE relationship, VTransE establishes a relationship as W s x s + r ∼ W o x o , (9.24)where W s and W o are projection matrices. The overall architecture is shown inFig. 9.6. Li et al. [37] further formulate visual relation detection as a scene graph generationtask, where nodes correspond to objects and directed edges correspond to visualrelations between objects, as shown in Fig. 9.7.This formulation allows [37] to leverage different levels of context information,such as information from objects, phrases (i.e., (cid:7) subject , predicate , object (cid:8) triples),

04 9 Cross-Modal Representation c)b)a)

Fig. 9.8

Dynamical graph construction. a The input image. b Object (bottom), phrase (middle),and caption region (top) proposals. c The graph modeling connections between proposals. Some ofthe phrase boxes are omitted and region captions, to boost the performance of visual relation detection. Speciﬁ-cally, [37] proposes to construct a graph that aligns these three levels of informationand perform feature reﬁnement via message passing, as shown in Fig. 9.8. By leverag-ing complementary information from different levels, the performances of differenttasks are expected to be mutually improved.

Dynamic Graph Construction.

Given an image, they ﬁrst generate three kindsof proposals that correspond to three kinds of nodes in the proposed graph structure.The proposals include object proposals, phrase proposals, and region proposals. Theobject and region proposals are generated using Region Proposal Network (RPN)[57] trained with ground truth bounding boxes. Given N object proposals, phraseproposals are constructed based on N ( N − ) object pairs that fully connect theobject proposals with direct edges, where each direct edge represents a potentialphrase between an object pair.Each phrase proposal is connected to the corresponding subject and object withtwo directed edges. A phrase proposal and a region proposal are connected if theiroverlap exceeds a certain fraction (e.g., 0 .

7) of the phrase proposal. There are no directconnections between objects and regions since they can be indirectly connected viaphrases.

Feature Reﬁnement.

After obtaining the graph structure of different levels ofnodes, they perform feature reﬁnement by iterative message passing. The messagepassing procedure is divided into three parallel stages, including object reﬁnement,phrase reﬁnement, and region reﬁnement.In object feature reﬁnement, the object proposal feature is updated with gatedfeatures from adjacent phrases. Given an object i , the aggregated feature from regionsthat are linked to object i via subject-predicate edges ˆ x p → si can be deﬁned as follows: ˆ x p → si = (cid:6) E i , p (cid:6) (cid:3) ( i , j ) ∈ E s , p f (cid:7) o , p (cid:8) ( x ( o ) i , x ( p ) j ) x ( p ) j , (9.25) .4 Visual Relationship Detection 305 where E s , p is the set of subject predicate connections, and E i , p denotes the number ofphrases connected with the object i as the subject predicate pairs. f (cid:7) o , p (cid:8) is a learnablegate function that controls the weights of information from different sources: f (cid:7) o , p (cid:8) ( x ( o ) i , x ( p ) j ) = K (cid:3) k = Sigmoid (ω ( k ) (cid:7) o , p (cid:8) · [ x ( o ) i ; x ( p ) j ] ), (9.26)where ω ( k ) (cid:7) o , p (cid:8) is a gate template used to calculate the importance of the informationfrom a subject-predicate edge and K is the number of templates. The aggregatedfeature from object-predicate edges ˆ x p → oi can be similarly computed.After obtaining information ˆ x p → si and ˆ x p → oi from adjacent phrases, the objectreﬁnement at time step t can be deﬁned as follows: x ( o ) i , t + = x ( o ) i , t + f ( p → s ) ( ˆ x p → si ) + f ( p → o ) ( ˆ x p → oi ), (9.27)where f ( · ) = W ReLU ( · ) , W is a learnable parameter and not shared between f ( p → s ) ( · ) and f ( p → o ) ( · ) .The reﬁnement scheme of phrases and regions is similar to objects. The onlydifference is the information sources: Phrase proposals receive information fromadjacent objects and regions, and region proposals receive information from phrases.After feature reﬁnement via iterative message passing, the feature of differentlevels of nodes can be used for corresponding tasks. Region features can be usedas the initial state of a language model to generate region captions. Phrase featurescan be used to predict visual relation between objects, which composes of the scenegraph of the image.In comparison with scene graph generation methods that model the dependenciesbetween relation instances by attention mechanism or message passing, [47] decom-poses the scene graph task into a mixture of two phases: extracting primary relationsfrom input, and completing the scene graph with reasoning. The authors propose aHybrid Scene Graph generator (HRE) that combines these two phases in a uniﬁedframework and generates scene graphs from scratch.Speciﬁcally, HRE ﬁrst encodes the object pair into representations and thenemploys a neural relation extractor resolving primary relations from inputs and a dif-ferentiable inductive logic programming model that iteratively completes the scenegraph. As shown in Fig. 9.9, HRE contains two units, a pair selector and a relationpredictor, and runs in an iterative way.At each time step, the pair selector takes a look at all object pairs P − that have notbeen associated with a relation and chooses the next pair of entities whose relationis to be determined. The relation predictor utilizes the information contained in allpairs P + whose relations have been determined, and the contextual information ofthe pair to make the prediction on the relation. The prediction result is then added to P + and beneﬁts future predictions.To encode object pair into representations, HRE extends the union box encoderproposed by [41] by adding the object features (what are the objects) and their

06 9 Cross-Modal Representation

Fig. 9.9

Framework of HRE that detects primary relations from inputs and iteratively completesthe scene graph via inductive logic programming

Fig. 9.10

Object pair encoder of HRE locations (where are the objects) into the object pair representation, as shown inFig. 9.10.

Relation Predictor.

The relation predictor is composed of two modules: a neuralmodule predicting the relations between entities based on the given context (i.e., avisual image) and a differentiable inductive logic module performing reasoning on P + . Both modules predict the relation score between a pair of objects individually.The relation scores from the two modules are ﬁnally integrated by multiplication. Pair Selector.

The selector works as the predictor’s collaborator with the goalto ﬁgure out the next relation which should be determined. Ideally, the choice p ∗ made by the selector should satisfy the condition that all relations that will affectthe predictor’s prediction on p ∗ should be sent to the predictor ahead of p ∗ . HREimplements the pair selector as a greedy selector which always chooses the entitypair from P − to be added to P + as the entity pair of which the relation predictor ismost conﬁdent in its prediction. .4 Visual Relationship Detection 307 It is worth noting that the task of scene graph generation resembles document-level relation extraction in many aspects. Both tasks seek to extract structured graphsconsisting of entities and relations. Also, they need to model the complex dependen-cies between entities and relations in a rich context. We believe both tasks are worthyto explore for future research.

Visual Question Answering (VQA) aims to answer natural language questions aboutan image, and can be seen as a single turn of dialogue about a picture. In this section,we will introduce widely used VQA datasets and several typical VQA models.

VQA was ﬁrst proposed in [46]. They ﬁrst propose a single-world approach bymodeling the probability of an answer a given question q and a world w by P ( a | q , w ) = (cid:3) z P ( a | z , w ) P ( z | q ), (9.28)where z is a latent variable associated with the question and the world w is a represen-tation of the image. They further extend the single-world approach to a multi-worldapproach by marginalizing over different segments s of the given image. The prob-ability of an answer a given question q and a world w is given by P ( a | q , s ) = (cid:3) w (cid:3) z P ( a | w , z ) P ( w | s ) P ( z | q ). (9.29)They also release the ﬁrst dataset of VQA named as DAQUAR in their paper.Besides DAQUAR, researchers also release a lot of VQA datasets with vari-ous characteristics. The most widely used dataset was released in [4], where theauthors provided cases and experimental evidence to demonstrate that to answerthese questions, a human or an algorithm should use features of the image and exter-nal knowledge. Figure 9.11 shows examples of VQA dataset released in [4]. It is alsodemonstrated that this problem cannot be solved by converting images to captionsand answering questions according to captions. Experiment results show that theperformance of vanilla methods is still far from human.In fact, there are also other existing datasets for Visual QA such as Visual7W[85], Visual Madlibs [80], COCO-QA [56], and FM-IQA [19].

08 9 Cross-Modal Representation

Question: Why are the men jumping?Answer: to catch frisbee Question: Is the water still?Answer: noQuestion: What is the kid doing?Answer: skateboarding Question: What is hanging on the wall above the headboard?Answer: pictures

Fig. 9.11

Examples of VQA dataset

Besides, [4, 46] further investigate approaches to solve speciﬁc types of questions inVQA. Moreover, [83] proposes an approach to solve “YES/NO” questions. Note thatthe model is an ensemble model of two similar models: Q-model and Tuple-model,the difference between which will be described later. The overall approach can bedivided into two steps: (1) Language Parsing and (2) Visual Veriﬁcation. In the formerstep, they extract (cid:7)

P, R, S (cid:8) tuples from questions ﬁrst by parsing it and assigning anentity to each word. Then they summarize the parsed sentences through removing“stop words”, auxiliary verbs, and all words before a nominal subject or passivenominal subject, and further split the summary into PRS arguments according to thepart of speech of phrases. The difference between Q-Model and Tuple-model is thatthe Q-model is the one used in their previous work [4], embedding the question intoa dense 256-dim vector by LSTM, while Tuple-model is to convert (cid:7)

P, R, S (cid:8) tuplesinto 256-dim embeddings by MLP. As for the Visual Veriﬁcation step, they use thesame feature of images as in [39] which was encoded into the dense 256-dim vector .5 Visual Question Answering 309 by an inner-product layer followed by a tanh layer. These two vectors are passedthrough an MLP to produce the ﬁnal output (“Yes” or “No”).Moreover, [61] proposes a method to calculate attention α j by the set of imagefeatures I = ( I , I , . . . , I K ) and the question embedding q by α j = ( W I j + b ) (cid:9) ( W q + b ), (9.30)where W , W , b , b are trainable parameters.Attention based techniques are quite efﬁcient for ﬁltering noises that are irrelevantto the question. However, some questions are only related to some small regions,which encourages researchers to use stacked attention to further ﬁltering noises. Werefer readers to Fig. 1b in [79] for an example of stacked attention.Yang et al. [79] further extend the attention-based model used in [61], whichemploys LSTMs to predict the answer. They take the question as input and attendto different regions in the image to obtain additional input. The key idea is to grad-ually ﬁlter out noises and pinpoint the regions that are highly relevant to the answerby reasoning through multiple stacked attention layers progressively. The stackedattention could be calculated by stacking: h kA = tanh ( W k I ⊕ ( W k u k − + b kA )). (9.31)Note that we denote the addition of a matrix and a vector by ⊕ . The additionbetween a matrix and a vector is performed by adding each column of the matrix bythe vector. u is a reﬁned query vector that combines information from the questionand image regions. u (i.e, u from the ﬁrst attention layer with k =

0) could beinitialized as the feature vector of the question. h kA is then used to compute p kI , whichcorresponds to the attention probability of each image region, p kI = Softmax ( W k h kA + b kP ). (9.32) u k could be iterated by ˜ I k = (cid:3) i p ki I i , (9.33) u k = u k − + ˜ I k . (9.34)That is, in every layer, the model progressively uses the combined question andimage vector u k − as the query vector for attending the image region to obtain thenew query.The above models attend only on images, but questions should also be attended.[44] calculates co-attention by Z = tanh ( Q (cid:9) WI ), (9.35)

10 9 Cross-Modal Representation

What color on the stop light is lit upWhatcolorWhat the stoplight colorcolorQuestion:What color on the stop light is lit up? co-attention imageAnswer:green stopthe stop lightcolor stop light litstop lightlight...... ......

Fig. 9.12

The architecture of hierarchical co-attention model

Regions Proposals

Multi-labelCNN

Cap1: a doglaying onthefloor withabird next toitandacatbehindthem,ontheothersideof aslidingglassdoor.Cap 2: a brown and black dog layingon a floor nexttoabird.Cap3: the dog, cat,and birdareall onthefloor intheroom.….. C a p ti on Internal Representation … AveragePoolingTop 5Attributes SP A R Q L DBpedia

The dog is a furry, carnivorousmember of the canidae family,mammalclass.Thecatisasmall,usually furry, domesticated, andcarnivorousmammal.Birds,Avesclass,areagroupofendothermicvertebrates, characterised byfeathers, a beak with no teeth.Plants, also calledgreen plants ,are multicellular eukaryotes ofthe…. D o c V ec External Knowledge ... ) ) ) ... ) How many ? ...

There Thereare aretwo mammalsEND .... . . ∙ minimizing cost function G e n e r a ti on h h h n h n+1 h n+l-1 LSTM

Cap 1Cap 2Cap 5

Fig. 9.13

The architecture of VQA incorporating external knowledge bases where Z i j represents the afﬁnity of the i th word and j th region. Figure 9.12 showsthe hierarchical co-attention model.Another intuitive approach is to use external knowledge from knowledge bases,which will help us better explain the implicit information hiding behind the image.Such an approach was proposed in [75], which ﬁrst encodes the image into cap-tions and vectors representing different attributes of the image to retrieve documentsabout a different part of the images from knowledge bases. Documents are encodedthrough doc2vec [36]. The representation of captions, attributes, and documents aretransformed and concatenated to form the initial vector of an LSTM, which is trainedin Seq2seq fashion. Details of the model are shown in Fig. 9.13.Neural Module Network is a framework for constructing deep networks with adynamic computational structure, which was ﬁrst proposed in [3]. In such a frame-work, every input is associated with a layout that provides a template for assemblingan instance-speciﬁc network from a collection of shallow network fragments called .5 Visual Question Answering 311 Where isthe dog?parser LSTM couchLayout count where colorstandinggod tacCNN ......

Fig. 9.14

The architecture of the neural module network model modules. The proposed method processes the input question through two separateways: (1) parsing and laying out several modules, and (2) encoding by an LSTM.The corresponding picture is processed by the modules laid out according to thequestion, the types of which are predeﬁned, find, transform, combine,describe , and measure . The authors deﬁned find to be a transformation from

I mage to Attention map, transform to be a mapping from one

Attention toanother, combine to be a combination of two

Attention , describe to be adescription relying on I mage and

Attention , and

Measure to be a measure onlyrelying on

Attention . The model is shown in Fig. 9.14.A key drawback of [3] is that it relies on the parser to generate modules. [22]proposes an end-to-end model to generate a sequence of Reverse-Polish expressionto describe the module network, as shown in Fig. 9.15. And the overall architectureis shown in Fig. 9.16.Graph Neural Networks (GNNs) have also been applied to VQA tasks. [68] triesto build graphs about both the scene and the question. The authors described a deepneural network to take advantage of such a structured representation. As shown inFig. 9.17, the GNN-based VQA model could capture the relationships between wordsand objects.

In this chapter, we ﬁrst introduce the concept of cross-modal representation learn-ing. Cross-modal learning is essential since many real-world tasks require the abilityto understand the information from different modalities, such as text and image.Next, we introduce the concept of cross-modal representation learning, which aimsto exploit the links and enable better utilization of information from different modali-

12 9 Cross-Modal Representation

Fig. 9.15

The architecture of Reverse-Polish expression and corresponding module network model

How many other things are of thesame size as the green matte ballQuestion encoder(RNN)Question featuresLayout predicion(reverse Polish notation)find()relocate(_) How many other things are of the same size as thegreen matte ball?How many other things areof the same size as thegreen matte ball?Question attentions Answercountrelocaterelocate e the same size as the green matte ball?

Image encoder(CNN)Image features4 Modulenetwork N e t w o r k bu il d e r count(_) L a you t po li c y ( R NN ) Fig. 9.16

The architecture of end-to-end module network model

Fig. 9.17

The architecture of GNN-based VQA models.6 Summary 313 ties. And we overview existing cross-modal representation learning methods for sev-eral representative cross-modal tasks, including zero-shot recognition, cross-mediaretrieval, image captioning, and visual question answering. These cross-modal learn-ing methods either try to fuse information from different modalities into uniﬁedembeddings, or try to build embeddings for different modalities in a common seman-tic space, allowing the model to compute cross-modal similarity. Cross-modal repre-sentation learning is drawing more and more attention and can serve as a promisingconnection between different research areas.For further understanding of cross-modal representation learning, there are alsosome recommended surveys and books including: • Skocaj et al., Cross-modal learning [64]. • Spence, Crossmodal correspondences: A tutorial review [66]. • Wang et al., A comprehensive survey on cross-modal retrieval [72].In the future, for better cross-modal representation learning, some directions arerequiring further efforts:(1)

Fine-grained Cross-modal Grounding . Cross-modal grounding is a funda-mental ability in solving cross-modal tasks, which aims to align semantic units indifferent modalities. For example, visual grounding aims to ground textual symbols(e.g., words or phrases) into visual objects or regions. Many existing works [27, 74,76] have been devoted to cross-modal grounding, which mainly focuses on coarse-grained semantic unit grounding (e.g., grounding of sentences and images). Betterﬁne-grained cross-modal grounding (e.g., grounding of words and objects) couldpromote the development of a broad variety of cross-modal tasks.(2)

Cross-modal Reasoning . In addition to recognizing and grounding semanticunits in different modalities, understanding and inferring the relationship betweensemantic units are also crucial to cross-modal tasks. Many existing works [37, 41,82] have investigated detecting visual relation between objects. However, most visualrelations in existing visual relation detection datasets do not require complex reason-ing. Some works [81] have made preliminary attempts on cross-modal commonsensereasoning. Inferring the latent semantic relationships in cross-modal context is crit-ical for cross-modal understanding and modeling.(3)

Utilizing Unsupervised Cross-modal Data . Most current cross-modal learn-ing approaches rely on human-annotated datasets. The scale of such superviseddatasets is usually limited, which also limits the capability of data-hungry neuralmodels. With the rapid development of the World Wide Web, cross-modal data onthe Web have become larger and larger. Some existing works [42, 67] have lever-aged unsupervised cross-modal data for representation learning. They ﬁrst pretrainedcross-modal models on large-scale image-caption pairs, and then ﬁne-tuned the mod-els on those downstream tasks, which shows signiﬁcant improvement in a broadvariety of cross-modal tasks. It is thus promising to better leverage the vast amountof unsupervised cross-modal data for representation learning.

14 9 Cross-Modal Representation

References

1. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: semanticpropositional image caption evaluation. In

Proceedings of ECCV , 2016.2. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould,and Lei Zhang. Bottom-up and top-down attention for image captioning and visual questionanswering. In

Proceedings of CVPR , 2018.3. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks.In

Proceedings of CVPR , pages 39–48, 2016.4. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C LawrenceZitnick, and Devi Parikh. Vqa: Visual question answering. In

Proceedings of ICCV , 2015.5. Lei Jimmy Ba, Kevin Swersky, Sanja Fidler, and Ruslan Salakhutdinov. Predicting deep zero-shot convolutional neural networks using textual descriptions. In

Proceedings of ICCV , 2015.6. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation byjointly learning to align and translate. In

Proceedings of ICLR , 2015.7. Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and OksanaYakhnenko. Translating embeddings for modeling multi-relational data. In

Proceedings ofNeurIPS , 2013.8. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In

Proceedings of EMNLP , 2014.9. Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards diverse and natural imagedescriptions via a conditional GAN. In

Proceedings of ICCV , 2017.10. Bo Dai and Dahua Lin. Contrastive learning for image captioning. In

Proceedings of NeurIPS ,2017.11. Bo Dai, Deming Ye, and Dahua Lin. Rethinking the form of latent states in image captioning.In

Proceedings of ECCV , 2018.12. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In

Proceedings of CVPR , 2009.13. Michael J. Denkowski and Alon Lavie. Meteor universal: Language speciﬁc translation eval-uation for any target language. In

Proceedings of ACL , 2014.14. Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. Write a classiﬁer: Zero-shot learn-ing using purely textual descriptions. In

Proceedings of ICCV , 2013.15. Desmond Elliott and Arjen de Vries. Describing images using inferred visual dependencyrepresentations. In

Proceedings of ACL , 2015.16. Desmond Elliott and Frank Keller. Image description using visual dependency representations.In

Proceedings of EMNLP , 2013.17. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian,Julia Hockenmaier, and David Forsyth. Every picture tells a story: Generating sentences fromimages. In

Proceedings of ECCV , 2010.18. Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al.Devise: A deep visual-semantic embedding model. In

Proceedings of NeurIPS , 2013.19. Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Are youtalking to a machine? dataset and methods for multilingual image question. In

Proceedingsof NeurIPS , 2015.20. Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. Stack-captioning: Coarse-to-ﬁnelearning for image captioning. In

Proceedings of AAAI , 2018.21. Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a rankingtask: Data, models and evaluation metrics.

Journal of Machine Learning Research , 47:853–899, 2013.22. Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learningto reason: End-to-end module networks for visual question answering. In

Proceedings ofICCV , 2017.eferences 31523. Xin Huang and Yuxin Peng. Deep cross-media knowledge transfer. In

Proceedings of CVPR ,2018.24. Xin Huang, Yuxin Peng, and Mingkuan Yuan. Cross-modal common representation learningby hybrid transfer network. In

Proceedings of IJCAI , 2017.25. Zhaoyin Jia, Andrew Gallagher, Ashutosh Saxena, and Tsuhan Chen. 3d-based reasoningwith blocks, support, and stability. In

Proceedings of ICCV , 2013.26. Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localizationnetworks for dense captioning. In

Proceedings of CVPR , 2016.27. Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating imagedescriptions. In

Proceedings of CVPR , 2015.28. Andrej Karpathy, Armand Joulin, and Fei Fei F Li. Deep fragment embeddings for bidirec-tional image sentence mapping. In

Proceedings of NeurIPS , 2014.29. Yoon Kim. Convolutional neural networks for sentence classiﬁcation. In

Proceedings ofEMNLP , 2014.30. Satwik Kottur, Ramakrishna Vedantam, José MF Moura, and Devi Parikh. Visual word2vec(vis-w2v): Learning visually grounded word embeddings using abstract scenes. In

Proceed-ings of CVPR , 2016.31. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, andLi Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense imageannotations.

International Journal of Computer Vision , 123(1):32–73, 2017.32. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deepconvolutional neural networks. In

Proceedings of NeurIPS , 2012.33. Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, andTamara L Berg. Baby talk: Understanding and generating image descriptions. In

Proceedingsof CVPR , 2011.34. Polina Kuznetsova, Vicente Ordonez, Alexander C Berg, Tamara L Berg, and Yejin Choi.Collective generation of natural image descriptions. In

Proceedings of ACL , 2012.35. Polina Kuznetsova, Vicente Ordonez, Tamara L Berg, and Yejin Choi. Treetalk: Composi-tion and compression of trees for image descriptions.

Transactions of the Association forComputational Linguistics , 2(10):351–362, 2014.36. Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In

Proceedings of ICML , 2014.37. Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. Scene graph gener-ation from objects, phrases and region captions. In

Proceedings of ICCV , 2017.38. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries.

Text SummarizationBranches Out , 2004.39. Xiao Lin and Devi Parikh. Don’t just listen, use your imagination: Leveraging visual commonsense for non-visual tasks. In

Proceedings of CVPR , 2015.40. Chenxi Liu, Junhua Mao, Fei Sha, and Alan L. Yuille. Attention correctness in neural imagecaptioning. In

Proceedings of AAAI , 2017.41. Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detectionwith language priors. In

Proceedings of ECCV , 2016.42. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visi-olinguistic representations for vision-and-language tasks. In

Proceedings of NeurIPS , 2019.43. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptiveattention via a visual sentinel for image captioning. In

Proceedings of CVPR , 2017.44. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In

Proceedings of NeurIPS , 2016.45. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Neural baby talk. In

Proceedings ofCVPR , 2018.46. Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering aboutreal-world scenes based on uncertain input. In

Proceedings of NeurIPS , 2014.16 9 Cross-Modal Representation47. Jiayuan Mao, Yuan Yao, Stefan Heinrich, Tobias Hinz, Cornelius Weber, Stefan Wermter,Zhiyuan Liu, and Maosong Sun. Bootstrapping knowledge graphs from images and text.

Frontiers in Neurorobotics , 13:93, 2019.48. Harry McGurk and John MacDonald. Hearing lips and seeing voices.

Nature , 1976.49. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of wordrepresentations in vector space. In

Proceedings of ICLR , 2013.50. Yasuhide Mori, Hironobu Takahashi, and Ryuichi Oka. Image-to-word transformation basedon dividing and vector quantizing images with words. In

Proceedings of WMISR , 1999.51. Jonghwan Mun, Minsu Cho, and Bohyung Han. Text-guided attention model for image cap-tioning. In

Proceedings of AAAI , 2017.52. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng.Multimodal deep learning. In

Proceedings of ICML , 2011.53. Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, AndreaFrome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination ofsemantic embeddings. In

Proceedings of ICLR , 2014.54. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automaticevaluation of machine translation. In

Proceedings of ACL , 2002.55. Yuxin Peng, Xin Huang, and Yunzhen Zhao. An overview of cross-media retrieval: Concepts,methodologies, benchmarks and challenges.

IEEE Transactions on Circuits and Systems forVideo Technology , 2017.56. Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image questionanswering. In

Proceedings of NeurIPS , 2015.57. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-timeobject detection with region proposal networks. In

Proceedings of NeurIPS , 2015.58. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-timeobject detection with region proposal networks. In

Proceedings of NeurIPS , 2015.59. Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In

Proceedings of CVPR , 2017.60. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge.

International Journal of ComputerVision , 115(3):211–252, 2015.61. Kevin J Shih, Saurabh Singh, and Derek Hoiem. Where to look: Focus regions for visualquestion answering. In

Proceedings of CVPR , 2016.62. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation andsupport inference from rgbd images. In

Proceedings of ECCV , 2012.63. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv:1409.1556, 2014.64. Danijel Skocaj, Ales Leonardis, and Geert-Jan M. Kruijff.

Cross-Modal Learning , pp. 861–864. Boston, MA, 2012.65. Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learningthrough cross-modal transfer. In

Proceedings of NeurIPS , 2013.66. Charles Spence. Crossmodal correspondences: A tutorial review.

Attention, Perception, &Psychophysics , 73(4):971–995, 2011.67. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: Ajoint model for video and language representation learning. In

Proceedings of ICCV , 2019.68. Damien Teney, Lingqiao Liu, and Anton Van Den Hengel. Graph-structured representationsfor visual question answering. In

Proceedings of CVPR , 2017.69. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-basedimage description evaluation. In

Proceedings of CVPR , 2015.70. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neuralimage caption generator. In

Proceedings of CVPR , 2015.71. Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. Image captioning withdeep bidirectional lstms. In

Proceedings of ACMMM , 2016.eferences 31772. Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. A comprehensive survey oncross-modal retrieval. arXiv:1607.06215, 2016.73. Yufei Wang, Zhe Lin, Xiaohui Shen, Scott Cohen, and Garrison W. Cottrell. Skeleton key:Image captioning by skeleton-attribute decomposition. In

Proceedings of CVPR , 2017.74. Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma.Uniﬁed visual-semantic embeddings: Bridging vision and language with structured meaningrepresentations. In

Proceedings of CVPR , 2019.75. Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. Ask meanything: Free-form visual question answering based on knowledge from external sources.In

Proceedings of CVPR , 2016.76. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov,Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation withvisual attention. In

Proceedings of ICML , 2015.77. Ran Xu, Jiasen Lu, Caiming Xiong, Zhi Yang, and Jason J Corso. Improving word represen-tations via global visual context. In

Proceedings of NeurIPS Workshop , 2014.78. Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. Corpus-guided sentencegeneration of natural images. In

Proceedings of EMNLP , 2011.79. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention net-works for image question answering. In

Proceedings of CVPR , 2016.80. Licheng Yu, Eunbyung Park, Alexander C Berg, and Tamara L Berg. Visual madlibs: Fill inthe blank description generation and question answering. In

Proceedings of ICCV , 2015.81. Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition:Visual commonsense reasoning. In

Proceedings of CVPR , 2019.82. Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. Visual translation embed-ding network for visual relation detection. In

Proceedings of CVPR , 2017.83. Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Yin andyang: Balancing and answering binary visual questions. In

Proceedings of CVPR , 2016.84. Bo Zheng, Yibiao Zhao, Joey Yu, Katsushi Ikeuchi, and Song-Chun Zhu. Scene understandingby reasoning stability and safety.

International Journal of Computer Vision , 112(2):221–238,2015.85. Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded questionanswering in images. In

Proceedings of CVPR , 2016.

Open Access

Resources

Abstract

Deep learning has been shown as a powerful method for a variety ofartiﬁcial intelligence tasks, including some critical tasks in NLP. However, training adeep neural network is usually a very time-intensive process and requires lots of codeto build related models. To alleviate these issues, some deep learning frameworkshave been developed and released, which incorporate some existing and necessaryarithmetic operators for neural network constructions. And these frameworks exploithardware features such as multi-core CPUs and many-core GPUs to shorten thetraining time. Each framework has its advantages and disadvantages. In this chapter,we aim to exhibit features and running performance of these frameworks so thatusers can select an appropriate framework for their usage.

In this section, we will introduce several typical open-source frameworks for deeplearning including Caffe, Theano, TensorFlow, Torch, PyTorch, Keras, and MXNet.In fact, as the rapid development of the deep learning community, these open-sourceframeworks are updating every day, and therefore the information in this sectionmay not be up to date. In fact, this section mainly focuses on introducing the specialfeatures of these frameworks and lets the readers have a preliminary understandingof them. To know the latest features of these deep learning frameworks, please referto their ofﬁcial sites.

Caffe is a well-known framework and is widely used for computer vision tasks. Itwas created by Yangqing Jia and developed by Berkeley AI Research (BAIR). Caffeuses a layer-wise approach to make building models become easy, and it is also http://caffe.berkeleyvision.org/.© The Author(s) 2020Z. Liu et al., Representation Learning for Natural Language Processing ,https://doi.org/10.1007/978-981-15-5573-2_10 31920 10 Resources convenient to ﬁne-tune the existing neural networks without writing too much codevia its simple interfaces. The underlying designs of Caffe are for the fast constructionof convolutional neural networks, which make it efﬁcient and effective.On the other hand, as normal pictures often have a ﬁxed size, the interfaces ofCaffe are ﬁxed and hard to be extended. It is thus difﬁcult to use Caffe for othertasks with a variable input length, such as text, sound, or other time-series data.Recurrent neural networks are also not well supported by Caffe. Although users caneasily build an existing network architecture with the layer-wise framework, it is notﬂexible when dealing with big and complex networks. If users want to design a newlayer, the users need to use C/C++ and CUDA for the underlying coding of the newlayer.

Theano is the typical framework developed to use symbolic tensor graphs for modelspeciﬁcation. Any neural networks or other machine-learning models can be repre-sented as symbolic tensor graphs. Forward, backward, and gradient updates can becalculated based on the ﬂow between tensors. Hence, Theano provides more ﬂexibil-ity than Caffe using a layer-wise approach to build models. In Caffe, to deﬁne a newlayer that is not already in the existing repository of layers is complicated, whichneeds to implement its forward, backward, and gradient update functions before. InTheano, you only need to use basic operators to deﬁne the customized layer followingthe order of operations.Theano is a platform and is easy to conﬁgure as compared with other frameworks.And some high-level frameworks are built on top of Theano such as Keras, whichfurther makes Theano easier to use. Theano supports cross-platform conﬁgurationwell, which means it works on not only Linux but also Windows. Because of this,many researchers and engineers use Theano to build their models and then releasethese projects. Rich open resources based on Theano attract some more users.Though Theano uses Python syntax to deﬁne symbolic tensor graphs, its graphprocessor will compile the graphs into high-performance C++ or CUDA code forcomputing. Owing to this, Theano can run very fast and make programmers codemode simply. Only one deﬁciency is that the compilation process is slow and needssome time. If a neural network does not need to be trained for several days, it isnot a good idea to select Theano. Compiling too often in Theano is maddening andannoying. As a comparison, the later framework like TensorFlow uses the compiledpackage for the symbolic tensor operations, which seems a little more relaxing.Theano has some other serious disadvantages. Theano cannot support many-coreGPUs very well, which makes it hard to train big neural networks. Besides thecompilation process, importing Theano is also slow. When you run your code inTheano, you will be stuck for a long time with a preconﬁgured device. If you want toimprove and contribute to Theano itself, this will also be maddening and annoying. In fact, Theano is no longer maintained, but it is still worth introducing as a landmarkwork in the history of deep learning frameworks, which inspires many subsequentframeworks.

TensorFlow is mainly developed and used by Google based on the experience onTheano and DistBelief [1]. TensorFlow and Theano are in fact quite similar to someextent. Both of them allow building a symbolic graph of the neural network archi-tecture via the Python interface. Different from Theano, TensorFlow allows imple-menting new operations or machine-learning algorithms using C/C++ and Java. Withbuilding symbolic graphs, the auto-gradient can be easily used to train complicatedmodels. Hence, TensorFlow is more than a deep learning framework. Its ﬂexibil-ity enables it to solve various complex computing problems such as reinforcementlearning.In TensorFlow, both code development and deployment is fast and convenient.Trained models can be deployed quickly on a variety of devices, including serversand mobile devices, without the need to implement a separate model setting codeor load Python/LuaJIT interpreter. Caffe also allows easy deployment of models.However, Caffe has trouble running on devices without a GPU, which is a prevalentsituation of smartphones. TensorFlow supports model decoding using ARM/NEONinstructions and does not need too many operations to choose training devices.TensorBoard of TensorFlow provides a platform for visualization of the modelarchitectures, which is beautiful and also useful. By visualizing the symbolic graph, itis not difﬁcult to ﬁnd bugs in the source code. To debug models on other deep learningframeworks is relatively bothering. TensorBoard can also log and generate real-timevisualization of variables during training, which is a pleasant way to monitor thetraining process.Though customizing operations in TensorFlow is convenient, it usually changes alot of function interfaces in every new release which is challenging for developers tokeep their code compatible with different TensorFlow versions. And mastering Ten-sorFlow is also not easy. As TensorFlow 2.0 has been released recently, TensorFlowmay gradually handle these issues in the predictable future. Torch is a computational framework mainly developed and used by Facebook andTwitter. Torch provides an API written in Lua to support the implementation of some http://torch.ch/.22 10 Resources machine-learning algorithms, especially convolutional neural networks. A temporalconvolutional layer implemented by Torch can have a variable input length, whichis extremely useful for NLP tasks and not designed in Theano and TensorFlow.Torch also contains the 3D convolutional layer, which can be easily used in videorecognition tasks. Besides its various ﬂexible convolutional layers, Torch is light andspeedy. The above reasons attract lots of researchers in universities and companiesto customize their own deep learning platforms.However, the negative aspects of Torch are also apparent. Though Torch is pow-erful, it is not designed to be widely accessible to the Python-based academic com-munity. And there are not any other interfaces but Lua. Lua is a multi-paradigmscripting language, which was developed in Brazil in the early 1990s and is not apopular mainstream programming language. Hence, it needs some time to learn Luabefore you use Torch to construct models. Different from convolutional neural net-works, there is no ofﬁcial support for recurrent neural networks. There are some openresources about recurrent neural networks implemented by Torch, but they are not yetintegrated to the main repository. And it is difﬁcult to distinguish the effectivenessof these implementations.Similar to Caffe, Torch is not a framework based on symbolic tensor graphs, italso uses the layer-wise approach. This means that your models in Torch are a graphof layers and not a graph of mathematical functions. The mechanism is convenientto build a network whose layers are stable and hierarchical. If you want to design anew connection layer or change an existing neural model, you need lots of code toimplement new layers with full forward, backward, and gradient update functions.However, those frameworks based on symbolic tensor graphs, such as Theano andTensorFlow, give more ﬂexibility to do this. In fact, these issues are handled asPyTorch has been released, which we will introduce then. PyTorch is a Python package built over Torch, developed by Facebook and othercompanies. However, it is not just an interface, and PyTorch has amounts of improve-ments over Torch. The most important one is that PyTorch can use a symbolic graphto deﬁne neural networks, and then use automatic differentiation following the graphto automate the computation of backward passes in neural networks. Meanwhile,PyTorch maintains some characteristics of the layer-wise approach in Torch, whichmeans coding with PyTorch is easy. Moreover, PyTorch has minimal framework over-head and custom memory allocators for the GPU, which means PyTorch is faster andmemory-efﬁcient than Torch. http://pytorch.org/.0.1 Open-Source Frameworks for Deep Learning 323 Compared with other deep learning frameworks, PyTorch has two main advan-tages. First, most frameworks like TensorFlow are based on static computationalgraphs (deﬁne-and-run), while PyTorch uses dynamic computational graphs (deﬁne-by-run). What it means is that with dynamic computational graphs, you can changethe network architecture based on the data ﬂowing through the network. There is away to do something similar in TensorFlow, but your static computational graphsmust contain all possible branches in advance, which will limit the performance.Second, PyTorch is built to be deeply integrated into Python and has a seamlessconnection with other popular Python packages, such as Numpy, Spicy, and Cython.Thus, it is easy to extend your model when needed.After Facebook released it, PyTorch has drawn considerable attention from thedeep learning community, and many past Torch users switch to this new package. Fornow, PyTorch already has a thriving community which contributes to its increasingpopularity among researchers. It is no exaggeration to say that PyTorch is one of themost popular frameworks at present.

Keras is a top-design deep learning framework that is based on Theano and Tensor-Flow. Interestingly, Keras sits atop Theano and TensorFlow, however, its interfacesare similar to Torch. To use Keras needs Python code, and there are lots of detaileddocuments and examples for a quick start. There is also a very active communityof developers, and they make Keras fastly updated. Hence, it is a very fast-growingframework.Because Theano and TensorFlow are the backends of Keras, disadvantages ofKeras are most similar to Theano and TensorFlow. With TensorFlow as the backend,it will run even slower than the pure TensorFlow code. Because it is a high-levelframework, to customize a new neural layer is not easy, though you can easily useexisting layers under Keras. The package is too advanced, and it hides too manytraining parameters. You cannot touch and change all details of your own modelsunless you use Theano, TensorFlow, or PyTorch. MXNet is an effective and efﬁcient open-source machine-learning framework,mainly pushed by Amazon. It supports APIs with multiple languages, includingC++, Python, R, Scala, Julia, Perl, MATLAB, and JavaScript, some of which can beadopted for Amazon Web Services. Some interfaces of MXNet are also reserved for https://keras.io/. http://mxnet.io/.24 10 Resources future mobile devices, just like TensorFlow. MXNet is built on a dynamic dependencyscheduler that automatically parallelizes both symbolic and imperative operations onthe ﬂy. A graph optimization layer on top of that makes symbolic execution fast andmemory efﬁcient. The MXNet library is portable and lightweight, and it scales tomultiple GPUs and multiple machines. The main problem of MXNet is the lack ofdetailed and well-organized documentation. The user groups are also smaller thanother frameworks, especially as compared with TensorFlow and PyTorch. It is morechallenging to grasp MXNet for newbies. The MXNet is developing fastly, and theseproblems may be solved in the future. Word2vec is a widely used toolkit for word representation learning, which providesan effective and efﬁcient implementation of the continuous bag-of-words and Skip-gram architectures. The word representations learned by Word2vec can be used inmany natural language processing ﬁelds. Empirically, To use pretrained word vectorsas the model inputs can be a good way to enhance model performances.Word2vec takes free text corpus as input and constructs the vocabulary list fromthe training data. Then it uses simple predictive models based on neural networksto learn the language model, which encode the co-occurrence information betweenwords into the resulting word representations.The resulting representations showcase interesting linear substructures of the wordvector space. The Euclidean distance (or cosine similarity) between two-word vectorsprovides an effective method for measuring the linguistic or semantic similarity ofthe corresponding words. Sometimes, the nearest neighbors, according to this metric,reveal rare but relevant words that lie outside an average human’s vocabulary.Words frequently appearing together in the text will have representations withclose distance within the embedding space. Word2vec also provides a tool to ﬁnd theclosest words for a user-speciﬁed word via the learned representations and distancesbetween representation embeddings. GloVe is a widely used toolkit, which supports an unsupervised learning method forword representation learning. Similar to Word2vec, GloVe also trains on text corpus https://code.google.com/archive/p/word2vec/. https://nlp.stanford.edu/projects/glove/.0.2 Open Resources for Word Representation 325 and captures the aggregated global word-word co-occurrence information for wordembeddings. However, GloVe uses count-based models instead of predictive models,which are different from Word2vec.The GloVe model ﬁrst builds a global word-word co-occurrence matrix, whichcan show how frequently words co-occur with one another in a given text. Thenword representations are trained on the nonzero entries of the matrix. To constructthis matrix requires the entire corpus traversal for the statistics collection. For largecorpora, this pass can be computationally expensive, but it is a one-time up-frontcost. Subsequent training iterations are much faster because the number of nonzeromatrix entries is typically much smaller than the total number of words in the corpus. OpenKE [2] is an open-source toolkit for Knowledge Embedding (KE), which pro-vides a uniﬁed framework and various fundamental KE models. OpenKE prioritizesoperational efﬁciency to support quick model validation and large-scale knowledgerepresentation learning. Meanwhile, OpenKE maintains sufﬁcient modularity andextensibility to incorporate new models easily. Besides the toolkit, the embeddingsof some existing large-scale knowledge graphs pretrained by OpenKE are also avail-able. The toolkit, documentation, and pretrained embeddings are all released onhttp://openke.thunlp.org/.As compared to other implementations, OpenKE has ﬁve advantages. First,OpenKE has implemented nine classical knowledge embedding algorithms, includ-ing RESCAL, TransE, TransH, TransR, TransD, ComplEx, DistMult, HolE, andAnalogy, which are veriﬁed effective and stable. Second, OpenKE shows high per-formance due to memory optimization, multi-threading acceleration, and GPU learn-ing. OpenKE supports multiple computing devices and provides interfaces to controlCPU/GPU modes. Third, system encapsulation makes OpenKE easy to train and testKE models. Users just need to set hyperparameters via interfaces of the platform toconstruct KE models. Fourth, it is easy to construct new KE models. All speciﬁcmodels are implemented by inheriting the base class by designing their own scoringfunctions and loss functions. Fifth, besides the toolkit, OpenKE also provides theembeddings of some existing large-scale knowledge graphs pretrained by OpenKE,which can be directly applied for many applications, including information retrieval,personalized recommendation, and question answering. https://github.com/thunlp/OpenKE.26 10 Resources Scikit-kge is an open-source Python library for knowledge representation learn-ing. The library supports different building blocks to train and develop models forknowledge graph embeddings. The primary purpose of Scikit-kge is to compute theembeddings of knowledge graphs for the method HolE; meanwhile, it also providessome other methods. Besides HolE, RESCAL, TransE, TransR, and ER-MLP canalso be trained in Scikit-kge. The library contains some parameter update methods,not only the basic SGD but also AdaGrad. It also implements different negativesampling strategies to select negative samples. OpenNE is an open-source standard NE/NRL (Network Representation Learning)training and testing framework. It uniﬁes the input and output interfaces of differentNE models and provides scalable options for each model. Moreover, typical NEmodels under this framework are based on TensorFlow, which enables these modelsto be trained with GPUs. The implemented or modiﬁed models include DeepWalk,LINE, node2vec, GraRep, TADW, GCN, HOPE, GF, SDNE, and LE. The frameworkalso provides classiﬁcation and embedding visualization modules for evaluating theresult of NRL. GEM (Graph Embedding Methods) is a Python package that offers a general frame-work for graph embedding methods. It implements many state-of-the-art embeddingtechniques including Locally Linear Embedding, Laplacian Eigenmaps, Graph Fac-torization, High-Order Proximity preserved Embedding (HOPE), Structural DeepNetwork Embedding (SDNE), and node2vec. Furthermore, the framework imple-ments several functions to evaluate the quality of the obtained embeddings includinggraph reconstruction, link prediction, visualization, and node classiﬁcation. For fasterexecution, C++ backend is integrated using Boost for supported methods. https://github.com/mnick/scikit-kge. https://github.com/thunlp/OpenNE. https://github.com/palash1992/GEM.0.4 Open Resources for Network Representation 327 GraphVite is a general and high-performance graph embedding system for variousapplications including node embedding, knowledge graph embedding, and graphhigh-dimensional data visualization.GraphVite provides a complete pipeline for users to implement and evaluate graphembedding models. For reproducibility, the system integrates several commonlyused models and benchmarks, and you can also develop your own models with theﬂexible interface. Additionally, for semantic tasks, GraphVite releases a bunch ofpretrained knowledge graph embedding models to enhance language understanding.There are two core advantages of GraphVite over other toolkits: fast and large-scaletraining. GraphVite accelerates graph embedding with multiple CPUs and GPUs.It takes around one minute to learn node embeddings for graphs with one millionnodes. Moreover, GraphVite is designed to be scalable. Even with limited memory,GraphVite can process node embedding task on billion-scale graphs. CogDL is another graph representation learning toolkit that allows researchersand developers to easily train and evaluate baseline or custom models for nodeclassiﬁcation, link prediction, and other tasks on graphs. It provides implementationsof many popular models, including non-GNN models and GNN-based ones.CogDL beneﬁts from several unique techniques. First, utilizing sparse matrixoperation, CogDL is capable of performing fast network embedding on large-scalenetworks. Second, CogDL has the ability to deal with different types of graph struc-tures attributed, multiplex, and heterogeneous networks. Third, CogDL supportsparallel training. With different seeds and different models, CogDL performs train-ing on multiple GPUs and reports the result table automatically. Finally, CogDL isextendable. New datasets, models, and tasks can be added without difﬁculty. OpenNRE [3] is an open-source framework for neural relation extraction, whichaims to easily build relation extraction (RE) models. https://graphvite.io/. http://keg.cs.tsinghua.edu.cn/cogdl/index.html. https://github.com/thunlp/OpenNRE.28 10 Resources Compared with other implementations, OpenNRE has four advantages. First,OpenNRE has implemented various state-of-the-art RE models, including atten-tion mechanism, adversarial learning, and reinforcement learning. Second, Open-NRE enjoys great system encapsulation. It divides the pipeline of relation extractioninto four parts, namely, embedding, encoder, selector (for distant supervision), andclassiﬁer. For each part, it has implemented several methods. System encapsulationmakes it easy to train and test models by changing hyperparameters or appoint modelarchitectures by using Python arguments. Third, OpenNRE is extendable. Users canconstruct new RE models by choosing speciﬁc blocks provided in four parts as men-tioned above and combining them freely, with only a few lines of codes. Fourth, theframework has implemented multi-GPU learning, which is efﬁcient.

References

1. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelioRanzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. In

Advances in neural information processing systems , pages 1223–1231, 2012.2. Xu Han, Shulin Cao, Xin Lv, Yankai Lin, Zhiyuan Liu, Maosong Sun, and Juanzi Li. OpenKE:An open toolkit for knowledge embedding. In

Proceedings of EMNLP: System Demonstrations ,pages 139–144, 2018.3. Xu Han, Tianyu Gao, Yuan Yao, Deming Ye, Zhiyuan Liu, and Maosong Sun. OpenNRE: Anopen and extensible toolkit for neural relation extraction. In

Proceedings of EMNLP: SystemDemonstrations , pages 169–174, 2019.

Open Access

Outlook

Abstract

The aforementioned representation learning models and methods haveshown their effectiveness in various NLP scenarios and tasks. With the rapid growth ofdata scales and the development of computation devices, there are also new challengesand opportunities for next-stage researches of deep learning techniques. In the lastchapter, we will look into the future directions of representation learning techniquesfor NLP. To be more speciﬁc, we will consider the following directions includingusing more unsupervised data, utilizing few labeled data, employing deeper neuralarchitectures, improving model interpretability and fusing the advantages of otherareas.

We have used ten chapters to introduce the advances of representation learningfor NLP, covering both multi-grained language entries including words, phrases,sentences and documents, and closely related objects including world knowledge,sememe knowledge, networks, and cross-modal data. Those mentioned models andmethods of representation learning for NLP have shown their effectiveness in variousNLP scenarios and tasks.As shown by the unsatisfactory performance of most NLP systems in opendomains, and recent great advances of pre-trained language models, representationlearning for NLP is far from perfect. With the rapid growth of data scales and thedevelopment of computation devices, we are facing new challenges and opportunitiesfor next-stage researches of representation learning and deep learning techniques.In this last chapter, we will look into the future research and exploration directionsof representation learning techniques for NLP. Since we have summarized the futurework of each individual part in the summary section of each previous chapter, herewe focus on discussing the general and important issues that should be addressed byrepresentation learning for NLP. © The Author(s) 2020Z. Liu et al.,

Representation Learning for Natural Language Processing ,https://doi.org/10.1007/978-981-15-5573-2_11 32930 11 Outlook

For general representation learning for NLP, we conclude the following direc-tions, including using more unsupervised data, utilizing a few labeled data, employ-ing deeper neural architectures, improving model interpretability, and fusing theadvances from other areas.

The rapid development of Internet technology and the popularization of informationdigitization have brought massive text data for NLP researches and applications.For example, the whole corpus of Wikipedia already contains more than 50 millionarticles (including 6 million articles in English) and is growing rapidly every daycontributed by collaborative work all over the world. The amount of user-generatedcontent on many social platforms such as Twitter, Weibo, and Facebook also increasesquickly by billions of users. It is worth considering these massive text data for learningbetter NLP models. However, due to the expensive cost of expert annotations, it isimpossible to label such massive amounts of data for speciﬁc NLP tasks.Hence, an essential direction of NLP is how to take better advantages of unla-beled data for efﬁcient unsupervised representation learning. Though without labeledannotations, unsupervised data can help initialize the randomized neural networkparameters and thus improve the performances of those downstream NLP tasks.This line of work usually employs a pipeline strategy: ﬁrst, pretrain the modelparameters and then ﬁne-tune these parameters in speciﬁc downstream NLP tasks.Recurrent language model [7], word embeddings [6], and pre-trained language mod-els (PLM) such as BERT [3], all utilize unsupervised plain text to pretrain neuralparameters and then beneﬁt downstream supervised tasks via ﬁne-tuning.Current state-of-the-art PLM models still can only learn from limited plain textdue to limited learning efﬁciency and computation power. Moreover, there are varioustypes of large-scale data online with abundant informative signals and labels, such asHTML tags, anchor text, keywords, document meta-information, and other structuredand semi-structured data. How to take full advantage of the large-scale Web text datahas not been extensively studied. In the future, with better computation devices (e.g.,GPUs) and data resources, we are expected to develop more advanced methods toutilize more unsupervised data. As NLP technologies become more powerful, people can explore more complicatedand ﬁne-grained problems. Taking text classiﬁcation as an example, early work tar-geted on ﬂat classiﬁcation with limited categories, and now researchers are more http://en.wikipedia.org/wiki.1.3 Utilizing Fewer Labeled Data 331 interested in classiﬁcation with hierarchical structure and a large number of classes.However, when a problem gets more complicated, it requires more knowledge fromexperts to annotate training instances for ﬁne-grained tasks and increases the cost ofdata labeling.Therefore, we expect the models or systems can be developed efﬁciently with(very) few labeled data. When each class has only one or a few labeled instances, theproblem becomes a one/few-shot learning problem. The few-shot learning problem isderived from computer vision and has also been studied in NLP recently. For example,researchers have explored few-shot relation extraction [5] where each relation has afew labeled instances, and low-resource machine translation [11] where the size ofthe parallel corpus is limited.A promising approach to few-shot learning is to compare the semantic similaritybetween the test instance and those labeled ones (i.e., the support set), and then makethe prediction. The idea is similar to k -nearest neighbor classiﬁcation ( k NN) [10].Since the key is to represent the semantic meanings of each instance for measuringtheir semantic similarity, it has been veriﬁed that language models pretrained onunsupervised data and ﬁne-tuned on the target few-shot domain are very effectivefor few-shot learning.Another approach to few-shot learning is to transfer the models from some relateddomains into the target domain with the few-shot problem [2]. This is usually namedas transfer learning or domain adaptation. For these methods, representation learningcan also help the transfer or adaptation process by learning joint representations ofboth domains.In the future, one may go beyond the abovementioned frameworks and designmore appropriate methods according to the characteristics of NLP tasks and prob-lems. The goal is to develop effective NLP methods with as less annotated data in thetarget domain as possible, by better utilizing unsupervised data that are much cheaperto get from the Web and existing supervised data from other domains. The explo-ration of the few-shot learning problem in NLP will help us develop data-efﬁcientmethods for language learning.

As the amount of available text data rapidly increases, the size of the training corpusfor NLP tasks grows as well. With more training data, a natural way to boost modelperformances is to employ deeper neural architectures for modeling. Intuitively,deeper neural models that have more sophisticated architecture and parameters canbetter ﬁt the increasing data. Another motivation for using deeper architectures formodeling comes from the development of computation devices (e.g., GPUs). Cur-rent state-of-the-art methods are usually a compromise between efﬁciency and effec-tiveness. As the computation devices operate faster, the time/space complexities ofcomplicated models become acceptable, which motivate researchers to design more

32 11 Outlook complex but effective models. To summarize, employing deeper neural architectureswould be one of the deﬁnite orientations for representation learning in NLP.Very deep neural network architectures have been widely used in computer vision.For example, the well-known VGG [8] network which was proposed in the famousImageNet contest has 16 layers of convolutional and fully connected layers. In NLP,the depths of neural architectures were relatively shallow until the Transformer [9]structure was proposed. Speciﬁcally, as compared with word embedding [6] which isbased on shallow models, the state-of-the-art pre-trained language model BERT [3]can be regarded as a giant model that stacks 12 self-attention layers and each layer has8 attention heads. BERT has demonstrated its effectiveness in a number of NLP tasks.Besides the well-designed model architecture and training objectives, the success ofBERT also beneﬁts from TPUs which is one of the most powerful devices for parallelcomputations. In contrast, it may take months or years for a single CPU to ﬁnish thetraining process of BERT. When these computation devices go popular, we can expectmore deep neural architectures to be developed for NLP as well.

Model transparency and interpretability are hot topics in artiﬁcial intelligence andmachine learning. Human interpretable predictions are very important for decision-critical applications related to ethics, privacy, and safety. However, neural networkmodels or deep learning techniques are short of model transparency for human inter-pretable predictions and thus are often treated as black boxes.Most NLP techniques based on neural networks and distributed representation arealso hard to be interpreted except for the attention mechanism where the attentionweights can be interpreted as the importance of corresponding inputs. For the sake ofemploying representation learning techniques for decision-critical applications, thereis a need to improve model interpretability and transparency of current representationlearning and neural network models.A recent survey [1] classiﬁes interpretable machine learning methods into twomain categories: interpretable models and post-hoc explainability techniques. Modelsthat are understandable by themselves are called interpretable models. For example,linear models, decision trees, and rule-based systems are such transparent models.However, in most cases, we have to probe into the model by a second one for expla-nations, namely post-hoc explainability techniques. In NLP, there have been someresearches to visualize neural models such as neural machine translation [4] forinterpretable explanations. However, the understanding of most neural-based mod-els remains unsolved. We are looking forward to more studies on improving modelinterpretability to facilitate the extensive use of representation learning methods forNLP.

During the development of deep learning techniques, mutual learning between dif-ferent research areas has never stopped.For example, Word2vec aims to learn word embeddings from large-scale text cor-pus published in 2013 and can be regarded as a milestone of representation learningfor NLP. In 2014, the idea of Word2vec was adopted for learning node embeddings ina network/graph by treating random walks over the network as sentences, named asDeepWalk; the analogical reasoning phenomenon learned by Word2vec, i.e., king − man = queen − woman also inspired the representation learning of world knowledge,named as TransE. Meanwhile, graph convolutional networks were ﬁrst proposed forsemi-supervised graph learning in 2016, and have been widely applied on many NLPtasks such as relation extraction and text classiﬁcation recently. Another example isthe Transformer model which was proposed for neural machine translation at ﬁrstand then transferred to computer vision, data mining, and many other areas.The fusion also appears between two quite distant disciplines. We should recallagain that, the idea of distributed representation proposed in the 1980s is inspired bythe neural computation scheme of humans and other animals. It takes about 40 years tosee the development of distributed representation and deep learning come to fruition.In fact, many ideas such as convolution in CNN and the attention mechanism areinspired by the computation scheme of human cognition.Therefore, an intriguing direction of representation learning for NLP is to fusethe advances from other areas, including not only those closely related areas in AIsuch as machine learning, computer vision, and data mining, but also those distantareas to some extent such as linguistics, brain science, psychology, and sociology.This line of work requires researchers to have sufﬁcient knowledge of other ﬁelds. References

1. Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, SihamTabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Ben-jamins, et al. Explainable artiﬁcial intelligence (xai): Concepts, taxonomies, opportunities andchallenges toward responsible ai.

Information Fusion , 58:82–115, 2020.2. Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. Acloser look at few-shot classiﬁcation. In

Proceedings of ICLR , 2019.3. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. In

Proceedings of NAACL , 2019.4. Yanzhuo Ding, Yang Liu, Huanbo Luan, and Maosong Sun. Visualizing and understandingneural machine translation. In

Proceedings of ACL , 2017.5. Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun.FewRel: A large-scale supervised few-shot relation classiﬁcation dataset with state-of-the-artevaluation. In

Proceedings of EMNLP , 2018.6. T Mikolov and J Dean. Distributed representations of words and phrases and their composi-tionality.

Proceedings of NeurIPS , 2013.7. Tomas Mikolov, Martin Karaﬁát, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. Recur-rent neural network based language model. In

Proceedings of InterSpeech , 2010.34 11 Outlook8. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556, 2014.9. Ashish Vaswani, Noam Shazeer, Niki Parmar, Llion Jones, Jakob Uszkoreit, Aidan N Gomez,and Lukasz Kaiser. Attention is all you need. In

Proceedings of NeurIPS , 2017.10. Yan Wang, Wei-Lun Chao, Kilian Q Weinberger, and Laurens van der Maaten. Sim-pleshot: Revisiting nearest-neighbor classiﬁcation for few-shot learning. arXiv preprintarXiv:1911.04623, 2019.11. Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low-resourceneural machine translation. In

Proceedings of EMNLP , 2016.