[PDF] Beyond Social Media Analytics: Understanding Human Behaviour and Deep Emotion using Self Structuring Incremental Machine Learning

Abstract

This thesis develops a conceptual framework considering social data as representing the surface layer of a hierarchy of human social behaviours, needs and cognition which is employed to transform social data into representations that preserve social behaviours and their causalities. Based on this framework two platforms were built to capture insights from fast-paced and slow-paced social data. For fast-paced, a self-structuring and incremental learning technique was developed to automatically capture salient topics and corresponding dynamics over time. An event detection technique was developed to automatically monitor those identified topic pathways for significant fluctuations in social behaviours using multiple indicators such as volume and sentiment. This platform is demonstrated using two large datasets with over 1 million tweets. The separated topic pathways were representative of the key topics of each entity and coherent against topic coherence measures. Identified events were validated against contemporary events reported in news. Secondly for the slow-paced social data, a suite of new machine learning and natural language processing techniques were developed to automatically capture self-disclosed information of the individuals such as demographics, emotions and timeline of personal events. This platform was trialled on a large text corpus of over 4 million posts collected from online support groups. This was further extended to transform prostate cancer related online support group discussions into a multidimensional representation and investigated the self-disclosed quality of life of patients (and partners) against time, demographics and clinical factors. The capabilities of this extended platform have been demonstrated using a text corpus collected from 10 prostate cancer online support groups comprising of 609,960 prostate cancer discussions and 22,233 patients.

Full PDF

BBeyond Social Media Analytics: Understanding HumanBehaviour and Deep Emotion using Self StructuringIncremental Machine Learning by Tharindu Rukshan BandaragodaMPhil (Monash University, 2015)BSc. Eng. (Honours) (University of Moratuwa, 2011)

A thesis submitted in total fulﬁlment of the requirementsfor the degree of

Doctor of Philosophy

Supervisor: Prof. Damminda AlahakoonAssociate Supervisor: Dr Daswin de Silva

Research Centre for Data Analytics and CognitionLa Trobe UniversityVictoria, Australia

February, 2020 a r X i v : . [ c s . C Y ] S e p CopyrightbyTharindu Rukshan Bandaragoda2020o Amma and Thaththaiii bstract

Recent advances in mobile and social media technologies has led to a rapid increase inonline social interactions among individuals from all walks of life. For instance, by 2019,48% of the world population were active on at least one online social media platform. So-cial media platforms archive all aspects of online social interactions. The data representingsocial interactions (or social data) can be used to analyse social behaviours and underlyingcausalities. Unlike the data collected conventionally from censuses or controlled experi-ments, this data is neither retrospectively collected nor controlled, and accumulated inlarge volumes leading to more accurate insights.Existing techniques for gaining insights from social data are mainly adaptations ofstandard machine learning and natural language processing techniques. The performanceof such techniques on social data is suboptimal as social data is highly unstructured,due to brevity and out of vocabulary terms, while also dynamic and bursty. Specialisedtechniques are required to handle issues related to the unstructured and dynamic natureof these data streams. This thesis aims to develop such techniques that can be used togain in-depth insights from social data.In the ﬁrst phase of the thesis, a conceptual framework has been developed consideringsocial data as representing the surface layer of a hierarchy of human social behaviours,needs and cognition. This framework is subsequently employed to transform social datainto representations that preserve social behaviours and their causalities. In the secondphase existing machine learning and natural language processing techniques have beenextended to overcome the challenges of social data. Two platforms were built to captureinsights from fast-paced and slow-paced social data. For fast-paced, a self structuringand incremental learning technique was developed to automatically capture salient topics,and corresponding dynamics over time. An event detection technique was developed toautomatically monitor those identiﬁed topic pathways for signiﬁcant ﬂuctuations in socialbehaviours using multiple indicators such as volume and sentiment. The capabilities ofthis platform are demonstrated using two large datasets with over 1 million tweets each.The separated topic pathways were representative of the key topics/discussions of eachentity and coherent against topic coherence measures. The identiﬁed events were validatedagainst contemporary events reported in news.Secondly for the slow-paced social data, a suite of new machine learning and natu-ral language processing techniques were developed to automatically capture self-disclosedinformation of the individuals such as demographics, emotions and timeline of personalevents of signiﬁcance. This platform was trialled on a large text corpus of over four mil-lion posts collected from online support groups. This was further extended to transformprostate cancer related online support group discussions into a multidimensional represen-tation and investigated the self-disclosed quality of life of patients (and partners) againsttime, demographics and clinical factors. The capabilities of this extended platform havebeen demonstrated using a text corpus collected from 10 prostate cancer online supportgroups comprising of 609,960 prostate cancer discussions and 22,233 patients.iv eyond Social Media Analytics: Understanding HumanBehaviour and Deep Emotion using Self StructuringIncremental Machine Learning

Statement of Authorship

Except where reference is made in the text of the thesis, this thesis contains no materialpublished elsewhere or extracted in whole or in part from a thesis accepted for the awardof any other degree or diploma. No other person’s work has been used without dueacknowledgement in the main text of the thesis. This thesis has not been submitted forthe award of any degree or diploma in any other tertiary institution.Tharindu Rukshan BandaragodaFebruary 25, 2020v cknowledgments

First and foremost, I would like to express my sincere gratitude to my primary supervisorProf. Damminda Alahakoon for his continues support and guidance throughout my PhDjourney. Thank you for all the inspiration, motivation, enthusiasm and immense wisdomshared with me. I could not have imagined having a better supervisor and a mentor tosupport my PhD journey.Secondly, I would like to thank my co-supervisor Dr Daswin De Silva for his continuessupport guidance and feedback. Thank you for always encouraging me to write andpublish; and keeping my research journey on track until the very end. Also thank you forall the collaboration opportunities provided.This work was supported by an Australian Government Research Training Program Schol-arship, a La Trobe University Full Fee Research Scholarship and a La Trobe UniversityPostgraduate Scholarship. Also, I would like to thank Data to Decisions CooperativeResearch Centre (D2D CRC) for providing the top-up scholarships.I would like to thank Dr Buddhi Jayathilaka for introducing me to this research groupand providing guidance and support during the initial phase of my PhD. Moreover, Iwould like to make this an opportunity to thank Dr Weranja Ranasinghe, A/Prof. NathanLawrentschuk and Prof. Damien Bolton for research collaboration and providing clinicalexpertise to explain the analytical ﬁndings of my research. Also, I would like to thank DrEvgeny Osipov and Lule University of Technology, Sweden for research collaboration andsupporting my research visit.A big thank to Latrobe Business school and its support staﬀ for all kinds of support.I would like to thank all my colleagues at the Center for Data Analytics and Cognitionfor all the support extended and the cherish memories.I would like to express my sincer gratitude to my parents who always encouraged me to takethis PhD journey. Thank you amma and thaththa for all the support and encouragementthroughout. None of this would be possible without you.I was so fortunate to have the unconditional love and support of my wife Chammi through-out my PhD journey. Thank you all your motivation, encouragements and tolerance. Also,I would like to thank my little daughter, Amelie, for her cuddles and smiles that kept megoing during the latter part of my PhD.Finally, I would like to thank everyone who supported me during my PhD journey.

Tharindu Rukshan Bandaragoda

La Trobe UniversityVictoria, AustraliaFebruary 2020 vi ontents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.4.1 Handling dynamic vocabularies . . . . . . . . . . . . . . . . . . . . . 754.4.2 Addressing the brevity issue . . . . . . . . . . . . . . . . . . . . . . . 774.4.3 Identifying new topic pathways . . . . . . . . . . . . . . . . . . . . . 784.5 Event detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.5.1 Volume based event indicator . . . . . . . . . . . . . . . . . . . . . . 814.5.2 Sentiment based event indicator . . . . . . . . . . . . . . . . . . . . 814.6 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.7.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.7.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85viii.7.4 Topic separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.7.5 Evolution of topic pathways . . . . . . . . . . . . . . . . . . . . . . . 894.7.6 Emergence of new topic pathways . . . . . . . . . . . . . . . . . . . 954.7.7 Automatic event detection . . . . . . . . . . . . . . . . . . . . . . . . 964.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ist of Tables I for p-value presented in the last column is from chi-square tests that evaluatedthe statistical signiﬁcance across the three modalities. . . . . . . . . . . . . 1526.4 Frequently-used emotion terms for three selected cohorts: (i) patients aged <

40, (ii) patients aged >

70 and (iii) partners of patients. . . . . . . . . . . 1546.5 The decision factors employed for the treatment decision against the pri-mary treatment modality of the patient. . . . . . . . . . . . . . . . . . . . . 1596.6 The decision factors employed for the treatment decision against the deci-sion making behaviour group. . . . . . . . . . . . . . . . . . . . . . . . . . . 160xii ist of Figures image source: Wikipedia . . . . . . . . . . . . . . . . 563.5 Accumulation of the traces of social behaviours in online social media plat-forms as social data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.6 The proposed self-structuring artiﬁcial intelligence framework for generatinginsights from social data in online social media platforms. . . . . . . . . . . 644.1 High-level design of the proposed platform for topic separation and eventdetection from a social text stream . . . . . . . . . . . . . . . . . . . . . . . 694.2 Two node expansions scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 73xiii.3 Layered structure of IKASL algorithm . . . . . . . . . . . . . . . . . . . . . 744.4 Local term frequency of terms A and B across social media message batches B , . . . , B . τ V is the local term frequency threshold that determines inclu-sion of exclusion in the respective vocabulary . . . . . . . . . . . . . . . . . 774.5 An illustrative example showing topic pathway separation and event detection 834.6 Volume (%) of tweets in six prominent topic pathways of T P Microsoft ,which represents Microsoft-Google relationship. Note that most frequentterms

Microsoft and

Google were removed from word clouds for improvedclarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.10 A word cloud and a word frequency graph generated for the topic pathway

T P Microsoft . Note that most frequent terms

Microsoft and

Google wereremoved from word clouds for improved clarity. . . . . . . . . . . . . . . . . 934.11 Four new topic pathways emerged in ist of Publications

Journals

1. T. Bandaragoda, D. De Silva, and D. Alahakoon. Automatic event detection inmicroblogs using incremental machine learning.

Journal of the Association for In-formation Science and Technology , 68(10):2394–2411, 20172. T. Bandaragoda, D. De Silva, D. Alahakoon, W. Ranasinghe, and D. Bolton. Textmining for personalized knowledge extraction from online support groups.

Journalof the Association for Information Science and Technology , 69(12):1446–1459, 20183. T. Bandaragoda, W. Ranasinghe, A. Adikari, D. de Silva, N. Lawrentschuk, D. Ala-hakoon, R. Persad, and D. Bolton. The patient-reported information multidimen-sional exploration (PRIME) framework for investigating emotions and other factorsof prostate cancer patients with low intermediate risk based on online cancer supportgroup discussions.

Annals of surgical oncology , 25(6):1737–1745, 20184. W. Ranasinghe, D. de Silva, T. Bandaragoda, A. Adikari, D. Alahakoon, R. Persad,N. Lawrentschuk, and D. Bolton. Robotic-assisted vs. open radical prostatectomy:A machine learning framework for intelligent analysis of patient-reported outcomesfrom online cancer support groups.

Urologic Oncology: Seminars and Original In-vestigations , 36(12):529.e1–529.e9, 20185. D. De Silva, W. Ranasinghe, T. Bandaragoda, A. Adikari, N. Mills, L. Iddamalgoda,D. Alahakoon, N. Lawrentschuk, R. Persad, E. Osipov, et al. Machine learningto support social media empowered patients in cancer care and cancer treatmentdecisions.

PLoS One , 13(10):e0205855, 20186. W. Ranasinghe, T. Bandaragoda, D. De Silva, and D. Alahakoon. A novel frameworkfor automated, intelligent extraction and analysis of online support group discussionsfor cancer related outcomes.

BJU International , 120:59–61, 20177. D. Nallaperuma, R. Nawaratne, T. Bandaragoda, A. Adikari, S. Nguyen, T. Kem-pitiya, D. De Silva, D. Alahakoon, and D. Pothuhera. Online incremental machinelearning platform for big data-driven smart traﬃc management.

IEEE Transactionson Intelligent Transportation Systems , 20(12):4679–4690, 2019xvi. M. Sherwood, M. Lordanic, T. Bandaragoda, E. Sherry, and D. Alahakoon. A newleague, new coverage? comparing tweets and media coverage from the ﬁrst seasonof AFLW. Media International Australia, 172(1):114–130, 20199. T. Bandaragoda, A. Adikari,R. Nawaratne, D. Nallaperuma, A. K. Luhach, T. Kem-pitiya, S. Nguyen, D. Alahakoon, D. De Silva. Artiﬁcial intelligence based commuterbehaviour proﬁling framework using Internet of things for real-time decision-making.

Neural Computing and Applications , 1-15, 2020.

Conference proceedings

1. T. Bandaragoda, D. De Silva, D. Kleyko, E. Osipov, U. Wiklund, and D. Alahakoon.Trajectory clustering of road traﬃc in urban environments using incremental ma-chine learning in combination with hyperdimensional computing. In 2019 IEEEIntelligent Transportation Systems Conference (ITSC), pages 1664–1670. IEEE,20192. R. Nawaratne, T. Bandaragoda, A. Adikari, D. Alahakoon, D. De Silva, and X. Yu.Incremental knowledge acquisition and self-learning for autonomous video surveil-lance. In IECON 2017-43rd Annual Conference of the IEEE Industrial ElectronicsSociety, pages 4790–4795. IEEE, 2017xviiviii hapter 1

Introduction

Man is by nature a social animal; an individual who is unsocial naturally and notaccidentally is either beneath our notice or more than human. Society is somethingthat precedes the individual. Anyone who either cannot lead the common life or is soself-suﬃcient as not to need to, and therefore does not partake of society, is either abeast or a god.

Aristotle, Politics (˜384 BC)

This chapter initiates the thesis by postulating an expansive context for its researchquestions, objectives and contribution. From the early origins of human societies to con-temporary habituation of digital environments, the manifestation of social behaviours andinteractions has eventuated in diverse schools of thought and distinct domains of knowl-edge. The persistence of social data in digital environments has signiﬁcantly transformedthis scholarly pursuit. This thesis is an endorsement, empowerment and advancement ofthat pursuit. Let us begin.

Humans are social animals. Social interactions continue to be paramount for modernhumans, just as it was the determinant of survival of our ﬁrst ancestors against naturalselection, the Hominin hunter-gatherer groups (1.8-1.3 million years ago) (Darwin, 1859).Over the course of human civilisation, the level of sophistication of social interactions,as well as the number of social interactions among humans have advanced from simplesurvival to complex social needs. In modern humans, social interactions are controlled bycognitive processes in the brain, which perceive the social environment (social actions ofothers) based on sensory information and existing knowledge. Evolution has led to biolog-ical adaptations in brain, sensory and motor neurons to optimise our social interactionsin the natural world (Kock, 2004, 2005). For instance, human ears are more sensitive tohuman voices than any other acoustic stimuli (Nass and Brave, 2005) and the human brainhas a high volume ratio for the neocortex (relative to the entire brain) in order to handlethe computational demand of social interactions (Dunbar, 1998). Moreover, a study oncognitive skills among chimpanzees, orangutans, and two year-old children uncovered that1

CHAPTER 1. although there is no signiﬁcant diﬀerence in cognitive skills for physical activities, humanchildren possess superior cognitive skills for social activities (Herrmann et al., 2007).Social behaviour is a goal directed action to achieve a perceived goal (Ajzen, 1985). Socialbehaviours are formally deﬁned as abstractions of diﬀerent types of human social interac-tions that are executed to achieve similar perceived goals (Ajzen, 1985). Some examplesof social behaviours are interpersonal communication, self disclosure, cooperation and so-cial comparison. For instance, expressing emotions and discussion of political views aredistinct social interactions. However, both interactions disclose diﬀerent types of informa-tion i.e., emotions and political views about oneself to others in society. Therefore, suchinteractions can be abstracted the behaviour of self-disclosure. Developing an intricateunderstanding of social behaviours is crucial for all domains of knowledge and academicdisciplines. This expands across humanities, social sciences and biological sciences moredirectly, and indirectly into physical sciences, formal sciences and applied sciences.Conventional studies of human social behaviours rely on surface data : limited informationon a large number of individuals or deep data : extensive information on few individu-als (Manovich, 2011).

Surface data is the population level data collected using censuseswhich collects data about individuals, households and businesses. Such datasets are wellstructured and often analysed using statistical methods.

Deep data is collected from asmall group of individuals using controlled and natural experiments (Neuman, 2013).Such datasets are detailed and unstructured, and analysed using qualitative methods.Both these data collection approaches have merits and long served as the key approachesto study social behaviours. However, the population level surface data is often too shallowwith few attributes about each individual, thus leads to over-generalised insights that areonly valid at the population level. Also, such studies are costly and time consuming, hence,not frequent (e.g., US census is every 10 years).

Deep data from social experiments are lim-ited to few individuals and mostly obtained using controlled environments. As pointed-outby Meshi et al. (2015), the major issues of deep data are lack of generalisability to a largerpopulation, lack external validity due to diﬀerences in controlled experiment environmentand natural environment, and various biases in data collection such as recall bias .Given these inherent limitations in conventional methods of studying human social be-haviours, it is timely that recent technological advancements have collectively led to theinvention and development of new digital environments for enriched social interactions,that also encapsulate the full spectrum of social behaviours. It is opportune that each in-teraction and behaviour is digitally represented, captured and archived to facilitate muchmore meticulous approaches to this ﬁeld of study.

These (established and emerging) digital environments are widely referred to as socialmedia or social media platforms. Technological developments in aﬀordability, availability .2. SOCIAL MEDIA

CHAPTER 1. of mobile and internet technologies (Kemp, 2019). A recent survey of US adults (Coleet al., 2017) shows that the average weekly internet usage has risen from 9.4 hours in 2000to 23.6 hours in 2016 (Figure 1.1.b). This increase was mainly due to domestic internet usefor leisure and social activities, which has risen ﬁve-fold over the past 16 years (Cole et al.,2017). Although the functionality of modern online social media platforms are diverse,and often specialised to facilitate the demands of diﬀerent socio-demographic segments,they are in general web and mobile based platforms that facilitate users to create, shareand co-create multimodal (text, image, audio and video) user generated content as wellas engage in online social discussions with other members on the platform (Kaplan andHaenlein, 2010; Kietzmann et al., 2011).

A vibrant history

Usenet (an online discussion system started in 1979) and Bulletin Board Systems (a dial-up internet based computer resource sharing system introduced in the late 1970s) are thelikely precursors of modern day online social media platforms. The ﬁrst recognisable onlinesocial media platform (SixDegrees.com) launched in 1997 (Boyd and Ellison, 2007) whichallows the users to create proﬁles, friend lists and browse the proﬁles of their friends. Aspeople started to ﬂock into the internet in the early 2000s, there has been a rapid increasein the launch of diverse online social media platforms. As shown in Figure 1.1.a in 2019, outof the 7.73 billion of world population, 3.73 billion individuals are active users of at leastone social media platform, which accounts for 48% of the world population. Figure 1.1.cdepicts the worldwide Monthly Active User (MAU) counts in October 2019 of the top socialmedia platforms. It shows that Facebook has 2.414 billion MAU which means 31% of theworld population are active Facebook users. Similarly, video sharing platform YouTubehas 2.0 billion MAU (25.9% of world population). Most popular messenger applicationsare WhatsApp (1.6 billion MAU), FB.Messenger (1.3 billion MAU), and WeChat (1.1billion MAU). Popular microblogging platforms are QQ (808 million MAU), Sina-Weibo(486 million MAU) and Twitter (330 million MAU). This massive scale of active usershighlights the popularity of online social media platforms and the increasing need foronline social interactions, across the world. Moreover, Figure 1.1.d shows that all thoseonline social media platforms have been launched in last 15 years and managed to get amassive number of active users within a very short period of time. For instance, TikToka video sharing platform launched in 2017 has reached 500 million MAU by end of 2019.Furthermore, as shown in Figure 1.1.e, most of those active users engage with the socialmedia platforms on daily basis. For instance in US, percentage of daily active users areclose to 75% for Facebook, over 60% for Instagram, over 50% for YouTube and over 40%for Twitter. This phenomenon highlights that most of the active users use social mediaplatforms frequently to participate in various online social activities.

A diversity of social behaviours

The social interactions on these online social media platforms are highly diverse repre-senting the variety of interests of the users. In addition to those common online social .2. SOCIAL MEDIA geographic proximity where the individuals have to be in closeproximity, (ii) synchronicity where interactions have to timely, and (iii) accessibility wheresome individuals and groups are not accessible to others (McFarland and Ployhart, 2015).However, online social media are lean mediums in contrast to the rich medium of face-to-face communication which allows the transmission of multiple communication cues (Daftet al., 1987). In particular, non-verbal communication cues are not transferable through

CHAPTER 1. most of the online social media platforms, excepts for those that allow video based inter-actions. This claim is disputed by other studies which argue that without non-verbal com-munication cues, humans are capable of adapting to utilise the available cues in learnermediums more eﬀectively with prolonged use (Walther, 1992; Kock, 2004; Tidwell andWalther, 2002). For instance, in text-based mediums emoticons are used to expresses spe-ciﬁc emotions and elongated terms (e.g., ‘soooo’) are used to emphasise the meaning ofthe base term.Although a variety of social interactions (that represent a number of social behaviours)take place on online social media platforms, the baseline proﬁle of salient social inter-actions is generally focused on the characteristics of each social media platform. Suchcharacteristics of the social media platform are sometimes by design to intentionally en-courage certain type of interactions, or may have emerged as a consequence of frequentlyhaving certain type of user interactions over others. For instance, mainstream social mediaplatforms such as Twitter are fast-paced social media platforms, that are used for rapiddissemination of current and trending information through the social network using func-tions such as re-tweeting and sharing (Kwak et al., 2010; Lotan et al., 2011; Lovejoy et al.,2012), which may even spread faster than seismic waves during an earthquake (Sakakiet al., 2010a). Due to this focus of information dissemination rather than having con-tinuing discussions, the ties between dyads are relatively weaker (unless known by othermeans), and often lacks homophily, reciprocity and interpersonal self-disclosure. Also, theemotions expressed in tweets are often shallow and intense, clearly leaning towards beingpositive or negative. In contrast, forums/online support groups such as online health sup-port groups are slow-paced with relatively smaller groups where the participants has highdegree of homophily (e.g., the patients of a speciﬁc medical condition). These social me-dia platforms are used to seek and provide, informational and emotional support (Preece,1999). Thus, users tend to publish longer posts providing high degree of self-disclosureand express complex emotions.

Digital representations of social data

Social data can be broadly deﬁned as any form of information that represents and char-acterises social interactions. Every social interaction releases information into the envi-ronment in which it occurs, which can be both explicit or implicit as well as intentionalor unintentional to the participants of the interaction. For instance, in a face-to-faceconversation in the physical environment, some of the information that characterised theinteraction are the theme of the conversation, duration, location, emotions/opinions ex-pressed, facial expressions of the participants, tone variation of the conversation, and theappearance of participants (e.g., what they wear). In addition, there can be pre- and post-interaction information that characterises what leads to the interaction and what follows.Social data from social interactions in the physical world (mainly face-to-face commu-nications) is seldom recorded. The interactions in the physical environment are generated .3. MOTIVATION tweets archived in Twitter.

Building upon this digest of social interactions, social behaviours, digital environments,and the social data accumulating on social media platforms, it is now pertinent to delin-eate the motivation of this thesis.Conventional methods of studying human social behaviours are impacted by the chal-lenges of collecting relevant data, in suﬃcient volumes. A distinctive opportunity lieswithin the massive amount of social data accumulated in online social media platforms.As noted earlier, social data are the archived records of online social interactions, thus,provides an unparalleled aperture into understanding various human social behaviours aswell as underlying causalities. This data is contributed by a large number of individualsoften in the scale of millions, thus they are in par or even larger than the population leveldatasets collected in censuses. Also, prolonged social interactions of individuals in onlinesocial media platforms generate suﬃcient information to conduct an in-depth analysis oftheir social behaviour.Existing computational approaches of using these large volumes of social data are mainlyadaptations of standard machine learning and natural language processing techniques.Table 1.1 shows some example uses of social data in diﬀerent applications alongside thetechniques employed and social media platforms used for data collection. These applica-tions mainly use sentiment analysis techniques (Liu and Zhang, 2012) to evaluate publicopinion, topic extraction techniques (Aggarwal and Zhai, 2012b) to identify salient discus-sion topics and entity extraction techniques (Ritter et al., 2011) to identify related entities(e.g., people, brands, locations) involved.

CHAPTER 1.

These adaptations of standard machine learning and natural language processing tech-niques demonstrate two high-level fundamental limitations when utilised for the explo-ration and understanding of social behaviours. Firstly, the constrained focus on basicinsights of social data, when there are deep insights on human social behaviour that areencapsulated but largely untapped due to the challenging nature of pre-processing, analy-sis and synthesis. For instance, the shallow focus on surface-level meaning, frequency, andbasic sentiment, and lack of consideration for deeper insights such as emotions, psycholin-guistics, and personal traits, that provides a better representation of the underlying socialbehaviours in social data. Secondly, sub-optimal computational performance (Eisenstein,2013) due to the highly unstructured nature of social data (Hu and Liu, 2012). Thedynamic and bursty nature of social data (Barabasi, 2005), requires techniques that aretime sensitive and can learn incrementally from a dynamic data stream, in contrast toconventional machine learning techniques which are mostly designed for static datasets.Table 1.1: Applications of social data based on recent literature.Application Technique (social media platform)Naturaldisasterdetection Sakaki et al. (2010a)- event detection (Twitter)Middleton et al. (2014)- event detection, clustering (Twitter)Abel et al. (2012)- event detection, entity extraction (Twitter)Plannedprotestprediction Ramakrishnan et al. (2014)- event detection, entity extraction(multiple sources)Becker et al. (2012)- event detection, entity extraction (multiplesources)First storydetection Osborne et al. (2012)- event detection, entity extraction (Twitter,Wikipedia)Lau, Collier and Baldwin (2012a)- topic extraction (Twitter)Traﬃcincidentdetection Pan et al. (2013)- anomaly detection, graph analysis (Weibo)D’Andrea et al. (2015)- event detection, classiﬁcation (Twitter)Salesforecasting Rui et al. (2013)- sentiment analysis, regression (Twitter)Mishne et al. (2006)- sentiment analysis (Blogs)Stockmarketprediction Si et al. (2013)- sentiment analysis, topic extraction (Twitter)Das and Chen (2007)- sentiment analysis, classiﬁcation (stock messageboards)Publicopinionprediction Tumasjan et al. (2010)- sentiment analysis (Twitter)Bollen, Mao and Pepe (2011)- sentiment analysis (Twitter)Publichealthinsights Huber et al. (2017)- thematic analysis (online support groups)Paul and Dredze (2011)- topic extraction (Twitter)On motivation, this thesis is ﬁrstly inspired by the unprecedented opportunity tocomprehensively study human social behaviours represented in social data accumulatedon online digital social media platforms. Secondly, it is stimulated by the limitations ofcurrent computational and artiﬁcial intelligence approaches in addressing the scientiﬁc, .4. RESEARCH OBJECTIVES

This thesis develops an understanding of social behaviours through the study of onlinesocial data using novel computational and artiﬁcial intelligence approaches that model,transform, analyse and generate insights on social behaviours and underlying causalities.The research objectives are:The ﬁrst objective is to explore existing theories on human social behaviour, socialneeds and cognition; and develop a conceptual framework to understand social behavioursthrough representative online social data. This conceptual framework considers socialdata as representing the surface layer of a hierarchy of human social behaviour, needs andcognition.The second objective is to advance new self-structuring artiﬁcial intelligence approachesto address the challenges and limitations of existing machine learning and natural languageprocessing techniques in the study of social behaviours based online social data. Thisexploration will focus on self-structuring and incremental learning techniques to represent,adapt to and evolve with text based social data streams, and automatically monitor thosestructures for changes in social behaviours and causes.The third objective is to transform the aforementioned approaches into technologyplatforms that capture insights from text based social data of diﬀerent online social mediastreams and use such insights to gain descriptive understandings and predictive insightsfor decision-making.

Based on the above research objectives, the following research questions will be addressedin this thesis.1.

What are the limitations of existing artiﬁcial intelligence algorithms and natural lan-guage processing techniques in the study of social interactions and social behavioursusing representative online social data in digital environments? How can theories of social behaviour from social sciences contribute towards a concep-tual model of enhanced understanding of social interactions in digital environments,as well as the representative online social data? How can new incremental machine learning algorithms, founded on the principles ofself-structuring artiﬁcial intelligence, address the challenges of using social data tounderstand social behaviours? CHAPTER 1. How can the research contributions that address research questions 2. and 3. beformulated into a technology platform that delivers actionable insights for societaladvancement?

Based on the above research objectives and research questions, this thesis yields the fol-lowing contributions as the outcomes:1. A comprehensive investigation on current state-of-the-art machine learning and nat-ural language processing techniques employed to generate insights from social datafollowed by an exhaustive analysis on their limitations due to distinct challengespresent in social data.2. A conceptual framework based on existing social theories to explicate the complexrelationships between social behaviours and online social data.3. Development of two novel self-structuring artiﬁcial intelligence algorithms for gen-erating insights from online social data. (1) A new unsupervised, self-structuringand incremental learning technique to structure a dynamic and diverse text basedsocial data stream to automatically capture salient topics, and their dynamics overtime. (2) An automated, intelligent event detector to monitor the identiﬁed topicpathways for signiﬁcant ﬂuctuations in social behaviours using multiple indicatorssuch as volume and sentiment to detect social events of interest.4. Demonstration of the two novel self-structuring artiﬁcial intelligence algorithms ontwo large microblogging data streams.5. Transformation of the two new algorithms, along with an ensemble of related machinelearning and natural language processing techniques, into a technology platform.This platform was further adapted for slow-paced online social data, to automaticallycapture and incrementally learn self-disclosed information.6. Extending the above mentioned platform into an application domain with high socialimpact, patient-centred healthcare. The platform was demonstrated to better facili-tate the diverse information needs of its stakeholder groups: consumers, researchersand health professionals.7. Further materialising the patient-centred healthcare technology platform for prostatecancer related online support groups, with real-life outcomes that have advancedclinical knowledge of patient needs and expectations. This was demonstrated usinga text corpus collected from 10 prostate cancer online support groups comprising of609,960 prostate cancer discussions and 22,233 participants. .7. THESIS OUTLINE

Figure 1.2 shows how this thesis is organised.Chapter two discusses the key literature on machine learning and natural languageprocessing techniques that are used to capture insights from textual sources which includestext clustering, topic modelling, event detection, sentiment and emotion capturing, self-structuring and incremental learning. Subsequently, it presents the major challenges suchtechniques would encounter when applied to social data.Chapter three presents the proposed conceptual framework for understanding the un-derlying mechanisms that generate social data using the existing physiological and socialtheories on cognition, social needs and social behaviour as the foundation. This conceptualframework is then extended to a platform that can be employed with artiﬁcial intelligencetechniques to represent, self-structure and learn insights from social data.Chapter four presents two novel techniques (i) a self structuring and incremental learn-ing technique that is capable of automatically structuring a text based social data streaminto topic pathways over time, and (ii) an automatic event detection technique that au-tomatically monitors the above topic pathways for behavioural changes using multiplechange detectors. These techniques were trialled using two large Twitter datasets anddemonstrated their capabilities to identify salient topic pathways, track their evolutionover time, detect new topic pathways, and detect signiﬁcant events based on sentimentand volume based event indicators.Chapter ﬁve presents the development of information structuring platform for onlinesupport group platforms to better facilitate the diverse information needs of its stakeholder2

CHAPTER 1. groups: consumers, researchers and health professionals. It presents the details of a setmachine learning and natural language processing techniques to capture self disclosedinformation such as demographics, emotions, psycholinguistics. The capabilities of thisplatform are demonstrated using a large text corpus extracted from two large online healthsupport groups.Chapter six presents the application of the techniques developed in chapters four andﬁve to gain insights from prostate cancer related online support group discussions. Inaddition it discusses the development of a specialised set of technique to capture treat-ment type, side eﬀects, treatment decision making behaviour and numeric pathologicalinformation speciﬁc to prostate cancer from the self disclosed free-text. It then showcasesthe capabilities of the captured multidimensional representation (prostate cancer relatedinformation, topics/topic pathways, demographics, emotions) to analyse the self disclosedquality of life against the other factors.Chapter seven concludes the thesis by providing a summary of work presented in thepreceding chapters, how the research questions formulated above were addressed in thebody of work and ideas for future research based on this thesis. hapter 2

Literature review and reﬂections

If I have seen further it is only by standing on the shoulders of giants.

Sir Isaac Newton (1675)

This chapter reviews existing machine learning and natural language processing tech-niques that are relevant to harness insights from social data. Also, it discusses the limita-tions and gaps of such techniques to eﬀectively capture insights from social data.This chapter is organised as follows: Section 2.1 reports on manual assessment of so-cial data and its limitations; Section 2.2 reviews the techniques for topic extraction andtext clustering; Section 2.3 reviews event detection from text data focussing on both spec-iﬁed and unspeciﬁed event detection techniques; Section 2.4 reviews emotions theories,models and sentiment/emotion extraction techniques from text; Section 2.5 discusses theself-structuring techniques covering biological inspiration and its computational adapta-tions; Section 2.6 reviews techniques developed for unsupervised incremental learning; andSection 2.7 discusses the challenges and limitations of existing techniques in relation togaining insights from social data.

Social research using social data is a relatively emerging ﬁeld of study that was developedin conjunction with the rapid growth of social media platforms and their use. Duringthe initial stages, social data was analysed manually by the researchers. Such attemptsinclude (i) manually coding social messages based on the topic discussed in the content,(ii) manually assessing of the sentiment polarity and strength, and (iii) manual inferenceof socio-demographic information of the authors of social data.The key advantage of manual assessment of social data is that it does not requirechallenging technical skills. Also, humans are skilled at decoding the complex socialexpressions such as sarcasm which are often present in social data. However, personaljudgement on social data varies across assessors as such judgements are impacted by thepersonal experience of each assessor. Such variations speciﬁcally occur in the assessmentof sentiment or emotions expressed in social data. These variations were often minimised134

CHAPTER 2. by using multiple assessors to assess the same social data and producing an aggregated as-sessment. Also, statistical agreement measures such as kappa coeﬃcient ( κ ) are measuredto evaluates the inter assessors agreement.However, the key issue with the manual assessment of social data is the associatedhuman cost that limits its applicability to small social datasets often in the scale of hun-dreds of data points. Use of crowd sourced manual assessment techniques such as AmazonMechanical Turk (Buhrmester et al., 2011) enables relatively inexpensive manual assess-ment possible. However, even such approaches can only assess social datasets in the scaleof thousands. One of the key utility of the manual assessment is for the generation ofreference assessment results which are often known as gold standard records that are usedas the reference social data points for training and evaluation of the machine learningbased models to automatically learn the underlying patterns of inferences made by humanassessors. Topic extraction (or mining) is one of the key areas in natural language processing whichattempts to organise a large set of documents (e.g., news articles or tweets) into seman-tically meaningful groups which are often denoted as topics or themes . Each identiﬁedtopic is represented as a set of keywords that semantically represents a particular topic.Topic extraction is mainly handled as an unsupervised task as in most of the scenariosunderlying topics in a document corpus is not a priori knowledge. Two key unsupervisedapproaches for the topic extraction are (i) text clustering and (ii) topic modelling. Clustering is the classic data mining task of ﬁnding groups of similar records in the databased on a similarity measure. Similarly, text clustering intends to separate the text cor-pus into groups with similar meaning (Aggarwal and Zhai, 2012a). It is assumed that aset of documents clustered together contains a common topic/theme/subject which is rep-resented by the salient terms of that cluster. Text clustering techniques have been appliedto organise documents (Cutting et al., 1992; Kohonen et al., 2000), organise web search re-sults (Zamir et al., 1997), and generate an abstractive summary of a given corpus (Sch¨utzeand Silverstein, 1997).Text clustering consists of two key phases (Larsen and Aone, 1999). First the docu-ments are converted into representative feature vectors and then clustering techniques areemployed to cluster those feature vectors into groups based on an appropriate distancefunction. These two phases are required because the clustering techniques are designedfor datasets with real-valued features. Hence, the text corpus has to be transformed intorepresentative real-valued feature vectors.Document feature vectors are often obtained based on the bag-of-words (BOW) con-cept (Jurafsky, 2000) which simpliﬁes a document as a bag of words, relaxing word orderbased relationships. Based on BOW, documents are transformed into feature vectors using .2. TOPIC EXTRACTION term frequency inverse documentfrequency tf-idf (Salton and Buckley, 1988a; Sparck Jones, 1972) which gives high featurevalue to terms that exist more on a given document (proportional to term frequency) andless across the corpus (inverse document frequency). The size of the vocabulary oftengrows with the corpus size resulting in very high-dimensional feature vectors. Therefore,document frequency based pruning techniques are used to remove less informative termswhich are either common terms among most of the documents or rare terms that exist ina small set of documents (Salton et al., 1975).The techniques employed in text clustering are either extended or adapted from theclustering techniques used for quantitative data such as k-means (Lloyd, 1982), self-organising-maps (SOM) (Kohonen, 1990), and hierarchical clustering (Ward Jr, 1963).Scatter/Gather (Cutting et al., 1992) one of the well-known text clustering algorithm whichuses hierarchical clustering to determine the cluster centres and further reﬁnes them usingk-means clustering technique. (Larsen and Aone, 1999) proposed a variant of k-meanswhich uses a damping technique during the centre adjustment that improves the clusterquality. (Steinbach et al., 2000) proposed a bisecting k-means variant that iteratively clus-ters the largest cluster (initially the entire dataset) into two clusters using k-means until k clusters are generated. (Kohonen et al., 2000) employs self-organising-maps to clustera large collection of patent abstracts which uses randomly projected document vectors toreduce dimensions. Topic modelling is another approach widely employed to extract salient topics from adocument corpus. It employs a probabilistic generative model to represent documentsin a corpus. Topic models estimate probabilities for each document being generated bythe identiﬁed topics, where the most salient topics of a document can be chosen basedon a threshold. It means a document could belong to multiple topic clusters. This is incontrast to the previously discussed clustering approaches which exclusively partition aset of documents into topic clusters resulting in a single topic for a document. Havingmultiple topics is more intuitive for larger documents which may contain multiple topics.Topic models assume that the corpus is generated by N number of latent topics, andeach document d i has a probability P ( t j | d i ) for belonging to each topic t j . Similarly, eachterm w k in the corpus vocabulary V being part of a topic t j is given by the probabil-ity P ( w k | t j ). Both topic-document P ( t j | d i ) and term-topic P ( w k | t j ) cannot be directlyestimated as topics are latent and the only relationship that can be estimated is the term-document probability P ( w k | d i ) for a term w k and document d i which is given by the termfrequency of w k in d i divided by number of terms in d i . Therefore, the topic-document6 CHAPTER 2. and term-topic probabilities are estimated based on its relationship with term-documentprobability given bellow: P ( w k | d i ) = N (cid:88) j =1 P ( w k | t j ) × P ( t j | d i )Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 1999, 2001) is one of theinitial generative topic models, which used expectation maximization (EM) algorithm toestimate the probabilities of the topic-document and term-topic relationships. Initially,the probabilities are randomly initialised and then iteratively conduct expectation (E) andmaximization (M) steps. E-step estimates the posterior probabilities of latent variablesgiven the current values of them and term-document probabilities. Then M-step estimatenew values for the probabilities based on the probabilities of latent variables estimated inE-step. These two steps alternatively carried out until the likelihood converges to a localmaximum. The key parameter to PLSA is the number of topics N that the generativemodel is optimised. The key limitation of PLSA is its tendency to overﬁt when estimatinglarge number of parameters.The most popular variant of probabilistic topic models is the Latent Dirichlet Alloca-tion (LDA) (Blei et al., 2003) algorithm. LDA closely follows PLSA but, is a completegenerative model which models topic-document and term-topic probabilities using Dirich-let distribution as priors using hyperparameters α and β respectively. α controls themixture of topics in a document and higher values lead to higher topic-document proba-bilities i.e. more topics are assigned to a document. β controls the term-topic probabilitiesand high values lead to higher term-topic probabilities i.e., more terms are assigned to atopic. In contrast to PLSA, LDA has less number of parameters, thus less susceptible tooverﬁtting problem. Moreover, LDA is capable of estimating the topic distribution of anew document that was not present during learning. The two key challenge for topic extraction techniques on social data is the high sparsityand time sensitivity. Text data is often sparse, however, social data is overly sparse mainlydue to brevity and diversity. Social media posts are often left brief as it allows authors toconstruct them eﬃciently and with less concern about the ﬂow of it. On the other hand,social data is constructed by many individuals who use diverse writing styles and discussdiverse topics resulting a large vocabulary across a corpus of social data. Both these factorscontribute to sparse document feature vectors (from vector space model ) containing onlya handful of activated features. This increased sparsity makes topic extraction techniquessusceptible to noise, degrading their performance.The topics discussed in social data changes over time as the interests of the authorchanges. Existing topics drop in popularity while new unseen topics emerge over time.Therefore, the topic mixture changes dynamically over time. However, both text clusteringand topic modelling techniques assume a static mixture of topics thus have to be extendedto facilitate changes in topic mixture. .2. TOPIC EXTRACTION biterm topic model which builds a generative model torepresent topic to word-pair (biterm) relationship. The assumption here is that if twowords co-exists frequently it is likely that those two words are part of the same topic.This model is claimed to more robust to brevity in social data as it does not model thetopic-document relationship which tends to suﬀer more from sparsity issues.Mehrotra et al. (2013) attempted diﬀerent aggregation strategies such as aggregatetweets based on hashtags, authors and time-window (aggregate hourly tweets). Theyempirically showed that hashtag based aggregating is the most eﬀective for LDA basedtopic extraction. However, the granularity of hashtag based aggregation varies widelybased on the usage of each hashtag. For example, topic over time (ToT) which is an extensionto the LDA topic modelling algorithms to consider time in the generative model. Thetime-topic relationship added to the generative model of LDA with document-topic andterm-topic relationships. The time-topic relationship is modelled using a beta distributionas the prior in contrast to document-topic and term-topic relationships modelled usingDirichlet priors. However, in ToT time is employed only to get a better ﬁt to a ﬁxed topicmixture and the topic trends such as topic evolution and new topic are not captured.Moreover, it is not trained in incremental fashion to use in a text stream.AlSumait et al. (2008) and Lau, Collier and Baldwin (2012b) present online versions ofLDA (OLDA) which are incremental versions of LDA, to facilitate topic mining from a textstream. These online variants take time-sliced input of a text stream and incrementally8

CHAPTER 2. update a LDA topic model at each time-slice. They use the LDA model generated fromtext corpus at t − t . This processgenerates an evolutionary matrix over each topic to capture the evolution of that topicover time. Based on this evolutionary matrix, the topics that are signiﬁcantly diﬀerent at t in contrast to t − An event can be broadly deﬁned as an occurrence that can be bounded by space andtime (Allan, Papka and Lavrenko, 1998). Events could happen anytime and anywherewhich could impact a single person, a handful of individuals, or a large number of in-dividuals (e.g., natural disaster, election). Events alter the behaviour of the individualsimpacted.An event in a social media platform is due to either a real world event (e.g., naturaldisaster, election) discussed in the social media platform or it can be originated from thesocial media platform itself without a link to a real world event (e.g., a controversial tweetfrom an inﬂuential author). An event often results a change of online social behaviour ofthe impacted individuals for the duration of the event (e.g., change of discussion topics,changes of emotions). Such changes due to an event often happen in near-realtime asa result of frequent use and availability of mobile devices. For instance, during naturaldisasters such as earthquakes and ﬂoods, related discussions instantly start to ﬂow insocial media platforms with ﬁrsthand individuals accounts, opinions, emotions and argu-ments (Java et al., 2007; Zhao and Rosson, 2009).Identifying these events and also analysing the related online social behaviour yieldvaluable information for individuals, corporations, and law enforcement agencies. Cor-porations increasingly monitor social media platforms to understand customer opinions,concerns and sentiment on their product portfolio; and to provide insights for businessdecisions (Jansen et al., 2009; Pak and Paroubek, 2010). Social media platforms can befurther monitored to identify breaking news on events reported in realtime by ﬁrsthandindividual accounts (Phuvipadawat and Murata, 2010; Sankaranarayanan et al., 2009).Also, social media platforms are monitored for events that may lead to public unrest (Ra-makrishnan et al., 2014).Event detection techniques can be broadly separated into unspeciﬁed and speciﬁed events (Atefeh and Khreich, 2015) on the event type. Unspeciﬁed event detection isemployed when prior information about the event is not available, and speciﬁed eventdetection is employed when details about the event looking for is partially known. Unspeciﬁed events are events with no prior information available. Such events can rangefrom extreme events such as natural disasters or terrorist act, unplanned political eventssuch as a controversial statement from a key politician, and unplanned corporate eventssuch as product malfunction or service outage. Since no prior information is available .3. EVENT DETECTION bursty terms fromtweets and subsequently clustered the identiﬁed such terms into topics/events. Fung et al.(2005) modelled the individual word distributions over time as binomial distributions andidentiﬁed the bursty terms. Those terms with suﬃciently overlapping active time windowsare grouped together and identiﬁed as events. He et al. (2007) applied Discrete FourierTransform (DFT) and searched for bursty terms in the frequency domain which appearsas a spike in the frequency spectrum. However, the time window of the event cannotbe retrieved from the frequency domain. Weng et al. (2011) developed

Event Detectionwith Clustering of Wavelet-based Signals (EDCoW) which applied the Wavelet Trans-form to word distributions, ﬁlters out the trivial terms based on autocorrelation, andthen employed a graph clustering technique to group the event related terms into events.Sankaranarayanan et al. (2009) trained a Naive Bayes classiﬁer to classify tweets as newsor junk, which was trained on term vectors of a hand-labelled tweet dataset as junk ornews. The identiﬁed news tweets are then online clustered to identify the salient topicsamong the news tweets.Although change of volume based signals are widely employed for event detection, thereare other changes that could happen to a social data stream due to an event. An event0

CHAPTER 2. can signiﬁcantly alter the opinion or emotions expressed in a discussion (Tan et al., 2014;Thelwall et al., 2011). For example, a social media post by a politician on a certain issuecould lead to a backslash with heavy criticism resulting signiﬁcant increase in negativeopinion and expression of emotions such as anger or sarcasm . Detecting such opinionor emotions related signals, supplement the volume based events detections, as well asprovides more detailed insights about the genre of the event.Paltoglou (2016) developed a sentiment based event detection technique that moni-tored the temporal variation of sentiment polarity of tweets with particular keywords (e.g.,hashtags) and identiﬁed instances with signiﬁcant changes of sentiment as events that areassociated with that keyword. The events captured from this technique was shown to becomparable to the events captured from a frequency based approach. One key limitationof this technique is the assumption that there is a one-to-one mapping between keywordsand topics, whereas in social data streams there can be multiple keywords associated withtopics and vice-versa. Speciﬁed event detection is the detection of known events and its related information suchas location, time, participants, outcomes etc. (Atefeh and Khreich, 2015). In contrast tounspeciﬁed events, the key terms that denote or associated with the event(s) are a priori information and can be employed to detect them. Hence, in contrast to the unsupervisedchange detection techniques used for unspeciﬁed events, speciﬁed events are mainly de-tected using supervised techniques which uses either an engineered feature dictionary orlabelled datasets to train a classiﬁer. Also, unlike the change detection techniques whichrequires a substantial change in social data stream from large scale events, speciﬁed eventdetection techniques can be employed to detect events at diﬀerent scales, from large eventssuch as riots, earthquakes to individual events such as report of an illness from a personalsocial data stream.Speciﬁed event detection on diﬀerent types of text sources has been studied in diﬀerentapplication domains. For instance, in ﬁnancial domain, ﬁnancial documents and businessnews are mined for ﬁnance related events such as catastrophic ﬁnancial events (Cecchiniet al., 2010), mergers and acquisitions (Lau, Liao, Wong and Chiu, 2012), and competitorrelations (Ma et al., 2011). Similarly, social data streams are mined for mentions of rel-evant entities (e.g., organisation names, brands, movie titles) and their associated mood(or sentiment) which can be used to determine market volatility such as stock price move-ments (Mao et al., 2012; Bollen, Mao and Zeng, 2011) and movie revenues (Rui et al.,2013; Asur and Huberman, 2010). In the health domain, electronic health records (EHR)such as admission notes, treatment plans, patient summaries, radiology reports, pathol-ogy reports and discharge notes are mined using natural language processing techniquesto capture diﬀerent clinical events of patients such as diagnosis, treatments, symptomsand side eﬀects (Murdoch and Detsky, 2013; Meystre et al., 2008). These events are usedwith machine learning techniques to gain further insights. For example, association rulessuch as symptom-disease, drug-disease (Chen et al., 2008), drug-side eﬀects (adverse drug .3. EVENT DETECTION

Identifying documents that contain event information

The techniques that mainly employed to extract event related documents are (i) thesaurusbased techniques and (ii) classiﬁcation model based techniques.Thesaurus based techniques assumes that presence of certain terms in documents in-dicate that such documents contain information about the events of interest. Therefore,such thesaurus is engineered with event related terms and use it to identify the documentswhich contain such terms. For instance, EMBERS planned protest model (Ramakrish-nan et al., 2014) employs a multi-lingual lexicon of planned protest related terms (e.g., plan to strike in English or preparaci ´ on huelga in Spanish) to capture the documents(e.g., social media posts or web pages) that discuss planned protests. These thesaurusesare developed based on the domain knowledge and often based on knowledge bases orglossaries that exists in each domain. For example, in the health domain, the speciﬁedevent detection techniques often employ the health related knowledge bases such as UMLSMetathesaurus (Bodenreider, 2004) which contains multiple health thesauruses includingSNOMED CT which contains normalised health concepts, synonyms and descriptions; andRxNORM (Nelson et al., 2011) which is a drug thesaurus all drugs approved in USA.The classiﬁcation model based techniques use a labelled document set where pos-itives discuss the event of interest and negatives do not, to train a machine learningclassiﬁcation model. The documents are ﬁrst feature transformed, in order to convert2 CHAPTER 2. text into real-values features that are needed for machine learning techniques. Generallyemployed feature transformation techniques include one-hot-encoding which transformpresence/absence of a term into a binary feature and term frequency inverse documentfrequency tf-idf (Salton and Buckley, 1988a; Sparck Jones, 1972) techniques discussed inSection 2.2.1. However, recent advances in semantic text-representations lead to the de-velopment of word-embedding techniques (Mikolov, Sutskever, Chen, Corrado and Dean,2013) that are often used as a feature transformation in deep neural network based tech-niques. The features transformed positive and negative samples are then used to train aclassiﬁcation model which learns to diﬀerentiate positives from the negatives. This trainedmodel is then used to identify the event related documents from an unlabelled documentcollection or stream. For instance, Sakaki et al. (2010b) trained an SVM (Support Vec-tor Machine) (Cortes and Vapnik, 1995) based classiﬁer using feature transformed tweetswhere positives are tweets that refer to an actual earthquake. This technique is mainlyemployed when reliable domain knowledge bases do not exist for the events of interest.

Identifying relevant entities of an event

Once an event is identiﬁed the next step is to capture more information about the event.This task is generally achieved by capturing the relevant named entities that are relatedto the identiﬁed event.Named entities refer to terms or phrases that consistently stands for the same referent.This deﬁnition is derived based on rigid designator (Kripke, 1972): a designator is rigidwhen it denotes the same thing in all possible worlds. The relevant named entities deferacross application domains. For instance, in a public event, the named entities are often lo-cation, time and people/organisations involved, whereas in the biomedical/health domainthe named entities could be health related entities such as drugs, illnesses, symptoms,treatments, genes etc.Recognition of named entities is achieved through detection of entity terms (words orphrases) and subsequently classifying them into meaningful categories. The recognitionpart is often handled using text chunking, regular expressions and noun phrase extractiontechniques (Nadeau and Sekine, 2007) to extract potential entity phrases.The categorisation of the identiﬁed entities is either achieved based on relevant ontolo-gies or using supervised learning techniques. The ontologies contain term dictionaries forhigh-level categories and the identiﬁed entities present in those dictionaries are assignedto the relevant categories. Those ontologies are often engineered for the requirements ofdiﬀerent application domains. WordNet (Miller, 1995) and YAGO (Mahdisoltani et al.,2013) are popular examples of such ontologies available for general purposes which con-tain a hierarchy of high-level categories for each entity in the ontology. Similarly, thereare specialised domain speciﬁc ontologies i.e., UMLS Metathesaurus (Bodenreider, 2004)for health domain. These ontologies are employed by the state-of-the-art health text-processing tools such as Apache cTAKES (Savova et al., 2010) and MetaMap (Aronsonet al., 2010). .4. EMOTION EXTRACTION FROM TEXT person , organisation , location , time/date and money .These categories are formulated for information extraction tasks form news articles. Thiscategorisation has been followed by most of the current state-of-the-art NER tools inStandford NLP (Finkel et al., 2005) and Open NLP (Baldridge, 2005). Emotions are an important aspect of human social life and social behaviour. In psy-chology, emotions are deﬁned as a complex state of mind that inﬂuence thought processand behaviour. Emotions play an integral part in human social interactions, especially inthe interpersonal communication process. In face-to-face communication, humans expressemotions verbally (emotional words and tone) as well as using non-verbal cues such asfacial expressions and hand gestures. In written communication expression of emotions islimited to the use of emotional terms (words, phrases and emoticons).Understanding the emotions conveyed is an integral part of understanding the commu-nicated message, during the communication process. The emotions conveyed in a messagesupplements its discussion topic by adding opinions/feelings of the communicator towardsthat discussion topic. Such expression of emotion is ubiquitous in social data duringsocial conversations where the individuals freely express their opinions and feelings ondiﬀerent discussion topics. Capturing these emotions encapsulated in social data providesinvaluable insights about the conversations happens in social media platforms, which isimportant to understand the opinions/feelings of individuals towards diﬀerent topics.In contrast to topic extraction and event detection, techniques on emotion extractionfrom text are still in the early stages. This is mainly because emotions extraction fromtext only became relevant and useful recently with the rise of social media platforms, whiletopic extraction and event detection techniques were started to develop much earlier tocapture topics/events from news articles.The rest of this section is organised as follows: ﬁrstly, emotion theories and relevantemotion models were discussed which are being used as the basis for the emotion extractiontechniques. Subsequently, sentiment extraction techniques (two dimensional emotionsextraction) were discussed and ﬁnally, multi-granular emotion extraction techniques werediscussed.4

CHAPTER 2.

The key theories of emotion can be categorised into physiological and cognitive. Physiolog-ical theories suggest that emotions are due to physical changes in the body. A prominentphysiological theory is James-Lange theory of emotions (Cannon, 1927) which was formu-lated by two 19th-century scholars, William James and Carl Lange. This theory statesthat emotions are nothing more than conscious feelings about bodily changes. This ideathat emotions occur solely due to physiological reactions to events is been criticised byothers. Another similar theory is Cannon-Bard theory (Cannon, 1927) which states thatafter an event physiological reactions and emotions occurs at the same time. Two-factortheory (Schachter and Singer, 1962) a more recent theory of similar genre which statesthat emotions are generated by a two factor process. First, the physiological event occursand then the individuals refer to his past experience or the immediate environment to ﬁndemotional cues to provide an emotion label to that event. A key commonality in thosetheories is that the stimulus for emotions is physical changes.Cognitive theories, on the other hand, argue that emotions occur due to consciouscognitive activities such as thoughts, judgements, and evaluations. One of the key theoriesin this line of thought is Lazarus cognitive theory (Lazarus and Lazarus, 1991) proposedby Richard Lazarus. This theory states that emotions are determined by the cognitiveappraisal of an event or stimuli in the environment. It argues that the quality and intensityof the emotions are controlled by the cognitive process. Therefore, cognitive appraisalmediates the eﬀect of the event in the environment. Since the cognitive appraisal processis personalised (based on their past experience), same event often yield diﬀerent emotionalresponses from diﬀerent individuals. Figure 2.1 presents how emotions are generatedfollowing an event in the environment based on diﬀerent theories of emotion discussedabove.The term emotion is used alongside and interchangeably with terms sentiment , aﬀect ,and feeling . Shouse (2005) points out that aﬀect is the most general and abstract of theabove terms. Aﬀect represents the intensity of experience as either positive or negative.It is often not conscious and not explainable using language (Munezero et al., 2014). Incontrast, feelings are conscious and labelled sensations that are checked against the pastexperience. Hence, feelings are personal to each individual as the labelling of sensationsis based on the past experience of the individual. Moreover, Shouse (2005) states thatemotions are the display or projection of feelings. Friedenberg and Silverman (2011) statethat emotions are short episodes that involves both brain and body to respond to an eventof interest. The sentiment is deﬁned as the positive or negative mental predisposition inan individual towards an object (e.g., another person, event). Sentiment is often formedwhen an individual constantly perceive or think about an object, which leads to buildingup a dispositional idea about the object. Characterising or assessing emotions is a challenging task. This is because, ﬁrstly, as dis-cussed in above theories emotions originated by a cognitive process, which makes emotions .4. EMOTION EXTRACTION FROM TEXT n dimensional space.2. Discrete models, which model all emotions into a number of basic categories.Dimensional models argue that emotions are overlapping and interrelated aﬀective statesgenerated by a common neurophysiological system (Posner et al., 2005). Therefore, allhuman emotions can be represented in a conceptual continuous n dimensional space (oftentwo or three dimensional). Each of these dimensions is modelled to represents a diﬀerentaspect of the underlying neurophysiological system. The key dimensional models are (i)positive activation negative activation (PANA) model (Watson and Tellegen, 1985), (ii)Vector model (Bradley et al., 1992) and (iii) Circumplex model (Russell, 1980).PANA model (Watson and Tellegen, 1985) represents emotions in two dimensionswhere the horizontal dimension is the strength of positive aﬀect and the vertical dimensionis the strength of negative aﬀect. Among those models, the most popular dimensionalmodel for emotions is the Circumplex model (Russell, 1980). It argues that all emotions6 CHAPTER 2. originate from two neurophysiological systems, valence (pleasant to unpleasant continuum)and arousal (active to passive continuum). According to Circumplex model, each emotioncan be represented as a linear combination of valance and arousal. For instance, excited isa pleasant and active emotion while relaxed is a pleasant and passive emotion. Figure 2.2shows a graphical representation of the Circumplex model and how diﬀerent emotions canbe represented along the valance and arousal dimensions of the model.Figure 2.2: A graphical representation of the Circumplex model consisting of valence andarousal dimensions. Multiple emotions are placed in the axis system based on their valanceand arousal. Image adapted from Posner et al. (2005).In contrast to dimensional models, discrete emotion models argue that human emo-tional expression consists of a set of distinct basic emotions. These basics emotions arecommon to all humans and even recognisable across diﬀerent cultures. The hypothe-sis of distinct basic emotions stems from Darwin (1872) that diﬀerent emotions are dis-tinct neuropsychological phenomena shaped by the natural selection process to providean organised physiological and cognitive responses to challenges and opportunities in theenvironment (Plutchik, 1980a). In other words, emotions are formulated as part of the evo-lution process and thus, it is argued that there should be a set of basic emotions commonto all humans.This hypothesis of a universal set of basic emotions was initially experimented byDarwin (Darwin, 1872) in which he showed an array of photographs with diﬀerent facialexpressions to people and asked them to recognise them. This experiment had beenadvanced by (Ekman and Oster, 1979) where they showed photographs of Europeans withdiﬀerent facial expressions to two isolated tribal communities (have never met Europeans)in Borneo and New Guinea. Those people (including children) from the tribal communitieswere able to understand the emotions expressed in the facial expressions of the photographswith signiﬁcant accuracy. Also, photographs with facial expression taken from the tribeswere recognised by US college students with a similar level of success. These experimentslead to the assertion that there exists a basic set of discrete emotions present in all humans.Ekman (1992) further argues that basic emotions are distinguishable as they have distinctuniversal signals such as facial expressions common across all humans and also often .4. EMOTION EXTRACTION FROM TEXT

Plutchiks emotion wheel .Figure 2.3: Plutchiks wheel of emotions adopted from (Plutchik, 2001)The next sections discuss how those emotions models are utilised to detect sentimentand emotions from text.8

CHAPTER 2.

As discussed in Section 2.4.1 sentiment is often deﬁned as aﬀect (or emotion) towardsan object or topic. Sentiment detection (or opinion mining) is the computerised analysisof discourse to capture emotions/opinions expressed towards an object (or topic) by theauthor. Most of the sentiment extraction techniques provide a positive or negative intensityscore for a given text based on the emotional expression in that text.Sentiment detection techniques are often considered as the computational implemen-tation of the Circumplex model (Russell, 1980) discussed in the previous section, whichmodels emotions in a two dimensional space as linear combinations of valance and arousal.Most of the early sentiment detection techniques were limited to detecting valance(positive/negative) dimension of the text (Pang et al., 2008). Those techniques considerthe task as a binary classiﬁcation task and classify a given text as positive or negativebased on the sentiment expressed in that text. The key issue with these techniques is thelack of arousal or strength of the sentiment expressed. For instance, both “food is good”and “food is exceptional” would be just rated as positive overlooking that the latter isrelatively more positive.Recent sentiment detection techniques attempt to capture both valence (positive/negative)and arousal (strength of the sentiment) and provide a polarised sentiment score (Moham-mad, 2016) where the polarity represents valance and the number represents the amountof arousal. These approaches consider the problem as either a regression problem or amulti-class classiﬁcation problem (sentiment in a Likert-type scale).Sentiment detection techniques can be broadly categorised into two approaches (Med-hat et al., 2014): (i) rule based classiﬁers based on sentiment lexicons and (ii) machinelearning models learned using a text corpus with human annotated sentiment.Rule based classiﬁers depend on a sentiment lexicon, which is a dictionary of lexicalfeatures (e.g., words, phrases) that has a semantic orientation as either positive or negative.Some lexicons only provide a list of positive and negative terms. Two prominent suchlexicons are General Inquirer (Stone et al., 1966) and LIWC (Pennebaker et al., 2001).General Inquirer (Stone et al., 1966) is the oldest sentiment lexicon that is still in use, whichhas been manually constructed by social scientists, political scientists, and psychologiststo capture diﬀerent aspects of text messages. It is a thesaurus of more than 11,000 termsin 183 categories, in which the terms categorised as positive and negative can be used asa sentiment lexicon. LIWC (Pennebaker et al., 2001) which stands for

Linguistic Inquiryand Word Count is a thesaurus constructed by sociologists and linguists. It contains morethan 4,500 terms organized into 76 categories, in which categories positive emotion and negative emotion can be used as a sentiment lexicon.There are also sentiment lexicons with both polarity and valence scores for each term.SentiWordNet (Esuli and Sebastiani, 2006) is sentiment lexicon which is created by an-notating over 115,000 synsets in WordNet (Miller, 1995) lexical database. Each synsetis annotated with three numeric scores indicating its positivity , negativity , and objectivity where the sum of the three scores equals to 1.00 for each synset. Aﬀective Norms forEnglish Words (ANEW) (Bradley and Lang, 1999) is another lexicon with valance scores. .4. EMOTION EXTRACTION FROM TEXT < > (cid:107) score − . (cid:107) , where a score of 1 . . really increasesthe valence of good when used together (e.g., good vs really good). Similarly, some-what decreases the valence (e.g., good vs somewhat good). A dictionary of degreemodiﬁers was employed or this rule.2. Negation: negations terms like not , hardly reverse the polarity (arousal) of a sen-timent term (e.g., not good). A negation term dictionary is used to capture suchoccurrences.3. Capitalisation: increase the valence of a term (e.g., good vs GOOD)4. Exclamation mark: increases the valence (e.g., good vs good!)Similarly, sentiment lexicon in SentiStrength (Thelwall et al., 2010, 2012) is constructedbased on General Inquirer (Stone et al., 1966), LIWC (Pennebaker et al., 2001); as wellas a set of slang, emoticons and idioms commonly used in social data. The initial valencescores for each term is determined by a pool of raters. Moreover, since SentiStrength ismainly optimised for brevity in social data, the valence scores were further reﬁned for thatdomain using a supervised machine learning technique which was based on a corpus ofsentiment labelled MySpace posts. Similar to the previous technique, SentiStrength usesrules such as negation, degree modiﬁers, punctuations, capitalisation and repeated letters(good vs goooood) to alter the sentiment scores.The rule based classiﬁer approach involves layers of engineering eﬀorts which includesthe construction of sentiment lexicons, assigning valence scores to each sentiment term0 CHAPTER 2. and setting up linguistic rules to alter the sentiment score of sentiment terms. Thisprocess result a set of sentiment scores for a given text based on its content, which is oftenaggregated to produce the ﬁnal sentiment score of that text.In contrast, the techniques that train a machine learning model mainly dependent ona suﬃciently large text corpus with sentiment labels/scores. Using sentiment label/scoresas the target, machine learning models were trained on the features extracted from thetext corpus. The features vectors often constructed based on words and n-grams in thedocument using vector-space model. In addition, the syntactic structure of the documentis often captured by using part of the speech tags and punctuations as features. Moreover,the classiﬁers designed for social data often capture features that represent hashtags, usermentions, and emoticons related features.In an early work, Pang et al. (2002) employed several machine learning techniquesto train a sentiment classiﬁer on a labelled movie review dataset using unigram, bigramand POS features. Labels were automatically generated as positive, neutral and negativebased on the star rating provides with a movie review. It was found that Support VectorMachine (SVM) technique performs better than other techniques. Other contemporaryworks employed similar machine learning techniques such as Naive Bayes (Melville et al.,2009) and Random Forest (Da Silva et al., 2014). Some approaches used feature selectiontechniques such as PCA and information gain (Riloﬀ et al., 2006) to reduce the sparsityby selecting important features.Socher et al. (2013) build the Stanford Sentiment Treebank, a corpus of movie reviews(a subset of (Pang et al., 2002)) with fully labelled parse trees. Such parse trees can be usedto capture the context of the sentiment language in contrast to assessing the sentimentof individual words or phrases. Socher et al. (2013) further present a recurrent neuralnetwork (RNN) based sentiment classiﬁer learned from the above corpus which is learnedto represent sentiment arousal and valence. It was shown that RNN is capable of capturingthe context speciﬁcs of sentiment.Lexicon based approaches are more generalisable to diﬀerent application domains asthey are based on sentiment lexicons constructed using emotional terms that are generallyemployed in any domain. However, such approaches may not sensitive to the sentimentexpressions that are speciﬁc to a particular domain. For instance, some sentiment lexi-cons do not include sentiment terms that are widely used in social data (e.g., LOL), butnot part of the standard English vocabulary. In contrast, machine learning models learnspeciﬁc sentiment expressions encapsulated in their training corpus. Such techniques alsooften capture more complex sentiment expressions that may span across multiple wordsand semantic patterns that are used to alter the sentiment expression. Hence, machinelearning models which learned from a suﬃciently large corpus often yield better perfor-mance on a similar corpus in comparison to lexicon based methods. However, they areless generalisable to capturing sentiment from text corpora in a diﬀerent domain, as thepatterns learned by the machine learning model may not be generally applicable.There are attempts on hybrid approaches which employs both sentiment lexicons andmachine learning models to build better sentiment detection techniques. Lexicon based .4. EMOTION EXTRACTION FROM TEXT

Emotion detection from text has been a relatively new avenue of research in contrast tosentiment detection, mainly due to higher complexity and lack of resources such as emo-tion labelled corpora and emotion lexicons. In fact, sentiment detection is a simpliﬁedversion of emotion detection which assess emotion only in one or two dimensions, whileemotion detection generally denotes assessing emotions with increased granularity (morethan two dimensions). Because of this close resemblance, from the technical aspect, emo-tion detection techniques follow the same approaches taken by the sentiment detectiontechniques.Current state-of-the-art emotion detection techniques are extensions to sentiment de-tection techniques. For instance, in rules based emotion detection techniques were devel-oped extending rules based sentiment detection techniques by simply replacing sentimentlexicon with an emotions lexicon. Similarly, machine learning models designed for senti-ment detection were trained on an emotion labelled text corpora to detect emotion.Emotion lexicons were constructed in diﬀerent granularities which often adherer todiﬀerent emotional models discussed in Section 2.4.2. Most of these emotion lexicons arebased on Ekman (1992) emotion model of six basic emotions or Plutchik (1980a) emotionmodel of eight basic emotions. NRC Emotion Lexicon (Mohammad and Turney, 2013)is one of prominent emotion lexicon which has emotion terms related to eight Plutchikemotions joy, sadness, fear, anger, anticipation, trust, surprise, and disgust. It has closeto 14,000 emotion terms generated using crowd sourcing techniques.Similarly, emotion labelled dataset were created mainly based on the emotion mod-els Ekman (1992) or Plutchik (1980a). Alm et al. (2005) annotated sentences from chil-dren’s stories with eight emotions ( Ekman (1992) six basic emotions and positively and2

CHAPTER 2. negatively surprised). Similarly, Strapparava and Mihalcea (2007) annotated news head-lines based on Ekman (1992) six basic emotions. Brooks et al. (2013) annotated 27,344chat messages based on 13 emotions (eight basic emotions and ﬁve secondary emotionsfrom Plutchik (1980a)). Unlike sentiment annotation, these emotion annotations are multi-label where each text can be labelled as containing multiple emotions. These annotatedcorpora are used to learn machine learning models (by the original authors and others) todetect emotions as multi-class classiﬁcation problems.

Self structuring or self organising techniques automatically learn a structure from a givendataset and then maps each data-point to the most similar structural element. Thisstructure can be used to develop a coherent grouping of the dataset. Such techniques areessential to learn the underlying structure of social data.

Self structuring techniques were mainly inspired by the self-structuring capabilities ofcells (neurones) in human cortex. Those cells apart from passing information to upperlayers, self-structure horizontally as well. This self-structuring happens through short-range excitatory interactions between the neighbouring cells and inhibitory interactionsbetween distant cells (Ratliﬀ, 1965). These inhibitory and excitatory actions lead to com-petition and correlative learning which is the basis for self-structuring. Topographicallyordered maps are observed in many parts of the cortex (Kertesz, 1983) including visualcortex (Van Essen, 1985), auditory cortex (Reale and Imig, 1980) and somatotopic cor-tex (Kaas et al., 1979). In these topographically ordered maps, diﬀerent sensory inputreceptions are mapped to diﬀerent parts of the cortex. Although the primary structure ofthe cortex is determined before birth, these subsequent topographical orderings are dueto self-structuring based on the sensory receptions.This self-structuring learning mechanism in the cortex is studied and theorised bymany research works. One of the most pioneering and prominent work is Hebbian the-ory (Hebb, 2005) which states that “When an axon of cell A is near enough to excite a cellB and repeatedly or persistently takes part in ﬁring it, some growth process or metabolicchanges take place in one or both cells such that A’s eﬃciency as one of the cells ﬁring B,is increased”. This theory explains that synaptic strength among neurone cells strengthenovertime if activation of one cell repeatedly and persistently leads to the activation of an-other cell. This can be mathematically stated as the synaptic weight between input cell andoutput cell is proportional to the correlation between input and associated output. Similarprominent theories are proposed elsewhere. Marr’s theory of the cerebellar cortex (Marrand Thach, 1991) states that cerebellum is an associative memory which maps the stateof the body learned from cutaneous and proprioceptive receptors into motor commands.Malsburg’s theory on self-organisation in visual cortex (Von der Malsburg, 1973) statesthat retinotopic organization (a mapping from retina to visual cortex) is learned through .5. SELF STRUCTURING TECHNIQUES

The biological process of self-structuring in the brain has inspired the design of sev-eral unsupervised computational techniques such as Willshaw-Malsburg Neural NetworkModel (Willshaw and Von Der Malsburg, 1976) and Kohonen’s Self Organising Map(SOM) (Kohonen, 1990, 1997). However, simplicity, computational eﬃciency and scal-ability of the Self Organising Map is unparalleled to any other such techniques which havelead SOM to be the only successful computational technique inspired by the biologicalself-organising process.Kohonen’s Self Organising Map (SOM) (Kohonen, 1990, 1997) follows the Hebbianlearning rule which emulates the self-structuring process in brain, but is simpliﬁed toreduced the computational complexity by replacing pre- and post-synaptic layers withthe computational eﬃcient two (or one) dimensional map structure. As illustrated inFigure 2.4, this map self-learns a structure from a high dimensional input space to alow dimensional map space while preserving the topological relations exist in the highdimensional input space.Figure 2.4: Self Organising Map(SOM) creates a low dimensional structure from a highdimensional input space while preserving the topological relations exist in high dimensionalinput space. Source: (Haykin, 1994)The learned SOM structure consists of nodes where each is denoted by a vector whichhas the same dimensionality as the input space. As shown in Figure 2.4, once learnedfrom the input data each node I ( x ) in SOM structure maps to a space x in the inputspace and represent that space in the SOM map. Since SOM is learned while preservingthe topological relations in the input space neighbouring nodes in the learned structuremaps to adjacent spaces in input space.SOM structure is initialised as a ﬁxed 2-dimensional grid of nodes where each nodeis randomly initialised with a d dimensional vector (where d is the dimensionality of theinput space). Once initialised SOM uses a combination of competition, cooperation, and4 CHAPTER 2. adaptation phases to self learn its topology preserving low-dimensional structure from theinput space.The competitive phase is used to ﬁnd the winning node or the best matching unit foreach input. This winning node N w for each input is selected using a similarity functionwhich assesses all nodes against a given input and selects the closest node to the giveninput: sim ( X, N w ) ≥ sim ( X, N ) , ∀ N ∈ { N } where { N } includes all nodes in the SOM structure. Euclidean distance is often usedas this similarity function and select the node with the lowest Euclidean distance to theinput. This competitive process closely resembles the process in the brain where inputneuron excites an output neurone.Next is the cooperative phase in SOM which emulates the lateral interactions formu-lated among the neighbourhood of the neurones where a ﬁred neurone excites its adjacentneurones more than the distance ones. In SOM, the topographic neighbourhood of anode is determined using a neighbourhood function N ( d i,j , t ) that decays with the lat-eral distance d i,j between winning node i and neighbouring node j . Also, to reach theconvergence, the neighbourhood function often decays with the number of iterations t , sothat winning nodes excite a wider neighbourhood during the initial stages, and a smallerneighbourhood in later stages. A Gaussian function is often used as the neighbourhoodfunction N as it peaks at the centre and decays exponentially with distance from centre.Finally, the adaptive phase updates the weights of winning node and its neighbourhoodbased on the input vector. The weight vectors of winning node and its neighbourhood getsupdated to reduce the distance to the given input. Let w , . . . , w d be the weight vector(in a d dimensional space) of the SOM node j and x , . . . , x d be the input vector, weightupdates in each dimension k in iteration t is determined as follows: w j,k ( t ) = w j,k ( t −

1) + α ( t ) × N ( d i,j , t ) × ( x k − w j,k ( t − α ( t ) is the learning rate which determines the amount of learning captured into thewinning node at each iteration. N ( d i,j , t ) is the neighbourhood function where winningnode is i and N ( d i,i , t ) = 1. In SOM, learning rate is a decaying function over thenumber of iterations ( t ), leading to substantial adjustments during the initial iterationsand minor adjustments in later iterations. This weight adapting process emulates theHebbian theory (Hebb, 2005) by adapting the weights of the winning neurone to reducethe distance between input and winning neurone so that the eﬃciency of exciting thatneurone to same (or similar) input increases.Observations from the input space are iteratively used in the SOM learning process.During initial iterations, with higher learning rate and bigger neighbourhood, rapid self-structuring of the SOM grid takes place. Over the iterations learning rate drops andneighbourhood size diminishes leading to a smooth convergence of the SOM grid to thelearned topological structure. The number of iterations required for convergence dependson the complexity of the underlying structure of the input space. Convergence in SOM can .5. SELF STRUCTURING TECHNIQUES (cid:80) d | x i − w i | . Quantization error rapidly drops duringinitial iterations and converges to a minimum value as SOM converges. The learning canbe stopped when quantization does not show signiﬁcant improvements over iterations.Self organising map has been used in many application domains in both academia andindustry over the last few decades (Oja et al., 2003; Kohonen, 2013). It has been frequentlyused as a tool for exploratory analysis of large datasets as it projects high dimensionalinput data into two dimensional grid which can be easily visualised to understand theunderlying patterns exist in a dataset. In addition, it has been used in applications suchas dimensionality reduction, clustering, classiﬁcation, anomaly detection, and informationorganisation/retrieval. A key limitation in self organising map is its ﬁxed grid size which needs to be pre-determined. SOM grid size determines the number of nodes trained which is indicative ofthe learning capacity of SOM. Using a smaller grid on a large complex dataset would resultin over-generalisation of the learned structure where some nodes may represent more thanone distinct patterns. On the other hand using an unnecessarily large grid is expensive interms of run time and memory usage.The key solution to overcome this issue is to use an expandable grid which dynamicallyexpands during the training phase if more nodes are required to represent the underlyingstructure of the input dataset. There are several extended versions of self organising mapdeveloped to start with a small number of nodes and dynamically expand during trainingbased on error metrics that indicates more nodes are required.Growing Grid (Fritzke, 1995a) is a dynamically expanding version that starts with 4nodes. In every iteration, it updates error of the winning node based on the distanceto the input. Expansion of the grid happens after every k × m × λ iterations ( k and m are the dimensions of the grid), where it expands the grid based on the node q with thehighest accumulated error. A new column or row is added to the grid between q and itsfurthermost connected node. This addition of an entire column or row may lead to addingtoo many nodes if trained on a large dataset (a large number of iterations).Growing Neural Gas (GNG) (Fritzke, 1995b) creates a dynamically growing graph(instead of a grid). It also keeps track of the accumulated error and after λ iterationsselect the node with the highest error for further growth. Similar to Growing Grid, a newnode is added in between node with the highest error and its furthest neighbour.Growing Self Organising Map (GSOM) (Alahakoon et al., 2000) is another dynami-cally growing version of SOM. Similar to the above methods it keeps track of the erroraccumulated in winning nodes, but unlike above methods it triggers growth when thelargest accumulated error exceeds a threshold. Growth happens from the node with thelargest error and nodes added to all its remaining edges in the grid. Unlike Growing Grid,GSOM does not maintain a rectangle shape of the structure and nodes are only added tothe impacted node.6 CHAPTER 2.

The self structuring techniques were mainly designed for conventional datasets whereit expects a real-valued dense feature matrix. Extended versions are required to handlesocial data which are often unstructured, sparse and high-dimensional.

Incremental learning is the paradigm of learning where existing learned knowledge in-crementally gets extended and updated as new data comes in. Incremental learningtechniques are the machine learning techniques that are capable of learning knowledgeincrementally from new data. Such techniques are mostly applied to scenarios where theentire dataset is not known priorly but available over time. A naive approach for suchscenarios is to discard previously learned knowledge/model and re-learn from scratch usingboth old and new data. However, such an approach is highly resource consuming in termsof both processing (to process the entire dataset) and memory (model has to keep olddata for retraining). Incremental learning techniques overcome these challenges by onlylearning from new data and updating the existing model to reﬂect knowledge in new data.Incremental learning is a must for social data, as it is a constantly evolving data stream.There will always be new unseen patterns appear on social data streams. Also, previouslyappeared patterns may appear later, so learning techniques have to keep the previouslylearned knowledge intact without discarding.Polikar et al. (2001) have speciﬁed four key characteristics of an incremental learningtechnique as follows:1. It should learn additional information from new data.2. It should not require access to the past data that it has already processed.3. It should not suﬀer from catastrophic forgetting, thus should preserve the previouslyacquired knowledge4. It should be able to accommodate new classes that may be introduced with newdataAs mentioned in the ﬁrst characteristic incremental learning techniques should captureany additional previously unseen information from the new data. Second characteristicdelineates that learning from new data should not require access to previously processeddata. This characteristic enables such techniques to be scalable to handle big data streamswhere retaining old data is not feasible. Although some techniques may keep samples ofprevious data that represent each class or group seen in the data. The third characteristicis related to a phenomenon called catastrophic forgetting (French, 1999) where modelsdiscard previously learned knowledge completely when new knowledge is presented. Thisability to learn new knowledge while preserving the existing knowledge is known as the stability-plasticity dilemma (Carpenter and Grossberg, 1988). It is a dilemma because theincremental learning techniques have to suﬃciently stable to be resilient to the noise innew data while ﬂexible enough to learn new knowledge encapsulated in new data (Gama .6. UNSUPERVISED INCREMENTAL LEARNING TECHNIQUES supervised incremental learn-ing, however, unsupervised incremental learning techniques were relatively less. Unsuper-vised incremental learning techniques maintain a dynamically evolving cluster structurereﬂecting the dynamics of the data stream, where new clusters may appear and othersmay disappear. Such incremental clustering techniques were mainly adaptations from thestandard clustering techniques.Leader (Sp¨ath, 1980) is one of the earliest partitioned based incremental clusteringtechnique. It employs a user deﬁned threshold to partition data into clusters. Everynew input is assigned to the closest cluster if its distance to the cluster centroid is lessthan the given threshold, otherwise a new cluster centroid (leader) is added based on thatinput. This approach is simple but unstable as it depends on a user deﬁned thresholdfor separating clusters. Another technique is the single pass k-means (Farnstrom et al.,2000) which is an incremental learning adaptation of k-means. It keeps a ﬁxed sized buﬀerand when the buﬀer is full from incoming data, runs k-means to identify k cluster. Afterrunning k-means, only cluster centroids are retained in the buﬀer, and when it is fullagain runs a weighted k-means using new data and existing centroids. This techniqueincrementally update clusters as new data arrives, however, ﬁxed k value limits capturingof new emerging clusters.Aggarwal et al. (2003) introduced CluStream, a stream clustering framework whichseparates incremental clustering into two phases online micro-clustering and oﬀ-line macro-clustering. Micro clusters are groups of data points which keeps statistical informationof data and time stamps. A new data point absorbed into an existing micro-cluster if itsdistance to the centroid falls within that micro-cluster boundary. Only a ﬁxed numberof micro-clusters are maintained, and when it exceeds, older clusters were merged intoothers. In oﬀ-line macro-clustering, k-means runs over the set of micro-cluster centroidsformulating the ﬁnal clusters.Furao and Hasegawa (2006) introduced self-organizing incremental neural network(SOINN) which is an incremental clustering technique based on self-organising-maps (Ko-honen et al., 2000). SOINN uses a two layer network where the ﬁrst layer representsdensity distribution of inputs and second layer separates clusters by detecting low denseareas in the ﬁrst layer. For each input presented, ﬁrst and second winner is identiﬁedfrom the ﬁrst layer, and if the distance to those exceeds a threshold, then the new input isadded as a new node (between class insertion). Otherwise, standard SOM weight updatesoccur updating ﬁrst winner and its neighbours (within class insertion).8 CHAPTER 2.

Another recent technique is Incremental Knowledge Acquisition and Self-Learning(IKASL) (De Silva and Alahakoon, 2010; De Silva, 2010) which self learns a layered struc-ture across time generalising the knowledge embodied in data. Each layer learns from abuﬀered batch of data using GSOM self-structuring technique (Alahakoon et al., 2000).IKASL preserves the acquired knowledge in a generalised form therefore, it does not re-quire access to the past data that it has already processed. The generalised version of theacquired knowledge from each layer (n) is used as the basis for the knowledge acquisitionfrom the subsequent layer (n+1), thus it avoids catastrophic forgetting of the past knowl-edge. Moreover, while using the past acquired knowledge as the base, it incrementallyacquires new knowledge that is embodied in the upcoming data.Similar to self structuring techniques discussed in the previous section, unsupervisedincremental learning techniques were mainly designed for conventional datasets where itexpects a real-valued dense feature matrix. Extended versions are required to handle socialdata which are often unstructured, sparse and high-dimensional. Also, existing techniquesdo expect all features to be previously known, which is not the case in social data as newfeatures will appear and existing features may disappear over time.

Previous sections discuss the existing machine learning and natural language processingtechniques that are used to gain insights from social data generated in online social mediaplatforms. As discussed within the above sections gaining insights from social data is highlychallenging. Such challenges come from several avenues: (i) scale of data generation, (ii)time sensitivity, (iv) diversity and (ii) the unstructured nature of data.

As discussed in the previous chapter, over the past decade, online social media platformshave been increasingly embraced by hundreds of millions of people worldwide. More-over, users are more frequently active in major social media platforms mainly due to theavailability of mobile devices.These large number of active users in prominent social media platforms generate largevolumes of new social data continuously resulting in a high velocity data stream. Theanalysis of data streams of this scale requires specialised computational approaches thatare optimised to handle large volumes of data.

As discussed in previous sections, supervised machine learning techniques were mainlyemployed in classiﬁcation tasks such as sentiment, topic and event classiﬁcation. Suchtasks were carried out using two types of techniques: (i) dictionary based approach wherea term dictionary is engineered to represent the concepts of each class to be classiﬁedand (ii) model based approach, where a labelled training dataset is employed to learn apredictive model using supervised machine learning techniques. The ﬁrst approach requires .7. THE FORMIDABLE CHALLENGES TO SOCIAL DATA ANALYSIS

39a term dictionary or a lexicon and the second approach requires a labelled dataset withrelevant classes.Since social data is not structured with any taxonomy or explicitly labelled, it doesnot carry any labels that are needed for such supervised or semi-supervised learning tasks.There are a handful of human labelled social data samples, however, such samples aredisproportionate and not representative enough when considering the scale and variationin social data.There are two key approaches to handle this issue. The ﬁrst is to use unsupervisedlearning techniques which do not require any form of supervision. Unsupervised techniquesleverage underlying patterns in data to separate them into groups based on similarity.Current use of unsupervised techniques in text (mainly topic extraction) is covered inSection 2.2.1.Another approach to overcome this issue is to use folksonomies or user generated tagsas a proxy for labels. For example, (i) 20 Newsgroups (Joachims, 1996) which is used fortopic classiﬁcation is taken from a news discussion forum where the news category is usedas the label, (ii) Movie Review Dataset (Pang and Lee, 2005; Pang et al., 2008) used forsentiment classiﬁcation is a set of user created movie reviews on the movie review websiteRotten Tomatoes in which the labels were obtained based on the user provided starrating for the movie. Social data can be considered as a near real-time data stream which gets updated byfreshly generated content from its users. For instance, Facebook users frequently publishstatus updates about how they feel and Twitter users frequently tweet their opinion abouttheir current issues of interest.Moreover, as pointed out by Hu and Liu (2012) these social data streams are notuniformly distributed, but bursty in nature. These bursts are mostly due to current eventsof interest that have disrupted (Sakaki et al., 2010b; Zhou and Chen, 2014) or capturedthe attention (Becker et al., 2011) of a signiﬁcant swath of human population. Hence, itis apparent that social data is tightly coupled with the time of its publication and thattime sensitivity needs to be considered for any analytic task.Most of the above discussed conventional machine learning algorithms require thedataset to be a priori which is not possible in social data streams as new patterns appearin the data streams as it progresses over time. Therefore, machine learning algorithmsneed to be extended to incrementally learn from the data stream as new patterns appearover time.

Diversity in social data is an important feature that is often overlooked in other studies.Social data is diverse due to multitude of reasons. It contains traces of multiple social CHAPTER 2. behaviours. Also, even the same behaviour could be actioned in diﬀerent depths and inten-sities. Moreover, at the highest granularity, contrasting diﬀerences among diﬀerent indi-viduals may appear due to individuality resulting from diﬀerences in socio-demographics.All these reasons induce diversity to social data. For instance, social data consists ofdiverse linguistic patterns, a large number of discussion topics and diﬀerences in expres-sion of emotion. Eisenstein et al. (2014) found that there are distinct linguistic patternsexist in Twitter among groups of similar geographic proximity as well as similar socio-demographics. This diversity in social data act as noise on patterns that exist in socialdata. Hence, the machine learning techniques have to ﬁrst reduce the noise due to di-versity by separating diverse social data into coherent groups which can then be used toextract insights.

Social data mainly consists of unstructured text posted by users in online social mediaplatforms. This text corpora of social data are substantially diﬀerent from the professionaldiscourse (found in news articles and other formal documents) in many aspects such asbrevity, lack of syntactic structure and use of out-of-vocabulary terms.

Brevity

Short length of social data partially due to restrictions imposed by some social mediaplatforms (e.g., tweet is limited to 140 characters). Even in platforms without such re-strictions, users prefer brevity since it allows them to express more eﬃciently. This brevityis mostly achieved by using shortened forms of terms (e.g., u for you ), and relaxation ofgrammaticality (Baldwin et al., 2013). This brief nature of social data result in highlysparse feature representations that is challenging for conventional natural language pro-cessing techniques. Lack of syntactic structure

Discourse in documents often follows certain syntactic structure which is driven by a setof language rules that governs the formation of clauses, phrases and sentences. Such ruleswere often established through repeated documentation over time. However, the discoursein social data loosely follows such syntactic structures. This is mainly because althougha written language social data is the outcome of social conversations where individualsprefer to make it more relaxed from syntactic structures.The above discussed diﬀerences in social data to any other formal discourse are highlychallenging most of the conventional natural language processing techniques that rely onsuch language rules to capture diﬀerent constructs from the discourse such as noun, nounphrases, co-references etc. Since those elementary constructs are then feed into othertechniques to capture complex insights such as sentiment, these errors due to lack ofsyntactic structures often impacts most of the conventional natural language processingtechniques. .8. CHAPTER SUMMARY Out of vocabulary terms

Users tend to coin new terms during social conversations in online social media platforms.Such terms are mostly developed as social tagging or folksonomy highlighting importantaspects i.e., topics or emotions in the conversation text. Hashtags used in Twitter is agood example of such terms, where users create hashtags to represent certain events ortopics (e.g., ). Another type of new terms is constructed by repeatingcertain characters of standard terms (e.g., cooolll ) which are mostly used to emphasise theexpressed emotion (Brody and Diakopoulos, 2011). Use of such out-of-vocabulary termsis a challenge as such terms are often not included in thesauruses that are often used bythe natural language processing techniques to derive certain properties of each word (e.g.,sentiment dictionaries). Filtering out such terms not an option as they are tightly coupledto the intended meaning of the post.From the above examples, it is apparent that as is application of natural languagetechniques which are designed for standard discourse on social data would yield sub-optimal outcomes (Baldwin et al., 2013). Such techniques have to be signiﬁcantly extendedto capture diﬀerent above discussed aspects of social data.

This chapter has conducted a comprehensive literature review on current state-of-the-artmachine learning and natural language processing techniques employed to generate insightsfrom social data. It focused around three types of insights often captured from socialdata which are topics, events and emotions. Topic capturing includes topic modelling tomodel topic distributions and text clustering to capture topically coherent clusters. Eventdetection is twofold where speciﬁed event detection is used to capture events with knownprior information and unspeciﬁed event detection is used to capture events with unknownor partially known information. The emotion detection approaches use emotion modelsbased on physiological and cognitive theories of emotion.As highlighted in the review most of the techniques employed are adaptations andextensions of machine learning and natural language processing algorithms developed forconventional text datasets such as news articles. However, as discussed in Section 2.7social data comes with formidable challenges that are less prevalent in conventional textdatasets due to its scale of data generation, time sensitivity, unstructured nature anddiversity. Hence this approach leads to suboptimal outcomes.The next chapter proposes an alternative approach to this conventional approach ofsocial media analytics. This proposed novel approach goes beyond considering social dataas just another data source by considering social data as traces of online human socialinteractions. It enables social data to be represented by the underlying drives of humansocial interactions which can then be transformed to generate more meaningful insights.2

CHAPTER 2. hapter 3

The Intellection of a ConceptualFramework

Nothing in life is to be feared, it is only to be understood. Now is the time tounderstand more, so that we may fear less.

Marie Curie

The previous chapter presented current work in machine learning and natural languageprocessing techniques that are being used to process, analyse and learn insights from so-cial data. As shown, the majority of such approaches for social media analytics wereadaptations and extensions of machine learning and natural language processing methodsdeveloped for conventional text datasets. Although such approaches generate useful re-sults, they tend to overlook the human behaviours, emotions and thought process whichare embedded in social data. Therefore a wealth of rich information which can be usedto infer potential causality of events and behaviours lie unused. This thesis proposes anovel approach that goes beyond conventional social media analytics. It considers thedata generated in social media platforms a.k.a social data as the traces of human socialinteractions online and thus can be better represent based on theories on social behaviours,which can then be used by machine learning and natural language processing techniquesto generate more meaningful insights. This chapter presents the theoretical foundationsfor this novel approach which has been materialised in subsequent chapters.This chapter begins with a discussion on the importance of social data in advancingan understanding of social behaviours (Section 3.1). In Section 3.2, behavioural theoriesfrom the social sciences are used to propose a new multi-layered conceptual frameworkthat addresses this need. Section 3.3 concludes the chapter delineating the materialisationof this conceptual framework based on the paradigms of self-structuring artiﬁcial intelli-gence, which also lays the foundation for the rest of the thesis.434

CHAPTER 3.

Social data is generated by humans through their social actions on online social mediaplatforms in cyberspace. In fact social data can be considered as archived traces of humansocial actions/interactions in cyberspace. For instance, tweets are the outcome of socialconversations and the expression of opinion that happens in the Twitter online socialmedia platform.Social data and online social media platforms are relatively new paradigms whichhave only been in existence for the past two decades. However, online social actionsthat generate social data are in fact adoptions from human social actions in the physicalworld. For example, computer mediated communication techniques used in online socialconversations are adaptations from the fact-to-face conversations in the physical world.Social actions were vital for the survival of human hunter gathers since the early ages ofhuman civilisations. However, social actions and exchange of social information increasedsigniﬁcantly with the use of language as it enables the individuals to convey complexexpressions and have lengthy conversations.Online social media platforms are also digital adaptations from the social media plat-forms in the physical worlds. The ﬁrst form of a social media platform could simplybe a campﬁre where the hunter-gatherer groups gathered around at night. The relax-ing, warm and safe environment around a campﬁre leads to social chit-chat in-contrastto work related discussions during day-time (Wiessner, 2014), leading to increased socialinteractions among the group. Similar social media platforms were present in coﬀee-houses(mainly in Europe) and tea-houses (mainly in Asia) which serves as centres of social inter-actions (Standage, 2013) where people chat with other visitors while sipping tea or coﬀeefor hours. These interactions although seems casual, were important to formulate tieswith individuals for economic or political beneﬁt.These social actions and social media platforms in the physical world have been stud-ied for decades and theorised by scientists across many disciplines that have developed anin-depth understanding of social actions and their underlying causalities. Such causalitieshave been abstracted into multiple layers of the human decision making process, whereabstract social actions are represented as a handful of social behaviours driven by socialneeds perceived through human cognition. This understanding has developed incremen-tally over many decades and collectively by many theories which provide explanations tothe social actions and their causalities based upon cognition, social needs and social be-haviours. Considering online social behaviours are adaptations of social behaviours in thephysical world and the causalities that drive such behaviours are the same, this researchproposes to incorporate those theories to enhance the computational techniques (e.g., ma-chine learning techniques, natural language processing techniques) that captures insightsfrom social data in online social media platforms. For instance, sharing of diﬀerent types ofpersonal information (e.g., demographics, emotions) can be better understood in terms ofthe social behaviour called self disclosure , while providing advice in online support groups .1. ON THE IMPORTANCE OF SOCIAL DATA altruism . Such incorporation would enable the computa-tional techniques to eﬀectively capture patterns from social data as well as meaningfullyaggregate such patterns based on the underlying causalities.Moreover, capturing insights into human behaviours and their causalities from socialdata, would enable social studies on human behaviour to use traces of human behavioursencapsulated in massive volumes of social data accumulated in online social media plat-forms. Conventionally, such studies on human behavioural causalities are conducted usingcontrolled social experiments and natural experiments (observational studies) (Neuman,2013). Controlled experiments are designed to study the cause and eﬀect of a certain con-dition on individuals against a control group while in natural experiments the behaviourof a group of individuals are observed in their natural environment. In both types of stud-ies the outcomes are acquired from the individuals either by recording their behaviour orusing interviews/survey instruments.Meshi et al. (2015) point out several key advantages of the use of social data tounderstand human behavioural causalities, in contrast to the use of conventional socialexperiments.1.

External validity:

It is the generalisability of the casual inference (Drost et al.,2011) from the experimental setting to the natural setting . External validity is oneof the key challenges of controlled social experiments, mainly because human be-haviour is inherently complex and irregular. It is diﬃcult to exclusively isolate causeand eﬀect of a certain behaviour in a controlled experiment. Therefore, the casualinference obtained from experimental settings may diﬀer in a natural setting. Incontrast, archived social data is the retrospective accounts of social actions whichoccurred in a natural setting .Thus, causal inferences obtained using social data doesnot have the bias of any experimental settings.2.

Less recall bias:

Is the error caused by the human recollection of past experiences.Self experiences are stored in the episodic memory as emotional and contextualdetails of the experience. However, episodic memory is gradually forgotten overthe passage of time. Kahneman and Riis (2012); Kahneman et al. (1999) arguethat recollection of an experience is biased towards the most intense aspects of thatexperience. Therefore, retrospective report of an experience is diﬀerent than beingreported shortly afterwards (Robinson and Clore, 2002). This recall bias aﬀectsthe conventional studies since the data collection often happens periodically andthe participants were asked to recollect their accounts of experiences that happensometime back. In contrast, social data is often recorded near real-time, mainly dueto the ease of access to online social media platforms using mobile devices. Therefore,social data is more immune to recall bias.3.

Large sample size:

Conventional social studies were often conducted using a hand-ful of individuals, mainly due to the associated cost. Since human behaviour is sodiverse across individuals, the casual inference learned from a small sample couldbe over-ﬁtting to that selected cohort. On the other hand, social data of millions6

CHAPTER 3. of individuals are already being archived in online social media platforms, who arefrom diverse socio-economic backgrounds. Hence, social data enables social studiesusing large and diverse cohorts.In a nutshell, social data can be considered as traces of online social interactions whichenable the use of theories on social behaviours to better understand social data. This un-derstanding can be leveraged to develop alternative approaches that better transform socialdata into insights representative of underlying social behaviours. These two approachesto transform social data is depicted in Figure 3.1, which shows a. the conventional ap-proach of applying machine learning techniques on social data to produce data-driveninsights, and b. the alternative approach of combining machine learning techniques withsocial theories to transform social data into insights representative of underlying social be-haviours. Furthermore, Figure 3.1.c shows that the insights representative of underlyingsocial behaviours learned from the proposed approach can be leveraged by social sciencesto better understand human social behaviours in general and especially in online socialmedia platforms. However, this feedback loop to improve the understanding of humansocial behaviours is beyond the scope of this thesis which is focused on building technicalcapabilities for harnessing insights from social data.Figure 3.1: The approaches of transforming social data: a. the conventional approachof using machine learning techniques on social data which yields data-driven insights, b.combining machine learning techniques with social theories to transform social data intoinsights representative of underlying social behaviours, c. using insights representative ofunderlying social behaviours to better understand human social behaviours.The new approach speciﬁed in Figure 3.1.b is further elaborated in Figure 3.2. Socialscience theories can be used to build latent representations from social data. Such repre-sentations can be based on either internal or psychological aspects such as emotions as wellas external aspects such as topics of conversation or behaviours or events. Subsequently,machine learning and natural language processing techniques can be used to transformthose latent representations into insights. .2. THE PROPOSED CONCEPTUAL FRAMEWORK

Figure 3.3 presents the proposed conceptual framework that can develop a latent rep-resentation of social data. The following subsections describe the layers of the conceptualframework and supporting theories from the social sciences.

Human cognition is the mental process of acquiring or perceiving knowledge from thought,experience and senses. Cognition is thought to have physically based in the neural networksof the human brain within its over 100 billion neuron cells and their synaptic intercon-nections. In fact it is argued that cognition represents the collective functionality of allneural networks.The cognitive process of acquiring or perceiving knowledge identiﬁes behaviours thatlead to positive or desired outcomes. This cognitive process dictates the execution of suchdesired behaviours by stimulating the motor neurons. These characteristics of cognitiveprocess lead to the cognitive theory of human needs (Rosenfeld et al., 1992), which statesthat, from a cognitive perspective, a need is an abstraction of the desired outcome obtainedusing a set of similar behaviours. Therefore, needs originate during the cognitive processof perceiving knowledge.8

CHAPTER 3.

Figure 3.3: The conceptual framework for understanding social data using social theoriesof human social behaviours, social needs and cognition as the foundation.

Based on this understanding of how cognition formulates human needs, this section dis-cusses prominent human needs that involves social aspects i.e., the needs that have anemphasis on social interactions with other individuals or groups.

Evolutionary and biological perspective

It is argued in Darwinian evolution theory (Darwin, 1859) that social needs developed asa survival trait in human hunter-gather groups. Since, humans do not possess exceptionalfeatures such as teeth, strength or speed compared to other hominoid relatives such asgorillas or chimpanzees, their survival against the selective pressure of natural selection isprobably the behavioural adaptation of belonging to a group. Although humans are nowan ecologically dominant and no longer under threat of a predatory species, eﬀective socialinteractions remains a key factor to survive or rather thrive among others in the moderncomplex society.A similar idea is proposed by the social brain hypothesis, where Dunbar (1998) statesthat human intelligence initially evolves as a behaviour adaptation to survive as groups.The larger brains in humans is a result of increased computational demands of living inlarge social groups. In fact it is found that the optimum group size of primate social .2. THE PROPOSED CONCEPTUAL FRAMEWORK

Psychological perspective

One of the earliest mentions of social needs appears in pyramid of needs presented inMaslow’s theory of motivation (Maslow, 1943, 1962) which presents a hierarchy of humanneeds. The hierarchical formulation represents that humans look for needs in a particularlayer only when the needs in respective lower layers are satisﬁed. In this pyramid of needs,socially relevant needs are placed third and fourth after physiological and safety needs.The third layer captures the human needs to feel a sense of belongingness among otherindividuals or social groups. Such relationships can range from close intimate relationshipsto work-related companionship. The fourth layer is self-esteem which is a sense of beingrespected and valued by others. It includes acquiring social status, fame or recognitionfrom the groups that the individual is a member.Alderfer (1969) formulated the Existence, Relatedness and Growth theory (ERG)by simplifying the Maslow’s pyramid of needs. In ERD, relatedness is the desire to es-tablish and positively maintain interpersonal relationships with other individuals. Suchrelationships could serve the need for belongingness as well as the need for self-esteem,thus relatedness is similar to third and fourth layers of pyramid of needs . However, incontrast ERD is not hierarchical and expect all three needs to exist simultaneously.The Three Needs theory (McClelland, 1967, 1987), describes three types of socialneeds: the need for achievement, aﬃliation and power. The achievement need requireswork on doable tasks and receive frequent feedback from others on the accomplishments.The aﬃliation is the need to establish and maintain social relationships; and feeling ofbelonging to social groups. The power need is the need to have power over others, socialstatus and recognitionThese diﬀerent psychological theories of needs highlight the key social need as be-ing belonging to social groups and maintain social relationships with other individuals,and within those relationships there exists need to receive feedback and recognition ofachievement as well as the need to inﬂuence others.In summary, social needs are a set of human needs that often need to be nurtured byothers in society. Some needs such as love and belongingness often provided by those whoare intimate to the individual (often family and friends of that individual). In contrast,social needs such as achievement, aﬃliation and power are related to receiving a form ofrecognition from society. Such needs are often satisﬁed by being part of diﬀerent socialgroups.0

CHAPTER 3.

Social capital looks at social needs in the economic perspective, as a capital that eachindividual should possess to survive and thrive in modern society. Social capital describesthe resources embedded in social groups as a form of capital i.e. an asset that can beeconomically useful and can be invested for future economic gain. Unlike other forms of capitals , that can be measurable, social capital is more conceptual and less measurable.Over the past decade, many sociologists have provided slightly contrasting deﬁnitionson social capital and its utility (Woolcock and Narayan, 2000; Lin, 2002; Adler and Kwon,2002; Putnam, 2001; Nahapiet and Ghoshal, 2000). Adler and Kwon (2002) deﬁne socialcapital as the goodwill (trust, reciprocity and sympathy) extended to an individual dueto his social relations with individuals or groups, that will be available as information,inﬂuence, and solidarity.Nahapiet and Ghoshal (2000) describe three types of social capital based on the formof existence as structural, cognitive , and relational . The structural dimension is basedon the structure of the social groups which includes roles, rules and procedures of thesocial group that determines the approachability to the resources in the group. Therelational dimension is based quality and nature of the relationships which determines thetrust, obligations, expectations and sanctions (Putnam, 2001). The cognitive dimension isbased on the shared reality (Bourdieu, 2011) which determines the shared values, beliefsand attitudes.Another dimension of social capital is often deﬁned based on the type of the rela-tionship, which is initially deﬁned as bonding or internal and bridging or external socialcapital (Putnam and Feldstein, 2009; Putnam, 2001; Adler and Kwon, 2002), base on thework strong and weak ties by Granovetter (1973). Bonding social capital arises from therelationships between people with similar characteristics i.e., homophily. Such individu-als often have strong relationships with each other and peruse collective goals, therebybonding capital facilitate support in terms of material, emotional and safety. In con-trast bridging social capital arises from relationships between diﬀerent social groups orindividuals with diﬀerent social identities. Bridging capital often operates on trust andreciprocity, and it provides access to important resources that are not available by bondingcapital (Granovetter, 1973).Later Szreter and Woolcock (2004); Woolcock et al. (2001) introduced linking socialcapital which is argued to be a special form of bridging social capital but the relationshipsand interactions occur across power or authority gradients of the society. The linking socialcapital mainly refers to the relationships with who have the decision making power overthe individual, thus such capital provides access to services, jobs and resources (Szreterand Woolcock, 2004).Social capital needs maintenance, where the individuals have to periodically renew thesocial links with others to make such links eﬀective (Adler and Kwon, 2002). Moreover,leaving social groups reduces the social capital associated with that group (e.g., leavingan organisation) (Glaeser et al., 2002). .2. THE PROPOSED CONCEPTUAL FRAMEWORK weak ties with friends, colleagues and others with similar interests. Hence,most of the online social capital gains are in the form of bridging social capital (Resnicket al., 2001). However, some of the weak ties formed online may become stronger overtimeleading to bonding social capital. Also, social media platforms help to maintain bondingcapital when the individuals are geographically distanced (Burke et al., 2011). Ellison et al.(2007) also provide evidence that online social media platforms enables users with low self-esteem to achieve bonding social capital and improve their psychological well-being.The Table 3.1 summaries the key theories discussed in relation to social needs.Table 3.1: A summary of the key theories related to social needs discussed in this section.Darwinian evolutiontheory(Darwin, 1859) Social needs developed to support belonging to a group asa survival trait against the selective pressure of natural se-lection.Social brain(Dunbar, 1998)(Dunbar and Shultz,2007) Human intelligence evolves as a behavioural adaptation tosurvive as groups and larger brains is a result of increasedcomputational demands of living in large social groups.Attachment theory(Bretherton, 1992)(Bowlby, 1969) Human social needs originate from the ﬁlial bonds developedbetween an infant and close caregivers.Theory of motivation(pyramid of needs)(Maslow, 1943)(Maslow, 1962) Humans needs are hierarchical, reach higher order needsonly when low order needs are satisﬁed.Three Needs theory(McClelland, 1967)(McClelland, 1987) Humans have three types of social needs: need for achieve-ment, aﬃliation and power.Social capital(Woolcock andNarayan, 2000)(Adler and Kwon, 2002)(Putnam, 2001) Social capital describes resources embedded in social groupsas a form of asset that can be economically useful and canbe invested for future economic gain. Behaviour is deﬁned as an intentional act to achieve a perceived goal (Ajzen, 1985). Thisgoal oriented nature of the behaviour arises from having perceived needs which individualsplan to satisfy through their own behaviour. The theory of planned behaviour (Ajzen,1985, 1991) states that execution of a behaviour is guided by the individual perceived2

CHAPTER 3. attitude towards that behaviour, perceived subjective norms of that behaviour and ﬁnallythe perceived ability to control its execution.Constructing on top of the general theory of behaviour, social behaviour can be de-ﬁned as an intentional act that involves interactions between one or more individuals toachieve the perceived social needs of those who involved. Therefore, social behaviours areintentional acts of achieving social needs discussed in the previous section. Among so-cial behaviours communication or interpersonal communication is the foundational socialbehaviour which is essential for any form of human social interaction. Other prominent so-cial behaviours include self-disclosure, cooperation and social comparison. All these socialbehaviours are observed in the natural world as well as in online social media platforms.However, diﬀerences in the online social environments have led each social behaviour tohave several adaptations from the natural environment to online social environments. Thefollowing subsections discuss the above mentioned social behaviours in terms of theoreticalperspective, social needs they attempt to achieve and the diﬀerent adaptations of themfor online social media platforms.

Interpersonal communication

Interpersonal communication or simply communication is the act of transmitting informa-tion among individuals or groups. Any form of social interaction between humans involvespassing information from an individual to others (intentionally or unintentionally). There-fore, interpersonal communication can be considered as a foundational social behaviourthat itself is a social behaviour as well as required for all other social behaviours. The keydimensions of communication are sender, channel, message, and the receiver (Shannon,1948; Schram, 1954). Sender encodes the desired information in a message and sends it tothe receiver using an appropriate communication channel. The receiver decodes the mes-sage to receive the information. The encoding and decoding schemes have to be the samefor the receiver to get the intended message, while diﬀerences in encoding and decodingmay lead to unintended interpretations of the message received (Schramm, 1954).Although communication happens among all living organisms, this discussion is exclu-sive to the behaviour of interpersonal communication which occurs between individualsor groups of humans. Conventionally, interpersonal communication happens face-to-face(FtF) where individuals visually see each other and exchange information using a combi-nation of both verbal (using natural language) and non-verbal (e.g., body language, facialexpression) communication. Verbal communication is often less ambiguous as languagerules and the meaning of the words/phrases are well deﬁned and common knowledge toall communicators. In contrast, the meaning of non-verbal communication often deviatesbased on diﬀerent attributes (e.g., past experience) of the receiver.Interpersonal communication as a social behaviour can be explained using the uncer-tainty reduction theory (Berger and Calabrese, 1975; Berger and Bradac, 1982). It statesthat during social interactions, individuals have a need to reduce the uncertainty aboutothers, thus use interpersonal communication as a mean of acquiring more information .2. THE PROPOSED CONCEPTUAL FRAMEWORK cog-nitive which are the uncertainties on values, ideas and attitudes of each other; and (ii) behavioural which are the uncertainties on behaviours of each other in a given situation.Both verbal and non-verbal communication with each other leads to the exchange of infor-mation that leads to the reduction of such uncertainties and improvement of trust. Suchreduction of uncertainties and improvement of trust is essential to initiate and establishsocial relationships with the society. Eﬀective and frequent communication with closedones (family and friends) is essential to maintain intimate relationships to achieve thesocial needs love and belongingness (bonding social capital). Similarly, communicationwith diﬀerent social groups and broader society is the key to establish aﬃliations therebyimprove bridging social capital. Also, interpersonal communication is used as a way of in-ﬂuencing others which yields inﬂuencing power. Therefore, interpersonal communicationis a fundamental behaviour to achieve most of the social needs.

Computer mediated communication

Computer mediated communication (CMC) is the process of two or more individuals com-municate using a computer platform. It is the communication process used in all onlinesocial media platforms. In early days it is primarily text based, however, recent CMCplatforms allow a range of audio and video features to enhance the communication expe-rience. Interaction in CMC can be either synchronous (e.g., Internet Relay Chat, Skype,Facebook Messenger) as well as asynchronous (e.g., email). Modern online social mediaplatforms support both approaches. For instance, in Facebook platform, posts and statusupdates are mostly used for asynchronous communication while Facebook Messenger canbe used for synchronous communication.Most of the earlier research looked down upon CMC as an inferior communicationchannel in contrast to face-to-face communication (F2F). One of the prominent theory onthat line of thought is the Media Richness Theory by Daft and Lengel (1986) and Trevinoet al. (1987), which deﬁnes the richness of information as “the ability of information tochange understanding within a time interval”. The theory delineates that the richness of a communication medium is its capability to eﬀectively transmit information to alterthe understanding of an individual or a group (Daft et al., 1987; Daft and Lengel, 1986;Trevino et al., 1987). The theory further states that richness of medium is a function ofthe following capabilities of a medium: (i) transmit multiple communication cues, (ii) pro-vide rapid responses, (iii) have a personal focus and (iv) support natural/conversationallanguage for communication. These four capabilities of a rich medium is formulated con-sidering F2F communication as the baseline standard and any other medium is evaluatedbased on how close it can resemble F2F communication.One of the key criticism on media richness theory is that it focuses on the communica-tion medium objectively and haven’t accounted for diﬀerent context/topic of the communi-cation or the diﬀerences in the human actors involved in the communication process (Kock,2005; Kahai and Cooper, 2003). In order to accommodate those features Carlson and Zmud(1999) proposed Channel Expansion Theory, which states that individuals perceptions on4

CHAPTER 3. the richness of a particular communication channel dependent on his relevant experienceof using that channel and familiarity with the topic of discussion. Hence, frequent andprolonged use of a particular CMC channel in a particular context makes it richer for theassociated individuals. Tidwell and Walther (2002) report that such improvements of theperception of richness of a CMC channel may result similar of greater richness than F2Fcommunication.Another key theory that discusses about the CMC medium is the Media NaturalnessTheory which is proposed by Kock (2004, 2005) as a psychobiological model. It arguesthat humans (hominoids) have conducted F2F communication inside social groups asan act of survival. Hence, the evolutionary pressures over a long time period wouldhave developed adaptations in both brain (e.g., sensory and motor neurons) and body(e.g., facial muscles, visual and auditory organs) to make F2F communication cognitivelyeﬃcient, less ambiguous, and result in better physiological arousal. For example, humanears are sensitive to human voices than any other acoustic stimuli (Nass and Brave, 2005).Kock (2004, 2005) further argues that due to the adaptive capability of the humanbrain, with frequent and prolonged use of a leaner CMC medium, human brain is capableof adapting to use such medium naturally with lesser cognitive eﬀort, less ambiguityand increased physiological arousal. This phenomenon aligns with Channel ExpansionTheory (Tidwell and Walther, 2002) which states similar outcomes with frequent andprolonged use of a CMC medium.Several such channel expansion attempts have been clearly visible across the recenthistory of social media platforms such as emoticons and hashtags. Emoticons in short for emotion icon are a pictorial representation of facial expressions using a combination ofcharacters. The most famous emoticons- smiley face :-) and frowning face :-( are introducedto online communication in 1982 as a suggestion to mark statements as jokes or seriousstatements. The use of emoticons has since spread across all text based communicationschannels such as emails, SMS and all social media platforms. A large set of emoticons arecurrently in use and also extended to multiple cultural variants such as Japanese kaomoji .Also, emojis are pictorial versions of emoticons that often provides a more lively feelingof facial expression. The main use of emotions/emojis is to reduced the ambiguity ofdiscourse resulted by not having facial expressions and improve the social presence in textbased CMC (Walther, 1992; Derks et al., 2007). Therefore, emotions/emojis expand CMCchannel by providing a means to transmit facial expressions.Another prominent channel expansion attempt is social tagging or folksonomy whichis the practice of adding user generated tags to content published online. Such contentpublished in social media platforms are mostly unstructured text and published in a lessorganised manner without adhering to any form of a taxonomy. Hence it is a dauntingtask to ﬁnd content or discussions that are relevant to a given topic of interest. It alsolimits the formation of online social groups with similar interests. Social tagging is oneof the solutions adopted by the online community to mitigate this limitation. Social tagsstart with .2. THE PROPOSED CONCEPTUAL FRAMEWORK in2007 has lead to its widespread use across most of the prominent social media platforms.As pointed out in (Zappavigna, 2015), social tags serves diﬀerent purposes such as: (i)highlighting the semantic domain of the post (e.g., Self-disclosure

Self-disclosure is another social behaviour, which is the act of disclosing self to an individualor a group (Jourard and Lasakow, 1958). Such disclosure can be intentional or uninten-tional, voluntary or by request, and verbal or non-verbal. The disclosure of informationincludes any type of information about the self such as socio-demographic information,emotions, opinions and needs (Cozby, 1973). Self-disclosure a key social behaviour that re-portedly consumes around 30-40% of everyday human conversations (Dunbar et al., 1997).No other species including other primates engage such high level of self-disclosure, whichsuggests that humans may have intrinsic motivation to self-disclose personal information.In fact, recent studies on brain reveal that self-disclosure lights up mesolimbic dopaminesystem that relates to reward and pleasure indicating that self-disclosure is a rewardingexperience for an individual (Tamir and Mitchell, 2012).Self disclosure is often studied on long terms strategic relationships such as romanticrelationships (Sprecher and Hendrick, 2004; Derlaga and Berg, 1987; Greene et al., 2006)and therapeutic (therapist-patient) (Derlaga and Berg, 1987; Knox et al., 1997; Hill andKnox, 2001). In addition, there are studies on spontaneous self disclosure to a completestranger which is commonly known as the ‘stranger on the train’ phenomena (Rubin,1975).There are several theories that describe self-disclosure process and its objectives.Among them, social penetration theory (SPT) proposed by Altman and Taylor (1973)is the most widely regarded. The social penetration theory (SPT) (Altman and Taylor,1973) was developed to explain how interpersonal communication develops from shallowto intimate through reciprocal exchange of self-disclosed information. SPT highlights twokey dimensions of self-disclosure- breadth and depth . Breadth is the variety of topic in self-disclosure which could include topics related to profession, political aﬃliation etc. Depthis how deep an individual would reveal on a certain topic, where some disclosures aremore of surface information while others disclose deeper information about the individual.SPT describes this multi-dimensional and multi-layered nature of personality using the‘onion metaphor’(Figure 3.4), where each layer of the onion represents diﬀerent layers ofpersonality. The outermost layer contains more visible and public information such associo-demographics, while inner layers progressively contain more intimate information CHAPTER 3. such as values, attitudes, emotions and goals. SPT states that self-disclosure process issimilar to peeling the onion layer-by-layer, where it initiates by disclosing more visible in-formation and then progressively disclose more intimate information as the interpersonalrelationship grows.Figure 3.4: The onion like nature of personality (Altman and Taylor, 1973), where outerlayers contain more visible information and inner layers contain more intimate information. image source: Wikipedia

Self-disclosure serves multiple social needs discussed in Section 3.2.2 which includesaﬃliation, intimacy, belongingness and recognition. Self-disclosure is a key behaviourrequired in formulating and maintaining any form of aﬃliation. Diﬀerent topics are self-disclosed in diﬀerent aﬃliations. For example, work related information such as skills andqualiﬁcations could be disclosed in professional aﬃliations, while sports team or playercould be disclosed for aﬃliation related to fan clubs. The depth of self-disclosure is oftenshallow for aﬃliations. However, with reciprocity and over time self-disclosure goes deeperin some aﬃliations, making them more intimate relationships (become close friends orfamily) which often serves social needs of intimacy and belongingness. Also, self-disclosinginformation about achievements is essential to get recognition from individuals and socialgroups.Self-disclosure is a prominently observed social behaviour in social media platforms aswell. It is reported that sometimes self-disclosure amounts to around 80% of the contentin online social media platforms (Naaman et al., 2010). This is known as the online disin-hibition eﬀect (Suler, 2004) where individuals self-disclose more frequently and intenselyin online communication in contrast to face-to-face communication. Suler (2004) identiﬁedfew factors that lead to this online disinhibition eﬀect, which are dissociative anonymity,invisibility, asynchronicity, empathy deﬁcit, dissociative imagination, and minimization ofauthority. Anonymity and invisibility are similar to the stranger in the train phenomenonwhere the individuals feel safe to self-disclose more in online social media platforms whentheir identity is not or loosely linked into the actual person. Similarly, asynchronous na-ture of the online social media platforms let the individuals self-disclose without worryingabout an immediate response of others. .2. THE PROPOSED CONCEPTUAL FRAMEWORK

Cooperation and altruism

Cooperation is the behaviour of working as a group to attain a goal that beneﬁts everyoneinvolved. Altruism is deﬁned as an act which only beneﬁts the recipient while incurringsome cost to the actor (Hamilton, 1964). Both cooperation and altruism exist in theanimal world in tasks such as hunting and bringing up oﬀspring, however, their scale andscope in humans society are unmatched to any other species. Humans have the capacity tocooperate as large groups of genetically unrelated individuals (Boyd and Richerson, 2009)and the capability to cooperate in complex tasks.In social brain hypothesis (Dunbar and Shultz, 2007) it is argued that the increasedcooperation and altruism in humans is an evolutionary trait for survival as complex socialgroups. Another argument is that cultural and social norms (Fehr and Fischbacher, 2004),supporting prosocial emotion and the establishment of the legal system to punish thosewho violate such norms (Boyd et al., 2003, 2010) have contributed to increased cooperationand altruism.Cooperation and altruism are often practised as a way to develop aﬃliations withsocial groups. It enables social interactions with the group members which often improveand establish interpersonal relationships with the other group members, thus improves the8

CHAPTER 3. bridging social capital of the individual. Such behaviours also lead to achieving recognitionfrom the aﬃliated social groups. Also, cooperation and altruism with social groups enablethe individual to take a leading role and inﬂuence others.Behaviours related to cooperation and altruism are often observed in online socialmedia platforms as well. Individuals form online social groups based on diﬀerent interestsand perform cooperative tasks. In contrast to social groups in physical world, online socialgroups are not bound by any geographic boundaries. Hence, online social media platformsenable individuals to perform cooperative tasks beyond geographical boundaries.Large social media platforms such as Facebook and Twitter often becomes the birth-place of social movements (Golder and Macy, 2014) which originates as people discussingor sharing their views on certain socio-political issues. Others with similar views join thosediscussions. Some of those discussions transition into social movements either facilitatingor demanding a corrective action for the issues. Tunisian and Egyptian revolutions thathas overthrown the then governments are prominent examples for such social movements.Those social movements prominently used Twitter for organising protests/demonstrations(e.g., logistic coordination) and spreading news and updates to both local and globalaudience in near real-time (Lotan et al., 2011).Another avenue for cooperative and altruistic behaviours are the online platforms forknowledge sharing where individuals from across the globe, contributes voluntarily by pub-lishing user generated content. Such content is mostly based on their personal experienceor professional experience. One of the well-known examples for such projects is Wikipediawhich is a massive online encyclopaedia contributed by thousands of volunteers with diﬀer-ent areas of expertise (Golder and Macy, 2014). In Wikipedia its community is delegatedwith multiple roles covering content generation, editing, validation and administrationwhich drivers content generation as well as the integrity of the content.Similarly, in other virtual communities such as self-help groups, cooperation and al-truism displayed when individuals seek support by providing informational support andemotional support based on their personal experiences (Ziebland and Wyke, 2012). Un-like in Wikipedia most of the other virtual communities does not have set roles. However,each piece of information is either supported or opposed by other participants leading tocommunity validated response.Cooperative acts in online social media platforms primarily provide a sense of aﬃli-ation to the members of that online community. Especially when such contributions arecommended by other members of the community. Moreover, online knowledge sharingcommunities have diﬀerent approaches to provide social recognition to the contributors.For example, some platforms maintain contributor ratings either granted automaticallybased on the number of articles/posts or provided by the readers based on the usefulnessof the contributions. In addition highest contributors receive privileged administrationand moderation roles to assess the suitability of user content and user behaviour in thecommunity. They have the power to remove inappropriate content as well as suspendusers for inappropriate behaviour. Such roles provides a sense of power and inﬂuence overothers in that online community. .2. THE PROPOSED CONCEPTUAL FRAMEWORK Social comparison

Social comparison is another key behaviour which is an act of self-evaluation of an individ-ual‘s attributes such as opinions, abilities, appearance, wealth, performance etc. againstthose of other individuals (Suls et al., 2002). Such comparisons enable the individualto understand self-worth as well as reduce any uncertainties about the compared at-tributes. (Festinger, 1954) proposed the initial social comparison theory which denotessocial comparison as a behaviour that happens against individuals similar to oneself. Peo-ple use social comparison for self-enhancement and to gain self-esteem.Social comparisons can be broadly categorised as upward (compare against someonebetter) and downward (compare against someone worse) (Collins, 1996). In upward socialcomparisons, the comparison happens against individuals who are doing better in thecompared aspects, such comparisons yield motivation and inspiration for self-improvement.However, it may lead to poor self-evaluation, and negativity as well. In contrast, downwardcomparisons happen against individuals who are doing worse in compared aspects. Suchcomparisons improve their self-worth by comparing to others who are worse oﬀ (Taylorand Lobel, 1989; Wills, 1981).Traditionally, social comparison happens among family, friends and co-workers whoman individual often encounter. In contrast, there are signiﬁcantly more opportunities forsocial comparison in online social media platforms. This is because most of those plat-forms allow and encourage users to publish a user proﬁle where user can self-created theirown persona. For instance, in Facebook, there are millions of user proﬁles which can befreely accessible to anyone which contains personal information as well as collections ofphotos/videos that provides a sneak peek to the lifestyle of each individual. Similarly,professional social media platforms such as LinkedIn contains comprehensive work relatedproﬁles. Therefore, social comparison is signiﬁcantly high on online social media plat-forms (Vogel et al., 2014) where individuals compare their oﬄine self to online self ofothers perceived based on the information presented on online social media platforms.However, most of the proﬁles in online social media platforms are built using selectedcontent to attractively present the ideal self of the individuals. For example, people onlypublish their best looking photos and highlights of their life; and avoid publishing anynegative looking content (Walther, 2007; Vogel et al., 2014). Because of this selectiveself-presentation in online social media platforms social comparison is not like for like , butrather a disadvantage for the oﬄine self. This phenomenon leads to excessive upwardcomparisons where more realistic oﬄine self gets compared to meticulously presented on-line self of others; and often results negative self-evaluations, diminished self-worth andeven depressive symptoms (Vogel et al., 2014; Nesi and Prinstein, 2015). Chou and Edge(2012) report that the impact of such comparison is higher when compared to an online self of people never or less frequently met in person, and lower for people with frequentencounter.Table 3.2 provides a summary of the key theories discussed in this section. In sum-mary, the above discussed social behaviours are designed to achieve diﬀerent social needs.Interpersonal communication is the foundational behaviour that is required to enable any0

CHAPTER 3.

Table 3.2: A summary of the key theories related to social behaviour discussed in thissection.Theory of planned be-haviour(Ajzen, 1985)(Ajzen, 1991) A behaviour is an intentional act to achieve goal(s) perceivedbased on the needs of individuals.Uncertainty reductiontheory(Berger and Calabrese,1975)(Berger and Bradac,1982) During social interactions, individuals have a need to re-duce the uncertainty about others, thus use interpersonalcommunication as a mean of acquiring more information.Media richness theory(Daft and Lengel, 1986)(Daft et al., 1987) Richness of a communication medium is its capability to ef-fectively transmit information to alter the understanding ofan individual or a group. A rich medium can transmit mul-tiple communication cues, provide rapid responses, have apersonal focus and support natural language for communi-cation.Channel expansiontheory(Carlson and Zmud,1999)(Tidwell and Walther,2002) Perception on the richness of a communication channel de-pendent on the experience of using that channel. Frequentand prolonged use of a channel makes it richer for the asso-ciated individuals.Social penetration the-ory(Altman and Taylor,1973) Self-disclosure is like peeling an onion layer-by-layer, whereit initiates by disclosing more visible information and thenprogressively disclose deeper information as relationshipgrows.Social comparison the-ory(Festinger, 1954) Individuals compare themselves to others as a form of self-evaluation to reduce uncertainty about the opinions andabilities of themselves.form of social interaction with others while the other discussed social behaviours utilisecommunication. Social behaviours are observed in both the physical world and the onlineenvironment (online social media platforms) however, some diﬀerences exist due to the dif-ferences in the environments. Table 3.3 summarises such diﬀerences of social behavioursin physical and online social environments.As shown in the conceptual model in Figure 3.3, the above discussed social behavioursare mainly abstractions of human actions originated to achieve the perceived social needs,Those abstracted behaviours are executed in a multitude of ways as social actions de-pending on diﬀerent functional and environmental aspects which will be discussed in thefollowing section.

Social action can be deﬁned as an execution of a particular social behaviour detailed in theprevious section. Social action for a particular behaviour is a specialised execution of thatbehaviour considering diﬀerent functional aspects and environmental aspects. Functional .2. THE PROPOSED CONCEPTUAL FRAMEWORK

CHAPTER 3.

Social data is the archived accounts of social actions. Before the era of digitisation,accounts of social actions only exist in the memory of the participants and forgotten overtime, unless being recorded on paper or dramatised as folk stories, which only happens toa handful of important social actions.The development of digitisation and digital storage techniques have enabled the archivalof social actions in the physical world in digitised format. Initially, digitisation techniqueswere expensive, devices were bulky, and required technical expertise. Thus, such tech-niques and devices were mostly possessed by professionals and only used to record a fewsigniﬁcant social actions. However, over the last few decades, (i) digital storage techniqueshave become cheaper and accessible, and (ii) digitisation devices has become cheaper,compact and easier to use. For instance, cloud storage has become cheaper and globallyaccessible through internet services. Also, smart phones packed with multiple digitisa-tion techniques (e.g., text, voice recording, camera) has become aﬀordable to the generalpublic. The widespread availability of these techniques has enabled the general public torecord and archive the social actions in the digitised form.

Accumulation of social data

Figure 3.5: Accumulation of the traces of social behaviours in online social media platformsas social data. .3. AN EMBODIMENT IN SELF-STRUCTURING ARTIFICIAL INTELLIGENCE

As discussed in Chapter 2 there have been attempts of qualitative methods as well asconventional machine learning approaches to transform social data into patterns that pro-vide data-driven insights into social behaviours. However, such attempts yield suboptimalresults due to a multitude of challenges present in social data that are discussed in Sec-tion 2.7. This thesis proposed a novel approach, the multi-layered conceptual model inSection 3.2. It is now pertinent to delineate the embodiment of this model, based on theparadigm of self-structuring artiﬁcial intelligence.Figure 3.6 shows the key elements of the proposed framework which consists of (i)representing social data based on existing social sciences theories into useful latent repre-sentations such as emotions, topics and events, (ii) unsupervised self-structuring to auto-matically structure representations of social data into semantically coherent groups, (iii)incremental learning from social data streams to learn temporal changes, and (iv) naturallanguage processing to extract semantic information about individuals and groups.The unsupervised self-structuring machine learning techniques are employed to au-tomatically generate representative structures from the feature transformed social data,based on the underlying similarities of the social behaviours present. Such structuringof social data highlights the common patterns of social behaviours among diﬀerent so-cial groups. These unsupervised algorithms are extended from the conventional machinelearning algorithms to suit the challenges of the social data such as high velocity, diversityand unstructured nature.Incremental learning techniques helps to capture temporal patterns which occur insocial data streams at diﬀerent granularities. For instance, at the individual level, changesare related to internal individual behavioural changes over time. Similarly at the grouplevel, changes of group behaviour often are driven by external factors such as events.4

CHAPTER 3.

Figure 3.6: The proposed self-structuring artiﬁcial intelligence framework for generatinginsights from social data in online social media platforms.The above framework is developed and implemented in the forthcoming chapters ofthis thesis, which describes the implementation of parts of the framework as well as itsapplication on two social data sources with distinct characteristics.Chapter four presents an implementation of the above framework for fast-paced socialmedia platforms such as Twitter which mainly focus on the external aspects in Figure 3.6.Such social media platforms are mainly characterised by its fast-paced nature where dis-cussions are used to rapidly disseminate information across the participants of the socialmedia platform. The participants discuss topics related to recent events and express theiropinion/sentiment. However discussion topics are changed frequently. An unsupervisedself-structuring and incremental learning algorithm is developed that automatically struc-ture the social data into coherent clusters based on the semantic similarity which existsin social data, and an event detection algorithm is developed to capture temporal changesof various behavioural change indicators.Chapter ﬁve presents an implementation of the above framework for online supportgroups (OSG) which are slow-paced and show contrasting characteristics to Twitter. OSGdiscussions focus prominently on the internal or psychological aspects such as emotionsspeciﬁed in Figure 3.6. OSG discussions are tightly coupled to the main theme of OSGwhich is often an disease condition and participants engaged in longer discussions. Becauseof higher homophily and stronger ties people tend to engage in deeper self-disclosure suchas expression of deeper emotions. Multi-granular emotional expression was structured .4. CHAPTER SUMMARY

This chapter developed a layered conceptual model for interpreting social data generatedin online social media platforms and a self-structuring artiﬁcial intelligent framework tocapture insights from social data.The conceptual model is developed after an extensive analysis on existing theories ofcognition, social needs and social behaviour. It depicts that social data is generated bysocial interactions happening in online social media platforms, and such interactions reﬂectdiﬀerent social behaviours driven by diﬀerent social needs. It provides a deeper meaningto social data as archived traces of online social behaviours rather than just a corpus ofunstructured text. In addition, four key social behaviours were conceptualised based onexisting literature and a comprehensive comparison is provided comparing and contrastingexhibition of each behaviour in the physical world and online social media platforms.The proposed self-structuring artiﬁcial intelligent framework is designed to overcomethe challenges of social data highlighted in Chapter 2. It consists of four key elements(i) representing social data into useful latent representations based on above conceptualmodel , (ii) unsupervised self-structuring to automatically structure such representationsinto semantically coherent groups, (iii) incremental learning to learn temporal changes,and (iv) natural language processing to extract semantic information. The next chapterpresents an algorithmic development of this self-structuring artiﬁcial intelligent frameworkto capture insights from fast-paced social media platforms.6

CHAPTER 3. hapter 4

Ideation to creation: A newself-structuring artiﬁcialintelligence algorithm

No event can be judged outside of the era and the circumstances in which it tookplace.

Fidel Castro

This chapter presents the algorithmic development of the self-structuring artiﬁcialintelligence conceptual model proposed in Section 3.3. As discussed in previous chapters,social data accumulated on online social media platforms is a representation of socialinteractions and behaviours of individuals from diverse socio-demographic backgrounds.This chapter is organised as follows. Section 4.1 describes the high-level design of theproposed algorithm. Section 4.2 focuses on incremental learning for topic separation andSection 4.3 presents IKASL: a recent unsupervised incremental learning algorithm thatforms the foundation for the proposed algorithm. Section 4.4 presents the ﬁrst techniqueof the proposed algorithm, for handling complexities of social text streams and Section 4.5presents the second component, the event detection technique. Section 4.6 demonstratesthe full algorithms using an illustrated example and Section 4.7 presents an empiricalevaluation using two Twitter datasets.

Given the diversity of social data generated by distinctive social media platforms, thisdevelopment focuses on fast-paced online social media platforms such as Twitter which isa high volume/velocity text stream contributed by a diverse set of participants from allover the world. Also, social media messages are relatively short and highly unstructured.Due to this fast-paced nature, such social media platforms are mainly employed forrapid dissemination of current and trending information through the social network to alarge number of individuals (Kwak et al., 2010; Lotan et al., 2011; Lovejoy et al., 2012).Due to this rapid dissemination of information, any trending topic would capture the678

CHAPTER 4. attention of a large number of individuals quickly leading to a surge in discussions related tothat topic. After a peak such trending topic often rapidly lose interest and discussions mayswitch to a diﬀerent trending topic. As the primary intention is information disseminationrather than having continuing discussions, the ties between dyads are relatively weak andoften lack homophily, reciprocity. Because of weak ties and lack of homophily, as discussedin Section 3.2.4, self-disclosures tend to be limited to revealing surface information aboutoneself. Also, the emotions expressed in tweets are often superﬁcial and intense, with aclear direction towards positive or negative sentiment.The developed algorithm consists of two techniques designed based on the above char-acteristics of the platforms. The ﬁrst is a new unsupervised incremental machine learningtechnique that automatically captures discussions from a text stream of a fast-paced so-cial media platform. This technique self-structures social text into coherent clusters whichare indicative of distinct topics, and extend those clusters across time into pathways a.k.a topic pathways to capture changes of those discussions over time due to changes in trendingtopics and appearance new unseen topics.The composition of a social text stream evolves continuously over time, which includeschanges in the focus of the topics, the emergence of new topics and cease of certaintopics. Considering a social text stream altogether makes it diﬃcult to capture thoseinteresting dynamics of the discussion topics. Therefore, it is important to separate thesocial text stream into dynamic topics that evolve over time i.e., topic pathways. This topicpathway separation reduces noise and separates social media messages in a meaningful wayso that they are categorised into topic pathways based on their topical similarity. Theproposed technique is designed to achieve this task by generating a self-adaptive structureautomatically separate social text stream into topic pathways.The second technique is a multi faceted event detection technique developed to monitortopic pathways for signiﬁcant changes in social behaviours to detect signiﬁcant changesover time and identify such changes as potential events of interest. Due to the above dis-cussed characteristics changes in social behaviours are mainly on changes in self-disclosureas in changes of discussion topics and changes in emotional expression. Therefore, detec-tion technique was developed to capture those changes.These new techniques were designed to be capable to eﬀectively handle the challengesrelated to the unstructured nature of social data outlined in Chapter 2.7. They are unsu-pervised and capable of incremental learning thus overcomes issues related to unlabelleddata and time sensitivity.This algorithm was trialled using two large datasets of tweets, one on a public ﬁgure andthe other on an international organisation. The results highlighted the capability to auto-matically structure tweets semantically and temporally yielding coherent and meaningfultopic pathways. Also, the automatically detected events from the detection algorithmwere relevant and meaningful.Figure 4.1 presents the high-level design of the proposed platform. A social text stream isdenoted as a continuous stream of messages over time < m , m , . . . > where each message m consists of a short text and a time-stamp of its published time. .1. THE DESIGN OF THE ALGORITHM t based onthe time-stamp of social media messages producing batches of social media messages B , B , . . . , B i , . . . . Each batch is a set of social media messages collected for the timeperiod ∆ t . B i = (cid:8) m (cid:9) t =( i +1) × ∆ tt = i × ∆ t (4.1)The granularity of the time period ∆ t (e.g., hourly, daily, weekly) can be adjusted tosuit the velocity of the social text stream and also to match other application speciﬁcrequirements.Each batch is preprocessed to reduce noise due to unstructured nature of social textand feature vectors v are extracted from each social text message m . Note that technicaldetails of preprocessing and feature extraction are discussed in Section 4.7 . The featurevectors of each batch are employed by the topic pathway separator to separate the socialmedia messages into topics and topic pathways over time.A topic pathway is a series of social media messages that discuss a distinct topic. Itstretches over time across several batches of social media messages. A topic segment is apart of the topic pathway that belongs to a particular batch. Thus, each topic pathway isa chain of topic segments. A topic pathway D can be depicted as follows: D = D → D → . . . D i . . . → D n where D i = { m } is a topic segment and D i ⊂ B i Each topic segment is a set of social media messages that are semantically similarto each other. Hence, technically it can be considered as a clustering problem withineach batch. A chain of such semantically similar clusters across batches forms a topicpathway. Therefore, this problem can be reduced to unsupervised incremental clusteringalgorithm that builds a set of semantically similar cluster chains across batches of socialmedia messages.As discussed in Section 2.3, and event can be broadly deﬁned as an occurrence that canbe bounded by space and time (Allan, Papka and Lavrenko, 1998). Events could happenanytime and anywhere which could impact a single person, a handful of individuals, or alarge number of individuals (e.g., natural disaster, election). Since the interest here is tocapture any event that has caught attention in online social discussions, any speciﬁcs of aparticular event is not priorly known, thus unspeciﬁed event detection (see Section 2.3.1)technique was designed for this task. This unspeciﬁed event detection technique looks0

CHAPTER 4. for changes in online social behaviour of the impacted individuals for the duration of theevent (e.g., change of discussion topics, changes of emotions). Such changes in humansocial behaviours related to an event appear as sudden bursts (Barabasi, 2005) in socialtext stream in between moderate or low level of activity. However, such bursts in aparticular topic often diluted inside the diverse social text stream and can go unnoticedbecause the dynamics of other topics act as noise.As a solution, we propose to detect events in the separated topic pathways because eachtopic pathway is a coherent stream of social media messages which continues a discussionabout a certain topic. Hence, a burst of activity related to a particular topic would beeasily detectable in relevant topic pathways as they are not (or minimally) aﬀected by thenoise of other topics. This proposed event detection algorithm monitors each topic pathwayfor topic speciﬁc events and detects such events using a multi-faceted event detector thatlooks for multiple event indicators.The following sections describe the topic separation and event detection componentsof the proposed platform in detail.

As discussed above, topic pathway separation is technically an unsupervised incrementallearning task that builds a self-adaptive structure which consists a set of cluster chainsover time across the batches of data.As discussed in Section 2.6, unsupervised incremental learning is a challenging task.It has to learn previously unseen patterns from new incoming data while preserving theknowledge acquired previously (no catastrophic forgetting). Also, it should not requireaccess to the past data that it has already processed.Due to this challenging nature of the unsupervised incremental learning, there areonly a handful of algorithms in literature that satisﬁes the above three characteristics(see Section 2.6 for details). Out of those, the recently published Incremental KnowledgeAcquisition and Self-Learning (IKASL) (De Silva and Alahakoon, 2010; De Silva, 2010) isselected as the base for this work.In IKASL, learning occurs as a structure of layers where each layer is learned from anew batch of data. In each layer, structure of the incoming data is self structured using theunsupervised self-learning mechanism introduced in GSOM algorithm (Alahakoon et al.,2000). GSOM employs the brain-inspired Hebbian learning rule (Kohonen, 1990) to learnthe underlying structure from a dataset.IKASL preserves the acquired knowledge in a generalised form therefore, it does notrequire access to past data that it has already processed. The generalised version of theacquired knowledge from each layer (n) is used as the basis for the knowledge acquisitionfrom the subsequent layer (n+1), thus it avoids catastrophic forgetting of the past knowl-edge. Moreover, while using the past acquired knowledge as the base, it incrementallyacquires new knowledge that is encapsulated in the upcoming data. The following sectionelaborates the functional details of the IKASL algorithm. .3. IKASL ALGORITHM This section describes the functional details of the IKASL algorithm. Table 4.1, states thesymbolic notation that will be used in this section to detail the IKASL algorithm.Table 4.1: Notations used for the functional details of IKASL algorithm

Symbol Description B i i th batch of social media messages d number of dimensions in (i) feature vector of a social text message, and(ii) weight vector of a feature map node LE i i th learning phase GE i i th generalisation phase n j j th feature map node T E j total quantisation error of node j C i cluster representation vector learned during GE i N N q Set of N q and neighbouring nodes of N q The IKASL algorithm functions in three key phases; initialisation, learning and gen-eralisation which are delineated below.

IKASL is initialised in a similar manner to GSOM, using a map of four starting nodes { n } . Each node n is initialised with a weight vector w of size d . Initial values for theweight vector can be selected based on two diﬀerent lines of approaches: • random initialisations: this type of approaches include completely random initial-isations, initialise using randomly selected data points and random sampling from astatistical distribution. • data distribution based initialisation: these type of approaches include the useof K-means clustering to initialise weights to identiﬁed cluster centres and heuristicmethods such as principal component analysis to assign optimal values to weights.Each approach comes with pros and cons however, often such gains are marginals andonly aﬀect the rate of convergence. Interesting readers can refer to (Attik et al., 2005) forfurther details.For the sake of simplicity, IKASL weight vectors are initialised by sampling from auniform random distribution in the range of [0, 1]. This learning phase LE i happens in each layer of IKASL in which it self learns a topologypreserving two-dimensional feature map form the data batch B i . This self structuringfunction employs the GSOM (Alahakoon et al., 2000) algorithm to learn the feature map.The GSOM algorithm is an enhanced version of Self Organising Map (SOM) (Kohonen,1990) discussed in Section 2.5.2 CHAPTER 4.

Similar to SOM, GSOM uses the concept of competitive learning, where a set of nodes(neurones) ins a feature map compete for the ownership of input data vectors. The winneror the best matching unit is the node that is most similar to the input.Once the winner is selected, brain inspired Hebbian learning rule (Hebb, 1949) is usedto update the weights of the winning node and its neighbourhood so that their distanceto the particular input is reduced. This process iteratively learns the topology preservedfeature map. One of the key limitations of SOM is that the number of nodes in the featuremap need to be pre-determined, which is often done experimentally. GSOM overcomesthis issue by initiating with a small number of nodes in the feature map and using a nodeexpansion function to dynamically expand the feature map during the training phase. Thisself structuring phase executed in two stages (i) Self-Organisation with node expansionand (ii) Smoothing. The details are described below.

Self structuring and node expansion

This stage presents the inputs in a randomised order and learns the feature map usingcompetitive learning and Hebbian learning rule (Hebb, 2005). Node map is dynamicallyexpanded based on the quantisation error accumulated in individual nodes. • A random input vector v is selected from the dataset and presented to the networkof nodes in GSOM. • The network of nodes compete for the ownership of the input v and the node n q that is closest to the input vector v is selected as the winner: sim ( v, n q ) ≥ sim ( v, n ) , ∀ n ∈ { n } .Note that similarity is measured using a designated distance function. • The feature map updates the weight vectors of the winning node and its neighbouringnodes N ( n q , i ) to reduce the distance between nodes and the input vector v Theweight vector w j of each selected node n j is updated as follows: w j ( k, i + 1) =  w j ( k, i ) + α ( i ) × ( v ( k ) − w j ( k, i )) , if n j ∈ N ( n q , i ) w j ( k, i ) , otherwise (4.2)Note that α ( i ) ≤ . i is the iteration and k = 1 , , . . . , d .The learning rate α is a tuning parameter which is determines the amount of weightcorrection in each selected node due to the input v . It starts with a higher valuesand gradually decreases as i → ∞ . N ( n q , i ) is the neighbourhood of n q that receives weight updates. This processsimulates the neuro-biological phenomenon of lateral excitation which happens inthe neocortex of brain where the activation of a neurone excites the neurones thatare connected to it (Dalva and Katz, 1994). The size of the neighbourhood N ( n q , i )gradually decreases as i → ∞ and eventually reduces to { n q } . • The winning node updates its quantisation error based on the distance between thenode and the input v : QE n q = QE n q + | v − w q | . This accumulation of quantisation .3. IKASL ALGORITHM • IKASL uses the node expansion method proposed in GSOM (Alahakoon et al., 2000)which uses the accumulated quantisation error QE n q to trigger expansion of thenetwork from that node. When QE n q > τ Q , node expansions is triggered on n q andnew node are added to the vacant spaces of n q in the grid formations as shown inFigure 4.2. Figure 4.2: Two node expansions scenarios Smoothing

This stage further ﬁne tunes the weights of the node map by re-feeding the inputs. It usesthe same competitive learning and Hebbian learning rule based approach to ﬁne tune thenode weights. However, in contrast to the previous stage, this stage uses a small learningrate α and a small neighbourhood in order to avoid substantial weight changes due toinputs. Moreover, this stage uses a ﬁxed node map learned from the previous stage anddo not perform node expansion. This generalisation phase GE i extracts a summarised representation as a set of clusterrepresentation vectors { C } from the topology preserved node map learned from the corre-sponding self structuring phase LE i . Nodes that are able to claim a signiﬁcant proportionof inputs are identiﬁed as hit nodes . The cluster representation vectors { C } are learnedby aggregating the weights of hit nodes and weights of its neighbourhood. This phase isdescribed below: • Input vectors are re-fed into the node network and let the nodes claim inputs basedon similarity. • Hit count h ( n ) of each node n is the number of inputs claimed by each node. h ( n j ) = |{∀ v ∈ B i : n j = arg max n ∈{ n } ( sim ( n, v )) }| (4.3) • Nodes that claimed more than τ H proportion of input were marked as hit nodes,where hit threshold τ H is a tunable parameter. Note that, hit threshold τ H ∈ [0 , CHAPTER 4. controls the granularity of the summarised representation where higher τ H resultsfew hit nodes, thus, more generalised representation. • The cluster representation vector C h is formed by aggregating the weights of anidentiﬁed hit node n h and its neighbourhood N h ( n h ). C h = aggregation ( n h , N h ( n h ))Note that IKASL has employed a fuzzy integral as the aggregation function. How-ever, any suitable aggregation function can be used for this task. After the ﬁrst layer, learning phase LE i is based on the set of cluster representationvectors { C } i − learned during previous generalisation phase GE i − . Each cluster repre-sentation vector becomes the weights of the starting node of a separate feature map in LE i . Therefore, |{ C } i − | number of feature maps are learned in LE i .Firstly, the feature vectors are assigned to their closest cluster representation vector based on an employed similarity function. Those assigned feature vectors are then usedto learn the respective feature maps initialised from that cluster representation vector. Figure 4.3: Layered structure of IKASL algorithm .4. IKASL

TEXT LE is initialised using the cluster representation vectors { C } of layer 0. Hence,previously learned knowledge from B is preserved and used as the basis for the acquisitionof new knowledge. The new knowledge from batch B of data is acquired by the learnedfeature maps in LE . Subsequently, generalisation layer GE generates a summarisedrepresentation of the acquired knowledge as a set of cluster representation vectors { C } . The features vectors derived from batches of social media messages are used to self struc-ture the adaptive structure that consists of chains of cluster representation vectors. Eachbatch of social media messages are assigned to the most similar cluster representationvector generated from the same batch, thus forming clusters of social media messages.Chains of cluster representation vectors form chains of social media messages clusters thatbecome the topic pathways.IKASL addresses the issue of time sensitivity (see Section 2.7) in social media messagesby sampling them using their published time and incrementally learning topic pathwaysthat capture the dynamic changes of each topic. The learning of IKASL has linear com-plexity with number of social media messages, hence it is scalable to handle large volumesof social text data. Moreover, the acquired knowledge is stored in a summarised form as cluster representation vectors , which makes it scalable in memory as well.While IKASL addresses several issues related to topic pathway extraction from a socialtext stream, there are several identiﬁed issues that are not addressed. Hence, an extendedIKASL algorithm (IKASL text ) is proposed to address those identiﬁed issues. Those issuesand the resolutions in IKASL text are presented in next section. text : Extended IKASL for topic separation froma social text stream

This section describes three key extensions included in IKASL text , the extended IKASLalgorithm for topic separation from social text stream, which is optimised to handle thechallenges in social text.Section 4.4.1 describes the developed technique to overcome the issue of dynamic vo-cabularies, Section 4.4.2 describes the technique the handle the brevity in social mediamessages and Section 4.4.3 presents the technique to detect new topic pathways and im-prove coherence in existing topic pathways.

In online social media platforms users often coin new words that are not found in standardvocabulary (Eisenstein, 2013). As discussed in Section 2.7, some of these terms are tagwords that are coined to represent certain events or topics (e.g., hashtags in Twitter).Such terms are important to aggregate the social media messages of that topic/event and6

CHAPTER 4. also to understand about the evolution of that topic/event. Often these new terms appearas bursts where its usage peaked rapidly, subsequently declines and seldom used after.The conventional approach would be to use a static global vocabulary of signiﬁcantterms in the dataset to generate feature vectors from social media messages. However,there are two key issues associated with this conventional approach.1. The tag words that are used in social data streams are not known a priori (Ma-hendiran et al., 2014) thus it is not possible to create a global static vocabulary thatincludes all possible terms.2. Consideration of large number of terms as features signiﬁcantly increases the sparsityof the feature vectors.In order to resolve these issues of a global vocabulary, a new method is developed to fa-cilitate dynamic vocabularies which can incorporate new signiﬁcant terms (e.g., hashtags)to the incremental learning task as they appear in new batches of data. A vocabulary V i comprising a set of signiﬁcant terms are learned from each batch B i of social mediamessages. This vocabulary V i is local to batch B i and used for the self structuring stageof that batch. Therefore, the learnt cluster representation vectors { C } i from the batch B i are based on features extracted using V i .The candidate terms for the vocabulary V i is selected based on local term-frequency(Salton and Buckley, 1988b) i.e. frequency of the term in batch B i , where the termsthat have a local term-frequency higher than τ V in B i are included in V i . τ V is a tuningparameter in which setting a higher value can miss informative terms and setting a lowervalue can increase the sparsity of the feature representation of social media messages.The key challenge of this approach is how to measure similarity between feature vectorsbased on two distinct vocabularies. This is because in IKASL, the cluster representationvectors { C } i learned from the batch B i are employed as the basis for the self structuringstage for subsequent batch B i +1 , in which V i +1 might contain diﬀerent terms that are notin V i .In order to solve this problem the intersection between the two vocabularies is consid-ered in similarity calculations. For example, let C im be any cluster representation vectorfrom stage i and v j be any input vector from data batch B j (where j > i ), the Cosinesimilarity between C i and v j is deﬁned as follows: sim ( C i , v j ) = (cid:20) C i · v j (cid:107) C i (cid:107) × (cid:107) v j (cid:107) (cid:21) V i ∩ V j (4.4)where V i ∩ V j (cid:54) = ∅ . Note that Cosine similarity is a popular similarity measure for sparsefeature vectors and often used in text bag-of-word representations.Figure 4.4 shows the local term frequency plot of two terms in the social data stream.Term A maintains a local term frequency higher than τ V throughout hence it is included invocabulary of all batches shown. In contrast, term B appears as a burst in B , reached peaklocal term frequency in B and gradually declined subsequently. Its local term frequency is greater than τ V only in batches 3 to 7, hence included in vocabularies V , . . . , V . .4. IKASL TEXT term frequency of terms A and B across social media message batches B , . . . , B . τ V is the local term frequency threshold that determines inclusion of exclusionin the respective vocabularyThis method overcomes the problem of new terms as it uses a dynamic vocabulary andincorporates new signiﬁcant terms in the feature space as they appear in the new batches ofsocial media messages. As in for term B shown in Figure 4.4, dynamic vocabularies includethe new term as a feature when it is signiﬁcant and relevant. Thereby, self structuringand generalisation stages on relevant stages consider the dynamics of the particular newterm for topic pathway separation. Moreover, term B is automatically excluded fromthe feature space when its usage drops beyond the threshold τ V thus less relevant. Thissubsequent exclusion reduces the sparsity of the feature vectors and thereby improves theclustering results. The short length of social media messages are imposed by some social media platforms(e.g., tweet is limited to 140 characters). Even in platforms without such restrictions,users prefer brevity since it allows them to express more eﬃciently. This results in a fewernumber of words in social media messages. For example, tweets are on average 11 wordslong (O ’connor et al., 2010).The feature vectors of the social media messages are formulated using the bag of words technique where each feature is a word in the text corpus of that batch of social mediamessages. The corresponding feature of each word present in the social media messages isassigned with the feature values of that word while other features are zero (not activated).Because of the brevity, these feature vectors of the social media messages have only ahandful of (non-zero) features.During the self structuring phase (Section 4.3.2) the activated features of a featurevector activate the corresponding weights of the winning node. Throughout the self struc-turing phase diﬀerent sets of weights of the node get activated based on the feature vectorsthat the node claims as the winner. However, since the winner is claimed based on simi-larity which is determined using the activated features of the node, the already activatedfeatures often get further strengthened during the self structuring phase. These activated8

CHAPTER 4. weights of the node represent the acquired knowledge of that node. Because of the brevityissue, often few feature weights are activated in each node after the self structuring phase.When the nodes have only a few activated weights and others are close to zero, theoverlapping activated features of the hit node and its neighbourhood will be less. Hence,the existing summation based aggregation function used in the generalisation phase (seeSection 4.3.3) would diminish the strength of the non-overlapping weights and the relevantacquired knowledge will be lost.This issue is inherent in summation based aggregation methods which favours fre-quently activated features, while features activated in one or few considered nodes arediscriminated (Murray and Perronnin, 2014). In order to minimise this issue, we adaptedan aggregation method known as max-pooling which is popular in Bag of Visual Words(BoVW) based techniques in visual recognition tasks (Peng et al., 2015). In fact, (Boureauet al., 2010) have theoretically shown that max-pooling performs better than average-pooling (averaging based aggregation) on features with a low probability of activation.Max-pooling takes the maximum feature value for each feature from hit node andits neighbourhood. In a d dimensional feature space, a cluster representation vector iscalculated from the neighbourhood N ( n h ) (neighbourhood includes the hit node) of a hitnode n h as follows: ∀ k ∈ , . . . , d, C ( k ) = max n ∈N ( n h ) ( n ( k )) (4.5)This approach does not average-out activated features that only exist in few nodes tothe near-zero values in other nodes. Intuitively, in the context of bag of words featureextraction, max-pooling aggregates the signiﬁcant words learned by each node consideredand include all of them as signiﬁcant words in the respective cluster representation vector . The impact of dynamic and time sensitive nature of social data streams extends beyondthe granularity of new terms into a much broader concept of new topic pathways. Newtopic pathways emerge when there is a substantial public interest on a new discussion topici.e., social media buzz . Such emerging topic pathways are often linked to real world eventsthat have captured the attention of a signiﬁcant portion of the general public. Hence, it isimportant to capture newly emerging topic pathways which are loosely related to existingpathways.However, the original IKASL algorithm links clusters to previous layers, thus all newclusters are formed having a cluster representation vector learned from previous layers asthe base. Hence, when a new topic emerges, the relevant social media messages wouldmap into an already existing topic pathway and only seen as incremental changes to anexisting topic pathway. Such an approach would hide the new topic as well as decreasethe coherence of the existing topic pathways. .4. IKASL

TEXT cluster representation vector based on simi-larity would assign less similar feature vectors to respective topic pathways. Hence, theknowledge acquisition process from the new unseen messages can increase the variancewithin topic pathways decreasing the topic coherence of topic pathways over time.In order to overcome these two issues, we employed a similarity threshold based methodwhich is often used in high-dimensional clustering algorithms (Ester et al., 1996; Ert¨ozet al., 2003) to ﬁlter out noise feature vectors that are loosely related to any existingclusters. However, instead of ﬁltering out such feature vector as noise our proposedtechnique identiﬁes any potential new topics encapsulated in them. Therefore, in thisproposed method, if the similarity with the closest cluster representation vector is greaterthan the similarity threshold τ topic , then such features vectors are used to learn a newfeature map G new that is randomly initialised as in LE as below. Thus, for a featurevector v , the feature map G is selected as bellow: G =  G j , if C j = arg max C ∈{ C } ( sim ( C, v )) and sim ( C j , v ) ≥ τ topic G new , otherwise (4.6) τ topic is the topic pathway coherence threshold that represents the expected coherenceof the identiﬁed topics. Low τ topic values result in few topic pathways with less coherentcontent in which some might be loosely related to the particular topic. τ topic positivelycorrelates with topic coherence in topic pathways and negatively correlates with number oftopic pathways. High τ topic values result in more topic pathways with increased coherencewhile low τ topic values result in less topic pathways with low coherence.This method improves the coherence of existing topic pathways as only the featurevectors that are similar to the respective cluster representation vector, are used in theself structuring stage to incrementally learn dynamic changes of that topic pathway. Incontrast, the generation of a new feature map leads to the formulation of cluster represen-tation vectors that are not inﬂuenced by the knowledge acquired in previous layers andthus, becomes the basis for new topic pathways.It should be noted that the proposed technique separates topic pathways autonomously.The coherence of these topic pathways can be evaluated using topic coherence mea-sures (R¨oder et al., 2015). The details of topic coherence are further discussed in theexperiments section, alongside results from experiments.This extended IKASL algorithm has the same time-complexity as the vanilla IKASLalgorithm. It primarily depends on the number of inputs n and the number of nodes inthe GSOM k . Each training step requires a nearest neighbour search between the inputand GSOM nodes leading to a complexity of O ( nk ). However, the expansion of GSOMslows over training iterations and eventually saturates. Thus, when the input data set issuﬃciently large ( n >> k ) it can be assumed that the time complexity of IKASL algorithm0 CHAPTER 4. is linear with the input size i.e., O ( n ). It is a very useful property of IKASL which enablesits use on very large datasets. As mentioned in Section 4.1, an can be characterised as sudden burst of activity in acertain topic. In recent literature as discussed in Section 2.3.1, detecting such burst of ac-tivity in a topic is often simpliﬁed to detecting bursty key words or tag words (hashtags inTwitter) (Weng et al., 2011) with the hypothesis that such burst key words are indicativeof an event related to the topic inferred by the meaning of those key words. In con-trast, (Paltoglou, 2016) recently introduced a sentiment based event detection techniquewhere signiﬁcant changes of sentiment can be used as event indicators with comparableresults.As discussed in Section 4.1, it can be hypothesis that an event in social data streamcreates a disruption to related topic pathways so that the characteristics of social be-haviours in the aﬀected topic pathways change substantially. Considering the nature ofcharacteristics in fast-paced social media platforms Such changes are reﬂected in variousaspects of the topic pathway such as change of volume or change of sentiment.Hence, in the proposed event detection approach we employed a multi-faceted eventdetector. It looks for diﬀerent event indications such as volume changes ( I v ), positivesentiment changes ( I P S ) and negative sentiment changes ( I NS ). In addition to thosegeneric event indicators, domain speciﬁc event indicators ( I DS ) such as (i) volume ofdisaster related words e.g., shaking (Sakaki et al., 2010b), and (ii) volume of civil unrestrelated words e.g., protest (Ramakrishnan et al., 2014); can also be incorporated in to theevent detection module.This proposed technique linearly combines the event indicators to obtain the ﬁnal eventscore I . Let D ij be a topic segment of topic pathway D j , the ﬁnal event score of D ij isdeﬁned as follows: I ( D ij ) = r V × I V ( D ij ) + r P S × I P S ( D ij ) + r NS × I NS ( D ij ) + . . . + r DSE × I DSE ( D ij ) (4.7)where r V , r P S , r NS , . . . , r DSE ∈ [0 ,

1] and (cid:80) r = 1 r k represents the sensitivity of the event indicator I k to the ﬁnal event score, where high orlow value of r k adapts the impact of I k to the ﬁnal event score accordingly. r values can beempirically set based on the importance of certain indicators to suit diﬀerent applications.It is also possible to learn r values by training a classiﬁer using pre-labelled records. I can take values between 0 to ∞ , and its value is proportionate to the signiﬁcance ofthe event. The potential events of interest can be identiﬁed by applying an event threshold τ e where I ( D ij ) ≥ τ e are ﬂagged as events.The following subsections describes the volume based and sentiment based event indi-cators respectively. .5. EVENT DETECTION The volume based event indicator looks for signiﬁcant increase in volume of a topic path-way. Such increase is an indication of increased interest in that topic pathway, which canbe due to a potential event relevant to that topic pathway.In order to capture signiﬁcant increases in volume, we set I V as the ratio betweenthe proportion of messages in a topic segment and the moving average of proportion ofmessages in that topic pathway as: I V ( D ij ) = V P ( D ij ) × W i − (cid:80) x = i − w − V P ( D xj ) (4.8)where W is the window size for the moving average.The volume proportion of a topic segment V P ( D ij ) is determined as: V P ( D ij ) = | D ij || B i | (4.9)The volume proportion is more robust than the absolute volume (number of social mediamessages) of a topic segment. This is because absolute volume of a topic segment is biasedto the volume of social text message batch which varies signiﬁcantly over time despite ﬁxedsampling time ∆ t . These changes in volume of social text message batch are often dueto seasonal changes which depends on hour of day (daytime vs night-time) and day ofweek (weekdays vs weekends). In contrast, volume proportion is more resilient to bias asit normalise the seasonal variations. Sentiment based event indicators capture the changes of public opinion (Thelwall et al.,2011). The concept that events are associated with changes of positive and negativesentiment strength has been successfully used for a recent event detection (Paltoglou,2016) task from a social text stream. The hypothesis here is that an event can alterthe existing level of public opinion of relevant topic pathways and such changes can bedetected by measuring the sentiment changes of topic pathways over time.As discussed in Section 2.4, sentiment analysis on social media messages is a chal-lenging task mainly because of its inherent unstructured nature (see Section 2.7) whereauthors loosely follow grammatical rules of language (Baldwin et al., 2013) and heavilyuse emoticons to express sentiment. Moreover, word lengthening (e.g., sorrrry ) is widelyused to emphasis the sentiment of statements (Brody and Diakopoulos, 2011). Recent re-search reports that such variations have resulted in signiﬁcant performance degradationsin conventional sentiment analysis tools (Eisenstein, 2013).Therefore, it is important to use sentiment tools that are designed to facilitate the abovementioned factors of social text. In Section 2.4.3, we have investigated the techniques ofseveral state-of-the art sentiment analysis tools (Esuli and Sebastiani, 2006; Socher et al.,2013; Thelwall et al., 2012), and found that SentiStrength (Thelwall et al., 2012) is well2

CHAPTER 4. suited for our requirement; mainly because it has been speciﬁcally engineered to capturesentiment related features in social text. It employs a word list with sentiment strength andpolarity to derive both positive and negative sentiment strengths. It considers emoticons,boosting words and negation words to strengthen or weaken the sentiment value. Inaddition, it considers repeated letters in a word (e.g., sorrrry ) as an indication of addingmore emphasis to the sentiment of that word.SentiStrength provides positive sentiment values ranging from 1 to 4 and negativesentiment values ranging from -1 to -4 respectively. Two sentiment based event indicatorsare designed utilising both positive and negative sentiment values to capture sentimentvariations in both positive and negative sentiment. The sentiment value of a topic segmentis determined by aggregating sentiment values of social media messages belong to thattopic segment and taking the average. The variations of sentiment in a topic segment D ij is captured by comparing the sentiment to the moving average of sentiment in that topicpathway D j . The positive sentiment event indicator I P S and the negative sentiment eventindicator I NS is deﬁned as follows: I P S ( D ij ) = avgP ositiveSentiment ( D ij ) × W i − (cid:80) x = i − w − avgP ositiveSentiment ( D xj ) (4.10) I NS ( D ij ) = avgN egativeSentiment ( D ij ) × W i − (cid:80) x = i − w − avgN egativeSentiment ( D xj ) (4.11)where W is the window size for the moving average. This section demonstrates the functionality of the proposed platform for separating topicpathways and event detection using an illustrative example shown in Figure 4.5.As shown in Figure 4.5 social media messages are sampled into batches and extractedfeature vectors are fed to the topic separation technique. Once the cluster representationvectors { C } i are learned, social media messages in B are mapped to their closest clusterrepresentation vector C ij ∈ { C } i . The topic segment D ij is formed by the set of socialmedia messages that are mapped to the cluster representation vector C ij . As shown inFigure 4.5, For example, C j is the closest for m , m ∈ B , thus D j = { m , m } . C j is learned based on C j (the feature map that used to derive C j is initialised using C j ), and is timely evolved acquiring relevant new knowledge from B . Hence C j sharessimilarity with C j , C j shares similarity with C j , and C j shares similarity with C j . Becauseof these relationships among the cluster representation vectors, the topic segments formedbased on these also contain semantically similar social media messages. Therefore, suchtopic segments are linked to form a topic pathway D j = D j → D j → D j → D j thatcontains semantically similar social media messages over time. .7. EXPERIMENTS text facilitates new topic pathways hence as shown inFigure 4.5 such new topic pathways can be spawned from any batch.Those identiﬁed topic pathways are continuously evaluated by the proposed eventdetection algorithm. It collects information about event indicators and aggregates theminto an event score I . As shown, event indicators of D j are determined based on changesin volume proportion, positive sentiment and negative sentiment compared to the previoustopic segments D j and D j (moving average window of 2). If I ( D j ) > τ e then D j containsinformation about a potential event that is related to the topic pathway D j .The next section presents the results of an empirical evaluation of the proposed topicseparation and event detection techniques using two Twitter datasets. This section demonstrates the core capabilities of the above proposed platform for topicseparation and event detection. Twitter was used as the experimental online social mediaplatform due to its popularity and rich API. The ﬁrst subsection describes the datasets andnew two subsections describe preprocessing and feature extraction steps. Subsequently,comprehensive accounts of the core capabilities, topic pathway separation, evolution oftopic pathways, emergence of new topic pathways and automatic event detection are pre-sented in separate subsections.

This section describes the two Twitter datasets collected for the experiments.4

CHAPTER 4. : President Obama maintains a diverse range of public relations, bothlocally and internationally and he is the most followed political ﬁgure on Twitter withapproximately 77.8 million followers. Thereby, analysing tweets about him is a challengingtask. Separating these into topic pathways would showcase the capabilities of the proposedtechnique and its value over existing techniques. for the time duration 01/12/2014 to19/04/2015 (20 weeks). 4,230,985 tweets were collected which contain 173,903 out ofwhich 57,998 are hashtags . : Microsoft Corporation has a diverse online social presence due toa large portfolio of technology-centric products and services that are consumed by a vari-ety of end-users. This diversity is a ﬁtting testbed to further investigate the separation ofa tweet stream into diﬀerent topic pathways on products, services and competitors. forthe time duration 08/12/2014 to 28/06/2015 (30 weeks). 1,953,243 tweets were collectedwhich contain 84,758 out of which 29,655 are hashtags .The characteristics of the two datasets are summarised in Table 4.2. Pre-processing is carried out on text data to reduce noise. This step is important as ithelps to reduce the sparseness of text data. Note that, informal language is widely used intweets, therefore they are often noisier than standard text documents, and hence a seriesof pre-processing steps are necessary.

Duplicate removal: tweets that are duplicated within the same batch are often spamor advertisements generated by automatic bots (Wang, 2010). Duplicated tweets aﬀectthe topic pathway separation process because they tend to form separate pathways. Suchpathways are not informative for understanding public opinion as they are published foradvertising purposes and can confuse attempts at event understanding and detection. .7. EXPERIMENTS Stopword removal: stopwords are often less informative words exists in text ( e.g.,‘a’, ‘the’). It is a common text processing practice to remove those stopwords beforeextracting features. In addition to those, tweets also have Twitter speciﬁc stopwords suchas ‘rt’ (a tag word used for retweets). Web links in tweets were also removed, as they donot directly relate to the social expression. In addition, a set of domain speciﬁc stopwordswere identiﬁed and incorporated into the standard stopword list. For example, in datasetObama, words such as ‘obama’, ‘barack obama’, ‘president obama’ and ‘POTUS’ wereﬁltered out.

Twitter speciﬁc tag removal:

Tweets also contain user mentions i.e. ‘@username’ andhashtags ‘

In this step, representative features were extracted from the pre-processed tweets. Weemployed the bag of words model, in which terms (word or phrase) are considered asfeatures of the tweets and features scores are determined based on the frequency of thatword. We use noun phrases from tweets as the features, which are extracted using JATEtoolkit (Zhang et al., 2008). Noun phrases that exist in less than a 0.5% of tweets in eachbatch (vocabulary threshold τ V discussed in Section 4.4.1) were omitted to reduce thesparsity of the feature space.We employed inverse document frequency (idf) as the representative feature value ofa term. tdf w of term w is given as follows: idf w = log( N df w )Where N is the number of tweets in that batch and df w (document frequency) is thenumber of times the term w appeared in tweets of that batch. This subsection demonstrates the ﬁrst core capability; topic pathway separation basedincremental learning outcomes.Pathway separation occurs as the algorithm incrementally learns from the Twitterstream. Some pathways remain prominent, continuing to receive a signiﬁcant numberof tweets across each topic segment, while others lose popularity and number of tweetsdecreased over time.6

CHAPTER 4.

Six prominent (high volume) topic pathways were identiﬁed in

T P xObama , where x is 1-6). Each topic pathway consists of a group of frequent terms thatsignify the topic of that pathway. These words are more frequent among the tweets ofthat pathway compared to the tweets of other pathways.Table 4.3 presents the top 10 frequent terms of these pathways. Column 3 of this tablesummarises the frequent terms into a focus area.Table 4.3: Frequent terms in the prominent topic pathways of T P Obama putin, russia, ukraine, economy, iran, respect,west, merkel, war, head Relations withRussia and Ukraine

T P Obama republican, party, question, tcot, congress,world, politics, cromnibu, gop, americanpeople Internal Politicsrelevant toRepublican party

T P Obama iran, israel, sanction, nuke, netanyahu,congress, tcot, irandeal, war, nuclear weapon,nuclear deal Relations with Iranand Israel

T P Obama nation, liberal, racism, reason, voter,democrat, consequence, tcot, gruber, pelosi Internal Politicsrelevant toDemocratic party

T P Obama obama news, cbs news, obama video,obamacare, sore throat, washington dc, gop,castro, cnn obama, summit News aboutObama

T P Obama war, afghanistan, isis, power, congress,protection siyra, iranian dissident, iraq Terrorism, mainlyISISIt should be noted that some of the pathways focus on internal political issues (

T P Obama , T P Obama ) while others are on international political issues (

T P Obama , T P Obama , T P Obama ). T P Obama is discussions about new reports on Obama.In order to distinguish the salient terms that deﬁne a pathway, it is necessary toinvestigate their presence and prominence in other topic pathways. Table 4.4 tabulatesthis investigation. It presents the distribution of top ﬁve frequent terms of each prominenttopic pathway, across the six prominent topic pathways. The rows represent the top ﬁvefrequent terms of each prominent topic pathway and the columns represent the six topicpathways.This table highlights that the top ﬁve terms of the six prominent pathways have asigniﬁcant presence in the relevant pathway compared to other pathways. For example,in pathway

T P Obama , the top three words putin, russia, and ukraine have more than 80%within that pathway. The term iran has a low percentage in

T P Obama because it has alarger presence in

T P Obama .Figure 4.6 shows the tweet volume of these topic pathways. It shows that diﬀerenttopic pathways peaked at diﬀerent time periods. For example, in the latter part, the topic .7. EXPERIMENTS

T P Obama

T P Obama putin 97.9 0.3 0.5 0.0 0.3 0.0russia 85.2 0.7 9.3 0.2 0.2 1.7ukraine 84.6 1.6 1.3 0.3 3.1 3.5economy 54.7 3.4 1.5 1.5 5.9 0.5iran 2.2 2.0 69.8 0.4 1.7 3.8

T P Obama republican 1.0 81.4 1.5 2.0 2.8 0.5party 1.7 68.5 1.1 1.7 1.7 4.4question 3.1 67.3 4.3 0.0 0.6 7.4tcot 4.1 16.8 14.3 7.6 2.3 9.2congress 1.5 11.1 14.7 1.2 9.0 9.4

T P Obama iran 2.2 2.0 69.8 0.4 1.7 3.8israel 1.8 3.6 53.9 1.8 1.7 2.7sanction 5.8 2.6 57.6 0.6 1.2 1.2nuke 2.8 1.2 70.6 0.0 0.4 0.4netanyahu 1.6 3.2 16.9 0.7 3.0 1.9

T P Obama nation 0.3 1.6 0.6 90 0.9 0.9liberal 0.0 0.0 0.8 98.4 0.0 0.4racism 0.0 0.5 0.0 97.5 0.0 0.0reason 0.6 2.2 2.8 75 0.0 3.3voter 0.0 0.9 0.0 96.4 0.0 0.0

T P Obama obama news 0.5 0.2 0.6 1 90.4 0.4cbs news 0.0 0.0 0.8 0.0 90.5 0.8obama video 2.7 6.7 0.9 1.3 49.3 1.8obamacare 2.5 5.7 1.3 3.2 69 1.3washingtondc 0.0 0.0 0.9 11.4 86 0.9

T P Obama isis 1.4 1.7 1.6 0.8 2 81.4war 7.6 2.8 10.7 2.2 1.6 56.4iraq 2.3 1.9 9.1 1.1 0.4 74.9siyra 7 19.9 21.6 0.0 0.6 37.4power 5.8 7.7 7.1 8.3 1.3 43.6pathway

T P Obama becomes popular to reﬂect the incidents related to the nuclear dealbetween USA and Iran.

Seven prominent topic pathways were identiﬁed in

CHAPTER 4.

Figure 4.6: Volume (%) of tweets in six prominent topic pathways of

T P Microsoft

Google, android, amazon, apple, cyanogen,ibm, work, sony, app, window Relations with Google

T P Microsoft

Xbox, ps4, update, windows 10, xboxone,game, Cortana, sony, bitcoin, microsoftxbox Microsoft Xboxrelated products andservices

T P Microsoft microsoft oﬃce, ipad, android, mac, androidtablet gmail, windowsphone, google,onenote, microsoft tech, oﬃce Microsoft Oﬃce andits compatibility indiﬀerent devices

T P Microsoft

Windows, version, windows 10, os news,future, microsoft windows, upgrade,microsoft windows10 Week, update Microsoft Windows

T P Microsoft

Facebook, bing, google, microsoft bing,zdnet Apple, favour, monumental deal,search result, virtual reality Relations withFacebook

T P Microsoft windows phone, app, oﬃce, android,bitcoin, onedrive, update, bitcoin payment,tablet, ipad Windows Phone

T P Microsoft

Apple, google, cloud, amazon, Samsung,microsoft store, bigdata, ﬁght, device,patent Relations with AppleThe topic pathway that focuses on Windows has the highest volume in most of theweeks as Windows operating system (

T P Microsoft ) is the most used product of Microsoft.Similarly, the topic pathway focuses on relations with Google (

T P Microsoft ) has the highestnumber of tweets among the pathways that focus on relations with competitors sinceboth Microsoft and Google are direct competitors of several products including operationssystem, mobile device, search engine and virtual reality technology. .7. EXPERIMENTS

Topic Coherence

It is important to evaluate the quality of the automatically identiﬁed and learned topicsbased on their interpretability to a human analyst. There have been recent developmentsin quantitative metrics to evaluate the coherence of learned topics (R¨oder et al., 2015).Mimno et al. (2011) have demonstrated such a metric based on the principle thatsigniﬁcant terms belonging to a topic are likely to co-occur in same documents. This topiccoherence metric delineates as follows: C ( T ; V T ) = M (cid:88) m =2 m − (cid:88) l =1 log D ( v Tm , v Tl ) + 1 D ( v Tl )where V T = ( v T , . . . , v TM ) are the list of M terms that are most probable (frequent) intopic T . D ( v Tl ) is the document frequency of the term v Tl . D ( v Tm , v Tl ) is the documentfrequency of the co-occurrence of terms v Tl and v Tm .This metric aggregates the term co-occurrence scores of a topic T for the M frequentterms of that topic. They have empirically veriﬁed that this metric correlates with thejudgement of human analysts.We have adopted this metric to measure and evaluate topic coherence within individualpathways. Our experiments measure the topic coherence scores of each topic for M ∈ [2 , This subsection demonstrates the second core capability of the proposed algorithm; cap-turing evolution of topic pathways and temporal changes in topic segments within atopic pathway. In order to demonstrate this capability, we selected the topic pathway,0

CHAPTER 4.

Figure 4.8: Topic coherence scores for prominent topic pathways and whole dataset (forbaseline) in (a)

T P Microsoft , which captures the relationship between Microsoft and Google. It is commonknowledge that both these organisations are dynamic and innovative in the technology do-main, thereby twitter activity associated with both entities is equally dynamic with newsubtopics (i.e., short-term topics within the particular topic pathway) being discussed. Wedemonstrate that the selected topic pathway

T P Microsoft , is capable of identifying each newsubtopic as it emerges without a loss of focus on the overarching theme of Microsoft-Googlerelationship. Thereby, the entire topic pathway maintains focus on Microsoft-Google re-lationship, while each segment in the pathway focuses on diﬀerent subtopics associatedwith this relationship as they emerge over time. The proposed algorithms autonomouscapability to capture the evolution of a topic pathway in this manner provides an ana-lyst unique insights into short-term changes within a long-term trend. In this instance,the long-term trend is the relationship with Google and the short-term changes are thesubtopics associated with both entities.We selected nine topic segments from pathway

T P Microsoft for this demonstration.Figure 4.9 presents word clouds for these nine topic segments. Table 4.6 further examinescontent of each of the nine topic segments. Word clouds were selected for the intuitive valuein visualising ﬂuctuations in the usage and frequency of terms. The primary observationhere is that each word cloud has a unique set of frequent terms which are indicative ofdiﬀerent subtopics.In Table 4.6, each row represents a topic segment. The ﬁrst column is a label toidentify each segment (i.e. ﬁrst row in Table 4.6

Segment 08/12/2014 corresponds to theﬁrst word cloud in Figure 4.9, also labelled

Segment 08/12/2014 ). The second columndenotes representative frequent terms for the corresponding segment and the third columndenotes the subtopic for each segment. It should be highlighted that side by side, secondand third column specify the relationship between top frequent terms and the subtopic.Although the subtopics change per segment, the main theme (or focus) of the pathwayremains the same. The fourth column presents a robust evaluation of the subtopic of eachsegment, in the form of title of corresponding news articles published in mainstream onlinesocial media platforms during the same period of time. .7. EXPERIMENTS

T P Microsoft ,which represents Microsoft-Google relationship. Note that most frequent terms

Microsoft and

Google were removed from word clouds for improved clarity.For example, in

Segment 08/12/2014 , the discussion is about Microsoft providing MSNapps for android and iOS platforms. The second column presents representative frequentterms from this segment, the third column presents the subtopic and the fourth columnvalidates both terms and the subtopic with the title of the corresponding news article.Some representative frequent terms are common across consecutive segments which indi-cates that some subtopics are discussed over several weeks. For example, frequent terms inweek 22/12/2014 such as ‘hotel’ and ‘wiﬁ’ appear in the word cloud of week 29/12/2014,but with a relatively lower frequency. Similarly, ‘human’ and ‘100 gb’ appears in consec-utive segments

Segment 09/12/2015 and

Segment 16/02/2015 .Representative frequent terms within a segment are related to each other based onthe subtopic. Representative frequent terms across segments are related to each otherbased on the topic pathway evolution. As noted, this topic pathway focuses on Microsoft-Google relationship, and each segment represents a diﬀerent subtopic (or phase) of thisrelationship. Top terms across segments are able to capture this evolving topic pathway,despite diﬀerences in subtopics in each segment.Figure 4.10 presents a contrasting view. It illustrates outcomes from conventionalanalysis of a social media platform, which analyse all tweets corresponding to the topicpathway as a single dataset. It only provides a concise view of what is discussed. UnlikeFigure 4.9 and Table 4.6, it is impossible to capture evolution of a topic pathway orshort-term changes in topic segments. The most frequent term (apart from ‘Google’) is‘android’ as it is one of the main competitive product for windows operating system. Otherfrequent terms include ‘amazon’, ‘ibm’ and ‘apple’ which are competitors in the same spaceas Microsoft and Google. Insights gained by an analyst will be severely limited if Twitteractivity is explored in this manner.2

CHAPTER 4.

Table 4.6: Further examination of topic segments illustrated as word clouds in Figure 4.9

Segment Representativefrequent terms Subtopic(s) of the segment Relevant News Article(s)Segment08/12/2014 android, ios,msn app, sport,weather, oﬀer-ing news Microsoft rolls out a set of MSNapplications such as Weather,News and Sports for iOS and An-droid devices MSN consumer apps arrive onAndroid, iOS, Amazon-CNET(11/12/2014)Segment22/12/2014 (i) microsoftbattle, personalwiﬁ, marriott,hotel(ii) santa (i)Google and Microsoft step into oppose the attempts of Mar-riott Hotels to block personalWi-Fi inside the hotel(ii) Google and Microsoft pro-vides interactive features to trackSantas journey during Christmas (i)Microsoft, Google join oppo-sition to hotels’ wi-ﬁ blocking-ZNET(23/12/2014)(ii) Tracking Santa with helpfrom Microsoft, Google-CNET(22/12/2014)Segment29/12/2014 windows vulner-ability, no ﬁx,bug, windowsbug Google has openly published aWindows 8.1 vulnerability thatallows users to get administratorprivileges Google discloses unpatchedWindows vulnerability-ZDNET(31/12/2014)Segment12/01/2015 criticism, win-dows, windowsbug, microsoftfume, endpointrisk Microsoft criticised Google forreleasing information about se-curity vulnerabilities to public Microsoft slams Googlefor spilling the beans onWindows 8.1 security ﬂaw-ZDNET(12/01/2015)Segment26/01/2015 android, 70mil, cyanogen,android startup,70 million,invest Microsoft is taking part in a $ Other topic pathways of both datasets show similar characteristics. Due to spacerestrictions, details of frequent terms for four prominent topic pathways (two each fromtwo datasets) are provided in Table 4.7 and Table 4.8. Note that, the identifying terms ineach pathway (‘xbox’ in

T P Microsoft ; ‘facebook’ in

T P Microsoft ; ‘russia’ and ‘ukraine’ in

T P Obama ; and ‘obama news’ in

T P Obama ) have been excluded to highlight subsequentfrequent terms. .7. EXPERIMENTS

T P Microsoft . Note that most frequent terms

Microsoft and

Google were removed fromword clouds for improved clarity.Table 4.7: Two topic pathways from

Topicpath-way Frequentterms of topicpathway Frequent terms of top ﬁve most voluminous topic segments T P M i c r o s o f t update, xbox-one, game,windows 10,sony, bitcoin,plan, lumia,ps4, kinect,windows,playstation bitcoin,windows10,novem-ber, sony,playstation 15/12/2014game, pan-dora, appletv, app,chrome-cast,xboxone sony,christ-mas, ddosattack,playsta-tion,lizard-squad plan, week,update,ps4, xbox-one, lumia xboxone,ps4 , 10april,playstation4, sony T P M i c r o s o f t bing, google,zdnet, apple,favour, mon-umental deal,search result,ganging, ya-hoo, sense,future, search bing,searchtool, safepreference,monday,graphsearch, 240million,built-insearch bing ,google,facebookssplit, zd-net keyevent, am-ber alert,motley fool google, ap-ple, gang-ing, bing,worse,microsoftservice,hotmail,msn, yahoosearch google, fu-ture, lead,bing, voicecommu-nication,sweden,eurovision,winner monumen-tal deal,virtualreality,alibaba,sense,google,oculus rift,investor,gamescom T P Microsoft globally contains frequent words that are indicative of diﬀerent productsrelated to ‘xbox’ like ‘xboxone’, ‘kinect’ and ‘windows 10’. Also, the word ‘sony’ denotescompetitive manufacturer while ‘ps4’ and ‘playstation’ denotes competitive products. No-table subtopics include (i)

Segment 08/12/2014 : Microsoft accepts ‘bitcoin’ as a validpayment method to pay for Xbox games; (ii)

Segment 22/12/2014 : Xbox and Sony ps4services are down due to a ‘ddos attack’ by a hacking group called ‘lizardsqsuad’ on Christ-mas day.

T P Microsoft has frequent terms ‘bing’, ‘znet’ and ‘monumental deal’ that are frequentterms of diﬀerent segments. In addition, there are other organisation names such as ‘apple’,4

CHAPTER 4. ‘google’ and ‘yahoo’. Notable subtopics include (i)

Segment 08/12/2014 : Facebook dropsMicrosoft Bing in favour of its own search tool; (ii)

Segment 08/06/2015 : Facebook andMicrosoft jointly working to make the Oculus Rift virtual reality headset works withWindows 10 and Xbox.Table 4.8: Two topic pathways from

Topicpath-way Frequentterms of topicpathway Frequent terms of top ﬁve most voluminous topic segments T P O b a m a putin, econ-omy, iran,respect, war,cuba, west,india, ﬁght,china, weapon,merkel, pres-sure, europe,sanction,leader putin, cubaeconomy,sanction,west, work,crimea,problem,foxnews economy,dow, gas,foxnews,putin,cuba, head,golf, hawaii putin,wwiii,merkel,kiev, arms,ﬁght,Moscow,weapon,china,Kerry,respect putin,weapon,war,merkel,poroshenko,ﬁghting,hollande,minsksum-mit putin,merkel,hollande,respect,allies,weapon,sanction,shot, war,west T P O b a m a cbs news,obama video,obamacare,sore throat,gop, cuba,castro, acidreﬂux, sum-mit, congress,prayer break-fast, iran,hospital, seat-tle, cnn, abcnews troop, 6year,hawaii,apron,hackingrow, cbsnews,obamacare,afghanistan communitycollege,france,seattle,maliaobama,union tour,congress,housingmove 1 obamacare,cameron,methaneemission,iran, mid-dle class,obamavideo slavery,prayerbreak-fast, isis,crusade,obamas,comparisonchristianity castro,cuba, sum-mit, stage,saturday,panama,abc news,historicmeeting T P Obama contains opinions about Obama handling Russia and Ukraine issues. Fre-quent terms of this pathway contains the names such as ‘putin’ (President of Russia), and‘merkel’ (Chancellor of Germany). Also, the word ‘sanction’ is frequent showing that peo-ple talk about sanctions on Russia. Subtopics include (i)

Segment 15/12/2014 : Obamassupport for new U.S. sanctions to Russia; (ii)

Segment 09/02/2015 : ahead of Minsk sum-mit, Obama threatened the Russian President with serious consequences for involvementin the Ukrainian conﬂict.

T P Obama contains news tweets about Obama often tagged as ‘

Segment 05/01/2015 : Obama proposes free twoyears at community colleges, Malia Obama supports anti-cop rap group, Obama to toutArizona housing initiative; (ii)

Segment 12/01/2015 : Obama to call for tax increase on .7. EXPERIMENTS

This subsection demonstrates the third core capability of the proposed algorithm; thedetection of emerging topic pathways in the twitter stream. These new topic pathwaysare loosely related to previously identiﬁed pathways. Such new topics can be internal (e.g.,new product or service launch) or external (e.g., interaction with a new organisation) tothe business entity. Subsequently, some of them fade away as people lose interest, whilesome topic pathways prevail due to continuing interest.Figure 4.11: Four new topic pathways emerged in

T P N Microsoft : (frequent terms- windows 10, internet explorer, version, pirate, upgrade)Initiation of this new topic pathway coincides with the week where Microsoft announcesthat there will be a new web browser in Windows 10 operating system. Since then it has asigniﬁcant amount of tweets that mainly discuss about new features in Windows 10. It isinteresting to note that this pathway was separated as a new topic despite having a topicpathway focused on Windows) in general. We hypothesis that it is due to diﬀerentiatingterms (e.g., windows 10, release, future, upgrade). Unlike other new topic pathways, thispathway retains its popularity throughout the time span of the dataset.

T P N Microsoft : (frequent terms- microsoft hololen, world, hologram, future, google glass)This pathway initiated when Microsoft introduced its virtual reality glasses ‘Hololens’ dur-ing ‘Microsoft Consumer Preview’ on 21/01/2015. After the initial peak it declined to fewtweets a week except for the week 27/04/2015 in which the capabilities of Hololens wasdemonstrated at Microsoft Build Developer Conference.

T P N Microsoft : (frequent terms- linux, service, windows, python, bigdata) This pathwayis initiated when Microsoft announces the incorporation of Linux operating system inits big data services provided in Microsoft Azure cloud platform. This move was widelydiscussed as Linux is a long term direct competitor to Windows and this is the ﬁrst timethat Microsoft incorporate Linux as part of its product/service suit.

T P N Microsoft : (frequent terms- salesforce, talk, msft, price, 55b) This pathway is focusedon Microsoft’s attempted acquisition of the company ‘Salesforce’ . It created hype ontwitter as the news emerged and progressed for a couple of weeks. However, later declined6

CHAPTER 4. in popularity when it failed due to the hefty price tag of

55 billion USD (‘55b’) demandedby Salesforce management.Figure 4.12: Topic coherence of newly emerged topic pathways.

Topic Coherence

Figure 4.12 presents the topic coherence metric for the above mentioned topic pathwaysplotted alongside the seven original topic pathways of the

T P N Microsoft ) have increased topic coherence scores thanthe other prominent pathways. This is because they are focused on topics which emerged asa result of particular events and thus contain more closely related terms that are uniquelyused to discuss them.

T P N Microsoft has frequent terms shared with the topic pathwayfocused on Windows in general (

T P Microsoft ) which results in relatively low coherencescores.Detection of these new topic pathways highlights the capability of our proposed al-gorithm to identify previously unseen topics at any stage of the social text stream. Thiscapability is due to the third extension of the extended IKASL text algorithm describedin Section 4.4.3, which facilitates learning a new randomly initialised feature map fromsocial media messages that are less similar to the existing topic pathways and spawningnew topic pathways when there is a burst of social media messages on a new topic.

The aim of this section is to demonstrate the fourth capability; automatic detection ofevents from topic pathways using event indicators.As previously shown in Equation 4.7, event score is determined by combining threeevent indicators for each topic segment in topic pathways. .7. EXPERIMENTS r parameters in Equation 4.7 decides the impact of each event indicator on theﬁnal event score. However, we have observed that the frequency and intensity changes ofvolume is higher compared to the changes of positive and negative sentiment. In order tocompensate for this behaviour r V is set to a lower value than others as follows r V = 0 . r P S = 0 .

45 and R NS = 0 .

45. We also employed W = 2 for the calculations of I V , r P S and r NS . τ e is set to 1 . I inthe two data sets. The frequent terms that are related to an event are derived by capturingthe frequent terms that appear in the topic segment containing the event but does notappear in the frequent terms of the W previous topic segments of the same pathway. Weanalysed online news articles from major news agencies to verify the signiﬁcance of eventsdetected by the algorithm.The above tabulated results demonstrate and verify the authenticity of events detectedby the proposed algorithm. It is interesting to note that these events were captured exclu-sively based on social expressions on Twitter. It is also interesting to observe that diﬀerentevents were triggered by high values from diﬀerent indicators. Volume based indicators arehigh in the ﬁrst event of Table 4.9, which is about an event where Microsoft demonstratedthe capabilities of its virtual reality product Hololens. As shown in Figure 4.11 a new topicpathway is spawned when Hololens was ﬁrst introduced by Microsoft on 21/01/2015.Another example where I V is high is the third event in Table 4.9, which is due toMicrosofts statement that Windows 10 would be a free upgrade for all users (includingthose with pirated). This statement attracts attention on online social media platforms asit was a surprise move by Microsoft (quite often aggressive about software piracy). Thispeak of volume can also be seen in Figure 4.11 for T P N Microsoft .There are several events in both tables that are signiﬁcantly high in negative sentimentevent indicator score. Event 2 in Table 4.9 is the highest triggered by the event whereXbox Live online services were oﬄine due to a hacker attack on Christmas day. Thisevent has caused issues for online gaming community who took it into Twitter criticisingMicrosoft for lack of security and sluggish response.Event 2 in Table 4.10 is highly negative, which was due to ﬁrstly, people condemningthe release of nine suspects held in Guantanamo Bay and secondly the Charlie HebdoShooting in the same week. Social commentary combined both incidents stating thatreleasing of terror suspects can lead to more attacks. This incident is identiﬁed from topicpathway focused on Iran/Israel relations, which is loosely related to the event since mostof the released suspects are from Middle East.Events with high positive sentiment are comparatively rare. This is largely becauseonline social media platform Twitter is more often used to express negative sentimentmore than positive. It also supports the ﬁnding that popular events are often associatedwith negative sentiment (Thelwall et al., 2011).This subsection aptly demonstrates the ﬁnal capability of the proposed platform, au-tomated event detection from social text streams. The incorporation of several event8

CHAPTER 4.

Table 4.9: Top 10 events detected based on event score I for S e g m e n t Topic I Event indicator Frequent terms Veriﬁcation evidencepathway scores related to the event based on news articles0 . .

45 0 . × I V × I P S × I NS T P N M i c r o s o f t T P M i c r o s o f t T P N M i c r o s o f t - T P M i c r o s o f t - T P M i c r o s o f t T P M i c r o s o f t - T P N M i c r o s o f t indicators allow us to capture events across diverse domains without having to developtailored event capture mechanisms for individual applications. .8. CHAPTER SUMMARY This chapter presented an algorithmic development of the conceptual model proposed inChapter 3. It consists of two novel algorithms designed to capture insights from fast-paced social data streams. The ﬁrst is a new unsupervised incremental machine learningalgorithm developed extending GSOM self-structuring algorithm and IKASL incrementallearning algorithm. It automatically structures a social data stream into topics and extendsthose across time into topic pathways. It also captures changes in topics over time as wellas distinct new topics as new topic pathways at diﬀerent points of time. This algorithmwas designed to overcome the challenges present in social data with respect to its brevity,unstructuredness, and diversity. The second algorithm is a multi faceted event detectionalgorithm developed to monitor topic pathways for signiﬁcant changes in online socialbehaviours over time, and identify such changes as potential events of interest. The changesin social behaviours were identiﬁed using automatic event indicators such as changes involume, positive sentiment and negative sentiment.These techniques were trialled using two large Twitter datasets containing 6 monthsof tweets on two entities a politician and an organisation. As shown in the experimentresults, the topic pathway separation algorithm successfully captured the key topic path-ways representing ongoing discussions related to the respective entities. Also, shifts in thediscussions represented by new key terms were successfully learned and associated withthe relevant topic pathway. Moreover, new distinct topics were automatically captured asnew topic pathways. The event detection algorithm monitors those topic pathways andautomatically captured signiﬁcant changes in human behaviour using changes in volumeand sentiment. Those captured events were aligned with contemporary news articles thatdiscussed those events.The next chapter extend the above developed algorithms into a technology platformto capture insights from online support group discussions.00

CHAPTER 4.

Table 4.10: Two topic pathways from S e g m e n t Topic I Event indicator Frequent terms Veriﬁcation evidencepathway scores related to the event based on news articles0 . .

45 0 . × I V × I P S × I NS - T P O b a m a T P O b a m a T P O b a m a T P O b a m a - - T P O b a m a T P O b a m a T P O b a m a hapter 5 Expatiation from algorithm totechnology platform

I alone cannot change the world, but I can cast a stone across the waters to createmany ripples -Mother Teresa

This chapter presents the advancement of the conceptual model for social behaviourunderstanding proposed in Chapter 3, and the new incremental machine learning algo-rithms, based on the principles of self-structuring artiﬁcial intelligence, proposed in Chap-ter 4, into a functional technology platform, that delivers on the ambition of commongood and contribution to human society. More speciﬁcally, the focus of social good is onpatient-centred healthcare and the role of online support groups (OSG).OSG are social forums that provide peer support, mainly for physical and mental healthrelated issues. Unlike fast-paced mainstream social media platforms, OSG are relativelylow volume social data streams where the bigger OSG generate few hundreds of social me-dia messages daily compared to millions in Twitter. The participants of these platformsengage in longer discussions, and such discussion mainly focused around the main themeof OSG. Engaging in longer discussions lead to the formation of stronger ties between theOSG members where they support each other to cope with the hardships of the diseaseconditions. Because of these stronger ties participants tend to frequently do deeper selfdisclosures which include information about their disease conditions and deeper emotionsrelated to their wellbeing.The proposed technology platform operates in multiple stages to better facilitate thediﬀerent information needs of its stakeholder groups: consumers, researchers and healthprofessionals. It extracts information encapsulated in free-text discussions of OSG usinga suite of machine learning, natural language processing and emotion analysis techniquesand automatically generate individual proﬁles for each OSG user which enriches the in-formation retrieval process.The chapter is organised as follows: Section 5.1 describes an OSG, its purpose andfunctions, Section 5.2 deﬁnes the stakeholder groups of OSG. Section 5.3 delineates limi-tations of OSG and Section 5.4 presents the proposed technology platform that addresses10102

CHAPTER 5. such limitations and advances the role of OSG in understanding social behaviours. Fi-nally, Section 5.7 presents a statistical evaluation of the technology platform, followed bySection 5.8 which demonstrates its capabilities on two high-volume OSG.

Health related information seeking has become one of the major use of internet, wheresurveys reported that in 2010, 80% internet users (59% of all adults) in United Stateshave looked online for health related information such as speciﬁc disease and treatmentoptions (Fox, 2011b). Moreover, 34% internet users have read online resources that containconsumer generated content about health issues and 18% internet users have searchedinternet to ﬁnd others with similar diseases or health concerns (Fox, 2011b, 2013).One of the growing sources of consumer generated online health information is OnlineSupport Groups (OSG) which are also known as online health communities and healthforums. OSG are virtual communities that are established to discuss or exchange knowl-edge about health related topics. OSG are formed in many diﬀerent forms of online socialmedia platforms such as discussion forums, blogs, wikis, groups in mainstream social me-dia platforms (e.g., Facebook groups), and Q/A websites (e.g., Yahoo! Answers ). Whilethere are few professional-led OSG, most OSG are peer-to-peer and moderated by fewself-appointed senior users of the OSG. It is reported that seeking such peer-to-peer sup-port is prominent among patients living with chronic illnesses such as high blood pressure,diabetes, heart or lung problems and cancer (Fox, 2011a).OCSG discussions are organised as discussion threads where each thread starting witha question, comment or an experience about corresponding individual‘s health concerns.Other OSG members (patients, partners of patients or other caregivers) respond to suchconcerns and thereby create discussion threads.Figure 5.1.a shows the home page of a popular OSG, patient.info . It contains a free-text search and a list of high level topics (e.g., Mental health and Women’s health).Figure 5.1.b shows a sample OSG thread where the initiating author starts with the title‘stomach pain’ and a description about his experience, which includes demographics ofthe author as well as the symptoms that he encountered. Figure 5.1.b also shows foursubsequent posts (out of the six replies in the thread), in which ﬁrst, second and fourthseems to be advice and the third is further information by the initiating author. OSG are contributed by consumers (e.g., patients or caregivers), health professionals andhealth researchers who are been denoted as the three main user groups of Medicine 2.0applications (Eysenbach, 2008). Figure 5.2 highlights the activities and interactions ofthese user groups. http://answers.yahoo.com http://patient.info/forums .2. USERS OF ONLINE SUPPORT GROUPS OSG are mainly used by patients and caregivers to fulﬁl some of their social needs discussedin Section 3.2.2 such as belongingness/aﬃliation, achievement and power over others.These were achieved using social behaviours such as self-disclosure and social comparison.Functionally, motivations to participate can be categorised into three: (i) seek infor-mational support, (ii) seek emotional support and (iii) provide support.04

CHAPTER 5.

Figure 5.2: Stakeholders of Online Support Groups (OSG).

Seeking informational support

Information support is sought for information related to a particular health condition.Individuals ﬁrst self-disclose their information related to the interested health conditionand either ask speciﬁc questions or similar experiences. The objective of seeking infor-mation support is two-fold. First is to gain knowledge about the health condition. Thesecond objective is to compare (social comparison) their situation to similar situations ofothers. Such comparisons enable individuals to get an understanding of their situationcompared to others, and also as discussed in Section 3.2.4 downward social comparisonsreduce distress related to the health conditions by knowing that others have similar oreven worse conditions/suﬀerings.Information support seeking happens at three diﬀerent stages of the healthcare processsuch as initial diagnosis, treatment decision making, and post-treatment quality of life (Huet al., 2012; Ziebland and Wyke, 2012).Information seeking in initial diagnosis is mainly to re-aﬃrm the diagnosis made by theclinician. It is reported that comparison of own symptoms and disease conditions againstothers with similar diagnosis helps to boost the conﬁdence about the diagnosis as well asthe clinician (Ziebland and Wyke, 2012; Mazanderani et al., 2012).Treatment decision making process is another key stage where people seek informa-tional support (Huber et al., 2011; Zhang, Bantum, Owen, Bakken and Elhadad, 2017; .2. USERS OF ONLINE SUPPORT GROUPS

Seeking emotional support

Emotional support is sought for emotional issues related to disease conditions. Emotionalsupport attempts to satisfy belonging social need by facilitating an understanding, sup-portive and empathetic audience where users can express their emotional issues relatedto the disease conditions. OSG members provide supportive, encouraging and positiveresponse to such issues and thereby reduce the emotional burden of the disease condi-tions (Tanis, 2008; Bar-Lev, 2008).Another key problem speciﬁcally faced by people with chronic illnesses is the disconnector isolation from the community (work or living) where they were once part of (Williams,1984; Bury, 1982). Some individuals may even feel embarrassed and stigmatised by theirillness and afraid to face their living community. For such individuals, contributing toOSG provide a sense of belonging to a community that consists of others with similarcircumstances, which helps to reduce the feeling anxiety and stress resulted due to theisolation of community (Bar-Lev, 2008; Harvey et al., 2007).Moreover, it is reported that individuals with chronic health issues reduce their anxietyby comparing their situation to others with similar circumstances (Frost and Massagli,2008). Bender (2011) explains this phenomenon using the theory of social comparisonprocesses (Festinger, 1954), which states that during uncertain situations individuals tendto seek others with similar circumstances to compare their behaviours and abilities.

Providing support

One of the vital element of the OSG ecosystem is the voluntary support provided by peerOSG users. Chiu et al. (2006) argue that willingness to share knowledge to support peers iskey to the fostering of virtual communities such as OSG. Providing support satisfy needssuch as a sense of belongingness to the community as well as a sense of achievement forsupporting others. Also, it may give a sense of power over others for inﬂuencing. Thefollowing are the ways that OSG users provide peer-support. • Answer direct questions:

Most of the direct questions are on information supportdiscussed in Section 5.2.1. Users answer such direct information requests, based ontheir experience in similar health concerns. A study on users in a Diabetes OSG (YanZhang, 2016) has found that users often answer those question because they areconﬁdent about their understanding of the illness. Also, answering questions made06

CHAPTER 5. the users felt proud or empowered as a contributor and a mentor to the OSG (YanZhang, 2016; Kummervold et al., 2002). • Self-disclosure:

Users often self-disclose their story of coping with health concernsas support for other users facing similar situations. Such disclosures help to ﬁndimportant clues to cope with the situation and also it serves as emotional supportfor the receiving users as they tend to feel belonging to a social group with similarhealth concerns (Høybye et al., 2005; Ziebland and Wyke, 2012).

Support groups (both face-to-face and online) often have diﬀerent levels of profession-als support which varies from professional-led to patient-led (Shepherd et al., 1999).Professional-led support groups are often small closed groups (by invitation only) whichfacilitates support for certain patient groups of a particular healthcare organisation. Suchgroups are mediated and quality controlled by a board of health professionals. Whilesuch quality control reduces the risk of spreading misnomers, researchers have found thatexcessive professional involvement tends to reduce patients supporting each other (Kurtz,1990). In addition to mediation, professional therapist often involved in OSG to provideemotional support to patients and caregivers.Most of the current popular OSG are patient-led, with no designated professionalbody to mediate or quality control. However, it is observed that health professionalsvoluntarily join the OSG as members and provide their opinions as informational andemotional support to the other OSG members. In addition, some health professionalsrefer to relevant OSG discussions to understand the patient concerns related to diseaseconditions.

OSG contain unsolicited accounts of ﬁrst person experiences, which is a rich source of infor-mation for the medical researchers to study diﬀerent aspects of patient experience (Robin-son, 2001). There are two main avenues that OSG are used by researchers.In the ﬁrst approach, Medical researches use OSG to recruit potential candidates toadministrate interviews or questionnaires for various research studies. OSG provide accessto user groups aggregated with similar health concerns which are used by the researchersto send invitations to participate in various studies. One of the key advantage of OSG is itprovide access to concentrated groups that are otherwise diﬃcult reach with conventionalmethods, which can be due to either rareness of the inclusion criteria or social stigmatisednature of the illness (Wright, 2006). Moreover, it allows the researchers to reach outto a large group of potential candidates within a short period of time. This approachuses OSG only as a recruitment platform for conventional survey based methods, withoutpaying attention to its peer-to-peer support ecosystem or the content of the discussions.In the second approach, medical researchers retrospectively analyse OSG discussionsof the selected groups with speciﬁc selection criteria. This approach is less intervening .3. LIMITATIONS IN FINDING RELEVANT INFORMATION

As discussed in Section 5.2.1 OSG plays an important role as an online platform thatenables users to engage with virtual peers who have (or had) similar health concerns. Oneof the key elements of these engagements is ﬁnding users with similar health concerns.Based on a survey of diabetes OSG users, Yan Zhang (2016) points out that the userslook for similar or related users mainly based on similar illness conditions and comparabledemographics. In the existing OSG infrastructure users achieve this task by manually scru-tinising the posts in OSG threads for clues about the illness conditions and demographicsof the other users.Moreover, as shown in Section 5.2.3 researchers look for OSG users with particularselection criteria (e.g., high risk prostate cancer patient who has gone through roboticassisted laparoscopic radical prostatectomy for prostate removal), where such user wereeither contacted to administrate questionnaires or their OSG discussions were analysedthematically depending on the research methodology. Such selection process is also car-ried out by manually going through the posts of OSG users while evaluating such userseligibility based on the self-stated information in OSG posts.However, with the widespread use of internet and popularity in online health informa-tion seeking across individuals from all walks of life, use of OSG participation is growingrapidly. Table 5.1 presents several usages statistics of three large OSG. It shows that thoseOSG are consists of over 100,000 registered users ( healthboards has over 1,000,000 regis-tered users). Moreover, these OSG consists of over 100,000 discussion threads and morethan 1,000,000 posts. Therefore, it is impossible to manually scrutinising these colossal08

CHAPTER 5. text croups or go though large number of users to ﬁnd the information that either usersor researchers are looking.Table 5.1: Population and volume statistics of three large OSG. Note that this informationare as of 02/04/2018.OSG registered users threads posts healthboards healingwell patient.info patient.info,healingwell, and healthboards respectively. As shown, the topic hierarchies in OSG arefairly shallow, often limited to two levels. Hence, the narrowing down by topic can stillleave large number of posts for manual analysis. Moreover, these topic structure of theOSG are user generated, often by the moderators as per popular requests from users.Therefore, the topic hierarchy is developed incrementally, and thus can be overlappingand sometimes confusing. For example, in healthboards (Figure 5.3.b), the second leveltopics under ‘Digestive & Bowel’can be overlapping as ‘IBS’can fall under both ‘DigestiveDisorders’ and ‘Bowel Disorders’. .3. LIMITATIONS IN FINDING RELEVANT INFORMATION

Low precision: in an information retrieval task, precision is the fraction of the re-trieved documents that are relevant (Manning et al., 2008). Topic based navigation endsup with a substantial amount of OSG post that is of little relevance to the health concernsof the users. Also, as demonstrated in the above example, free-text search pulls out OSGposts that have one or few relevant terms, which may not relevant. For example, there maybe OSG posts with the terms ‘old’ and ‘woman’ but with diﬀerent symptom mentions ofdisease conditions, thus, not relevant to health concerns in the query.

Low reliability in recall : in an information retrieval task, recall is the fraction ofrelevant documents that are retrieved (Manning et al., 2008). Topic based navigationapproach loose recall when users post relevant content under diﬀerent topics, which canhappen due to confusing topic structures or users misunderstanding the scope of a topic.Free-text search loose recall mainly because it is not capable of identifying the diﬀerenttypes of information present in the query or OSG posts. For example, OSG posts froma user of age 41 with similar health concerns are a very close match for the above givenquery, but will not be given a higher relevance in the free-text search. Moreover, usersonly mention a portion of information in a single OSG post. For example, a user mighthave mention relevant health concerns in a one OSG post and relevant demographics inanother, but the free-text cannot automatically aggregate them and show it as a relevantresult.10

CHAPTER 5.

The above issues are mainly due to insensitivity towards diﬀerent types of information(e.g., demographic, clinical). These types of information are crucial to understand thecontext of OSG posts as well as the search criteria. Moreover, the information encapsulatedin OSG posts needs to be aggregated by the author to get a more comprehensive pictureof each author. The next section proposes such information structuring model which canovercome the limitations of the existing information retrieval capabilities in OSG.

Figure 5.4: The proposed multi-stage self-structuring AI platform and its beneﬁts for thethree stakeholder groups in OSG.Figure 5.4 presents the proposed multi-stage self-structuring AI platform which isan implementation of the proposed self-structuring artiﬁcial intelligence platform in Sec-tion 3.3. It employs machine learning and natural language based information extrac-tion methods to transform the free-text corpus of discussion threads into a structuredmulti-dimensional information database pivoted around each user to better facilitate theinformation needs of three groups of stakeholders in OSG. Stages of this framework is asfollows:i The OSG posts are aggregated by the author ID of each OSG user. As discussed, OSGusers include bits and pieces of information in diﬀerent posts, posted across multiplediscussion threads. Therefore, aggregating the posts by each author enables morecomprehensive information extraction from each user.ii Natural language processing and machine learning based techniques are employed;with the support of clinical ontologies and emotional thesaurus; to extract and inferinformation of each OSG user based on the free-text content of their aggregated OSGposts. .4. PROPOSED TECHNOLOGY PLATFORM

CHAPTER 5.

This section presents the information retrieval techniques developed for stages slowroman-capii@ and slowromancapiii@ of the proposed information structuring platform discussedin the previous section (see Figure 5.4). This section is organised as follows: Section 5.5.1presents the narrative type extraction, Section 5.5.2 presents age extraction, Section 5.5.3presents gender extraction, Section 5.5.4 presents medical concept extraction and Sec-tion 5.5.5 presents the emotion extraction.

Online social platforms such as OSG are used to seek, provide and exchange informa-tion (Zhang et al., 2015; Oh, 2012). As discussed in Section 5.2.1, individuals sharesimilar experiences and also provide advice to their peers in the OSG.Diﬀerentiating the posts as experience sharing or advice is an important task. Expe-rience sharing posts are important as they contain self-disclosed information of patientfactors (Sadovykh et al., 2015), which includes demographic, clinical and emotional infor-mation during various stages of the patient timeline. This information can be employedfor patient proﬁling. On the other hand, advice can be used to understand the nature ofpeer-advice in OSG.OSG community is consists of not only patients but also a signiﬁcant proportion ofcaregivers who posts on behalf of the patients that look after. Hence, the expressionsof experience in OSG can be further categorised as an expression of own experience oran expression of experience of someone else (often family member, close relative or afriend). Thereby, in this work the posts are grouped into three narrative type categories(i) experience: ﬁrst-person, (ii) experience: second-person, and (iii) advice.Detecting experience expressing posts is being studied in various application domains.Park et al. (2010) developed a classiﬁer based on a set of linguistic features to detectsentences that discuss human activities from Weblogs. Liu et al. (2015) developed a similarclassiﬁer to extract sentences that express experience in online health forums (OSG) usinglinguistic features such as pronouns, verb tense and modality of the verb.(Nguyen, 2012) has employed both bag-of-words (BoW) and linguistic features to de-tect experience expressing posts and identiﬁed that the most discriminative BoW featuresare ‘I’ and ‘my ’and most discriminative linguistic feature is ﬁrst person pronouns (‘I’ and‘my ’). This is because users frequently use ﬁrst person pronouns in expression of experi-ence.In this work, the narrative type is determined based on the main subject of each OSGpost. The key intuition for this approach is that diﬀerent types of nouns and pronounsare used as the subject in diﬀerent narrative types. For example, the ﬁrst column ofTable 5.2 shows three sample posts taken from a OSG. First post is a patient expressinghis experience, and the main subject is a ﬁrst-person pronoun. In the second post, a sonis expressing about his mother, so the main subjects are mother and third person pronoun .5. INFORMATION RETRIEVAL TECHNIQUES experience ﬁrst-person: if the main subject is a ﬁrst-person pronoun (e.g., I, my)2. experience second-person: if the main subject represents a relationship (e.g.,mother, friend, son) to the narrator of the OSG post.3. advice: if the main subject is a second-person pronoun (e.g., you).The key challenge of this approach is that OSG posts sometimes contain several nounsand pronouns as the subject of diﬀerent sentences, for example, patients expressing theirexperience might mention their family history using words such as mother and third-person pronouns like her. Moreover, third-person pronouns are often used to mentionsother people as well (e.g., I went to see GP and he prescribed ...).Table 5.2: Examples for narrative type resolution from OSG posts

Post Pronoun resolved post Humanmentionnouns NarrativetypeMy doctor ... yesterday. He ...daily. I took it ... and ... I felt... Is this normal ...? I was ...because ... I haven’t ... and Iwas ... with my ﬁance. I gotparanoid about her ... eventhough I knew ... she ... My doctor ... yesterday. < doctor > ... daily. I took it... and ... I felt ... Is thisnormal ... ? I was ... because... I haven’t ... and I was ...my ﬁance. I got paranoidabout < ﬁance > ... eventhough I knew ... < ﬁance > ... I: 7doctor: 2ﬁance: 2prominentnoun: ‘I’ experience:ﬁrst-personMy mother is diagnosed ...She is suﬀering with ... I knowthat ... her heart. Anendocrinologist suggested herto ... Also, he recommendedher to ... But now she had ...She is suﬀering ... My mother is diagnosed ... < mother > is suﬀering with ...I know ... < mother > heart.An endocrinologist suggested < mother > to .. . Also , < en-docrinologist > recommended < mother > to ... But now < mother > had ... < mother > is suﬀering ... mother: 5endocrinologist:2I: 1prominentnoun:‘mother’ experience:second-personYou should consider ... such... . You may even want to .... Try to ..., but if you’ve ...your other options you may ... You should consider ... such... . You may even want to .... Try to ..., but if you’ve ...your other options you may ... you: 5prominentnoun: ‘you’ advice In order to overcome these ambiguities, initially, the gender speciﬁc third person pro-nouns in OSG posts are replaced with relevant nouns (e.g., she → mother). For this task,we employed a pronominal anaphora resolution (Mitkov, 2014) algorithm which predictsthe relevant antecedents of third person pronouns.Pronominal anaphora resolution is initially attempted using JavaRAP tool (Qiu et al.,2004) that implements an algorithm presented by Lappin and Leass (1994). However, thisapproach relies on parsing text to identify its structural elements and grammatical roles,14 CHAPTER 5. which is found to be less eﬀective for OSG posts because as discussed in Section 2.7 theyare not written conforming to the formal rules of grammar. Hence, a relatively simple rulebased pronoun resolution algorithm CogNIAC (Baldwin, 1997) is selected. CogNIAC usesa set of rules to assign the most probable antecedent to each pronoun.For this task, a partial implementation of CogNIAC is used to resolve pronouns totheir antecedent. Note that the pronoun ‘it’ is ignored as it is irrelevant to the given task.The second column of Table 5.2 shows the pronoun resolved output where each pronounis replaced with the relevant noun.Once the pronouns are resolved it is easier to determine the key narrative type of theOSG post. For example, pronoun resolution in the second OSG post speciﬁed in Table 5.2has aggregated ‘mother’and ‘she’, which makes the subject ‘mother’more evident, andthat the narrative of that post is experience: second-person . Column three of Table 5.2shows a list of human subjects identiﬁed from the sample posts and column four showsthe predicted narrative type.Note that the pronoun resolved output of the OSG posts are further used in Sec-tion 5.5.3 for determining the gender of the patient.

Age is an important feature for patient proﬁling because medical information such assymptoms are interpreted diﬀerently for diﬀerent age groups. Patients often mentiontheir age within the text of the forum post. However, such mentions are highly diverse,unstructured and often expressed using shorten forms of English terms. Some sample agementions are shown in column 1 of Table 5.3. Also, they need to be diﬀerentiated fromother number mentions such as duration, dose, weight etc.Table 5.3: Positive and negative samples of age related phrases found in forum postsAge Phrases (positive examples) Non Age Phrases (negative examples)when i was in my 20s My SVT started about 10 years agoIn my 41 years of life I had a 48 hr monitorI am 42 yrs old I am using atenolol 50 mgmy 25 year old son doc prescribes 10 tablets 1 twice a dayoperated on at 10 days old Currently I weigh 16 3/4 stonePrevious literature on age extraction from forums is limited mainly due to its chal-lenging nature. Liu et al. (2014) focused on standard age mentions such as ‘I am 35 yearsold’ and identify them as age mentions. Kim et al. (2013) employed regular expressionssuch as age+number to extract age. These approaches achieve higher precision but asshown in Table 5.3, age mentions are diverse thus these approaches would overlook asigniﬁcant amount of age mentions resulting in a low recall.Zhu et al. (2012) look for clue words such as years, old, aged etc. appear within two-word distance of each two-digit number mentions and such mentions are extracted as theage of the patient. In addition, they look for clue words such as teenager, toddler, childto approximate the age of the patient. This approach would result in high recall however, .5. INFORMATION RETRIEVAL TECHNIQUES

Potential pattern identiﬁcation

In this initial step, a regular expression based pattern in used to identify text chunks thatare potential age mentions.Firstly, each post is break into sentences using the sentence splitter in OpenNLPtool (Baldridge, 2005). Subsequently, the regular expression : (\d{1,2}[a-zA-Z]{0,5}) is used to capture two-digit number mentioned in the text. Note that, the above regularexpression is designed to accommodate numbers that co-exist with several letters such as20s and 34yrs.A text chunk of ﬁve words before the detected two-digit number and ﬁve words afterare extracted as a possible age mention. Note that start and end of sentences are paddedwith symbols to accommodate age mentions in edges of the sentence.Let a set of OSG posts of a user A be { p } A . For each post p , a set of text chunks { t } with possible age mentions were identiﬁed { p } A ← { t } A . Feature extraction

In this step the text chunks extracted from the previous step are transformed in to featurevectors ( { t } A ← { f } A ). These features are extracted to diﬀerentiate age mentioned fromothers (e.g., drug dose, clinical factors).As shown in Figure 5.6, the text chunk is ﬁrst divided into three segments (L, M, andR) and diﬀerent features were extracted separately from each segment.16 CHAPTER 5.

Figure 5.6: A text chunk and its segments L, M, and RWe engineered 29 features that are capable of diﬀerentiating age related text chunksfrom others. Table 5.4 presents a sample of the extracted features. Note that, the featuresthat are extracted from multiple segments are included in the feature vector as multiplefeatures (one for each segment).Table 5.4: A Sample of features extracted from a text chunkFeature Appliedsegments Representing termsFirst-person pronouns L I, Im, IamPossessive pronouns L my, ourFamily relationshipmentions L,R mother, mom, father, dad, brother, sister,son, daughterState of being verbs L am, is was, were, will beAge relatedprepositions L at, until, around, under, about, abtOld Mentions M, R old, yo, y/oYear Mentions M, R year, years, yo, yrs, y/o, sTime mentions M, R days, hours, hrs, minutes, min, weeks, wksDose mentions M, R mg, dose, tablets, mgs, micrograms, ugMost of these features are constructed to capture age mentions. For example

First-person pronouns are often used before an age mention. Some other features such as doseand time mentions are added to diﬀerentiate other key classes such as dosage (e.g., atenolol50 mg) and duration (e.g., 10 years ago).

Classiﬁer

Once the features were extracted from text chunks, we employed a classiﬁer to identifythe chunks that are most likely to be age mentions and assigned a conﬁdence probability.To train this classiﬁer we manually labelled 2,212 text chunks identiﬁed from the ﬁrst stepof this process. We labelled them as ‘age’ or ‘other’ based on whether it is an actual agemention or not. In the labelled dataset there are 1,186 ‘age’ and 1,026 ‘other’ sample.The usual techniques for training a predictive classiﬁer are Na¨ıve Bayes, Support VectorMachines (SVM), Random Forest, and Multilayer Perceptron (MLP). We have excludedMLP as the dataset is small which could lead to overﬁt MLP models. Na¨ıve Bayes, SVMand Random Forest classiﬁers are trained using the feature vectors of the above labelleddataset. The classiﬁer implementations in WEKA (Hall et al., 2009) data mining toolkitare employed for this task. Each classiﬁer is evaluated using 5-fold cross validation and .5. INFORMATION RETRIEVAL TECHNIQUES c ( t )) for each text chunk to be an age mentionand based on that text chunks that have higher conﬁdence of being an age mention areselected for each author A . { t a } A = { t a : t ∈ { t } A and c ( t ) > τ } (5.1)Where τ is a conﬁdence threshold which is set to be 0.7. Aggregate and resolve age

This step aims to resolve the age of each individual author from the selected high conﬁdenceage mention text chunks { t a } A from the previous step. Note that the numerical value ofage mentions might be diﬀerent from the actual age value due to (i) age mentions of a pastevent, (ii) age mention of a diﬀerent person, and (iii) misclassiﬁed text chunk. Hence, aweighted majority voting technique is employed to resolve the most probable age of eachuser A .First, the numerical value of each age mention ( D ( t a )) is determined. Note that agementions are parsed to get the age value normalised to year (e.g., 6 months → . present , and other tenses or if thetense is not available it is marked as other ). Subject type of the age mention is determinedbased on the subject in the chunk. The text chunks with ﬁrst-person subjects are markedas ﬁrst-person , and other subject types or if the subject is not available it is marked as other ).The rationale of these two parameters is that if the chunk is in the present tense thenit is more likely to be the current age of author. Also, if the subject is ﬁrst-person then itis more likely to be about the author. Therefore, a weight boost of 0.25 is added if the agemention is present tense and a weight boost of 0.25 boost is added if it has a ﬁrst-personsubject. The aggregated conﬁdence value C ( a ) determined as follows: C ( a ) = { t a } A (cid:88) I ( a ≤ D ( t a ) ≤ a + ∆ a ) × − ( D ( t a ) − a ) × W ( t a ) (5.2)where I is the indicator function and the weight W ( t a ) is determined as 1 + 0 . × I ( t a present tense) + 0 . × I ( t a ﬁrst-person).Equation 5.2 is formulated in a way that an age value D ( t a ) contributes to the con-ﬁdence of a to a + ∆ a . This is because users often contributes to a particular OSG overa period of time. Hence, an age mention of a user can be diﬀerent in older posts. Forexample, a user might mention age to be ‘56’ two years ago and now mention that he is‘58’ . In such cases, age mention of ‘56’ should contribute to the age value of 58. Note18 CHAPTER 5. that ∆ a is set to 2 to hypothesising that a user often contributes to OSG for a period of2 years.The aggregated conﬁdence value is determined for all the age values { a } A = { D ( t a ) ∀ t a ∈{ t a } A } mentioned for the user A , and the age value of the user is resolved based on thehighest aggregated conﬁdence value: a resolved = arg max { a } A ( C ( a )) (5.3) Gender is another important demographic information that is important for personalisedretrieval. Some symptoms can relate to diﬀerent diagnoses depending on whether thepatient is a male or a female. Gender is sometimes explicitly self-disclosed (e.g., ‘I’m afemale’), and in some cases gender can be inferred based on gender mentions (e.g., ‘mymother’) or gender speciﬁc medical term mentions (e.g., ‘pregnant’). However, similarto age extraction, this task is challenging due to the unstructured and diverse nature ofgender mentions in OSG posts.Cheng et al. (2011) focused on the language style of the text to predict the genderof the author. The idea is that males and females often follow diﬀerent language stylesfor written communication. However, the accuracy of this approach will be low as OSGattract information seekers and providers with very diverse language styles from acrossthe world. Another approach is to predict the gender by looking at the ﬁrst name of theauthor (Herdadelen and Baroni, 2011). It keeps two lists of male and female ﬁrst namesand resolves the gender based on that. This approach does not work in our scenario asmany people do not use their real name in OSG, they often use a nickname or a part oftheir name.Zhu et al. (2012) look for gender clues: (i) gender specifying words such as ‘men’,‘woman’, etc. and (ii) gender speciﬁc medical terms such as ‘hot ﬂashes’, ‘prostate cancer’,etc. It was not mentioned that how they resolve the gender if terms related to both gendersappear in posts of the patient. Also, another key issue is to resolve whether the genderclues are about that patient.We extended the above approach of using gender clues to resolve gender, however,our technique is designed to be more robust to noisy gender cues, and also capable offusing multiple gender cue to resolve the most probable gender of each author. Similarto (Zhu et al., 2012) this technique is two fold, (i) the ﬁrst approach extracts genderspeciﬁc narrations and processed them to resolve the gender, and (ii) the second approachlooks for gender speciﬁc medical terms (body-parts, illnesses, symptoms). The resultsfrom both approaches are fused using a weighted majority voting method to resolve thehighest conﬁdent gender of each author. This technique is presented in Figure 5.7 anddescribed in detail below.As shown in Figure 5.7, ﬁrst the posts which are marked as advice by the narrativetype identiﬁcation are ﬁltered out. This step is taken because advice posts often containgender cues that are related to the receiver of the advice rather than the author of the .5. INFORMATION RETRIEVAL TECHNIQUES

Narration based gender resolution

The gender is often explicitly or implicitly mentioned in the narration of OSG posts.However, gender speciﬁed in narrations can be ambiguousness since such terms may bereferring to some other actor other than the patient. The proposed technique is developedto overcome such ambiguities.Firstly, the narration type is identiﬁed based on the technique mentioned in Sec-tion 5.5.1. If the narration type is a ﬁrst-person narration (e.g., ‘I’, ‘me’), then thenarrator is the actual patient. Therefore, the narrator gender is resolved and assigned tothe patient. The gender of the narrator is resolved using two types of gender mentionsfound in OSG posts:1. Direct gender mentions: directly reveals the gender of the narrator (e.g., ‘a male’, ‘aman’, ‘a mother’, ‘a widow’). Those mentions have to exist within close proximityof ﬁrst-person pronoun i.e., ‘I’ in a sentence to verify that it is about the patient.2. Indirect gender mentions: narrator gender can be inferred from gender speciﬁc re-lationships. For example, if OSG posts talk about ‘my wife’ or ‘my ﬁance’ thatimplies that the narrator is a male.If the narration type is a second-person narration, then the narrator is a caregiver(e.g., partner, parent, child, friend) who is posting on behalf of the patient. In such cases,the relationship of patient to the narrator can be used to resolve the gender most of thetime. For example, patient is a male if ‘he’ is husband, son, uncle etc. of the narrator;and female if ‘she’ is daughter, wife, mother, etc. of the narrator.However, sometimes patients relationship to the narrator is gender-neutral (e.g., part-ner, friend) then the pronoun resolution process that is used to identify the narration typein Section 5.5.1 is examined to check if gender speciﬁc pronouns were resolved to that noun(e.g., he → friend) and gender is assigned accordingly. Table 5.5 shows three examples ofthe narration based gender resolution process.20 CHAPTER 5.

Table 5.5: Few examples of narration based gender resolution process

Post Pronoun resolved post Narrativetype Inference ofgenderMy doctor ... yesterday. He . . . daily. I took it ... and... I felt ... Is this normal...? I was ... because ... Ihaven’t ... and I was ...with my ﬁance. I gotparanoid about her ... eventhough I knew ... she ... My doctor ... yesterday. < doctor > ... daily. I took it ...and ... I felt ... Is this normal ... ?I was ... because ... I haven’t ...and I was ... my ﬁance. I gotparanoid about < ﬁance > ... eventhough I knew ... < ﬁance > ... ﬁrst-person Gender cue: ﬁance Patient: male

My mother is diagnosed ...She is suﬀering with ... Iknow that ... her heart. Anendocrinologist suggestedher to ... Also, herecommended her to ... Butnow she had ... She issuﬀering ... My mother is diagnosed ... < mother > is suﬀering with ... Iknow ... < mother > heart. Anendocrinologist suggested < mother > to .. . Also , < endocrinologist > recommended < mother > to ... But now < mother > had ... < mother > issuﬀering ... second-person Gender cue: mother Patient: female

A friend ... when he .... Hesuddenly ... hurt his back.He fell down .... Despite ...him to his feet and madehim walk to ... away. A friend ... when < friend > .... < friend > suddenly ... hurt < friend > back. < friend > felldown .... Despite ... < friend > to < friend > feet and made < friend > walk to ... away. second-person Gender cue:he (resolvedto friend)Patient: male Medical concept based gender resolution

There are medical concepts (e.g., body parts, symptoms, illnesses, procedures) that areoften limited to a single gender (e.g., prostate cancer, pregnancy), which can be used toinfer the gender of the patient. In this approach it is hypothesis that discussion of such amedical concept indicates that the patient is of that respective gender. For this approach itis immaterial to know the narration type, because the medical concept is about the patient,hence this approach is relatively straightforward than the narration based approach.A list of seed medical terms (words or phrases) were constructed for both genders.These seeds lists were populated based on the gender related concept terms from UMLSMetathesaurus (Bodenreider, 2004). However, patients sometimes use lay terms for thosemedical concepts, which are not included in clinical ontologies such as UMLS Metathe-saurus. Therefore, a set of samples from mens and womens sections of OSG were examinedto include such lay terms of medical concepts that are deterministic of gender.If the above formulated gender speciﬁc terms are present in OSG posts, relevant genderis assigned to the patient that is mentioned in that post.

Aggregate and resolve gender

Gender suggestions based on each post are aggregated to resolve the patient gender ofeach OSG user. The aggregation process is similar to the age resolution process speciﬁedin Section 5.5.2, where medical concept based gender suggestion is given a weight of 1.0and narration based gender suggestion is given a weight of 2.0. Note that narration based .5. INFORMATION RETRIEVAL TECHNIQUES

Medical concept extraction from text is a further formidable task. Natural languageprocessing (NLP) tools are used to extract terms from text that can be mapped to themedical concepts found in medical thesauruses such as the UMLS Metathesaurus (Bo-denreider, 2004). There are several state-of-the-art tools that extracts medical conceptsfrom text such as: MedLEE (Friedman et al., 1994), cTAKES (Savova et al., 2010), andMetaMap (Aronson and Lang, 2010).Gupta et al. (2014), shows that better precision and recall can be achieved by devel-oping a tool to support special characteristics of patient-authored text. Consumer healthvocabularies extracted from community generated corpora (Vydiswaran et al., 2014) areemployed to identify the relevant medical terms and a diﬀerent stack to NLP processes isfollowed to extract key phrases from the text. However, as noted by the authors, the exist-ing system is limited to certain subcategories of the OSG (Asthma and ENT). Therefore,for our task we adhere to the well-established medical concept extraction tools.We employed cTAKES (Savova et al., 2010) tool for our medical concept extractiontask. It identiﬁes noun phrases in the text and then conducts a dictionary-look up inSNOMED CT and RxNORM (Nelson et al., 2011) medical concept databases. Eachidentiﬁed term is then mapped into ﬁve semantic types: disorder/disease, sign/symptoms,procedures, anatomical sites, and medications. In the proposed method, we subject eachOSG post to this process and the identiﬁed medical concepts are extracted. As pointed-out by Gupta et al. (2014), some ambiguities exist in the identiﬁed concepts. For example,today is mapped to a drug named Today and web is mapped to the disorder congenitalwebbing. However, both these words are found frequently in OSG posts mostly referringto their usual meanings. To overcome this issue, we constructed a list of terms that areoften mapped incorrectly and ﬁltered out the identiﬁed concepts based on such terms.

As discussed in Section 2.4 emotions are an important element of social text which self-disclose the mental state of the author. In the context of OSG discussions, it can behypothesised that the emotions expressed in OSG discussions are indicative of the men-tal state of OSG users (patents/caregivers), which can be used to infer their emotionalwell-being. Moreover, since the OSG user posts at diﬀerent points of time, emotionalﬂuctuations over time is indicative of their emotional journey during their participationin OSG.Analysing emotions and emotion ﬂuctuations over time of a OSG user is important forthe medical researchers to understand the emotional well-being of patients/caregivers witha particular disease conditions, as well as to compare and contrast the emotional well-being22

CHAPTER 5. of patient cohorts with diﬀerent disease conditions or other clinical factors (e.g., undergonediﬀerent treatments).As discussed in Section 2.4.2, multiple psychological emotional models are proposedto represent the emotional state, which ranges from the two dimensional valence-arousalmodel (Russell, 1980) to multi-dimensional models such as emotion wheel (Plutchik, 1991).While such models serve as theoretical basis for emotion representation, computationalimplementations are required to capture the emotional expression from discourse. For ex-ample, sentiment analysis techniques are the computational implementation of the valence-arousal model (Mohammad, 2016), which provide a signed real-value as the sentimentscore, where the sign (positive/negative) represents the valence and the absolute value ofscore represents arousal.Although sentiment analysis techniques are relatively mature and commonly used forcapturing emotions, we found that the two dimensional model is too coarse grained forrepresenting complex emotional states of OSG users. Therefore, we have developed acomputational technique based on the Emotion Wheel (Plutchik, 1980b, 1991) to capturea multi-dimensional representation of emotions from OSG discourse.As discussed in Section 2.4.2 Emotion Wheel has eight primary emotions (joy, trust,surprise, sadness, disgust, anger, anticipation and fear) and further eight secondary emo-tions which are derived using combinations of primary emotions (e.g., love: joy+ trust).These 16 emotions (primary and secondary) speciﬁed in the Emotion Wheel is incor-porated as the emotional dimensions in the proposed computational model. Note thatseveral modiﬁcations are applied to secondary emotions to suit the healthcare domain.The emotional intensity of each emotion is determined based on the proportion of relevantemotional terms present in each OSG post resulting a 16-dimensional real-valued emotionvector for each OSG post.Figure 5.8: The proposed technique for emotion extractionFigure 5.8 presents the proposed technique for emotion extraction. The relevant termsfor each emotion is obtained using a two step process. .5. INFORMATION RETRIEVAL TECHNIQUES

Emotion Emotional terms fromthesaurus Emotional terms fromword-embedding P o s i t i v e Happy happy, great, joyous, glad,delighted fab, chuﬀed, terriﬁc, great news, lookingforward, heart warming, uplifting, upbeatGood good, pleased,comfortable, relaxed,content comfy, nice, chill, chipper, ok, okay, clearheaded, coolAlive alive, playful, energetic,spirited, animated chatty, perky, sociable, vibrant, vivacious,witty, easy going, peppyLove love, attracted, warm,passionate, aﬀectionate romantic, cuddly, compassionate,intimate,adore, supportive, caringPositive positive, eager, keen, bold,brave smart, ambitious, proactive, cynical,insistent, willing, upbeatOpen open, understanding,accepting, satisﬁed,receptive open minded, empathetic, cooperative,accommodating, approachable, forgiving,attuned, rationalInterested interested, fascinated,inquisitive, curious,intrigued keen, impressed, cautious, leery, eager,intuitive, savvy, thoughtfulStrong strong, certain, dynamic,sure, tenacious resilient, independent, adamant, ﬁerce, selfreliant, decisive, ﬁghter, pragmatic N e ga t i v e Sad sad, tearful, grief,sorrowful, grief heart break, teary, lonely, weepy, crying,despairing, hurtfulAfraid afraid, fearful, terriﬁed,panic, worry petriﬁed, freaking out, apprehensive, dread,obsess, fret, nervous wreckHurt hurt, deprived, pained,dejected, agonised traumatised, bruised, shattered, ached,exhausted, cramped, numb, fatigued,strainedAngry angry, annoy, provoke,aggressive, enraged agitated, hostile, pissed oﬀ, argumentative,aggressive, rude, paranoid, ticked oﬀ,lashing outDepressed depressed, disappointed,miserable, despair,powerless despondent, distraught, suicidal, unloved,worthless, emotionally drained, snappyHelpless helpless, incapable, alone,vulnerable, fatigued insecure, tired, hopeless, powerless,defeated, overwhelmed, listless,incapacitatedConfused confused, upset, doubtful,uncertain, hesitant unsure, perplexed, wary, leery freaked out,iﬀy, bummed, taken abackIndiﬀerent indiﬀerent, insensitive,dull, reserved, lifeless grumpy, apathetic, blunt, ignorant,emotionless, callous, crass, standoﬃsh24

CHAPTER 5.

Expanding a seed list of lexicons is a labours task which is often achieved using crowd-sourcing techniques such as Amazon Mechanical Turk (Buhrmester et al., 2011). How-ever, recent research report techniques (Hamilton et al., 2016; Fast et al., 2016) to expandseed term list using a semi-supervised approach based on the word-embedding (Mikolov,Sutskever, Chen, Corrado and Dean, 2013) technique. Word-embeddings learn densevector representations of words and phrases while automatically preserving the semanticrelationships that exist in the text corpus by incorporating such relations into the vectorspace of the word-embedding. This enables the use of linear algebra to capture diﬀerentsemantic relationships within word-vectors in the word-embedding. The famous exampleby Mikolov, Yih and Zweig (2013) shows that the vector arithmetic of word vectors ‘King-Man + Queen’ results a word vector similar to the word vector of ‘Woman’.Developing such word-embedding using an OSG post corpus enables to capture termsused by the OSG users that are semantically similar to the seed emotional terms. Wehave developed a word-embedding from a large text corpus collected from two large OSG patient.info and healingwell mentioned in Table 5.1, which contained a total of 4,795,428OSG posts. This corpus is preprocessed to remove any URLs, converted to lower caseand then separated into sentences using the python NLTK sentence tokenizer , which hasresulted a text corpus of 36,222,536 sentences. This text corpus is employed to train a 200dimensional word-embedding using Word2Vec technique with skip-gram model (Mikolov,Sutskever, Chen, Corrado and Dean, 2013) and negative-sampling (Mnih and Teh, 2012).Note that common phrases are tagged up to three terms and used them as phrases duringthe training of the word-embedding. We employed the python Gensim ( ˇReh˚uˇrek and Sojka,2010) library for this implementation. The resulted word-embedding contains 312,196unique terms (words or phrases).Once the word-embedding is trained, most similar terms for each seed terms in theemotion thesaurus is identiﬁed using a nearest neighbour search in the embedding spaceusing Cosine similarity. These identiﬁed terms are semantically similar terms to the seedemotion terms, in which some of the terms have the same emotional sense of the seed termwhile some others may not. For example, the top ﬁve nearest neighbours of sorrowful are sadness, sincerity, joyful , and deeply saddened , in which joyful is semantically similar buthas the opposite emotional sense. Therefore, we manually looked at the selected termsand ﬁltered out the terms that do not align with the emotional sense of the respectiveseed term. The third column of Table 5.6 presents a sample of emotional terms capturedusing the above technique. Note that, this emotions term identiﬁcation is a one time taskand the ﬁnal emotion terms list is added to the emotion term thesaurus.Intensity modiﬁer terms are a set of terms that increase or decrease the intensity ofthe emotional term. For example, the term ‘very’ increases the intensity of the emotion‘good’ when used together, whereas, the term ‘kind of’ decreases the intensity of theemotion ‘okay’ when used together. Moreover, some terms completely negate the emotionse.g., ‘not okay’ negates the emotion expressed by ‘okay’. A thesaurus of such terms is oftenused in rule based sentiment analysis tools such as SentiStrength (Thelwall et al., 2012) .6. OSG USER TIMELINE CONSTRUCTION E P calculation of a particular OSGpost P . Algorithm 1:

Determine emotion intensity vector

Input: P - OSG post, T E - Emotion thesaurus, T m - Intensity modiﬁer thesaurus Output: E ( P ) - Emotion vector of OSG post PE ( P ) ← ∅ P ← P reprocess ( P ) // remove URLs, numbers, punctuations { w } P ← T okenize ( P ) // tokenise the post into words for i ← to do T Ei ← T E e i ← w prev ← null foreach w in { w } P do // lookup w in emotion i term list if w ∈ T Ei then e i ← e i + 1 // lookup w prev in intensity modifier thesaurus if w prev (cid:54) = null and w prev ∈ T m then e i ← e i + T m ( w prev ) w prev ← we i ← e i ÷ |{ w } P | E ( P ) ← E ( P ) ∪ e i return E ( P ) This section describes the automatic construction of user timelines from their self-disclosednarratives in OSG posts. A timeline consists of time sensitive events organised in chrono-logical order. Apart from demographics which are not time sensitive, all other informationextracted in the previous section can be employed to the construction of the user timeline.However, this work is limited to the timeline construction based on clinical events andemotional events (emotional trends) only.26

CHAPTER 5.

A timeline of events helps to examine the user behaviour over time as well as tempo-ral associations between multiple events. For example, such a timeline is an importantresource for a health professional to identify the temporal ordering of clinical events overthe disease progression of a patient. Such insights can be employed during prognosis aswell as therapy planning (Augusto, 2005). Also, the medical researchers can use suchinformation to investigate disease progression of certain illnesses across diﬀerent cohortsof the patients in the population. Having emotions alongside clinical events enables themedical researchers to analyse the associations between clinical events and emotions, andﬁnd insights on the emotional burden of a disease condition over time.

Clinical event is deﬁned in literature as a clinically relevant symptom, state, perception,procedure or occurrence (Tao et al., 2010a,b). Clinical event extraction is a speciﬁedevent extraction task as the event related information such as relevant key terms is priorlyknown. As discussed in Section 2.3.2, speciﬁed event detection uses an engineered featuredictionary or labelled datasets to train a classiﬁer to identify documents/texts that containrelevant event information, and then capture the relevant named entities (such as eventtime) that are related to the identiﬁed event.Existing work on clinical event extraction is mostly developed for the clinical narrativesin Electronic Health Records (EHR) and discharge summaries. There is a recent intereston such techniques due to the availability of several human annotated datasets such as thedataset released for i2ib challenge (Sun et al., 2013) and SemEval-2015 Task-6 (Bethardet al., 2015). Tao et al. (2010b) developed CNTRO (Clinical Narrative Temporal RelationOntology) which uses an ontology based approach to model time sensitive information inclinical narratives. For each identiﬁed event it extracts temporal information such as timeand duration by parsing the time indicative terms/phrases (e.g., ‘2 week ’for duration oftwo weeks) using time ontologies (Hobbs and Pan, 2006).These existing techniques are speciﬁcally designed for the narratives generated byhealth professionals, where such narratives often adhere to certain guidelines and pro-fessional practices. Therefore, vocabulary, reporting of clinical events and expression oftemporarily is fairly uniform across those narratives. On the other hand, OSG contentis user generated (patient/caregiver). As discussed in Section 2.7 such user generatedcontent is so diverse and contain diﬀerent language patterns prevalent among diﬀerentsocio-geographic groups (Eisenstein et al., 2014). Therefore, the existing techniques needto be extended to facilitate timeline extraction from OSG.Wen and Rose (2012) have developed such an extended technique to construct clinicaltimeline i.e., cancer trajectories from the OSG discussions in breast cancer support groups.They have engineered a consumer health vocabulary of breast cancer related keywordswhich can be phrases or certain abbreviations (e.g., diagnosed: dx, dx’d). From the OSGposts, the sentences with the keywords in the above developed vocabulary are extractedand further parsed to extract the temporal expression related to the time of that event.Time is inferred using both speciﬁc time expressions (e.g., 4th May 2018) and relative .6. OSG USER TIMELINE CONSTRUCTION → postdate - 7 days). This approach is further extended by Naik et al. (2017) with a semi-automated approach for engineering relevant keyword dictionaries and also have employedHEidleTime (Str¨otgen and Gertz, 2010) for time resolution.We employed Wen and Rose (2012) approach for clinical event extraction from OSG.The medical concept extraction discussed in Section 5.5.4 already extracts clinical con-cepts such as symptoms and procedures based on the terms found in medical thesauruses.However, as discussed in Section 5.5.4 such thesauruses are optimised for the vocabularyin clinical discourse used by health professionals, therefore often lack the terms used bygeneral health consumers i.e., patient or caregiver. This issue highlights the need for aconsumer health vocabulary to more eﬀectively extract clinical concepts from consumergenerated discourse such as OSG.Developing a generic consumer health vocabulary is a laborious task which is beyondthe scope of this work. However, we have developed a consumer health vocabulary forseveral key cancer related clinical concepts, in order to demonstrate its utility as well asfor the further application of this work in cancer domain (presented in Chapter 6). Wehave selected ﬁve key clinical events of a cancer patient as follows:1. diagnosis2. medical tests (e.g., biopsies, other pathology tests)3. surgery4. radiotherapy5. recurrenceA consumer health vocabulary was engineered based on the terms (words/phrases) usedby the users in OSG that are indicative of the above designated event types. In order tocapture such terms from a very large text corpus, we employed the same approach used inSection 5.5.5 for expanding emotion thesaurus, which is a semi-supervised approach basedon the word-embedding (Mikolov, Sutskever, Chen, Corrado and Dean, 2013) technique.Similar to the emotion term capturing, such word-embedding based on OSG post corpusenables to capture terms used by the OSG users that are semantically similar to the clinicalterms that are indicative of the above designated event types.We employed the same word-embedding developed in Section 5.5.5 and looked upsemantically similar terms for a given term in the embedding space using Cosine similar-ity. For example, the ten most similar terms of ‘radiotherapy’ ordered based on Cosinesimilarity score are ‘radiation’, ‘chemotherapy’, ‘hormone therapy’, ‘radio therapy’, ‘sal-vage radiation’, ‘external beam radiation’, ‘chemo’, ‘brachytherapy’, ‘external radiation’,‘brachy’ and ‘ebrt’. radiation and radio therapy are synonyms while salvage radiation, ex-ternal beam radiation and brachytherapy are diﬀerent variants of radiotherapy. EBRT isa abbreviation for external beam radiation therapy, while chemo and brachy are shortenedforms for chemotherapy and brachytherapy respectively.28 CHAPTER 5.

The inclusion of the phrase ‘hormone therapy’ as a similar term for ‘radiotherapy’ isinteresting as it is neither a synonym nor a variant, but just used in a semantically similarmanner in OSG discussions. This inclusion highlights the fact that the candidate termsselected using word-embeddings need to further validated for their meaning by an expert.Therefore, we employed the support of a cancer surgeon to further curate the automaticallygenerated term sets of each event category. Table 5.7 shows ten term samples of thiscurated consumer health vocabulary for each event category delineated above.Table 5.7: A sample of representative terms of each clinical event type. Note that someof the terms are common misspelled words.Clinical event Representative termsDiagnosis diagnosed, dxed, diagonsed, diagnosedwith, diag, dx, clinicallydiagnosed, oﬃcially diagnosedMedical tests biopsy, biopsies, colonscopy, biop, fna, endoscopy, scope,mammogram, mpmri, cystoscopySurgery surgery, operation, op, sugery, surgury, opp, corrective surgery, ops,surgical procedure, opperation, keyhole surgery, bunion surgeryRadiotherapy radiotherapy, radiation, chemotherapy, radio therapy, salvageradiation, external beam radiation, chemo, brachytherapy, externalradiation, brachyRecurrence recurrence, reoccurrence, recurrance, bcr, biochemical recurrence,biochemical failure, local recurrence, biochemical relapse, secondarycancersThe identiﬁed event sentences are then further processed to identify any temporalexpression mentioned in the free-text form. We employed SUTime (Chang and Manning,2012) to detect such temporal expressions. SUTime is a rule based classiﬁer which containsa set of hand-crafted rules and dictionary of time sensitive terms. It ﬁrst identiﬁes thetime related terms and then extend them to chunks and apply the rules to resolve thetime. Note that we extended the time sensitive term list of SUTime by including some ofthe relevant terms identiﬁed based on the above created word-embedding. For example,‘weeks’ is often referred to using terms such as ‘wks’, ‘wk’, and ‘months’ as ‘mnths’.Such additions improve the recall of temporal expression. SUTime resolves both speciﬁctime expressions and relative time expressions (e.g., ‘2 wks ago’ → ‘OFFSET P-2W’). Therelative time expression is then resolved based on the post date of the OSG post.OSG users, often mention the same event multiple times in diﬀerent OSG posts. Suchduplication can be resolved with slightly diﬀerent speciﬁc times mainly due to the diﬀer-ences in the granularity of the temporal expression. In such instances, mentions of sameclinical event are aggregated and its representative time is resolved using the median timeof the respective cluster. .7. EVALUATION OF THE INFORMATION EXTRACTION TECHNIQUES t (e.g., week, month etc.). The emotions of theOSG posts in a single time-bin is aggregated to determine the emotional expression overthat time period. We hypothesis that the approach of aggregating emotional expressionsover a time-period would provide a more unbiased account of the emotional state of thatuser over that time, as the averaging smoothed-out frequent ﬂuctuations in the emotionsbut retain macro emotional trends that correlate with the clinical events.The emotions are captured as a 16-dimensional vector for each OSG post based onthe technique introduced in Section 5.5.5. The emotional representation of user U for thetime period t − t + ∆ t is obtained by averaging the emotional vectors of the OSG poststhat has the time stamp within that time period. This section evaluates the three knowledge extraction modules described in Section 5.5: (i)narrative type classiﬁcation, (ii) age extraction, and (iii) gender extraction. We obtainedthe services of qualiﬁed domain experts for manual classiﬁcation of test datasets. Narrativetype classiﬁcation is evaluated using a labelled set of posts as advice or experience. Ageand gender resolution is evaluated using a labelled set of OSG post authors using theirpublished posts.

Narrative type classiﬁcation is evaluated using 500 posts labelled by domain experts asexperience or advice. Note that we ignored the sub-classiﬁcation of experience (ﬁrst personor second person) for this evaluation, because second person experiences are relativelyrare in the dataset. It is evaluated as a classiﬁcation problem where the two classes areExperience and Advice. Table 5.8 presents the evaluation results.Table 5.8: Evaluation results of the narrative type extraction techniqueLabel Number of posts Precision RecallExperience 329 0.92 0.96Advice 171 0.91 0.81Combined 500 0.92 0.89The results show that both experience and advice are identiﬁed with a precision above0.9. Recall of advice is relatively low mainly because some advising posts are mixed withthe authors experience and therefore hard to identify them as advice.30

CHAPTER 5.

Age and gender resolution was evaluated using a set of 300 labelled author proﬁles. Postsof each author were examined to identify the age or gender of the author if such informationis present. Each author proﬁle is annotated based on the identiﬁed age and gender.The labelled data is then compared to the output of age and gender resolution modules.We employed precision and recall statistics for this evaluation. Note that age is oftenmentioned in incremental values for some authors as a result of prolonged contributionto the OSG over several years. Therefore, age resolution is considered correct if it fallswithin two integer values of the labelled age.Similar to the previous evaluation, performances of the gender and resolution modulesare evaluated as classiﬁcation problems. For gender, the classes are

Female , Male , and

Unknown and for age, classes are

Age mentioned and

Age not mentioned . Note that, in

Age mentioned class the classiﬁer has to correctly resolve the age value (within two integervalues to the labelled age value) in order to be a true positive.Table 5.9 and Table 5.10 present the age and gender classiﬁcation results respectively.Table 5.9: Evaluation results of the age extraction techniqueLabel Number of proﬁles Precision RecallAge mentioned 131 0.78 0.89Age not mentioned 169 0.95 0.84Combined 300 0.86 0.87Table 5.10: Evaluation results of the gender extraction techniqueLabel Number of proﬁles Precision RecallFemale 109 0.91 0.87Male 35 0.90 0.77Unknown 156 0.87 0.94Combined 300 0.90 0.86Both age and gender resolution have average precision and recall over 0.85. Precisionin

Age mentioned class is relatively low because when a proﬁle does not have actual agementions, the classiﬁer tends to pick up low conﬁdence age mentions that are often agementions in past incidents or age mentions about other people. The same issue resulted arelatively low recall in

Age not mentioned class as well.Recall for

Male is relatively low. Most of those misses are classiﬁed as

Unknown asthe classiﬁer misses the gender speciﬁc clues. This is mainly because males tend to exposevery few clues about their gender compared to females.

This section demonstrates the capabilities of the proposed platform in addressing theinformation needs of patients (end-users) and researchers. Patient information needs arebased on ﬁnding similar cases that are more relevant to them (e.g., patients with similar .8. DEMONSTRATION

The popular OSG patient.info and healingwell mentioned in Table 5.1 were selected fordata collection because of their high volume of posts and large number of participants.A web scraping tool is developed to automatically traverse through each topic andcollect all the threads. From each thread, title, ﬁrst post and subsequent reply posts arecollected. Each OSG post is collected with its time-stamp and author id. As shown inFigure 5.4, the posts are then aggregated by the author id. We have ﬁltered-out the userswho have less than ﬁve posts in the OSG, because the information exposed by such usersare often not suﬃcient to understand their demographic and clinical factors. The ﬁnaldataset contains 4,469,107 posts from 79,829 OSG users, where 1,662,312 posts (47,712users) are from patient.info and 2,806,795 posts (32,117 users) from healingwell .The collected OSG posts are stored as JSON documents using a setup of the opensource search platform Elasticsearch . Elasticsearch search is employed both as the doc-ument store and search platform since it handles both full-text and structured search.Elasticsearch is a distributable full-text search engine, designed to be scalable to handlevery large datasets (Kononenko et al., 2014). The information extraction techniques developed in Section 5.5 are employed to enrichthe extracted OSG posts.The narrative type of each post is identiﬁed individually for each post, where eachOSG post is categorised as experience: ﬁrst person , experience: second person , and advice .The age and gender are resolved for each author. All the posts of an author areaggregated using the associated author-id of each post. These aggregated posts are thenprocessed to resolve age and gender of each author.Table 5.11 provides a summary of the demographics identiﬁed. The proposed de-mographic extraction modules managed to resolve age and gender of 42% users which http://patient.info/forums CHAPTER 5.

Table 5.11: Counts and percentages of extracted age, gender and narrative typeUsers (percentage) OSG posts (percentage)Total 79,829 (100) 4,469,107 (100)Age resolved 48,126 (60.3) 4,045,781 (90.6) <

20 10,380 (13.0) 824,405 (18.5)21-30 10,303 (12.9) 664,950 (14.9)31-40 6,849 (8.6) 534,157 (12.0)41-50 6,939 (8.7) 569,000 (12.7)51-60 6,581 (8.2) 732,864 (16.4)61-70 4,536 (5.7) 519,317 (11.6) >

70 2,538 (3.2) 201088 (4.5)Gender resolved 44,700 (56.0) 3,963,523 (88.7)female 32,268 (40.4) 2,943,776 (65.9)male 12,432 (15.6) 1019747 (22.8)Age and gender resolved 33501 (42.0) 3770618 (84.4)comprises of 84.4% of the posts in the collected dataset. It is a 12% increase compared tothe reported 30% success by Cho et al. (2013).Among the gender resolved users, 72.2% are females whereas only 27.8% are males.The gender resolved OSG posts are further skewed, where among the gender resolvedposts 74.3% from females and 25.7% from males. We have further investigated this in theresearch literature. Kummervold et al. (2002) report that women are more participative inOSG than men. Li et al. (2015) state that women tend to self-disclose more informationthan men. Therefore, we assume there is less male participation in OSG and also even theparticipating males tend to expose less information to identify their gender.

One of the key driver to participate in OSG is the ability to engage with individuals whohas similar conditions (Tanis, 2008). In fact, Fox (2011c) reports that in 2011, 20% ofadult internet users have used internet to ﬁnd individuals with similar health concerns.As discussed in Section 5.4, stage slowromancapv@ of the proposed framework providespersonalised (relevant and reliable) information in response to patient search queries basedon their medical and demographic information.We ﬁrst employed the medical information extracted from the query to identify theposts that are experiences and contain matching medical information. Such posts arethen ranked based on the demographic similarity of the author to the demographics of thepatient. For this ranking, we use a custom relevance measure based on age and gender.Let age and gender of an information seeking patient P i be P ia and P ig respectively.Using the same notation, let age and gender of an existing patient P e be P ea and P eg . Therelevance measure r E is deﬁned as follows: r E = W a × exp( −| P ia − P ea | σ a ) + W g × I( P ig = P eg ) (5.4) .8. DEMONSTRATION W a and W g are the weights for age and gender respectively. I is the indicatorfunction which is 1 if P ig equals P eg and 0 otherwise. Weight of age is associated with aGaussian decay function which is 1 if P ia equals P ea . σ a is used to control the granularityof age matching where smaller values make the decay function to decrease rapidly withthe age diﬀerence and vice versa. W a and W g are set to 0 if the patient does not providetheir demographic details. The OSG posts are ranked based on this relevance measureand presented to the patient.In this patient use case, we demonstrate how personalised retrieval can provide rele-vant and reliable information to the patient compared to the existing full-text search. Weuse the same query: “Im a 40 year old woman taking Nexium for heartburn”. Contextualinformation is initially extracted from the query and used to identify medical informa-tion (symptom: heartburn, medication: Nexium) and demographic information (age: 40,gender: female). Most relevant experiences are retrieved from the database using theabovementioned method.For comparison of the results we employed two approaches of full-text search: (i)default search in the OSG, and (ii) search key terms with the Boolean aggregation AND(retrieves the posts that contain all the search terms). Excerpts from the top ﬁve resultsfrom our method and the two approaches of the full-text search are provided in Table 5.12.The key terms that are relevant to the query is highlighted in each excerpt. Note that,some experiences retrieved by the proposed method does not have age or gender mentionedin that post. This is because age and gender is resolved for each author using all postsby that author, so age or gender of that author is inferred from other posts and not theretrieved post.Above results show that the proposed method retrieves more relevant posts for thegiven query. Instead of taking key terms of the query as-is, the proposed method identiﬁesthe patient is a female and her age is 40. Therefore, it retrieves similar experiences fromfemales who are aged close to 40. Also, the posts do not necessarily need to have age andgender mentions in the post itself as they were resolved for each author.In comparison, the second query (full-text search in the OSG) attempts a direct stringmatching with the search terms and retrieves partially matched results that contain any(unknown) combination of search terms. It is apparent that the last match of this queryis irrelevant, because it is only matching woman and 40 but does not have the symptomheartburn or the medication ‘Nexium’. The third query has a very low recall with onlyone retrieved post, as it is rare to have all four terms in a single post. Also, it is clearlynoticeable that the match term ‘40’ is not an age mention. Medical research is often conducted using small samples of patients due to the associ-ated cost (both time and money) of such research. On the other hand, such informationis accumulated in OSG, crowd sourced by real patients. These untapped resources are34

CHAPTER 5.

Table 5.12: Excerpts from the top ﬁve results obtained using the three search approaches(including the proposed method) to retrieve similar experiences for the query “I’m a 40year old woman taking Nexium for heartburn”

Querying method Excerpts from top ﬁve resultsThe proposedstructured searchBreaks the queryinto the followingstructure:symptom:heartburnmedication:Nexiumage: 40 gender:female ( female,40 ): How I cured my gastritis ... side eﬀect of ﬁsh oil is heartburn ...my doctor said I could try

Nexium as well I decided not to...

My husband made gluten-free banana bread...

I am 40 ...( female,40 ): I had a very severe attack during a 24 hour ph probe test. I usedto have heartburn ... I am on nexium, ranitadine and donperidone...

I tooam 40 years old but I feel 80 ...(female,43): My ...it’s been in a long time and the heartburn seems to beeasing oﬀ... inhibitors are medication for reducing acid in your tummy likenexium and zoton... I’m only 43 and feel my life is...(female,50): I’m a 50yo Aussie female with Barrett’s... I still have heartburnif I eat/drink the wrong things or forget to take my medication for a fewhours... I was on Nexium 40 forever until I saw a diﬀerent GP...(female,30): i am bloated and get heartburn all the time...iam 30yrs old wiv4kids...take them today along with nexium. Nexium in my opinion is of nouse at all...Full-text searchusing ‘heartburnNexium woman 40’(searched in theactual forumsearch ofpatient.infowebsite) Nexium and side eﬀect, anyone here while taking Nexium suﬀer diarrhea...I was taking another PPI tablet for heartburn for about 2 years (Nexium)Constant burping ...no pain or heartburn. I have been taking nexium for thepast two weeks to relieve constant burping...just wondering if ... if you have been on Nexium for a long time? Womenwould be the ones who might ﬁnd their iron levels low...I am a 40 year old woman with no notable health problems aside from thenephrotic syndrome. At 10 years old I was diagnosed with primary focalsegmental glomerulosclerosisFull-text searchusing ‘heartburnAND Nexium ANDwoman AND 40’(searched withinthe collected postsusingElasticsearch) the meds slowly over two years improved symtoms, during the symptomphase l started with diﬀerent pain, chronic heartburn, leading to gall bladderremoval...bought nexium as ld read theyre same as omp, they are expensive...they use 40-80gm.... but tomorrow will try cabbage juice, just told woman oncfs site who has chronic nausea to try it inaccessible to researchers due to inherent noise, unstructured nature and diversity of in-formation representation. Researchers have to attempt the formidable task of executingfull-text queries and manually extract information from the resulting posts.As discussed in Section 5.4, stages slowromancapiii@ and slowromancapiv@ of theproposed framework builds a structured layer on top of the unstructured text of OSGposts which can be utilised for OSG analytics. As shown in stage slowromancapiii@ ofFigure 5.4, each OSG user can be represented using demographic, clinical and emotionaldimensions that enables researchers to conduct OSG analytics and gain insights. It pro-vides unprecedented access to OSG data from diﬀerent viewpoints.In order to demonstrate the OSG analytics capability, we performed several analyses onpatients who report the symptom heartburn. Note that this attempt is solely to showcasethe analytical capabilities and not a comprehensive medical research on heartburn.

Dimensional analysis:

In this analysis, we combine age and gender dimensionsand present demographic distribution of patients who report the symptom heartburn. .9. CHAPTER SUMMARY

Association mining:

The OSG analytics layer is also useful for association mining.It can be used to analyse associations between diﬀerent symptoms in order to identify co-existing symptoms. Table 5.13 presents the top ﬁve other symptoms that co-exists withthe symptom heartburn in diﬀerent age groups.Table 5.13: Top ﬁve symptoms co-exists with the symptom ’heartburn’ in diﬀerent age-groupsAge group Top ﬁve symptoms co-exist with ‘heartburn’ <

20 less sleep, depressed, tiredness, anxiety, stress, living alone21 to 40 reﬂux, anxiety, less sleep, nausea, stress41 to 60 anxiety, stress, depression, reﬂux, indigestion61 to 80 reﬂux, indigestion, anxiety, constipation, less sleep

Temporal analysis:

The Date dimension can be used to perform temporal analysisto identify seasonal patterns in the OSG. Figure 5.10 shows the temporal distribution ofthe post counts that report the symptom heartburn drawn for each month for a period ofthree years. It shows that over the three-year period reported heartburns are relativelyhigh during March and April.

This chapter presents a multi-stage information structuring platform for online supportgroups which facilitate the information needs to of OSG stakeholders: consumers, re-searchers and health professionals. This platform process unstructured OSG messagesusing a suite of machine learning and natural language processing techniques to retrieveage, gender, narrative type, medical entities, emotions and patient timeline. Age, gender,36

CHAPTER 5.

Figure 5.10: Temporal distribution of the posts that mention symptom ‘heartburn’ overa three-year periodand narrative type extraction modules show high precision and recall when evaluated withlabelled data sets. The demonstration shows that the extracted attributes provide relevantand reliable information to consumers. Also, aggregated information is useful for researchpurposes. Next chapter extends this platform to structure OSG related to prostate cancer. hapter 6

A digital health platform

When you come to the end of your rope, tie a knot and hang on.

Franklin D. Roosevelt

This chapter employs the information extraction and structuring platform presented inthe previous chapter to analyse the free-text discussions in Online Support Groups (OSG)related to prostate cancer and conduct a multi-dimensional exploration on the emotionalexpression of various groups and the underlying causalities for such expressions. Thisanalysis informs the key groups that are in need of support as well as the aspects uponwhich such support needs to be delivered. As the number of cancer survivors is increasingglobally within ﬁnite healthcare systems, this understanding is pivotal to optimise the careschemes and also to optimally provide care to vulnerable groups.The platform presented in Chapter 5 is a platform designed for OSG on any healthtopic and it extracts demographic, clinical and emotional information. In this chapter,this platform is further extended to capture prostate cancer speciﬁc clinical and decisionmaking related information. This extended platform is applied on a large corpus of prostatecancer related OSG discussions collected from ten active OSG and the collected insightsare presented as two case studies where one explores the groups that are in need of morepsychological support and the other investigates diﬀerent decision making behaviours anddecision factors.The rest of the chapter is organised as follows. Section 6.1 briefs on cancer incidence,unmet needs of cancer survivorship and OSG as a potential resource to understand suchneeds. Section 6.2 justiﬁes the selection of prostate cancer for the analysis due to itshigher survivorship and unique characteristics related to treatment choices and side-eﬀects.Section 6.3 presents the extended platform to analyse prostate cancer OSG discussions.Section 6.4 discusses the prostate cancer OSG data collection. Sections 6.5 and 6.6 presentsthe two case studies on the emotional expression of diﬀerent groups and treatment decisionmaking. Section 6.7 concludes the chapter with a discussion.13738

CHAPTER 6.

Cancer is a diverse group of diseases in which cells in some part of the body begin tomultiply abnormally. In contrast to normal cells, cancer cells do not die when they shouldand new cells are produced when they are not needed, thereby developing a lump ofcancer tissues i.e., tumour. This abnormal and out of control growth of cancer cellsdamage nearby tissue and also reduce the supply of nutrients to nearby normal cells. Suchdamage hinders the functionality of the body part with the abnormal growth and may failto function eventually causing death. Also, cancer cells are malignant which means thatthey can spread into other body parts using the blood or lymph systems .According to Global Cancer Observatory, in 2018 there are an estimated 18.1 millionnew cancer cases diagnosed and 9.6 million cancer related deaths worldwide (Bray et al.,2018). In US, American Cancer Society estimated that there are 1.7 million new cancercases diagnosed and 0.6 million cancer related deaths in 2018 (American Cancer Society,2018; Siegel et al., 2018). Cancer incident rates are rising as the growth of population andincrease of life expectancy, as well as changes of lifestyles (e.g., obesity, lack of exercises)in developing countries due to socioeconomic development (Bray et al., 2018).Although cancer incident rates increases, cancer related deaths are decreasing due tothe advances of treatment options as well as the advances in screening and early detectionprograms (Bray et al., 2018). Therefore, cancer which was once uniformly fatal in ashort period of time has been transformed into a diverse set of diseases with diﬀerentsurvival rates. Long-term survival is possible in the majority of major cancers types whendiagnosed earlier. In US alone by 2018 it is estimated that there are 15.5 million cancersurvivors, which is expected to rise to 20.3 million (10 million males and 10.3 millionfemales) in 2026 (Miller et al., 2016).Besides for the cancer patient/survivor; partners, family and friends often provide thenecessary care during the primary or adjuvant treatments as well as during the longtermmanagement of cancer as a chronic disease. It is reported that in 2015, at least 2.8 millionAmericans provided care to a cancer patient (National Alliance for Caregiving, 2016). The transformation of most cancers from a fatal to chronic disease and the growing numberof cancer survivors have become a key challenge to the modern health care systems as itstruggles to understand and allocate resources to support the key needs of cancer patientsand survivors.Cancer patients are increasingly involved in the treatment related decision makingand are in need of better support on selecting treatment options. The cancer survivors,although being relived of the primary treatment, are unprepared to live with cancer andits treatment related chronic outcomes (e.g., changes to the body image, infertility, stroke,intimacy issues), and also often worry about cancer recurrence (Council et al., 2005; Alfanoand Rowland, 2006). Moreover, cancer caregivers report that long-term caring of the .1. CANCER, CANCER BURDEN AND CANCER CARE Although there have been so many studies that led to advances in cancer preventing,screening and treatment which treats cancer clinically, the studies on the social and psy-chological burden of cancer have been limited. This lack of studies is one of the keyreason for lack of resource allocation in the current health care system to provide the re-quired support to overcome the social and psychological challenges of the cancer patients,survivors and caregivers.The studies on social and psychological burden of cancer are often carried out as ran-domised control trials and cohort studies where the social and psychological well-being isassessed using survey instruments either designed for general quality of life or specialisedfor certain moods such as depression and anxiety (Groth-Marnat, 2009). For example,some generic instruments are Rotterdam Symptom Check List (RSCL) (De Haes et al.,1990), Functional Assessment of Cancer Therapy-General (FACT-G) (Cella et al., 1993)and European Organization for Research Treatment in Cancer: Quality of Life Question-naire (EORTC-QLQ-C30) (Aaronson et al., 1993). Some depression and anxiety speciﬁcinstruments are The Distress Thermometer (DT) (Roth et al., 1998), The Hospital Anxi-ety and Depression Scale (HADS) (Zigmond and Snaith, 1983), and The Beck DepressionInventory (BDI) (Beck et al., 1996).These studies require the individuals (patients, survivors, caregivers) to self-reporttheir social and psychological well-being time-to-time and correlate to their clinical out-comes from cancer and cancer treatments, thereby developing an understanding of thesocial and psychological issues at diﬀerent time points of the cancer journey for diﬀerentcohorts individuals. However, such long-term analysis is cumbersome, expensive and oftensuﬀered from high drop-out rates as the individuals are less keen to support such studiesthat stretch for a long period of time. Moreover, as discussed in Section 3.1, such studiesare often subjected to issues in generalisability due to small sample sizes, and variousrecall biases.40

CHAPTER 6.

As discussed in Section 5.2.1, OSG are a form of a social media platform that providesanonymous comfortable virtual spaces for patients survivors and carers to share experi-ences, seek advice, express emotions and provide emotional support. Participants withsimilar experiences provide each other with informational and emotional support.As pointed-out in Section 5.2.3, OSG contain unsolicited accounts of self-reported ﬁrstperson experiences at diﬀerent time points of the illness journey. Those self-reportedexpressions are scattered across the OSG as parts of diﬀerent discussions. However, onceaggregated they provide signiﬁcant insights about the patient/caregiver and their illnessjourney over time. This resource is seen to be instrumental in addressing the limitationsand challenges of the current approach to understanding the unmet issues of cancer. Thefree-text corpus in OSG discussions can be mined to uncover the needs and challenges of thecancer patients, survivors and caregivers based on the issues that they discuss at diﬀerentstages of their cancer journey. The emotions expressed when discussing the issues can beemployed as a proxy to the quality of life to understand the severity of the issues. Theindividuals can be categorised based on their demographics (age, gender), role (patient,caregiver), and other clinical factors (e.g., treatment, cancer aggressiveness). The issuesexpressed by diﬀerent categories of individuals can be prioritised based on their emotionalexpression and employed in the decision making process to identify the pressing issuesand the groups that essentially in need of support so that the support can be optimallydelivered with the available resources.This approach requires the relevant information about the individuals to be extractedfrom the free-text OSG discussions, which is challenging due to the noisy and unstructurednature of OSG content.In order to validate this premise that OSG can be better utilised to gain insights aboutthe diverse needs of cancer care among the patients, survivors and caregivers, a use caseof prostate cancer has been selected. The next section provides an overview of prostatecancer and justiﬁes its use as a case study to analyse the needs and challenges of prostatecancer patients, survivors and caregivers, discussed in prostate cancer OSG.

Prostate cancer is the development of cancer in the prostate gland, a part of the malereproductive system. It is often slow growing and relatively less fatal compared to othertypes of cancers, but it may spread into other parts of the body (e.g., bones, lymph nodes).The key incidence factors of prostate cancer are age (mainly on males >

50) and familyhistory.

It is the second most frequent cancer among men (after lung cancer), and ﬁfth leadingcancer for mortality among men (Bray et al., 2018). In 2018 there has been an estimated .2. PROSTATE CANCER AND RELATED OSG USAGE , and 95% (2014) Australia . This increased survival is mainlydue to the introduction of early detection schemes such as abnormal prostate-speciﬁcantigen (PSA) testing (Kv˚ale et al., 2007) so that cancer can be removed or containedduring early stages of cancer. In US it is estimated that there are over 3.3 million menwith a history of prostate cancer (Miller et al., 2016). PSA and Gleason score are the key determinants of the prostate cancer stage. PSA whichstands for prostate speciﬁc antigen is a protein made in the prostate gland. Increase ofPSA level than normal in the blood is an indication of having prostate cancer, althoughsuch an eﬀect can be due to various reasons, which may often lead to misdiagnose (falsenegatives). Due to the simplicity of the procedure, PSA test is often used as a routinetest in older adults where abnormalities were directed to further tests.Gleason grading system measures the development of cancer by analysing the cell mor-phology of cancer cells and grading them 1-5 where 5 being most aggressive (Gleason et al.,1974). The ﬁnal score is determined by combining the score of the most common cell pat-tern and non-dominant cell pattern with the highest grade. The ﬁnal score ranges between2-10 which is indicative of the growth of cancer where 10 being the most aggressive.

There are three major treatment options prostate cancer:1.

Surgery: a surgical procedure known as radical prostatectomy that completelyremoves the prostate gland. It is the most widely used treatment for prostate cancerand often administrated as open surgery, laparoscopically or robotic-assisted.2.

Radiation therapy: using ionizing radiation destroy the malignant cells of cancer,it uses as a primary treatment method or as a secondary treatment following asurgery to remove any remaining cancer cells.3.

Active surveillance: for men with a less aggressive cancer (low-risk) or olderindividuals who have other serious conditions that prevent them from a curativetreatment. It involves continues monitoring of cancer for its progression with theintent of switching to a curative treatment if cancer becomes aggressive. https://prostate-cancer.canceraustralia.gov.au/statistics CHAPTER 6.

As these treatments lead to diﬀerent outcomes in terms of curing (or containing can-cer) as well as the subsequent treatment related side-eﬀects (Albertsen, 2016; Cooperbergand Carroll, 2015). Therefore, the patients are eager to chose the best treatment forthem, which is often selected based on multiple factors such as the stage of the cancer,age, other medical conditions, ﬁnancial conditions as well as the personal preference ofthe individual (Berry et al., 2003; Xu, Dailey, Eggly, Neale and Schwartz, 2011). In fact,there have been reports of decision regrets among the post-treatment survivors (Clarket al., 2001; Gwede et al., 2005) mainly attributed to treatment related side-eﬀects. Thiscomplexity makes patients to seek multiple opinions on the treatment options from clin-icians, family, friends and other prostate cancer survivors (Denberg et al., 2006; Berryet al., 2003). Also, patients acquire more information by reading relevant books, maga-zines and web pages (Berry et al., 2003) as well as interacting online with other individualsin OSG (Huber et al., 2011; Ihrig et al., 2011a).The treatments of prostate cancer come with chronic side eﬀects which are mainlyurinary, sexual and bowel impairments. Among them urinary side eﬀects such as urinaryincontinence and sexual side-eﬀect erectile dysfunction are intimate in nature, thus patientsare reluctant or rather embarrassed to self-disclose and talks about these issues with theirphysicians, family and friends (Weber et al., 2000). This issue is more prevalent among menas they are less eager to seek support for social and psychological issues (Wang et al., 2013;Lintz et al., 2003) and discuss intimate health issues with physician or peers (Lintz et al.,2003; Weber et al., 2000). In contrast, online social media platforms such as OSG providepatients with the ability to communicate under a pseudo name without revealing their trueidentity. As discussed in Section 3.2.4 this anonymous nature leads to online disinhibitioneﬀect (Suler, 2004) where individuals are comfortable in self-disclose more frequently andintensely in online communication in contrast to face-to-face communication. Therefore,prostate cancer patients (or survivors) are more comfortable and likely to use OSG todiscuss their intimate side-eﬀect related problems (Nanton et al., 2018).

The discussions related to prostate cancer happens in OSG either dedicated for prostatecancer discussions (e.g, Prostatecanceruk), or in generic OSG with a speciﬁc section forprostate cancer (e.g, Healingwell sub-forum on Prostate Cancer).The above discussed high incidence rates and the high survival rates globally result asigniﬁcantly large population of prostate cancer patients, survivors as well as caregiversof prostate cancer patients participating in prostate cancer OSG. Especially given thatprostate cancer is prominent in the developed countries where most of the patient arefrequent users of internet and online social media platforms. Therefore, a large volume ofprostate cancer related discussions are accumulated in OSG platforms.The participants use prostate cancer OSG at four key stages of their cancer journeywhich are diagnosis, pre-treatment, post-treatment side-eﬀects and recurrence. Thosestages of cancer are being identiﬁed as the most in need of psychological support (Weis,2003) due to signiﬁcant transitions of the quality of life as well as the signiﬁcant decisions .3. PRIME

This section presents the proposed Patient Reported Information Multidimensional Ex-ploration (PRIME) platform which is developed to automatically analyse a large collectionof cancer related OSG posts and extract demographic, clinical and emotional factors withtheir associated temporality. These factors were employed for automated investigation of44

CHAPTER 6. patient behaviours, clinical factors and patient emotions, across the temporalities of diag-nosis, treatment and recovery. More speciﬁcally, we focus on the automated multi-granularextraction, analysis, classiﬁcation and aggregation of decision-making behaviours, decisionfactors, the temporality of patient interactions, the temporality of clinical information andside eﬀects, and trajectory of positive and negative emotions, in the context of decisiongroups, demographics and treatment type.Figure 6.2: Patient Reported Information Multidimensional Exploration (PRIME) plat-form, which contains a suite of machine learning and natural language processing basedinformation extraction modules to process cancer related OSG data.As shown in Figure 6.2 the PRIME framework functions in seven

Stages S1-S7 to cap-ture multiple layers of information from free-text OSG posts. As discussed in Section 5.1,OSG are organised as discussion threads where each thread is initiated by a user (often apatient, partner of a patient or a caregiver) with a particular question or health concern.Other participants respond to that with answers, suggestions, expression of their own ex-perience or a follow-up question for further information. This process creates a discussionthread in OSG. .3. PRIME

The patient behaviour in an OSG over time can be represented as t x ( P C × P y C z ),where t x denotes time, P y denotes a contributing patient, C z denotes an OSG thread and P y C z denotes an OSG post by patient P y as a contribution to the discussion in thread C z .In this naturally occurring order of OSG discussions, the posts by a single patient P y arescattered over multiple discussions ( C , . . . , C z ). Hence, as the initial step in the Stage 1 the OSG conversations by a single patient are collocated and chronologically ordered basedon the timestamp.The timestamp of OSG post depicts its published time. However, as diﬀerent cancerpatients start their cancer journey at diﬀerent times, published time cannot be used tocompare and contrast patients. Therefore, the time of each OSG post is normalisedbased on a key cancer related event (e.g., diagnosis, treatment) of each patient, which isexperimentally set to treatment event as it was found that many patients join OSG closeto their treatment. Note that the cancer events and their temporalities are automaticallyextracted during the construction of patient timeline which will be discussed later in

Stage3 . Stages 2-4 extracts various aspects of the patient proﬁle such as demographics, clinicalinformation and treatment decision making.

Capturing demographics

Stage 2 is based on natural language processing and machine learning based informationretrieval techniques presented in Section 5.5, which extracts age, gender and narrativetype. Based on the narrative type being ﬁrst-person or second-person the role of theparticipant has being identiﬁed as the patient or caregiver, where female caregivers arepredominantly the partners of patients. If the gender cannot be inferred with suﬃcientconﬁdence, such individuals were marked as male patients as prostate cancer is a maleonly cancer. Patients are grouped based on age into ﬁve age groups ( <

40, 40-50, 50-60,60-70 and >

70) to be in line with the standard prostate cancer cohort studies.

Capturing clinical information

Stage 3 enriches patient proﬁles with clinical information, which are important to cate-gorise patients based on their treatment type and the stage of cancer.46

CHAPTER 6.

The treatment type that the patient has undergone is often disclosed during the dis-cussions about treatment choice and their respective consequences. As mentioned in Sec-tion 6.2, multi-modality of treatment options makes the selection of treatment a complexprocess. Hence, it is important to compare and contrast the outcomes of a patient withdiﬀerent modalities.The treatment information is extracted using the clinical event extraction technique (aspart of the clinical timeline extraction) presented in Section 5.6 which is a speciﬁed eventdetection technique (See Section 2.3.2 on speciﬁed event detection). It extracts the OSGposts that mention any of the treatment modalities and then determines the temporalityof that event based on the explicitly mentioned time (e.g., ‘2 weeks ago’) and the implicittime based on the timestamp of the OSG post. Note that if the patient had undergonemultiple treatments during their cancer journey (e.g., surgery and then radiotherapy) theﬁrst treatment is considered to represent his modality as it is the primary treatment inhis cancer journey.In OSG, participants disclose their PSA and Gleason scores as a means of informingtheir stage of cancer to the OSG community. However, such mentions employ diversenarrative styles. For example, Gleason score is mentioned as ‘gleason’, ‘gleson’, ‘gleeson’,‘g’ and ‘gs ’followed by the value in either as a total score (e.g., ‘7’) or as a combinationof the two components (e.g., ‘4+3’). In

Stage 3 , association rules and extracts fromclinical ontologies are employed to develop regular expressions to identify these mentionsof Gleason and PSA and subsequently, capture the numerical details of Gleason and PSAscores.

Capturing treatment decision making behaviour and decision factors

Stage 4 infers the decision making behaviour of each patient and the correspondingdecision factors considered for the decision making. As discussed in Section 6.2, multiplemodalities in prostate cancer treatment process make the treatment decision making acomplicated process. Due to this complexity, patients are often eager to get involved inthe decision making process to select what is best for them. Information of this decisionmaking process is often posted in OSG with the intention of receiving peer-validation fromthose who had similar circumstances.These decision making behaviours are captured based on the established patient de-cision making models (Charles et al., 1999; Emanuel and Emanuel, 1992; Veatch, 1972)which highlights diﬀerent types of decision making behaviours depending on the degreeof involvement by clinician and patient. The three prominent categorises of the decisionmaking behaviours are as follows:1.

Paternalistic : also known as priestly which is the traditional behaviour in clinicaldecision making where clinician assumes a dominant role in making the decisionbased on his expertise and experience. In this approach patient delegates the deci-sion making authority on the treatment to the clinician and provide consent to therecommendations of the clinician, assuming that the clinician would use his expertise .3. PRIME

Autonomous : also known as informative or consumer model is the decision mak-ing behaviour where the patient assumes the dominant role. In this behaviour,the clinicians provide the relevant information, let the patient make the decision,and subsequently executes the suggested treatment option. In this approach, it isassumed that the patient with suﬃcient knowledge is capable of selecting the treat-ment option that best suits his condition.3. Shared : also known as deliberative is the decision making behaviour where bothpatient and the clinician jointly work to ﬁnd the best treatment options by weighingall options using multiple aspects such as clinical, social and ﬁnancial.These decision making behaviours are reﬂected in self-disclosed statements that announcethe treatment decision of the individual (see third OSG post in Figure 6.1). In order tocapture such statements, a set of template patterns was engineered to capture sentencesthat describe that either individual has taken the decision (Autonomous) or the treatmentoption was recommended by a clinician (Paternalistic). The template patterns are asfollows: • Autonomous template ( T A ): * * • Paternalistic template ( T P ): * * Note that * denotes zero or multiple words in-between, and upper case termsare template terms which consider a set of synonym terms (word or phrases). Table 6.1shows a selected sample of terms employed as candidates for each template term.Table 6.1: A sample of candidate terms for each template term used in the templatesentences of decision making behaviour. Note that some of the terms are abbreviations orshorten versions of the actual terms or phrases.Templateterm Sample candidate terms

DECIDE decided, chosen, wind up going, made the call, settled, opted, wentfor, took the option, end up

RECOMMEND recommend, recommended, prescribe, prescribed, advised, advise,endorse, endorsed, advocate

DOCTOR doctor, doc, surgeon, urologist, uro, specialist, consultant, radiologist,oncologist, radiotherapist

TREATMENT surgery, davinci, da vinci, robotic, prostatectomy, ralp, rrp, lrp, rpp,key hole, open opradiation, imrt, brachytherapy, radiotherapy, seed therapy, brachy,seed implant, ebrtsurveillance, AS, watch and waitThese template sentences from both templates T A and T P were used to capture thematching sentences in the OSG posts of each individual. The proportion P ( T X ) =48 CHAPTER 6. T X / ( T A + T P ) (where X is A or P ) appeared in the OSG posts of a user is used todetermine the decision making behaviour of that user as follows:1. Paternalistic : if P ( T P ) > = 0 .

25 and P ( T A ) < . Autonomous : if P ( T P ) < .

25 and P ( T A ) > = 0 . Shared : if P ( T P ) > = 0 .

25 and P ( T A ) > = 0 . T P or T A , his decision making behaviour is categorised asPaternalistic or Autonomous respectively. However, if he has both T P and T A signiﬁcantly( > = 0 . Stage 5 onwards, PRIME framework incorporates the time dimension of OSG discus-sions and patient interactions. The patient timeline extraction technique presented inSection 5.6 is adapted with prostate cancer speciﬁcs to automatically generated a clini-cal event and emotion timeline of each individual based on the self-disclosed side eﬀectscaptured in

Stage 5 and positive/negative emotions captured in

Stages 6-7 . Stage 5 captures the self-disclosure of side eﬀects and grouped them into four keycategories: urinary , sexual , bowel and other in which urinary , sexual , and bowel are thekey side eﬀect categories of prostate cancer treatments (Hamdy et al., 2016; Donovanet al., 2016), and other represents the miscellaneous side eﬀects arises due to prostatecancer and its treatments.The relevant side-eﬀects for each category were selected with the support of cliniciansand a thesaurus of relevant terms (words/phrases) for each side-eﬀect were engineeredbased on the medical concepts found in the UMLS Metathesaurus (Bodenreider, 2004).However as discussed in Section 5.5.4, health consumers (patients and caregivers) discussside eﬀects using everyday layman language (e.g., urinary incontinence described as leak-age, leak, drip). Therefore, the above thesaurus is extended by adding consumer health .4. PROSTATE RELATED OSG DATA COLLECTION Stages 5-6 the positive and negative emotions were captured using the emotiontimeline generation technique presented in Section 5.6.2. It provides a 16-dimensionalreal-valued emotion vector for each OSG post which included 8 positive and 8 negativeemotions extracted based on the psychological emotional model Emotion Wheel (Plutchik,1980b, 1991). The value of each emotion is a representation of the strength of that emotionexpressed in the respective OSG post.Each patient timeline is time-normalised based on the treatment month captured in

Stage 4 as t . The events (side eﬀects and emotions) are aggregated monthly based onthe reported timestamp, and the timeline is generated from three months pre-treatment( t − ) to 12 months post-treatment ( t ) based on the available information.The PRIME platform has been implemented using JAVA language and available at https://github.com/tharindurb/PRIME .The following sections empirically evaluate the PRIME platform using a large corpusof prostate cancer related OSG discussions collected from 10 active OSG. The evaluation iscarried as two case studies where the ﬁrst case-study evaluates the emotional expression ofthose who have undergone treatment for low-intermediate risk prostate cancer and secondcase study evaluates decision making behaviours and their associated decision factors andemotional expression. The social data related to online prostate cancer discussions were collected from highvolume active OSG on prostate cancer discussions. An OSG is considered active if ithas at least 100 new conversations per week. From these active OSG, conversations wereselected using the speciﬁc topic ‘prostate cancer’. Table 6.2 presents the ten OSG selectedfor the data collection and their URLs. Note that, some of these OSG are dedicated forprostate cancer discussions (e.g, Prostatecanceruk), while others are generic OSG witha speciﬁc section for prostate cancer discussions (e.g, Healingwell sub-forum on ProstateCancer).The identiﬁed discussions were extracted using a set of web scrapers developed foreach OSG platform. The collected dataset contains 609,960 conversations from 22,233OSG users, comprising a text corpus of 93,606,581 word tokens. Table 6.2 shows thenumber of users and number of posts collected from each OSG platform.

This study has been conducted under the ethics approval from La Trobe University Re-search Ethics Committee. All patient-reported data used in this study are non-identifying50

CHAPTER 6.

Table 6.2: Population and volume statistics of the ten selected OSG on prostate cancerdiscussions.OSG URL Users PostsHealingwell healingwell.com/community 6,829 401,325Cancer Survivors Network csn.cancer.org/forum 2,775 33,166Cancerforums cancerforums.net 2,761 52,840Cancercompass cancercompass.com 2,524 12,141Healthboards healthboards.com/boards 1,938 17,144Patientinfo patient.info/forums 1,727 33,304Macmillanuk community.macmillan.org.uk 1,178 9,210Prostatecanceruk community.prostatecanceruk.org 1,119 28,041Prostatecancerinfolink prostatecancerinfolink.ning.com/forum 862 7,187Ustoo inspire.com/groups/us-too-prostate-cancer 519 15,602and publicly available from the corresponding OSG. The OSG do not provide access toidentifying information of patients, and PRIME does not process any identifying informa-tion. This work only publishes aggregates of the analysed data, which cannot be reverseengineered using any means for any form of re-identiﬁcation.

As discussed in Section 6.2, the treatment of low-intermediate risk clinically localised PCais becoming increasingly complex due to comparable cure rates of the available treatmentoptions; radical prostatectomy (RP), external beam radiation therapy (EBRT), and activesurveillance (AS). Therefore, in addition to the tumour characteristics, signiﬁcant empha-sis is placed on customising the side eﬀect proﬁles of each treatment option to the patient,based on their preference, anxiety levels and experience of the medical professional (Zeliadtet al., 2006).Recent studies (Barocas et al., 2017; Chen et al., 2017; Donovan et al., 2016) comparepatient quality of life (QoL) in men randomised to AS, RP and EBRT demonstrating thatQoL post-treatment mirrored reported changes in function, yet no signiﬁcant diﬀerenceswere observed among the groups in measures of anxiety, depression, and general health-related or cancer-related QoL. While study settings are controlled for many variables, theyare dependent on questionnaires completed in a trial setting and may not accurately cap-ture real-life issues experienced by patients in diverse circumstances undergoing diﬀerenttreatment options for localised prostate cancer (Bowling, 1995; Carr and Higginson, 2001).In contrast, the complex phenomenon as QoL is eﬀectively communicated as free ﬂow-ing text containing expressions of emotions instead of predetermined responses in a ques-tionnaire (Carr and Higginson, 2001; Seale et al., 2010). A complementary approach toconventional studies is provided in this case study which employs the emotional expressionextracted from PRIME as a proxy to patient reported QoL, and compare/contrast the ex-pression of emotions across the three modalities and other demographic factors. Note that .5. ANALYSIS OF THE QUALITY OF LIFE OF PROSTATE CANCER PATIENTS time factor is not considered in this study, thus, side-eﬀects and emotions captured inthe patient timeline are aggregated.

Following the completion of the automated intelligent extraction of demographic, clinicalinformation and expressions of emotion, inclusion criteria set in order to focus on theanalysis of a speciﬁc cohort. Thereby, this study focuses on those who have undergonetreatment (RP, EBRT, AS) for low-intermediate risk PCa. Low risk PCa (Gleason ≤ The selected participant cohort has RP as the highest modality with 4,241 (69.7%) par-ticipants, followed by 1,528 (25.1%) EBRT and 315 (5.2%) AS. Table 6.3 presents demo-graphic and clinical information of participants (patient or partners of patients) in totaland across each modality. The percentages in each column are determined relative to thetotal participants in each cohort (total or each modality). The p-value presented is fromchi-square tests that evaluated the statistical signiﬁcance across the three modalities.Table 6.3 presents a comparison of diﬀerent participant characteristics across the threemodalities. Among the participants who selected AS, 85% have a Gleason score of lessthan 6, while only 15% have 7. This is because AS is preferred for patients who are low-risk (have a slow growing cancer). The age distribution among EBRT and AS areskewed towards older participants as some are not suitable for surgical procedures in RPdue to their old age. The partner cohort across the modalities are around 10% while AShas lowest of 9%.Side eﬀect results show that those undergoing RP had comparatively high urinarysymptoms and sexual side eﬀects compared to the other groups, while those who had EBRThad comparatively high bowel symptoms. These ﬁndings on side eﬀects are corroboratedby key outcomes from a large randomised controlled study and two large population basedprospective cohort studies (Barocas et al., 2017; Chen et al., 2017; Donovan et al., 2016).

Figure 6.3a depicts the positive emotion categories of ﬁve key age groups. The categories; open , happy , positive and good are consistently close to 20% whereas alive is close to 15%and interested is close to 10%. The cohort aged <

40 has on or above average emotionlevels across all positive emotions.Figure 6.3b presents the negative emotions of the same age groups. In general, negativeemotions are less expressed compared to positive emotions. The expression of emotion afraid is signiﬁcantly high for individuals aged <

40 which is 10% higher than other age52

CHAPTER 6.

Table 6.3: Demographics and clinical characteristics of the identiﬁed low-intermediaterisk prostate cancer patients and partners of patients from the ten selected OCSGs. Notethat, the percentages in each column is determined relative to the total participants ineach cohort (total or each modality). The p-value presented in the last column is fromchi-square tests that evaluated the statistical signiﬁcance across the three modalities.RP EBRT AS Totaln (% inRP) n (% inEBRT) n (% inAS) n (% inTotal) pvalueTotal participants 4241 1528 315 6084

Gleason score < =6 2543 (60) 831 (54) 269 (85) 3643 (60) < < Age <

40 110 (3) 37 (2) 8 (2) 155 (3) 0.93741-50 688 (16) 144 (9) 32 (10) 864 (14) < < < >

70 197 (5) 192 (13) 39 (12) 428 (7) < Participant role

Patient 3702 (87) 1319 (86) 288 (91) 5309 (87) 0.676Partner 539 (13) 209 (14) 27 (9) 775 (13) 0.069

Side eﬀects

Urinary symptoms 2229 (53) 625 (41) 75 (24) 2929 (48) < < < < < < < < < < < < < depressed and helpless .However, the emotion angry is less expressed by <

40 group in comparison to all other .5. ANALYSIS OF THE QUALITY OF LIFE OF PROSTATE CANCER PATIENTS sad and confused are expressed consistentlyacross all age groups.

Figure 6.4 presents positive and negative emotions by modality. The positive emotion open is the most signiﬁcant, while alive and interested are the least signiﬁcant.Figure 6.4b depicts negative emotions by modality and afraid is the most signiﬁcantwith an on average of 10% higher across all modalities. Noteworthy ﬂuctuations areobserved for AS; where sad , depressed and helpless are signiﬁcantly low compared to theother two modalities. Figure 6.5 presents positive and negative emotions by patients and partners. Both groupswere more open , and less alive and interested . Partners were slightly more positive , good and alive . On average, partners have expressed more negative emotions than patients,where afraid is expressed 7% more than patients.Figure 6.5b presents partner negative emotions by treatment modality. Partners ofparticipants under active surveillance are more afraid , more angry , and less hurt thanother two modalities.54 CHAPTER 6.

Figure 6.4: Positive and negative emotions by the treatment modality (RP, EBRT andAS).

Table 6.4 shows the frequently used emotion terms by the three selected cohorts ( < >

70, partners), which were determined based on the percentage increase of frequency ofeach emotion term in the selected cohort compared to all the participants.Table 6.4: Frequently-used emotion terms for three selected cohorts: (i) patients aged <

40, (ii) patients aged >

70 and (iii) partners of patients.Group Terms expressing positive emo-tions Terms expressing negativeemotionsPatients aged < > .5. ANALYSIS OF THE QUALITY OF LIFE OF PROSTATE CANCER PATIENTS <

40 have expressed signiﬁcantly high positive and negative emotionscompared to other age groups. Aligning with studies that report high levels of psycho-logical distress in young cancer patients (Compas et al., 1999; Mosher and Danoﬀ-Burg,2005), this group demonstrates signiﬁcantly high negative emotions and emotion terms(Table 6.4). Positive categories ( happy , positive ) and positive emotion terms (Table 6.4)are also high in this group. This is noted as post-traumatic growth in trials on other typesof cancer patients (Blank and Bellizzi, 2008; Manne et al., 2004; Pudrovska, 2010). Assuch, this cohort appears to be the ideal beneﬁciaries of OSG and could gain from ad-ditional health care resources and discussions focused on treatment decision making andcultivating positivity.Emotions of patients and partners diﬀer signiﬁcantly in several categories. Partnersexpress higher levels of afraid , positive , these diﬀerences can be explained by (1) femalepartners being emotionally expressive than males (Kring and Gordon, 1998), thereby ac-tively use OSG to share emotional experiences (Seale, 2006) and (2) caregiving partnersreport more emotional distress and anxiety than the patient (Given et al., 2001; Northouseet al., 2007). Negative terms signiﬁcantly used by partners e.g., ‘tearful’, ‘frightened’, ‘mis-erable’ further conﬁrms this. While the cancer care process is directed towards the patientwith regular interactions with healthcare professionals and support groups; partner’s bur-den of caregiving and anxiety of potentially losing their loved one, are often overlooked,highlighting the need for better support for partners of patients.56 CHAPTER 6.

This study reports that signiﬁcantly higher and more negative emotions were expressedin younger age cohorts and partners of patients who may beneﬁt from increased psycho-logical support from healthcare providers.

As discussed in Section 6.2.3, treatment decision making in prostate cancer is a complexprocess due to the availability of multiple treatment options and comparable outcomes.Therefore, the selection of treatment is not only based on clinical factors such as the stageof cancer, age, other medical conditions, but also other factors such as ﬁnancial conditionsand personal preference of the individual (Berry et al., 2003; Xu, Dailey, Eggly, Nealeand Schwartz, 2011). Moreover, as discussed in Section 6.3.2, over the last few decadespatients have assumed a more dominant role in decision making deviating from cliniciandominated paternalistic behaviour to a more patient involved autonomous and shared decision making behaviours.The PRIME framework has captured the patient decision making behaviours and theassociated decision factors mentioned in the OSG by the OSG participants, which enablesa large scale analysis of the decision making behaviours and key decision factors acrossdiﬀerent cohorts based on demographic and clinical factors.

In this case study, the inclusion criteria is set to included patients those who self-disclosedtheir primary PCa treatment and discussed the decision-making process that led to theparticular treatment. A total of 6,457 participants matched this criteria which is 29.0%of the total proﬁles extracted. Among them 420 (6.5%) have shown paternalistic , 3883(60.1%) autonomous and 2154 (33.4%) shared decision making behaviours. This distribu-tion shows that OSG participants predominantly get involved in the treatment decisionmaking process ( autonomous or shared ). Note that there may be a representative bias on paternalistic individuals, as they are more reliant on clinical sources and comparativelyless likely to participate in OSG discussions. Figure 6.6 presents a comparison of the distribution of decision behaviour groups acrosstreatment modality, age group and Gleason score. The three Gleason score groups ( < >

7) are based on D’Amico risk classiﬁcation (D’Amico et al., 1998) as low-risk, intermediate-risk and high-risk respectively.The results highlight that the autonomous group dominates each cohort, followed bythe shared group leaving less than 10% for the paternalistic group. This shows that patientinvolvement in decision making is signiﬁcant across all the cohorts. Active Surveillance(AS) has the lowest percentage on the autonomous group, but highest in paternalistic and .6. ANALYSIS OF DECISION MAKING BEHAVIOUR AND DECISION FACTORS shared . The paternalistic group has a fairly uniform distribution across the age groups.The percentage of the shared group is highest among <

40 and the autonomous group ishighest among >

70, which shows that young individuals are more keen to jointly workwith the clinician to make the treatment decision, while older individuals are more keento take the decision themselves. The shared group is highest among Gleason < Paternalistic and autonomous groups reduce activity soon afterwards, but the shared group more con-sistently participated in OSG discussions throughout the next 12 months. This trend isreﬂected in Figure 6.7b as well, where the average number of posts by an active partici-pant is signiﬁcantly high for the shared group. Figure 6.7c shows that in all groups theproportion of advice posts increases over time which shows that the participants graduallybecome support providers to others who need. However, the shared group with a consis-tently high proportion of advice is shown to be more collaborative in OSG discussions.This transformation of the initial information/support seeker gradually becoming an in-formation/support provider happens as the participants gain more relevant knowledge andwisdom (often by receiving support from peers), which is very important for the ecosystemin any support group as it provides the continues supply of information and support forthose who need (Ziebland and Wyke, 2012).58

CHAPTER 6.

Figure 6.7: The OSG participation activities by the three decision making behaviourgroups over the patient trajectory from 3 months pre-treatment to 12 months post-treatment. Note that an active participant during a particular month is an individualwho has participated in OSG discussions during that month. The OSG posts are clas-siﬁed as an expression of experience or providing advice by the classiﬁer presented inSection 5.5.1.

It is important to understand the decision factors that inﬂuenced the treatment deci-sions of the individuals.

Stage 4 of PRIME delineated in Section 6.3.2 extracted theseself-disclosed decision factors alongside the selected modality and the decision makingbehaviour. This section analyse such decision factors against the selected modality anddecision making behaviour shown by the participant. .6. ANALYSIS OF DECISION MAKING BEHAVIOUR AND DECISION FACTORS p-value in the last column of both tables denotes the statistical signiﬁcance of the mentioneddecision factors among the cohorts.Table 6.5: The decision factors employed for the treatment decision against the primarytreatment modality of the patient.RP EBRT AS Total p valuen (% inRP) n (% inEBRT) n (% inAS) n (% inTotal)Doctor skilldiscussed 2617(66.76) 1392 (61.78) 190 (66.9) 4199 (65.03) 0.06Surgeon mentioned 1975(50.38) 880 (39.06) 141(49.65) 2996 (46.4) < < < < < < < < < < < shared groupactively utilised the OSG discussions in treatment decision making process to weight theavailable options with their outcomes. Thus, they tend to talk more about the decisionfactors. In contrast, paternalistic and autonomous groups either less comprehensive on60 CHAPTER 6.

Table 6.6: The decision factors employed for the treatment decision against the decisionmaking behaviour group.Paternalistic Autonomous Shared Total pvaluen (% inPaternalistic) n (% inAutonomous) n (% inShared) n (% inTotal)Doctor skilldiscussed 200 (47.62) 2162 (55.68) 1837(85.28) 4199(65.03) < < < < < < < < < < < < < < .6. ANALYSIS OF DECISION MAKING BEHAVIOUR AND DECISION FACTORS It is important to analyse the patient timeline to understand the temporalities of theclinical events and emotions of the diﬀerent behaviour groups. Such analysis enables tocompare and contrast the characteristics of the behaviour groups.Figure 6.8 presents the aggregated positive and negative emotions timelines of thethree decision behaviour groups. Note that the aggregated emotions are obtained byaccumulating all positive emotions (e.g., open , alive ) into an aggregated positive emotionand accumulating all negative emotions (e.g., afraid , hurt ) into an aggregated negativeemotion.Figure 6.8: The positive and negative emotion trajectories of the three decision makingbehaviour groups over the patient trajectory from 3 months pre-treatment to 12 monthspost-treatment.As shown in Figure 6.8 both positive and negative emotions peaked at the treatmentmonth, which due to the elevated emotional state associated with a traumatic event likesurgery. The shared group is the most positive and autonomous group is the most negativeduring the treatment month. After the treatment, there is a rapid drop in positive andnegative emotions across all groups which stabilised in subsequent months. However, the shared group consistently expressive in positive emotions than the other two groups. Thepaternalistic group have shown high variance in the latter part of the timeline in bothpositive and negative emotions. It seems to be due to the drop of active paternalistic CHAPTER 6. participants as shown in Figure 6.7a resulting in the averaged emotions values impactedby a few emotional individuals.Figure 6.9: The self-disclosed side eﬀect timeline aggregated monthly across the threedecision making behaviour groups from 1st month post-treatment to 12 months post-treatment.Although treatment decision making does not impact the occurrence of treatmentrelated side-eﬀects, it impacts the awareness and acceptance of the treatment related side-eﬀects. Figure 6.9 presents the self-disclosed side eﬀect timeline aggregated monthly acrossthe three decision making behaviour groups. It highlights that shared group reports moreon side eﬀects than the other two groups across all side eﬀect categories.

Shared and autonomous groups are initially reported more urinary side eﬀects (Figure 6.9a) whichgradually declined over time, in contrast to the paternalistic group which has ﬂuctuationsover time. shared and autonomous groups are consistently reported sexual side eﬀects(Figure 6.9b) while paternalistic group shows an increased reporting over time, reachinga level equivalent to that of the autonomous group by month 12. The bowel side eﬀects(Figure 6.9c) are the least mentioned across all groups as they are relatively less prominent.This signiﬁcantly more reporting of side-eﬀects by the shared group is mainly due totheir openness to discuss such issues and high collaborativeness in OSG. Similarly, the paternalistic group is less open and collaborative to discuss side-eﬀects in OSG. However, itis interesting to observe the increase of the report of sexual side eﬀects by the paternalistic .7. DISCUSSION shared group hasshown to discuss signiﬁcantly more decision factors indicating that they are more awareof decision factors and eﬀectively use OSG discussions in the decision making process.The shared group reported more side eﬀects. However, they have demonstrated a betterquality of life (increased positive and decreased negative emotions) over time more thanthe other two groups. These characteristics of the shared group align with two copingstrategies of traumatic events such as cancer. First is being more informed about theexpected outcomes leads to an informed decision that develops acceptance to the outcomesof the treatment. The second is, discussing about a traumatic event among individualswith similar or worse outcomes leads to positive social comparisons (see Section 3.2.4)highlighting that the individual is not alone reduces the emotional distress (Taylor andLobel, 1989).These results also highlight the importance of the shared decision-making. It is arguedthat patients should be provided necessary tools to gather information, know their deci-sion options, scenarios and consequences for shared decision-making to be eﬀective (El-wyn et al., 2012). The signiﬁcance of emotional support that allows patients to freelyexpress values and preferences and ask questions without clinician obstruction is alsohighlighted (Stacey et al., 2011). The proliferation of OSG is a clear indication that pa-tients and carers are bridging this gap by seeking (and providing) this service extraneous tohealthcare providers and institutions. Further, OSG provide information, decision optionsand emotional support with the added advantage of a geographically dispersed communityof individuals who are undergoing/have undergone similar circumstances.

The above two case studies have shown the wealth of information encapsulated in OSGdiscussions. The ﬁrst case-study investigated the positive and negative emotional expres-sion of diﬀerent groups and highlights several highly emotionally expressive groups as morein need of support. The second case-study explored diﬀerent decision making behavioursand evaluated their decision factors, emotional expression and report of side-eﬀects; andidentiﬁed that shared group reports better emotional expression compared which could beattributed to their extensive consideration of decision factors and active role in decisionmaking. Also, shared groups is pivotal for the continuation of OSG as they are more likelyto stay longer with the OSG and become information/support providers over time. Theseinsights help to shape-up optimum delivery of necessary care to the vulnerable prostatecancer patients/survivors and caregivers.64

CHAPTER 6.

With the increasing presence of social media empowered patients, OSG make a paramountcontribution as a medium for discussions on information exchange and emotional support.This is seen to be instrumental in addressing the ‘out of sight out of mind’ dilemma thatarises due to periodic or occasional clinician consultations during the cancer journey. Theparticipants who have undergone similar circumstances (e.g., similar treatment) are will-ing to share their experience, oﬀer advice and emotional support to those who seek inOSG. Hence, the medical support network for care in any type of cancer should look intorelevant OSG platforms in order to understand the unmet needs and provide optimal andindividualised care that is clinically appropriate for patients with cancer.As explicated in this study, the PRIME platform provides this required functionalityto automatically analyse OSG discussions and transform into a multi-dimensional infor-mation source which can be explored to understand the unmet needs of prostate cancer.PRIME can be succinctly extended into any form of cancer by adapting the clinical in-formation extraction layers of the platform to incorporate the clinical information of therespective cancer.

This chapter extends the multi-stage information structuring platform (presented in pre-vious chapter) to capture insights speciﬁc to prostate cancer related online support groups(OSG). It structures and transforms prostate cancer related online support group discus-sions into a multidimensional representation based on demographics, emotions, clinicalfactors. This representation is employed to investigate the self disclosed quality of lifeagainst time, demographics and clinical factors. Moreover, it was used to assess diﬀerentdecision making behaviours and decision factors related to treatment decision making andtheir associated emotions over time pre- and post-treatment. These investigations haveprovided insights on the emotional expression of diﬀerent groups and highlight severalhighly emotionally expressive groups as more in need of support. Also, diﬀerences in deci-sion making behaviours and decision factors across distinct decision making groups. Theseinsights help to shape-up optimum delivery of necessary care to the vulnerable prostatecancer patients/survivors and caregivers. hapter 7

Conclusion

Begin thus from the ﬁrst act, and proceed; and, in conclusion, at the ill which thouhast done, be troubled, and rejoice for the good.

Pythagoras

This chapter concludes the thesis by presenting a summary of the research contribu-tions, as well as directions for future work. Section 7.1 provides a summary of conceptual,technical and application contributions, Section 7.2 describes how those contributions haveaddressed the key research questions, and Section 7.3 concludes this chapter by providingfuture research directions arising from this thesis.

This thesis is an exploration to develop techniques that can be used to gain deeper insightsfrom social data, which has yield theoretical, technical and application contributions.As the conceptual contribution, a comprehensive literature study has been conductedon current state-of-the-art machine learning and natural language processing techniquesemployed to generate insights from social data in Chapter 2. This study is focusedaround three types of insights often captured from social data which are topics, eventsand emotions. Section 2.2 investigates current techniques used for topic extraction whichinclude topics modelling and text clustering. Section 2.3 assessed two types of event de-tection techniques: speciﬁed event detection techniques to capture known event typesusing event related taxonomies and unspeciﬁed event detection techniques to capture un-known/unforeseen event types using various change indicators. Section 2.4 ﬁrst reviewstheories emotions and emotional models developed in social sciences and subsequentlyreview how those models are incorporated into computational techniques to extract emo-tions from text. Finally, Section 2.7 provides an exhaustive analysis of the challenges andlimitations of existing techniques in relation to gaining insights from social data.As the second theoretical contribution, this thesis developed a layered conceptualframework for interpreting data generated in online social media platforms a.k.a. so-cial data. This framework is presented in Chapter 3. It is built upon existing theories ofcognition, social needs and social behaviour, which depicts that social data is generated16566

CHAPTER 7. by social interactions happening in online social media platforms, and such interactionsreﬂect diﬀerent social behaviours driven by diﬀerent social needs. This conceptual frame-work provides a deeper meaning to social data as archived traces of online social behavioursrather than just a corpus of unstructured text. In addition, four key social behaviours wereconceptualised based on existing literature and a comprehensive comparison is providedcomparing and contrasting exhibition of each behaviour in the physical world and onlinesocial media platforms.On the technical front, two suites of algorithms were developed to harness insights fromdiﬀerent types of online social media platforms. First is two unsupervised incrementallearning algorithms presented in Chapter 4. These algorithms were designed to captureinsights from fast-paced social data streams. The ﬁrst is a new unsupervised incrementalmachine learning algorithm that automatically structures a social data stream into topicsand extends those across time into topic pathways. This algorithm is developed based onGSOM self-structuring algorithm and IKASL incremental learning algorithm. It captureschanges in topics over time as well as distinct new topics as new topic pathways at diﬀerentpoints of time. Also, it was designed to overcome the challenges present in social data withrespect to its brevity, unstructuredness, and diversity. The second algorithm is a multifaceted event detection algorithm developed to monitor topic pathways for signiﬁcantchanges in online social behaviours over time, and identify such changes as potentialevents of interest. The changes in social behaviours were identiﬁed using automatic eventindicators such as changes in volume, positive sentiment and negative sentiment. Thesealgorithms are presented in the journal publication entitled

Automatic event detection inmicroblogs using incremental machine learning (Bandaragoda et al., 2017).The second technical contribution is another a set of algorithms presented in Chap-ter 5, built with a suite of machine learning and natural language processing algorithmsdeveloped to structure slow-paced social data streams. These algorithms are speciﬁcallydeveloped for online support groups which are social discussions related to health is-sues by patients and caregivers. These algorithms extract demographics, deeper emotionsand clinical events from users self-disclosed information encapsulated in unstructured text.Subsequently, the extracted information is further structured by formulating patient eventtimelines based on clinical events and emotions with their associated time. This extractedinformation is used to build a multi-dimensional database that can be used to queryrelevant and reliable information for diﬀerent use cases. Moreover, multiple aggregatesof information can be obtained for research purposes. These algorithms are presented injournal publications

Text mining for personalized knowledge extraction from online supportgroups (Bandaragoda, De Silva, Alahakoon, Ranasinghe and Bolton, 2018) and

Machinelearning to support social media empowered patients in cancer care and cancer treatmentdecisions (De Silva et al., 2018).As application contributions, the above developed algorithms have been successfullytrialled with social data streams from two online social media platforms. The suitesof algorithms developed for fast-paced text streams has been tested with two Twitterdatasets containing 6 months of tweets on two entities– a politician and an organisation. .1. SUMMARY OF CONTRIBUTIONS

Automatic event detection in microblogs usingincremental machine learning (Bandaragoda et al., 2017).The second application contribution is based on the second suite of algorithms devel-oped to structure slow-paced social data streams. It was applied to a large text corpuscollected from two large online support groups (OSG) which consist of over 4 million posts.It successfully structures those posts pivoted by user id, automatically building user proﬁlesand user timelines from the extracted information. The use of these algorithms is demon-strated using two uses cases where ﬁrst shows how it can be used by patient/caregivers tolook for relevant and reliable information based on diﬀerent criteria, better than the stan-dard methods available. Also, the researcher use case demonstrated that researchers canuse this structure to capture aggregated information on diﬀerent study cohorts which canbe used to generate population level insights. These results are published in the journalpublication entitled

Text mining for personalized knowledge extraction from online supportgroups (Bandaragoda, De Silva, Alahakoon, Ranasinghe and Bolton, 2018).The third application contribution is presented in Chapter 6, which is an extension tothe researcher use case discussed above. In this application, the platform developed tostructure slow-paced social data stream is extended to capture insights speciﬁc to prostatecancer related online support groups (OSG). It structures and transforms prostate cancerrelated online support group discussions into a multidimensional representation based ondemographics, emotions, clinical factors and investigated the self disclosed quality of lifeagainst time, demographics and clinical factors. Moreover, it assessed diﬀerent decisionmaking behaviours and decision factors related to treatment decision making and their as-sociated emotions over time pre- and post-treatment. These investigations have providedinsights on the emotional expression of diﬀerent groups and highlight several highly emo-tionally expressive groups as more in need of support. Also, diﬀerences in decision makingbehaviours and decision factors across distinct decision making groups. These insightshelp to shape-up optimum delivery of necessary care to the vulnerable prostate cancerpatients/survivors and caregivers. These investigations were carried out in collaborationwith researchers from a key urology institute and the results were published in three jour-nals papers: (i)

The patient-reported information multidimensional exploration (PRIME)framework for investigating emotions and other factors of prostate cancer patients with lowintermediate risk based on online cancer support group discussions (Bandaragoda, Ranas-inghe, Adikari, de Silva, Lawrentschuk, Alahakoon, Persad and Bolton, 2018), (ii)

Machinelearning to support social media empowered patients in cancer care and cancer treatment CHAPTER 7. decisions (De Silva et al., 2018), and (iii)

Robotic-assisted vs. open radical prostatectomy:A machine learning framework for intelligent analysis of patient-reported outcomes fromonline cancer support groups (Ranasinghe et al., 2018).

This section describes how the above contributions have addressed the research questionsdelineates in Chapter 1.1.

What are the limitations of existing artiﬁcial intelligence algorithms and natural lan-guage processing techniques in the study of social interactions and social behavioursusing representative online social data in digital environments?

This research question investigates applications of existing techniques on social dataand consolidates key challenges and limitations faced in such applications. Existingmachine learning and natural language processing techniques applied to social datais assessed in Chapter 2. This assessment captures key applications such as topic ex-traction, event detection, emotion extraction, self-structuring and incremental learn-ing. It discussed diﬀerent supervised and unsupervised state-of-the-art techniquesand what are their limitations when applied to social data. Subsequently section 2.7consolidates the key challenges of social data leading to the limitations in existingtechniques.2.

How can theories of social behaviour from social sciences contribute towards a concep-tual model of enhanced understanding of social interactions in digital environments,as well as the representative online social data?

This research question is formulated to investigate how existing social theories onhuman social behaviour, social needs and cognition can be applied to social data.In order to address this research question, a conceptual framework is developed inSection 3.2 based on existing theories in social sciences to consider social data asrepresenting the surface layer of a hierarchy of human social behaviour, social needsand cognition. This model assumes social data is generated as a result of socialinteractions that happens in online social media platforms, social behaviours areabstractions of social interactions, and cognition drives social interactions to achievesocial needs. The Section 3.2 further explains four key behaviours that presentin most of the social interactions and compared those behaviours in natural anddigital environments. Based on this conceptual framework a platform is proposedin Section 3.3 which provides structure to unstructured social data for machineprocessing and also as a means for coupling established social theories with the newforms of digital social behaviour and interactions.3.

How can new incremental machine learning algorithms, founded on the principles ofself-structuring artiﬁcial intelligence, address the challenges of using social data tounderstand social behaviours? .2. ADDRESSING THE RESEARCH QUESTIONS

How can the algorithms developed in 3. advanced into technology platforms thatdeliver actionable insights for societal advancement?

This research question is formulated to investigate how the developed algorithmscan be employed to develop technology platforms that are capable of generatingactionable insights which can be used for societal advancements. These technologyplatforms were speciﬁcally designed to facilitate distinct and contrasting characteris-tics/challenges of diﬀerent online social media platforms. As detailed in the previoussection as application contributions, this thesis presents three applications where thedeveloped platforms are employed to generate valuable insights. In these applica-tions, large text corpora collected from diﬀerent online social media platforms wereprocessed using the developed platforms. In the ﬁrst application, two datasets col-lected from Twitter are used to extract topic pathways and signiﬁcant events using70

CHAPTER 7. the platform developed in Chapter 4. The second application is structuring infor-mation present as self-disclosed unstructured text in online support groups. Thisstructured information provides relevant and reliable information to other users andaggregates pivoted by diﬀerent variables to health researchers. Chapter 6 expandsthe researcher application into gain insights from prostate cancer related online sup-port groups.

The platforms and techniques developed in this thesis have addressed numerous problemsof gaining insights from social data accumulated in online social media platforms, andhave successfully applied to diﬀerent applications. However, considering the scale of dataand the amount of insights encapsulated, gaining insights from social data on underlyingsocial behaviours is still in its infancy. As discussed, swaths of social data accumulated insocial media platforms every day and people from all walks of life are rapidly embracingonline social media platforms to perform online social interactions. Hence, there existssubstantial challenges to overcome and massive potential to gain.One of the key avenue for future research is the predictive aspect, where predictivetechniques can be integrated into the developed self-learning platforms to predict futuredynamics of the human behaviours. These predictions can be individualistic where pre-dictions can be generated for an individual based on diﬀerent layers of the conceptualmodel presented in this thesis. For instance, predictions can be generated related to thecognitive process of an individual to predict the state of the mind and transition of emo-tions overtime based on the past indicators of such transitions captured from social data.Such predictions are invaluable to manage chronic mental disorders such as depression andanxiety which spreads among the society at epidemic levels. A predictive model trainedto capture warning signs of mental disorders from social data/social interactions can bedeployed to social media platforms to identify individuals who are in need of support.Also, predictions can be at group or population level to predict diﬀerent group dynam-ics of behaviour. For instance, the event detection technique developed in Chapter 4 can beextended for event prediction where predictive models can be trained to learn transitionsof event indicators to learn whether or when a discussion gains popularity and generatepredictions on future popularity. Such predictions are useful in many applications such asin law-enforcement to predict potentially disruptive movements as well as in marketing topredict gain of a marketing campaign.Another direction for future work is on how to extend this work to a fully automatedsystem that can self-adapt and scale to generate insights from social media relevant toany domain. The self-structuring AI framework presented in Chapter 3 is generalisableto any domain. Also, the unsupervised self-structuring algorithms presented in Chapter 4and 5 are scalable and applicable to any domain. However, some of the natural languageprocessing and machine learning algorithms developed to gather insights are speciﬁcallydeveloped and ﬁne-tuned for the applications presented in the thesis. As future work, this .3. FUTURE DIRECTIONS

CHAPTER 7. eferences

Aaronson, N. K., Ahmedzai, S., Bergman, B., Bullinger, M., Cull, A., Duez, N. J., Fil-iberti, A., Flechtner, H., Fleishman, S. B., de Haes, J. C. et al. (1993). The europeanorganization for research and treatment of cancer qlq-c30: a quality-of-life instrumentfor use in international clinical trials in oncology,

JNCI: Journal of the National CancerInstitute (5): 365–376.Abel, F., Hauﬀ, C., Houben, G.-J., Stronkman, R. and Tao, K. (2012). Twitcident: Fight-ing Fire with Information from Social Web Streams, Proceedings of the 21st InternationalConference on World Wide Web , ACM, pp. 305–308.Adelman, R. D., Tmanova, L. L., Delgado, D., Dion, S. and Lachs, M. S. (2014). Caregiverburden: a clinical review,

Jama (10): 1052–1060.Adler, P. S. and Kwon, S.-W. (2002). Social capital: Prospects for a new concept,

TheAcademy of Managemenet Review : 17–40.Aggarwal, C. C., Han, J., Wang, J. and Yu, P. S. (2003). A framework for clusteringevolving data streams, Proceedings of the 29th international conference on Very largedata bases-Volume 29 , VLDB Endowment, pp. 81–92.Aggarwal, C. C., Hinneburg, A. and Keim, D. A. (2001). On the Surprising Behaviorof Distance Metrics in High Dimensional Space,

International conference on databasetheory , Springer, Berlin, Heidelberg, pp. 420–434.Aggarwal, C. C. and Zhai, C. (2012a).

A Survey of Text Clustering Algorithms .Aggarwal, C. C. and Zhai, C. (2012b).

Mining text data , Springer Science & BusinessMedia.Aiken, L. H., Clarke, S. P., Sloane, D. M., Sochalski, J. and Silber, J. H. (2002). Hos-pital nurse staﬃng and patient mortality, nurse burnout, and job dissatisfaction,

Jama (16): 1987–1993.Ajzen, I. (1985). From intentions to actions: A theory of planned behavior,

Action control ,Springer, pp. 11–39.Ajzen, I. (1991). The theory of planned behavior,

Organizational behavior and humandecision processes (2): 179–211. 17374 REFERENCES

Alahakoon, D., Halgamuge, S. K. and Srinivasan, B. (2000). Dynamic Self-OrganizingMaps with Controlled Growth for Knoledge Discovery,

IEEE Transactions on NeuralNetworks (3): 601—-614.Albertsen, P. (2016). Who Should Consider Active Surveillance?, The Journal of Urology (6): 1604–1605.Alderfer, C. P. (1969). An empirical test of a new theory of human needs,

Organizationalbehavior and human performance (2): 142–175.Alfano, C. M. and Rowland, J. H. (2006). Recovery issues in cancer survivorship: a newchallenge for supportive care, The Cancer Journal (5): 432–443.Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y. et al. (1998). Topicdetection and tracking pilot study: Final report, Proceedings of the DARPA broadcastnews transcription and understanding workshop , Vol. 1998, Citeseer, pp. 194–218.Allan, J., Papka, R. and Lavrenko, V. (1998). On-line new event detection and tracking,

Proceedings of the 21st annual international ACM SIGIR conference on Research anddevelopment in information retrieval , ACM, pp. 37–45.Alm, C. O., Roth, D. and Sproat, R. (2005). Emotions from text: machine learningfor text-based emotion prediction,

Proceedings of the conference on human languagetechnology and empirical methods in natural language processing , Association for Com-putational Linguistics, pp. 579–586.AlSumait, L., Barbar´a, D. and Domeniconi, C. (2008). On-line lda: Adaptive topic modelsfor mining text streams with applications to topic detection and tracking,

Data Mining,2008. ICDM’08. Eighth IEEE International Conference on , IEEE, pp. 3–12.Altman, I. and Taylor, D. A. (1973).

Social penetration: The development of interpersonalrelationships. , Holt, Rinehart & Winston.American Cancer Society (2018). Cancer facts & ﬁgures 2018,

Technical report .Aronson, A. R. and Lang, F.-M. (2010). An overview of MetaMap: historical perspec-tive and recent advances.,

Journal of the American Medical Informatics Association :JAMIA (3): 229–36.Aronson, A. R., Lang, F.-M., Aronson, A., Aronson, A., Rindﬂesch, T., Browne, A., Aron-son, A., Rindﬂesch, T., Rindﬂesch, T., Aronson, A., Rindﬂesch, T., Aronson, A., Aron-son, A., McCray, A., Aronson, A., Browne, A., Humphrey, S., Rogers, W., Kilicoglu, H.,Aronson, A., Masseroli, M., Kilicoglu, H., Lang, F., Fiszman, M., Demner-Fushman,D., Lang, F., Ahlers, C., Fiszman, M., Demner-Fushman, D., Demner-Fushman, D.,Humphrey, S., Ide, N., Weeber, M., Klein, H., Aronson, A., Aronson, A., Bodenreider,O., Chang, H., Kim, G., Aronson, A., Mork, J., Aronson, A., Mork, J., Gay, C., Gay,C., Kayaalp, M., Aronson, A., Neveol, A., Shooshan, S., Humphrey, S., Butte, A., Ko-hane, I., Hsieh, Y., Hardardottir, G., Brennan, P., Baud, R., Ruch, P., Gaudinat, A., EFERENCES

Journalof the American Medical Informatics Association : JAMIA (3): 229–36.Asur, S. and Huberman, B. A. (2010). Predicting the future with social media, pp. 492–499.Atefeh, F. and Khreich, W. (2015). A survey of techniques for event detection in twitter, Computational Intelligence (1): 132–164.Attik, M., Bougrain, L. and Alexandre, F. (2005). Self-organizing map initialization, Lecture Notes in Computer Science (including subseries Lecture Notes in Artiﬁcial In-telligence and Lecture Notes in Bioinformatics) , Vol. 3696 LNCS, pp. 357–362.Augusto, J. C. (2005). Temporal reasoning for decision support in medicine,

ArtiﬁcialIntelligence in Medicine (1): 1–24.Baldridge, J. (2005). The opennlp project, URL: http://opennlp. apache. org/index.html,(accessed 2 February 2012) p. 1.Baldwin, B. (1997). CogNIAC: High precision coreference with limited knowledge and lin-guistic resources,

Proceedings of the ACL Workshop on Operational Factors in Practical,Robust Anaphora Resolution for Unrestricted Text, Madrid, Spain, July 1997 pp. 38–45.Baldwin, T., Cook, P., Lui, M., Mackinlay, A. and Wang, L. (2013). How Noisy SocialMedia Text , How Diﬀrnt Social Media Sources ?,

Proc. IJCNLP 2013 , number October,pp. 356–364.Bandaragoda, T. R., De Silva, D. and Alahakoon, D. (2017). Automatic event detec-tion in microblogs using incremental machine learning,

Journal of the Association forInformation Science and Technology (10): 2394–2411.Bandaragoda, T. R., De Silva, D., Alahakoon, D., Ranasinghe, W. and Bolton, D. (2018).Text mining for personalized knowledge extraction from online support groups, Journalof the Association for Information Science and Technology (12): 1446–1459.Bandaragoda, T., Ranasinghe, W., Adikari, A., de Silva, D., Lawrentschuk, N., Alahakoon,D., Persad, R. and Bolton, D. (2018). The Patient-Reported Information Multidimen-sional Exploration (PRIME) Framework for investigating emotions and other factors76 REFERENCES of prostate cancer patients with low intermediate risk based on online cancer supportgroup discussions,

Annals of surgical oncology (6): 1737–1745.Banerjee, S., Ramanathan, K. and Gupta, A. (2007). Clustering short texts usingwikipedia, Proceedings of the 30th annual international ACM SIGIR conference on Re-search and development in information retrieval - SIGIR ’07 , ACM Press, New York,New York, USA, p. 787.Bar-Lev, S. (2008). ”We are here to give you emotional support”: Performing emotionsin an online HIV/AIDS support group,

Qualitative Health Research (4): 509–521.Barabasi, A.-L. (2005). The origin of bursts and heavy tails in human dynamics, Nature (7039): 207.Barocas, D. A., Alvarez, J., Resnick, M. J., Koyama, T., Hoﬀman, K. E., Tyson, M. D.,Conwill, R., Mccollum, D., Cooperberg, M. R., Goodman, M., Greenﬁeld, S., Hamilton,A. S., Hashibe, M., Kaplan, S. H., Paddock, L. E., Stroup, A. M., Wu, X.-C. andPenson, D. F. (2017). Association Between Radiation Therapy, Surgery, or Observationfor Localized Prostate Cancer and Patient-Reported Outcomes After 3 Years,

JAMA (11): 1126–1140.Bazarova, N. N. and Choi, Y. H. (2014). Self-disclosure in social media: Extending thefunctional approach to disclosure motivations and characteristics on social network sites,

Journal of Communication (4): 635–657.Beck, A. T., Steer, R. A. and Brown, G. K. (1996). Beck depression inventory-ii, SanAntonio (2): 490–498.Becker, H., Iter, D., Naaman, M. and Gravano, L. (2012). Identifying content for plannedevents across social media sites, Proceedings of the ﬁfth ACM international conferenceon Web search and data mining , ACM, pp. 533–542.Becker, H., Naaman, M. and Gravano, L. (2011). Beyond Trending Topics: Real-WorldEvent Identiﬁcation on Twitter.,

Proceedings of the Fifth International Conference onWeblogs and Social Media (ICWSM) , pp. 1–17.Bender, J. L. (2011).

The Web of Care : A Multi-Method Study Examining the Role ofOnline Communities as a Source of Peer-to-Peer Supportive Care for Breast CancerSurvivors By The Web of Care : A Multi-Method Study Examining the Role of OnlineCommunities as a Source of Peer- , PhD thesis.

URL: https://tspace.library.utoronto.ca/handle/1807/31690

Berger, C. R. and Bradac, J. J. (1982).

Language and social knowledge: Uncertainty ininterpersonal relations , Vol. 2, Hodder Education.Berger, C. R. and Calabrese, R. J. (1975). Some explorations in initital interaction andbeyod: Toward a developmental theory of interpersonal communication,

Human Com-munciation Research (2): 99–112. EFERENCES

Urologic Oncology: Seminars and Original Investigations (2): 93–100.Bethard, S., Derczynski, L., Savova, G., Pustejovsky, J. and Verhagen, M. (2015).SemEval-2015 Task 6: Clinical TempEval, Proceedings of the 9th International Work-shop on Semantic Evaluation (SemEval 2015) , pp. 806–814.Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U., Beeri, C. and Buneman, P. (1999).When Is ”Nearest Neighbor” Meaningful?,

Database TheoryICDT’99 : 217–235.Blank, T. O. and Bellizzi, K. M. (2008). A gerontologic perspective on cancer and aging.

URL: http://doi.wiley.com/10.1002/cncr.23444

Blank, T. O., Schmidt, S. D., Vangsness, S. A., Monteiro, A. K. and Santagata, P. V.(2010). Diﬀerences among breast and prostate cancer online support groups,

Computersin Human Behavior (6): 1400–1404.Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent dirichlet allocation, Journal ofmachine Learning research (Jan): 993–1022.Bodenreider, O. (2004). The Uniﬁed Medical Language System (UMLS): integratingbiomedical terminology, Nucleic Acids Research (90001): 267D–270.Bollen, J., Mao, H. and Pepe, A. (2011). Modeling public mood and emotion: Twit-ter sentiment and socio-economic phenomena, Fifth International AAAI Conference onWeblogs and Social Media .Bollen, J., Mao, H. and Zeng, X. (2011). Twitter mood predicts the stock market,

Journalof computational science (1): 1–8.Bourdieu, P. (2011). The forms of capital.(1986), Cultural theory: An anthology : 81–93.Boureau, Y.-L., Ponce, J. and LeCun, Y. (2010). A theoretical analysis of feature poolingin visual recognition, International Conference on Machine Learning (ICML) , pp. 111–118.Bowlby, J. (1969).

Attachment, Vol. 1 of Attachment and loss , New York: Basic Books.Bowling, A. (1995). What things are important in people’s lives? A survey of the public’sjudgements to inform scales of health related quality of life,

Social Science and Medicine (10): 1447–1462.Boyd, D. and Ellison, N. B. (2007). Social Network Sites: Deﬁnition, History, and Schol-arship, Journal of Computer-Mediated Communication : 210–230.Boyd, R., Gintis, H. and Bowles, S. (2010). Coordinated punishment of defectors sustainscooperation and can proliferate when rare, Science (5978): 617–620.78

REFERENCES

Boyd, R., Gintis, H., Bowles, S. and Richerson, P. J. (2003). The evolution of altruisticpunishment,

Proceedings of the National Academy of Sciences (6): 3531–3535.Boyd, R. and Richerson, P. J. (2009). Culture and the evolution of human coopera-tion,

Philosophical Transactions of the Royal Society of London B: Biological Sciences (1533): 3281–3288.Bradley, M. M., Greenwald, M. K., Petry, M. C. and Lang, P. J. (1992). Remembering pic-tures: pleasure and arousal in memory.,

Journal of experimental psychology: Learning,Memory, and Cognition (2): 379.Bradley, M. M. and Lang, P. J. (1999). Aﬀective norms for english words (anew): Instruc-tion manual and aﬀective ratings, Technical report , Citeseer.Bray, F., Ferlay, J., Soerjomataram, I., Siegel, R. L., Torre, L. A. and Jemal, A. (2018).Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwidefor 36 cancers in 185 countries,

CA: a cancer journal for clinicians .Bretherton, I. (1992). The origins of attachment theory: John bowlby and maryainsworth.,

Developmental psychology (5): 759.Brody, S. and Diakopoulos, N. (2011). Cooooooooooooooollllllllllllll !!!!!!!!!!!!!! Using WordLengthening to Detect Sentiment in Microblogs *** Preprint Version, ComputationalLinguistics (2010): 562–570.Brooks, M., Kuksenok, K., Torkildson, M. K., Perry, D., Robinson, J. J., Scott, T. J.,Anicello, O., Zukowski, A., Harris, P. and Aragon, C. R. (2013). Statistical aﬀect de-tection in collaborative chat,

Proceedings of the 2013 conference on Computer supportedcooperative work , ACM, pp. 317–328.Buhrmester, M., Kwang, T. and Gosling, S. D. (2011). Amazon’s mechanical turk: Anew source of inexpensive, yet high-quality, data?,

Perspectives on psychological science (1): 3–5.Burke, M., Kraut, R. and Marlow, C. (2011). Social capital on facebook: Diﬀerentiatinguses and users, Proceedings of the SIGCHI conference on human factors in computingsystems , ACM, pp. 571–580.Bury, M. (1982). Chronic illness as biographical disruption,

Sociology of health & illness (2): 167–182.Cannon, W. B. (1927). The james-lange theory of emotions: A critical examination andan alternative theory, The American journal of psychology (1/4): 106–124.Carlson, J. R. and Zmud, R. W. (1999). Channel expansion theory and the experientialnature of media richness perceptions, Academy of Management Journal (2): 153–170.Carpenter, G. A. and Grossberg, S. (1988). The ART of Adaptive Pattern Recognitionby a Self-Organizing Neural Network, Computer (3): 77–88. EFERENCES

BMJ(Clinical research ed.) (7298): 1357–60.Cecchini, M., Aytug, H., Koehler, G. J. and Pathak, P. (2010). Making words work: Usingﬁnancial text as a predictor of ﬁnancial events,

Decision Support Systems (1): 164–175.Cella, D. F., Tulsky, D. S., Gray, G., Saraﬁan, B., Linn, E., Bonomi, A., Silberman, M.,Yellen, S. B., Winicour, P., Brannon, J. et al. (1993). The functional assessment ofcancer therapy scale: development and validation of the general measure, J Clin Oncol (3): 570–579.Chang, A. X. and Manning, C. D. (2012). SUTime: A library for recognizing and normal-izing time expressions., LREC , number iii, pp. 3735–3740.Charles, C., Gafni, A. and Whelan, T. (1999). Decision-making in the physician-patientencounter: Revisiting the shared treatment decision-making model,

Social Science andMedicine (5): 651–661.Chen, E. S., Hripcsak, G., Xu, H., Markatou, M. and Friedman, C. (2008). Automatedacquisition of disease–drug knowledge from biomedical and clinical documents: an initialstudy, Journal of the American Medical Informatics Association (1): 87–98.Chen, R. C., Basak, R., Meyer, A.-M., Kuo, T.-M., Carpenter, W. R., Agans, R. P.,Broughman, J. R., Reeve, B. B., Nielsen, M. E., Usinger, D. S., Spearman, K. C.,Walden, S., Kaleel, D., Anderson, M., St¨urmer, T. and Godley, P. A. (2017). AssociationBetween Choice of Radical Prostatectomy, External Beam Radiotherapy, Brachyther-apy, or Active Surveillance and Patient-Reported Quality of Life Among Men WithLocalized Prostate Cancer, JAMA (11): 1141–1150.Cheng, N., Chandramouli, R. and Subbalakshmi, K. (2011). Author gender identiﬁcationfrom text,

Digital Investigation (1): 78–88.Chikersal, P., Poria, S. and Cambria, E. (2015). Sentu: sentiment analysis of tweetsby combining a rule-based classiﬁer with supervised learning, Proceedings of the 9thInternational Workshop on Semantic Evaluation (SemEval 2015) , pp. 647–651.Chiu, C., Hsu, M. and Wang, E. (2006). Understanding knowledge sharing in virtualcommunities: An integration of social capital and social cognitive theories,

DecisionSupport Systems (3): 1872–1888.Cho, J. H., Liao, V. Q., Jiang, Y. and Schatz, B. R. (2013). Aggregating Personal HealthMessages for Scalable Comparative Eﬀectiveness Research, Proceedings of the Interna-tional Conference on Bioinformatics, Computational Biology and Biomedical Informat-ics - BCB’13 , number September 2013, pp. 907–916.Chou, H.-T. G. and Edge, N. (2012). they are happier and having better lives than i am:the impact of using facebook on perceptions of others’ lives,

Cyberpsychology, Behavior,and Social Networking (2): 117–121.80 REFERENCES

Clark, J. A., Wray, N. P. and Ashton, C. M. (2001). Living with treatment decisions:regrets and quality of life among men treated for metastatic prostate cancer,

Journal ofClinical Oncology (1): 72–80.Cole, J. et al. (2017). Surveying the digital future, UCLA Center for CommunicationPolicy, University of California–Los Angeles .Collins, R. L. (1996). For better or worse: The impact of upward social comparison onself-evaluations.,

Psychological bulletin (1): 51.Compas, B. E., Stoll, M. F., Thomsen, A. H., Oppedisano, G., EppingJordan, J. E. andKrag, D. N. (1999). Adjustment to breast cancer: agerelated diﬀerences in coping andemotional distress,

Breast Cancer Research and Treatment (3): 195–203.Cooperberg, M. R. and Carroll, P. R. (2015). Trends in management for patients withlocalized prostate cancer, 1990-2013, Jama (1): 80–82.Cortes, C. and Vapnik, V. (1995). Support-vector networks,

Machine learning (3): 273–297.Council, N. R. et al. (2005). From cancer patient to cancer survivor: lost in transition ,National Academies Press.Cozby, P. C. (1973). Self-disclosure: a literature review.,

Psychological bulletin (2): 73.Cutting, D. R., Karger, D. R., Pedersen, J. O. and Tukey, J. W. (1992). Scatter/gather: Acluster-based approach to browsing large document collections, Proceedings of the 15thannual international ACM SIGIR conference on Research and development in informa-tion retrieval , ACM, pp. 318–329.Da Silva, N. F., Hruschka, E. R. and Hruschka Jr, E. R. (2014). Tweet sentiment analysiswith classiﬁer ensembles,

Decision Support Systems : 170–179.Daft, R. L. and Lengel, R. (1986). Organizational information requirements, media rich-ness and structural design, Management Science (5): 554–571.Daft, R. L., Lengel, R. H. and Trevino, L. K. (1987). Message equivocality, media selection,and manager performance: Implications for information systems, MIS quarterly pp. 355–366.Dalva, M. and Katz, L. (1994). Rearrangements of synaptic connections in visual cortexrevealed by laser photostimulation,

Science (5169): 255–258.D’Amico, A., Whittington, R., Malkowicz, S. B., Schultz, D., Blank, K., Broderick, G. A.,Tomaszewski, J. E., Renshaw, A. A., Kaplan, I., Beard, C. J. and Wein, A. (1998).Biochemical outcome after radical prostatectomy, external beam radiation therapy, orinterstitial radiation therapy for clinically localized prostate cancer.,

Jama (11): 969–74.

EFERENCES

IEEE transactions on intelligent transportationsystems (4): 2269–2283.Darwin, C. (1859). On the origin of species by means of natural selection, Murray, London .Darwin, C. (1872). The expression of the emotions in man and animal.Das, S. R. and Chen, M. Y. (2007). Yahoo! for amazon: Sentiment extraction from smalltalk on the web,

Management science (9): 1375–1388.Davis, K. (2012). Friendship 2.0: Adolescents’ experiences of belonging and self-disclosureonline, Journal of adolescence (6): 1527–1536.De Haes, J., Van Knippenberg, F. and Neijt, J. (1990). Measuring psychological andphysical distress in cancer patients: structure and application of the rotterdam symptomchecklist, British journal of cancer (6): 1034.De Silva, D. (2010). A Cognitive Approach to Autonomous Incremental Learning , PhDthesis, Monash University.De Silva, D. and Alahakoon, D. (2010). Incremental knowledge acquisition and self learn-ing from text,

Proceedings of the International Joint Conference on Neural Networks(IJCNN) , pp. 1—-8.De Silva, D., Ranasinghe, W., Bandaragoda, T., Adikari, A., Mills, N., Iddamalgoda,L., Alahakoon, D., Lawrentschuk, N., Persad, R., Osipov, E. et al. (2018). Machinelearning to support social media empowered patients in cancer care and cancer treatmentdecisions,

PloS one (10): e0205855.Demner-Fushman, D., Chapman, W. W. and McDonald, C. J. (2009). What can naturallanguage processing do for clinical decision support?, Journal of biomedical informatics (5): 760–772.Denberg, T. D., Melhado, T. V. and Steiner, J. F. (2006). Patient treatment preferences inlocalized prostate carcinoma: The inﬂuence of emotion, misconception, and anecdote, Cancer (3): 620–630.Derks, D., Bos, A. E. and Von Grumbkow, J. (2007). Emoticons and social interaction onthe internet: the importance of social context,

Computers in human behavior (1): 842–849.Derlaga, V. J. and Berg, J. H. (1987). Self-disclosure: Theory, research, and therapy ,Springer Science & Business Media.Donovan, J. L., Hamdy, F. C., Lane, J. A., Mason, M., Metcalfe, C., Walsh, E., Blazeby,J. M., Peters, T. J., Holding, P., Bonnington, S., Lennon, T., Bradshaw, L., Cooper,D., Herbert, P., Howson, J., Jones, A., Lyons, N., Salter, E., Thompson, P., Tidball,82

REFERENCES

S., Blaikie, J., Gray, C., Bollina, P., Catto, J., Doble, A., Doherty, A., Gillatt, D.,Kockelbergh, R., Kynaston, H., Paul, A., Powell, P., Prescott, S., Rosario, D. J., Rowe,E., Davis, M., Turner, E. L., Martin, R. M. and Neal, D. E. (2016). Patient-ReportedOutcomes after Monitoring, Surgery, or Radiotherapy for Prostate Cancer,

New EnglandJournal of Medicine (15): 1425–1437.Drost, E. A. et al. (2011). Validity and reliability in social science research,

EducationResearch and perspectives (1): 105.Dunbar, R. I. (1998). The social brain hypothesis, Evolutionary Anthropology (5): 178–190.Dunbar, R. I. M. (1993). Coevolution of neocortical size, group size and language inhumans, Behavioral and Brain Sciences (04): 681.Dunbar, R. I. M. and Shultz, S. (2007). Evolution in the Social Brain, Science (5843): 1344–1347.Dunbar, R. I., Marriott, A. and Duncan, N. D. (1997). Human conversational behavior,

Human nature (3): 231–246.Eisenstein, J. (2013). What to do about bad language on the internet, NAACL-HLT 2013 ,pp. 359–369.Eisenstein, J., O’Connor, B., Smith, N. A. and Xing, E. P. (2014). Diﬀusion of lexicalchange in social media,

PLoS ONE (11).Ekman, P. (1992). An argument for basic emotions, Cognition & emotion (3-4): 169–200.Ekman, P. and Oster, H. (1979). Facial expressions of emotion, Annual review of psychol-ogy (1): 527–554.Ellison, N. B., Steinﬁeld, C. and Lampe, C. (2007). The beneﬁts of facebook friends: socialcapital and college students use of online social network sites, Journal of Computer-Mediated Communication (4): 1143–1168.Elwyn, G., Frosch, D., Thomson, R., Joseph-Williams, N., Lloyd, A., Kinnersley, P.,Cording, E., Tomson, D., Dodd, C., Rollnick, S. et al. (2012). Shared decision making:a model for clinical practice, Journal of general internal medicine (10): 1361–1367.Emanuel, E. J. and Emanuel, L. L. (1992). Four models of the physician-patient relation-ship, Jama (16): 2221–2226.Ert¨oz, L., Steinbach, M. and Kumar, V. (2003). Finding Clusters of Diﬀerent Sizes,Shapes, and Densities in Noisy, High Dimensional Data,

Proceedings of Second SIAMInternational Conference on Data Mining , pp. 47–59.Ester, M., Kriegel, H.-P., Sander, J. and Xu, X. (1996). A Density-Based Algorithm forDiscovering Clusters in Large Spatial Databases with Noise,

KDD , pp. 226–231.

EFERENCES

LREC , Vol. 6, Citeseer, pp. 417–422.Eysenbach, G. (2008). Medicine 2.0: Social networking, collaboration, participation, apo-mediation, and openness.

URL:

Farnstrom, F., Lewis, J. and Elkan, C. (2000). Scalability for clustering algorithms revis-ited,

SIGKDD explorations (1): 51–57.Fast, E., Chen, B. and Bernstein, M. (2016). Empath: Understanding Topic Signalsin Large-Scale Text, Proceedings of the 2016 CHI Conference on Human Factors inComputing Systems - CHI ’16 , ACM Press, New York, New York, USA, pp. 4647–4657.Fehr, E. and Fischbacher, U. (2004). Social norms and human cooperation,

Trends incognitive sciences (4): 185–190.Festinger, L. (1954). A theory of social comparison processes, Human relations (2): 117–140.Finkel, J. R., Grenager, T. and Manning, C. (2005). Incorporating non-local informationinto information extraction systems by gibbs sampling, Proceedings of the 43rd annualmeeting on association for computational linguistics , Association for ComputationalLinguistics, pp. 363–370.Fox, S. (2011a). Peer-to-peer Health Care — Pew Research Center.

URL:

Fox, S. (2011b).

The social life of health information, 2011 , Pew Internet & American LifeProject Washington, DC.Fox, S. (2011c).

The Social Life of Health Information , Pew Internet \ & American LifeProject Washington, DC.Fox, S. (2013). After dr google: peer-to-peer health care, Pediatrics (Supplement4): S224–S225.French, R. M. (1999). Catastrophic forgetting in connectionist networks,

Trends in cogni-tive sciences (4): 128–135.Friedenberg, J. and Silverman, G. (2011). Cognitive science: An introduction to the studyof mind , Sage.Friedman, C., Alderson, P. O., Austin, J. H., Cimino, J. J. and Johnson, S. B. (1994). Ageneral natural-language text processor for clinical radiology,

Journal of the AmericanMedical Informatics Association (2): 161–174.Fritzke, B. (1995a). Growing grida self-organizing network with constant neighborhoodrange and adaptation strength, Neural processing letters (5): 9–13.84 REFERENCES

Fritzke, B. (1995b). A growing neural gas network learns topologies,

Advances in neuralinformation processing systems , pp. 625–632.Frost, J. H. and Massagli, M. P. (2008). Social uses of personal health information withinPatientsLikeMe, an online patient community: What can happen when patients haveaccess to one another’s data,

Journal of Medical Internet Research (3): 15–3.Fung, G. P. C., Yu, J. X., Yu, P. S. and Lu, H. (2005). Parameter free bursty eventsdetection in text streams, Proceedings of the 31st international conference on Very largedata bases , VLDB Endowment, pp. 181–192.Furao, S. and Hasegawa, O. (2006). An incremental network for on-line unsupervisedclassiﬁcation and topology learning,

Neural Networks (1): 90–106.Gama, J., ˇZliobait˙e, I., Bifet, A., Pechenizkiy, M. and Bouchachia, A. (2014). A surveyon concept drift adaptation, ACM computing surveys (CSUR) (4): 44.Given, B. A., Given, C. W. and Kozachik, S. (2001). Family support in advanced cancer, CA: a cancer journal for clinicians (4): 213–231.Glaeser, E. L., Laibson, D., Sacerdote, B., The, S., Journal, E. and Nov, F. (2002). AnEconomic Approach to Social Capital, The Economic Journal (483): F437–F458.Gleason, D. F., Mellinger, G. T., Arduino, L. J., Bailar III, J. C., Becker III, L. E.,Berman III, H. I., Bischoﬀ, A. J., Byar, D. P., Blackard, C. E., Doe, R. P. et al. (1974).Prediction of prognosis for prostatic adenocarcinoma by combined histological gradingand clinical staging,

The Journal of urology (1): 58–64.Golder, S. A. and Macy, M. W. (2014). Digital footprints: Opportunities and challengesfor online social research,

Annual Review of Sociology : 129–152.Gooden, R. J. and Wineﬁeld, H. R. (2007). Breast and prostate cancer online discussionboards: A thematic analysis of gender diﬀerences and similarities, Journal of HealthPsychology (1): 103–114.Granovetter, M. (1973). The Strength of Weak Ties, The American Journal of Sociology (6): 1360–1380.Greene, K., Derlega, V. J. and Mathews, A. (2006). Self-disclosure in personal relation-ships, The Cambridge handbook of personal relationships pp. 409–427.Grishman, R. and Sundheim, B. (1996). Message understanding conference-6: A briefhistory,

COLING 1996 Volume 1: The 16th International Conference on ComputationalLinguistics , Vol. 1.Groth-Marnat, G. (2009).

Handbook of psychological assessment , John Wiley & Sons.Gupta, S., Maclean, D. L., Heer, J. and Manning, C. D. (2014). Induced lexico-syntacticpatterns improve information extraction from online medical forums,

Journal of theAmerican Medical Informatics Association (5): 902—-909. EFERENCES

Cancer (7): 1381–1390.Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I. H. (2009).The WEKA data mining software,

ACM SIGKDD Explorations Newsletter (1): 10.Hamdy, F. C., Donovan, J. L., Lane, J. A., Mason, M., Metcalfe, C., Holding, P., Davis,M., Peters, T. J., Turner, E. L., Martin, R. M., Oxley, J., Robinson, M., Staﬀurth,J., Walsh, E., Bollina, P., Catto, J., Doble, A., Doherty, A., Gillatt, D., Kockelbergh,R., Kynaston, H., Paul, A., Powell, P., Prescott, S., Rosario, D. J., Rowe, E. andNeal, D. E. (2016). 10-Year Outcomes after Monitoring, Surgery, or Radiotherapy forLocalized Prostate Cancer, New England Journal of Medicine pp. 1–10.Hamilton, W. D. (1964). The genetical evolution of social behaviour. ii,

Journal of theo-retical biology (1): 17–52.Hamilton, W. L., Clark, K., Leskovec, J. and Jurafsky, D. (2016). Inducing Domain-Speciﬁc Sentiment Lexicons from Unlabeled Corpora.Harvey, K. J., Brown, B., Crawford, P., Macfarlane, A. and McPherson, A. (2007). ’AmI normal?’ Teenagers, sexual health and the internet, Social Science and Medicine (4): 771–781.Haykin, S. (1994). Neural networks: a comprehensive foundation , Prentice Hall PTR.He, Q., Chang, K. and Lim, E.-P. (2007). Analyzing feature trajectories for event detection,

Proceedings of the 30th annual international ACM SIGIR conference on Research anddevelopment in information retrieval , ACM, pp. 207–214.Hebb, D. O. (1949).

The Organization of Behavior .Hebb, D. O. (2005).

The organization of behavior: A neuropsychological theory , PsychologyPress.Herdadelen, A. and Baroni, M. (2011). Stereotypical gender actions can be extractedfrom web text,

Journal of the American Society for Information Science and Technology (9): 1741–1749.Herrmann, E., Call, J., Hern´andez-Lloreda, M. V., Hare, B. and Tomasello, M. (2007).Humans have evolved specialized skills of social cognition: The cultural intelligencehypothesis, science (5843): 1360–1366.Hill, C. E. and Knox, S. (2001). Self-disclosure., Psychotherapy: Theory, Research, Prac-tice, Training (4): 413.Hobbs, J. R. and Pan, F. (2006). Time ontology in owl, W3C working draft : 133.86 REFERENCES

Hofmann, T. (1999). Probabilistic latent semantic indexing,

Proceedings of the 22nd an-nual international ACM SIGIR conference on Research and development in informationretrieval , ACM, pp. 50–57.Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis,

Machine learning (1-2): 177–196.Hong, L. and Davison, B. (2010). Empirical study of topic modeling in twitter, Proceedingsof the First Workshop on Social Media Analytics , pp. 80–88.Høybye, M. T., Johansen, C. and Tjørnhøj-Thomsen, T. (2005). Online interaction. Eﬀectsof storytelling in an internet breast cancer support group,

Psycho-Oncology (3): 211–220.Hu, X., Bell, R. A., Kravitz, R. L. and Orrange, S. (2012). The Prepared Patient: Infor-mation Seeking of Online Support Group Members Before Their Medical Appointments, Journal of Health Communication (8): 960–978.Hu, X. and Liu, H. (2012). Text analytics in social media, in C. C. Aggarwal and C. Zhai(eds),

Mining Text Data , Springer US, Boston, MA, pp. 385–414.Hu, X., Sun, N., Zhang, C. and Chua, T.-S. (2009). Exploiting internal and externalsemantics for the clustering of short texts using world knowledge,

Proceedings of the18th ACM conference on Information and knowledge management , ACM, pp. 919–928.Huber, J., Ihrig, A., Peters, T., Huber, C. G., Kessler, A., Hadaschik, B., Pahernik, S. andHohenfellner, M. (2011). Decision-making in localized prostate cancer: Lessons learnedfrom an online support group,

BJU International (10): 1570–1575.Huber, J., Maatz, P., Muck, T., Keck, B., Friederich, H. C., Herzog, W. and Ihrig, A.(2017). The eﬀect of an online support group on patients treatment decisions for lo-calized prostate cancer: An online survey,

Urologic Oncology: Seminars and OriginalInvestigations (2): 37.e19–37.e28.Hutto, C. J. and Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentimentanalysis of social media text, Eighth International AAAI Conference on Weblogs and. . . , San Francisco, CA, USA, pp. 216–225.Ihrig, A., Keller, M., Hartmann, M., Debus, J., Pﬁtzenmaier, J., Hadaschik, B., Hohen-fellner, M., Herzog, W. and Huber, J. (2011a). Treatment decision-making in localizedprostate cancer: why patients chose either radical prostatectomy or external beam ra-diation therapy,

BJU international (8): 1274–1278.Ihrig, A., Keller, M., Hartmann, M., Debus, J., Pﬁtzenmaier, J., Hadaschik, B., Hohen-fellner, M., Herzog, W. and Huber, J. (2011b). Treatment decision-making in localizedprostate cancer: Why patients chose either radical prostatectomy or external beamradiation therapy,

BJU International (8): 1274–1278.

EFERENCES

Psychological review (1): 68.Jansen, B. J., Zhang, M., Sobel, K. and Chowdury, A. (2009). Twitter power: Tweets aselectronic word of mouth,

Journal of the American Society for Information Science andTechnology (11): 2169–2188.Java, A., Song, X., Finin, T. and Tseng, B. (2007). Why We Twitter: UnderstandingMicroblogging Usage and Communities, Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis - WebKDD/SNA-KDD’07 , ACM Press, New York, New York, USA, pp. 56–65.Jayles, B., Kim, H.-r., Escobedo, R., Cezera, S., Blanchet, A., Kameda, T., Sire, C.and Theraulaz, G. (2017). How social information can improve estimation accuracy inhuman groups,

Proceedings of the National Academy of Sciences (47): 201703695.Jensen, P. B., Jensen, L. J. and Brunak, S. (2012). Mining electronic health records:towards better research applications and clinical care.,

Nature Reviews Genetics (6): 395–405.Joachims, T. (1996). A probabilistic analysis of the rocchio algorithm with tﬁdf for textcategorization., Technical report , Carnegie-mellon univ pittsburgh pa dept of computerscience.Jourard, S. M. and Lasakow, P. (1958). Some factors in self-disclosure.,

The Journal ofAbnormal and Social Psychology (1): 91.Jurafsky, D. (2000). Speech & language processing , Pearson Education India.Kaas, J. H., Nelson, R. J., Sur, M., Lin, C.-S. and Merzenich, M. M. (1979). Multiple rep-resentations of the body within the primary somatosensory cortex of primates,

Science (4392): 521–523.Kahai, S. S. and Cooper, R. B. (2003). Exploring the core concepts of media richnesstheory: The impact of cue multiplicity and feedback immediacy on decision quality,

Journal of Management Information Systems (1): 263–299.Kahneman, D. and Riis, J. (2012). Living, and thinking about it: Two perspectives onlife, The Science of Well-Being .Kahneman, D. et al. (1999). Objective happiness,

Well-being: The foundations of hedonicpsychology : 25.Kaplan, A. M. and Haenlein, M. (2010). Users of the world, unite! The challenges andopportunities of Social Media, Business Horizons (1): 59–68.Kemp, S. (2019). 2019 Q4 Global Digital Statshot Report.Kertesz, A. (1983). Localization in neuropsychology , Academic Pr.88

REFERENCES

Kietzmann, J. H., Hermkens, K., McCarthy, I. P. and Silvestre, B. S. (2011). Socialmedia? Get serious! Understanding the functional building blocks of social media,

Business Horizons (3): 241–251.Kim, M. Y., Xu, Y., Zaiane, O. and Goebel, R. (2013). Patient information extraction innoisy tele-health texts, Proceedings - 2013 IEEE International Conference on Bioinfor-matics and Biomedicine, IEEE BIBM 2013 , IEEE, pp. 326–329.Kim, Y. and Schulz, R. (2008). Family caregivers’ strains: comparative analysis of cancercaregiving with dementia, diabetes, and frail elderly caregiving,

Journal of Aging andHealth (5): 483–503.Kleinberg, J. (2003). Bursty and hierarchical structure in streams, Vol. 7, Springer,pp. 373–397.Knox, S., Hess, S. A., Petersen, D. A. and Hill, C. E. (1997). A qualitative analysis ofclient perceptions of the eﬀects of helpful therapist self-disclosure in long-term therapy., Journal of counseling psychology (3): 274.Kock, N. (2004). The Psychobiological Model: Towards a New Theory of Computer-Mediated Communication Based on Darwinian Evolution, Organization Science (3): 327–348.Kock, N. (2005). Media richness or media naturalness? The evolution of our biologicalcommunication apparatus and its inﬂuence on our behavior toward e-communicationtools, IEEE Transactions on Professional Communication (2): 117–130.Kohonen, T. (1990). The self-organizing map, Proceedings of the IEEE (9): 1464–1480.Kohonen, T. (1997). Exploration of very large databases by self-organizing maps, Pro-ceedings of International Conference on Neural Networks (ICNN’97) , Vol. 1, IEEE,pp. PL1–PL6.Kohonen, T. (2013). Essentials of the self-organizing map,

Neural networks : 52–65.Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V. and Saarela, A.(2000). Self organization of a massive document collection, IEEE transactions on neuralnetworks (3): 574–585.Kononenko, O., Baysal, O., Holmes, R. and Godfrey, M. W. (2014). Mining modernrepositories with elasticsearch, Proceedings of the 11th Working Conference on MiningSoftware Repositories - MSR 2014 , pp. 328–331.Kraut, R., Patterson, M., Lundmark, V., Kiesler, S., Mukophadhyay, T. and Scherlis,W. (1998). Internet paradox: A social technology that reduces social involvement andpsychological well-being?,

American psychologist (9): 1017.Kring, A. M. and Gordon, A. H. (1998). Sex diﬀerences in emotion: Expression, experience,and physiology., Journal of Personality and Social Psychology (3): 686–703. EFERENCES

Semantics of natural language , Springer,pp. 253–355.Kummervold, P. E., Gammon, D., Bergvik, S., Johnsen, J. A. K., Hasvold, T. andRosenvinge, J. H. (2002). Social support in a wired world: Use of online mental healthforums in Norway,

Nordic Journal of Psychiatry (1): 59–65.Kurtz, L. F. (1990). The self-help movement: Review of the past decade of research, SocialWork with Groups (3): 101–115.Kv˚ale, R., Auvinen, A., Adami, H.-O., Klint, ˚A., Hernes, E., Møller, B., Pukkala, E.,Storm, H. H., Tryggvadottir, L., Tretli, S. et al. (2007). Interpreting trends in prostatecancer incidence and mortality in the ﬁve nordic countries, Journal of the NationalCancer Institute (24): 1881–1887.Kwak, H., Lee, C., Park, H. and Moon, S. (2010). What is Twitter, a social network ora news media?, Proceedings of the 19th international conference on World wide web ,pp. 591–600.Laﬀerty, J. D., McCallum, A. and Pereira, F. C. N. (2001). Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data,

Proceedings of theEighteenth International Conference on Machine Learning , ICML ’01, Morgan Kauf-mann Publishers Inc., San Francisco, CA, USA, pp. 282–289.Lapidot-Leﬂer, N. and Barak, A. (2012). Eﬀects of anonymity, invisibility, and lack ofeye-contact on toxic online disinhibition,

Computers in human behavior (2): 434–443.Lappin, S. and Leass, H. J. (1994). An Algorithm for Pronominal Anaphora Resolution, Computational Linguistics (4): 535–561.Larsen, B. and Aone, C. (1999). Fast and eﬀective text mining using linear-time doc-ument clustering, Proceedings of the ﬁfth ACM SIGKDD international conference onKnowledge discovery and data mining , ACM, pp. 16–22.Lau, J. H., Collier, N. and Baldwin, T. (2012a). On-line trend analysis with topic mod-els:

Proceedings of COLING 2012 , TheCOLING 2012 Organizing Committee, pp. 1519–1534.Lau, J. H., Collier, N. and Baldwin, T. (2012b). On-line trend analysis with topicmodels: \ Proceedings of COLING 2012 pp. 1519–1534.Lau, R. Y., Liao, S. S., Wong, K.-F. and Chiu, D. K. (2012). Web 2.0 environmentalscanning and adaptive decision support for business mergers and acquisitions,

MISquarterly (4): 1239–1268.Lazarus, R. S. and Lazarus, R. S. (1991). Emotion and adaptation , Oxford UniversityPress on Demand.90

REFERENCES

Li, Y., Qian, M., Jin, D., Hui, P. and Vasilakos, A. V. (2015). Revealing the eﬃciencyof information diﬀusion in online social networks of microblog,

Information Sciences : 383–389.Lin, N. (2002).

Social capital: A theory of social structure and action , Vol. 19, Cambridgeuniversity press.Lintz, K., Moynihan, C., Steginga, S., Norman, A., Eeles, R., Huddart, R., Dearnaley,D. and Watson, M. (2003). Prostate cancer patients’ support and psychological careneeds: survey from a non-surgical oncology clinic,

Psycho-Oncology: Journal of thePsychological, Social and Behavioral Dimensions of Cancer (8): 769–783.Liu, B. and Zhang, L. (2012). A survey of opinion mining and sentiment analysis, Miningtext data , Springer, pp. 415–463.Liu, L., Wu, J., Li, P. and Li, Q. (2015). A social-media-based approach to predictingstock comovement,

Expert Systems with Applications (8): 3893–3901.Liu, Y., Xu, S., Yoon, H.-J. and Tourassi, G. (2014). Extracting patient demographicsand personal medical information from online health forums., AMIA Annual SymposiumProceedings, AMIA Symposium. , Vol. 2014, pp. 1825–1834.Lloyd, S. (1982). Least squares quantization in pcm,

IEEE transactions on informationtheory (2): 129–137.Lotan, G., Graeﬀ, E., Ananny, M., Gaﬀney, D., Pearce, I. and Boyd, D. (2011). The ArabSpring— The Revolutions Were Tweeted: Information Flows during the 2011 Tunisianand Egyptian Revolutions. URL: http://ijoc.org/index.php/ijoc/article/view/1246

Lovejoy, K., Waters, R. D. and Saxton, G. D. (2012). Engaging stakeholders throughtwitter: How nonproﬁt organizations are getting more out of 140 characters or less,

Public Relations Review (2): 313–318.Luxford, K., Safran, D. G. and Delbanco, T. (2011). Promoting patient-centered care: aqualitative study of facilitators and barriers in healthcare organizations with a reputa-tion for improving the patient experience, International Journal for Quality in HealthCare (5): 510–515.Ma, Z., Pant, G. and Sheng, O. R. (2011). Mining competitor relationships from on-line news: A network-based approach, Electronic Commerce Research and Applications (4): 418–427.Mahdisoltani, F., Biega, J. and Suchanek, F. M. (2013). YAGO3: A knowledge base frommultilingual wikipedias.Mahendiran, A., Wang, W., Arredondo, J., Lira, S., Huang, B., Getoor, L., Mares, D. andRamakrishnan, N. (2014). Discovering Evolving Political Vocabulary in Social Media, EFERENCES

Behavior, Economic and Social Computing (BESC), 2014 International Conference on ,IEEE, pp. 1–7.Manne, S., Ostroﬀ, J., Winkel, G., Goldstein, L., Fox, K. and Grana, G. (2004). Post-traumatic growth after breast cancer: patient, partner, and couple perspectives.,

Psy-chosomatic medicine (17): 442–454.Manning, C. D., Raghavan, P. and Sch¨utze, H. (2008). Introduction to Information Re-trieval , Cambridge University Press, New York, NY, USA.Manovich, L. (2011). Trending: The promises and the challenges of big social data,

Debatesin the digital humanities : 460–475.Mao, Y., Wei, W., Wang, B. and Liu, B. (2012). Correlating s&p 500 stocks with twitterdata, Proceedings of the ﬁrst ACM international workshop on hot topics on interdisci-plinary social networks research , ACM, pp. 69–72.Marr, D. and Thach, W. T. (1991). A theory of cerebellar cortex,

From the Retina to theNeocortex , Springer, pp. 11–50.Maslow, A. (1962).

Toward a psychology of being , Vol. 214, J. Wiley & Sons.Maslow, A. H. (1943). A theory of human motivation,

Psychological Review (4): 370–396.Mathioudakis, M. and Koudas, N. (2010). Twittermonitor: trend detection over the twitterstream, SIGMOD ’10 Proceedings of the 2010 ACM SIGMOD International Conferenceon Management of data pp. 1155–1158.Mazanderani, F., Locock, L. and Powell, J. (2012). Being diﬀerently the same: Themediation of identity tensions in the sharing of illness experiences,

Social Science andMedicine (4): 546–553.McClelland, D. C. (1967). Achieving society , Vol. 92051, Simon and Schuster.McClelland, D. C. (1987).

Human motivation , CUP Archive.McFarland, L. A. and Ployhart, R. E. (2015). Social media: A contextual framework toguide research and practice,

Journal of Applied Psychology (6): 1653–1677.Medhat, W., Hassan, A. and Korashy, H. (2014). Sentiment analysis algorithms andapplications: A survey,

Ain Shams engineering journal (4): 1093–1113.Mehrotra, R., Sanner, S., Buntine, W. and Xie, L. (2013). Improving LDA topic modelsfor microblogs via tweet pooling and automatic labeling, Proceedings of the 36th inter-national ACM SIGIR conference on Research and development in information retrieval-SIGIR ’13 , ACM Press, New York, New York, USA, pp. 889—-892.92

REFERENCES

Melville, P., Gryc, W. and Lawrence, R. D. (2009). Sentiment analysis of blogs by com-bining lexical knowledge with text classiﬁcation,

Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery and data mining , ACM, pp. 1275–1284.Meshi, D., Tamir, D. I. and Heekeren, H. R. (2015). The Emerging Neuroscience of SocialMedia.

URL:

Meystre, S. M., Savova, G. K., Kipper-Schuler, K. C. and Hurdle, J. F. (2008). Extractinginformation from textual documents in the electronic health record: a review of recentresearch,

Yearbook of medical informatics (01): 128–144.Middleton, S. E., Middleton, L. and Modaﬀeri, S. (2014). Real-time crisis mapping ofnatural disasters using social media, IEEE Intelligent Systems (2): 9–17.Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013). Distributedrepresentations of words and phrases and their compositionality, in C. J. C. Burges,L. Bottou, M. Welling, Z. Ghahramani and K. Q. Weinberger (eds),

Advances in NeuralInformation Processing Systems 26 , Curran Associates, Inc., pp. 3111–3119.Mikolov, T., Yih, W.-t. and Zweig, G. (2013). Linguistic regularities in continuous spaceword representations,

Proceedings of NAACL-HLT , number June, pp. 746–751.Miller, G. A. (1995). WordNet: a lexical database for English,

Communications of theACM (11): 39–41.Miller, K., Siegel, R. and Jemal, A. (2016). Cancer treatment & survivorship facts &ﬁgures 2016-2017, Atlanta, GA: American Cancer Society .Mimno, D., Wallach, H. M., Talley, E., Leenders, M. and McCallum, A. (2011). Optimizingsemantic coherence in topic models,

Proceedings of the 2011 Conference on EmpiricalMethods in Natural Language Processing , number 2, Association for ComputationalLinguistics, pp. 262–272.Miotto, R., Li, L., Kidd, B. A. and Dudley, J. T. (2016). Deep patient: an unsuper-vised representation to predict the future of patients from the electronic health records,

Scientiﬁc reports : 26094.Mishne, G., Glance, N. S. et al. (2006). Predicting movie sales from blogger sentiment., AAAI spring symposium: computational approaches to analyzing weblogs , pp. 155–158.Mishra, M. V., Bennett, M., Vincent, A., Lee, O. T., Lallas, C. D., Trabulsi, E. J.,Gomella, L. G., Dicker, A. P. and Showalter, T. N. (2013). Identifying barriers to patientacceptance of active surveillance: content analysis of online patient communications.,

PloS one (9): e68563.Mitkov, R. (2014). Anaphora Resolution. , Taylor and Francis.

EFERENCES

Proceedings of the 29th International Conference onMachine Learning (ICML’12) , pp. 1751–1758.Mohammad, S. M. (2016). Sentiment Analysis: Detecting Valence, Emotions, and OtherAﬀected States from Text,

Emotion Measurement , pp. 201–237.Mohammad, S. M. and Turney, P. D. (2013). Crowdsourcing a word–emotion associationlexicon,

Computational Intelligence (3): 436–465.Mosher, C. E. and Danoﬀ-Burg, S. (2005). A review of age diﬀerences in psychologicaladjustment to breast cancer., Journal of psychosocial oncology (2-3): 101–14.Munezero, M. D., Montero, C. S., Sutinen, E. and Pajunen, J. (2014). Are they diﬀerent?aﬀect, feeling, emotion, sentiment, and opinion detection in text, IEEE transactions onaﬀective computing (2): 101–111.Murdoch, T. B. and Detsky, A. S. (2013). The inevitable application of big data to healthcare, JAMA (13): 1351–1352.Murray, N. and Perronnin, F. (2014). Generalized Max Pooling,

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pp. 2473–2480.Naaman, M., Boase, J. and Lai, C.-H. (2010). Is it really about me?: message contentin social awareness streams,

Proceedings of the 2010 ACM conference on Computersupported cooperative work , ACM, pp. 189–192.Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classiﬁcation,

Lingvisticae Investigationes (1): 3–26.Nahapiet, J. and Ghoshal, S. (2000). Social capital, intellectual capital, and the organi-zational advantage, Knowledge and social capital , Elsevier, pp. 119–157.Naik, A., Bogart, C. and Rose, C. (2017). Extracting personal medical events for usertimeline construction using minimal supervision,

BioNLP 2017 pp. 356–364.Nanton, V., Appleton, R., Loew, J., Ahmed, N., Ahmedzai, S. and Dale, J. (2018). Mendon’t talk about their health, but will they chat? the potential of online holistic needsassessment in prostate cancer,

BJU international (4): 494–496.Nass, C. and Brave, S. (2005).

Wired for speech: How voice activates and advances thehuman-computer relationship. , MIT press.National Alliance for Caregiving (2016). Cancer caregiving in the us: An intense, episodic,and challenging care experience.Nelson, S. J., Zeng, K., Kilbourne, J., Powell, T. and Moore, R. (2011). Normalized namesfor clinical drugs: RxNorm at 6 years.,

Journal of the American Medical InformaticsAssociation : JAMIA (4): 441–8.94 REFERENCES

Nesi, J. and Prinstein, M. J. (2015). Using social media for social comparison and feedback-seeking: Gender and popularity moderate associations with depressive symptoms,

Jour-nal of abnormal child psychology (8): 1427–1438.Neuman, W. L. (2013). Social research methods: Qualitative and quantitative approaches ,Pearson education.Nguyen, Q. (2012).

Detecting Experience Revealing Sentences in Product Reviews , PhDthesis, University of Amsterdam.

URL: https://esc.fnwi.uva.nl/thesis/centraal/ﬁles/f1006996408.pdf

Nie, N. H., Hillygus, D. S. and Erbring, L. (2002). Internet use, interpersonal relations,and sociability,

The Internet in everyday life pp. 215–243.Niolon, R. (n.d.). List of Feeling Words.

URL:

Northouse, L. L., Katapodi, M. C., Song, L., Zhang, L. and Mood, D. W. (2010). Inter-ventions with family caregivers of cancer patients: meta-analysis of randomized trials,

CA: a cancer journal for clinicians (5): 317–339.Northouse, L. L., Mood, D. W., Montie, J. E., Sandler, H. M., Forman, J. D., Hussain,M., Pienta, K. J., Smith, D. C., Sanda, M. G. and Kershaw, T. (2007). Living withprostate cancer: Patients’ and spouses’ psychosocial status and quality of life, Journalof Clinical Oncology (27): 4171–4177.O ’connor, B., Balasubramanyan, R., Routledge, B. R. and Smith, N. A. (2010). FromTweets to Polls: Linking Text Sentiment to Public Opinion Time Series, ICWSM ,pp. 122–129.Oh, S. (2012). The characteristics and motivations of health answerers for sharing infor-mation, knowledge, and experiences in online environments,

Journal of the AmericanSociety for Information Science and Technology (3): 543–557.Oja, M., Kaski, S. and Kohonen, T. (2003). Bibliography of self-organizing map (som)papers: 1998-2001 addendum, Neural computing surveys (1): 1–156.Osborne, M., Petrovic, S., McCreadie, R., Macdonald, C. and Ounis, I. (2012). Bieberno more: First story detection using twitter and wikipedia, Sigir 2012 workshop ontime-aware information access .Pak, A. and Paroubek, P. (2010). Twitter as a Corpus for Sentiment Analysis and OpinionMining.,

Lrec pp. 1320–1326.Paltoglou, G. (2016). Sentiment-based event detection in Twitter,

Journal of the Associ-ation for Information Science and Technology (7): 1576–1587. EFERENCES

Proceedings of the 21st ACM SIGSPA-TIAL International Conference on Advances in Geographic Information Systems , ACM,pp. 344–353.Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentimentcategorization with respect to rating scales,

Proceedings of the 43rd annual meetingon association for computational linguistics , Association for Computational Linguistics,pp. 115–124.Pang, B., Lee, L. and Vaithyanathan, S. (2002). Thumbs up? Sentiment Classiﬁcationusing Machine Learning Techniques,

Proceedings of the ACL-02 conference on Empiricalmethods in natural language processing - EMNLP ’02 , Vol. 10, Association for Compu-tational Linguistics, Morristown, NJ, USA, pp. 79–86.Pang, B., Lee, L. et al. (2008). Opinion mining and sentiment analysis,

Foundations andTrends ® in Information Retrieval (1–2): 1–135.Park, K. C., Jeong, Y. and Myaeng, S.-H. H. (2010). Detecting Experiences from Weblogs, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (July): 1464–1472.Paul, M. J. and Dredze, M. (2011). You are what you tweet: Analyzing twitter for publichealth,

Fifth International AAAI Conference on Weblogs and Social Media .Peng, X., Wang, L., Wang, X. and Qiao, Y. (2015). Bag of visual words and fusion methodsfor action recognition: Comprehensive study and good practice,

Computer Vision andImage Understanding : 109–125.Pennebaker, J. W., Francis, M. E. and Booth, R. J. (2001). Linguistic inquiry and wordcount: Liwc 2001,

Mahway: Lawrence Erlbaum Associates (2001): 2001.Petrovi´c, S., Osborne, M. and Lavrenko, V. (2012). Using paraphrases for improvingﬁrst story detection in news and twitter, Proceedings of the 2012 conference of the northamerican chapter of the association for computational linguistics: Human language tech-nologies , Association for Computational Linguistics, pp. 338–346.Pew Research Center (2019). Social media fact sheet.

URL:

Phuvipadawat, S. and Murata, T. (2010). Breaking news detection and tracking in Twitter,

Proceedings - 2010 IEEE/WIC/ACM International Conference on Web Intelligence andIntelligent Agent Technology - Workshops, WI-IAT 2010 pp. 120–123.Plutchik, R. (1980a).

Emotion: A Psychoevolutionary Synthesis , Harpercollins CollegeDivision.Plutchik, R. (1980b).

Emotion: A psychoevolutionary synthesis , Harpercollins CollegeDivision.96

REFERENCES

Plutchik, R. (1991).

The emotions , University Press of America.Plutchik, R. (2001). The nature of emotions,

American scientist (4): 344–350.Polikar, R., Udpa, L., Udpa, S. S. and Honavar, V. (2001). Learn++: An incrementallearning algorithm for supervised neural networks, IEEE Transactions on Systems, Manand Cybernetics Part C: Applications and Reviews (4): 497–508.Popescu, A.-M. and Pennacchiotti, M. (2010). Detecting controversial events from twitter, Proceedings of the 19th ACM international conference on Information and knowledgemanagement , ACM, pp. 1873–1876.Posner, J., Russell, J. A. and Peterson, B. S. (2005). The circumplex model of aﬀect: Anintegrative approach to aﬀective neuroscience, cognitive development, and psychopathol-ogy,

Development and psychopathology (3): 715–734.Preece, J. (1999). Empathic communities: Balancing emotional and factual communica-tion, Interacting with computers (1): 63–77.Pudrovska, T. (2010). What makes you stronger: age and cohort diﬀerences in personalgrowth after cancer., Journal of health and social behavior (3): 260–273.Putnam, R. D. (2001). Bowling alone: The collapse and revival of American community ,Simon and Schuster.Putnam, R. D. and Feldstein, L. (2009).

Better together: Restoring the American com-munity , Simon and Schuster.Qiu, L., Kan, M.-Y. and Chua, T.-S. (2004). A Public Reference Implementation of theRAP Anaphora Resolution Algorithm,

Proceedings of the 4th Language Resources andEvaluation Conference (LREC 2004) (Lrec): 291–294.Rajkomar, A., Oren, E., Chen, K., Dai, A. M., Hajaj, N., Hardt, M., Liu, P. J., Liu, X.,Marcus, J., Sun, M. et al. (2018). Scalable and accurate deep learning with electronichealth records, NPJ Digital Medicine (1): 18.Ramakrishnan, N., Korkmaz, G., Kuhlman, C., Marathe, A., Zhao, L., Hua, T., Chen,F., Lu, C. T., Huang, B., Srinivasan, A., Trinh, K., Butler, P., Getoor, L., Katz, G.,Doyle, A., Ackermann, C., Zavorin, I., Ford, J., Summers, K., Fayed, Y., Arredondo,J., Gupta, D., Muthiah, S., Mares, D., Self, N., Khandpur, R., Saraf, P., Wang, W.,Cadena, J. and Vullikanti, A. (2014). ’Beating the news’ with EMBERS, Proceedingsof the 20th ACM SIGKDD international conference on Knowledge discovery and datamining - KDD ’14 , number February, pp. 1799–1808.Ranasinghe, W., de Silva, D., Bandaragoda, T., Adikari, A., Alahakoon, D., Persad, R.,Lawrentschuk, N. and Bolton, D. (2018). Robotic-assisted vs. open radical prostate-ctomy: A machine learning framework for intelligent analysis of patient-reported out-comes from online cancer support groups,

Urologic Oncology: Seminars and OriginalInvestigations (12): 529–e1. EFERENCES

Mach bands: quantitative studies on neural networks , Vol. 2, Holden-Day, San Francisco London Amsterdam.Reale, R. A. and Imig, T. J. (1980). Tonotopic organization in auditory cortex of the cat,

Journal of Comparative Neurology (2): 265–291.ˇReh˚uˇrek, R. and Sojka, P. (2010). Software Framework for Topic Modelling with LargeCorpora,

Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frame-works , ELRA, Valletta, Malta, pp. 45–50. http://is.muni.cz/publication/884893/en .Resnick, P. et al. (2001). Beyond bowling together: Sociotechnical capital,

HCI in theNew Millennium : 247–272.Riloﬀ, E., Patwardhan, S. and Wiebe, J. (2006). Feature subsumption for opinion analysis, Proceedings of the 2006 conference on empirical methods in natural language processing ,Association for Computational Linguistics, pp. 440–448.Ritter, A., Clark, S., Etzioni, O. et al. (2011). Named entity recognition in tweets: anexperimental study,

Proceedings of the conference on empirical methods in natural lan-guage processing , Association for Computational Linguistics, pp. 1524–1534.Robinson, K. M. (2001). Unsolicited narratives from the internet: A rich source of quali-tative data,

Qualitative Health Research (5): 706–714.Robinson, M. D. and Clore, G. L. (2002). Belief and feeling: evidence for an accessibilitymodel of emotional self-report., Psychological bulletin (6): 934.R¨oder, M., Both, A. and Hinneburg, A. (2015). Exploring the Space of Topic CoherenceMeasures,

Proceedings of the Eighth ACM International Conference on Web Search andData Mining - WSDM ’15 , ACM Press, New York, New York, USA, pp. 399–408.Rosen-Zvi, M., Chemudugunta, C., Griﬃths, T., Smyth, P. and Steyvers, M. (2010).Learning author-topic models from text corpora,

ACM Transactions on InformationSystems (TOIS) (1): 4.Rosen-Zvi, M., Griﬃths, T., Steyvers, M. and Smyth, P. (2004). The author-topic modelfor authors and documents, Proceedings of the 20th conference on Uncertainty in arti-ﬁcial intelligence , AUAI Press, pp. 487–494.Rosenfeld, P., Culbertson, A. L. and Magnusson, P. (1992). Human needs: A literature re-view and cognitive life span model,

Technical report , NAVY PERSONNEL RESEARCHAND DEVELOPMENT CENTER SAN DIEGO CA.Roth, A. J., Kornblith, A. B., Batel-Copel, L., Peabody, E., Scher, H. I. and Holland,J. C. (1998). Rapid screening for psychologic distress in men with prostate carcinoma:a pilot study,

Cancer: Interdisciplinary International Journal of the American CancerSociety (10): 1904–1908.98 REFERENCES

Rubin, Z. (1975). Disclosing oneself to a stranger: Reciprocity and its limits,

Journal ofExperimental Social Psychology (3): 233–260.Rui, H., Liu, Y. and Whinston, A. (2013). Whose and what chatter matters? the eﬀectof tweets on movie sales, Decision Support Systems (4): 863–870.Russell, J. A. (1980). A circumplex model of aﬀect., Journal of personality and socialpsychology (6): 1161.Sadovykh, V., Sundaram, D. and Piramuthu, S. (2015). Do decision-making structure andsequence exist in health online social networks?, Decision Support Systems : 102–120.Sakaki, T., Okazaki, M. and Matsuo, Y. (2010a). Earthquake shakes Twitter users: real-time event detection by social sensors, WWW ’10: Proceedings of the 19th internationalconference on World wide web p. 851.Sakaki, T., Okazaki, M. and Matsuo, Y. (2010b). Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors,

Proceedings of the 19th international conferenceon World wide web (WWW ’10) , pp. 851–860.Salton, G. and Buckley, C. (1988a). Term-weighting approaches in automatic text retrieval,

Information processing & management (5): 513–523.Salton, G. and Buckley, C. (1988b). Term-weighting approaches in automatic text re-trieval, Information Processing and Management (5): 513–523.Salton, G., Wong, A. and Yang, C.-S. (1975). A vector space model for automatic indexing, Communications of the ACM (11): 613–620.Sankaranarayanan, J., Teitler, B. E., Samet, H., Lieberman, M. D. and Sperling, J. (2009).TwitterStand : News in Tweets, Information Storage and Retrieval : 42–51.Savova, G. K., Masanz, J. J., Ogren, P. V., Zheng, J., Sohn, S., Kipper-Schuler, K. C.and Chute, C. G. (2010). Mayo clinical Text Analysis and Knowledge Extraction Sys-tem (cTAKES): Architecture, component evaluation and applications,

Journal of theAmerican Medical Informatics Association (5): 507–513.Schachter, S. and Singer, J. (1962). Cognitive, social, and physiological determinants ofemotional state., Psychological review (5): 379.Schacter, D. L., Norman, K. A. and Koutstaal, W. (1998). The cognitive neuroscience ofconstructive memory, Annual review of psychology (1): 289–318.Schram, W. E. (1954). The process and eﬀects of mass communication.Schramm, W. (1954). How communication works, The process and eﬀects of mass com-munication pp. 3–26.Sch¨utze, H. and Silverstein, C. (1997). Projections for eﬃcient document clustering,

ACMSIGIR Forum , Vol. 31, ACM, pp. 74–81.

EFERENCES

Health (London,England : 1997) (3): 345–360.Seale, C., Charteris-Black, J., MacFarlane, A. and McPherson, A. (2010). Interviews andInternet Forums: A Comparison of Two Sources of Qualitative Data, Qualitative HealthResearch (5): 595–606.Shannon, C. E. (1948). A mathematical theory of communication, The Bell System Tech-nical Journal (3): 379–423.Shepherd, M. D., Schoenberg, M., Slavich, S., Wituk, S., Warren, M. and Meissen, G.(1999). Continuum of professional involvement in self-help groups, Journal of Commu-nity Psychology (1): 39–53.Shouse, E. (2005). Feeling, emotion, aﬀect, M/c journal (6): 26.Si, J., Mukherjee, A., Liu, B., Li, Q., Li, H. and Deng, X. (2013). Exploiting topic basedtwitter sentiment for stock prediction, Proceedings of the 51st Annual Meeting of theAssociation for Computational Linguistics (Volume 2: Short Papers) , pp. 24–29.Siegel, R. L., Miller, K. D. and Jemal, A. (2018). Cancer statistics, 2018,

CA: A CancerJournal for Clinicians (1): 7–30.Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. and Potts, C.(2013). Recursive deep models for semantic compositionality over a sentiment treebank, Proceedings of the 2013 conference on empirical methods in natural language processing ,pp. 1631–1642.Sparck Jones, K. (1972). A statistical interpretation of term speciﬁcity and its applicationin retrieval,

Journal of documentation (1): 11–21.Sp¨ath, H. (1980). Cluster analysis algorithms for data reduction and classiﬁcation ofobjects , Horwood.Sprecher, S. and Hendrick, S. S. (2004). Self-disclosure in intimate relationships: Associ-ations with individual and relationship characteristics over time,

Journal of Social andClinical Psychology (6): 857–877.Stacey, D., Cl, B., Mj, B., Nf, C., Kb, E., Lyddiatt, A., L´egar´e, F., Thomson, R., Stacey,D., Bennett, C. L., Barry, M. J., Col, N. F., Eden, K. B., Holmes-rovner, M. andLlewellyn, H. (2011). Decision aids for people facing health treatment or screeningdecisions., Cochrane Database of Systematic Reviews (10).Standage, T. (2013). Writing on the wall: Social media-The ﬁrst 2,000 years , BloomsburyPublishing USA.Steinbach, M., Karypis, G., Kumar, V. et al. (2000). A comparison of document clusteringtechniques,

KDD workshop on text mining , Vol. 400, Boston, pp. 525–526.00

REFERENCES

Stone, P. J., Dunphy, D. C. and Smith, M. S. (1966). The General Inquirer: A ComputerApproach to Content Analysis.Strapparava, C. and Mihalcea, R. (2007). Semeval-2007 task 14: Aﬀective text,

Proceedingsof the Fourth International Workshop on Semantic Evaluations (SemEval-2007) , pp. 70–74.Str¨otgen, J. and Gertz, M. (2010). Heideltime: High quality rule-based extraction andnormalization of temporal expressions,

Proceedings of the 5th International Workshopon . . . pp. 321–324.Suler, J. (2004). The online disinhibition eﬀect,

Cyberpsychology & behavior (3): 321–326.Suls, J., Martin, R. and Wheeler, L. (2002). Social comparison: Why, with whom, andwith what eﬀect?, Current directions in psychological science (5): 159–163.Sun, W., Rumshisky, A. and Uzuner, O. (2013). Evaluating temporal relations in clinicaltext: 2012 i2b2 Challenge. URL: https://academic.oup.com/jamia/article-lookup/doi/10.1136/amiajnl-2013-001628

Szreter, S. and Woolcock, M. (2004). Health by association? Social capital, social theory,and the political economy of public health.

URL: https://watermark.silverchair.com/dyh013.pdf ?token=AQECAHi208BE49Ooan9kkhW Ercy7Dm3ZL 9Cf3qfKAc485ysgAAAb0wggG5BgkqhkiG9w0BBwagggGqMIIBpgIBADCCAZ8GCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMumxZ-DyVSAERJaTRAgEQgIIBcNMIDJ2AJmTXIRrRteDw-bTee6smqZHVePY7MnsXJa5F8esU

Tamir, D. I. and Mitchell, J. P. (2012). Disclosing information about the self is intrinsicallyrewarding,

Proceedings of the National Academy of Sciences (21): 8038–8043.Tan, S., Li, Y., Sun, H., Guan, Z., Yan, X., Bu, J., Chen, C. and He, X. (2014). Interpretingthe public sentiment variations on twitter,

IEEE transactions on knowledge and dataengineering (5): 1158–1170.Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L. and Su, Z. (2008). Arnetminer: extrac-tion and mining of academic social networks, Proceedings of the 14th ACM SIGKDDinternational conference on Knowledge discovery and data mining , ACM, pp. 990–998.Tanis, M. (2008). Health-Related On-Line Forums: What’s the Big Attraction?,

Journalof Health Communication (7): 698–714.Tao, C., Wei, W.-Q., Solbrig, H. R., Savova, G. and Chute, C. G. (2010a). A Semantic WebOntology for Temporal Relation Inferencing in Clinical Narratives., AMIA ... AnnualSymposium proceedings / AMIA Symposium. AMIA Symposium : 787–791.Tao, C., Wei, W.-Q., Solbrig, H. R., Savova, G. and Chute, C. G. (2010b). A Semantic WebOntology for Temporal Relation Inferencing in Clinical Narratives.,

AMIA ... AnnualSymposium proceedings / AMIA Symposium. AMIA Symposium : 787–791.

EFERENCES

Psychological review (4): 569.Thelwall, M., Buckley, K. and Paltoglou, G. (2011). Sentiment in Twitter Events, Journalof the American Society for Information Science and Technology (2): 406–418.Thelwall, M., Buckley, K. and Paltoglou, G. (2012). Sentiment Strength Detection for theSocial Web, Journal of the American Society for Information Science and Technology (1): 163–173.Thelwall, M., Buckley, K., Paltoglou, G., Cai, D. and Kappas, A. (2010). Sentimentstrength detection in short informal text, Journal of the American Society for Informa-tion Science and Technology (12): 2544–2558.Tidwell, L. C. and Walther, J. B. (2002). Computer-Mediated Communication eﬀectson self-disclosure, impressions, and interpersonal evaluations, Human CommunicationResearch (3): 317–348.Tomkins, S. S. (1984). Aﬀect theory, Approaches to emotion (163-195).Trevino, L. K., Lengel, R. H. and Daft, R. L. (1987). Media symbolism, media richness, andmedia choice in organizations: A symbolic interactionist perspective,

Communicationresearch (5): 553–574.Tumasjan, A., Sprenger, T., Sandner, P. and Welpe, I. (2010). Predicting elections withTwitter: What 140 characters reveal about political sentiment, Proceedings of the FourthInternational AAAI Conference on Weblogs and Social Media pp. 178–185.Valenzuela, S., Park, N. and Kee, K. F. (2009). Is there social capital in a social net-work site?: Facebook use and college students’ life satisfaction, trust, and participation,

Journal of computer-mediated communication (4): 875–901.Van Essen, D. C. (1985). Functional organization of primate visual cortex., The cerebralcortex .Veatch, R. M. (1972). Models for ethical medicine in a revolutionary age: What physician-patient roles foster the most ethical relationship?,

Hastings Center Report (3): 5–7.Vogel, E. A., Rose, J. P., Roberts, L. R. and Eckles, K. (2014). Social comparison, socialmedia, and self-esteem., Psychology of Popular Media Culture (4): 206.Von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striatecortex, Kybernetik (2): 85–100.Vydiswaran, V. G. V., Mei, Q., Hanauer, D. A. and Zheng, K. (2014). Mining Con-sumer Health Vocabulary from Community-Generated Text, AMIA Annual SymposiumProceedings pp. 1150–1159.Wallsten, S. (2013). What are we not doing when we’re online,

Technical report , NationalBureau of Economic Research.02

REFERENCES

Walther, J. B. (1992). Interpersonal Eﬀects in Computer-Mediated Interaction: A Rela-tional Perspective,

Communication Research (1): 52–90.Walther, J. B. (2007). Selective self-presentation in computer-mediated communication:Hyperpersonal dimensions of technology, language, and cognition, Computers in HumanBehavior (5): 2538–2557.Wang, A. H. (2010). Don’t follow me: Spam detection in Twitter.Wang, X., Hripcsak, G., Markatou, M. and Friedman, C. (2009). Active computerizedpharmacovigilance using natural language processing, statistics, and electronic healthrecords: a feasibility study, Journal of the American Medical Informatics Association (3): 328–337.Wang, X. and McCallum, A. (2006). Topics over time: a non-markov continuous-timemodel of topical trends, Proceedings of the 12th ACM SIGKDD international conferenceon Knowledge discovery and data mining , ACM, pp. 424–433.Wang, Y., Hunt, K., Nazareth, I., Freemantle, N. and Petersen, I. (2013). Do men consultless than women? an analysis of routinely collected uk general practice data,

BMJ open (8): e003320.Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function, Journalof the American statistical association (301): 236–244.Watson, D. and Tellegen, A. (1985). Toward a consensual structure of mood., Psychologicalbulletin (2): 219.Weber, B. A., Roberts, B. L. and McDougall Jr, G. J. (2000). Exploring the eﬃcacy ofsupport groups for men with prostate cancer, Geriatric Nursing (5): 250–253.Weis, J. (2003). Support groups for cancer patients. URL: https://link.springer.com/content/pdf/10.1007%2Fs00520-003-0536-7.pdf

Wellman, B., Haase, A. Q., Witte, J. and Hampton, K. (2001). Does the internet increase,decrease, or supplement social capital? social networks, participation, and communitycommitment,

American behavioral scientist (3): 436–455.Wen, M. and Rose, C. P. (2012). Understanding participant behavior trajectories inonline health support groups using automatic extraction methods, Proceedings of the17th ACM international conference on Supporting group work - GROUP ’12 , p. 179.Weng, J., Yao, Y., Leonardi, E., Lee, F. and Lee, B.-s. (2011). Event Detection in Twitter,

ICWSM (98): 401–408.Wiessner, P. W. (2014). Embers of society: Firelight talk among the ju/hoansi bushmen, Proceedings of the National Academy of Sciences (39): 14027–14035.Williams, G. (1984). The genesis of chronic illness: narrative reconstruction,

Sociology ofHealth & Illness (2): 175–200. EFERENCES

Psychologicalbulletin (2): 245.Willshaw, D. J. and Von Der Malsburg, C. (1976). How patterned neural connections canbe set up by self-organization, Proceedings of the Royal Society of London. Series B.Biological Sciences (1117): 431–445.Winzelberg, A. J., Classen, C., Alpers, G. W., Roberts, H., Koopman, C., Adams, R. E.,Ernst, H., Dev, P. and Taylor, C. B. (2003). Evaluation of an internet support groupfor women with primary breast cancer,

Cancer (5): 1164–1173.Woolcock, M. and Narayan, D. (2000). Social capital: Implications for development theory,research, and policy, The world bank research observer (2): 225–249.Woolcock, M. et al. (2001). The place of social capital in understanding social and economicoutcomes, Canadian journal of policy research (1): 11–17.Wright, K. B. (2006). Researching Internet-Based Populations: Advantages and Disadvan-tages of Online Survey Research, Online Questionnaire Authoring Software Packages,and Web Survey Services, Journal of Computer-Mediated Communication (3): 00–00.Xu, J., Dailey, R. K., Eggly, S., Neale, A. V. and Schwartz, K. L. (2011). Mens perspectiveson selecting their prostate cancer treatment, Journal of the National Medical Association (6): 468.Xu, Z., Ru, L., Xiang, L. and Yang, Q. (2011). Discovering User Interest on Twitter witha Modiﬁed Author-Topic Model, , Vol. 1, IEEE, pp. 422–429.Yan, X., Guo, J., Lan, Y. and Cheng, X. (2013). A biterm topic model for short texts,

Proceedings of the 22nd international conference on World Wide Web , ACM, pp. 1445–1456.Yan Zhang (2016). Understanding the Sustained Use of Online Health Communities Froma Self-Determination Perspective,

Journal of the Association for Information Scienceand Technology (12): 2842–2857.Yang, Y., Pierce, T. and Carbonell, J. (1998). A study of retrospective and on-lineevent detection, Proceedings of the 21st annual international ACM SIGIR conferenceon Research and development in information retrieval , ACM, pp. 28–36.Zamir, O., Etzioni, O., Madani, O. and Karp, R. M. (1997). Fast and intuitive clusteringof web documents.,

KDD , Vol. 97, pp. 287–290.Zappavigna, M. (2015). Searchable talk: the linguistic functions of hashtags,

Social Semi-otics (3): 274–291.Zeliadt, S. B., Ramsey, S. D., Penson, D. F., Hall, I. J., Ekwueme, D. U., Stroud, L. andLee, J. W. (2006). Why do men choose one treatment over another? A review of patient04 REFERENCES decision making for localized prostate cancer.

URL: http://doi.wiley.com/10.1002/cncr.21822

Zhang, S., Bantum, E. O., Owen, J., Bakken, S. and Elhadad, N. (2017). Online cancercommunities as informatics intervention for social support: Conceptualization, char-acterization, and impact,

Journal of the American Medical Informatics Association (2): 451–459.Zhang, S., Kang, T., Qiu, L., Zhang, W., Yu, Y. and Elhadad, N. (2017). CataloguingTreatments Discussed and Used in Online Autism Communities, Proceedings of the 26thInternational Conference on World Wide Web - WWW ’17 , Vol. 2017, ACM Press, NewYork, New York, USA, pp. 123–131.Zhang, T., Cho, J. H. D. and Zhai, C. (2015). Understanding User Intents in OnlineHealth Forums,

IEEE Journal of Biomedical and Health Informatics (4): 1392–1398.Zhang, Z., Iria, J., Brewster, C. and Ciravegna, F. (2008). A comparative evaluationof term recognition algorithms, Proceedings of the Sixth International Conference onLanguage Resources and Evaluation (LREC 2008) , pp. 2108–2113.Zhao, D. and Rosson, M. B. (2009). How and Why People Twitter: The Role that Micro-blogging Plays in Informal Communication at Work,

Proceedings of the ACM 2009international conference on Supporting group work , ACM Press, New York, New York,USA, pp. 243—-252.Zhou, X. and Chen, L. (2014). Event detection over twitter social media streams,

VLDBJournal (3): 381–400.Zhu, H., Ni, Y., Peng, C., Qiu, Z. and Cao, F. (2012). Automatic extracting of patient-related attributes: Disease, age, gender and race, Studies in Health Technology andInformatics , Vol. 180, pp. 589–593.Ziebland, S. and Wyke, S. (2012). Health and Illness in a Connected World: Howmight sharing experienes on the Internet aﬀect people’s health,

The Millbank Quar-terly (2): 219–49.Zigmond, A. S. and Snaith, R. P. (1983). The hospital anxiety and depression scale, Actapsychiatrica scandinavica67

Related Researches

Exchanging Best Practices and Tools for Supporting Computational and Data-Intensive Research, The Xpert Network

by Parinaz Barakhshan

Securing the Network for a Smart Bracelet System

by Iuliana Marin

Effect of Social Media Use on Mental Health during Lockdown in India

by Sweta Swarnam

Smart non-intrusive appliance identification using a novel local power histogramming descriptor with an improved k-nearest neighbors classifier

by Yassine Himeur

Mobile Apps Prioritizing Privacy, Efficiency and Equity: A Decentralized Approach to COVID-19 Vaccination Coordination

by Joseph Bae

The Use and Misuse of Counterfactuals in Ethical Machine Learning

by Atoosa Kasirzadeh

BlockNet Report: Exploring the Blockchain Skills Concept and Best Practice Use Cases

by Boris Duedder

HINT: Hierarchical Interaction Network for Trial Outcome Prediction Leveraging Web Data

by Tianfan Fu

E-Health Management Services in Supporting Empowerment

by Muhammad Anshari

NELA-GT-2020: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles

by Maurício Gruppi

The FairCeptron: A Framework for Measuring Human Perceptions of Algorithmic Fairness

by Georg Ahnert

Detecting Fake News Using Machine Learning : A Systematic Literature Review

by Alim Al Ayub Ahmed

Under the Spotlight: Web Tracking in Indian Partisan News Websites

by Vibhor Agarwal

A Review of Product Safety Regulations in the European Union

by Jukka Ruohonen

Policy options for digital infrastructure strategies: A simulation model for broadband universal service in Africa

by Edward Oughton

FedMood: Federated Learning on Mobile Health Data for Mood Detection

by Xiaohang Xu

BlockNet Report: Curriculum Guidance Document

by Boris Düdder

AI Can Stop Mass Shootings, and More

by Selmer Bringsjord

Toward a Rational and Ethical Sociotechnical System of Autonomous Vehicles: A Novel Application of Multi-Criteria Decision Analysis

by Veljko Dubljevi?

AI Development for the Public Interest: From Abstraction Traps to Sociotechnical Risks

by McKane Andrus

Riiid! Answer Correctness Prediction Kaggle Challenge: 4th Place Solution Summary

by Duc Kinh Le Tran

Insiders and Outsiders in Research on Machine Learning and Society

by Yu Tao

Problematic Machine Behavior: A Systematic Literature Review of Algorithm Audits

by Jack Bandy

Fairness for Unobserved Characteristics: Insights from Technological Impacts on Queer Communities

by Nenad Tomasev

Validating Optimal COVID-19 Vaccine Distribution Models

by Mahzabeen Emu

«

1

2

3

4

»

Submitted on 5 Sep 2020 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar