"Is depression related to cannabis?": A knowledge-infused model for Entity and Relation Extraction with Limited Supervision
""Is depression related to cannabis?": Aknowledge-infused model for Entity and RelationExtraction with Limited Supervision
Kaushik Roy a , Usha Lokala a , Vedant Khandelwal a and Amit Sheth a a Artificial Intelligence Institute, University of South Carolina, Columbia
Abstract
With strong marketing advocacy of the benefits of cannabis use for improved mental health, cannabis legaliza-tion is a priority among legislators. However, preliminary scientific research does not conclusively associatecannabis with improved mental health. In this study, we explore the relationship between depression and con-sumption of cannabis in a targeted social media corpus involving personal use of cannabis with the intentto derive its potential mental health benefit. We use tweets that contain an association among three cate-gories annotated by domain experts - Reason, Effect, and Addiction. The state-of-the-art Natural LangaugeProcessing techniques fall short in extracting these relationships between cannabis phrases and the depressionindicators. We seek to address the limitation by using domain knowledge; specifically, the Drug Abuse Ontol-ogy for addiction augmented with Diagnostic and Statistical Manual of Mental Disorders lexicons for mentalhealth. Because of the lack of annotations due to the limited availability of the domain experts’ time, we usesupervised contrastive learning in conjunction with GPT-3 trained on a vast corpus to achieve improved per-formance even with limited supervision. Experimental results show that our method can significantly extractcannabis-depression relationships better than the state-of-the-art relation extractor. High-quality annotationscan be provided using a nearest neighbor approach using the learned representations that can be used by thescientific community to understand the association between cannabis and depression better.
Keywords
Mental Health, Depression, Cannabis Crisis, Legalization, knowledge infusion, Relation Extraction
1. Introduction
Many states in the US have legalized the medical use of cannabis for therapeutic relief in those af-fected by Mental Illness [1] [2, 3]. The use of cannabis for depression, however, is not authorized yet[4]. Depression is ubiquitous among the US population, and some even use cannabis to self treat theirdepression[5][6]. Therefore, scientific research that can help understand the association between de-pression and cannabis consumption is of the utmost need given the fast increasing cases of depressionin the US and consequent cannabis consumption [7].Twitter can provide crucial contextual knowledge in understanding the usage patterns of cannabisconsumption concerning depression [8, 9]. Conversations on social media such as Twitter provideunique insights as tweets are often unfiltered and honest in disclosing consumption patterns due tothe anonymity and private space afforded to the users. For now, even with several platforms avail-able to aid the analysis of depression concerning cannabis consumption, this understanding remains
In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI 2021Spring Symposium on Combining Machine Learning and knowledge Engineering (AAAI-MAKE 2021) - Stanford University,Palo Alto, California, USA, March 22-24, 2021. " [email protected] (K. Roy); [email protected] (U. Lokala); [email protected] (V. Khandelwal);[email protected] (A. Sheth) (cid:18) © 2021 Copyright for this paper by its authors.Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR
WorkshopProceedings http://ceur-ws.orgISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) a r X i v : . [ c s . C L ] F e b mbiguous [10, 11, 12]. Still, encouragingly there is support in the literature to show that cannabisuse can be potentially associated with depressive patterns. Hence, we aim to extract this associationas one of three relationships annotated by domain experts: Reason, Effect, and Addiction (Table 1). Relationship Tweet
Reason “-Not saying im cured, but i feel less depressed lately,could be my
Table 1 cannabis-depression Tweets and their relationships. Here the text in the blue and red represents the cannabisand depression entities respectively.
The paper studies mental health and its relationship with cannabis usage, which is a sig-nificant research problem. The study will help address several health challenges such “as theinvestigation of cannabis for the treatment of depression”,” as a reason for depression” or “asan addictive phenomenon that accompanies depression”.
Extracting relationships between any concepts/slang-terms/synonyms/street-names related to ‘ cannabis ,’ and ‘ depression ,’ from text is a toughproblem. This task is challenging for traditional Natural Language Processing (NLP) because of theimmense variability with which tweets mentioning depression and cannabis are described. Here, wemake use of the Drug Abuse Ontology (DAO) [13, 14] which is a domain-specific hierarchical frame-work containing 315 entities (814 instances) and 31 relations defining concepts about drug-abuse.The ontology has been used in prior work to analyze the effects of cannabis [15, 16, 17]. DAO wasaugmented using Patient Health Questionnaire 9th edition (PHQ-9), Systematized Nomenclature ofMedicine - Clinical Terms (SNOMED-CT), International Classificatio of Diseases 10th edition (ICD-10), Medical Subject Headings (MeSH) Terms, and Diagnostic and Statistical Manual for Mental Dis-orders (DSM-5) categories to infuse knowledge of mental health-related phrases in association withdrugs such as cannabis [18][19]. Some of the triples extracted from the DAO are as follows: (1)
SCRA → subClassOf → Cannabis; (2)
Cannabis_Resin → has_slang_term → Kiff; (3)
Marijuana_Flower → type → Natural_Cannabis.For entity and relationship extraction (RE), previous approaches generally adopt deep learningmodels [20] [21]. However, these models require a high volume of annotated data and are hence un-suitable for our setting. Several pre-trained language representation models have recently advancedthe state-of-the-art in various NLP tasks across various benchmarks [22, 23]. GPT-3 [24], BERT [25]are such language models [26]. Language models benefit from the abundant knowledge that they aretrained on and, with minimal fine-tuning, can tremendously help in downstream tasks under limitedsupervision. Hence, we exploit the representation from GPT-3 and employ supervised contrastivelearning to deal with limited supervision in terms of quality annotations for the data. We proposea knowledge-infused deep learning framework based on GPT-3 and domain-specific DAO ontologyto extract entities and their relationship. Then we further enhance the utilization of limited super-vision through the use of supervised contrastive learning. It is well known that deep understandingrequires many examples to generalize. Metric Learning frameworks such as Siamese networks havepreviously shown how limited supervision can help use contrastive learning with triplet loss [27].Combinatorially this method leads to an increase in the number of training examples from an orderof 𝑛 to 𝑛𝐶3 , which helps with generalizability. The technique can also exploit the learned metricspace representations to provide high-quality annotations over unlabeled data. Therefore, the com-ination of knowledge-infusion [28, 29, 30], pre-trained GPT-3, and supervised contrastive learningpresents a very effective way to handle limited supervision. The proposed model has two modules: (1) Phrase Extraction and Matching Module , which utilizes the DAO ontology augmented withthe PHQ-9, SNOMED-CT, ICD-10, MeSH Terms, and Diagnostic and Statistical Manual for MentalDisorders (DSM-5) lexicons to map the input word sequence to the entities mention in the ontologyby computing the cosine similarity between the entity names (obtained from the DAO) and everyn-gram token of the input sentence. This step identifies the depression and cannabis phrase in thesentence. Distributed representation obtained from GPT-3 of the depression phrase and cannabisphrase in the sentence is used to learn the contextualized syntactic and semantic information thatcomplement each other. (2) Supervised Contrastive Learning Module , uses a triplet loss to learn arepresentation space for the cannabis and depression phrases through supervised contrastive learn-ing. Phases with the correct relationship are trained to be closer in the learned representation space,and phrases with incorrect relationships are far apart. Contributions : (1) In collaboration with domain experts who provide limited supervision on real-world data extractedfrom Twitter, we learn a representation space to label the relationships between cannabis and depres-sion entities to generate a cannabis-depression relationship dataset. (2)
We propose a knowledge-infused neural model to extract cannabis/depression entities and predictthe relationship between those entities. We exploited domain-specific DAO ontology, which providesbetter coverage in entity extraction. (3)
Further, we use GPT-3 representations in a supervised contrastive learning approach to learn arepresentation space for the different cannabis and depression phrase relationships due to limited su-pervision. (4)
We evaluated our proposed model on the real world twitter dataset. The experimental resultsshow that our model significantly outperforms the state-of-the-art relation extraction techniques by >
11% points on the F1 score. (1)
Semantic filtering: We use DAO, DSM-5, to extract contextual phrases expressed implicitly in theuser tweet, mentioning Depression and Cannabis. This is required for noise-free domain adaptationof the model, as evident in our results. (2)
We develop a weak supervision pipeline to label the remaining 7000 tweets with three relationships(Reason, Effect, Addiction). (3)
We learn a domain-specific distance metric between the phrases, leveraging pre-trained GPT-3embeddings of the extracted phrases and their relationship, in a supervised contrastive loss trainingsetup. (4)
2. Related Works
Based on the techniques and their application to health, we discuss recent existing works. StandardDL approaches based on Convolutional Neural Networks (CNN), and Long Term Short Term Memory(LSTM) networks have been proposed for RE [31] [32, 33]. Hybrid models that combine CNN andLSTM have also been proposed [34]. More recently, Graph Convolutional Neural Networks (GCN)’ shave been utilized to leverage additional structural information from dependency trees towards theE task [35]. [36] guide the structure of the GCN by modifying the attention mechanism. Adversarialtraining has also been explored to extract entities and their relationships jointly [37]. Due to the vari-ability in the specification of entities and relationships in natural language, [38, 39] have exploitedentity position information in their DL framework. Models have demonstrated state of the art in REbased on the popular BERT language model, BioBERT [40], SciBERT [41], and XLNet Yang et al. [42].Task-specific adaptations of BERT have been used to enhance RE in Shi and Lin [43] and Xue et al.[44]. Wang et al. [45] augment the BERT model with a structured prediction layer to predict multiplerelations in one pass. In all the approaches discussed so far, knowledge has not been a component ofthe architecture [46].Chan and Roth [47] show the importance of using knowledge to improve RE on sentences by show-ing an improvement of 3.9% of F1-score incorporating knowledge in an Integer Linear Programming(ILP) formulation.Wen et al. [48] use the attention weights between entities to guide traversal pathsin a knowledge graph to assist RE. Distiawan et al. [49] use knowledge graph TransE embeddings intheir approach to improving performance. Some of the other prominent work utilizing knowledgegraph for relation extraction is [50, 51, 52].These methods, however, do not consider a setting in which the availability of high-quality an-notated data is scarce. We use knowledge to extract relevant parts of the sentence [53, 54] and pre-trained GPT-3 representations trained over a massive corpus in conjunction with supervised con-trastive learning to achieve a high degree of sample efficiency with limited supervision.
3. Our Approach
The dataset we have used for our study consists of 11,000 Tweets collected using the twitris API fromJan 2017 to Feb 2019 - determined by three substance use epidemiologists as a period of heightenedCannabis consumption. The experts annotated 3000 tweets (due to time constraints) with one of 3relationships that they considered essential to identify: “Reason,” “Effect,” and “Addiction.” The an-notation had a Cohen Kappa Agreement of 0.8. Example from each of these different relationshipsalready shown in Table 1
We exploit the domain-specific knowledge base to replace the phrases in social media text with theknowledge base concepts under this method. The Phrase Extraction and Matching are performed inseveral steps, which are:•
Depression and cannabis Lexicon:
We have exploited the state of the art Drug Abuse On-tology (DAO) to extract various medial entities and slang terms related to cannabis and de-pression. We further expand the entities using entities extracted from PHQ-9, SNOMED-CT,ICD-10, MeSH Terms, and Diagnostic and Statistical Manual for Mental Disorders (DSM-5).•
Extracting N-Grams from Tweets:
The N-Grams are extracted from the tweets are consid-ered to better understand the context of the terms by taking into consideration the words aroundit. For example, from the tweet whole world emotionally depressed everybody needs smoke bluntto relax We living nigga , we will obtain ngrams such as whole world, emotionally depressed,depressed everybody, need smoke, need smoke blunt, living nigga. igure 1:
The Phrase extraction and Mapping Pipeline, where the DAO is used to extract GPT-3 representationsof the related phrases • GPT-3:
Generative Pre-Trained Transformer 3 is an autoregressive language model. We haveused GPT-3 to generate the embedding of the N-Grams generated and the cannabis and de-pression Phrase because of the vast dataset it is trained on, which provides us the phrases’embeddings based on its global understanding. Cosine Similarity: It is a measure of similar-ity between two non-zero vectors in a similar vector space. This metric is often used to get asemantic similarity between two phrase embeddings obtained in the same vector space.•
Phrase Matching:
We use the cosine similarity metric to understand the semantic similaritybetween the phrases. We have taken a threshold of 0.75 or more cosine similarity. Once thephrase obtains a similarity value more than or equals the threshold, the original N-Grams fromthe tweet text are replaced by the matched cannabis/depression Phrase. The above steps are re-peated for all the tweets. For example, we would obtain emotionally depressed as the depressionphrase, whereas need smoke blunt is found to be the cannabis phrase.
The proposed model architecture is divided into two sub-parts: A) Transformer Block, B) Loss Func-tion. The input tweet is first sent through a block of 12 transformers, and later the embedding ispassed through a triplet loss function. The label associated with the tweet input is used to extract acomplimentary sample (a tweet with the same label) and a negative sample (a tweet associated witha different label). These positive and negative samples are sent through a block of 12 transformers toobtain their embedding, which is further passed on to the triplet loss function. Under the loss, thefunction tries to achieve a low cosine similarity between the tweet and its negative sample as closeto 0. At the same time, it tries to achieve a high cosine similarity between the tweet and its positivesample as close to 1.
𝐶𝑜𝑆𝑖𝑚(𝐴, 𝑃 ) − 𝐶𝑜𝑆𝑖𝑚(𝐴, 𝑁 ) + 𝛼 ≤ 0 (1)Where A is the anchor (the initial data point), P is a positive data point which is of the same class as theanchor, N is a negative data point which is of the different class as the anchor. CoSim(X, Y) is the cosinesimilarity between the two data points, and 𝛼 is the margin. For the example shown in Section 3.2,if we consider the anchor sample to be "whole world emotionally depressed everybody needssmoke blunt relax We living nigga" , corresponding to that positive sample is "Depressionarmyweed amp sleep I awake I depressed" and the negative sample would be "This weird rant like weedmakes anxiety depression worse Im soooo sick ppl like jus" . igure 2: Supervised Constrastive Learning pipeline using Triplet Loss
4. Experimental Setup and Results
In this section, we discuss the results of the task of cannabis and depression RE tasks. After that, wewill also provide a technical interpretation of the results.
The dataset utilized in our experiment is described in Section-3. We have used Precision, Recall, and F1score as the metric to compare our proposed methodology with the state-of-the-art relation extractor.As a baseline model, we used BERT, BioBERT, and its various variations such as:•
BERT PE : We expand the BERT as a position embedding along with the BERT embedding withthe position data (the relative distance of the word with cannabis and depression entity) ob-tained via domain-specific knowledge resource.• BERT
PE+PA : This consists of an additional component of position-aware mechanism along withBERT PE .Table-2 Summarises the performance of our model over the baselines. Our proposed methodologyhas outperformed all the state of the art models based on the given metrics. As compared to theworst-performing baseline, BioBERT our model achieves an F1 score with an absolute improvementof 12.32%, and with best performing baseline BERT PE+PA it gives an improvement of 11.28% in F1Score. From the above comparison using contrastive learning with knowledge, infused learning canperform better in relation classification.
Method Precision Recall F1-ScoreBERT 64.49 63.22 63.85BioBERT 63.97 62.15 63.06BERT PE PE+PA
Proposed Model
Table 2
Performance comparison of our proposed model with baselines methods .2. Ablation Study
We have performed the ablation study by removing one component from the proposed model, eval-uate its performance to understand the impact of various components. Based on the study, we foundthat, as we remove the contrastive loss function from our learning approach, the model performancesignificantly drops by 6.46 % F1 Score, 6.53 % Recall, and 6.4 % Precision. The significant decreasein the model’s performance shows that generating an embedding for two samples of the same classsimilar and of different classes dissimilar brings in a great advantage to the training of the model. Thecontrastive loss function allows us to learn the representation of the same classes to be closer to eachother in vector space and hence allows us in generating the representation of unlabelled data fromthe dataset (discussed further, later in this section )We also observe that domain-specific knowledge resource with contextualized embedding trainedover a large corpus (GPT-3) is very important. As we further remove the second component fromour model, we see a total decrease of 9.01 % F1 score, 8.92 % Precision, and 9.11 % recall in theproposed model’s performance. This component was majorly responsible for removing the data’sambiguity using the phrases from human-curated domain-specific knowledge bases (such as DAO,DSM-5, SNOMED-CT, and others). Also, the contextualized embedding helped us consider the globalrepresentation of the entities present in the dataset and hence contribute to improving the model’sperformance.
Model Precision Recall F1-Score
Proposed Model ↓ ↓ ↓ ↓ ↓ ↓ Table 3
Ablation Study over the proposed model to evaluate the effect of contrastive learning loss and knowledgeinfusion in determining the relationship between cannabis and depression
Thus, this shows that every component of the proposed model is necessary for the best performingresults.
After training the model, we annotate the unlabelled data in the dataset by classifying among thethree relationships. We parse the unlabelled tweets from the first module to extract the phrases fromthe knowledge bases using contextualized embedding. Later the embeddings are pushed into theproposed model architecture to obtain a representation of the tweets. The representation is used tocreate a cluster of the tweet data points and determine the label of the un-labeled tweets based on themajority of the data points present. The representation of the cluster after labeling unlabelled tweetsis shown in Figure 3. Some examples from each of the cluster are as follows:•
Reason: Depressionarmy weed amp sleep I awake I depressed, mood smoke blunt exceptthe fact I depressed, weed hits ya RIGHT depression, I smoked weed drank alcohol drowningsorrows away, whole world emotionally depressed everybody need smoke blunt relax We livinnigga • Effect: marijuana bad marijuana makes feel depressed low mmk, Unemployed stoners are themost depressed on the planet, guess depression took long time discover marijuana makes VERYEPRESSED alcohol doesnt help either, This weird rant like weed makes anxiety depression worseIm soooo sick ppl like jus, waking weed naps nigga feeling depressed hell • Addiction: I feel like weed calm someone suffer depression anxiety psychosis predisposed either, Small trigger warning Blaine suffers anxiety depression occasionally smoke pot, need bluntaccompany depression, This bot crippling depression ok weed lol, Violate blunt distractionpossibly despair bask commitment This would never happen
Figure 3:
Shows the depression and cannabis phrases grouped according to the relationships of Reason, Effectand Addiction. Purple - Reason, Green - Effect, Blue and a little yellow below - Addiction.
5. Reproducibility
From this study, we will be delivering high quality annotated dataset of 3000 tweets along with the fullannotated dataset (by our method) of 11000 tweets, will be made publicly available to support researchon the psychological impact of cannabis and depression use during COVID-19, as Cannabis use relatedto depression is seeing a rise once more. Also, the trained model will be shared for reproducibility ofthe results and annotation of tweets. Unfortunately, we cannot release the code used for training atthis time as recently, Microsoft bought the rights to the GPT-3 model. Therefore, to use the learningmethod proposed in this paper, GPT-3 will need to be substituted with an alternative language modelsuch as BERT, GPT-2, etc.
6. Conclusion
In this study, we present a method to determine the relationship between depression and cannabisconsumption. We motivate the necessity of understanding this issue due to the rapid increase in casesof depression in the US and across the world and subsequent increase in cannabis consumption. Weutilize tweets to understand the relationship as tweets are typically unfiltered expressions of simple https://blogs.microsoft.com/blog/2020/09/22/microsoft-teams-up-with-openai-to-exclusively-license-gpt-3-language-model/ sage patterns among cannabis users who use it in association with their depressive moods or dis-order. We present a knowledge aware method to determine the relationship significantly better thanthe state-of-the-art effectively, show the quality of the learned relationship through visualization onTSNE based clusters, and annotate the unlabeled parts of the dataset. We show by training on thisnew dataset (human-labeled and estimated label) that the model’s prediction quality is improved. Wepresent this high-quality dataset for utilization by the broader scientific community in better under-standing the relationship between depression and cannabis consumption.
7. Broader Impact
Although we develop our method to handle relationship extraction between depression and cannabisconsumption specifically, we generally develop a domain knowledge infused relationship extractionmechanism that uses state-of-the-art language models, few shot machine learning techniques (con-trastive loss) to achieve efficient and knowledge guided extraction. We see the improved quality inthe results over transformer models. We believe that for applications with real-life consequences suchas these, it is crucial to infuse domain knowledge as a human would combine with language under-standing obtained from language models to identify relationships efficiently. Humans typically canlearn from very few examples. Motivated by this and the lack of availability of examples, we developour relation extraction method. We hope our significantly improved results will encourage scientiststo explore further the use of domain knowledge infusion in application settings that demand highlyspecialized domain expertise.
References .[26] C. Lin, T. Miller, D. Dligach, S. Bethard, G. Savova, A bert-based universal model for both within-and cross-sentence clinical temporal relation extraction, in: Proceedings of the 2nd ClinicalNatural Language Processing Workshop, 2019, pp. 65–71.[27] E. Hoffer, N. Ailon, Deep metric learning using triplet network, in: International Workshop onSimilarity-Based Pattern Recognition, Springer, 2015, pp. 84–92.[28] A. Sheth, M. Gaur, U. Kursuncu, R. Wickramarachchi, Shades of knowledge-infused learning forenhancing deep learning, IEEE Internet Computing 23 (2019) 54–63.[29] U. Kursuncu, M. Gaur, A. Sheth, Knowledge infused learning (k-il): Towards deep incorporationof knowledge in deep learning, arXiv preprint arXiv:1912.00512 (2019).[30] M. Gaur, U. Kursuncu, A. Sheth, R. Wickramarachchi, S. Yadav, Knowledge-infused deep learn-ing, in: Proceedings of the 31st ACM Conference on Hypertext and Social Media, 2020, pp.309–310.[31] C. Liu, W. Sun, W. Chao, W. Che, Convolution neural network for relation extraction, in:International Conference on Advanced Data Mining and Applications, Springer, 2013, pp. 231–242.[32] M. Miwa, M. Bansal, End-to-end relation extraction using lstms on sequences and tree structures,arXiv preprint arXiv:1601.00770 (2016).[33] S. Yadav, A. Ekbal, S. Saha, A. Kumar, P. Bhattacharyya, Feature assisted stacked attentive short-est dependency path based bi-lstm model for protein–protein interaction, Knowledge-BasedSystems 166 (2019) 18–29.[34] D. Liang, W. Xu, Y. Zhao, Combining word-level and character-level representations for relationclassification of informal text, in: Proceedings of the 2nd Workshop on Representation Learningfor NLP, 2017, pp. 43–47.[35] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, K. Weinberger, Simplifying graph convolutional net-works, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Confer-ence on Machine Learning, volume 97 of
Proceedings of Machine Learning Research10.18653/v1/D18-1307