Back to Prior Knowledge: Joint Event Causality Extraction via Convolutional Semantic Infusion
BBack to Prior Knowledge: Joint Event CausalityExtraction via Convolutional Semantic Infusion
Zijian Wang, Hao Wang ∗ , Xiangfeng Luo ∗ , and Jianqi Gao School of Computer Engineering and Science, Shanghai University, Shanghai 200444,China { zijianwang,wang-hao,luoxf,gjqss } @shu.edu.cn Abstract.
Joint event and causality extraction is a challenging yet es-sential task in information retrieval and data mining. Recently, pre-trained language models (e.g., BERT) yield state-of-the-art results anddominate in a variety of NLP tasks. However, these models are incapableof imposing external knowledge in domain-specific extraction. Consider-ing the prior knowledge of frequent n-grams that represent cause/effectevents may benefit both event and causality extraction, in this paper, wepropose convolutional knowledge infusion for frequent n-grams with dif-ferent windows of length within a joint extraction framework. Knowledgeinfusion during convolutional filter initialization not only helps the modelcapture both intra-event (i.e., features in an event cluster) and inter-event (i.e., associations across event clusters) features but also booststraining convergence. Experimental results on the benchmark datasetsshow that our model significantly outperforms the strong BERT+CSNNbaseline.
Keywords:
Causality extraction · Prior knowledge · Semantic infusion.
Joint event and causality extraction from natural language text is a challeng-ing task in knowledge discovery [1], discourse understanding [12], and machinecomprehension [19]. Formally, joint event causality extraction is defined as aprocedure of extracting an event triplet consisting of a cause event e
1, an effectevent e a r X i v : . [ c s . C L ] F e b Zijian Wang et al.
B-C I-C I-C I-C I-C I-C I-C O O O O O O O O B-E I-E I-E I-E I-E O 原 材 料 价 格 上 涨 ,导 致 公 司 部 分 产 品 毛 利 率 下 降 。
Input sentence:Tags: cause effect 原 材 料 价 格 上 涨 raw material prices rise 毛 利 率 下 降 gross profit rate down
Causal relation:
Fig. 1: An example of joint event causality extraction from raw financial text. Itextracts cause and effect events with the underlying cause-effect relation simul-taneously.tagging [10]. Among these methods, pre-trained language models [4] (e.g., BERT)dominate the state-of-the-art results on a wide range of NLP tasks. BERT pro-vides a masked language model (MLM) supporting fine-tuning to achieve betterperformance on a new dataset while reducing serious feature engineering efforts.However, there are still drawbacks when applying such pre-trained languagemodels for event causality extraction as follows: – The small size of available datasets is the main bottleneck of fine-tuningBERT-based models to satisfactory performance. – Even if BERT trained on a large-scale corpus contains commonsense knowl-edge, it may not be sufficient for specific-domain such as financial. – Domain-specific frequency analysis of n-grams should be essential prior knowl-edge to recognize events. In contrast, these important hints have not beenfully emphasized in the current neural architecture.To tackle these issues, in this paper, we propose a novel joint method forevent causality extraction. We first pass the input text through the BERT en-coding layer to generate token representations. Secondly, to infuse the knowl-edge like frequent cause/effect n-grams on different scales, we utilize multipleconvolutional filters simultaneously (i.e., infusing intra-event knowledge). Theweights of these filters are manually initialized with the centroid vectors of thecause/effect n-gram clusters. After that, we link the cause and effect events usingthe key-query attention mechanism to alleviate incorrect cause-effect pair can-didates (i.e., infusing inter-event knowledge). Finally, we predict target labelsgiven the contextual representation by combing bidirectional long short-termmemory (LSTM) with the conditional random field (CRF). Empirical resultsshow that our model fusing advantages of both intra- and inter-event knowl-edge significantly improve event causality extraction and obtain state-of-the-artperformance.The contributions of this paper can be summarized as follows:1. We propose a novel joint framework of event causality extraction based onrecent advances of the pre-trained language models, taking account of fre-quent domain-specific and event-relevant n-grams given statistic analysis. oint Event Causality Extraction via Convolutional Semantic Infusion 3
2. This framework allows incorporating intra-n-gram knowledge of frequent n-grams into the current deep neural architecture to filter potential cause oreffect event mentions in the text during extracting.3. Our approach also considers inter-n-gram knowledge of cause-effect co-occurrence.We adopt a query-key attention mechanism to pairwisely extract cause-effectpairs.
The methods for event or causality extraction in the literature fall into threecategories: rule-based, machine learning, and neural network. The rule-basedmethods employ linguistic resources or NLP toolkits to perform pattern match-ing. These methods often have low adaptability across domains and require ex-tensive in-domain knowledge to deal with the domain generalisation problem.The second category methods are mainly based on machine learning techniques,requiring considerable human effort and additional time cost in feature engineer-ing. These methods heavily rely on manual selection of textual feature sets. Thethird category methods are depending on the neural network. In this section, wesurvey those methods and figure out the problems existing in those methods.
Numerous rule-based methods have been dedicated to event causality mining.Early works predominantly perform syntactic pattern matching, where the pat-terns or templates are handcrafted for the specific-domain texts. For instance,Grishman et al. [6] perform syntactic and semantic analysis to extract the tempo-ral and causal relation networks from texts. Kontos et al. [11] match expositorytext with structural patterns to detect causal events. Girju et al. [5] validateacquired patterns in a semi-supervised way. They check whether to express acausal relationship based on constraints of nouns and verbs. As a result, thosemethods cannot generalize to a variety of domains.
The paradigm of automatic causal extraction dates back to machine learningtechniques using trigger words combined with decision trees [17] to extract causalrelations. Sorgente et al. [19] first extract the candidates of causal event pairswith pre-defined templates and then use a Bayesian classifier to filter non-causalpairs. Zhao et al. [22] compute the similarity of syntactic dependency structuresto integrate causal connectives. Indeed, these methods suffer from data sparsityand require professional annotators.
Due to the powerful deep neural representations, neural networks can effectivelyextract implicit causal relations. In recent years, the adoption of deep learning
Zijian Wang et al. techniques for causality extraction has become a popular choice for researchers.Methods of event extraction can be roughly divided into the template argumentfilling approach or sequence labelling approach. For example, Chen et al. [3]propose to process extraction as a serial execution of relation extraction andevent extraction. This method splits the task into two phases. They first extractthe event elements, then put them into a pre-defined template consisting of eventelements. By sharing features during extraction, it effectively avoids missing thevital information.Other models obey a sequential labelling manner. Fu et al. [9] propose to treatthe causality extraction problem as a sequential labelling problem. Martinez etal. [16] employ an LSTM model to make contextual reasoning for predictingevent relation. Jin et al. [10] propose a cascaded network model capturing localphrasal features as well as non-local dependencies to help cause and effect ex-traction. Despite their success, those methods ignore the fact of domain-specificextraction. On the one hand, an n-gram representing cause or effect event shouldfrequently appear in the text. On the other hand, the events involving causalityhave a higher probability of co-occurrence.
Figure 2 shows the overall architecture of the model. It is composed of fourmodules: 1) We first pass the input sentence through the BERT encoding layerand the alignment layer successively to output the BERT representation foreach token. 2) A convolutional knowledge infusion layer is created to captureuseful n-gram patterns to focus on domain-relevant n-grams at the beginningof the training phase. 3) We characterize inter-associations among cause andeffect using a key-query attention mechanism. 4) Temporal dependencies areestablished utilising the LSTM combined with CRF layers, which significantlyimproves the performance of causality extraction.
The BERT is one of the pre-trained language models, which contains multi-layers of the bidirectional transformer, designed to jointly condition on both leftand right context. BERT supports fine-tuning for a wide range of tasks with-out substantial task-specific architecture modifications. We omit the exhaustivebackground description of the architecture of BERT in this paper.In our model, we use the BERT encoder to model sentences that contain theevent mentions via computing a context-aware representation for each token Wetake the packed sentences [
CLS, S, SEP ] as the input. “[CLS]” is inserted as thefirst token of each input sequence. “[SEP]” denotes the end of the sentence. Foreach token s i in S , the BERT input can be represented as: s i = [ s toki ⊕ s posi ⊕ s segi ] , (1) oint Event Causality Extraction via Convolutional Semantic Infusion 5 成本上升导致利润减少 material price increase leads to profit rate down Data rank clustering … Query-KeyScaledDot-ProductAttention .......
OB-CI-CI-COI-EI-EI-E
Bert encoding Convolution infusion N-gram mapping Contextual representation high frequencyn-grams … fixedlearnable BiLSTM+CRF … c on c a t … Fig. 2: Illustration of the model architecture. Intra- and inter-event feature areextracted using convolutional knowledge infusion and key-query attention mech-anism. This model allows to extract cause and effect events associated with aevent causal relation at the same time.where s toki , s posi , and s segi represent token, position, and segment embeddingsfor i , respectively. In this regard, a sentence is expressed as a vector H ∈ R l × e ,where l is the length of sentence and e is the size of embedding dimension. h i isa vector standing for word embedding of the i -th word in the sentence. h iN = BERT ([ CLS, s , . . . , s i , . . . , s l , SEP ]) , (2)where h iN represents hidden states of the last layer generated by BERT. We usethem as the context-aware representation for each token. In natural language processing, the convolutional operation can be understoodas a process in which a sliding window continuously captures semantic featuresin a sentence of specific length. The convolution operation will extract semanticfeatures similar to the convolution filter (i.e., the vectors similar to the convo-lution filter should share a higher weight). In our model, the convolution layeraims to enhance event-relevant features by embedding causal patterns into fea-ture maps. Considering that word embedding can be initialized using pre-trainedword vectors, it also should be possible to initialize the convolutional filter (i.e.,kernel) with the vector related to frequent event n-grams. Therefore, inspired by[13], we launch unsupervised clustering to collect n-grams mentioning the similarcause or effect events in the text into the same clusters. Thus, these clusters canelegantly represent cause/effect semantic. Then, we compute the centroid vectorof the cluster given all cause/effect event n-grams in the cluster and initializeconvolution filters with these vectors. We only fix a part of weight in the filterusing the cluster semantic vector, which allows our model to learn more featuresby itself during training. The details are given below:
Zijian Wang et al.
Ngram Collection:
Sentences can be segmented into particular chunks ac-cording to the size of the window in the convolution operation, e.g., “ 我 喜 欢 苹 果 (I like apples)” are split into four bigrams “ 我 喜 , 喜 欢 , 欢 苹 , 苹 果 ” when n=2,step=1). This method takes advantage over standard word representations toobtain ampler semantic information, which motivates us to recognize the eventphrase boundary in this example.The texts in a different domain often have different length for event n-grams,which also dominates convolutional window size. At the same time, the n-gramlength is a kind of vital prior knowledge in event extraction. However, it is oftenignored by conventional character- or word-based methods, resulting in errors ofevent boundary recognition. Therefore, we count the event length in the dataset.We find the number of the events in the most frequent length accounted for 27%among all examples in the Financial dataset. Motivated by this phenomenon, weassociate the n-gram length feature as well as n-gram clusters by applying theconvolution filters with various window sizes.Intuitively, “profit drop” should be more important than the location n-gram like “New York” in most cases in the financial domain. Thus, we distill therelevant n-grams in a cluster with the centroid vector. In practical, we collectn-grams only from the training data. For its simplicity and effectiveness, wesort the n-grams using Na¨ıve Bayes to obtain scores. The ranking score r of then-gram w is calculated as follows: r = ( p wc + b ) / (cid:107) p c (cid:107) ( p we + b ) / (cid:107) p e (cid:107) , (3)where c denotes cause event and e indicates effect event. p wc is the number ofsentences that contains n-grams w in class c . (cid:107) p c (cid:107) is the number of n-gramsrelated to cause c , p we is the number of sentences that contain n-grams w relatedto e , (cid:107) p e (cid:107) is the number of n-grams in e , and b is a smoothing parameter. Weselect the top n % n-gram vectors by scoring n-grams using Formula 3. Filter Initialization :
First of all, we embed each n-gram into a vector rep-resentation using BERT. Since filters in CNNs are insufficient on the amountto account all selected n-grams, we only extract the cluster vectors to representthe generalized cause/effect features. We perform k-means clustering to obtainclusters. Here, the cluster vector is defined as the centroid vector of the cluster.Considering that non-causal n-grams exist in the sentence, we purposely leavesome blank. We do not fully fill the convolution filters with centroid vectors andrandomly initialize the rest weights of the filters. c ni = f ( W · h i : i + n − + b ) (4)Namely, W is the weight of initialized filter, b is a bias term and f is a nonlinearfunction such as sigmoid, ReLU. As a single cluster is often not enough to de-scribe the overall information, we need to cluster with different length scales. Tocapture the features of causal events at different scales, we use parallel convo-lution operation with varying windows. For example, convolution window sizes oint Event Causality Extraction via Convolutional Semantic Infusion 7 of n , n , n are chosen when there are three convolution layers. For example,when length equals to 4, the model tends to extract “ 股 价下 跌 ” (share pricesfell) and “ 销 量 减 少 ” (sales reduced) in the Financial dataset. The convolutionoutput can be computed as the concatenation of sub-spaces: c i = [ c n i ⊕ c n i ⊕ c n i ] (5) Query-key attention is a special attention mechanism according to a sequenceto compute its representation. It has been successfully applied in many NLPtasks, such as machine translation and language understanding. In our model,the query-key attention mechanism is employed to mine inter associations be-tween cause and effect events. After the multi-scaled convolution operation, weobtain the feature mapping for each token. Instead of max-pooling, we use themulti-head attention mechanism to check whether exist a cause-effect relation.Following Vaswani et al. [21], we split the encoded representations of the se-quence into homogeneous sub-vectors, called heads. The input consists of queries Q ∈ R t × d , keys K ∈ R t × d and values V ∈ R t × d , while d is the dimension size.The mathematical formulation is shown below: Attention ( Q, K, V ) = sof tmax (cid:18) QK T √ d (cid:19) V, (6) H i = Attention (cid:16) QW Qi , KW Ki , KW Vi (cid:17) , (7) H head = ([ H ⊕ H ⊕ · · · ⊕ H h ]) W, (8)where h is the number of heads. Query, key, and value matrices dimension is d/h . We perform the attention in parallel and concatenate the output valuesof h heads. The parameter matrices of i - th linear projections W Qi ∈ R n × ( dh ) , W Ki ∈ R n × ( dh ) , W Vi ∈ R n × ( dh ) . In addition, the outputs of CNN and attentionstructure are cascaded to output token-wise contextual representations. Long Short Term Memory (LSTM) is a particular Recurrent Neural Networks(RNN) that overcomes the vanishing and exploding gradient problems of tra-ditional RNN models. Through the specially designed gate structure of LSTM,the model can optionally keep context information. Considering that the textfield has obvious temporal characteristics, we use BiLSTM to model the tem-poral dependencies of tokens. Conditional random field (CRF) can obtain tagsin the global optimal chain given input sequence, taking the correlation tagsbetween neighbour tokens into consideration. Therefore, for the sentence S = { s , s , s , ..., s n } along with a path of tags sequence y = { y , y , y , ..., y n } , CRFscores the outputs using the following formula: score ( S, y ) = n +1 (cid:88) i =1 A y i − ,y i + n (cid:88) i =1 P i,y i (9) Zijian Wang et al.
The goal of our training is to minimize the loss function. The mathematicalformulation is defined as below: E = log (cid:88) y ∈ Y exp s ( y ) − score ( s, y ) , (10)where Y is the set of all possible tagging sequences for an input sentence. We conduct our experiments on three datasets. The first dataset is the ChineseEmergency Corpus (CEC). CEC is an event ontology corpus publicly available.It consists of six event categories: outbreak, earthquake, fire, traffic accident,terrorist attack, and food poisoning. We extract 1,026 sentences mention eventand causality from this dataset. Since there are few publicly available datasets, tomake the fair comparison, we also conduct experiments on two in-house datasets,one called “Financial”, which is built based on Chinese Web Encyclopedias, suchas Jinrongjie and Hexun . This dataset contains a large number of financialarticles, including cause-effect mentions. The Financial dataset is divided into atraining set (1,900 instances), validation set (200 instances), and test set (170instances). Because few English datasets are publicly available for event causalityextraction, we re-annotated the SemEval-2010 task 8 dataset [7] and evaluateour model on this dataset. Finally, we obtain 1,003 causality instances from thisdataset.Table 1: Statistic details of datasets, including training, development and testsets. Statistics CEC Fiancial SemEval2010
Average sentence length 31.14 57.94 18.54Mean distance between causal events 10.24 13.49 5.33Mode of cause event length and proportion 2(41%) 4(31%) 1(85%)Mode of effect event length and proportion 4(27%) 4(23%) 1(91%)Average value of cause event length 4.03 6.06 0.96Average value of effect event length 5.41 6.46 0.98 https://github.com/shijiebei2009/CEC-Corpus oint Event Causality Extraction via Convolutional Semantic Infusion 9 Table 2: Average F1-scores (%) of joint event causality extraction using differentmodels. Boldface indicates scores better than the baseline system. The lengthsof high-frequency cause/effect events are 2, 4, and 1 for the CEC, Financial andSemEval2010 dataset, respectively.
Model CEC Financial SemEval2010
IDCNN+CRF [20] 68.26 71.81 68.59BiLSTM+CRF [8] 68.74 74.75 73.20CNN+BiLSTM+CRF [15] 71.68 74.31 74.20CSNN [10] 70.61 74.59 73.71BERT+CSNN (baseline) 74.61 76.23 75.69CISAN 72.49 75.99 74.20BERT+CISAN (unigram)
BERT+CISAN (bigram)
BERT+CISAN (trigram) 74.45
BERT+CISAN (quagram)
We use the pre-trained uncased BERT model. We set the model hyper-parametersaccording to [4]. For all datasets, we set the maximum length of sentences to 100,the size of the batch to 8, the learning rate for Adam to 1 × − , the numberof training epochs to 100. To prevent over-fitting, we set the dropout rate to0 .
5. It is worth mentioning that the lengths for n-grams are assigned throughour preliminary statistics. For n-gram clustering, we set the number of clustersequal to the dimension of convolution filters; both are 100.
We make a comparison between our model and previous models and also conductadditional ablation experiments to prove the effectiveness of our model. Table 2reports the results on CEC, Financial and SemEval2010 datasets. Following pre-vious works, we use the standard F1-score as the evaluation metrics.
IDCNN+CRF [20]: A modified CNN model, using Iterated Dilated Con-volutional Neural Networks, which permits fixed-depth convolutions to run inparallel across documents to speedup while retaining accuracy comparable tothe BiLSTM-CRF.
BiLSTM+CRF : This is a classic sequential labeling model [8], which mod-els context information using Bi-LSTM and learns sequential tags using CRF.
CNN+BiLSTM+CRF : This model [2] uses CNN to enhance BiLSTM+CRFto capture local n-gram dependencies.
CSNN : The variation [10] modifies the basic architecture of BiLSTM+CRFwith additional CNN layer to address n-gram inside features, and the self-attention mechanism to address n-gram outside features.
CISAN : Our model with convolutional semantic-infused filters and key-query attention, using GloVe word embeddings.
F1-value e p o c h
O u r m o d e l B E R T + C S N N C S N N C N N + B iL S T M + C R F B iL S T M + C R F
Fig. 3: F1-score on test dataset of Fi-nancial w.r.t training epoch.
O u r m o d e l - B E R T - C I - A t t e n t io n b a s e lin e7 07 17 27 37 47 57 67 77 87 98 0
F1-value m o d e l C E C F i n a n c i a l S e m E v a l
Fig. 4: Ablation analysis of our modelin three datasets.
BERT+CSNN : BERT serves as the encoder for CSNN, replacing the GloVeembedding layer.
BERT+CISAN : Our model utilizes convolutional semantic-infused filters,key-query attention incorporating with the BERT encoder.
To verify that the effectiveness of BERT and convolution semantic infusion bothcontribute to extraction, we construct another strong baseline BERT+CSNN.The result on BERT+CISAN produces a decent f1 score, which is 1.32%, 0.86%,and 1.96% higher than BERT+CSNN on CEC, Financial, and SemEval2010, toshow the effectiveness of n-grams and knowledge produced by the pre-trainedmodel.As described, the length of n-grams serves as a determining hyper-parameter,which reflects the frequency information of event length in our model. To provethis hypothesis, we assign the different value of lengths to verify the effectivenessand investigate the influence on the results. Table.2 indicates that the best lengthof high-frequency cause/effect n-grams for the CEC, Financial and SemEval2010dataset are 2, 4 and 1, respectively. Intuitively, the causal events in Financialare very likely to be described as a phrase whose length=4 like ” 股 价下 跌 (Shareprices fell)” and ” 销 量 减 少 (sales reduced)”. Thus, We think that convolution ismore sensitive to such events. When using our method with the length of 4 forn-grams, we get the best results compared to other lengths, which demonstratesthe effectiveness of semantic infusion. Comparison w.r.t. epoch.
We further compare the convergence speed of mod-els. As shown in Figure 3, the NON-BERT model starts with a low F1 score fromthe very beginning. Reversely, our model gains a promising consequence whenepoch <
10. Moreover, compared with the strong baseline represented by BERT, oint Event Causality Extraction via Convolutional Semantic Infusion 11 our model also obtain obvious advantages. This suggests that prior knowledgehelps the model to surpass others at few training epochs.
Ablation experiments.
We verify whether each component has a positiveeffect on the model by removing (BERT, CI, Attention) layer in order. Theresults of experiments launched on three datasets are shown in Figure 4. Thecontribution of BERT layer is most significant for the reason that BERT hassufficient pre-training and enriched with rich semantic knowledge. Moreover,the model is promoted by extra gains of CI and attention layers. The CI layerincorporating knowledge related to cause/effect event improves the F1 score. Theattention layer applied is also efficient in removing off the noise. In conclusion,all these layers simultaneously contribute to obtaining SOTA performance.
This paper introduces a novel n-gram-based neural solution for joint casual re-lation and event extraction. The model looks up high-frequency cause/effectn-grams in the sentence by partially encoding the causal features into convo-lution filters. In practical, we model the associations of n-gram pairs to findthe potential cause-effect pairs. As a result, our proposal achieves significantimprovements compared with baseline. This work shows the feasibility of fur-ther enhancing intra- and inter- knowledge around n-grams in guiding neuralextractor. In the future, a potential direction is to adopt our method to few-shotlearning of coarse-to-fine causality extraction. We are also interested in multiplecause-effect pairs extraction rather than only extracting one cause-effect eventpair per time.
References
1. Asghar, N.: Automatic extraction of causal relations from natural language texts:a comprehensive survey. arXiv preprint arXiv:1605.07895 (2016)2. Chen, T., Xu, R., He, Y., Wang, X.: Improving sentiment analysis via sentence typeclassification using BiLSTM-CRF and CNN. Expert Systems with Applications ,221–230 (2017)3. Chen, Y., Xu, L., Liu, K., Zeng, D., Zhao, J.: Event extraction via dynamic multi-pooling convolutional neural networks. In: Proceedings of the 53rd Annual Meetingof the Association for Computational Linguistics and the 7th International JointConference on Natural Language Processing (Volume 1: Long Papers). pp. 167–176. Association for Computational Linguistics (Jul 2015)4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018)5. Girju, R., Moldovan, D.I., et al.: Text mining for causal relations. In: FLAIRSconference. pp. 360–364 (2002)6. Grishman, R.: Domain modeling for language analysis. Tech. rep., NEW YORKUNIV NY (1988)2 Zijian Wang et al.7. Hendrickx, I., Kim, S.N., Kozareva, Z., Nakov, P., S´eaghdha, D.O., Pad´o, S., Pen-nacchiotti, M., Romano, L., Szpakowicz, S.: Semeval-2010 task 8: Multi-way clas-sification of semantic relations between pairs of nominals. In: Proceedings of the5th International Workshop on Semantic Evaluation. pp. 33–38. Association forComputational Linguistics (2010)8. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging.arXiv preprint arXiv:1508.01991 (2015)9. Jian, F., Zong-Tian, L., Wei, L., Wen, Z.: Event causal relation extraction based oncascaded conditional random fields. Pattern Recognition and Artificial Intelligence (4), 567–573 (2011)10. Jin, X., Wang, X., Luo, X., Huang, S., Gu, S.: Inter-sentence and implicit causalityextraction from Chinese corpus. In: Pacific-Asia Conference on Knowledge Discov-ery and Data Mining. pp. 739–751. Springer (2020)11. Kontos, J., Sidiropoulou, M.: On the acquisition of causal knowledge from scientifictexts with attribute grammars. International Journal of Applied Expert Systems (1), 31–48 (1991)12. Li, P., Mao, K.: Knowledge-oriented convolutional neural network for causal re-lation extraction from natural language texts. Expert Systems with Applications , 512–523 (2019)13. Li, S., Zhao, Z., Liu, T., Hu, R., Du, X.: Initializing convolutional filters withsemantic features for text classification. In: Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing. pp. 1884–1889 (2017)14. Lin, Y., Shen, S., Liu, Z., Luan, H., Sun, M.: Neural relation extraction withselective attention over instances. In: Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers). pp.2124–2133 (2016)15. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional lstm-cnns-crf.arXiv preprint arXiv:1603.01354 (2016)16. Mart´ınez, E., Shwartz, V., Gurevych, I., Dagan, I.: Neural disambiguation of causallexical markers based on context. In: IWCS 2017—12th International Conferenceon Computational Semantics—Short papers (2017)17. Riaz, M., Girju, R.: Another look at causality: Discovering scenario-specific con-tingency relationships with no supervision. In: 2010 IEEE Fourth InternationalConference on Semantic Computing. pp. 361–368. IEEE (2010)18. Shen, Y., Huang, X.J.: Attention-based convolutional neural network for seman-tic relation extraction. In: Proceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: Technical Papers. pp. 2526–2536 (2016)19. Sorgente, A., Vettigli, G., Mele, F.: Automatic extraction of cause-effect relationsin natural language text. DART@ AI* IA , 37–48 (2013)20. Strubell, E., Verga, P., Belanger, D., McCallum, A.: Fast and accurate entity recog-nition with iterated dilated convolutions. arXiv preprint arXiv:1702.02098 (2017)21. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,(cid:32)L., Polosukhin, I.: Attention is all you need. In: Advances in neural informationprocessing systems. pp. 5998–6008 (2017)22. Zhao, S., Liu, T., Zhao, S., Chen, Y., Nie, J.Y.: Event causality extraction basedon connectives analysis. Neurocomputing173