[PDF] Back to Prior Knowledge: Joint Event Causality Extraction via Convolutional Semantic Infusion

Abstract

Joint event and causality extraction is a challenging yet essential task in information retrieval and data mining. Recently, pre-trained language models (e.g., BERT) yield state-of-the-art results and dominate in a variety of NLP tasks. However, these models are incapable of imposing external knowledge in domain-specific extraction. Considering the prior knowledge of frequent n-grams that represent cause/effect events may benefit both event and causality extraction, in this paper, we propose convolutional knowledge infusion for frequent n-grams with different windows of length within a joint extraction framework. Knowledge infusion during convolutional filter initialization not only helps the model capture both intra-event (i.e., features in an event cluster) and inter-event (i.e., associations across event clusters) features but also boosts training convergence. Experimental results on the benchmark datasets show that our model significantly outperforms the strong BERT+CSNN baseline.

Full PDF

BBack to Prior Knowledge: Joint Event CausalityExtraction via Convolutional Semantic Infusion

Zijian Wang, Hao Wang ∗ , Xiangfeng Luo ∗ , and Jianqi Gao School of Computer Engineering and Science, Shanghai University, Shanghai 200444,China { zijianwang,wang-hao,luoxf,gjqss } @shu.edu.cn Abstract.

Joint event and causality extraction is a challenging yet es-sential task in information retrieval and data mining. Recently, pre-trained language models (e.g., BERT) yield state-of-the-art results anddominate in a variety of NLP tasks. However, these models are incapableof imposing external knowledge in domain-speciﬁc extraction. Consider-ing the prior knowledge of frequent n-grams that represent cause/eﬀectevents may beneﬁt both event and causality extraction, in this paper, wepropose convolutional knowledge infusion for frequent n-grams with dif-ferent windows of length within a joint extraction framework. Knowledgeinfusion during convolutional ﬁlter initialization not only helps the modelcapture both intra-event (i.e., features in an event cluster) and inter-event (i.e., associations across event clusters) features but also booststraining convergence. Experimental results on the benchmark datasetsshow that our model signiﬁcantly outperforms the strong BERT+CSNNbaseline.

Keywords:

Causality extraction · Prior knowledge · Semantic infusion.

Joint event and causality extraction from natural language text is a challeng-ing task in knowledge discovery [1], discourse understanding [12], and machinecomprehension [19]. Formally, joint event causality extraction is deﬁned as aprocedure of extracting an event triplet consisting of a cause event e

1, an eﬀectevent e a r X i v : . [ c s . C L ] F e b Zijian Wang et al.

B-C I-C I-C I-C I-C I-C I-C O O O O O O O O B-E I-E I-E I-E I-E O 原材料价格上涨，导致公司部分产品毛利率下降。

Input sentence:Tags: cause effect 原材料价格上涨 raw material prices rise 毛利率下降 gross profit rate down

Causal relation:

Fig. 1: An example of joint event causality extraction from raw ﬁnancial text. Itextracts cause and eﬀect events with the underlying cause-eﬀect relation simul-taneously.tagging [10]. Among these methods, pre-trained language models [4] (e.g., BERT)dominate the state-of-the-art results on a wide range of NLP tasks. BERT pro-vides a masked language model (MLM) supporting ﬁne-tuning to achieve betterperformance on a new dataset while reducing serious feature engineering eﬀorts.However, there are still drawbacks when applying such pre-trained languagemodels for event causality extraction as follows: – The small size of available datasets is the main bottleneck of ﬁne-tuningBERT-based models to satisfactory performance. – Even if BERT trained on a large-scale corpus contains commonsense knowl-edge, it may not be suﬃcient for speciﬁc-domain such as ﬁnancial. – Domain-speciﬁc frequency analysis of n-grams should be essential prior knowl-edge to recognize events. In contrast, these important hints have not beenfully emphasized in the current neural architecture.To tackle these issues, in this paper, we propose a novel joint method forevent causality extraction. We ﬁrst pass the input text through the BERT en-coding layer to generate token representations. Secondly, to infuse the knowl-edge like frequent cause/eﬀect n-grams on diﬀerent scales, we utilize multipleconvolutional ﬁlters simultaneously (i.e., infusing intra-event knowledge). Theweights of these ﬁlters are manually initialized with the centroid vectors of thecause/eﬀect n-gram clusters. After that, we link the cause and eﬀect events usingthe key-query attention mechanism to alleviate incorrect cause-eﬀect pair can-didates (i.e., infusing inter-event knowledge). Finally, we predict target labelsgiven the contextual representation by combing bidirectional long short-termmemory (LSTM) with the conditional random ﬁeld (CRF). Empirical resultsshow that our model fusing advantages of both intra- and inter-event knowl-edge signiﬁcantly improve event causality extraction and obtain state-of-the-artperformance.The contributions of this paper can be summarized as follows:1. We propose a novel joint framework of event causality extraction based onrecent advances of the pre-trained language models, taking account of fre-quent domain-speciﬁc and event-relevant n-grams given statistic analysis. oint Event Causality Extraction via Convolutional Semantic Infusion 3

2. This framework allows incorporating intra-n-gram knowledge of frequent n-grams into the current deep neural architecture to ﬁlter potential cause oreﬀect event mentions in the text during extracting.3. Our approach also considers inter-n-gram knowledge of cause-eﬀect co-occurrence.We adopt a query-key attention mechanism to pairwisely extract cause-eﬀectpairs.

The methods for event or causality extraction in the literature fall into threecategories: rule-based, machine learning, and neural network. The rule-basedmethods employ linguistic resources or NLP toolkits to perform pattern match-ing. These methods often have low adaptability across domains and require ex-tensive in-domain knowledge to deal with the domain generalisation problem.The second category methods are mainly based on machine learning techniques,requiring considerable human eﬀort and additional time cost in feature engineer-ing. These methods heavily rely on manual selection of textual feature sets. Thethird category methods are depending on the neural network. In this section, wesurvey those methods and ﬁgure out the problems existing in those methods.

Numerous rule-based methods have been dedicated to event causality mining.Early works predominantly perform syntactic pattern matching, where the pat-terns or templates are handcrafted for the speciﬁc-domain texts. For instance,Grishman et al. [6] perform syntactic and semantic analysis to extract the tempo-ral and causal relation networks from texts. Kontos et al. [11] match expositorytext with structural patterns to detect causal events. Girju et al. [5] validateacquired patterns in a semi-supervised way. They check whether to express acausal relationship based on constraints of nouns and verbs. As a result, thosemethods cannot generalize to a variety of domains.

The paradigm of automatic causal extraction dates back to machine learningtechniques using trigger words combined with decision trees [17] to extract causalrelations. Sorgente et al. [19] ﬁrst extract the candidates of causal event pairswith pre-deﬁned templates and then use a Bayesian classiﬁer to ﬁlter non-causalpairs. Zhao et al. [22] compute the similarity of syntactic dependency structuresto integrate causal connectives. Indeed, these methods suﬀer from data sparsityand require professional annotators.

Due to the powerful deep neural representations, neural networks can eﬀectivelyextract implicit causal relations. In recent years, the adoption of deep learning

Zijian Wang et al. techniques for causality extraction has become a popular choice for researchers.Methods of event extraction can be roughly divided into the template argumentﬁlling approach or sequence labelling approach. For example, Chen et al. [3]propose to process extraction as a serial execution of relation extraction andevent extraction. This method splits the task into two phases. They ﬁrst extractthe event elements, then put them into a pre-deﬁned template consisting of eventelements. By sharing features during extraction, it eﬀectively avoids missing thevital information.Other models obey a sequential labelling manner. Fu et al. [9] propose to treatthe causality extraction problem as a sequential labelling problem. Martinez etal. [16] employ an LSTM model to make contextual reasoning for predictingevent relation. Jin et al. [10] propose a cascaded network model capturing localphrasal features as well as non-local dependencies to help cause and eﬀect ex-traction. Despite their success, those methods ignore the fact of domain-speciﬁcextraction. On the one hand, an n-gram representing cause or eﬀect event shouldfrequently appear in the text. On the other hand, the events involving causalityhave a higher probability of co-occurrence.

Figure 2 shows the overall architecture of the model. It is composed of fourmodules: 1) We ﬁrst pass the input sentence through the BERT encoding layerand the alignment layer successively to output the BERT representation foreach token. 2) A convolutional knowledge infusion layer is created to captureuseful n-gram patterns to focus on domain-relevant n-grams at the beginningof the training phase. 3) We characterize inter-associations among cause andeﬀect using a key-query attention mechanism. 4) Temporal dependencies areestablished utilising the LSTM combined with CRF layers, which signiﬁcantlyimproves the performance of causality extraction.

The BERT is one of the pre-trained language models, which contains multi-layers of the bidirectional transformer, designed to jointly condition on both leftand right context. BERT supports ﬁne-tuning for a wide range of tasks with-out substantial task-speciﬁc architecture modiﬁcations. We omit the exhaustivebackground description of the architecture of BERT in this paper.In our model, we use the BERT encoder to model sentences that contain theevent mentions via computing a context-aware representation for each token Wetake the packed sentences [

CLS, S, SEP ] as the input. “[CLS]” is inserted as theﬁrst token of each input sequence. “[SEP]” denotes the end of the sentence. Foreach token s i in S , the BERT input can be represented as: s i = [ s toki ⊕ s posi ⊕ s segi ] , (1) oint Event Causality Extraction via Convolutional Semantic Infusion 5 成本上升导致利润减少 material price increase leads to profit rate down Data rank clustering … Query-KeyScaledDot-ProductAttention .......

OB-CI-CI-COI-EI-EI-E

Bert encoding Convolution infusion N-gram mapping Contextual representation high frequencyn-grams … fixedlearnable BiLSTM+CRF … c on c a t … Fig. 2: Illustration of the model architecture. Intra- and inter-event feature areextracted using convolutional knowledge infusion and key-query attention mech-anism. This model allows to extract cause and eﬀect events associated with aevent causal relation at the same time.where s toki , s posi , and s segi represent token, position, and segment embeddingsfor i , respectively. In this regard, a sentence is expressed as a vector H ∈ R l × e ,where l is the length of sentence and e is the size of embedding dimension. h i isa vector standing for word embedding of the i -th word in the sentence. h iN = BERT ([ CLS, s , . . . , s i , . . . , s l , SEP ]) , (2)where h iN represents hidden states of the last layer generated by BERT. We usethem as the context-aware representation for each token. In natural language processing, the convolutional operation can be understoodas a process in which a sliding window continuously captures semantic featuresin a sentence of speciﬁc length. The convolution operation will extract semanticfeatures similar to the convolution ﬁlter (i.e., the vectors similar to the convo-lution ﬁlter should share a higher weight). In our model, the convolution layeraims to enhance event-relevant features by embedding causal patterns into fea-ture maps. Considering that word embedding can be initialized using pre-trainedword vectors, it also should be possible to initialize the convolutional ﬁlter (i.e.,kernel) with the vector related to frequent event n-grams. Therefore, inspired by[13], we launch unsupervised clustering to collect n-grams mentioning the similarcause or eﬀect events in the text into the same clusters. Thus, these clusters canelegantly represent cause/eﬀect semantic. Then, we compute the centroid vectorof the cluster given all cause/eﬀect event n-grams in the cluster and initializeconvolution ﬁlters with these vectors. We only ﬁx a part of weight in the ﬁlterusing the cluster semantic vector, which allows our model to learn more featuresby itself during training. The details are given below:

Zijian Wang et al.

Ngram Collection:

Sentences can be segmented into particular chunks ac-cording to the size of the window in the convolution operation, e.g., “ 我喜欢苹果 (I like apples)” are split into four bigrams “ 我喜 , 喜欢 , 欢苹 , 苹果 ” when n=2,step=1). This method takes advantage over standard word representations toobtain ampler semantic information, which motivates us to recognize the eventphrase boundary in this example.The texts in a diﬀerent domain often have diﬀerent length for event n-grams,which also dominates convolutional window size. At the same time, the n-gramlength is a kind of vital prior knowledge in event extraction. However, it is oftenignored by conventional character- or word-based methods, resulting in errors ofevent boundary recognition. Therefore, we count the event length in the dataset.We ﬁnd the number of the events in the most frequent length accounted for 27%among all examples in the Financial dataset. Motivated by this phenomenon, weassociate the n-gram length feature as well as n-gram clusters by applying theconvolution ﬁlters with various window sizes.Intuitively, “proﬁt drop” should be more important than the location n-gram like “New York” in most cases in the ﬁnancial domain. Thus, we distill therelevant n-grams in a cluster with the centroid vector. In practical, we collectn-grams only from the training data. For its simplicity and eﬀectiveness, wesort the n-grams using Na¨ıve Bayes to obtain scores. The ranking score r of then-gram w is calculated as follows: r = ( p wc + b ) / (cid:107) p c (cid:107) ( p we + b ) / (cid:107) p e (cid:107) , (3)where c denotes cause event and e indicates eﬀect event. p wc is the number ofsentences that contains n-grams w in class c . (cid:107) p c (cid:107) is the number of n-gramsrelated to cause c , p we is the number of sentences that contain n-grams w relatedto e , (cid:107) p e (cid:107) is the number of n-grams in e , and b is a smoothing parameter. Weselect the top n % n-gram vectors by scoring n-grams using Formula 3. Filter Initialization :

First of all, we embed each n-gram into a vector rep-resentation using BERT. Since ﬁlters in CNNs are insuﬃcient on the amountto account all selected n-grams, we only extract the cluster vectors to representthe generalized cause/eﬀect features. We perform k-means clustering to obtainclusters. Here, the cluster vector is deﬁned as the centroid vector of the cluster.Considering that non-causal n-grams exist in the sentence, we purposely leavesome blank. We do not fully ﬁll the convolution ﬁlters with centroid vectors andrandomly initialize the rest weights of the ﬁlters. c ni = f ( W · h i : i + n − + b ) (4)Namely, W is the weight of initialized ﬁlter, b is a bias term and f is a nonlinearfunction such as sigmoid, ReLU. As a single cluster is often not enough to de-scribe the overall information, we need to cluster with diﬀerent length scales. Tocapture the features of causal events at diﬀerent scales, we use parallel convo-lution operation with varying windows. For example, convolution window sizes oint Event Causality Extraction via Convolutional Semantic Infusion 7 of n , n , n are chosen when there are three convolution layers. For example,when length equals to 4, the model tends to extract “ 股价下跌 ” (share pricesfell) and “ 销量减少 ” (sales reduced) in the Financial dataset. The convolutionoutput can be computed as the concatenation of sub-spaces: c i = [ c n i ⊕ c n i ⊕ c n i ] (5) Query-key attention is a special attention mechanism according to a sequenceto compute its representation. It has been successfully applied in many NLPtasks, such as machine translation and language understanding. In our model,the query-key attention mechanism is employed to mine inter associations be-tween cause and eﬀect events. After the multi-scaled convolution operation, weobtain the feature mapping for each token. Instead of max-pooling, we use themulti-head attention mechanism to check whether exist a cause-eﬀect relation.Following Vaswani et al. [21], we split the encoded representations of the se-quence into homogeneous sub-vectors, called heads. The input consists of queries Q ∈ R t × d , keys K ∈ R t × d and values V ∈ R t × d , while d is the dimension size.The mathematical formulation is shown below: Attention ( Q, K, V ) = sof tmax (cid:18) QK T √ d (cid:19) V, (6) H i = Attention (cid:16) QW Qi , KW Ki , KW Vi (cid:17) , (7) H head = ([ H ⊕ H ⊕ · · · ⊕ H h ]) W, (8)where h is the number of heads. Query, key, and value matrices dimension is d/h . We perform the attention in parallel and concatenate the output valuesof h heads. The parameter matrices of i - th linear projections W Qi ∈ R n × ( dh ) , W Ki ∈ R n × ( dh ) , W Vi ∈ R n × ( dh ) . In addition, the outputs of CNN and attentionstructure are cascaded to output token-wise contextual representations. Long Short Term Memory (LSTM) is a particular Recurrent Neural Networks(RNN) that overcomes the vanishing and exploding gradient problems of tra-ditional RNN models. Through the specially designed gate structure of LSTM,the model can optionally keep context information. Considering that the textﬁeld has obvious temporal characteristics, we use BiLSTM to model the tem-poral dependencies of tokens. Conditional random ﬁeld (CRF) can obtain tagsin the global optimal chain given input sequence, taking the correlation tagsbetween neighbour tokens into consideration. Therefore, for the sentence S = { s , s , s , ..., s n } along with a path of tags sequence y = { y , y , y , ..., y n } , CRFscores the outputs using the following formula: score ( S, y ) = n +1 (cid:88) i =1 A y i − ,y i + n (cid:88) i =1 P i,y i (9) Zijian Wang et al.

The goal of our training is to minimize the loss function. The mathematicalformulation is deﬁned as below: E = log (cid:88) y ∈ Y exp s ( y ) − score ( s, y ) , (10)where Y is the set of all possible tagging sequences for an input sentence. We conduct our experiments on three datasets. The ﬁrst dataset is the ChineseEmergency Corpus (CEC). CEC is an event ontology corpus publicly available.It consists of six event categories: outbreak, earthquake, ﬁre, traﬃc accident,terrorist attack, and food poisoning. We extract 1,026 sentences mention eventand causality from this dataset. Since there are few publicly available datasets, tomake the fair comparison, we also conduct experiments on two in-house datasets,one called “Financial”, which is built based on Chinese Web Encyclopedias, suchas Jinrongjie and Hexun . This dataset contains a large number of ﬁnancialarticles, including cause-eﬀect mentions. The Financial dataset is divided into atraining set (1,900 instances), validation set (200 instances), and test set (170instances). Because few English datasets are publicly available for event causalityextraction, we re-annotated the SemEval-2010 task 8 dataset [7] and evaluateour model on this dataset. Finally, we obtain 1,003 causality instances from thisdataset.Table 1: Statistic details of datasets, including training, development and testsets. Statistics CEC Fiancial SemEval2010

Average sentence length 31.14 57.94 18.54Mean distance between causal events 10.24 13.49 5.33Mode of cause event length and proportion 2(41%) 4(31%) 1(85%)Mode of eﬀect event length and proportion 4(27%) 4(23%) 1(91%)Average value of cause event length 4.03 6.06 0.96Average value of eﬀect event length 5.41 6.46 0.98 https://github.com/shijiebei2009/CEC-Corpus oint Event Causality Extraction via Convolutional Semantic Infusion 9 Table 2: Average F1-scores (%) of joint event causality extraction using diﬀerentmodels. Boldface indicates scores better than the baseline system. The lengthsof high-frequency cause/eﬀect events are 2, 4, and 1 for the CEC, Financial andSemEval2010 dataset, respectively.

Model CEC Financial SemEval2010

IDCNN+CRF [20] 68.26 71.81 68.59BiLSTM+CRF [8] 68.74 74.75 73.20CNN+BiLSTM+CRF [15] 71.68 74.31 74.20CSNN [10] 70.61 74.59 73.71BERT+CSNN (baseline) 74.61 76.23 75.69CISAN 72.49 75.99 74.20BERT+CISAN (unigram)

BERT+CISAN (bigram)

BERT+CISAN (trigram) 74.45

BERT+CISAN (quagram)

We use the pre-trained uncased BERT model. We set the model hyper-parametersaccording to [4]. For all datasets, we set the maximum length of sentences to 100,the size of the batch to 8, the learning rate for Adam to 1 × − , the numberof training epochs to 100. To prevent over-ﬁtting, we set the dropout rate to0 .

5. It is worth mentioning that the lengths for n-grams are assigned throughour preliminary statistics. For n-gram clustering, we set the number of clustersequal to the dimension of convolution ﬁlters; both are 100.

We make a comparison between our model and previous models and also conductadditional ablation experiments to prove the eﬀectiveness of our model. Table 2reports the results on CEC, Financial and SemEval2010 datasets. Following pre-vious works, we use the standard F1-score as the evaluation metrics.

IDCNN+CRF [20]: A modiﬁed CNN model, using Iterated Dilated Con-volutional Neural Networks, which permits ﬁxed-depth convolutions to run inparallel across documents to speedup while retaining accuracy comparable tothe BiLSTM-CRF.

BiLSTM+CRF : This is a classic sequential labeling model [8], which mod-els context information using Bi-LSTM and learns sequential tags using CRF.

CNN+BiLSTM+CRF : This model [2] uses CNN to enhance BiLSTM+CRFto capture local n-gram dependencies.

CSNN : The variation [10] modiﬁes the basic architecture of BiLSTM+CRFwith additional CNN layer to address n-gram inside features, and the self-attention mechanism to address n-gram outside features.

CISAN : Our model with convolutional semantic-infused ﬁlters and key-query attention, using GloVe word embeddings.

F1-value e p o c h

O u r m o d e l B E R T + C S N N C S N N C N N + B iL S T M + C R F B iL S T M + C R F

Fig. 3: F1-score on test dataset of Fi-nancial w.r.t training epoch.

O u r m o d e l - B E R T - C I - A t t e n t io n b a s e lin e7 07 17 27 37 47 57 67 77 87 98 0

F1-value m o d e l C E C F i n a n c i a l S e m E v a l

Fig. 4: Ablation analysis of our modelin three datasets.

BERT+CSNN : BERT serves as the encoder for CSNN, replacing the GloVeembedding layer.

BERT+CISAN : Our model utilizes convolutional semantic-infused ﬁlters,key-query attention incorporating with the BERT encoder.

To verify that the eﬀectiveness of BERT and convolution semantic infusion bothcontribute to extraction, we construct another strong baseline BERT+CSNN.The result on BERT+CISAN produces a decent f1 score, which is 1.32%, 0.86%,and 1.96% higher than BERT+CSNN on CEC, Financial, and SemEval2010, toshow the eﬀectiveness of n-grams and knowledge produced by the pre-trainedmodel.As described, the length of n-grams serves as a determining hyper-parameter,which reﬂects the frequency information of event length in our model. To provethis hypothesis, we assign the diﬀerent value of lengths to verify the eﬀectivenessand investigate the inﬂuence on the results. Table.2 indicates that the best lengthof high-frequency cause/eﬀect n-grams for the CEC, Financial and SemEval2010dataset are 2, 4 and 1, respectively. Intuitively, the causal events in Financialare very likely to be described as a phrase whose length=4 like ” 股价下跌 (Shareprices fell)” and ” 销量减少 (sales reduced)”. Thus, We think that convolution ismore sensitive to such events. When using our method with the length of 4 forn-grams, we get the best results compared to other lengths, which demonstratesthe eﬀectiveness of semantic infusion. Comparison w.r.t. epoch.

We further compare the convergence speed of mod-els. As shown in Figure 3, the NON-BERT model starts with a low F1 score fromthe very beginning. Reversely, our model gains a promising consequence whenepoch <

10. Moreover, compared with the strong baseline represented by BERT, oint Event Causality Extraction via Convolutional Semantic Infusion 11 our model also obtain obvious advantages. This suggests that prior knowledgehelps the model to surpass others at few training epochs.

Ablation experiments.

We verify whether each component has a positiveeﬀect on the model by removing (BERT, CI, Attention) layer in order. Theresults of experiments launched on three datasets are shown in Figure 4. Thecontribution of BERT layer is most signiﬁcant for the reason that BERT hassuﬃcient pre-training and enriched with rich semantic knowledge. Moreover,the model is promoted by extra gains of CI and attention layers. The CI layerincorporating knowledge related to cause/eﬀect event improves the F1 score. Theattention layer applied is also eﬃcient in removing oﬀ the noise. In conclusion,all these layers simultaneously contribute to obtaining SOTA performance.

This paper introduces a novel n-gram-based neural solution for joint casual re-lation and event extraction. The model looks up high-frequency cause/eﬀectn-grams in the sentence by partially encoding the causal features into convo-lution ﬁlters. In practical, we model the associations of n-gram pairs to ﬁndthe potential cause-eﬀect pairs. As a result, our proposal achieves signiﬁcantimprovements compared with baseline. This work shows the feasibility of fur-ther enhancing intra- and inter- knowledge around n-grams in guiding neuralextractor. In the future, a potential direction is to adopt our method to few-shotlearning of coarse-to-ﬁne causality extraction. We are also interested in multiplecause-eﬀect pairs extraction rather than only extracting one cause-eﬀect eventpair per time.

References

1. Asghar, N.: Automatic extraction of causal relations from natural language texts:a comprehensive survey. arXiv preprint arXiv:1605.07895 (2016)2. Chen, T., Xu, R., He, Y., Wang, X.: Improving sentiment analysis via sentence typeclassiﬁcation using BiLSTM-CRF and CNN. Expert Systems with Applications ,221–230 (2017)3. Chen, Y., Xu, L., Liu, K., Zeng, D., Zhao, J.: Event extraction via dynamic multi-pooling convolutional neural networks. In: Proceedings of the 53rd Annual Meetingof the Association for Computational Linguistics and the 7th International JointConference on Natural Language Processing (Volume 1: Long Papers). pp. 167–176. Association for Computational Linguistics (Jul 2015)4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018)5. Girju, R., Moldovan, D.I., et al.: Text mining for causal relations. In: FLAIRSconference. pp. 360–364 (2002)6. Grishman, R.: Domain modeling for language analysis. Tech. rep., NEW YORKUNIV NY (1988)2 Zijian Wang et al.7. Hendrickx, I., Kim, S.N., Kozareva, Z., Nakov, P., S´eaghdha, D.O., Pad´o, S., Pen-nacchiotti, M., Romano, L., Szpakowicz, S.: Semeval-2010 task 8: Multi-way clas-siﬁcation of semantic relations between pairs of nominals. In: Proceedings of the5th International Workshop on Semantic Evaluation. pp. 33–38. Association forComputational Linguistics (2010)8. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging.arXiv preprint arXiv:1508.01991 (2015)9. Jian, F., Zong-Tian, L., Wei, L., Wen, Z.: Event causal relation extraction based oncascaded conditional random ﬁelds. Pattern Recognition and Artiﬁcial Intelligence (4), 567–573 (2011)10. Jin, X., Wang, X., Luo, X., Huang, S., Gu, S.: Inter-sentence and implicit causalityextraction from Chinese corpus. In: Paciﬁc-Asia Conference on Knowledge Discov-ery and Data Mining. pp. 739–751. Springer (2020)11. Kontos, J., Sidiropoulou, M.: On the acquisition of causal knowledge from scientiﬁctexts with attribute grammars. International Journal of Applied Expert Systems (1), 31–48 (1991)12. Li, P., Mao, K.: Knowledge-oriented convolutional neural network for causal re-lation extraction from natural language texts. Expert Systems with Applications , 512–523 (2019)13. Li, S., Zhao, Z., Liu, T., Hu, R., Du, X.: Initializing convolutional ﬁlters withsemantic features for text classiﬁcation. In: Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing. pp. 1884–1889 (2017)14. Lin, Y., Shen, S., Liu, Z., Luan, H., Sun, M.: Neural relation extraction withselective attention over instances. In: Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers). pp.2124–2133 (2016)15. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional lstm-cnns-crf.arXiv preprint arXiv:1603.01354 (2016)16. Mart´ınez, E., Shwartz, V., Gurevych, I., Dagan, I.: Neural disambiguation of causallexical markers based on context. In: IWCS 2017—12th International Conferenceon Computational Semantics—Short papers (2017)17. Riaz, M., Girju, R.: Another look at causality: Discovering scenario-speciﬁc con-tingency relationships with no supervision. In: 2010 IEEE Fourth InternationalConference on Semantic Computing. pp. 361–368. IEEE (2010)18. Shen, Y., Huang, X.J.: Attention-based convolutional neural network for seman-tic relation extraction. In: Proceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: Technical Papers. pp. 2526–2536 (2016)19. Sorgente, A., Vettigli, G., Mele, F.: Automatic extraction of cause-eﬀect relationsin natural language text. DART@ AI* IA , 37–48 (2013)20. Strubell, E., Verga, P., Belanger, D., McCallum, A.: Fast and accurate entity recog-nition with iterated dilated convolutions. arXiv preprint arXiv:1702.02098 (2017)21. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,(cid:32)L., Polosukhin, I.: Attention is all you need. In: Advances in neural informationprocessing systems. pp. 5998–6008 (2017)22. Zhao, S., Liu, T., Zhao, S., Chen, Y., Nie, J.Y.: Event causality extraction basedon connectives analysis. Neurocomputing173