Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition
PParaphrase Thought: Sentence Embedding Module ImitatingHuman Language Recognition
Myeongjun Jang Pilsung Kang Abstract
Sentence embedding is an important researchtopic in natural language processing. It is es-sential to generate a good embedding vectorthat fully reflects the semantic meaning ofa sentence in order to achieve an enhancedperformance for various natural languageprocessing tasks, such as machine trans-lation and document classification. Thusfar, various sentence embedding models havebeen proposed, and their feasibility has beendemonstrated through good performances ontasks following embedding, such as sentimentanalysis and sentence classification. How-ever, because the performances of sentenceclassification and sentiment analysis can beenhanced by using a simple sentence repre-sentation method, it is not sufficient to claimthat these models fully reflect the meaningsof sentences based on good performances forsuch tasks. In this paper, inspired by humanlanguage recognition, we propose the follow-ing concept of semantic coherence, whichshould be satisfied for a good sentence em-bedding method: similar sentences should belocated close to each other in the embeddingspace. Then, we propose the Paraphrase-Thought (P-thought) model to pursue se-mantic coherence as much as possible. Ex-perimental results on two paraphrase identi-fication datasets (MS COCO and STS bench-mark) show that the P-thought models out-perform the benchmarked sentence embed-ding methods.
Keywords:
Sentence embedding, Recurrentneural network, Paraphrase, Semantic coher-ence, Natural language processing School of Industrial Management Engineering, Ko-rea University, Seoul, South Korea. Correspondenceto: Myeongjun Jang
1. Introduction
Sentence embedding, which transforms sentences intolow-dimensional vector values reflecting their mean-ings, is a highly important task in natural languageprocessing (NLP). By mapping unstructured textdata into a certain form of structured representation,the embedding vector can enhance the performancesof various NLP tasks, such as machine translation(Artetxe et al., 2017; Lee et al., 2016; Zhao & Zhang,2016), document classification (Conneau et al., 2017b;Zhou et al., 2016), and sentence matching (Wan et al.,2016). As sentence embedding plays an import rolein NLP, various methods (Kiros et al., 2015; Pagliar-dini et al., 2017; Hill et al., 2016; Arora et al., 2017;Conneau et al., 2017a; Chen, 2017) have been pro-posed since the advent of the Doc2vec method (Le &Mikolov, 2014). Typically, these methods exhibit bet-ter performances than benchmarked embedding meth-ods for common NLP tasks, such as document classi-fication or sentiment analysis. However, this is not adirect evaluation of how well semantic meanings arepreserved by the proposed embedding method.Indirect methods for evaluating sentence embeddingare not sufficient to evaluate the main property of sen-tence embedding techniques, i.e., how well semanticrelationships between sentences are preserved. Iyyeret al. (2015) showed that it is possible to achievea fairly good performance in document classificationusing a simple document representation vector, i.e.,an average of word vectors in the document. Evenfor classic document representation methods, in whichword sequences or semantic relationships betweenwords are not considered, e.g., bag of words (BoW) orterm frequency-inverse document frequency (TF-IDF),highly accurate classification results can be achievedusing a Naïve Bayesian classifier (Soumya George &Joseph, 2014). This means that a good performanceon a classification task can be achieved without the useof embedding vectors. In other words, a good classi-fication performance for common NLP tasks using acertain type of sentence embedding method does notguarantee that the embedding method can successfully a r X i v : . [ c s . C L ] O c t araphrase Thought: Sentence Embedding Module Imitating Human Language Recognition preserve the semantic relationship between sentences.In this paper, in order to overcome the limitationsof indirect sentence embedding evaluation strategies,we propose the following concept of semantic coher-ence, which should be satisfied by a good sentence em-bedding method: sentences having similar meaningsshould be placed close to each other in the embed-ding space. Then, we propose a new sentence embed-ding model named Paraphrase-Thought (P-thought),which can maximally pursue semantic coherence dur-ing training. The P-thought model is designed as adual generation model, which receives a single sen-tence as input and generates both the input sentenceand its paraphrase sentence simultaneously. The pro-posed P-thought model is evaluated through a task ofmeasuring the semantic coherence and the STS Bench-mark task. Experimental results show that the pro-posed P-thought model yields a better performancethan benchmarked models in both tasks.The remainder of this paper is organized as follows. InSection 2, we briefly review previous research on sen-tence embedding. In Section 3, we propose the conceptof semantic coherence and a new metric: paraphrasecoherence (P-coherence). In Section 4, we describe thestructure of the P-thought model. In Section 5, experi-mental settings are described for each task, followed byresults and discussions. In Section 6, we conclude thepresent work with some discussion of future researchdirections.
2. Related work
Recent work on sentence embedding ranges from sim-ple extensions of the word embedding vector (Le &Mikolov, 2014; Arora et al., 2017; Pagliardini et al.,2017; Chen, 2017; Wieting et al., 2015) to neural net-work models specialized for handling a sequence ofwords appearing in a sentence (Kiros et al., 2015; Con-neau et al., 2017a). The Distributed Bag of Wordsversion of Paragraph Vector (PV-DBOW) and Dis-tributed Memory Model of Paragraph Vectors (PV-DM) methods, which were proposed in Doc2vec (Le& Mikolov, 2014), learn sentence vectors based on thesame principle: maximizing the probability to predictwords in the same sentence. Arora et al. (2017) pro-posed a model that computes the sentence embeddingvector as a weighted average of word embedding vec-tors in a sentence. By re-weighting the weights ofwords in a sentence, the authors achieved an improvedperformance in textual similarity tasks, and outper-formed a complex model based on recurrent neuralnetwork (RNN). Unlike in Doc2vec (Le & Mikolov,2014), in Doc2vecC (Chen, 2017), the sentence embed- ding vector is defined as a simple average of word em-bedding vectors. The idea behind doc2vecC, i.e., usingan average of word embedding vectors to represent theglobal context of the sentence, had already been pro-posed by Huang et al. (2012). In addition, doc2vecCapplies a corruption mechanism that randomly re-moves words from a sentence and generates a sentenceembedding vector with the remaining words. Thissimple idea significantly reduced the total amount oftraining time. Similar to previous methods, Sent2vec(Pagliardini et al., 2017) defines the sentence vectoras an average of word embedding vectors. However,unlike other models using word embedding vectors ofsingle words (i.e., uni-gram), it considers n-gram vec-tors in addition to uni-gram vectors when training thesentence embedding model.The Skip-thought model (Kiros et al., 2015), which hasa sequence to sequence (Seq2Seq) structure, is an ex-tension of the Skip-gram (Mikolov et al., 2013b) model,where the basic unit for network learning is a sentenceinstead of a word. Similar to the Skip-gram model,which learns word embedding vectors by training thenetwork to predict the surrounding words when thecenter word is given, Skip-thought is trained to en-code the input sentence and generate its precedingand following sentences. By using the generated sen-tence vectors as the input of a simple linear model,Skip-thought exhibited an improved performance fordocument classification and sentiment analysis. In-spired by previous results in computer vision, wheremany models are pretrained based on ImageNet (Denget al., 2009), Conneau et al. (2017a) conducted re-search on whether supervised learning tasks are help-ful for learning sentence embedding vectors. Throughexperiments, Conneau et al. (2017a) claimed that sen-tence embedding vectors generated from a model thatis trained based on a natural language inference (NLI)task yield a state-of-the-art performance when lever-aged in other NLP tasks. In particular, they foundthat a model with a bi-directional Long-short termmemory (LSTM) structure and max pooling trainedon the Stanford Natural Language Inference (SNLI)dataset (Bowman et al., 2015), named InferSent, ex-hibited the best performance.
3. Semantic coherence
Although two sentences may employ different words ordifferent structures, people will recognize them as thesame sentence as long as the implied semantic mean-ings are highly similar. Consider the following twosentences: araphrase Thought: Sentence Embedding Module Imitating Human Language Recognition • Sentence 1:
Jang was caught by professor Kangwhile playing the computer game in the lab. • Sentence 2:
Professor Kang came to the lab andwitnessed Jang playing the computer game.Although these two sentences exhibit a clear differencewith respect to both the sentence structure and wordusage, people can immediately perceive that theyconvey the same meaning. Hence, a good sentenceembedding approach should satisfy the propertythat if two sentences have different structures butconvey the same meaning (i.e., paraphrase sentences),then they should have the same, or at least similar,embedding vectors. Based on this, we define semanticcoherence as follows.
Definition 1.
The degree of semantic coherenceof a sentence embedding model is proportional tothe similarity between the representation vectors ofparaphrase sentences generated by the model.If the representation vectors of paraphrase sentencesare located close to each other in the embedding space,this implies that there is little difference between theirvector values. Thus, when the representation vectorvalue of a sentence is given, it should be possibleto generate the given sentence and its paraphrasesentences. Consequently, we can derive the followinghypothesis.
Hypothesis 1 . If it is possible to generate an inputsentence and its paraphrase sentence simultaneouslyfrom the vector value of the input sentence, then thesentence embedding model can enhance the semanticcoherence.In this study, we propose a new sentence embeddingmodel to satisfy the above hypothesis.
To evaluate semantic coherence, we should measurethe densities of paraphrase sentences. This requiresmultiple pairs of paraphrase sentences that share thesame meaning. Thus, previous metrics that simplycalculate the matching degree of two sentences are in-sufficient.In this study, inspired by topic coherence, which isused to determine the optimal number of topics intopic modeling, we propose a new evaluation metric called paraphrase coherence (P-coherence) to measurethe semantic coherence. Topic coherence measureshow effectively the highly weighted top k words of atopic satisfy coherence (Newman et al., 2010); topiccoherence is computed as follows: T opic − coherence = (cid:88) i P MI ( w i , w j ) = log p ( w i , w j ) p ( w i ) p ( w j ) , (2)where p ( w i , w j ) is the probability of words w i and w j appearing together in a randomly selected document,and p ( w i ) and p ( w j ) are the marginal probabilitiesthat the words w i and w j appear in the randomly se-lected document, respectively.Unlike in topic coherence, which defines the probabil-ity of two words appearing together based on a simpleword count, we should consider the relationship be-tween two sentences by leveraging their representationvector values. Hence, we replace the co-occurrenceprobability p ( w i , w j ) in topic coherence with the dotproduct of two sentence representation vectors becausethe dot product of two vectors is widely used as anunnormalized probability in many studies (Karpathyet al., 2014; Karpathy & Fei-Fei, 2015). Next, we re-place the marginal probability for word occurrence intopic coherence with the L -norm of the sentence vec-tor, derived from the dot product. As a result, thescore between two sentences takes a value between 0and 1: the higher the score value, the stronger is therelationship between the two sentences. The equationrepresenting the proposed score is as follows: Score ( sv i , sv j ) = sv i · sv j ∥ sv i ∥ ∥ sv j ∥ , (3)where sv i and sv j are the representation vectors of thesentences i and j , respectively. Finally, P-coherence isdefined as the average score of all pairs of paraphrasesentences: P − coherence ( U k ) = Average ( sv i · sv j ∥ sv i ∥ ∥ sv j ∥ ) ,sv i , sv j ∈ U k , k = 1 , ..., N, (4)where U k is the k th set of paraphrase sentences, and N is the number of paraphrase sets. For instance, if there araphrase Thought: Sentence Embedding Module Imitating Human Language Recognition are four paraphrase sentences for each paraphrase set,then the P-coherence for each paraphrase set is calcu-lated as the average score of C = 6 sentence pairs.The total P-coherence is the average P-coherence foreach paraphrase set: P − coherence T otal = 1 N N (cid:88) k =1 P − coherence ( U k ) . (5) 4. Paraphrase thought Assume that a sentence tuple ( s, p ) is given, where p is the paraphrase sentence of the sentence s . Let x t bethe t th word of the sentence s and y t be the t th word ofthe sentence p . To maximize the semantic coherencedefined above, it should be possible to generate boththe sentence itself and its paraphrase sentence fromthe representation vector of an input sentence. There-fore, the proposed P-thought model is designed as adual generation model, which generates both s and p simultaneously when the sentence tuple ( s, p ) is given.We employed an Seq2Seq structure with a gated re-current unit (GRU) (Cho et al., 2014) cell for the P-thought model. The encoder transforms the sequenceof words of an input sentence into a fixed-sized rep-resentation vector, whereas the decoder generates thetarget sentence based on the given sentence represen-tation vector. The proposed P-thought model hastwo decoders. When the input sentence is given, thefirst decoder, named auto − decoder , generates theinput sentence as it is. The second decoder, named paraphrase − decoder , generates the paraphrase sen-tence of the input sentence. Similar to other sequence learning tasks in NLP, thepurpose of the P-thought model is to minimize thenegative log likelihoods of the two decoders. Further-more, according to Hypothesis 1, the P-thought modelshould satisfy the condition that it can encode the sen-tence s and generate the sentences s and p simulta-neously when the sentence pair ( s, p ) is given. Thiscondition can be written as follows: P ( s ) P ( s | s ; θ ss ) = P ( s ) P ( p | s ; θ sp ) , (6)where P ( s ) is the marginal probability of the inputsentence s , and θ ss and θ sp are the parameters of auto − decoder and paraphrase − decoder , respectively.Thus, similarly to the work of Xia et al. (2017), the problem can be formulated as the following multi-objective optimization problem: Objective 1 : min θ ss l A ( f ( s ; θ ss ) , s ) , Objective 2 : min θ sp l P ( f ( s ; θ sp ) , p ) ,s.t P ( s ) P ( s | s ; θ ss ) = P ( s ) P ( p | s ; θ sp ) , (7)where l A and l P are the negative log likelihoods of auto − decoder and paraphrase − decoder , respectively.In this case, the constraint term can be rewritten asfollows: − log P ( s | s ; θ ss ) = − log P ( p | s ; θ sp ) , (8)The left and right terms of transformed equation rep-resent the negative log likelihoods of auto − decoder and paraphrase − decoder , respectively. Hence, theconstraint term can be written as follows: l A ( f ( s ; θ ss ) , s ) = l P ( f ( s ; θ sp ) , p ) . (9)By introducing the Lagrange multiplier, the multi-objective optimization problem is transformed into thefollowing minimization problem: min L = l A ( f ( s ; θ ss ) , s ) + l P ( f ( s ; θ sp ) , p ) − λ ( l A ( f ( s ; θ ss ) , s ) − l P ( f ( s ; θ sp ) , p ))=(1 − λ ) l A ( f ( s ; θ ss ) , s ) + (1 + λ ) l P ( f ( s ; θ sp ) , p )) ,where λ ̸ = 0 . (10)In this case, a value of λ > or λ < − leads to max-imizing the negative log likelihood of auto − decoder and that of paraphrase − decoder , respectively. Toavoid this problem, the allowable range for λ is set to − < λ < or < λ < . Under this condition,minimizing L is equivalent to minimizing L ′ : min L ′ = l A ( f ( s ; θ ss ) , s ) + αl P ( f ( s ; θ sp ) , p ) ,where α = 1 + λ − λ , (11)where α is the hyperparameter of the P-thoughtmodel. This should be greater than 1 or be thevalue between 0 and 1, because − < λ < and < λ < . However, it is desirable to set the ap-propriate α -value to greater than 1, considering thatauto decoding is trivial copying task which is mucheasier than paraphrase generation. Experimental re-sults also demonstrated that the performance is de-graded for small α value. Thus, the objective of the P-thought model is the sum of the negative log likelihoodof auto − decoder with that of paraphrase − decoder with a higher weight: Loss = l auto ( f ( s ; θ ss ) , s ) + αl para ( f ( s ; θ sp ) , p ) ,where α > . (12) araphrase Thought: Sentence Embedding Module Imitating Human Language Recognition ശ𝐡 𝑻 Ԧ𝐡 𝑻 𝐱 𝐱 𝑇−1 𝐱 𝑇 Encoder Bi-RNN …… ForwardEncoder 1BackwardEncoder 1 𝒉 𝒆𝒏𝒄 Concat (a) One-layer Bi-RNN encoder … 𝐱 𝐱 𝑇−1 𝐱 𝑇 Encoder RNN ForwardEncoder 1 … ForwardEncoder 2 𝒉 𝒆𝒏𝒄 Concat Ԧ𝐡 𝑻,𝟏 Ԧ𝐡 𝑻,𝟐 (b) Two-layer forward RNN …… ശ𝐡 𝑻 Ԧ𝐡 𝑻 𝐱 𝐱 𝑇−1 𝐱 𝑇 Encoder Bi-RNN Forward Encoder 1 BackwardEncoder 1 …… ForwardEncoder 2Backward Encoder 2 𝒉 𝒆𝒏𝒄 Concat (c) Two-layer BiRNN Figure 1. Three encoder structures of the P-thought model The number of unique words appearing in our train-ing dataset is only about 35,000, which is considerablyfewer than the number of words in the English lan-guage. This may be problematic, in that many wordsare treated as out of vocabulary after model train-ing. To solve this problem, motivated by the idea ofcross-lingual embedding (Mikolov et al., 2013a), Skip-thought attempts to learn a matrix that maps thewords of a pretrained word2vec model (Mikolov et al.,2013b) to one of 20,000 words in their training dataset.However, this approach suffers from the problem thata word can be mapped to another word whose actualmeaning is significantly different, only because it hasa high similarity with the original word in the embed-ding space. For example, the word ’endogenous’ wasmapped to the word ’neuronal,’ despite the semanticdifferences.We extracted the vector values of words that appear inour training dataset from the pretrained Glove vector(Pennington et al., 2014) to resolve the problem de-scribed above. In the pretrained Glove vectors, the se-mantic relationships between words are reflected in thegeometrical structures between word vectors. There-fore, even when vector values of words that are un- … Forward-RNN Encoder Two layerBi-RNN Encoder Figure 2. P-thought model structure Table 1. Training data description2014-Validation 2017-Training TotalNo. of unique 40,504 118,284 123,287imagesNo. of unique 202,654 591,753 593,968captionsNo. of unique 811,426 2,368,926 2,467,293sentence pairsNo. of unique - - 34,826words used during training occur, the information loss can bereduced because the geometric relationships betweenword vectors are well preserved if the model is suffi-ciently trained. By using this method, we are able tohandle 2.1 million words without the effort of trainingan extra mapping matrix. 5. Experiments We used the captions of the MS-COCO dataset (Linet al., 2014) to train the P-thought model. Thisdataset has been employed in various paraphrase gen-eration studies (Prakash et al., 2016; Gupta et al.,2017). The MS-COCO dataset has more than five cap-tions for each image, which allows us to generate morethan P = 20 unique sentence pairs. For training, weused the 2014-Validation and 2017-Training datasets.Descriptions of these datasets are provided in Table 1.Simple tokenizing was performed as text preprocessingfor the captions.We employed three different encoder structures, asshown in Figure 1, to investigate the model perfor- araphrase Thought: Sentence Embedding Module Imitating Human Language Recognition Table 2. Table 3. Experimental results for evaluating the P-coherence Model P-coherence PV-DBOW 0.0099(Le & Mikolov, 2014)Uni-skip 0.5328(Kiros et al., 2015)Bi-skip 0.5155(Kiros et al., 2015)Combine-skip 0.5209(Kiros et al., 2015)SIF 0.4205(Arora et al., 2017)Sent2vec Wiki-uni 0.4279(Pagliardini et al., 2017)Sent2vec Wiki-bi 0.4553(Pagliardini et al., 2017)InferSent 0.7454(Conneau et al., 2017a)P-thought 0.7432(one layer-Bi RNN)P-thought (two layers-Forward RNN)P-thought (two layers-Bi RNN) mances according to different levels of model complex-ity. The first encoder structure has one layer with abi-directional RNN (Bi-RNN). The sentence embed-ding vector of this encoder structure consists of theconcatenated values of the final state values of the for-ward and backward RNN. The second encoder struc-ture contains two layers, with only a forward RNN.The sentence embedding vector is generated by con-catenating the final states of both layers. The thirdencoder structure contains two layers of Bi-RNN. Thesentence embedding vector is generated from the con-catenated values of the final states of the second layer’sforward and backward RNNs. The overall structureof the P-thought model, including the decoder part,is illustrated in Figure 2. These three models weretrained under the same conditions. The number ofhidden units is set to 1,200, which results in 2,400-dimensional sentence embedding vectors after concate-nation. We employed Xavier initialization (Glorot &Bengio, 2010), and gradient computations and weightupdates were performed with a mini-batch size of 128.All models were trained for four epochs using theAdam optimizer (Kingma & Ba, 2014). Table 4. Extracted sentences for visualization Group 1 ( ▲ ) 1) Bedroom scene with a bookcase, blue comforter andwindow.2) A bedroom with a bookshelf full of books.3) This room has a bed with blue sheets and a large book-case.4) A bed and a mirror in a small room.5) A bed room with a neatly made bed a window and abook shelf Group 2 ( ♦ ) 1) A male tennis player in white shorts is playing tennis.2) This woman has just returned a volley in tennis.3) A man holding a tennis racket playing tennis.4) The man balances on one leg after serving a tennis ball.5) Someone playing in a tennis tournament with a crowdlooking on. Group 3 ( ■ ) 1) A woman holding a Hello Kitty phone on her hand.2) A woman holds up her phone in front of her face.3) A woman in white shirt holding up a cellphone.4) A woman checking her cell phone with a hello kitty case.5) The Asian girl is holding her Miss Kitty phone. Group 4 ( × ) 1) A plate of food which includes onions, tomato, lettuce,sauce, fries, and a sandwich.2) A sandwich, french fries, bowl of ketchup, onion slice,lettuce slice, tomato slice, and knife sit on the white plate.3) Partially eaten hamburger on a plate with fries andcondiments.4) A grilled chicken sandwich sits beside french fries madewith real potatoes.5) A sandwich on a sesame seed bun next to a pile of frenchfries and a cup of ketchup Group 5 ( ▼ ) 1) Decorated coffee cup and knife sitting on a patternedsurface.2) A large knife is sitting in front of a mug has a skull andcrossbones.3) A white mug showing pirate skull and bones and a largeknife on a counter top.4) There is a white coffee cup with a skull and bones on itnext to a knife.5) A close up of a knife and a cup on a surface To measure the P-coherence, we used the 2017-Validation dataset from the MS-COCO captiondataset, which has no overlap with the trainingdataset. A description of the dataset used for eval-uating the P-coherence is provided in Table 2. We se-lected PV-DBOW, Skip-thought, SIF, Sent2vec, andInferSent as benchmark models. In the case of PV-DBOW, we employed the datasets used for both train-ing P-thought and evaluating the P-coherence to learnthe sentence vectors. For the remaining models, weused the publicly available pretrained models. araphrase Thought: Sentence Embedding Module Imitating Human Language Recognition(a) Uni-skip (b) SIF (c) Sent2vec (Wiki-uni) (d) Sent2vec (Wiki-bi) (e) InferSent (f ) P-thought (One-Bi) (g) P-thought (Two-Forward) (h) P-thought (Two-Bi) Figure 3. Scatter plots of the five paraphrase sentence groups represented by each sentence embedding method The experimental results are summarized in Table 3.It can be observed that the P-thought models withrelatively complex encoder structures outperformedother benchmarked models. In the case of P-thoughtwith a one-layer Bi-RNN, the P-coherence value iscomparable to that of InferSent, and superior to theother benchmarked models. Among the benchmarkedmodels, InferSent yielded a significantly higher P-coherence value than the other models, which impliesthat InferSent preserved the semantic coherence whenlearning the sentence representation vectors.In addition to the quantitative evaluation provided inTable 3, we reduced the generated sentence vectors totwo-dimensional vectors using t -SNE (Maaten & Hin-ton, 2008) and created scatter plots to qualitativelyinvestigate how effectively the paraphrase sentencessatisfied coherence. For the sake of visualization, weextracted the paraphrase sentences for five images andmarked them with different colors and shapes. The ex-tracted paraphrase sentences are presented in Table 4,and the scatter plots are given in Figure 3. It caneasily be observed that paraphrase sentence vectorslearned by the models with high P-coherence values(P-thought and InferSent) are more concentrated thanthose of the other models. We also carried out the STS Benchmark task (Ceret al., 2017) to evaluate how well the models preservethe meanings of sentences through a more generallyconducted task. The dataset for this task consists of Table 5. STS Benchmark task dataset description- Train Dev Test Total u and v , thecomponent-wise product u · v and the absolute differ-ence | u − v | are computed and concatenated to be usedas an input. As the target, the human rated similarityscore y is transformed as follows. Let r T = [1 , ..., denote a vector that takes integer values between 1and 5. The target y is transformed to the distribution d using the equation below: d i = y − ⌊ y ⌋ , if i = ⌊ y ⌋ + 1 , ⌊ y ⌋ + 1 , if i = ⌊ y ⌋ , otherwise . (13)Finally, we trained a logistic regression model that pre-dicts the transformed target d from the sentence pairrepresentations of the training dataset. The resultsfor the STS Benchmark test dataset are summarizedin Table 6. Figure 4 presents a scatter plot of theresults for the proposed models and the target y .The experimental results show that the P-thoughtmodels of all three levels outperformed the bench- araphrase Thought: Sentence Embedding Module Imitating Human Language Recognition Table 6. Experimental results for the STS Benchmark task Model Pearsoncorrelation PV-DBOW 0.649(Le & Mikolov, 2014)(Lau & Baldwin, 2016)SipThought 0.721(Kiros et al., 2015)SIF 0.720(Arora et al., 2017)Sent2vec 0.755(Pagliardini et al., 2017)InferSent 0.758(Conneau et al., 2017a)P-thought (one layer-Bi RNN)P-thought (two layers-Forward RNN)P-thought (two layers-Bi RNN) marked models. An interesting observation is thatthe performances of the P-thought models for theSTS benchmark task are inversely proportional to themodel complexity: the simplest model (one-layer BiRNN) yielded the highest correlation value, while themost complex model (two-layers Bi RNN) resultedin the lowest correlation value among the three P-thought models. This observation is exactly the op-posite of the result for the MS-COCO dataset, wherethe more complex the P-thought model, the higheris the P-coherence score. One possible reason forthis reversed performances is that the MS-COCO cap-tion dataset used for the model training only con-tains around 600,000 sentences, which is far fewer thantraining datasets for general sequence learning tasks inthe NLP field. Hence, it is more likely to overfit thetraining dataset for a more complex structure. Thisproblem can be alleviated by obtaining more of para-phrase sentence pairs. 6. Conclusion Sentence embedding is one of the most important textprocessing techniques in NLP. To date, various sen-tence embedding models have been proposed and haveyielded good performances in document classificationand sentiment analysis tasks. However, the fundamen-tal ability of sentence embedding methods, i.e., howeffectively the meanings of the original sentences arepreserved in the embedded vectors, cannot be fullyevaluated through such indirect methods.In this study, under the proposition that a good sen-tence embedding method should act similar to humanlanguage recognition, we suggested the concept of se- (a) One-layer BiRNN (b) Two-layer forward RNN (c) Two-layer BiRNN Figure 4. Scatter plot for the STS Benchmark task result mantic coherence and proposed a model named P-thought that aims to maximize the semantic coherenceby designing a model to have a dual generation struc-ture. The proposed model was evaluated based onthe MS-COCO caption and STS Benchmark datasets.Experimental results showed that the P-thought mod-els yielded better performances than the benchmarkedmodels for both tasks. Based on the scatter plots in thetwo-dimensional space reduced by t -SNE, it can clearlybe observed that the paraphrase sentences are moreconcentrated for the P-thought models than those us-ing other sentence embedding methods.The main limitation of the current work is that thereare insufficient paraphrase sentences for training themodels. P-thought models with more complex encoderstructures tend to overfit the MS-COCO datasets. Al-though this problem can be resolved by acquiring moreparaphrase sentences, it is not easy in practice to ob-tain a large number of paraphrase sentences. There-fore, similar to the approaches that have achievedgood performances in machine translation by employ-ing semi-supervised learning or unsupervised learning(Cheng et al., 2016; Artetxe et al., 2017; Lample et al.,2017), an approach to improve the performances of theproposed models using only minimal paraphrase datashould be developed. References Arora, Sanjeev, Liang, Yingyu, and Ma, Tengyu.A simple but tough-to-beat baseline for sentenceembeddings. International conference on LearningRepresentations , 2017. araphrase Thought: Sentence Embedding Module Imitating Human Language Recognition Artetxe, Mikel, Labaka, Gorka, Agirre, Eneko, andCho, Kyunghyun. Unsupervised neural machinetranslation. arXiv preprint arXiv:1710.11041 , 2017.Bowman, Samuel R, Angeli, Gabor, Potts, Christo-pher, and Manning, Christopher D. A large anno-tated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 , 2015.Cer, Daniel, Diab, Mona, Agirre, Eneko, Lopez-Gazpio, Inigo, and Specia, Lucia. Semeval-2017task 1: Semantic textual similarity-multilingual andcross-lingual focused evaluation. arXiv preprintarXiv:1708.00055 , 2017.Chen, Minmin. Efficient vector representation fordocuments through corruption. arXiv preprintarXiv:1707.02377 , 2017.Cheng, Yong, Xu, Wei, He, Zhongjun, He, Wei,Wu, Hua, Sun, Maosong, and Liu, Yang. Semi-supervised learning for neural machine translation. arXiv preprint arXiv:1606.04596 , 2016.Cho, Kyunghyun, Van Merriënboer, Bart, Gul-cehre, Caglar, Bahdanau, Dzmitry, Bougares, Fethi,Schwenk, Holger, and Bengio, Yoshua. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078 , 2014.Conneau, Alexis, Kiela, Douwe, Schwenk, Holger,Barrault, Loic, and Bordes, Antoine. Supervisedlearning of universal sentence representations fromnatural language inference data. arXiv preprintarXiv:1705.02364 , 2017a.Conneau, Alexis, Schwenk, Holger, Barrault, Loïc, andLecun, Yann. Very deep convolutional networks fortext classification. In Proceedings of the 15th Con-ference of the European Chapter of the Associationfor Computational Linguistics: Volume 1, Long Pa-pers , volume 1, pp. 1107–1116, 2017b.Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li,Kai, and Fei-Fei, Li. Imagenet: A large-scale hier-archical image database. In Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Con-ference on , pp. 248–255. IEEE, 2009.Glorot, Xavier and Bengio, Yoshua. Understandingthe difficulty of training deep feedforward neuralnetworks. In Proceedings of the Thirteenth Inter-national Conference on Artificial Intelligence andStatistics , pp. 249–256, 2010. Gupta, Ankush, Agarwal, Arvind, Singh, Prawaan,and Rai, Piyush. A deep generative frame-work for paraphrase generation. arXiv preprintarXiv:1709.05074 , 2017.Hill, Felix, Cho, Kyunghyun, and Korhonen,Anna. Learning distributed representations ofsentences from unlabelled data. arXiv preprintarXiv:1602.03483 , 2016.Huang, Eric H, Socher, Richard, Manning, Christo-pher D, and Ng, Andrew Y. Improving word rep-resentations via global context and multiple wordprototypes. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguis-tics: Long Papers-Volume 1 , pp. 873–882. Associa-tion for Computational Linguistics, 2012.Iyyer, Mohit, Manjunatha, Varun, Boyd-Graber, Jor-dan, and Daumé III, Hal. Deep unordered composi-tion rivals syntactic methods for text classification.In Proceedings of the 53rd Annual Meeting of the As-sociation for Computational Linguistics and the 7thInternational Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers) , volume 1, pp.1681–1691, 2015.Karpathy, Andrej and Fei-Fei, Li. Deep visual-semantic alignments for generating image descrip-tions. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp.3128–3137, 2015.Karpathy, Andrej, Joulin, Armand, and Fei-Fei, Li F.Deep fragment embeddings for bidirectional imagesentence mapping. In Advances in neural informa-tion processing systems , pp. 1889–1897, 2014.Kingma, Diederik and Ba, Jimmy. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Kiros, Ryan, Zhu, Yukun, Salakhutdinov, Ruslan R,Zemel, Richard, Urtasun, Raquel, Torralba, Anto-nio, and Fidler, Sanja. Skip-thought vectors. In Advances in neural information processing systems ,pp. 3294–3302, 2015.Lample, Guillaume, Denoyer, Ludovic, and Ranzato,Marc’Aurelio. Unsupervised machine translationusing monolingual corpora only. arXiv preprintarXiv:1711.00043 , 2017.Lau, Jey Han and Baldwin, Timothy. An empiricalevaluation of doc2vec with practical insights intodocument embedding generation. arXiv preprintarXiv:1607.05368 , 2016. araphrase Thought: Sentence Embedding Module Imitating Human Language Recognition Le, Quoc and Mikolov, Tomas. Distributed representa-tions of sentences and documents. In InternationalConference on Machine Learning , pp. 1188–1196,2014.Lee, Jason, Cho, Kyunghyun, and Hofmann, Thomas.Fully character-level neural machine translationwithout explicit segmentation. arXiv preprintarXiv:1610.03017 , 2016.Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays,James, Perona, Pietro, Ramanan, Deva, Dollár, Pi-otr, and Zitnick, C Lawrence. Microsoft coco: Com-mon objects in context. In European conference oncomputer vision , pp. 740–755. Springer, 2014.Maaten, Laurens van der and Hinton, Geoffrey. Visu-alizing data using t-sne. Journal of machine learningresearch , 9(Nov):2579–2605, 2008.Mikolov, Tomas, Le, Quoc V, and Sutskever, Ilya. Ex-ploiting similarities among languages for machinetranslation. arXiv preprint arXiv:1309.4168 , 2013a.Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado,Greg S, and Dean, Jeff. Distributed representationsof words and phrases and their compositionality. In Advances in neural information processing systems ,pp. 3111–3119, 2013b.Newman, David, Lau, Jey Han, Grieser, Karl, andBaldwin, Timothy. Automatic evaluation of topiccoherence. In Human Language Technologies: The2010 Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics , pp. 100–108. Association for Computa-tional Linguistics, 2010.Pagliardini, Matteo, Gupta, Prakhar, and Jaggi, Mar-tin. Unsupervised learning of sentence embeddingsusing compositional n-gram features. arXiv preprintarXiv:1703.02507 , 2017.Pennington, Jeffrey, Socher, Richard, and Manning,Christopher. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 conferenceon empirical methods in natural language processing(EMNLP) , pp. 1532–1543, 2014.Prakash, Aaditya, Hasan, Sadid A, Lee, Kathy,Datla, Vivek, Qadir, Ashequl, Liu, Joey, andFarri, Oladimeji. Neural paraphrase generationwith stacked residual lstm networks. arXiv preprintarXiv:1610.03098 , 2016.Röder, Michael, Both, Andreas, and Hinneburg,Alexander. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM inter-national conference on Web search and data mining ,pp. 399–408. ACM, 2015.Soumya George, K and Joseph, Shibily. Text classifi-cation by augmenting bag of words (bow) represen-tation with co-occurrence feature. IOSR Journal ofComputer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p-ISSN: 2278-8727Volume , 16:34–38, 2014.Wan, Shengxian, Lan, Yanyan, Guo, Jiafeng, Xu, Jun,Pang, Liang, and Cheng, Xueqi. A deep architec-ture for semantic matching with multiple positionalsentence representations. In AAAI , volume 16, pp.2835–2841, 2016.Wieting, John, Bansal, Mohit, Gimpel, Kevin,and Livescu, Karen. Towards universal para-phrastic sentence embeddings. arXiv preprintarXiv:1511.08198 , 2015.Xia, Yingce, Qin, Tao, Chen, Wei, Bian, Jiang, Yu,Nenghai, and Liu, Tie-Yan. Dual supervised learn-ing. arXiv preprint arXiv:1707.00415 , 2017.Zhao, Shenjian and Zhang, Zhihua. Deep character-level neural machine translation by learning mor-phology. 2016.Zhou, Peng, Qi, Zhenyu, Zheng, Suncong, Xu, Ji-aming, Bao, Hongyun, and Xu, Bo. Text classi-fication improved by integrating bidirectional lstmwith two-dimensional max pooling. arXiv preprintarXiv:1611.06639arXiv preprintarXiv:1611.06639