Product Title Generation for Conversational Systems using BERT
Mansi Ranjit Mane, Shashank Kedia, Aditya Mantha, Stephen Guo, Kannan Achan
PProduct Title Generation for Conversational Systems using BERT
Mansi Ranjit Mane, Shashank Kedia, Aditya Mantha, Stephen Guo, Kannan Achan
Walmart Labs, Sunnyvale, CA, USA { mansi.mane, shashank.kedia, aditya.mantha, sguo, kachan } @walmartlabs.com Abstract
Through recent advancements in speech technology and introduction of smart devices, such asAmazon Alexa and Google Home, increasing numbers of users are interacting with applicationsthrough voice. E-commerce companies typically display short product titles on their webpages,either human-curated or algorithmically generated, when brevity is required, but these titles aredissimilar from natural spoken language. For example, “Lucky Charms Gluten Free Break-fastCereal, 20.5 oz a box Lucky Charms Gluten Free” is acceptable to display on a webpage, but “a20.5 ounce box of lucky charms gluten free cereal” is easier to comprehend over a conversationalsystem. As compared to display devices, where images and detailed product information can bepresented to users, short titles for products are necessary when interfacing with voice assistants.We propose a sequence-to-sequence approach using BERT to generate short, natural, spokenlanguage titles from input web titles. Our extensive experiments on a real-world industry datasetand human evaluation of model outputs, demonstrate that BERT summarization outperformscomparable baseline models.
Smartphones and voice-activated smart speakers, such as Amazon Alexa, Google Home, and Apple Siri,have led to increased adoption of voice-enabled shopping experiences. In such voice-enabled shoppingexperiences, reducing user friction and saving time is key, especially for low consideration purchasesand repeat grocery purchases. Display-based experiences typically utilize a short product title when pre-senting a product, but these short titles do not naturally fit in a typical conversational flow. For example,for a display-based experience showing Jergens Natural Glow Daily Moisturizer, Medium to Tan, 7.5 ozas a short title of the product may be okay but it is not suitable for voice-based applications as it is nota naturally spoken title. At the same time, display-based experiences have the added benefit of beingable to show a lot of additional meta-data which is not possible in conversational systems. The producttitle for a conversational system needs to encapsulate the important information in a succinct, grammat-ically correct natural language sentence, such that it naturally fits with the conversational flow duringthe dialogue rounds between the user and the conversational system. E-commerce companies can havemillions to billions of products in their ever-changing product catalogs, so it is typically prohibitivelyexpensive to manually annotate naturally spoken titles for all products. In such industry settings, it isideal to have a model that can generate naturally spoken titles for an evolving catalog. The primary goalof this work is to examine the methods and challenges to convert short titles of products into a naturallyspoken language that is grammatically correct.This problem is similar to the text summarization task which is well studied in natural language pro-cessing (NLP). However, using it for e-commerce application is not a trivial task. Text summarizationapproaches can be classified into two broad sub-categories, extractive text summarization and abstrac-tive text summarization. Extractive text summarization-based approaches usually try to extract a fewsentences (or keywords) from lengthy documents (Dorr et al., 2003; Neto et al., 2002; Nallapati et al.,2016b). To generate natural language titles from web titles, model should have capability to generateconjunctions, articles etc. at the appropriate position. a r X i v : . [ c s . C L ] J u l bstractive text summarization attempts to understand the content of the document and produce sum-maries which may contain novel words or phrases. Recurrent Neural Networks(RNN) (Hochreiter andSchmidhuber, 1997; Cho et al., 2014) based sequence to sequence (seq2seq) models (Sutskever et al.,2014) or recently developed attention models (Vaswani et al., 2017) have been shown to perform wellon abstractive summarization tasks (Nallapati et al., 2016a). However these models tend to generaterepetitive words and this can cause negative customer experience in voice e-commerce applications. Fortraditional text summarization tasks we can introduce novel words which are not part of the input se-quence but in an e-commerce setting paraphrasing of factual details like brand, quantity, etc. may implya different product. In addition to this, new products are added to e-commerce product catalogs con-tinuously, which can introduce out-of-vocabulary words. The summarization model should be able togeneralize to these words.In this paper, we investigate application of text summarization techniques to voice based e-commerceapplication. Our major contributions are1. We adapt different state-of-the-art NLP models to a real-world e-commerce dataset with limitedlabels.2. We perform extensive evaluation of these models on established evaluation metrics, as well metricsrelevant to our application. We also perform human judgement evaluation.The following sections provides a summary of related work followed by a description of methods ap-plied to convert web-based short titles of products (sequence of words in English) into more naturallyspoken summary titles (sequence of words in English) for voice-based applications. In our problem set-ting, we are more interested in building an abstractive text summarization based model that can generatenovel words in the decoded summary. Section 4 provides the salient features of the dataset and imple-mentation details of the methods described earlier. This is followed by a discussion of results observedand conclusion. Text summarization is a long-studied problem in natural language processing. With the advent of deeplearning based approaches, seq2seq models have proven highly successful in abstractive text summariza-tion. Some of these models and relevant developments in the field are mentioned below.Ptr-Net (See et al., 2017) develops on the seq2seq model with attention for the summarization task.It uses the concept of the pointer network introduced in Vinyals et al. (2015) to decide which wordsfrom the main text should directly be copied to the summary. This helps to preserve the importantfactual information from the input text and also assists in handling out-of-vocabulary words. Ptr-Netmodel also adds coverage loss, which examines the difference between the attentions of previous wordsgenerated and the current attention, in an attempt to fix the issue of word repetition, a persistent issue inseq2seq models. Gehrmann et al. (2018) try to improve the fluency of the generated text through variousconstraints applied during model training. Soft constraints on the size of text are used to constrainthe length of generated descriptions, while constraints on the output probability distribution of wordsameliorates word repetition.Development in language models have subsequently led to increased use of pre-trained models suchas BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019), which are trained on huge text corpusesand are used to generate the word embeddings for input texts. While Khandelwal et al. (2019) examinesthe feasibility of pre-trained language models in a low-data setting and moves away from a seq2seqframework, the work of Liu and Lapata (2019) use BERT in a seq2seq model to summarize data. Detailsof this model are discussed further in Section3. In our study, this is the primary model adapted for ouruse-case.Text summarizarion finds natural application in e-commerce where products may have a long descrip-tion but only the salient features of a particular product is what the end user is in interested in. Increasedinteraction with mobile devices and voice based interfaces such as Amazon Alexa and Google Homeresent new challenges as the product title now need to be succinct. There has been development in rulebased methods as in Camargo de Souza et al. (2018). Deep learning based methods find natural appli-cation and attempt to use multi-model information (images and text) to generate product title. Chen etal. (2019) generates personalized product titles utilizing user personas and an external knowledge base.Mathur et al. (2018) attempts at generating titles in different languages for the same product. Sun etal. (2018) develops further on the work of Ptr-Net using a separate encoder network for important at-tributes like quantity, brand of products and then then uses 2 pointers to decide where to copy data from.The work of Zhang et al. (2019) attempts a novel method of generating short descriptions for e-commerceuse case and uses multimodal information. A word sequence symbolizing description and the image ofa product in the catalog are chosen as descriptors for the product and are used to generate short titlesfor the product. An adversarial network based approach is then used to decide if the generated title ismachine or human generated to improve the quality of generated titles and make it more human like.
In this section, we formulate the problem of automatic natural language title generation and discussvarious approaches that can be used to solve this problem.
Problem Definition:
The goal of this task is to build a system that can automatically generate naturallanguage product titles which are easily interpretable in a voice-enabled shopping experience. Given theshort web-title w represented as a sequence of words w = { w , w , ....w n } , the goal of this system is togenerate the corresponding natural language title y = { y , y , ....y m } .In the following sub-sections, we discuss various models which we apply for the automatic naturallanguage title generation task. Consider a sequence of input tokens w i fed into an encoder (LSTM) producing a sequence of encoderhidden states h i . The decoder receives the word embeddings of the previous words and has a decoderstate s t at time step t . The attention distribution(Bahdanau et al., 2014) a t is computed as shown infollowing equations: e ti = v t . tanh( W h h i + W t .s t + b attn ) (1) a t = softmax( e t ) (2)where v , W h , W s , and b attn are learnable parameters. The attention distribution is used to produce aweighted sum of the encoder hidden states, known as the context vector h ∗ t computed by: h ∗ t = (cid:88) i a ti h i (3)The context vector h ∗ t and decoder state s t is fed through linear layers to get the vocabulary distribution P vocab ( w ) . The network is trained end-to-end using the negative log-likelihood of the target word w ∗ t ateach timestep. loss t = − log P vocab ( w ∗ t ) (4) Ptr-Net (See et al., 2017) is a hybrid between the baseline seq2seq with attention and a pointer network.In the this model, the generation probability p gen (or the probability of using a new word in the output)depends on context vector h ∗ t and attention distribution a t , and is computed as follows: p gen = σ ( w Th ∗ h ∗ t + w Ts s t + w Tx x t + b ptr ) (5)where w h ∗ , w s , w x , and b ptr are learnable parameters. p gen is used to choose between generating a wordfrom the vocabulary by sampling from P vocab , or copying a word from the input sequence by samplingrom the attention distribution a t . The final probability distribution over the vocabulary is computed by: P ( w ) = p gen .P vocab ( w ) + (1 − p gen ) (cid:88) i : w i = w a ti (6)The model is trained end-to-end similar to seq2seq + attention using the negative log-likelihood of thetarget word w ∗ t as the loss function. Transformers (Vaswani et al., 2017) are attention-based models, where the relationship between a givenword w and the context is modeled through multi-head attention. Each layer in a transformer consists ofmulti-head attention ( MHAtt ) followed by a layer norm LN and feed forward network FFN as shownin Equation 7. ˜ h l = LN( h l − + MHAtt( h l − ) (7) h l = LN(˜ h l + FFN(˜ h l ) (8)The final layer representation from the encoder is given to the decoder, and the decoder is trained usingnegative log likelihood with the target word w ∗ t . Figure 1: Architecture for the BERT summarization model. Input document at the top is a sequence ofwords from the web title.We use BERT (Bidirectional Encoder Representations from Transformer) (Devlin et al., 2018) to encodethe web titles. BERT is trained on a large text corpora using unsupervised tasks ( masked language mod-elling and next sentence prediction). BERT uses token, segment, and position embeddings to representinput tokens. Segment embeddings are used in pairwise tasks to differentiate between segments (e.g.,uestion and answer in SQUAD tasks). This input representation is then fed into multiple transformerlayers.Liu and Lapata (2019), modify the BERT formulation for the text summarization task. A [CLS]token is used at the start of a sentence, and the representation of this token is used as the sentencerepresentation. E A and E B are used as segment embeddings for odd and even sentences, respectively.Position embeddings for sentences larger than 500 words are learnt as model parameters. We adapt to thisformat to represent data. For our use case, web product titles are one sentence long, hence the segmentembedding E B is not used.We insert [CLS] and [SEP] tokens in input web title w = [ w , w ...w n ] as shown in Figure 1.The inputrepresentation [ x , x ...x n ] for transformer is then prepared by adding position ( E pos ), token ( E token ),and segment embedding ( E A ) for corresponding words x i = E token + E pos + E A (9)Encoder then transforms this input representation using transformer layers which applies transforma-tion as shown in Equation 7 on input representation x . In Equation 7, h represents input representation x i . The continuous representation from the final layer of BERT [ z , z , .., z n ] is then given to the decoder.While pretrained BERT is used as the encoder, an 8-layer transformer, randomly initialized, is used asthe decoder. We train the decoder to generate summaries using true labels in ground truth data using theframework for abstractive summarization, as in (See et al., 2017), without using the coverage and copymechanism. We use the Adam optimizer with the following learning rate schedule for the encoder anddecoder (Liu and Lapata, 2019): lr e = ˜ lr e . min( step − . , step.warmup − . e ) (10) lr d = ˜ lr d . min( step − . , step.warmup − . d ) (11)We set lr e = 2 e − , lr d = 0 . , warmup e = 20 , , and warmup e = 10 , . We use a proprietary dataset from one of the largest E-commerce retailers in the world. Our datasetconsists of only 19,269 pairs of web product titles along with their corresponding voice titles. Detailsof the dataset have been provided in Table 2. The web product titles are either entered by merchantsor algorithmically generated for certain categories while publishing products on e-commerce website.The voice titles for the corresponding web titles are manually created by human annotators througha crowdsourcing platform. Some examples of web titles and corresponding voice titles are shown inTable 1.
Web titles (Input Sequence of Words)
Voice titles (Output Sequence of Words)El Monterey Beef & Cheese Burritos 8 ct bag a family sizepack El Monterey Beef and cheese a family size pack of 8 El MontereyFrozen Beef And Cheese BurritoPaas Magical Color Cup Egg Decorating Kit a pack Paas Mag-ical Color Cup a pack of Paas Magical Color Cup EggDecorating KitWonderful Roasted & Salted Pistachios 8 oz. Bag a bag Won-derful Roasted and salted an 8 ounce bag of Wonderful RoastedAnd Salted PistachiosTable 1: Examples of input web titles along (left) and desired output voice titles (right).There are certain key differences in the characteristics of web titles and voice titles. Important distinc-tions are listed below:
Web titles often contain abbreviations for units of measurement for succinctness, e.g., Row 3 inTable 1 mentions ”8 oz. bag”. However, the voice title should contain the corresponding naturallanguage word ”ounce”. • Web titles may or many not contain articles, but voice titles need to have grammatically correctarticles, conjunctions, etc. For example, refer to Row 1 in Table 1. • Web titles sometimes contain specific product attributes, such as brand or quantity. These productattributes may have altered positions in the voice title, but the attribute phrase needs to be retainedexactly in its entirety. • As shown in Table 2, the average voice title length is . tokens. Voice titles need to be short andsuccinct, as this information is spoken through a voice device to the end-user.avg. web title length 15.3352avg. voice title length 11.3886avg. We use the pytorch ‘bert-base-uncased’ version of BERT for the encoder along with the subword tok-enizer . In the decoder, the transformer has 768 hidden units and 8 layers, while the feed-forward layershave size 2,048. The learning rate used is as mentioned in Section 3, with a batch size of 256, and themodel was trained for 35,000 steps. We used beam search with size and α = 0 . for decoding. De-coding is done until end of sequence token is emitted. we also block repeat trigrams (Liu and Lapata,2019). We use a minimum length of 4 for decoding and a maximum length of 50 for decoding. A check-point model was saved every 2,000 steps, with the best performing checkpoint model on validation databeing used to report performance on the test data.We compare the BERT model with seq2seq, Ptr-Net, Ptr-Net + Coverage, and the Transformer modelas baselines. For the implementation of seq2seq, Ptr-Net, and Ptr-Net + Coverage, the implementationof (See et al., 2017) was used to generate the results .The implementation details of the various baselines is provided below: • seq2seq : Stanford core nlp PTBtokenizer is used to tokenize the data, which is converted into storyformat as in the popular CNN-Daily mail dataset (Nallapati et al., 2016a) for text summarization.We use default parameters from authors and change minimum length in decoder to 50 and maximumlength in decoder to 35, as those are the corresponding maximum lengths of augmented web titleand voice title respectively. When decoding test data, a beam search of beam size 4 is used togenerate the predicted title. A minimum length of 5 is set for the prediction title. https://github.com/nlpyang/PreSumm/ https://github.com/abisee/pointer-generator ethod R-1 R-2 R-L avg. seq2seq + attention 0.7951 0.6607 0.7883 0.257135 XPtr-Net 0.8965 0.8053 0.8956 0.239839 XPtr-Net with Coverage 0.8917 0.8042 0.8800 0.3041 XTransformer Table 3: Evaluation Results. R-1, R-2, and R-L denote ROUGE-1, ROUGE-2, and ROUGE-L metrics. • Ptr-Net : The Pointer Net uses the same parameters as seq2seq model, and the validation set is usedto identify the optimal training checkpoint for the model. • Ptr-Net + Coverage : We use the default implementation of the authors and train the model in a2-step training process. First, the pointer net model is trained without any coverage loss. Usingvalidation loss, the best model is extracted. We then add the coverage loss term, and train the modelagain using the previously mentioned best model as warm-up. • Transformer : We use a 6-layer transformer encoder with 512 hidden size and 2048 dimensionalfeed-forward layer. For the decoder, we use the same configuration as BERT. The learning rate andother hyper-parameters are obtained from (Liu and Lapata, 2019).
We use the following metrics to evaluate the above proposed model and baselines: • ROUGE (Recall-Oriented Understudy for Gisting Evaluation): measures overlap between the can-didate summary and ground truth using precision, recall, and F-1 scores. However, ROUGE doesnot give a clear idea about repetitions or duplicates in the generated summary. We report F-1ROUGE score at 1, 2, and L(Longest Common Subsequence). (Lin, 2004) – ROUGE-1 refers to the overlap of the unigrams – ROUGE-2 refers to the overlap of the bigrams – ROUGE-L takes into account sentence level structure similarity naturally and identifies thelongest co-occurring in-sequence n-grams automatically. • Avg. - Number of duplicate one-grams between the ground truth summaryand candidate summary. • Human Evaluation - 3 human annotators were provided 100 random samples of model-outputtitles and were asked to rate the title on a scale of 1-5, based on product relevance(i.e., if the productis the same), grammatical correctness, and correct preservation of important attributes (such asbrand name, quantity, and unit of measurement). The average score of the annotators is taken as thejudgement score for the model. This evaluation was performed for only the Transformer and BERTAbstractive model outputs.
Table 3 provides a summary of model performance on different evaluation metrics. We observe that trans-former and BERT have ROUGE-1 F1 scores of . and . respectively, outperform seq2seq-based approaches in terms of both ROUGE and avg. . from . .While transformer model does have a better ROUGE score than BERT model, BERT has lower repeatedwords in output ( . compared to . ) which has a greater impact on the title qualitatively.ince, BERT is pre-trained on a large corpus, it should be able to generalize better especially in low datascenarios. Human evaluation of generated titles reinforce this, showing that BERT performs better thantransformer based model on output quality. Web title Ground truth Model prediction a boxLucky Charms Gluten Free a box of LuckyCharms Gluten Free Cereal a box of luckycharms gluten free cereal4 Hostess Donettes
FrostedMini Donuts , 6 ct, 3 oz a packhostess donettes a pack of 6 hostess donettes mini donuts a pack of 6 hostess donettes frosted mini donuts a of porkbutt a tray ofpork butt steaks6 Pork Cube Steaks, Tray, a tray a
12 ounce tray of groundpork a tray of porkcubesTable 4: BERT Summariazation - Good Model Predictions.
Web Title Ground truth Model prediction
Chocolate Fridge Pack a 12 count pack of yoo hoo chocolate milk fridge pack a pack of 12 yoo hoo choco-late bar produceproduce produce baby foodblend3 Diet 7UP, , 6 pack a pack a 6 pack of .5 liter diet diet 7 up4 Garlic, each (1 bulb) a garlic garlic sold individually a pound of garlic sold indi-vidually5 Harvestland Chicken Breast, a tray perdue har-vestl and free range a tray of Per-due Harvestland Free RangeChicken Breasts a trayof perdue harvestland freerange chicken breastTable 5: BERT Summariazation - Bad Model Predictions.Table 4 and Table 5 lists examples of BERT Abstractive Model prediction on the test dataset where themodel performs well and poorly, respectively. The corresponding ground truth and web titles have alsobeen provided for comparison. The model seems to have repeated words in certain cases, for exampleRow 2 of Table 5.Given that most of the data has been trained on products with ounce and pound as theunits of measurement, it can be seen that liter is incorrectly converted to fluid ounce by the model inRow 3 of Table 5 and pound is added to a single bulb of garlic in Row 4 of Table 5. However, it can beobserved that in some cases, the model clearly does better than ground truth evaluation and even fixes theincorrect quantities in ground truth (rows 3-6 in Table 4). The model is able to add important attributeslike frosted in the case of row 4 of Table 4. From row 2 of Table 4 we can see that the model is ableto maintain the brand name as it is like great value and provide the correct measurements units for thedifferent products like or pack of 6 . Thus the model is able to fulfill requirements of preservingquantity and brand between web and voice titles.e observe that overall BERT based model performs better both quantitatively and qualitatively inmaintaining factual details in the output title and also reducing repeated words in the output. In this paper, we studied the problem of generating succinct, grammatically correct voice titles for prod-ucts in a large e-commerce catalog with limited labels. We evaluate 4 different baselines and demonstratethat BERT summarization can generate good titles through ROUGE metrics and human evaluation, evenwhen there is extremely limited data. Generating personalized titles for different user segments basedon rich user metadata and incorporating web data with additional product attributes that may be productdependent are some directions to extend this work.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning toalign and translate.Jos´e G. Camargo de Souza, Michael Kozielski, Prashant Mathur, Ernie Chang, Marco Guerini, Matteo Negri,Marco Turchi, and Evgeny Matusov. 2018. Generating e-commerce product titles and predicting their quality.In
Proceedings of the 11th International Conference on Natural Language Generation , pages 233–243, TilburgUniversity, The Netherlands, November. Association for Computational Linguistics.Qibin Chen, Junyang Lin, Yichang Zhang, Hongxia Yang, Jingren Zhou, and Jie Tang. 2019. Towards knowledge-based personalized product description generation in e-commerce. arXiv preprint arXiv:1903.12457 .KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties ofneural machine translation: Encoder-decoder approaches.
CoRR , abs/1409.1259.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXiv preprint arXiv:1810.04805 .Bonnie Dorr, David Zajic, and Richard Schwartz. 2003. Hedge trimmer: A parse-and-trim approach to headlinegeneration. Technical report, MARYLAND UNIV COLLEGE PARK INST FOR ADVANCED COMPUTERSTUDIES.Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018. Bottom-up abstractive summarization. arXivpreprint arXiv:1808.10792 .Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory.
Neural Comput. , 9(8):17351780,November.Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, and Lukasz Kaiser. 2019. Sample efficient text summarizationusing a single pre-trained transformer. arXiv preprint arXiv:1905.08836 .Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In
Text summarization branchesout , pages 74–81.Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. arXiv preprintarXiv:1908.08345 .Prashant Mathur, Nicola Ueffing, and Gregor Leusch. 2018. Multi-lingual neural title generation for e-commercebrowse pages. In
Proceedings of the 2018 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 3 (Industry Papers) , pages 162–169, NewOrleans - Louisiana, June. Association for Computational Linguistics.Ramesh Nallapati, Bing Xiang, and Bowen Zhou. 2016a. Sequence-to-sequence rnns for text summarization.
CoRR , abs/1602.06023.Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2016b. Summarunner: A recurrent neural network basedsequence model for extractive summarization of documents.
CoRR , abs/1611.04230.Joel Larocca Neto, Alex A. Freitas, and Celso A. A. Kaestner. 2002. Automatic text summarization using amachine learning approach. In Guilherme Bittencourt and Geber L. Ramalho, editors,
Advances in ArtificialIntelligence , pages 205–215, Berlin, Heidelberg. Springer Berlin Heidelberg.lec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language modelsare unsupervised multitask learners.Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368 .Fei Sun, Peng Jiang, Hanxiao Sun, Changhua Pei, Wenwu Ou, and Xiaobo Wang. 2018. Multi-source pointer net-work for product title summarization. In
Proceedings of the 27th ACM International Conference on Informationand Knowledge Management , pages 7–16.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, andIllia Polosukhin. 2017. Attention is all you need.Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks.Jian-Guo Zhang, Pengcheng Zou, Zhao Li, Yao Wan, Xiuming Pan, Yu Gong, and Philip S Yu. 2019. Multi-modal generative adversarial network for short product title generation in mobile e-commerce. arXiv preprintarXiv:1904.01735arXiv preprintarXiv:1904.01735