[PDF] Cross-Modal Alignment with Mixture Experts Neural Network for Intral-City Retail Recommendation

Abstract

In this paper, we introduce Cross-modal Alignment with mixture experts Neural Network (CameNN) recommendation model for intral-city retail industry, which aims to provide fresh foods and groceries retailing within 5 hours delivery service arising for the outbreak of Coronavirus disease (COVID-19) pandemic around the world. We propose CameNN, which is a multi-task model with three tasks including Image to Text Alignment (ITA) task, Text to Image Alignment (TIA) task and CVR prediction task. We use pre-trained BERT to generate the text embedding and pre-trained InceptionV4 to generate image patch embedding (each image is split into small patches with the same pixels and treat each patch as an image token). Softmax gating networks follow to learn the weight of each transformer expert output and choose only a subset of experts conditioned on the input. Then transformer encoder is applied as the share-bottom layer to learn all input features' shared interaction. Next, mixture of transformer experts (MoE) layer is implemented to model different aspects of tasks. At top of the MoE layer, we deploy a transformer layer for each task as task tower to learn task-specific information. On the real word intra-city dataset, experiments demonstrate CameNN outperform baselines and achieve significant improvements on the image and text representation. In practice, we applied CameNN on CVR prediction in our intra-city recommender system which is one of the leading intra-city platforms operated in China.

Full PDF

CCross-Modal Alignment with Mixture Experts Neural Networkfor Intral-City Retail Recommendation

Po Li, Lei Li, Yan Fu, Jun Rong, Yu Zhang

Alibaba Group, Hang Zhou, China { poli.lp,zy223687,leili.ll,yanfu.fy,junrong.rj } @alibaba-inc.com Abstract

In this paper, we introduce Cross-modal Alignment withmixture experts Neural Network (CameNN) recommenda-tion model for intral-city retail industry, which aims to pro-vide fresh foods and groceries retailing within 5 hours de-livery service arising for the outbreak of Coronavirus dis-ease (COVID-19) pandemic around the world. As most foodsand groceries stories are small business without extra laborto type or maintain good commodities’ properties informa-tion carefully in merchant inventory system. Inaccurate textinformation and vague image data are common issues in thisindustry. Pioneer Conversion Rate (CVR) prediction modelsfrom FM series models, Deep & Wide Models to the state-of-art Transformer models mainly focus on how to learn moreeffective interaction of features. Consuming those mislead-ing image and text data, implementation of above-mentionedCVR models are not able to achieve satisfying performanceas they do in other industries where data is inaccurate. Mean-while, numerous frustrating bad cases happen in our practicalrecommendation.To this end, we propose CameNN, which isa multi-task model with three tasks including Image to TextAlignment (ITA) task, Text to Image Alignment (TIA) taskand CVR prediction task. We use pre-trained BERT to gen-erate the text embedding and pre-trained InceptionV4 to gen-erate image patch embedding (each image is split into smallpatches with the same pixels and treat each patch as an imagetoken). Softmax gating networks follow to learn the weightof each transformer expert output and choose only a subsetof experts conditioned on the input. Then transformer en-coder is applied as the share-bottom layer to learn all inputfeatures’ shared interaction. Next, mixture of transformer ex-perts (MoE) layer is implemented to model different aspectsof tasks. At top of the MoE layer, we deploy a transformerlayer for each task as task tower to learn task-speciﬁc in-formation. On the real word intra-city dataset, experimentsdemonstrate CameNN outperform baselines and achieve sig-niﬁcant improvements on the image and text representation.In practice, we applied CameNN on CVR prediction in ourintra-city recommender system which is one of the leadingintra-city platforms operated in China.

With the outbreak of Coronavirus disease (COVID-19) pan-demic around the world, this newly discovered coronavirushas caused more than half millions of death and tens mil-lions of conﬁrmed cases. According to WHO report (Or- ganization 2020), keep ’physical distancing’ is one of themost effective measures to prevent the spread of COVID-19.The COVID-19 pandemic totally changed the life of mostpeople in this planet, such as most companies require theirstaffs to work from home instead of ofﬁce and people in-tend to buy daily fresh foods and groceries through onlinestores instead of physical stores. Intra-city online retailingservice, which provides daily necessaries for people within5 hours or even less than 30 minutes, help people to pre-pare fresh foods and buy daily necessaries conveniently andtimely. However, most groceries or fresh foods stores arefamily-owned business with few internet retailing experi-ence and knowledge. Intra-city online retailing platform thatproviding service such as online stores and intra-city deliv-ery for above-mentioned businesses has arisen up. Personal-ized goods recommender system plays vital role in shoppingguide for customers to the right commodities. But for mostphysical groceries being small stories without extra staffsto type commodities’ proprieties into online inventory sys-tem carefully, lots of information post on the online platformare incorrect such as put item name ’Cauliﬂower’ into cat-egory ’Fruit’ , upload vague or wrong images for the itemand even describe ’Strawberries’ with item name ’Berries’.Those misleading item data consumed as inputs result in thepoor performance of the item recommendation models. It’simpossible to rectify millions of wrong data manually. Onemethod to relieve is taking measures to reduce the wrongdata generated from the origin source, which is time con-suming. Another approach that designs a multi-task modelto automatically rectify wrong data and achieve better com-modities recommendation should be explored.Recommender System mainly consists of two stagepipelines in our platform: matching and ranking. Stage oneis matching, which is selecting a set of items according tousers’ proﬁles and behaviors. Then, use models to predictthe Conversion Rate (CVR) for each item and rank them bycertain rules. CVR prediction is a task to predict the proba-bility of a user purchasing the recommended candidate itemsaccording to user’s historical behaviors. In practical, modelsused in CVR prediction are similar with models of Click-through Rate (CTR) prediction that is predicting a user’sclick probability of candidate items based on user’s his-torical behaviors. The factorization machines (FMs) (Stef-fen 2010) is a typical CTR model designed to extend lo- a r X i v : . [ c s . C V ] S e p istic regression (LR) with including second-order termsby allowing pairwise interactions between variables. Fur-ther extensions of FMs are also made to include ﬁeld in-formation (ﬁeld-aware factorization machines (FFM); (Juanet al. 2016)), attention mechanism (attention FM; (Xiaoet al. 2017)), and even deep component (deep FM; (Guoet al. 2017)). However, other than deep FM, most FM-familymodels fail to include higher-order terms and the choice ofsecond-order term requires domain expertise. Recently, deeplearning models such as Wide & Deep (Heng-Tze et al.2016),xDeepFM (Lian et al. 2018), deep interest network(DIN) (Guorui et al. 2018) and Behavior Sequence Trans-former (BST) (Qiwei et al. 2019) have been developed tolearn the higher-order feature interactions. These deep mod-els are bestowed with greater capacity of modeling user pref-erence and capturing user behaviors. Whereas, for issuesabove-mentioned in intra-city retailing industry, those typ-ical CTR models don’t perform well as suffering from poorquality of text and image data. Numerous frustrating badcases happen in our practical implementation.In this paper, we propose a multi-task recommendationmodel named Cross-modal Alignment with mixture expertsNeural Network (CameNN) to solve the above problems.Three tasks including Text Alignment (ITA) task, Textto Image Alignment (TIA) task and CVR prediction taskare introduced in CameNN. Inspired by FashionBERT(Dehong et al. 2020), we split each image into small patcheswith the same pixels and treat each patch as image token.Meanwhile, text tokens are tokenized according to (Yonghuiet al. 2016) into token sequences with adopting the standardBERT vocabulary. Pre-trained Chinese version BERT(Jacob et al. 2018) and pre-trained InceptionV4 (Christianet al. 2016) are implemented to generate text representationand image representation correspondingly. Then, we usemixture of transformer experts (MoE) layer to model dif-ferent aspects of tasks with softmax gating networks whichare designed to learn the weight of transformer expertsand choose a subset of transformer experts conditioned oninputs. Next, shared-bottom layer is utilized to model inputfeatures’ shared interaction for different tasks. At top of theMoE layer, we deploy a transformer layer for each task astask tower to learn task-speciﬁc information. Transformerencoder is the basic layer for share-bottom layers, expertlayers and task tower layers. Experiments on dataset froma leading intra-city retailing platform operated in Chinademonstrate that CameNN outperform baselines on CVRprediction task and achieve signiﬁcant improvements onrectifying image and text representation. In our real-worldonline application, CameNN do help to improve the CVRon items recommendation.To summarize, our main contributions are as following: • We describe the issues of intral-city retail recommenda-tion facing and propose CameNN to address these difﬁ-culties. • We present CameNN to conduct both cross-modal (im-ages and text) alignment and customer Conversion Rate(CVR) prediction tasks on the real intral-city retail indus- try data to show that CameNN outperform baselines mod-els on CRV task, TIA task and ITA task. • We implement the ablation study to show the beneﬁt ofCameNN on text and image data alignment. • We show a successful and efﬁcient large-scale online ap-plication of CameNN to improve CVR of items recom-mendation.

Traditional non-linear CTR models such as factorizationmachines (FMs) (Steffen 2010) has been proven to be ef-fective in recommendation systems. However, its modellingcapacity is limited by its low complexity. To extend the abil-ity of FM model, lots of efforts have been made, such asField-aware FM (FFM) (Juan et al. 2016) was proposed tolearn different interaction with features from different ﬁeldsand Attentional Factorization Machines (AFM) (Xiao et al.2017) was designed to use attention network (Bahdanau,Cho, and Bengio 2014) to learn the importance of each fea-ture interaction. However, all these linear extensions of FMstill focus on modeling the second-order feature interactionsand impractical to deal with real-word non-linear structuredata.In recent few years, beneﬁting from the booming of neu-ral networks, models that are able to learn high-order fea-tures interaction has signiﬁcantly improved the performanceon CTR prediction. For example, Wide & Deep (Heng-Tzeet al. 2016) model is designed to learn high-order featureinteraction. Take advantage of higher-order feature learningof Wide & Deep model and the second-order factorizationpower of FM (Guo et al. 2017), an integration of Wide &Deep model and FM model has been developed. Further-more,extreme Deep factorization machine (xDeepFM) (Lianet al. 2018) is established to exploit the modelling power offeed-forward neural networks. Ability to tackle sequentialdata is required to improve performance of CTR prediction,deep interest network (DIN) (Guorui et al. 2018) and Be-havior Sequence Transformer (Qiwei et al. 2019) are pro-posed to capture the dependencies between users’ sequen-tial behaviors that may reﬂect the interest behind historicalbehaviors. But above-mentioned models can’t achieve goodperformance on CVR prediction with consuming misleadingitem data.For Multi-task learning models, a shared-bottom modelstructure is proposed by Caruana (Caruana 1993) includingbottom hidden layers shared across tasks. However, the taskdifferences may cause optimization conﬂicts in this structuresince all tasks share the same set of parameter of the shared-bottom layers. To avoid sharing same parameters, Duong etal. (Long et al. 2015) add L-2 constraints between two setof parameters for two tasks. Meanwhile, Misra et al. adaptthe cross-stitch network (Ishan et al. 2016) to learn a uniquecombination of task-speciﬁc embeddings and Yang et al.(Yongxin and Timothy 2016) implement a tensor model togenerate task-speciﬁc hidden parameters. All of these mod-els require huge amounts of data to train and are inefﬁcientfor large-scale implementation. Ma et al. (Jiaqi et al. 2018)proposed Multi-gate Mixture-of-Experts (MMoE) model us-ng softmax gating networks and mixture of identical mul-tilayers perceptrons (MLP) with ReLU activations expertsto accomplish multi-tasks more efﬁciently. To empower themodel with ability to model user sequential behaviors, Zhenet al. (Zhen et al. 2020) developed a model named MultitaskMixture of Sequential Experts (MoSE) by replacing all thefunctional ReLu MLP layers in MMoE with LSTM layers(Hochreiter and Schmidhuber 1997) which can learn bettersequential representations. However, MoSE did not explic-itly handle cross-modality data such as images or text inputs.All aforementioned models need major modiﬁcations totackle our issues in intra-city item recommendation. It in-spires us to develop CameNN that can address this problem.

In this section, we give an overview of our proposed Ca-meNN and describe the details how we use mixture expertsto alignment text and image features and promote perfor-mance on the customer conversion rate(CVR).The overview structure of CameNN is depicted by Fig-ure 1. CameNN is composed of six parts: text representationnet, image representation net, shared-bottom layer, softmaxgating networks, mixture of experts and task tower net.

We use three categories of features: Other Features (in-cluding user proﬁle features, item other features excludingtexts and images, context Features, cross features), User Be-havior Sequence (including list of items user bought withitems’ texts and images information) and Target item fea-tures(including items’ texts and images information). ForOther Features, they are encoded into one-hot vector as x o .Then put x o through embedding layer to transform it intolow-dimensional dense embedding as E Other with dimen-sion size n E . BERT (Jacob et al. 2018) language model has been shownto be effective for various of natural language processingtasks. By using the standard prepossessing method ofBERT, we tokenize the a item text x txt according to(Yonghui et al. 2016) into token sequences with adoptingthe standard BERT vocabulary. Then we apply pre-trainedBERT Chinese version with 12-layer ,768- hidden, 12-headsand 110M parameters to generate the token embedding withdimension size n E . For segmentation, we use ’T’ tokenfor text token to differentiate text and image features asshown in Figure 2. Meanwhile, special [CLS] tokens areplaced in the ﬁrst position of each item and [SEP] tokensbetween text tokens and image patch tokens. Next, similarto BERT, position embedding is to represent sequenceposition information. Then, sum the token embeddingand position embedding as the ﬁnal text input embeddingrepresentation. For the a item text embedding, we denote itas e txt . The item text embedding generation would be: e txt = f BERT ( x txt ) + e poi t + e seg t (1)where e txt ∈ R n E × n T , e poi t ∈ R n E × n T , e seg t ∈ R n E × n T , the n E is the embedding dimension size and n T is the num-bers of split tokens in this item text. e poi t is the text tokensposition embedding and e seg t is the text segmentation em-bedding. The state-of-art method to extract semantic informationfrom images is RoIs detection such as (Girshick 2013) anduse these RoIs as ’image tokens’ which is shown in Figure2. In intra-city domain, item images are small and lack ofdetected RoIs. RoIs method is not effective in this industry.We apply another pioneer approach (Dehong et al. 2020) toextract image patches as image tokens. To keep the rawerpixel feature than object-level RoIs of images, we split eachimage x img into small patches with the same pixels and treateach patch as an image token. Image patches are set in natureorder patch sequences with non-repeated patches.Next, we implement a pre-trained image model Incep-tionV4 (Christian et al. 2016) as the main model to gener-ate patch embedding with embedding dimension size of n E .The pre-trained model could be any other pre-trained im-age model such as VGG-16 (Karen and Andrew 2015) orResNeXt (Saining et al. 2017). Segmantion token ’I’ is im-plemented for image patch tokens. Place each patches in na-ture order, we represent the spatial information of patches byposition embedding. Finally, image patches are representedas the sum of patch embedding, segmentation embeddingand position embedding. The formulation is: e img = f InceptionV ( x img ) + e poi p + e seg p (2)where e txt ∈ R n E × n P , e poi p ∈ R n E × n P , e seg p ∈ R n E × n P the n E is the embedding dimension size and n P is the numbers of split patches in this item image. e poi p isthe image patches position embedding and e seg p is the im-age segmentation embedding.The input features E Input is generated as follows: E Input = Concat ( E Other , E

User , E

T arget ) E User =[ E , E i , ..E N ] E T arget =[ e csl , e txt , e sep , e img ] E i =[ e csl,i , e txt,i , e sep,i , e img,i ] (3)where e csl,i , e txt,i , e sep,i , e img,i correspondingly representthe special [CLS] token embedding, text embedding, [SEP]token embedding and image embedding for the i-th item inuser behavior sequence. N denote the number of items inuser behavior sequence. User behavior sequence consisting of items bought by userin time series is used to represent user’s behavior pref-erence on items. Other features such as user proﬁle fea-tures(e.g.user’s age, gender and so on), items’ other featuresigure 1: Our CameNN overview structure. The shared-bottom transformer layer, transformer experts and transformer towersare transformer encoder .excluding image and text information (e.g.item weight, itemprice, item location), cross features and context features areused. We concatenate all the above-mentioned other fea-tures, user behavior sequence features and target item fea-tures as the inputs of the multi-task frame.Since the inputs features contain sequential data, wechoose multi-head transformer encoder to consume thesefeatures, which is able to learn explicit and effective se-quential representation of data. The structure of transformerencoder is shown in the Figure 1. In addition to the factthat transformer has been proved to be natural ﬁt for se-quential data (Vaswani et al. 2017), recent application (Qi-wei et al. 2019) shows that transformer encoder is able tomodel well high-order interaction between features in rec-ommender system.

Inspired by the work of Zhen et al (Zhen et al. 2020), weapply softmax gating networks to learn the weight of eachtransformer expert output and choose only a subset of ex-perts conditioned on the input data for the task. The gatingnetwork can model the relationships of tasks by use differentgates to separate the overlaps between tasks. For less relatedtasks, smaller weights will be given to sharing experts which means tasks will try to utilize different more independent ex-perts instead.

Developed from MoE model which was introduced byRobert et al (Robert A et al. 1991) as ensemble learning ap-proach for multiple individual models, Eigen et al (David,MarcAurelio, and Ilya 2014) use the same structure as theMoE model turning it into MoE layer which can consumethe previous layer outputs as inputs.We use transfomer encoders as the experts of the MoElayer to handle sequential data as well as model betterfeatures interaction. Different aspects of each task can belearned by each expert. Coordinating with gating network,MoE layer can achieve conditional computation by activat-ing only a subset of experts for a task.

To decouple the optimization of multiple tasks, each trans-former encoder tower is used for each task. The tower layeris also able to learn task-speciﬁc information.igure 2: Use pre-trained Bert-base Chinese model to generate text embedding and pre-trained inception v4 to generate imagepatch embedding. The image is strawberry but the text information is vague.

Given T tasks, the above mentioned procedures for the k-thtask can be formulated as: y k = f kT SR ( X k ) X k = M (cid:88) j =1 f T SR ( g kj ( E Input ) .f T SR ( X )) X = f T SR ( E Input ) (4)where g k ( x ) = Sof tmax ( x ) . And f T SR was denoted asthe transformer encoder. M is the number of TransformerExperts. • Task1: Image to Text Alignment(ITA) task

As above-mentioned incorrect or vague text data issue, weexploit to use image data to align the vague text data. Inthis task, we use the output of special token [CLS] as theinput of a binary classiﬁer to implement the predictionwhether the text data complies in the image data. For atraining positive example, we choose the image featurefrom the item, and the text feature from the same item. Onthe contrary, in a negative case, the text feature is chosenrandomly from other items.Binary cross-entropy loss isused to optimize the ITA objects. • Task2: Text to Image Alignment(TIA) task

On the other side, we implement TIA task to match thevague item image by text data. Similar to ITA task, wefed the output of special token [CLS] into a binary classi-ﬁer. The positive example keep the same as the ITA taskwhen the image and text are from the same item. For thenegative training data, we randomly choose image featurefrom the other items when the text feature from the cor-rect item. As a binary classiﬁcation prediction, we applybinary cross-entropy loss as the objective loss function: • Task3: Conversion Rate (CVR) prediction task

Our main task is to predict the customers CVR which in-dicate whether the user would purchase the target item.For the training positive example, the target item has beenbought by the user. On the contrast, the negative is theuser do not buy the item. CVR prediction is also a binaryclassiﬁcation problem. So we use the binary cross-entropyloss for the object optimization.

In this section, we conduct experiments on real industrydatasets and describe the main results.

Dataset :Our real world dataset is collected from a leading onlineintra-city retailing platform that mainly operates in China.Information and entities that would reveal the identities ofeither a shop or a consumer have been carefully removed.We adopt real data from our platform database, whichcontains data from ten millions of users and properties infor-mation from two hundreds millions of commodities. Userdata mainly contains user proﬁle and user behavior informa-tion. As for the commodities’ properties, we mainly parsethe item image, item name and item text description. ForCVR prediction task, we collect 1,257,642 positive samplesand 5,030,568 negative samples. Since our dataset is fromreal world, typos mistakes textural data and blur image dataare common. Therefore our CameNN model is designed towork in this scenario. For all three tasks, we use 75% oftotal dataset for training propose and 25% on testing dataset.

Evaluation Tasks and Metrics :We introduce three tasks: ITA task, TIA task and Conver-sion Rate (CVR) prediction task to evaluate our CameNNethods Ofﬂine AUC(mean ± std)FM 0.7032 ± ± ± ± ± ± ± ± CameNN 0.7549 ± Table 1: Experimental CVR Prediction results of CameNNand baseline modelsmodel. All data above-mentioned are used for training andvalidating these three tasks.Since ITA task and TIA task are matching tasks, we useAccuracy to access the performance on these two matchingtasks. As for the Conversion Rate (CVR) prediction taskwhich the network is trained as a binary classiﬁcationestimator, we choose Area under the ROC Curve (AUC) asthe indicator to measure its predictive qualities.

Implementation Details :The dataset is split into two parts, with 75% being usedfor training and 25% for the testingWe use the chinese version pre-trained BERT with 12-layer ,768-hidden, 12-heads and 110M parameters to gener-ate the text token embeddings. For the image patch featuregenerating, pre-trained from imagenet dataset InceptionV4model with num classes equal to 1001 was adopted. We set50 as the maximum text sequence length and 9 (3*3 patches)as the maximum patch sequence length. Our experimentsare conducted with Tensorﬂow and trained with 4 NVIDIATITAN-V GPUs. For the transformer encoders used in Ca-meNN, the encoders contain 8 heads and 1 transformer en-coder block. All experimental models are trained with theAdam (Kingma and Ba 2014) optimizer with a learning rateof · − with β = 0.95, β = 0.999 and weight decay of1e-4. Early stopping method is also implemented to avoidover-ﬁtting. To explore the performance of our model on CVR predic-tion, we implemented state-of-the-art deep learning modelsin this area as baselines on the same dataset. • FM (Steffen 2010) : FM takes advantage of factorizationmechanism to model second-order feature interactions. • FFM (Juan et al. 2016) : FFM models ﬁne-grained inter-actions between features from different ﬁelds • Wide&Deep (Heng-Tze et al. 2016) : Wide&Deep usesjointly training feed-forward neural networks with em-beddings and linear model with feature transformations. • DeepFM (Guo et al. 2017): DeepFM combines the powerof factorization machines for recommendation and deeplearning for feature learning in the same neural network. Methods TIA Accuracy ITA AccuracyFashionBERT 85.13% 85.71%

CameNN 85.08% 86.23%

Table 2: Experimental TIA and ITA Task results of CameNNand FashionBERTFigure 3: text1 : Juicy red sweet berries 500g. Locally farm grown. Nofarm chemical used text5 : Fresh carrots from local farm. No chemical.. • AFM (Xiao et al. 2017): AFM extends FM by using atten-tion mechanism to determine the different importance ofsecond-order combinatiorial features to capture second-order feature interaction. • xDeepFM (Lian et al. 2018): xDeepFM uses compressedInteraction Network to take outer product of stacked fea-ture matrix at vector-wise level. • DIN (Guorui et al. 2018): With attention mechanism todeal with the users’ behavior sequences, DIN model try tocapture different similarities between previously clickeditems and target item. • BST (Qiwei et al. 2019): BST takes advantage of Trans-former’s powerful abilities to capture sequential relationsso that model be able to learn deeper representation foritems in user’ behavior sequences.Table.1 indicates that comparison of CameNN against allbaselines above-mentioned shows that our model achievethe best performance with respect to the chosen metric AUCin CVR prediction task.

1) ITA and TIA tasks study

FashionBERT (Dehong et al. 2020) introduces an inno-vative method to implement text and image matching andcross-modal retrieval, consisting of pre-train BERT (Jacobet al. 2018) as the matching backbone network and an adap-tive loss to trade off multi-tasks.To evaluate the performance of CameNN on ITA andTIA task, we set series of experiments comparing withFashionBERT. As descibe above, we use accuracy as themetric to access the performance on ITA and TIA tasks.Table.2 shows that CameNN is able to achieve evenlymatched performance as FashionBERT on image and textigure 4: Text and image embedding similarity before TIAand ITA Task .Figure 5: Text and image embedding similarity after TIAand ITA Task .alignment tasks.In addition, CameNN is more ﬂexible forless related multi-tasks.

2) Visualization of embedding similarity after alignmenttasks

Figure.3 shows the item 1 which is a typical vague textinformation data, the image of strawberries is explicit andclear while the corresponding text is ambiguous with onlyberries to deﬁne the commodity. On the other hand, the item5 in Figure. 3, the text data describes unequivocally aboutthe properties of carrots while the image is not determinable.To visualize the alignment tasks impacts on item textembedding and image embedding, we choose 8 items assamples. Each item consists of two parts: text embeddingand image embedding. We calculate the similarity for eachitem between item image embedding and corresponding textembedding before and after the ITA and TIA tasks.In theheatmaps illustrated by Figure.4 and Figure.5, the similaritybetween the text and image embedding from the same itemincrease after the implementation of alignment tasks.

3) Multi-task frame studyMMoE (Jiaqi et al. 2018): MMoE adapts Mixture-of-Experts(MoE) and softmax gating network to implementmulti-task learning across tasks by sharing subsets of ex-perts. All the experts are fully connect layer with ReLU ac- Methods CVR GainxDeepFM -4.20% ± ± ± CameNN 0.63% ± Table 3: Practical CVR gain results of CameNN and baselinemodels Methods Ofﬂine AUC(mean ± std)MMoE frame 0.7203 ± ± CameNN 0.7549 ± Table 4: Experimental CVR Task Prediction results of Ca-meNN and FashionBERTtivation.

MoSE (Zhen et al. 2020): MoSE applies the same frameas MMoE and replace the ReLU experts and ReLU shared-bottom layer with LSTM layer (Hochreiter and Schmidhu-ber 1997) to capture sequential features.In CameNN we use transformer encoder as the functionallayer of the experts and the shared-bottom layer. Comparingwith MMoE and MoSE, the Table. 4 shows that CameNNoutperform both MMoE and MoSE on CVR prediction task.

To demonstrate the effective of CameNN in practical appli-cation, we use the ﬁrst seven days data as the training data,the eighth day as the testing data. Then we use online A/Btest to evaluate only three baseline models that achieve topthree ofﬂine AUC for controlling the impacts on real-worldbusiness and CameNN performance on online CVR Gain.As illustrated in Table. 3, CameNN help to increase the on-line CVR by 0.63% while other baselines do not improve theperformance.

In this paper, we propose a multi-task recommendationmodel named Cross-modal Alignment with mixture expertsNeural Network (CameNN) to model CVR prediction andimage-text alignment. CameNN achieves signiﬁcant im-provement on CVR prediction in intral-city retail industry.Specially, the ITA and TIA tasks contribute to rectify incor-rect image and text data of items, which reduce the mislead-ing noise of dirty data. We adapt transformer encoder as thebasic block of shared-bottom layer, mixture of experts layerand task tower layers to handle sequential data. Beneﬁtingfrom the softmax gate networks which achieve conditionalcomputation by activating only a subset of experts for a task,We show a successful and efﬁcient large-scale online appli-cation of CameNN to improve CVR of items recommenda-tion. In the future, we will try to explore more potential tasksin this model. eferences

Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .Caruana, R. 1993. Multitask learning: A knowledge-basedsource of inductive bias. Google AI Language.Christian, S.; Vincent, V.; Sergey, I.; and Alex, A. 2016.Inception-v4, Inception-ResNet and the Impact of ResidualConnections on Learning. University College London.David, E.; MarcAurelio, R.; and Ilya, S. 2014. Learning Fac-tored Representations in a Deep Mixture of Experts. Face-book AI Group.Dehong, G.; Linbo, J.; Ben, C.; Minghui, Q.; Peng, L.; Yi,W.; Yi, H.; and Hao, W. 2020. FashionBERT: Text andImage Matching with Adaptive Loss for Cross-modal Re-trieval. In

Proceedings of the 43rd International ACM SI-GIR Conference on Research and Development in Informa-tion Retrieval . SIGIR.Girshick, R. 2013. Fast R-CNN. In

Proceedings of IEEEInternational Conference on Computer Vision . IEEE.Guo, H.; Tang, R.; Ye, Y.; Li, Z.; and He, X. 2017. DeepFM:a factorization-machine based neural network for CTR pre-diction.Guorui, Z.; Chengru, S.; Xiaoqiang, Z.; Ying, F.; Han, Z.;Xiao, M.; Yanghui, Y.; Junqi, J.; Han, L.; and Kun, G. 2018.Deep Interest Network for Click-Through Rate Prediction.Alibaba Group.Heng-Tze, C.; Levent, K.; Jeremiah, H.; Tal, S.; Tushar, C.;Hrishi, A.; Glen, A.; Greg, C.; Wei, C.; Mustafa, I.; et al.2016. Wide & deep learning for recommender systems. In

Proceedings of the 1st workshop on deep learning for rec-ommender systems , 7–10.Hochreiter, S.; and Schmidhuber, J. 1997. Long Short-TermMemory.

Neural Computation

Pro-ceedings of the 10th ACM Conference on Recommender Sys-tems , 43–50.Karen, S.; and Andrew, Z. 2015. Very Deep ConvolutionalNetworks for Large-Scale Image Recognition. University ofOxford.Kingma, D. P.; and Ba, J. 2014. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 . Lian, J.; Zhou, X.; Zhang, F.; Chen, Z.; Xie, X.; and Sun,G. 2018. xdeepfm: Combining explicit and implicit featureinteractions for recommender systems. In

Proceedings ofthe 24th ACM SIGKDD International Conference on Knowl-edge Discovery & Data Mining , 1754–1763.Long, D.; Trevor, C.; Steven, B.; and Paul, C. 2015. Low Re-source Dependency Parsing: Cross-lingual Parameter Shar-ing in a Neural Network Parser.Organization, W. H. 2020. Considerations for public healthand social measures in the workplace in the context ofCOVID-19.

Considerations in adjusting public health andsocial measures in the context of COVID-19

Neural Compution3 , 995–1000. IEEE.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In