[PDF] Distillation based Multi-task Learning: A Candidate Generation Model for Improving Reading Duration

Abstract

In feeds recommendation, the first step is candidate generation. Most of the candidate generation models are based on CTR estimation, which do not consider user's satisfaction with the clicked item. Items with low quality but attractive title (i.e., click baits) may be recommended to the user, which worsens the user experience. One solution to this problem is to model the click and the reading duration simultaneously under the multi-task learning (MTL) framework. There are two challenges in the modeling. The first one is how to deal with the zero duration of the negative samples, which does not necessarily indicate dislikes. The second one is how to perform multi-task learning in the candidate generation model with double tower structure that can only model one single task. In this paper, we propose an distillation based multi-task learning (DMTL) approach to tackle these two challenges. We model duration by considering its dependency of click in the MTL, and then transfer the knowledge learned from the MTL teacher model to the student candidate generation model by distillation. Experiments conducted on dataset gathered from traffic logs of Tencent Kandian's recommender system show that the proposed approach outperforms the competitors significantly in modeling duration, which demonstrates the effectiveness of the proposed candidate generation model.

Full PDF

DDistillation based Multi-task Learning: A Candidate GenerationModel for Improving Reading Duration

Zhong Zhao, Yanmei Fu, Hanming Liang, Li Ma, Guangyao Zhao, Hongwei Jiang

Tencent Inc.Shenzhen, China{zhongzhao,friedafu,meeloliang,listma,issaczhao,rockyjiang}@tencent.com

ABSTRACT

In feeds recommendation, the first step is candidate generation.Most of the candidate generation models are based on CTR esti-mation, which do not consider user’s satisfaction with the clickeditem. Items with low quality but attractive title (i.e., click baits) maybe recommended to the user, which worsens the user experience.One solution to this problem is to model the click and the read-ing duration simultaneously under the multi-task learning (MTL)framework. There are two challenges in the modeling. The firstone is how to deal with the zero duration of the negative samples,which does not necessarily indicate dislikes. The second one is howto perform multi-task learning in the candidate generation modelwith double tower structure that can only model one single task.In this paper, we propose an distillation based multi-task learn-ing (DMTL) approach to tackle these two challenges. We modelduration by considering its dependency of click in the MTL, andthen transfer the knowledge learned from the MTL teacher modelto the student candidate generation model by distillation. Experi-ments conducted on dataset gathered from traffic logs of TencentKandian’s recommender system show that the proposed approachoutperforms the competitors significantly in modeling duration,which demonstrates the effectiveness of the proposed candidategeneration model.

CCS CONCEPTS • Information systems → Learning to rank . KEYWORDS multi-task learning, knowledge distillation, candidate generation,duration modeling, recommender system

ACM Reference Format:

Zhong Zhao, Yanmei Fu, Hanming Liang, Li Ma, Guangyao Zhao, HongweiJiang. 2021. Distillation based Multi-task Learning: A Candidate GenerationModel for Improving Reading Duration. In

SIGIR ’21: The 44th InternationalACM SIGIR Conference on Research and Development in Information Retrieval,July 11–15, 2021, Montreal, Quebec, Canada.

ACM, New York, NY, USA,5 pages. https://doi.org/10.1145/1122445.1122456

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Click through rate (CTR) estimation is a widely adopted methodfor ranking in many recommender systems. Many models basedon deep learning are proposed to estimate CTR in recent years[1, 3,7, 9]. As for feeds recommendation, the predicted CTR (pCTR) canreflect how likely the user will click the item, but can not reflect howlikely the user will like the item after click and read the content. Forexample, item with low quality and attractive titles (i.e. click baits)usually gets a high pCTR but users never like them. Therefore,only modeling CTR can not ensure the user’s satisfaction withthe clicked item. To improve user’s experience, reading durationshould be modeled as well, which is of great importance in industrialapplications such as feeds recommendation.In this paper, we focus on reading duration modeling and itsapplication to large-scaled candidate generation for feeds recom-mendation. There are two main challenges in our real practice. Thefirst challenge is how to deal with the zero duration of the negativesamples. The negative samples get zero duration just because theyare un-clicked, which doesn’t necessarily indicate that the userdislikes the item. It is quite different from the positive samples withshort duration which indicates dislike. Directly use the zero dura-tion as target for modeling may lead to the inaccurate estimation.The second challenge is caused by the first challenge. In order tosolve the problem of the first challenge, multi-task learning is em-ployed. However, it is difficult to perform multi-task learning in thecandidate generation model. As we know, most deep learning basedcandidate generation models adopt the double tower structure[2, 4].They have a user tower and a item tower for computing the uservector and item vector respectively, and use user-item inner productas similarity metric for ANN search to generate candidate items in avery efficient way. Since the inner product can only model one sin-gle task, it makes multi-task learning difficult to be applied directlyto the candidate generation model. To our knowledge, few papersdiscuss duration modeling. In the real practice, the commonly usedmethod is to model duration by regression in one single task, inwhich the duration of all negative samples are set to zero and squareloss is used. As mentioned before, fitting the duration of negativesample to zero may treat the dislikes (short duration) and the un-clicks (zero duration, but not necessarily dislike) similarly, whichmay mislead the model training.To tackle the challenges stated above, we propose a d istillationbased m ulti- t ask l earning approach, which we refer to as DMTL ,to model reading duration for candidate generation. We overcomethe problems of the existing duration models by considering dura-tion’s dependency to click within the multi-task learning frameworkwhich simultaneously models CTR and CTCVR for the click taskand the duration task respectively. Then, we use the distillation a r X i v : . [ c s . I R ] F e b IGIR ’21, July 11–15, 2021, Montreal, Quebec, Canada Zhao and Fu, et al. technique to transfer knowledge learned by the multi-task modelto the double tower candidate generation model, which makes thecandidate generation model obtain the ability of modeling readingduration while keeping its high efficiency in candidate generation.To evaluate the performance of the proposed approach, we con-ducted experiments on the dataset gathered from the traffic logsof Tencent Kandian’s recommender system. The results of the of-fline and online experiments show that the proposed approachoutperforms the competing duration models significantly, whichdemonstrates the effectiveness of the proposed approach in model-ing reading duration for candidate generation.

The purpose of candidate generation is to select hundreds or thou-sands of items that are relavant to user’s interests from the wholeitem corpus which may have millions or even billions of items. Inthis paper, the proposed DMTL improves the quality of candidategeneration by modeling click and reading duration simultaneously,rather than modeling click only. For the click task, positive samplesare clicked impressions, and negative samples are randomly selectedfrom all items according to their frequency of being clicked. This isquite different from ranking model which uses clicked impressionsas positive sample and un-clicked impressions as negative samples.For the duration task, positive samples are clicked impressions withduration more than 50 seconds (i.e., the median of all durations),and the rest are negative samples.Let 𝑢 𝑖 and 𝑣 𝑖 be the user features and item features respectively,both of which are usually concatenation of embeddings of multiplefields, and 𝑥 𝑖 be the concatenation of 𝑢 𝑖 , 𝑣 𝑖 and other dense features.Let 𝑦 𝑖 be the label of click task with 𝑦 𝑖 = representing the itemis clicked and 𝑦 𝑖 = representing the item is randomly selected.Let 𝑧 𝑖 be the label of duration task with 𝑧 𝑖 = representing theitem has been read for more than 50s, and 𝑧 𝑖 = otherwise. Theduration modeling problem can be formulated as estimating theprobability of 𝑧 𝑖 = given 𝑥 𝑖 , i.e., 𝑝 ( 𝑧 𝑖 = | 𝑥 𝑖 ) . As mentionedbefore, 𝑧 𝑖 is dependent of 𝑦 𝑖 since 𝑦 𝑖 = will cause 𝑧 𝑖 =0. To bettermodel this probability, we make good use of the dependency ofclick and reading. Specifically, 𝑝 ( 𝑧 𝑖 = | 𝑥 𝑖 ) can be rewritten as 𝑝 ( 𝑧 𝑖 = | 𝑥 𝑖 ) = 𝑝 ( 𝑦 𝑖 = | 𝑥 𝑖 ) 𝑝 ( 𝑧 𝑖 = | 𝑦 𝑖 = , 𝑥 𝑖 ) (1)where 𝑝 ( 𝑦 𝑖 = | 𝑥 𝑖 ) is the predicted click-through rate(pCTR), 𝑝 ( 𝑧 𝑖 = | 𝑦 𝑖 = , 𝑥 𝑖 ) is the predicted conversion rate(pCVR) and 𝑝 ( 𝑧 𝑖 = | 𝑥 𝑖 ) is the predicted click-through and conversion rate(pCTCVR). To reduce the influence of selection bias and data spar-sity when modeling duration, we adopt the approach proposed inESMM[6] which fits CTR and CTCVR simultaneously under themulti-task learning framework. In our model, click task and dura-tion task fit the CTR and the CTCVR respectively by minimizingthe binary cross entropy.We employ the multi-task learning framework MMoE[5, 8] tomodel CTR and CVR. Let 𝑓 𝑘 be the 𝑘 -th expert network whichis usually a DNN and 𝑓 𝑘 ( 𝑥 𝑖 ) be the output vector of the 𝑘 -th ex-pert. For the modeling of CTR, the gate is computed as 𝑔 𝑐 ( 𝑥 ) = [ 𝑔 𝑐 ( 𝑥 𝑖 ) , · · · , 𝑔 𝑐𝐾 ( 𝑥 𝑖 )] , where 𝑔 𝑐 (·) is the gate function defined as 𝑔 𝑐 ( 𝑥 ) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ( 𝑊 𝑐 𝑥 𝑖 ) with 𝑊 𝑐 being a trainable matrix, 𝐾 is the number of experts, and 𝑔 𝑐𝑘 ( 𝑥 𝑖 ) is the 𝑘 -th element of 𝑔 𝑐 ( 𝑥 𝑖 ) . Theoutput of the experts for modeling CTR is computed as 𝑒 𝑐 ( 𝑥 𝑖 ) = 𝐾 ∑︁ 𝑘 = 𝑔 𝑐𝑘 ( 𝑥 𝑖 ) 𝑓 𝑘 ( 𝑥 𝑖 ) (2)For the modeling of CVR, the gate function 𝑔 𝑑 (·) with trainableparameter matrix 𝑊 𝑑 can be defined similarly, and the output ofthe experts for modeling CVR is computed as 𝑒 𝑑 ( 𝑥 𝑖 ) = 𝐾 ∑︁ 𝑘 = 𝑔 𝑑𝑘 ( 𝑥 𝑖 ) 𝑓 𝑘 ( 𝑥 𝑖 ) (3)where 𝑔 𝑑𝑘 is the 𝑘 -th element of 𝑔 𝑑 ( 𝑥 𝑖 ) . The pCTR and pCVR forsample 𝑥 𝑖 are modeled as 𝑝 𝑐𝑡𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ( ℎ 𝑐 ( 𝑒 𝑐 ( 𝑥 𝑖 ))) (4) 𝑝 𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ( ℎ 𝑑 ( 𝑒 𝑑 ( 𝑥 𝑖 ))) (5)where ℎ 𝑐 (·) and ℎ 𝑑 (·) are DNNs that map 𝑒 𝑐 ( 𝑥 𝑖 ) and 𝑒 𝑑 ( 𝑥 𝑖 ) tothe logit of pCTR and pCVR respectively, and 𝜃 𝑡 is all trainableparameters in the above formulations. According to 1, 4 and 5, thepCTCVR can be written as 𝑝 𝑐𝑡𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) = 𝑝 𝑐𝑡𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) 𝑝 𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) (6)As modeling reading duration is to fit the CTCVR, the loss of theduration task is the following binary cross entropy 𝐿 𝑑 ( 𝜃 𝑡 ) = − 𝑁 ∑︁ 𝑖 = 𝑧 𝑖 log 𝑝 𝑐𝑡𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) + ( − 𝑧 𝑖 ) log ( − 𝑝 𝑐𝑡𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 )) (7)Equation (6) and (7) have modeled the dependency of click andreading by introducing 𝑝 𝑐𝑡𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) to compute 𝑝 𝑐𝑡𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) . How-ever, only fitting 𝑝 𝑐𝑡𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) to CTCVR can not ensure 𝑝 𝑐𝑡𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) fits to CTR. Therefore, we need a auxiliary task to make sure that 𝑝 𝑐𝑡𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) fits to the CTR. The loss function of this auxiliary clicktask is formulated as 𝐿 𝑐 ( 𝜃 𝑡 ) = − 𝑁 ∑︁ 𝑖 = 𝑧 𝑖 log 𝑝 𝑐𝑡𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) + ( − 𝑧 𝑖 ) log ( − 𝑝 𝑐𝑡𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 )) (8)By summing (7) and (8), we get the multi-task learning loss functionfor the duration model as 𝐿 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 ( 𝜃 𝑡 ) = 𝑤 𝐿 𝑑 ( 𝜃 𝑡 ) + 𝑤 𝐿 𝑐 ( 𝜃 𝑡 ) (9)where 𝑤 and 𝑤 are the weights for each loss. In most deep learning based candidate generation models, doubletower structure is employed to compute user vectors and item vec-tors, where the item vectors are used to build the item index. For agiven user vector, the user-item inner product is used as similarityfor ANN search in the item index, and the top-k items are returnedas candidate items. However, the candidate generation models areunable to model duration by multi-task learning since the innerproduct can only model one task. To make the candidate genera-tion model obtain the extra ability of modeling reading durationwithin its high efficient double tower structure framework, we usedistillation technique to transfer knowledge learned by the MTLmodel in section 2.1 to the candidate generation model. istillation based Multi-task Learning: A Candidate Generation Model for Improving Reading Duration SIGIR ’21, July 11–15, 2021, Montreal, Quebec, Canada

Figure 1: Network structure of the proposed distillation based multi-task learning model. The teacher model (left) is a multi-task learning model which models reading duration. It considers the dependency of click and reading by minimizing the CTRloss and CTCVR loss simultaneously. The student model (right) is a candidate generation model with double tower structure.Knowledge of the teacher model is transferred to the student model by distillation so that the student model can obtain theability to model reading duration while keeping its high efficiency for candidate generation.

The proposed candidate generation model uses the double towerstructure, and computes the user vector and item vector by DNNs.Let 𝑅 ( 𝑢 𝑖 ) and 𝑆 ( 𝑣 𝑖 ) be the user vector and the item vector respec-tively, where 𝑅 (·) and 𝑆 (·) are the DNNs that map input embeddingto output vector. Given 𝑅 ( 𝑢 𝑖 ) and 𝑆 ( 𝑣 𝑖 ) , the CTCVR predicted bythe candidate generation model can be formulated as 𝑝 ( 𝑧 𝑖 = | 𝑅 ( 𝑢 𝑖 ) , 𝑆 ( 𝑣 𝑖 ) , 𝜃 𝑠 ) = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ( 𝑅 ( 𝑢 𝑖 ) 𝑇 𝑆 ( 𝑣 𝑖 )) (10)where 𝑅 ( 𝑢 𝑖 ) 𝑇 𝑆 ( 𝑣 𝑖 ) is the inner product of 𝑅 ( 𝑢 𝑖 ) and 𝑆 ( 𝑣 𝑖 ) , and 𝜃 𝑠 is the trainable parameters in 𝑅 ( 𝑢 𝑖 ) and 𝑆 ( 𝑣 𝑖 ) . We expect that 𝑝 ( 𝑧 𝑖 = | 𝑅 ( 𝑢 𝑖 ) , 𝑆 ( 𝑣 𝑖 ) , 𝜃 𝑠 ) can be similar to 𝑝 𝑐𝑡𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) as much aspossible, so that we can use (10) to estimate the CTCVR of durationaccurately while keeping the high efficiency of the candidate gen-eration model. To this end, we treat the multi-task learning model(9) as the teacher model and the double tower candidate generationmodel (10) as the student model, and use distillation to transferknowledge from the teacher model to the student model. The lossof the distillation can be formulated as the following KL-divergence 𝐿 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 ( 𝜃 𝑠 ) = 𝑝 𝑐𝑡𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) 𝑝 𝑐𝑡𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) 𝑝 ( 𝑧 𝑖 = | 𝑅 ( 𝑢 𝑖 ) , 𝑆 ( 𝑣 𝑖 ) , 𝜃 𝑠 )+( − 𝑝 𝑐𝑡𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 )) − 𝑝 𝑐𝑡𝑐𝑣𝑟 ( 𝑥 𝑖 , 𝜃 𝑡 ) − 𝑝 ( 𝑧 𝑖 = | 𝑅 ( 𝑢 𝑖 ) , 𝑆 ( 𝑣 𝑖 ) , 𝜃 𝑠 ) (11)By summing the loss of the teacher model and the student model, weobtain the loss of the distillation-based multi-task learning modelas follow 𝐿 ( 𝜃 𝑡 , 𝜃 𝑠 ) = 𝐿 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 ( 𝜃 𝑡 ) + 𝐿 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 ( 𝜃 𝑠 ) (12) To prevent the teacher model from being influenced by the studentmodel in the training stage, the parameters in the student modelare separated from those in the teacher model, and the pCTCVR ofteacher model is freezed when computing 𝐿 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 ( 𝜃 𝑠 ) . Therefore,minimizing loss (12) is equivalent to minimizing the teacher lossand the student loss alternatively. In the inferring stage, we onlyuse the student model to compute user vectors and item vectors,where item vectors are used for building the index, user vector isused as the query, and ANN search is performed to fetch top-k can-didate items from the index for the user. The network structure andtrainging/serving framework of the proposed model is illustratedin figure 1. Experiments are conducted on dataset gatheredfrom traffic logs of Tencent Kandian’s feeds recommender system.The dataset has billions of training samples and millions of testsamples. For each user, the positive samples are the clicked im-pressions, and the negative samples are randomly selected from allitems according to their frequency of being clicked. Each clickedimpression has a positive reading duration, and each randomlyselected item has a zero duration.

We compare the proposed DMTL to the ex-isting candidate generation models. The competitors are listed asfollow

IGIR ’21, July 11–15, 2021, Montreal, Quebec, Canada Zhao and Fu, et al.

DSSM-Regression : user vectors and item vectors are computedby DNNs, and their inner product regresses the reading durationby square loss. The duration of the negative samples is treated aszero for training.

DSSM-Classification : user vectors and item vectors are com-puted by DNNs, and their inner product is used to compute binarycross entropy for training a classification model, where positivesamples are clicked-impressions with reading duration more than50s and the rest are negative samples.

DSSM-Click : user vectors and item vectors are computed byDNNs, and their inner product is used to compute binary crossentropy for training a classification model, where positive samplesare clicked-impressions and the rest are the negative samples.

We compare all approaches by evaluating their per-formance in a binary classification task, in which positive samplesare true clicked-impressions with duration larger than 50s and therest are negative samples. Area under ROC curve (AUC) is used asmetric for performance evaluation. Higher AUC represents strongerability of modeling reading duration.

For the teacher model, the experts areDNNs with hidden layer size × × and the number ofexperts is 2. The DNN for each task is × . For the studentmodel, the hidden layer size for each tower is × × . Forboth models, the embedding size for the categorical variables is 30. Table 1 shows the performance of eachcandidate generation model. The regression model performs theworst among all models, which may be due to its direct fitting tolarge number of zero durations. The classification model is a littlebetter than the regression model but still performs much poor thanthe proposed DMTL. This is due to the lack of modeling dependencybetween click and reading, which leads to the confusion betweenun-clicks and short duration. Compared to the duration DSSM, theclick DSSM performs much better, which indicates the importanceof click to duration. As the occurrence of duration depends on click,modeling duration without considering the dependency of clickmay miss a lot of important information for training. Among allmodels, the proposed DMTL achieves the highest AUC and sig-nificantly outperforms the competing methods. The improvementis attributed to the knowledge distilled from the teacher modelwhich models duration in a more reasonable way that considersdependency of the click.

Table 1: Offline performance of different duration models. models AUCDSSM-Regression 0.7374DSSM-Classification 0.7562DSSM-Click 0.9359

DMTL

Online A/B tests are also conducted to compare the proposed DMTLto the competing candidate generation models. For the online ex-periment, we only change the candidate generation step by usingdifferent candidate generation models and keep all other stepsunchanged. The online evaluation metric is the average reading du-ration defined as 𝑇 / 𝑀 , where 𝑇 is the total sum of all reading dura-tions (seconds) and 𝑀 is total number of impressions. Table 2 showsthe online performance of different models. DSSM-Regression andDSSM-Classification perform much worse than the DSSM-click andDMTL, which is consistent with the result of the offline experiment.Directly modeling duration without modeling its preceding step(click) may lead to the inaccurate estimation which causes the re-turned candidate items to be less relavant to user’s interests. Theproposed DMTL overcomes this problem, and thus achieves thebest performance compared to competing methods. Table 2: Online performance of different duration models. models average reading duration (s)DSSM-Regression 2.73DSSM-Classification 1.74DSSM-Click 3.93

DMTL

In this paper, we proposed a distillation based multi-task learningapproach for modeling reading duration in candidate generationstage. The teacher model is a multi-task learning model which mod-els reading duration by using ESMM to consider the dependencyof click and reading. The student model is a DSSM candidate gen-eration model with double tower structure. Knowledge Distillationtechnique is employed to make DSSM obtain the ability of model-ing reading duration while keeping its high efficiency in candidategeneration. Offline/online experiments conducted on real worlddataset demonstrated the effectiveness of the proposed approach inmodeling reading duration for candidate generation. The proposedapproach can be easily generalized to other scenario in which thereare multiple tasks with dependency. In the future, we will studyhow to develop multi-task learning model in the case when onlypart of the tasks are related, and how to fuse the output score fordistillation.

REFERENCES [1] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016.Wide & deep learning for recommender systems. In

Proceedings of the 1st workshopon deep learning for recommender systems . 7–10.[2] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks foryoutube recommendations. In

Proceedings of the 10th ACM conference on recom-mender systems . 191–198.[3] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.DeepFM: a factorization-machine based neural network for CTR prediction. arXivpreprint arXiv:1703.04247 (2017).[4] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck.2013. Learning deep structured semantic models for web search using clickthroughdata. In

Proceedings of the 22nd ACM international conference on Information &Knowledge Management . 2333–2338. istillation based Multi-task Learning: A Candidate Generation Model for Improving Reading Duration SIGIR ’21, July 11–15, 2021, Montreal, Quebec, Canada [5] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Mod-eling task relationships in multi-task learning with multi-gate mixture-of-experts.In

Proceedings of the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining . 1930–1939.[6] Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and KunGai. 2018. Entire space multi-task model: An effective approach for estimatingpost-click conversion rate. In

The 41st International ACM SIGIR Conference onResearch & Development in Information Retrieval . 1137–1140.[7] Ruoxi Wang, Rakesh Shivanna, Derek Z Cheng, Sagar Jain, Dong Lin, Lichan Hong,and Ed H Chi. 2020. DCN-M: Improved Deep & Cross Network for Feature Cross Learning in Web-scale Learning to Rank Systems. arXiv preprint arXiv:2008.13535 (2020).[8] Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews,Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019.Recommending what video to watch next: a multitask ranking system. In

Proceed-ings of the 13th ACM Conference on Recommender Systems . 43–51.[9] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, YanghuiYan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-throughrate prediction. In