[PDF] Decomposed Generation Networks with Structure Prediction for Recipe Generation from Food Images

Abstract

Recipe generation from food images and ingredients is a challenging task, which requires the interpretation of the information from another modality. Different from the image captioning task, where the captions usually have one sentence, cooking instructions contain multiple sentences and have obvious structures. To help the model capture the recipe structure and avoid missing some cooking details, we propose a novel framework: Decomposed Generation Networks (DGN) with structure prediction, to get more structured and complete recipe generation outputs. To be specific, we split each cooking instruction into several phases, and assign different sub-generators to each phase. Our approach includes two novel ideas: (i) learning the recipe structures with the global structure prediction component and (ii) producing recipe phases in the sub-generator output component based on the predicted structure. Extensive experiments on the challenging large-scale Recipe1M dataset validate the effectiveness of our proposed model DGN, which improves the performance over the state-of-the-art results.

Full PDF

11 Decomposed Generation Networks with StructurePrediction for Recipe Generation from Food Images

Hao Wang, Guosheng Lin, Steven C. H. Hoi,

Fellow, IEEE and Chunyan Miao

Abstract —Recipe generation from food images and ingredientsis a challenging task, which requires the interpretation of theinformation from another modality. Different from the imagecaptioning task, where the captions usually have one sentence,cooking instructions contain multiple sentences and have obviousstructures. To help the model capture the recipe structure andavoid missing some cooking details, we propose a novel frame-work: Decomposed Generation Networks (DGN) with structureprediction, to get more structured and complete recipe generationoutputs. To be speciﬁc, we split each cooking instruction intoseveral phases, and assign different sub-generators to each phase.Our approach includes two novel ideas: (i) learning the recipestructures with the global structure prediction component and(ii) producing recipe phases in the sub-generator output compo-nent based on the predicted structure. Extensive experimentson the challenging large-scale Recipe1M dataset validate theeffectiveness of our proposed model DGN, which improves theperformance over the state-of-the-art results.

Index Terms —Structure Learning, Text Generation, Image-to-Text.

I. I

NTRODUCTION

Due to food is very close to people’s daily life, food-related research, such as food image recognition [1], [2], cross-modal food retrieval [3], [4], [5] and recipe generation [6],[7], [8], has raised great interests recently. From a technicalperspective, jointly understanding the multi-modal food data[3] including food images and recipes remains an open re-search task. In this paper, we try to approach the problem ofgenerating cooking instructions (recipes) conditioned on foodimages and ingredients.Cooking instructions are one kind of procedural text, whichare constructed step by step with some format. For example,as is shown in Figure 1, the cooking instructions are composedof several sentences, and each sentence starts with a verb inmost cases. Apart from dividing the cooking instructions bysentences, we may also split them into more general phases ,which represent the global structures of the cooking recipes.Imagine when people start cooking food, we may decomposethe cooking procedure into some basic phases ﬁrst, e.g. pre-process the ingredients , cook the main dish , etc. Then we willfocus on some details, like determining which ingredients touse. While this coarse-to-ﬁne reasoning is trivial for humans,most algorithms do not have the capacity to reason aboutthe phase information contained in the static food image [6]. Hao Wang, Guosheng Lin and Chunyan Miao are with School of Com-puter Science and Engineering, Nanyang Technological University; e-mail: { hao005,gslin,ascymiao } @ntu.edu.sg.Steven C. H. Hoi is with Singapore Management University; e-mail:[email protected]. Therefore, it is important to guide the model to be aware ofthe global structure of the recipe during generation, otherwisethe generation outputs can hardly cover all the cooking details[7].Recently, several food datasets have been proposed forrecipe generation, such as YouCook2 [9], Storyboarding [8]and Recipe1M [3]. The ﬁrst two datasets both include theimage sequence, along with their corresponding textual de-scriptions. The image sequence is a concise series of unfoldedcooking videos. Hence the model can obtain the explicitinstruction structures with the image sequence. By contrast,Recipe1M remains more challenging, since it only containsthe static cooked food images. It is hard to obtain large-scale instructional video data in read world, and sometimeswe want to know the exact recipe of a cooked food image.Therefore, we believe that generating cooking instructionsfrom one single food image is of more value, compared toproducing instructions from image sequence.Given the previous stated reasons, we choose the large-scale Recipe1M dataset [3] to implement our methods. Here,our goal is to capture the global structure of recipe and togenerate the instruction from one single image with a list ofingredients. The basic idea is that we ﬁrst (i) assemble someof the consecutive steps to form a phase , (ii) assign suitablesub-generators to produce certain instruction phases, and (iii)concatenate the phases together to form the ﬁnal recipes.We propose a novel framework of

Decomposed GenerationNetworks (DGN) with global structure prediction, to achievethe coarse-to-ﬁne reasoning. Figure 2 shows the pipeline of theframework. To be speciﬁc, DGN is composed of two compo-nents, i.e. the global structure prediction component and thesub-generator output component. To obtain the global structureof the cooking instruction, we input image and ingredientrepresentations into global structure prediction component, andget the sub-generator selections as well as their orders. Thenin the sub-generator output component, we adopt attentionmechanism to get the phase-aware features. The phase-awarefeatures are designed for different sub-generators and help thesub-generators produce better instruction phases.We have conducted extensive experiments on the large-scaleRecipe1M dataset, and evaluated the recipe generation resultsby different evaluation metrics. We ﬁnd our proposed modelDGN outperforms the state-of-the-art methods.II. R

ELATED W ORK

A. Food Computing

Our work is closely related to food computing [10], whichutilizes computational methods to analyze the food data in- a r X i v : . [ c s . C V ] J u l image ingredientsGlobal Structure Prediction Phase 1:

Boil the konnyaku noodles for 2-3 minutes (boiling will help get rid of their smell). Cut the green pepper, carrot, and chives into equal pieces.

Phase 2:

Combine the ingredients. Heat the konnyaku without oil.

Phase 3:

Once the liquid has evaporated, add the sesame oil, vegetables, and vegi-meat, and stir-fry. Stir in the ingredients.

Generator X Generator Y Generator ZPhase 1 Phase 2 Phase 3cooking instructions noodles, green pepper, carrot, chinese chives, bean sprouts, soy sauce, mirin, sugar

Fig. 1. Illustration of the Decomposed Generation Networks (DGN) for recipe generation. Instead of producing instructions directly from the image andingredient embedding [7], we ﬁrst predict the instruction structure and choose different generators to match the cooking phases. And then we combine theoutputs of selected sub-generators to get the ﬁnal generated recipes. cluding the food images and recipes. With the developmentof social media and mobile devices, more and more food databecome available on the Internet, the UEC Food100 dataset[1] and ETHZ Food-101 dataset [2] are proposed for thefood recognition task. The previous two food datasets arerestricted to the variety of data types, only have differentcategories of food images. YouCook2 dataset is proposed byZhou et al. in [9], which contains cooking video data. Theyfocused on generating cooking instruction steps from videosegments in YouCook2 dataset. The latter work [8] proposeda new food dataset, Storyboarding, where the food data itemhas multiple images aligned with instruction steps. In theirwork, they proposed to utilize a scaffolding structure for themodel representations. Besides, Bosselut et al. [6] generatedthe recipes based on the text, where they reasoned about causaleffects that are not mentioned in the surface strings, theyachieved this with memory architectures by dynamic entitytracking and obtained a better understanding on proceduraltext.In order to better model the relationship between recipesand food images, Recipe1M [3] has been proposed to providericher food image, cooking instruction, ingredient, and seman-tic food-class information. Recipe1M contains large amountsof image-recipe pairs, which can be applied on cross-modalfood retrieval task [3], [4], [5] and recipe generation task [7].Salvador et al. [7] focused more on the ingredient prediction task. For instruction generation, they generated the wholecooking instructions from given food images and ingredientsthrough a single decoder directly, which may result in thatsome cooking details can be missing in some cases.It is worth noting that, to the best of our knowledge, [7]is the only work for recipe generation task on Recipe1Mdataset. Our DGN approach improves the recipe generationperformance by introducing the decomposing idea to thegeneration process. Therefore, our proposed methods can beapplied to many general models. We will demonstrate thedetails in Section IV.

B. Text Generation

Text generation is a widely researched task, which can takevarious input types as source information. Machine translation[11], [12] is one of the representative works of text-basedgeneration, in which the decoder takes one language text asthe input and outputs another language sentences. Image-basedtext generation involves both vision and language, such asimage captioning [13], [14], [15], visual question answering[16], [17]. To be speciﬁc, image captioning is to generatesuitable descriptions for the given images, and the goal ofvisual question answering is to answer questions accompaniedwith the image and text. In this paper, we try to addressthe challenging recipe generation problem, which produces a long procedural text conditioned on the image and text(ingredients).Text generation related tasks are accelerated by some newstate-of-the-art models like the Transformer [12] and BERT[18], which are attention-based. Many recent works achievesuperior performance with attention-based models [19], [20],[21]. In our work, we compare the results of using the pre-trained BERT [18] and normal embedding layer [7] as theingredient encoder.

C. Neural Module Networks

The idea of using neural module network to decomposeneural models have been proposed for some language-visionintersection tasks, such as visual question answering [22],image captioning [19], visual reasoning [23]. Neural modulenetwork has good capabilities to capture the structured knowl-edge representations of input images or sentences. In general,since the image layouts or questions are obviously structured,many prior related research [22], [19], [23], focused on con-structing better encoders with neural modules. To produce acoherent story for an image in MS COCO [24], Krause etal. [25] decomposed both images and paragraphs into theirconstituent parts, detecting semantic regions in images andusing a hierarchical recurrent neural network to generate topicvectors with their corresponding sentences, but they generatedifferent paragraph parts with the same decoder.In food data [3], the cooking instructions tend to be verystructured as well. To generate recipes with better structures,we employ different sub-generators to produce different phasesof cooking instructions.III. M

ETHOD

A. Overview

In Figure 2, we show the training ﬂow of DGN. It isobserved that the cooking instructions have obvious structuresand clear formats, most cooking instruction sentences inRecipe1M dataset [3] start with a verb, e.g. heat , combine , pierce , etc. However, how to automatically divide the recipesinto phases remains an NLP problem. Therefore, we use a pre-deﬁned rule to segment the recipes. Speciﬁcally, we split perinstruction into - phases and try to ensure each phase sharesequal sentence numbers, where one or more cooking steps(sentences) will map to one phase. This recipe segmentationrule is based on intuitions, i.e. having more recipe phases mayresult in looser cooking step clustering and consequently failto form the hierarchy between cooking phase and step. As anexample stated in Figure 2, the recipe for the roasted chicken totally has ﬁve steps, which are transitioned to three phases.After we obtained the phase segmentation in recipes, weneed to determine which sub-generators will be selected togenerate the certain phases. We use the approach of k-means clustering to assign pseudo labels to each recipe phase.Speciﬁcally, we ﬁrst extract all the verbs in recipes with spaCy[26], a Natural Language Processing (NLP) tool. Then, we canobtain the mean verb representations, which can be regardedas the representation of each phase. After that, we use k-meansclustering to get pseudo labels for phases, which indicate the selections of sub-generators. The number of the sub-generatorcategory N is a hyper-parameter, we do experiments withdifferent N and show the results in Table III. The pseudo labels Generators = { g i , ..., g k } represent different sub-generatorselections.Figure 2 provides an overview of our proposed model,which is composed of the global structure prediction compo-nent and sub-generator output component. Our model takesfood images and their corresponding ingredients as input.It uses several sub-generators for different recipe phases,allowing sub-generators to focus on different clustered recipephases.ResNet-50 [27] pretrained on ImageNet [28] and BERT[18] model implemented by [29] are used to encode foodimages and ingredients respectively. We can get image andingredient global representations F img and F ingr . These globalrepresentations will be fed into the global structure predictioncomponent, to decide which sub-generators will be selectedas well as their orders. To enable the interactions amongsub-generators, the global structure prediction component alsoproduces a P -dimensional phase vector F phase for each ofthe sub-generators. Then we split the target instructions intophases and assign different position one-hot vectors v p ∈ R for each phase, which will be transformed into a P -dimensional position representations F pos through a linearlayer. With previous encoded features F img , F ingr , F phase and F pos , we can fuse them together and obtain the phase-awarefeatures r i ∈ R P for sub-generator g i . B. Global Structure Prediction Component

Since the cooking instructions are divided into phases, theglobal structure prediction component not only needs to decidewhich generators to be selected in each phase, but also isrequired to predict the order of the chosen sub-generators. Inorder to achieve the goal, we stack the transformer blocks[12] to construct our global structure prediction component.The last transformer block is followed by a linear layer anda softmax activation, to ﬁnd the predictions for each step. Weset hidden size H = 512 , the number of heads n head = 8 and the number of stacked layers n layer = 4 , generate thesub-generator label sequence { y i , ..., y k } .To be speciﬁc, the transformer block contains two sub-layers with layer normalization, where the ﬁrst one employsthe multi-head self-attention mechanism and the second oneattends to the model conditional inputs to enhance the self-attention output. The attention outputs can be computed as[12], Attention(

Q, K, V ) = softmax( QK T √ d k ) V, (1)where the input comes from queries Q and keys K ofdimension d k , and values V of dimension d v . We also adoptthe multi-head attention mechanism [12], which linearly maps Q, K, V with different, learned projections. These differentprojected results will be concatenated together and get betteroutput values.

Generator IGenerator IIGenerator III target instruction captions pierce the skin… sprinkle with kombu tea… brown the skin side… sprinkle some pepper… then brown the other side…chicken thighstea white pepper caption encoderposition encoder

FeatureFusion ingredient encoder ingredientsimage phase-aware features image encoder global features global structure prediction 𝐹 " 𝐹 "%$& 𝐹 ’() 𝐹 ’*+), generator selection sub-generator output Fig. 2.

Decomposed Generation Networks with global structure prediction (DGN):

We take food images and the corresponding ingredients as modelinputs, and obtain the image and ingredient embedding F img , F ingr through a pre-trained image model CNN and the language model BERT respectively.After that, the model will be split into two branches, i.e. the global structure prediction component and the sub-generator output component. Both of them areconstructed by the transformer. The global structure prediction component produces the sub-generator selections and their orders for the following branch. Thesub-generator output component fuses F img , F ingr , the position representations F pos and the phase vector F phase to obtain the input of each sub-generator,and produces different phases of the recipe. MultiHead(

Q, K, V ) = Concat(head , ..., head h ) W O , head i = Attention( QW Qi , KW Ki , V W Vi ) (2)Where the projections are matrices W Qi ∈ R d k , W Ki ∈ R d k , W Vi ∈ R d v and W O ∈ R d v n head .We take the global context vectors { F img , F ingr } and targetrecipe phase labels g = { [ ST ART ] , g i , ..., g k } as inputswhen training the model. We ﬁrst map the discrete labelsto a sequence of continuous representations Z . The modelgenerates an output sequence { y i , ..., y k } one element at atime. The target sequence embedding Z will be ﬁrst fed intothe model and processed with multi-head self-attention layers,as follows: H attnself = MultiHead( Z, Z, Z ) , (3)We further concatenate the context vectors { F img , F ingr } together, get the conditional vector F kv , which will be attendedto reﬁne previous self-attention outputs H attnself , which is deﬁnedas: H attncond = MultiHead( H attnself , F kv , F kv ) , (4) H attncond is the ﬁnal attention outputs of each phase, which canbe used as the phase vector F phase for sub-generator outputcomponent. We transform H attncond into H attncond (cid:48) for output tokengeneration with a linear layer. The dimension of H attncond (cid:48) is identical with the number of sub-generator category N , theprobabilities of generated tokens are p gen = softmax(H attncond (cid:48) ) .Therefore, the ﬁnal output tokens of global structure predictioncomponent y i = argmax( p gen ) . We train the global structureprediction component with cross-entropy loss L pre : L pre = S (cid:88) i =1 (cid:96) cross − entropy ( p geni , g i ) , (5)where S is the number of instruction phases. C. Sub-Generator Output Component

The sub-generator output component uses different sub-generators predicted by global structure prediction component,to produce a certain phase of the recipe, and concatenate themtogether to form the ﬁnal cooking instruction. We stack transformer blocks to construct the generator, in which ofthem are shared blocks, and the rest are independent blocksof each of the generators. The reasons for using shared blockslie in that the model may overﬁt to the limited training dataand cannot generalize well, if we adopt whole independentblocks for each sub-generator.We utilize each predicted sub-generator to produce onerecipe phase, which requires that each of the generator inputsshould be discriminative and informative enough. Therefore,we incorporate rich sources of feature representations, i.e.the food image features F img , the ingredient features F ingr ,the position representations F pos and the phase vector F phase ( H attncond ) produced by global structure prediction component. F img provides the model with generation contents from thefood images, which belong to a different modality, and F ingr indicates the ingredients containing in the recipe, which canbe reused in the generated cooking instructions. To allow themodel to be aware of the generation phase, we fuse the recipephase position representations F pos . F phase is incorporated forenhancing the interactions among different sub-generators andhelps the model adapt to different generation phases.The above four representations will be fused together toget the phase-aware features r = (cid:104) F img , F ingr , F pos , F phase (cid:105) ,which are the inputs of sub-generators. We adopt two dif-ferent ways to achieve that. The ﬁrst one is that we sim-ply concatenate these representations, and get r cat . In thesecond way, we use attention mechanism to make F img , F ingr attend to the concatenated embedding cat ( F pos , F phase ) respectively. Speciﬁcally, we utilize a projection matrix on cat ( F pos , F phase ) and get the attention maps for F img and F ingr , the image and ingredient attention outputs can beformulated as: F attnimg = softmax(W ( cat ( F pos , F phase ))) F img ,F attningr = softmax(W ( cat ( F pos , F phase ))) F ingr , (6)The ﬁnal attended phase-aware features r attn is the concate-nation of F attnimg and F attningr . We involve an additional positionclassiﬁer L pos on r to ensure that it contains certain phaseposition information.We also need to input the target instruction captions t = { [ ST ART ] , t , t , ..., t m } for training the Transformer [12]generators, and map them to a continuous representation C .As described in Section III-B, we utilize attention mechanismwith transformer blocks: F attnself = MultiHead( C, C, C ) , (7) F attncond = MultiHead( F attnself , r , r ) , (8)We use F attncond to generate the tokens through a linearlayer and softmax activation, and we can obtain the outputprobabilities p token among candidate tokens. For each sub-generator, we compute the training loss as follows: L gen = M (cid:88) i =1 (cid:96) cross − entropy ( p tokeni , t i ) , (9) D. Training and Inference

The food images, ingredients and the target instructioncaptions are taken as the training input of the model. We totallyhave three loss functions, i.e. the global structure predictionloss L pre , sub-generator output loss L gen and position classi-ﬁcation loss L pos , our training loss can be formulated as: L = λ L pre + λ L gen + λ L pos , (10)The Transformer model [12] is auto-regressive, which uti-lizes the previously generated tokens as additional input whilegenerating the next [12]. Therefore, during inference time, we ﬁrst feed the model with the [ ST ART ] token instead ofthe whole target instruction captions, and then the model willoutput the following tokens incrementally. We run the globalstructure prediction component ﬁrst. According to the pre-dicted sub-generator sequence, we utilize the chosen generatorfor each recipe phase.IV. E XPERIMENTS

A. Dataset and Evaluation Metrics

Dataset.

We use the Recipe1M [3], [7] provided ofﬁcial split: , , , and , recipes for training, validationand test respectively. Theses recipes are scraped from cookingwebsites, and each of them contains the food image, a listof ingredients and the cooking instructions. Since Recipe1Mdata is uploaded by users, there have large variance and noisesacross the food images and recipes. Evaluation Metrics.

We totally adopt three different metricsfor evaluation, i.e. perplexity, BLEU [30], ROUGE [31]. Theprior work [7] only used perplexity for evaluation, whichmeasures how well the probability distribution of learnedwords matches that of the input instructions. BLEU scoresare based on an average of unigram, bigram, trigram and 4-gram precision, however, it fails to consider sentence structures[32]. In other words, BLEU cannot evaluate the performanceof our global structure prediction component. ROUGE isa modiﬁcation of BLEU that focuses on recall rather thanprecision, i.e. it looks at how many n-grams of the referencetext show up in the outputs, rather than the reverse. Therefore,ROUGE can reﬂect the inﬂuence of the proposed globalstructure prediction component, which is discussed in SectionIV-E.

B. Implementation Details

We utilize ResNet-50 [27] which is pretrained on ImageNet[28] as the image encoder, which takes image size of × as input. The ingredient encoder is BERT [18], short for Bidi-rectional Encoder Representations from Transformers, whichis a pretrained language model implemented by [29] and isone of the state-of-the-art NLP models. As the prior worksetting [7], we adopt the last convolutional layer of ResNet-50,whose output dimension is , as the feature representations.[7] used ingredients per recipe for embedding, but sinceBERT tokenizer [18] may split one word into several tokens,so we set the maximum number of tokens as . The outputembedding of BERT model will be mapped to the dimensionof as well. For the cooking instruction generators, differentsub-generators will share transformer blocks, and each ofthem has additional independent transformer blocks with multi-head attention heads. To align with [7] and achieve afair comparison, we generation instruction of maximum words. In all the experiments, we use greedy search for recipegeneration.Regarding the phase number setting of each cooking in-struction, we experiment with different numbers and observethat splitting per instruction into up-to three phases has thebest trade-off performance. Since the cooking step numbersrange from to , suppose that if we split too many phases image featuresingredient features generator Structure Prediction generator i … Baseline DGN image featuresingredient features generator k

Fig. 3. The comparison of the baseline model and our proposed DGN. DGN can be applied to different backbone networks. for each recipe, one phase may only contain one step, whichwill fail to obtain the global structure information. Therefore,we assume per instruction has at most three phases.In all the experiments, we ﬁx the weights of the imageencoder for faster training, and instead of using the predictedingredients as conditional generator inputs [7], we take theground truth ingredients and images as input for a fair com-parison. We set λ , λ and λ in Eq. 10 to be , and . respectively, which is based on empirical observations onvalidation set. The model is optimized with Adam [33], and theinitial learning rate is set as . , with . decay per epoch.The model is trained for about epochs to be converged. Weimplement the proposed methods with PyTorch [34]. C. Baselines

To the best of our knowledge, [7] is the only work for recipegeneration task at Recipe1M dataset, where they generatedthe whole cooking instructions from the cooked food imagesthrough transformer blocks. By contrast, our proposedDGN extends an additional branch for the text generation pro-cess, which predicts the structures of the recipes ﬁrst and thenutilizes the chosen sub-generators for each phase generation.In other words, DGN can be applied to different backbonenetworks. We compare the difference between baseline modelsand the proposed DGN in Figure 3.To fully demonstrate the efﬁcacy of DGN, we experimentwith two different ingredient encoders to act as baseline re-sults. The ﬁrst one comes from the prior work [7], where theyadopted one word embedding layer to encode the ingredients.We need to train it from scratch. For comparison, BERT [18]is utilized as the second ingredient encoder. We ﬁnetune theBERT model during training. Note that the above two baselinemodels both use ResNet-50 as the image encoder, they onlydiffer in the ingredient encoders. D. Main Results

We show our main results of generating cooking instructionsin Table I, which are evaluated across three language met-rics: perplexity, BLEU [30] and ROUGE-L [31]. Generally,models with and without DGN have an obvious performancegap. Simply using one word embedding layer for ingredientencoder performs poorly, achieving the lowest scores across allthe metrics. When we replace the embedding layer with state-of-the-art pretrained language model, BERT, the performancereasonably gets better, which highlights the signiﬁcance of thepretrained model. We then incrementally add the DGN branch to two differentbackbone networks. To be speciﬁc, we experiment with twoways to construct the phase-aware features r , i.e. DGN (cat) ,where r is formed by the concatenation of the four repre-sentations, and DGN (attn) , in which we construct image andingredient features with attention mechanism, then concatenatethem together to be r . First, we add DGN (cat) to baselinemodels, surprisingly this approach can achieve more than BLEU scores better than the baseline model with embeddinglayer and BLEU score over state-of-the-art language modelBERT, which indicates our DGN idea is very promisingand can extend to some general models. We further adopt

DGN (attn) for recipe generation evaluation, the performancecontinually gets better, illustrating the usefulness of enhancingthe inputs of generators. In general, our full model,

BERT+ DGN (attn) , obtains the best results among all methodson every metric consistently, and achieve the state-of-the-artperformance.

E. Ablation Studies

The ablative inﬂuence of image and ingredient as input.

To suggest the necessity of using both image and ingredientas input, we train the model with different inputs separately.We show the ablation studies in Table II, where we usea transformer for generation, instead of DGN. It can beobserved that ingredient information helps more on the recipegeneration, since ingredients can be directly reﬂected in therecipes. The model with image and ingredient as input hasbetter performance than that of single modality input.

The impact of sub-generator category number N . Afterwe get the representation of each instruction phase, we adoptk-means clustering to obtain the phase labels, which indicatethe sub-generator selections. Then these labels are used for theglobal structure prediction component training. We show theexperiment results in Table III, where the ﬁrst row shows theexperiment results of BERT baseline model, the last four rowsare all implemented by BERT + DGN (attn). When N = 1 , wecompare the results of the ﬁrst and second row, the ﬁrst rowuses the concatenated representations of image and ingredientfeatures, while the second row takes the enhanced phase-aware features r as input, indicating the efﬁcacy of the phase-aware features. Besides, the model with N = 1 has inferiorperformance compared with model with N = 3 , illustratingthe single generator struggles to ﬁt data from different phases.When N = 5 , the model gets similar evaluation results to N = 1 . That model with N = 5 has poorer performance thanmodel with N = 3 may because the model does not have TABLE IM

AIN R ESULTS . E

VALUATION OF

DGN

PERFORMANCE AGAINST DIFFERENT SETTINGS . W

E FIRST SHOW THE RESULTS OF TWO INGREDIENT ENCODERS , WHERE THE FIRST ONE ADOPTS THE WORD EMBEDDING LAYER TO ENCODE THE INGREDIENTS , WHILE THE SECOND ONE , BERT,

USES A PRETRAINEDLANGUAGE MODEL . DGN

IS ADDED TO THE BASELINE MODELS AS AN ADDITIONAL BRANCH , WHERE WE SHOW THE RESULTS OF DIFFERENTCONSTRUCTION WAYS OF PHASE - AWARE FEATURES r . DGN (

CAT ) USES THE CONCATENATION OF THE PROVIDED REPRESENTATIONS FOR THESUB - GENERATOR INPUTS , AND

DGN (

ATTN ) ADOPTS THE ATTENTION MECHANISM TO ENHANCE THE REPRESENTATIONS . W

E EVALUATE THE MODELWITH PERPLEXITY ( LOWER IS BETTER ), BLEU (

HIGHER IS BETTER ) AND

ROUGE-L (

HIGHER IS BETTER ). W

E FIND THE PROPOSED

DGN

IMPROVESTHE PERFORMANCE ACROSS ALL THE METRICS . Methods Ingredient Encoder Perplexity BLEU ROUGE-LBaseline [7]

Embedding Layer

DGN (cat) Embedding Layer

DGN (attn) Embedding Layer

Baseline [18]

BERT

DGN (cat) BERT

DGN (attn) BERT 6.59 11.83 36.6

TABLE IIT

HE ABLATIVE INFLUENCE OF IMAGE AND INGREDIENT AS INPUT . T

HE MODEL IS EVALUATED BY PERPLEXITY ( LOWER IS BETTER ), BLEU (

HIGHER ISBETTER ) AND

ROUGE-L (

HIGHER IS BETTER ). Input Perplexity BLEU ROUGE-LOnly Image

Only Ingredient

Image and Ingredient 7.52 9.29 34.8

TABLE IIIT

HE IMPACT OF SUB - GENERATOR CATEGORY NUMBER N . T HE MODEL IS EVALUATED BY PERPLEXITY ( LOWER IS BETTER ), BLEU (

HIGHER IS BETTER ) AND

ROUGE-L (

HIGHER IS BETTER ). N Methods Perplexity BLEU ROUGE-L1 BERT

HE IMPACTS OF

DGN

ON THE AVERAGE LENGTH AND VOCABULARY SIZE OF GENERATED RECIPES . T

HE RESULTS DEMONSTRATE THAT THE PROPOSED

DGN

INCREASES THE AVERAGE LENGTH AND DIVERSITY OF GENERATED COOKING INSTRUCTIONS . Methods Average Length Vocab SizeBaseline [7] 69.9 3657

Baseline [18] 66.9 4521

DGN (Baseline [7])

DGN (Baseline [18])

Ground Truth enough data for training, due to the more splits of the trainingdata. Therefore, we set the hyper-parameter N to be . The impacts of DGN on the average length and vocabularysize of generated recipes.

In order to further demonstratethe effectiveness of the proposed DGN from other aspects,we perform some language analysis based on the generatedoutputs in the Table IV. Our DGN approach generates textof the closet average length as ground truth recipes, whichare crawled from websites and written by humans. Whilethe models without DGN generate relatively short cookinginstructions, which provides the evidence for our assumptionsbefore: using one single generator will result in some cookingdetails are missing. We also show some qualitative results in Figure 4. To evaluate the diversity of the recipes, we computethe vocabulary sizes of the generations and the ground truth,which indicates the number of unique words that appear inthe text. According to the results, DGN (BERT) is actually themost diverse method apart from the ground truth. But therestill remain huge gaps between the diversity of generated textand human-written text.

The effect of global structure prediction.

Global structureprediction component is the ﬁrst and basic part of our proposedDGN model, which outputs the sub-generator selections andtheir orders for subsequent generations. We test the generatedtext of the predicted orders and that of random orders. Weadopt the ROUGE-L metric for evaluation, since BLEU fo-

Ground Truth BERT DGN (BERT)

Pizza

Heat oven to 450 degrees f. Cook and stir vegetables in 1 tbsp. Dressing in skillet on medium heat 3 min. Place pizza crust on baking sheet sprayed with cooking spray; brush with remaining dressing. Top with cheese, pepperoni, vegetables and olives. Bake 10 to 12 min. Or until cheese is melted and edge of crust is golden brown. Sprinkle with thyme. Heat oven to 450 degrees f. Toss vegetables with dressing; place in single layer on baking sheet. Bake 10 min. Or until vegetables are crisp-tender, stirring after 10 min. Top with cheese, pepperoni and olives. Heat oven to 450 degrees f. Cook and stir vegetables in 1/2 cup dressing in large skillet on medium heat 5 min. Or until vegetables are tender. spoon onto pizza shell; top with cheese. Place pizza crust on baking sheet sprayed with cooking spray; spread with sauce. Top with cheese and vegetables. Bake 10 to 12 min. Or until cheese is melted and edge of crust is golden brown. Sprinkle with pepperoni. Bake 2 to 3 min. Or until cheese is melted. Sprinkle with basil. Serve.

Beef Stew

Combine flour, salt and pepper in a bowl. Toss beef cubes in the flour mixture to coat. In a large dutchoven, cook the butter until just starting to brown. Add the meat and onions, and cook , stirring occasionally, until the meat is browned on all sides. Add the bay leaves and allspice. Boil the water in a separate pan, then pour over the meat. Simmer, covered, for 1 1/2 hours, or until the meat is tender. Check the water level occasionally, and add more if needed. When meat is tender, remove to a serving dish. Sprinkle meat with salt and pepper. Melt butter in dutch oven or heavy pot. Add meat and brown well on all sides. Remove meat and set aside. Add onions and allspice to dutch oven. Cover and simmer for 1 1/2 hours. Add water and bay leaves. Cover and simmer for 1 1/2 hours. Add meat and simmer for 1 1/2 hours more. Season the beef with salt and pepper. Melt the butter in a large skillet over medium heat. Add the beef and brown on all sides. Remove the beef from the skillet and set aside. Add onions and allspice. Cook, stir occasionally, until onions are soft. Add bay leaves and water. Cover and simmer for 1 1/2 hours. Add more water if necessary . Cover and cook on low for 8-10 hours. Skim off fat. Add the remaining ingredients and cook for 30 minutes more. Serve with rice or noodles. Enjoy!

Spicy Red Beans

Place beans into a pressure cooker and fill with enough water to cover the beans by 1 inch. Bring to a boil, and boil for 2 minutes. Remove from the heat, drain and cool for 1 hour. Add water to the beans again, filling just enough to cover the beans by 1/2 inch. Season with salt, fennel seed, cayenne pepper and mixed peppercorns. Put the lid on and set the pressure to high, or 15 pounds depending on your indicator. Bring to a boil and cook for 35 minutes. Let the pressure lower naturally before releasing the lid, according to manufacturer's instructions (about 15 minutes). Serve hot .

Soak beans overnight in plenty of water. Drain and rinse beans. Place beans in a large pot with enough water to cover by 2 inches. Drain and rinse beans. Place beans in a large pot with salt, fennel seed, cayenne pepper and black pepper. Cover and simmer for 1 1/2 hours. Soak beans overnight in cold water. Drain and discard soaking water. Place beans in a large saucepan and cover with water. Bring to a boil, reduce heat, and simmer for 1 hour. Add salt, fennel seeds, peppercorns. Grind to a fine powder. Add to beans and mix well. Cover and refrigerate overnight. Rewarm before using. Cover and let stand for 1 hour. Drain and store in a cool, dry place. Makes about 6 cups.

Peel, core and slice apples and place into medium sized mixing bowl. Add sugar, vanilla and cinnamon (or all spice) to apples and mix until apples are coated. Place apple mixture into 8x8 glass baking dish. Soften butter, stir until creamy and place in another bowl. Add flour and brown sugar until mixture looks like little peas. Cover apples with flour mixture making sure apples are coated. Place uncovered baking dish onto lower oven rack. Bake at 300 degrees for 50-55 minutes (or until corners are dark brown and crispy). Serve warm with vanilla ice cream or espresso.

Mix all ingredients together. Store in an airtight container. Preheat oven to 350 degrees. Peel and core apples. Cut into 1/2 inch slices. Place in a 9x13 pan. Sprinkle with sugar and cinnamon. Sprinkle the mixture over the apples. In a small bowl, mix together the melted butter, brown sugar, and flour.Pour this over the apples. Bake for 30 minutes. Serve warm with vanilla ice cream. Enjoy! Note: you can use any type of apples you like. I use a combination of apples, but I think it is notnecessary.

Peach and Nut Cake

Preheat oven to 350f. Grease and butter a 9x13-inch glass pan. In a food processor mix the dough ingredients. Pour into prepared pan. Mix sugar and cinnamon together. Pour half the sugar on batter. Place in 3 rows 10 slices of peaches each row. Eat 2 slices for yourself. Sprinkle the chopped nut over the peaches then top with the sugar and drizzle the butter over the sugar. Bake for 30-35 minutes .

Preheat oven to 350 degrees. Place chicken in a 9x13 baking dish. Mix soup, milk, and pepper in a bowl. Pour over chicken. Bake for 1 hour. Cream butter and sugar together. Sift flour, baking powder, milk, egg and salt together. Add to butter mixture alternately with the milk. Fold in pecans. Pour batter into a greased 9x13 pan. Pour into greased 9x13 pan . Sprinkle with 1/2 cup sugar and cinnamon. Pour peaches over batter. Sprinkle with remaining 1/2 cup sugar. Bake at 375 for 30-35 minutes. Serve warm with ice cream or whipped cream. Enjoy!

Apple Crumble

Fig. 4. Analysis of generated recipes by different models. We show the generated results conditioned on three different food images, namely pizza , beefstew spicy red beans , apple crumble and peach and nut cake . The left column shows the conditional food images, and the right three columns show thetrue cooking instructions, baseline BERT generations and DGN generated recipes. Words with yellow background represent the matching parts between rawrecipes and the generated recipes. In the DGN generations, we state the recipe phases with numbers in red. cuses on the recall instead of the precision and it cannot reﬂectthe impact of different orders of recipe phases, while ROUGEconsiders both recall and precision. The random order outputROUGE-L score turns out to be . , about percentagedrop from the predicted order evaluation results. F. Qualitative Results

We present some qualitative results from our proposedmodel and the ground truth cooking instructions for compari-son in Figure 4. In the left column, we show the conditionalfood images, which come from pizza , beef stew , spicy redbeans , apple crumble and peach and nut cake respectively.And in the right three columns, we list the true recipes, the generated recipes of BERT and that of our proposed modelDGN, which uses the attended features. We indicate the recipephases with the red number in DGN generations, and wordswith yellow background suggest the matching parts betweenraw recipes and the generated recipes.The obvious properties of DGN generations include itsaverage length and its ability to capture rich cooking details.First of all, we can see that DGN generates longer recipeoutputs than BERT, which has a similar length as true recipes.Besides, it is observed that the phase orders predicted bythe global structure prediction component make sense in theshown cases: the ﬁrst instruction phase gives some instructionson pre-processing the ingredients, the middle instruction phase tends to describe the details about the main dish cooking,and the last phase often contains some concluding work ofcooking.Generally, it can be seen that DGN generates more matchingcooking instruction steps with the ground truth recipes thanBERT. When we go into the details, the DGN generatedinstructions include the ingredients used in the true recipes.Speciﬁcally, in the top row, the generated text covers theingredients of pepperoni , cheese , vegetables and etc. Com-pared with the BERT outputs, DGN generate similar sentencesat the beginning. However, DGN provides more details, e.g.in the instruction generation of beef stew , both BERT andDGN output the sentence of “Add onions and allspice.”, whileDGN further generate some tips: “Cook, stir occasionally, untilonions are soft.”.It is also worth noting that some of the predicted numbersare not precise enough, like in the third generated phase of Beef Stew , the generation output turns out to be “cook ... for8-10 hours”, which is not aligned with common sense.V. C

ONCLUSION

In this paper, we have proposed to make the generatedcooking instructions more structured and complete, i.e. todecompose the recipe generation process. In particular, wepresent a novel framework DGN for recipe generation thatleverages the compositional structures of cooking instructions.Speciﬁcally, we ﬁrst predicted the global structures of theinstructions based on the conditional food images and in-gredients, and determined the sub-generator selections andtheir orders. Then we constructed a novel phase-aware featurefor the input of chosen sub-generators and adopted themto produce the instruction phases, which are concatenatedtogether to obtain the whole cooking instructions. Experimen-tally, we have demonstrated the advantages of our approachover traditional methods, which use one single decoder to gen-erate the long cooking instructions. We conducted extensiveexperiments with ablation studies, and achieved state-of-the-artrecipe generation results across different metrics in Recipe1Mdataset. A

CKNOWLEDGMENT

This research is supported, in part, by the National Re-search Foundation (NRF), Singapore under its AI SingaporeProgramme (AISG Award No: AISG-GC-2019-003) and underits NRF Investigatorship Programme (NRFI Award No. NRF-NRFI05-2019-0002). Any opinions, ﬁndings and conclusionsor recommendations expressed in this material are those ofthe authors and do not reﬂect the views of National ResearchFoundation, Singapore. This research is also supported, in part,by the Singapore Ministry of Health under its National Innova-tion Challenge on Active and Conﬁdent Ageing (NIC ProjectNo. MOH/NIC/COG04/2017 and MOH/NIC/HAIG03/2017),and the MOE Tier-1 research grants: RG28/18 (S) andRG22/19 (S). R

EFERENCES[1] Y. Matsuda, H. Hoashi, and K. Yanai, “Recognition of multiple-food im-ages by detecting candidate regions,” in

Multimedia and Expo (ICME),2012 IEEE International Conference on . IEEE, 2012, pp. 25–30.[2] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining dis-criminative components with random forests,” in

European Conferenceon Computer Vision . Springer, 2014, pp. 446–461.[3] A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Oﬂi, I. Weber, andA. Torralba, “Learning cross-modal embeddings for cooking recipes andfood images,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2017, pp. 3020–3028.[4] M. Carvalho, R. Cad`ene, D. Picard, L. Soulier, N. Thome, and M. Cord,“Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings,” in

ACM SIGIR , 2018.[5] H. Wang, D. Sahoo, C. Liu, E.-p. Lim, and S. C. Hoi, “Learning cross-modal embeddings with adversarial networks for cooking recipes andfood images,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2019, pp. 11 572–11 581.[6] A. Bosselut, O. Levy, A. Holtzman, C. Ennis, D. Fox, and Y. Choi,“Simulating action dynamics with neural process networks,” arXivpreprint arXiv:1711.05313 , 2017.[7] A. Salvador, M. Drozdzal, X. Giro-i Nieto, and A. Romero, “Inversecooking: Recipe generation from food images,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2019,pp. 10 453–10 462.[8] K. Chandu, E. Nyberg, and A. W. Black, “Storyboarding of recipes:Grounded contextual generation,” in

Proceedings of the 57th AnnualMeeting of the Association for Computational Linguistics , 2019, pp.6040–6046.[9] L. Zhou, C. Xu, and J. J. Corso, “Towards automatic learning ofprocedures from web instructional videos,” in

Thirty-Second AAAIConference on Artiﬁcial Intelligence , 2018.[10] W. Min, S. Jiang, L. Liu, Y. Rui, and R. Jain, “A survey on foodcomputing,” arXiv preprint arXiv:1808.07202 , 2018.[11] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in

Advances in neural information processingsystems , 2014, pp. 3104–3112.[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advancesin neural information processing systems , 2017, pp. 5998–6008.[13] N. Xu, H. Zhang, A.-A. Liu, W. Nie, Y. Su, J. Nie, and Y. Zhang, “Multi-level policy and reward-based deep reinforcement learning frameworkfor image captioning,”

IEEE Transactions on Multimedia , vol. 22, no. 5,pp. 1372–1383, 2019.[14] M. Yang, W. Zhao, W. Xu, Y. Feng, Z. Zhao, X. Chen, and K. Lei, “Mul-titask learning for cross-domain image captioning,”

IEEE Transactionson Multimedia , vol. 21, no. 4, pp. 1047–1061, 2018.[15] X. Xiao, L. Wang, K. Ding, S. Xiang, and C. Pan, “Deep hierarchicalencoder–decoder network for image captioning,”

IEEE Transactions onMultimedia , vol. 21, no. 11, pp. 2942–2956, 2019.[16] J. Yu, W. Zhang, Y. Lu, Z. Qin, Y. Hu, J. Tan, and Q. Wu, “Reasoningon the relation: Enhancing visual representation for visual questionanswering and cross-modal retrieval,”

IEEE Transactions on Multimedia ,2020.[17] Z. Huasong, J. Chen, C. Shen, H. Zhang, J. Huang, and X.-S. Hua, “Self-adaptive neural module transformer for visual question answering,”

IEEETransactions on Multimedia , 2020.[18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof deep bidirectional transformers for language understanding,” arXivpreprint arXiv:1810.04805 , 2018.[19] X. Yang, H. Zhang, and J. Cai, “Learning to collocate neural modulesfor image captioning,” arXiv preprint arXiv:1904.08608 , 2019.[20] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2017,pp. 7008–7024.[21] Z.-J. Zha, D. Liu, H. Zhang, Y. Zhang, and F. Wu, “Context-aware visualpolicy network for ﬁne-grained image captioning,”

IEEE transactions onpattern analysis and machine intelligence , 2019.[22] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural modulenetworks,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 39–48.[23] D. A. Hudson and C. D. Manning, “Compositional attention networksfor machine reasoning,” arXiv preprint arXiv:1803.03067 , 2018. [24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in European conference on computer vision . Springer, 2014,pp. 740–755.[25] J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei, “A hierarchicalapproach for generating descriptive image paragraphs,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2017, pp. 317–325.[26] M. Honnibal and I. Montani, “spaCy 2: Natural language understandingwith Bloom embeddings, convolutional neural networks and incrementalparsing,” 2017, to appear.[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in . Ieee, 2009, pp. 248–255.[29] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew, “Huggingface’stransformers: State-of-the-art natural language processing,”

ArXiv , vol.abs/1910.03771, 2019.[30] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a methodfor automatic evaluation of machine translation,” in

Proceedings ofthe 40th annual meeting on association for computational linguistics .Association for Computational Linguistics, 2002, pp. 311–318.[31] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”in

Text summarization branches out , 2004, pp. 74–81.[32] C. Callison-Burch, M. Osborne, and P. Koehn, “Re-evaluation the roleof bleu in machine translation research,” in ,2006.[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980