[PDF] A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Abstract

Watching instructional videos are often used to learn about procedures. Video captioning is one way of automatically collecting such knowledge. However, it provides only an indirect, overall evaluation of multimodal models with no finer-grained quantitative measure of what they have learned. We propose instead, a benchmark of structured procedural knowledge extracted from cooking videos. This work is complementary to existing tasks, but requires models to produce interpretable structured knowledge in the form of verb-argument tuples. Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations. Our analysis shows that the proposed task is challenging and standard modeling approaches like unsupervised segmentation, semantic role labeling, and visual action detection perform poorly when forced to predict every action of a procedure in a structured form.

Full PDF

AA Benchmark for Structured Procedural Knowledge Extraction fromCooking Videos

Frank F. Xu ∗ , Lei Ji , Botian Shi , Junyi Du , Graham Neubig , Yonatan Bisk , Nan Duan Carnegie Mellon University Microsoft Research, Asia Beijing Institute of Technology University of Southern California { fangzhex,gneubig,ybisk } @cs.cmu.edu , { leiji,nanduan } @microsoft.com Abstract

Procedural knowledge, which we deﬁne asconcrete information about the sequenceof actions that go into performing a partic-ular procedure, plays an important role inunderstanding real world tasks and actions.Humans often learn this knowledge frominstructional text and video, and in this paperwe aim to perform automatic extraction of thisknowledge in a similar way. As a concretestep in this direction, we propose the newtask of inferring procedures in a structuredform (a data structure containing verbs andarguments) from multimodal instructionalvideo contents and their corresponding tran-scripts. We ﬁrst create a manually annotated,large evaluation dataset including over 350instructional cooking videos along withover 15,000 English sentences in transcriptsspanning 89 recipes. We conduct analysisof the challenges posed by this task anddataset with experiments with unsupervisedsegmentation, semantic role labeling, andvisual action detection based baselines. Thedataset and code will be publicly availableat https://github.com/frankxu2004/cooking-procedural-extraction . Instructional videos are a convenient way to learna new skill. Although learning from video seemsnatural to humans, it requires identifying and un-derstanding procedures and grounding them to thereal world. In this paper, we propose a new taskand dataset for extracting procedural knowledgeinto a ﬁne-grained structured representation from multimodal information contained in a large-scale archive of open-domain narrative videos with tran-scripts. While there is a signiﬁcant amount of re-lated work (summarized in Section 7), to our knowl- ∗ Work done during the ﬁrst author’s internship at Mi-crosoft Research, Asia.

Video V for Task R : Making Clam Chowder ID Transcript T i 'm heating it up with a medium, medium high flame , and i 'm going to put some bacon in there and fry it up .5 and this is not a diet recipe. remove the bacon after was nice and chris then i added some chopped or diced, celery and onions and then i added a stick of butter .10 i set a stick of butter, and i 'm going to add a quarter cup of cornstarch . heatcast iron skillet bacon frybacon removeskillet diced celery addonions stick of butterquarter cup of cornstarch add with heated skillet 𝑝 𝑝 Key clip: 𝑣 (𝑡 ) Key clip: 𝑣 (𝑡 ) 𝑝 𝑝 𝑝 Verb

Argument add quarter cup of cornstarch

Structured Procedural Knowledge S Key clip: 𝑣 (𝑡 ) heat cast iron skillet; fry bacon on heated skillet remove bacon from skillet; add diced celery, onions, and stick of butter Key Clips &Utterances

Figure 1: An example of extracting procedures for task “Making Clam Chowder” . edge there is no dataset similar in scope, with pre-vious attempts focusing only on a single modality(e.g. text only (Kiddon et al., 2015) or video only(Zhukov et al., 2019; Alayrac et al., 2016)), usingclosed-domain taxonomies (Tang et al., 2019), orlacking structure in the procedural representation(Zhou et al., 2018a).In our task, given a narrative video, say a cook-ing video on YouTube about making clam chowder as shown in Figure 1, our goal is to extract a se-ries of tuples representing the procedure, e.g. (heat,cast iron skillet), (fry, bacon, with heated skillet),etc. We created a manually annotated, large datasetfor evaluation of the task, including over 350 in-structional cooking videos along with over 15,000English sentences in the transcripts spanning over89 recipe types. This verb-argument structure us-ing arbitrary textual phrases is motivated by openinformation extraction (Schmitz et al., 2012; Faderet al., 2011), but focuses on procedures rather thanentity-entity relations. a r X i v : . [ c s . C L ] M a y his task is challenging with respect to bothvideo and language understanding. For video, it re-quires understanding of video contents, with a spe-cial focus on actions and procedures. For language,it requires understanding of oral narratives, includ-ing understanding of predicate-argument structureand coreference. In many cases it is necessary forboth modalities to work together, such as whenresolving null arguments necessitates the use ofobjects or actions detected from video contentsin addition to transcripts. For example, the cook-ing video host may say “just a pinch of salt in”,while adding some salt into a boiling pot of soup,in which case inferring the action “add” and itsargument “pot” requires visual understanding.Along with the novel task and dataset, we pro-pose several baseline approaches that extract struc-ture in a pipelined fashion. These methods ﬁrstidentify key clips/sentences using video and tran-script information with unsupervised and super-vised multimodal methods, then extract proceduretuples from the utterances and/or video of these keyclips. On the utterances side, we utilize an existingstate-of-the-art semantic role labeling model (Shiand Lin, 2019), with the intuition that semantic rolelabeling captures the verb-argument structures of asentence, which would be directly related to proce-dures and actions. On the video side, similarly, weutilize existing state-of-the-art video action/objectrecognition model trained in kitchen settings tofurther augment utterances-only extraction results.The results are far from perfect, demonstrating thatthe proposed task is challenging and that structur-ing procedures requires more than just state-of-the-art semantic parsing or video action recognition. We show a concrete example of our proceduralknowledge extraction task in Figure 1. Our ulti-mate goal is to automatically map unstructured instructional video (clip and utterances) to struc-tured procedures, deﬁning what actions should beperformed on which objects, with what argumentsand in what order. We deﬁne the input to such anextraction system: • Task R , e.g. “Create Chicken Parmesan” andinstructional video V R describing the procedureto achieve task R , e.g. a video titled “ChickenParmesan - Let’s Cook with ModernMom”. Ours AR YC2 CT COIN How2 HAKE TACOSGeneral domain? (cid:88) (cid:88) (cid:88) (cid:88)

Multimodal input? (cid:88) (cid:88) (cid:88) (cid:88)

Use transcript? (cid:88) (cid:88)

Use noisy text? (cid:88) (cid:88)

Open extraction? (cid:88) (cid:88)

Structured format? (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Table 1: Comparison to current datasets. • A sequence of n sentences T R = { t , t , ..., t n } representing video V R ’s corresponding tran-script. According to the time stamps of thetranscript sentences, the video is also segmentedinto n clips V R = { v , v , ..., v n } accordinglyto align with the sentences in the transcript T R .The output will be: • A sequence of m procedure tuples S R = { s , s , ..., s m } describing the key steps toachieve task R according to instructional video V R . • An identiﬁed list of key video clips and corre-sponding sentences V (cid:48) R ⊆ V R , to which proce-dures in S R are grounded.Each procedural tuple s j = ( verb , arg , ..., arg k ) ∈ S R consists of a verb phrase and its arguments.Only the “verb” ﬁeld is required, and thus the tuplesize ranges from 1 to k + 1 . All ﬁelds can be eithera word or a phrase.Not every clip/sentence describes procedures,as most videos include an intro, an outro, non-procedural narration, or off-topic chit-chat. Keyclips V (cid:48) R are clips associated with one or more pro-cedures in P R , with some clips/sentences associ-ated with multiple procedure tuples. Conversely,each procedure tuple will be associated with only asingle clip/sentence. While others have looked at creating relateddatasets, they fall short on key dimensions whichwe remedy in our work. Speciﬁcally, In Ta-ble 1 we compare to AllRecipes (Kiddon et al.,2015) (AR), YouCook2 (Zhou et al., 2018b)(YC2), CrossTask (Zhukov et al., 2019) (CT),COIN (Tang et al., 2019), How2 (Sanabria et al.,2018), HAKE (Li et al., 2019) and TACOS (Reg-neri et al., 2013). Additional details about alldatasets are included in the Appendix A. In A common dataset we do not include here is HowTo100M(Miech et al., 2019) as it does not contain any annotations. igure 2: Annotation interface.

Verbs ArgumentsTotal

Table 2: Statistics of annotated verbs and arguments inprocedures. summary, none have both structured and open extraction annotations for the procedural knowl-edge extraction task, since most focus on eithervideo summarization/captioning or action localiza-tion/classiﬁcation tasks.

To address the limitations of existing datasets, wecreated our own evaluation dataset by annotatingstructured procedure knowledge given the videoand transcript. Native English-speakers annotatedfour videos per recipe type (e.g. clam chowder,pizza margherita, etc.) in the YouCook2 datasetinto the structured form presented in § s j and series of short sentences describingthe procedure.Figure 2 shows the user interface of annotationtool. The process is divided into 3 questions perclip: Q1:

Determine if the video clip is a key stepif: (1) the clip or transcript contains at least one

152 139

126 104

61 55 55 44 43 F r e qu e n c y

111 107 97 94

88 78 73

62 49 41 41 F r e qu e n c y Figure 3: Most frequent verbs (upper) and arguments(lower). action; (2) the action is required for accomplish-ing the task (i.e. not a self introduction); and (3)for if a clip duplicates a previous key clip, choosethe one with clearer visual and textual signals (e.g.without coreference, etc.).

Q2:

For each key videoclip, annotate the key procedural tuples. We haveannotators indicate which actions are both seen andmentioned by the instructor in the video. The ac-tions should correspond to a verb and its argumentsfrom the original transcript except in the case ofellipsis or coreference where they have to refer toearlier phrases based on the visual scene.

Q3:

Con-struct a short ﬂuent sentence from the annotatedtuples for the given video clip.We have two expert annotators and a profes-sional labeling supervisor for quality control anddeciding the ﬁnal annotations. To improve the dataquality, the supervisor reviewed all labeling results,and applied several heuristic rules to ﬁnd anoma-lous records for further correction. The heuristicis to check the annotated verb/arguments that arenot found in corresponding transcript text. Amongthese anomalies, the supervisor checks the conﬂictsbetween the two annotators. 25% of all annotationswere modiﬁed as a result. On average annotatorscompleted task Q1 at 240 sentences (clips) per hourand task Q2 and Q3 combined at 40 sentences perhour. For Q1, we observe an inter-annotator agree-ment with Cohen’s Kappa of 0.83. Examples areshown in Table 3. We use the Jaccard ratio between the annotated tokens oftwo annotators for Q2’s agreement. Verb annotations have ahigher agreement at 0.77 than that of arguments at 0.72. ranscript sentence Procedure summary Verb Argumentsso we’ve placed the dough directly into thecaputo ﬂour that we import from italy. place dough in caputo ﬂour place dough caputo ﬂourwe just give (ellipsis) a squish with our palmand make it ﬂat in the center . squish dough with palm squish dough with palmﬂatten center of dough ﬂatten center of doughso will have to rotate it every thirty to fortyﬁve seconds ... rotate pizza every 30-45 seconds rotate pizza every 30-45 sec-onds Table 3: Annotations of structured procedures and summaries.

Coreference and ellipsis are marked with italics and are resolved into referred phrases also linked back in the annotations.

You are good to go, thanks for watching! 𝑣 𝑛 𝑡 𝑛 Put some bacon in there and fry it up … 𝑣 𝑡 Hello everyone, today i am going to ... 𝑣 𝑡 ... Key Clip Prediction

Is key clip?If yes:

Input Procedural Knowledge Extraction 𝑣 𝑡 𝑣 𝑡 𝑣 𝑚 𝑡 𝑚 ... Stage 1 OutputStage 2

Extract tuples from key clips

Figure 4: Extraction pipeline.

Overall, the dataset contains 356 videos with15,523 video clips/sentences, among which 3,569clips are labeled as key steps. Sentences average16.3 tokens, and the language style is oral English.For structured procedural annotations, there are 347unique verbs and 1,237 unique objects in all. Statis-tics are shown in Table 2. Figure 3 lists the mostcommonly appearing verbs and entities. The action add is most frequently performed, and the entities salt and onions are the most popular ingredients.In nearly 30% of annotations, some verbs and ar-guments cannot be directly found in the transcript.An example is “(add) some salt into the pot”, andwe refer to this variety of absence as ellipsis . Ar-guments not mentioned explicitly are mainly dueto (1) pronoun references, e.g. “put it (ﬁsh) in thepan”; (2) ellipsis, where the arguments are absentfrom the oral language, e.g. “put the mixture inside”where the argument “oven” is omitted. The detailscan be found in Table 2. The coreferences andellipsis phenomena add difﬁculty to our task, andindicate the utility of using multimodal informationfrom the video signal and contextual proceduralknowledge for inference.

In this and the following section, we describe ourtwo-step pipeline for procedural knowledge ex-traction (also in Figure 4). This section describesthe ﬁrst stage of determining which clips are “keyclips” that contribute to the description of the pro-cedure. We describe several key clip selection mod-els, which consume the transcript and/or the videowithin the clip and decide whether it is a key clipor not.

Given our unsupervised setting, we ﬁrst examinetwo heuristic parsing-based methods that focus onthe transcript only, one based on semantic role la-beling (SRL) and the other based on an unsuper-vised segmentation model Kiddon et al. (2015).Before introducing heuristic baselines, we notethat having a lexicon of domain-speciﬁc actionswill be useful, e.g. for ﬁltering pretrained modeloutputs, or providing priors to the unsupervisedmodel described later. In our cooking domain,these actions can be expected to consist mostlyof verbs related to cooking actions and procedures.Observing recipe datasets such as AllRecipes (Kid-don et al., 2015) or WikiHow (Miech et al., 2019;Zhukov et al., 2019), we ﬁnd that they usuallyuse imperative and concise sentences for proce-dures and the ﬁrst word is usually the action verblike “add” , e.g. add some salt into the pot . Wethus construct a cooking lexicon by aggregatingthe frequently appearing verbs as the ﬁrst wordfrom AllRecipes, with frequency over a thresholdof 5. We further ﬁlter out words that that have noverb synsets in WordNet (Miller, 1995). Finally wemanually ﬁlter out noisy or too general verbs like“go”. Note that when applying to other domains,the lexicon can be built following a similar processof ﬁrst ﬁnding a domain-speciﬁc corpus with sim-ple and formal instructions, and then obtaining thelexicon by aggregation and ﬁltering.

Semantic role labeling baselines.

One intuitiverigger in the transcript for deciding whether thesentence is a key step should be the action words,i.e. the verbs. In order to identify these actionwords we use semantic role labelling (Gildea andJurafsky, 2002), which analyzes natural languagesentences to extract information about “who didwhat to whom, when, where and how?” The outputis in the form of predicates and their respective ar-guments that acts as semantic roles, where the verbacts as the root (head) of the parse. We run a strongsemantic role labeling model (Shi and Lin, 2019)included in the AllenNLP toolkit (Gardner et al.,2018) on each sentence in the transcript. Fromthe output we get a set of verbs for each of thesentences. Because not all verbs in all sentencesrepresent actual key actions for the procedure, weadditionally ﬁlter the verbs with the heuristicallycreated cooking lexicon above, counting a clip asa key clip only if at least one of the SRL-detectedverbs is included in the lexicon.

Unsupervised recipe segmentation base-line (Kiddon et al., 2015).

The second baseline isbased on the outputs of the unsupervised recipesentence segmentation model in Kiddon et al.(2015). Brieﬂy speaking, the model is a generativeprobabilistic model where verbs and arguments,together with their numbers, are modeled aslatent variables. It uses a bigram model for stringselection. It is trained on the whole transcriptcorpus of YouCook2 videos iteratively for 15epochs using a hard EM approach before theperformance starts to converge. The count of verbsin the lexicon created in § Next, we implement a supervised neural networkbased model that incorporates visual information,which we have posited before may be useful in theface of incomplete verbal utterances. We ﬁrst ex-tract the features of the sentence and each videoframe using pretrained feature extractors respec-tively. Then we perform attention (Bahdanau et al., https://demo.allennlp.org/semantic-role-labeling The SRL model is used in this stage only as a verb identi-ﬁer, with other output information used in stage 2. generaldomain instructional key clip selection dataset with no overlap with ours, and our annotated dataset isused for evaluation only . Additional details aboutthe model and training dataset are included in theAppendix B. With the identiﬁed key clips and correspondingtranscript sentences, we proceed to the second stagethat performs clip/sentence-level procedural knowl-edge extraction from key clips. In this stage, theextraction is done from clips that are identiﬁed atﬁrst as key clips.

We ﬁrst present two baselines to extract structuredprocedures using transcripts only, similarly to thekey-clip identiﬁcation methods described in § Semantic role labeling.

For the ﬁrst baseline, weuse the same pretrained SRL model introduced in § “you’re ready to add a variety of bell peppers” fromthe transcript, the outputs from SRL model con-tains two parses with two predicates, “are” and “add” , where only the latter is actually part of theprocedure. To deal with this issue we ﬁrst per-form ﬁltering similar to that used in stage 1, re-moving parses with predicates (verbs) outside ofthe domain-speciﬁc action lexicon we created in § ARG0 : I] [ V : add] [ ARG1 : a lot of pepper][

ARGM-CAU : because I love it]”, some argumentssuch as

ARG0 and

ARGM-CAU are clearly not con-tributing to the procedure. We provide a completelist of the ﬁltered argument types in Appendix C. nsupervised recipe segmentation (Kiddonet al., 2015).

The second baseline is to use thesame trained segmentation model as in § We also examine a baseline that utilizes two formsof visual information in videos: actions and ob-jects. We predict both verbs and nouns of a givenvideo clip via a state-of-the-art action detectionmodel TSM (Lin et al., 2019), trained on the EpicK-itchen (Damen et al., 2018a) dataset. For eachvideo, we extract 5-sec video segments and feedinto the action detection model. The outputs of themodels are in a predeﬁned set of labels of verbs (ac-tions) and nouns (objects). We directly combinethe outputs from the model on each video segment,aggregate and align them with key clips/sentencesthrough timestamps in the video, forming the ﬁnaloutput.

Finally, to take advantage of the fact that utter-ance and video provide complementary views, weperform multimodal fusion of the results of bothof these model varieties. We ﬁrst adopt a naivemethod of fusion by taking the union of result setsfrom best performing utterance-only model andvisual detection model. However, we found in eval-uations that this degrades the performance, partlydue to the differences in video data distribution anddomain, as well as the limitation of the predeﬁnedset of verbs and nouns in the EpicKitchen dataset.To tackle the limitation of the label set, we comparean “oracle” version by ﬁrst expanding the prede-ﬁned verbs and nouns in the EpicKitchen datasetwith synonyms and 1-hop siblings with synsets inWordNet. With these, the visual detection resultsare expanded as above and we ﬁlter them with theground truth annotations (oracle) before they arecombined with utterance model predictions.

We provide evaluation results on our annotateddataset for both of the two stages: key clip selec- https://epic-kitchens.github.io/2019 Notably, this contrasts to our setting of attempting torecognize into an open label set, which upper-bounds theaccuracy of any model with a limited label set. Acc P R F1Parsing-based HeuristicsSRL w/o heur. 25.9 23.4

Table 4: Key clip selection results. tion and structured procedural extraction. Besidesquantitative evaluation and qualitative evaluations,we also analyze the key challenges of this task.

In this section, we evaluate results of the key clipselection described in §

4. We evaluate using the ac-curacy, precision, recall and F1 score for the binaryclassiﬁcation problem of whether a given clip in thevideo is a key clip. The results are shown in Table4. We compare parsing-based heuristic models andsupervised neural models, with ablations (modeldetails in Appendix B). From the experimental re-sults in Table 4, we can see that:1. Unsupervised heuristic methods perform worsethan neural models with training data. This isdespite the fact that the dataset used for trainingneural models has a different data distributionand domain from the test set.2. Among heuristic methods, pretrained SRL isbetter than Kiddon et al. (2015) even thoughthe second is trained on transcript text fromYouCook2 videos. One possible reason for thisis that the unsupervised segmentation methodwas specially designed for recipe texts, whichare mostly simple, concise and imperative sen-tences found in recipe books, while our tran-script text is full of noise and tends to havelonger, more complicated, and oral-style En-glish.3. Post-processing signiﬁcantly improves the SRLmodel, showing that ﬁltering unrelated argu-ments and incorporating the cooking lexiconhelps, especially with reducing false positives.4. Among neural method ablations, the model us-ing only visual features performs worse thanthat using only text features. The best modelfor identifying key clips among proposed base-lines uses both visual and text information inthe neural model. odel Verbs ArgumentsExact Match Fuzzy Partial Fuzzy Exact Match Fuzzy Partial FuzzyP R F1 P R F1 P R F1 P R F1 P R F1 P R F1Using oracle key clipsKiddon et al. (2015) 12.0 10.9 11.4 18.8 17.2 18.0 20.2 18.4 19.3 0.4 0.9 0.5 10.4 19.3 13.5 16.4 30.2 21.3SRL w/o heur. 19.4 54.7 28.6 25.3 70.1 37.2 26.6 73.8 39.1 1.3

Using predicted key clipsKiddon et al. (2015) 7.0 6.3 6.6 10.9 10.0 10.4 11.7 10.7 11.2 0.2 0.5 0.3 6.1 11.2 7.9 9.5 17.5 12.3SRL w/o heur. 11.2 31.7 16.6 14.7 40.7 21.6 15.4

Table 5: Clip/sentence-level structured procedure extraction results for verbs and arguments.

Besides quantitative evaluation, we analyzed keyclip identiﬁcation results and found a number ofobservations. First, background introductions, ad-vertisements for the YouTube channel, etc. can berelatively well classiﬁed due to major differencesboth visually and textually from procedural clips.Second, alignment and grounding between the vi-sual and textual domains is crucial for key clipprediction, yet challenging. For example, the clipwith the transcript sentence “add more pepper ac-cording to your liking” is identiﬁed as a key clip.However, it is in fact merely a suggestion madeby the speaker about an imaginary scenario, ratherthan a real action performed and thus should not beregarded as a key procedure.

In this stage, we perform key clip-level evalua-tion for structured procedural knowledge extrac-tion by matching the ground truth and predictedstructures with both exact match and two fuzzyscoring strategies. To better show how stage 1 per-formance affects the whole pipeline, we evaluateon both ground truth (oracle) and predicted keyclips. Similarly to the evaluation of key clip se-lection, we compare the parsing-based methods( § T P/ predicted where

T P is the number oftrue positives. Recall (R) is the proportion of cor-rect verbs or arguments which are predicted by amodel, i.e.

T P/ gold. The key here is how tocalculate

T P and we propose 3 methods to calcu-late them: exact match, fuzzy matching, and partialfuzzy matching. The ﬁrst is straight forward, wecount true positives if and only if the predictedphrase is an exact string match in the gold phrases.However, because our task lies in the realm of openphrase extraction without predeﬁned labels, it is un-fairly strict to count only the exact string matchesas

T P . Also by design, the gold extraction resultscannot always be found in the original transcriptsentence (refer to § “fuzzy” ,we leverage edit distance to enable fuzzy matchingand assign a “soft” score for T P . In some cases,the two strings of quite different lengths will hurtthe fuzzy score due to the nature of edit distance,even though one string is a substring of another. Toget around this, we propose a third metric, “par-tial fuzzy” to get the score of the best matchingsubstring with the length of the shorter string incomparison (see ). Note that this third metric willbias towards shorter, correct phrases and thus weshould have a holistic view of all 3 metrics dur-ing the evaluation. Details of two fuzzy metricsare described in Appendix D. Table 5 illustratesevaluation results:1. Argument extraction is much more challengingompared to verb extraction, according the re-sults: arguments contain more complex types ofphrases (e.g. objects, location, time, etc.) andare longer in length. It is hard to identify com-plex arguments with our current heuristic orunsupervised baselines and thus the need forbetter supervised or semi-supervised models.2. Heuristic SRL methods perform better than theunsupervised segmentation model even thoughit is trained on our corpus. This demonstratesthe generality of SRL models, but the heuris-tics applied at the output of SRL models stillimprove the performance by reducing false pos-itives.3. The visual-only method performs the worst,mainly because of the domain gap between vi-sual detection model outputs and our annotatedverbs and arguments. Other reasons include:the closed label set predeﬁned in EpicKitchen;challenges in domain transferring from closedto open extraction; different video data distribu-tion between EpicKitchen (for training) and ourdataset (YouCook2, for testing); limited perfor-mance of video detection model itself.4. Naive multimodal fusion leads to a performancedrop to below the utterance-only model. How-ever, ﬁltering visual outputs with the oracleannotations before merging with the utterance-only output outperforms single-modality mod-els. This indicates a path forward for fusionstrategies, though this is not sufﬁcient for han-dling the complexity of arguments. To get aphrase for open extraction, we need more thanjust object detection.There are two key challenges we see moving for-ward:

Verb extraction:

We ﬁnd that verb ellipsis is com-mon in transcripts. The transcript text containssentences where key action “verbs” do not haveverb part-of-speech in the sentence. For example,in the sentence “give it a ﬂip ...” with the annota-tion (“ﬂip”, “pancake”), the model detects “give”as the verb rather than “ﬂip”. Currently all ourbaselines are highly reliant on a curated lexiconfor verb selection and thus such cases will get ﬁl-tered out. How to deal with such cases with generalverbs like make , give , do remains challenging andrequires extracting from the contexts. Argument Extraction:

Speech-to-text errors areintrinsic in automatically acquired transcripts andcause problems during parsing that cascade. Exam- ples are that “add ﬂour” being recognized as “addﬂower” and “sriracha sauce” being recognized as“sarrah cha sauce” causing wrong extraction out-puts. Coreference and ellipsis are also challengingand hurting current benchmark performance, as ourbaselines do not tackle any of these explicitly. Vi-sual co-reference and language grounding (Huanget al., 2018, 2017) provides a feasible method forus to tackle these cases in the future.

Text-based procedural knowledge extraction.

Procedural text understanding and knowledgeextraction (Chu et al., 2017; Park and Mota-hari Nezhad, 2018; Kiddon et al., 2015; Jermsura-wong and Habash, 2015; Liu et al., 2016; Longet al., 2016; Maeta et al., 2015; Malmaud et al.,2014; Artzi and Zettlemoyer, 2013; Kuehne et al.,2017) has been studied for years on step-wise tex-tual data such as WikiHow. Chu et al. (2017) ex-tracted open-domain knowledge from how-to com-munities. Recently Zhukov et al. (2019) also stud-ied to adopt the well-written how-to data as weaksupervision for instructional video understanding.Unlike existing work on action graph/dependencyextraction (Kiddon et al., 2015; Jermsurawong andHabash, 2015), our approach differs as we extractknowledge from the visual signals and transcriptsdirectly, not from formal imperitive recipe texts.

Instructional video understanding.

Unlikeexisting tasks for learning from instructionalvideo (Zhou et al., 2018c; Tang et al., 2019; Alayracet al., 2016; Song et al., 2015; Sener et al., 2015;Huang et al., 2016; Sun et al., 2019b,a; Plummeret al., 2017; Shi et al., 2019; Palaskar et al., 2019),visual-linguistic reference resolution (Huang et al.,2018, 2017), visual planning (Chang et al., 2019),joint learning of object and actions (Zhukov et al.,2019; Richard et al., 2018; Gao et al., 2017; Damenet al., 2018b), pretraining joint embedding of highlevel sentence with video clips (Sun et al., 2019b;Miech et al., 2019), and multimodal reading com-prehension with RecipeQA (Yagcioglu et al., 2018),our task proposal requires explicit structured knowl-edge extraction.

Visual procedure learning.

In addition to closelyrelated work ( §

3) there is a wide literature (Zhouet al., 2018b,c; Alayrac et al., 2016; Ushiku et al.,2017; Nishimura et al., 2019; Tang et al., 2019;Alayrac et al., 2016; Huang et al., 2016; Shi et al.,2019; Ushiku et al., 2017) that aims to predict denseprocedural captions given the video, which are theost similar works to ours. Zhou et al. (2018c) ex-tracted temporal procedures and then generatedcaptioning for each procedure. Sanabria et al.(2018) proposes a multimodal abstractive summa-rization for how-to videos with either human la-beled or speech-to-text transcript. Alayrac et al.(2016) also introduces an unsupervised step learn-ing method from instructional videos. Inspired bycross-task sharing (Zhukov et al., 2019), which is aweakly supervised method to learn shared actionsbetween tasks, ﬁne grained action and entity areimportant for sharing similar knowledge betweenvarious tasks. We focus on structured knowledge ofﬁne-grained actions and entities.Visual-linguisticcoreference resolution (Huang et al., 2018, 2017) isamong one of the open challenges for our proposedtask.

We propose a multimodal open procedural knowl-edge extraction task, present a new evaluationdataset, produce benchmarks with various methods,and analyze the difﬁculties in the task. Meanwhilewe investigate the limit of existing methods andmany open challenges for procedural knowledgeacquisition, including: testing supervised settings(e.g. through cross-validation); to better deal withcases of coreference and ellipsis in visual-groundedlanguages; exploit cross-modalities of informationwith more robust models using unsupervised orsemi-supervised learning paradigm; construct ac-tion graphs with dependencies between proceduresto enable reasoning and machine execution; incor-porate human-in-the-loop teaching in automaticprocedural knowledge learning.

References

Jean-Baptiste Alayrac, Piotr Bojanowski, NishantAgrawal, Josef Sivic, Ivan Laptev, and SimonLacoste-Julien. 2016. Unsupervised learning fromnarrated instruction videos. In

Proceedings of theIEEE Conference on Computer Vision and PatternRecognition , pages 4575–4583.Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su-pervised learning of semantic parsers for mappinginstructions to actions.

Transactions of the Associa-tion for Computational Linguistics , 1:49–62.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 . Xavier Carreras and Llu´ıs M`arquez. 2004. Introduc-tion to the conll-2004 shared task: Semantic rolelabeling. In

Proceedings of the Eighth Confer-ence on Computational Natural Language Learning(CoNLL-2004) at HLT-NAACL 2004 , pages 89–97.Chien-Yi Chang, De-An Huang, Danfei Xu, EhsanAdeli, Li Fei-Fei, and Juan Carlos Niebles. 2019.Procedure planning in instructional videos.

ArXiv ,abs/1907.01172.Cuong Xuan Chu, Niket Tandon, and Gerhard Weikum.2017. Distilling task knowledge from how-to com-munities. In

Proceedings of the 26th InternationalConference on World Wide Web , pages 805–814. In-ternational World Wide Web Conferences SteeringCommittee.Dima Damen, Hazel Doughty, Giovanni MariaFarinella, Sanja Fidler, Antonino Furnari, EvangelosKazakos, Davide Moltisanti, Jonathan Munro, TobyPerrett, Will Price, and Michael Wray. 2018a. Scal-ing egocentric vision: The epic-kitchens dataset. In

European Conference on Computer Vision (ECCV) .Dima Damen, Hazel Doughty, GiovanniMaria Farinella, Sanja Fidler, Antonino Furnari,Evangelos Kazakos, Davide Moltisanti, JonathanMunro, Toby Perrett, Will Price, et al. 2018b.Scaling egocentric vision: The epic-kitchens dataset.In

Proceedings of the European Conference onComputer Vision (ECCV) , pages 720–736.Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. 2009. Imagenet: A large-scale hier-archical image database. In , pages248–255. Ieee.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Anthony Fader, Stephen Soderland, and Oren Etzioni.2011. Identifying relations for open information ex-traction. In

Proceedings of the conference on empir-ical methods in natural language processing , pages1535–1545. Association for Computational Linguis-tics.Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Neva-tia. 2017. Tall: Temporal activity localization vialanguage query. In

Proceedings of the IEEE Interna-tional Conference on Computer Vision , pages 5267–5275.Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson Liu, Matthew Pe-ters, Michael Schmitz, and Luke Zettlemoyer. 2018.Allennlp: A deep semantic natural language process-ing platform. arXiv preprint arXiv:1803.07640 .Daniel Gildea and Daniel Jurafsky. 2002. Automatic la-beling of semantic roles.

Computational linguistics ,28(3):245–288.aiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 770–778.De-An Huang, Shyamal Buch, Lucio Dery, AnimeshGarg, Li Fei-Fei, and Juan Carlos Niebles. 2018.Finding “it”: Weakly-supervised, reference-awarevisual grounding in instructional videos. In

IEEEConference on Computer Vision and Pattern Recog-nition (CVPR) .De-An Huang, Li Fei-Fei, and Juan Carlos Niebles.2016. Connectionist temporal modeling for weaklysupervised action labeling. In

European Conferenceon Computer Vision , pages 137–153. Springer.De-An Huang, Joseph J Lim, Li Fei-Fei, and Juan Car-los Niebles. 2017. Unsupervised visual-linguisticreference resolution in instructional videos. In

Pro-ceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , pages 2183–2192.Jermsak Jermsurawong and Nizar Habash. 2015. Pre-dicting the structure of cooking recipes. In

Proceed-ings of the 2015 Conference on Empirical Methodsin Natural Language Processing , pages 781–786.Chlo´e Kiddon, Ganesa Thandavam Ponnuraj, Luke S.Zettlemoyer, and Yejin Choi. 2015. Mise en place:Unsupervised interpretation of instructional recipes.In

EMNLP .Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Hilde Kuehne, Alexander Richard, and Juergen Gall.2017. Weakly supervised learning of actions fromtranscripts.

Computer Vision and Image Under-standing , 163:78–89.Yong-Lu Li, Liang Xu, Xijie Huang, Xinpeng Liu,Ze Ma, Mingyang Chen, Shiyi Wang, Hao-Shu Fang,and Cewu Lu. 2019. Hake: Human activity knowl-edge engine. arXiv preprint arXiv:1904.06539 .Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Tem-poral shift module for efﬁcient video understanding.In

Proceedings of the IEEE International Confer-ence on Computer Vision .Changsong Liu, Shaohua Yang, Sari Saba-Sadiya,Nishant Shukla, Yunzhong He, Song-Chun Zhu,and Joyce Chai. 2016. Jointly learning groundedtask structures from language instruction and visualdemonstration. In

Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 1482–1492.Reginald Long, Panupong Pasupat, and PercyLiang. 2016. Simpler context-dependent logicalforms via model projections. arXiv preprintarXiv:1606.05378 . Hirokuni Maeta, Tetsuro Sasada, and Shinsuke Mori.2015. A framework for procedural text understand-ing. In

Proceedings of the 14th International Con-ference on Parsing Technologies , pages 50–60.Jonathan Malmaud, Earl Wagner, Nancy Chang, andKevin Murphy. 2014. Cooking with semantics. In

Proceedings of the ACL 2014 Workshop on SemanticParsing , pages 33–38.Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,Makarand Tapaswi, Ivan Laptev, and Josef Sivic.2019. HowTo100M: Learning a Text-Video Embed-ding by Watching Hundred Million Narrated VideoClips. arXiv:1906.03327 .George A Miller. 1995. Wordnet: a lexical database forenglish.

Communications of the ACM , 38(11):39–41.James Munkres. 1957. Algorithms for the assignmentand transportation problems.

Journal of the societyfor industrial and applied mathematics , 5(1):32–38.Taichi Nishimura, Atsushi Hashimoto, Yoko Yamakata,and Shinsuke Mori. 2019. Frame selection for pro-ducing recipe with pictures from an execution videoof a recipe. In

Proceedings of the 11th Workshop onMultimedia for Cooking and Eating Activities , pages9–16. ACM.Shruti Palaskar, Jindrich Libovick`y, Spandana Gella,and Florian Metze. 2019. Multimodal abstractivesummarization for how2 videos. arXiv preprintarXiv:1906.07901 .Hogun Park and Hamid Reza Motahari Nezhad. 2018.Learning procedures from text: Codifying how-toprocedures in deep neural networks. In

Compan-ion Proceedings of the The Web Conference 2018 ,pages 351–358. International World Wide Web Con-ferences Steering Committee.Bryan A Plummer, Matthew Brown, and SvetlanaLazebnik. 2017. Enhancing video summarizationvia vision-language embedding. In

Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition , pages 5781–5789.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questionsfor machine comprehension of text. arXiv preprintarXiv:1606.05250 .Michaela Regneri, Marcus Rohrbach, Dominikus Wet-zel, Stefan Thater, Bernt Schiele, and ManfredPinkal. 2013. Grounding action descriptions invideos.

Transactions of the Association for Compu-tational Linguistics , 1:25–36.Alexander Richard, Hilde Kuehne, and Juergen Gall.2018. Action sets: Weakly supervised action seg-mentation without ordering constraints. In

Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 5987–5996.amon Sanabria, Ozan Caglayan, Shruti Palaskar,Desmond Elliott, Lo¨ıc Barrault, Lucia Specia, andFlorian Metze. 2018. How2: a large-scale datasetfor multimodal language understanding. arXivpreprint arXiv:1811.00347 .Erik F Sang and Fien De Meulder. 2003. Intro-duction to the conll-2003 shared task: Language-independent named entity recognition. arXivpreprint cs/0306050 .Michael Schmitz, Robert Bart, Stephen Soderland,Oren Etzioni, et al. 2012. Open language learningfor information extraction. In

Proceedings of the2012 Joint Conference on Empirical Methods in Nat-ural Language Processing and Computational Natu-ral Language Learning , pages 523–534. Associationfor Computational Linguistics.Ozan Sener, Amir R Zamir, Silvio Savarese, andAshutosh Saxena. 2015. Unsupervised semanticparsing of video collections. In

Proceedings of theIEEE International Conference on Computer Vision ,pages 4480–4488.Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen,Zhendong Niu, and Ming Zhou. 2019. Dense pro-cedure captioning in narrated instructional videos.In

Proceedings of the 57th Conference of the Asso-ciation for Computational Linguistics , pages 6382–6391.Peng Shi and Jimmy Lin. 2019. Simple bert models forrelation extraction and semantic role labeling. arXivpreprint arXiv:1904.05255 .Yale Song, Jordi Vallmitjana, Amanda Stent, and Ale-jandro Jaimes. 2015. Tvsum: Summarizing webvideos using titles. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition ,pages 5179–5187.Chen Sun, Fabien Baradel, Kevin Murphy, andCordelia Schmid. 2019a. Learning video represen-tations using contrastive bidirectional transformer.Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur-phy, and Cordelia Schmid. 2019b. Videobert: Ajoint model for video and language representationlearning. arXiv preprint arXiv:1904.01766 .Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng,Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou.2019. Coin: A large-scale dataset for comprehen-sive instructional video analysis. In

Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition , pages 1207–1216.Atsushi Ushiku, Hayato Hashimoto, AtsushiHashimoto, and Shinsuke Mori. 2017. Proce-dural text generation from an execution video. In

Proceedings of the Eighth International Joint Con-ference on Natural Language Processing (Volume1: Long Papers) , pages 326–335, Taipei, Taiwan.Asian Federation of Natural Language Processing. Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Na-zli Ikizler-Cinbis. 2018. RecipeQA: A challengedataset for multimodal comprehension of cookingrecipes. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing ,pages 1358–1368, Brussels, Belgium. Associationfor Computational Linguistics.Luowei Zhou, Nathan Louis, and Jason J Corso. 2018a.Weakly-supervised video object grounding from textby loss weighting and object interaction. arXivpreprint arXiv:1805.02834 .Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018b.Towards automatic learning of procedures from webinstructional videos. In

Thirty-Second AAAI Confer-ence on Artiﬁcial Intelligence .Luowei Zhou, Yingbo Zhou, Jason J Corso, RichardSocher, and Caiming Xiong. 2018c. End-to-enddense video captioning with masked transformer. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 8739–8748.Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gok-berk Cinbis, David Fouhey, Ivan Laptev, and JosefSivic. 2019. Cross-task weakly supervised learn-ing from instructional videos. arXiv preprintarXiv:1903.08225 . Comparison with existing datasets

There are publicly available datasets related to un-derstanding instructional videos: • AllRecipes (Kiddon et al., 2015) (AR). Theauthors collected 2,456 recipes from All-Recipes weibsite . The sentences in thedataset are mostly simple imperative Englishdescribing concise steps to make a given dish,where the ﬁrst word is usually the verb describ-ing the action. The ingredient list informationis also available. In contrast, our task seeksto extract procedural information from morenoisy, oral and erroneous languages in reallife video context. • YouCook2 (Zhou et al., 2018b) (YC2). Theprocedure steps for each video are annotatedwith temporal boundaries in the video anddescribed by human-written imperative En-glish sentences. However, this dataset doesnot contain more ﬁne-grained annotations ina structured form. • HowTo100M (Miech et al., 2019). This isa large scale how-to videos dataset, searchedon YouTube using the task taxonomy on Wik-iHow as a source. However, it does notcontain any annotations although the domainis more general. • CrossTask (Zhukov et al., 2019) (CT). Basedon HowTo100M, this dataset is used forweakly supervised learning with 18 tasks fullylabeled and 65 related tasks unlabeled. Al-though the dataset is annotated in a structuredway by separating verbs and objects, the labelspace is closed with predeﬁned sets of verbsand objects. The dataset also does not allowmultiple verbs or objects to be extracted for asingle segment. • COIN (Tang et al., 2019). This contains in-structional (how-to) videos, in a closed taxon-omy of tasks and steps. The authors annotatedtime spans of steps in a video with pre-deﬁned http://youcook2.eecs.umich.edu/ https://github.com/DmZhukov/CrossTask https://coin-dataset.github.io/ steps, however the biggest drawback is that itis unstructured and closed domain. • How2 (Sanabria et al., 2018). This datasetannotates ground truth transcript text to helpabstractive summarization, a very differenttask than ours of structured data extraction. • HAKE (Li et al., 2019). Human ActivityKnowledge Engine (HAKE) is a large-scaleknowledge base of human activities, builtupon existing activity datasets, and supplieshuman instance action labels and correspond-ing body part level atomic action labels. How-ever, HAKE uses closed activity and part stateclasses. It also does not contain videos ofactivities accompanied with narrative tran-scripts. • TACOS (Regneri et al., 2013). This datasetconsiders the problem of grounding sentencesdescribing actions in visual information ex-tracted from videos in kitchen settings. Thedataset contains expert annotations of lowlevel activity tags, with a total of 60 differentactivity labels with numerous associated ob-jects, and sequences of NL sentences describ-ing actions in the kitchen videos. This datasetalso does not support open extraction and thevideos are provided using human annotatedcaption sentences, rather than transcript textswith noise. B Neural Selection Model

Figure 5 presents the overall detailed structure ofthe neural selection model for combining utteranceand video information for key clip selection.

Sentence token encoding

Each input clip is ac-companied with a sentence S = { t , . . . , t k } which has k tokens. We use a pre-trainedBERT (Devlin et al., 2018) model as the encoderand extract the sentence representation s . Video frame features

For each clip we uni-formly sample T = 10 frames and usean ImageNet-pretrained (Deng et al., 2009)ResNet50 (He et al., 2016) to extract the featurevector of each frame as X = { x , · · · , x T } . https://github.com/srvk/how2-dataset http://hake-mvig.cn e s N e t x i … x T Video Frame Features … x Transcript Feature P r e - t r a i n e d B E R T s … I ’ m heating it up with a medium medium high flame and I ’ m going to put some bacon ... Attn( s , X ) a s Attention-based

Frame Encoding M L P ×Key Clip Prediction -1 1/0 ............ Figure 5: Neural key clip selection model.

Attention-based frame encoding

To model theinteraction between the encoded sentence and thefeature of each frame, we adopt an attention-basedmethod. We ﬁrst calculate the attention weight a s by a tensor product of sentence feature s with eachvideo frame x i followed by a softmax layer. Thenwe perform a weighted sum on all frame featuresto get Attn ( s, X ) . Visual-utterance fusion

Finally, we fuse the ex-tracted transcript features s with the attended videofeatures Attn ( s, X ) by a tensor product and ﬂattenit into a vector. Then we use a non-linear activa-tion layer to map these features into a real number,which represents the probability of the clip being akey clip. Experiment details

In the presented experi-ments, we use a pre-trained BERT (Devlin et al.,2018) model to extract the continuous represen-tation of each sentence. During ﬁne-tuning, themodel is optimized by Adam optimizer (Kingmaand Ba, 2014) with the starting learning rate of e − . The model is trained in a supervised fash-ion with a separate key clip/sentence classiﬁcationdataset that is not related to YouCook2. This aux-iliary dataset will also be publicly released. Allof them are general domain instructional videosharvested from from YouTube. Human annota-tors labeled whether it is a key clip when given avideo clip-sentence pair. In the end, we have 1,034videos (40,146 pairs) for training the classiﬁcationmodel. We split the dataset into two subset as 772videos (28,519 pairs) and 312 videos (11,627 pairs)for training and validation (hyper-parameter tun- https://github.com/hanxiao/bert-as-service ing) respectively. The testing set is our proposeddataset with key clips and sentences annotated (see § C SRL Argument Filtering

The argument types that we deem to not con-tribute as the procedural knowledge for complet-ing the task and ﬁlter out include: ARG0 (usu-ally refers to the subject, usually a person), AM-MOD (modal verb), AM-CAU (cause), AM-NEG(negation marker), AM-DIS (discourse marker),AM-REC (reciprocal), AM-PNC/PRP (purpose),AM-EXT (extent), and R-ARG* (in-sentence refer-ences).

D Fuzzy Matching and Partial FuzzyMatching

Fuzzy matching

Denote the Levenshtein dis-tance between string a and string b as d ( a, b ) . Wethen deﬁne a normalized pairwise score between0 to 1 as s ( a, b ) = d ( a, b ) /max {| a | , | b |} Givena set of n predicted phrases X = { x , ..., x n } and a set of m ground truth phrases G = { g , ..., g m } , we can ﬁnd a set of min ( n, m ) string pairs between predicted X and groundtruth G , as M = { ( x i , g j ) } that maximizes thesum of scores (cid:80) ( x i ,g j ) ∈ M s ( x i , g j ) . This assign-ment problem can be solved efﬁciently with Kuhn-Munkres (Munkres, 1957) algorithm . Since thisfuzzy pairwise score is normalized, it can be re-garded as a soft version for calculating T P = max (cid:80) ( x i ,g j ) ∈ M s ( x i , g j ) . Partial Fuzzy matching

The only differencefrom “fuzzy” matching is that the scoring func-tion now follows the “best partial” heuristic thatassuming the shorter string a is length | a | , and thelonger string b is length | b | , we now calculate thescore between shorter string and the best “fuzzy” matching length- | a | substring. s ( a, b ) = max { d ( a, t ) } / | a | ,t ∈ substring of b, | t | = | a | , | a | < | b | Both fuzzy metric implementations are based onFuzzyWuzzy . http://software.clapper.org/munkres/19