End-to-End Fine-Grained Action Segmentation and Recognition Using Conditional Random Field Models and Discriminative Sparse Coding
Effrosyni Mavroudi, Divya Bhaskara, Shahin Sefati, Haider Ali, René Vidal
EEnd-to-End Fine-Grained Action Segmentation and Recognition UsingConditional Random Field Models and Discriminative Sparse Coding
Effrosyni Mavroudi † , Divya Bhaskara †‡ , Shahin Sefati † (cid:92) , Haider Ali † and Ren´e Vidal †† Johns Hopkins University, ‡ University of Virginia, (cid:92)
Comcast AI Research
Abstract
Fine-grained action segmentation and recognition is animportant yet challenging task. Given a long, untrimmedsequence of kinematic data, the task is to classify the actionat each time frame and segment the time series into the cor-rect sequence of actions. In this paper, we propose a novelframework that combines a temporal Conditional RandomField (CRF) model with a powerful frame-level representa-tion based on discriminative sparse coding. We introducean end-to-end algorithm for jointly learning the weights ofthe CRF model, which include action classification and ac-tion transition costs, as well as an overcomplete dictionaryof mid-level action primitives. This results in a CRF modelthat is driven by sparse coding features obtained using adiscriminative dictionary that is shared among different ac-tions and adapted to the task of structured output learning.We evaluate our method on three surgical tasks using kine-matic data from the JIGSAWS dataset, as well as on a foodpreparation task using accelerometer data from the 50 Sal-ads dataset. Our results show that the proposed methodperforms on par or better than state-of-the-art methods.
1. Introduction
Temporal segmentation and recognition of complex ac-tivities in long continuous recordings is a useful, yet chal-lenging task. Examples of complex activities comprised offine-grained goal-driven actions that follow a grammar aresurgical procedures [9], food preparation [31] and assemblytasks [35]. For instance, in the medical field there is a needto better train surgeons in performing surgical proceduresusing new technologies such as the daVinci robot. One pos-sible approach is to use machine learning and computer vi-sion techniques to automatically determine the skill levelof the surgeon from kinematic data of the surgeon’s perfor-mance recorded by the robot [9]. Such an approach typi-cally requires an accurate classification of the surgical ges-ture at each time frame [3] and a segmentation of the surgi-cal task into the correct sequence of gestures [34]. Another example of a complex activity with goal-driven fine-grainedactions following a grammar is cooking. Although the ac-tions performed while preparing a recipe and their relativeordering can vary, there are still temporal relations amongthem. For instance, the action stir milk usually happens af-ter pour milk , or the action fry egg usually follows the action crack egg . Robots equipped with the ability to automati-cally recognize actions during food preparation could assistindividuals with cognitive impairments in their daily activ-ities by providing prompts and instructions. However, thetask of fine-grained action segmentation and recognition ischallenging due to the subtle differences between actions,the variability in the duration and style of execution amongusers and the variability in the relative ordering of actions.Existing approaches to fine-grained action segmenta-tion and recognition use a temporal model to capture thetemporal evolution and ordering of actions, such as Hid-den Markov Models (HMMs) [13, 32], Conditional Ran-dom Fields (CRF) [16, 17], Markov semi-Markov Condi-tional Random Fields (MsM-CRF) [34], Recurrent NeuralNetworks [8, 28] and Temporal Convolutional Networks(TCNs) [15]. However, such models cannot capture subtledifferences between actions without a powerful, discrimi-native and robust representation of frames or short temporalsegments. Sparse coding has emerged as a powerful signalrepresentation in which the raw data in a certain time frameis represented as a linear combination of a small number ofbasis elements from an overcomplete dictionary. The coef-ficients of this linear combination are called sparse codes and are used as a new representation for temporal model-ing. However, since the dictionary is typically learned in anunsupervised manner by minimizing a regularized recon-struction error [1], the resulting representation may not bediscriminative for a given learning task. Task-driven dis-criminative dictionary learning addresses this issue by cou-pling dictionary and classifier learning [24]. For example,Sefati et al. [30] propose an approach to fine-grained actionrecognition called Shared Discriminative Sparse DictionaryLearning (SDSDL), where sparse codes are extracted ateach time frame and a frame feature is computed by aver-age pooling the sparse codes over a short temporal window1 a r X i v : . [ c s . C V ] J a n urrounding the frame. The dictionary is jointly learnedwith the per-frame classifier parameters, resulting in a dis-criminative mid-level representation that is shared across allactions/gestures. However, their approach lacks a tempo-ral model, which is crucial for modeling temporal depen-dencies. Although prior work [38] has combined discrim-inative dictionary learning with CRFs for the purpose ofsaliency detection, such work is not directly applicable tofine-grained action recognition.In this work we propose a joint model for fine-grainedaction recognition and segmentation that integrates a CRFfor temporal modeling and discriminative sparse coding forframe-wise action representation. The proposed CRF mod-els the temporal structure of long untrimmed activities viaunary potentials that represent the cost of assigning an ac-tion label to a frame-wise representation of an action ob-tained via discriminative sparse coding, and pairwise po-tentials that capture the transitions between actions and en-courage smoothness of the predicted label sequence. Theparameters of the combined model are trained jointly inan end-to-end manner using a max-margin approach. Ourexperiments show competitive performance in the task offine-grained action recognition, especially in the regime oflimited training data. In summary, the contributions of thispaper are three-fold:1. We propose a novel framework for fine-grained actionsegmentation and recognition which uses a CRF modelwhose target variables (action labels per time step) areconditioned on sparse codes.2. We introduce an algorithm for training our model inan end-to-end fashion. In particular, we jointly learna task-specific discriminative dictionary and the CRFunary and pairwise weights by using Stochastic Gradi-ent Descent (SGD).3. We evaluate our model on two public datasets fo-cused on goal-driven complex activities comprised offine-grained actions. In particular, we use robot kine-matic data from the JHU-ISI Gesture and Skill Assess-ment Working Set (JIGSAWS) [9] dataset and evaluateour method on three surgical tasks. We also experi-ment with accelerometer data from the 50 Salads [31]dataset for recognizing actions that are labeled at twolevels of granularity. Results show that our methodperforms on par with most state-of-the-art methods.
2. Related Work
The task of fine-grained action segmentation and recog-nition has recently received increased attention due tothe release of datasets such as MPII Cooking [29], JIG-SAWS [9] and 50 Salads [31]. In this section, we briefly re-view some of the main existing approaches for tackling this problem. Besides, we briefly discuss existing work on dis-criminative dictionary learning. Note that since the focus ofthis paper is fine-grained action recognition from kinematicdata, we do not discuss approaches for feature extraction orobject parsing from video data.
Fine-grained action recognition from kinematic data.
Astraightforward approach to action segmentation and clas-sification is the use of overlapping temporal windows inconjunction with temporal segment classifiers and non-maximum suppression (e.g., [29, 25]). However this ap-proach does not exploit long-range temporal dependencies.Recently, deep learning approaches have started toemerge in the field. For instance, in [8] a recurrent neu-ral network (Long Short Term Memory network - LSTM)is applied to kinematic data, while in [15] a Temporal Con-volutional Network composed of 1D convolutions, non-linearities and pooling/upsampling layers is introduced. Al-though these models yield promising results, they do notexplicitly model correlations and dependencies among ac-tion labels.Another line of work, including our proposed method,takes into account the fact that the action segmentationand classification problem is a structured output predictionproblem due to the temporal structure of the sequence ofaction labels and thus employs structured temporal modelssuch as HMMs and their extensions [32, 13, 14]. Amongthem, the work that is most related to this work is Sparse-HMMs [32], which combines dictionary learning withHMMs. However, a Sparse-HMM is a generative model inwhich a separate dictionary is learned for each action class.In this work we use a CRF, which is a discriminative model,and we learn a dictionary that is shared among all actionclasses. Discriminative models like CRFs [16, 17], semi-Markov CRFs [34] have gained popularity since they allowfor flexible energy functions. Other types of temporal mod-els include a duration model and language model recentlyproposed in [27] for modeling action durations and context.The input to these temporal models are either the kinematicdata themselves or features extracted from them. For in-stance, in the Latent Convolutional Skip Chain CRF (LC-SC-CRF) [17] the responses to convolutional filters, whichcapture latent action primitives, are used as features.
Discriminative Dictionary Learning.
Task-driven dis-criminative dictionary learning was introduced in the sem-inal work of Mairal et al. [24] and couples the processof dictionary learning and classifier training, thus incor-porating supervised learning to sparse coding. Since thendiscriminative dictionary learning has enjoyed many suc-cesses in diverse areas such as handwritten digit classifica-tion [22, 39], face recognition [10, 39, 26], object categoryrecognition [10, 26, 5], scene classification [5, 19, 26], andaction classification [26].The closest work to ours is the Shared Discriminativeparse Dictionary Learning (SDSDL) proposed by Sefatiet al. [30], where sparse codes are used as frame featuresand a discriminative dictionary is jointly learned with perframe action classifiers for the task of surgical task segmen-tation. Our work builds on top of this model by replacingthe per-frame classifiers, which compute independent pre-dictions per frame, with a structured output temporal model(CRF), which takes into account the temporal dependenciesbetween actions. While prior work has considered joint dic-tionary and CRF learning [33, 37, 38] for the tasks of se-mantic segmentation and saliency estimation, our work dif-fers from these previous approaches in three key aspects.First, to the best of our knowledge, we are the first to applyjoint dictionary and CRF learning to the task of action seg-mentation and classification. Second, we are learning unaryCRF classifiers and pairwise transition scores, while in [33]only two scalar variables encoding the relative weight be-tween the unary and pairwise potentials are learned. Third,we use local temporal average-pooling of sparse codes as afeature extraction process for capturing local temporal con-text instead of the raw sparse codes used in [37, 38].
3. Technical Approach
In this section, we introduce our temporal CRF modeland frame-wise representation based on sparse coding anddescribe our algorithm for training our model. Figure 1 il-lustrates the key components of our model.
Frame-wise representation.
Let X = { x t } Tt =1 be a se-quence of length T , with x t ∈ R p being the input at time t (e.g., the robot’s joint positions and velocities). Our goal isto compactly represent each x t as a linear combination of asmall number of atomic motions using an overcomplete dic-tionary of representative atomic motions Ψ ∈ R p × m , i.e., x t ≈ Ψu t , where u t ∈ R m is the vector of sparse coeffi-cients obtained for frame t . Such sparse codes can be ob-tained by considering the following optimization problem: min { u t } Tt =1 T T (cid:88) t =1 || x t − Ψu t || + λ u || u t || , (1)where λ u is a regularization parameter controlling the trade-off between reconstruction error and sparsity of the coef-ficients. Problem (1) is a standard Lasso regression andcan be efficiently solved using existing sparse coding al-gorithms [23]. After computing sparse codes u t for eachtime step of the input sequence, we follow the approach pro-posed in [30] to compute feature vectors { z t } Tt =1 . Namely,we initially split the positive and negative components ofthe sparse codes and stack them on top of each other. This x t Ψ u t z t y t Figure 1: Overview of our framework. Given an input timeseries X , we first extract sparse codes U for each timestepusing a dictionary Ψ . Sparse codes are then average pooledin short temporal windows yielding feature vectors Z pertimestep. These feature vectors are then given as inputs to aLinear Chain CRF with weights W . Trainable parameters Ψ and W are shown in light pink boxes.step yields a vector a t ∈ R D , D = 2 m , which is given by: a t = (cid:20) max(0 , u t )min(0 , u t ) (cid:21) . (2)This is a common practice [6, 4], which allows the clas-sification layer to assign different weights to positive andnegative responses. Second, we compute a feature vector z t ∈ R D for each frame by average-pooling vectors a t in atemporal window T t surrounding frame t , i.e.: z t = 1 L (cid:88) j ∈ T t a j , T t . = (cid:26) t − (cid:22) L (cid:23) , t + (cid:22) L (cid:23)(cid:27) , (3)where L is the length of the temporal window centered atframe t . This feature vector captures local temporal context. Temporal model.
Let Z = { z t } Tt =1 be a sequence of length T with z t being the feature vector representing the input attime t , and Y = { y t } Tt =1 be the corresponding sequence ofaction labels per frame, y t ∈ { , . . . , N c } , with N c beingthe number of action classes. Let G = {V , E} be the graphhose nodes correspond to different frames ( |V| = T ) andwhose edges connect every d frames (with d = 1 corre-sponding to consecutive frames). Our CRF models the con-ditional distribution of labels given the input features with aGibbs distribution of the form P ( Y | Z ) ∝ exp E ( Z , Y ) ,where the energy E ( Z , Y ) is factorized into a sum of poten-tial functions defined on cliques of order less than or equalto two. Formally, the energy function can be written as: E ( Z , Y ) = T (cid:88) t =1 U (cid:62) y t z t + T − d (cid:88) t =1 P y t ,y t + d , (4)where the first term is the unary potential which models thescore of assigning label y t to frame t described by feature z t , while the second term is called pairwise potential andmodels the score of assigning labels y t and y t + d to frames t and t + d respectively ( d is a parameter called the skiplength and a CRF with d > is called Skip-Chain CRF (SC-CRF) [16, 17]). U y t ∈ R D is a linear unary classifier corre-sponding to action class y t and P ∈ R N c × N c is the pairwisetransition matrix. Note that there exist different variants tothis model. For instance, one can use precomputed unaryand pairwise potentials and learn two scalar coefficients thatencode the relative weights of the two terms [33].We now show how this energy can be written as alinear function with respect to a parameter vector W ∈ R N c D + N c . The unary term can be rewritten as follows: T (cid:88) t =1 U Ty t z t = (cid:2) U (cid:62) , . . . , U (cid:62) N c (cid:3) T (cid:88) t =1 z t δ ( y t = 1) ... T (cid:88) t =1 z t δ ( y t = N c ) = W U (cid:62) Φ U ( Z , Y ) , (5)where W U and Φ U ( Z , Y ) ∈ R N c D are, respectively, theunary CRF weights and the unary joint feature. Similarlythe pairwise term can be written as: [ P , . . . , P N c N c ] T − d (cid:88) t =1 δ ( y t = 1) δ ( y t + d = 1) ... T − d (cid:88) t =1 δ ( y t = N c ) δ ( y t + d = N c ) = W P (cid:62) Φ P ( Y ) , (6)where W P , Φ P ( Y ) ∈ R N c are the pairwise CRF weightsand pairwise joint feature. Therefore, the overall energyfunction can be written as: E ( Z , Y ) = (cid:20) W U W P (cid:21) (cid:62) (cid:20) Φ U ( Z , Y ) Φ P ( Y ) (cid:21) = W (cid:62) Φ ( Z , Y ) , (7) where W is the vector of CRF weights and Φ ( Z , Y ) thejoint feature [11]. At this point, we should emphasize thatfeature vectors Z = [ z , . . . , z T ] are constructed by localaverage pooling of the sparse codes and are therefore im-plicitly dependent of the input data X and the dictionary Ψ .For the rest of this manuscript, we will denote this depen-dency by substituting Z with the notation Z ( X , Ψ ) . So ourenergy can be rewritten as: E ( Z ( X , Ψ ) , Y ) = W (cid:62) Φ ( Z ( X , Ψ ) , Y ) . (8)It should be now clear that if Ψ is fixed, then the energy islinear with respect to the parameter vector W , like in a stan-dard CRF model. However, if Ψ is a parameter that needsto be learned, then the energy function is nonlinear with re-spect to ( W , Ψ ) and thus training is not straightforward.The training problem is addressed next. Let { X n } N s n =1 be N s training sequences with associatedlabel sequences { Y n } N s n =1 . We formulate the training prob-lem as one of minimizing the following regularized loss: J ( W , Ψ ) = 12 || W || F ++ CN s N s (cid:88) n =1 max Y [∆( Y n , Y ) + (cid:104) W , Φ ( Z n ( X n , Ψ ) , Y ) (cid:105) ] − (cid:104) W , Φ ( Z n ( X n , Ψ ) , Y n ) (cid:105) , (9)where C is a regularization parameter controlling the reg-ularization of the CRF weights, ∆( ˆ Y , Y ) = T (cid:88) t =1 δ (ˆ y t (cid:54) = y t ) is the Hamming loss between two sequences of labels ˆ Y and Y , and Z n is the matrix of feature vectors ex-tracted from the frames of input sequence X n , i.e., Z n =[ z n , . . . , z nT ] . This max-margin formulation performs reg-ularized empirical risk minimization and bounds the ham-ming loss from above. We use a Stochastic Gradient De-scent algorithm for minimizing the objective function inEq. (9). Our algorithm is based on the task-driven dictio-nary learning approach developed by Mairal et al. [24]. No-tice that, although the sparse coefficients are computed byminimizing a non-differentiable objective function (Eq. 1), J ( W , Ψ ) is differentiable and its gradient can be com-puted [22]. In particular, the function relating the sparsecodes u t and the dictionary is differentiable almost every-where, except at the points where the set of non-zero el-ements of u t (called the support set and denoted by S t )changes. Assuming that the perturbations of the dictionaryatoms are small so that the support set stays the same, wecan compute the gradient of the non-zero coefficients withrespect to the columns of Ψ indexed by S t , denoted as Ψ S t ,s follows [33]: ∂ u t ( k ) ∂ Ψ S t = ( x t − Ψ S t ( u t ) S t )( A − t ) [ k ] − ( Ψ S t A −(cid:62) t ) (cid:104) k (cid:105) ( u t ) (cid:62) S t (10)where k ∈ S t , ( u t ) S t denotes the sub-vector of u t withentries in S t , A t = Ψ (cid:62) S t Ψ S t , and the subscripts [ k ] and (cid:104) k (cid:105) denote, respectively, the k -th row and column of thecorresponding matrix.Given the dictionary and CRF weights computed at the ( i − -th iteration, the main s-eps-converted-to.pdf of ouriterative algorithm at the i -th iteration are:1. Randomly select a training sequence ( X i , Y i ) .2. Compute sparse codes u t with Eq. 1 and feature vec-tors z t with Eq. 3 using dictionary Ψ ( i − .3. Find the sequence ˆ Y that yields the most violated con-straint by solving the loss augmented inference prob-lem: ˆ Y = argmax Y ∆( Y i , Y )+ (cid:68) W ( i − , Φ ( Z i ( X i , Ψ ( i − ) , Y ) (cid:69) (11)using the Viterbi algorithm (see [17] for details regard-ing inference when using a SC-CRF ( d > )).4. Compute gradient with respect to the CRF parameters W : ∂J∂ W = W ( i − + C ( Φ ( Z i ( X i , Ψ ( i − ) , ˆ Y ) −− Φ ( Z i ( X i , Ψ ( i − ) , Y i )) . (12)5. Compute gradients with respect to the dictionary Ψ us-ing the chain rule: ∂J∂ Ψ = T (cid:88) t =1 (cid:18) ∂J∂ z t (cid:19) (cid:62) ∂ z t ∂ Ψ = T (cid:88) t =1 (cid:18) ∂J∂ z t (cid:19) (cid:62) L (cid:88) j ∈ T t ∂ a j ∂ Ψ , = 1 L T (cid:88) t =1 (cid:88) j ∈ T t ( x j − Ψ ( i − S j ( u j ) S j ) (cid:32) A − j (cid:18) ∂J∂ z t (cid:19) ˜ S j (cid:33) (cid:62) − Ψ ( i − S j A −(cid:62) j ( u j ) S j (cid:18) ∂J∂ z t (cid:19) (cid:62) ˜ S j , (13)where ∂J∂ z t = U ˆ y t − U y t ∈ R D , S j is the set of indicescorresponding to the non-zero entries of the vector u j , ˜ S j is the set of indices corresponding to the non-zeroentries of the vector a j , A j = Ψ (cid:62) S j Ψ S j , Ψ S j denotesthe active columns of the dictionary indexed by S j , ( u j ) S j denotes the non-zero entries of vector u j and ( ∂J∂ z t ) ˜ S j denotes the entries of the partial derivative cor-responding to non-zero entries of vector a j . 6. Update W , Ψ using stochastic gradient descent.7. Normalize the dictionary atoms to have unit l norm.This step prevents the columns of Ψ from becomingarbitrarily large, which would result in arbitrarily smallsparse coefficients.
4. Experiments
We evaluate our method on two public datasets for fine-grained action segmentation and recognition: JIGSAWS [9]and 50 Salads [31]. First, we report our results on eachdataset and compare them with the state of the art. Next, weexamine the effect of different model components.
JHU-ISI Gesture and Skill Assessment (JIGSAWS) [9].
This dataset provides kinematic data of the right and leftmanipulators of the master and slave da Vinci surgical robotrecorded at Hz during the execution of three surgicaltasks (Suturing (SU), Knot-tying (KT) and Needle-passing(NP)) by surgeons with varying skill levels. In particu-lar, kinematic data include positions, orientations, veloci-ties etc. ( variables in total), and there are 8 surgeonsperforming a total of 39, 36 and 26 trials for the Suturing,Knot-tying and Needle-passing surgical tasks, respectively.This dataset is challenging due to the significant variabilityin the execution of tasks by surgeons of different skill lev-els and the subtle differences between fine-grained actions.There are 10, 6 and 8 different action classes for the Sutur-ing, Knot-tying and Needle-passing tasks, respectively. Ex-amples of action classes are orienting needle , reaching forneedle with right hand , pulling suture with left hand , and making C loop . We evaluate our method using the standardLeave-One-User-Out (LOUO) and Leave-One-Supertrial-Out (LOSO) cross-validation setups [2].
50 Salads [31].
This dataset provides data recorded by10 accelerometers attached to kitchen tools, such as knife,peeler, oil bottle etc., during the preparation of a salad by25 users. This dataset features annotations at four levels ofgranularity, out of which we use the eval and mid granular-ities. The former consists of 10 actions that can be reason-ably recognized based on the utilization of accelerometer-equipped objects, such as add oil , cut , peel etc., while thelatter consists of 18 mid-level actions, such as cut tomato , peel cucumber . Both granularities include a backgroundclass. We evaluate our method using the ground truth la-bels and the 5-fold cross-validation setup proposed by theauthors of [18, 15].In summary, these two datasets provide kinematic/sensordata recorded during the execution of long goal-drivencomplex activities, which are comprised of multiple fine-grained action instances following a grammar. Hence, theyare suitable for evaluating our method, which was designedethod LOSO LOUOSU KT NP SU KT NPGMM-HMM [2] 82.22 80.95 70.55 73.95 72.47 64.13KSVD-SHMM [32, 2] 83.40 83.54 73.09 73.45 74.89 62.78MsM-CRF [34, 2] 81.99 79.26 72.44 67.84 44.68 63.28SC-CRF-SL [16, 2] 85.18 SDSDL [30] - -BiLSTM (30Hz) [8]* - - - 80.15 - -TCN [18] - - - 79.6 - -LC-SC-CRF [17]** - - - - -Ours (0.34) (0.08) (0.12) 78.16 (0.42) (1.20) (0.06)Table 1: Average per-frame action recognition accuracy for surgical task segmentation and recognition on the JIGSAWSdataset [9]. The results are averaged over three random runs, with the standard deviation reported in parentheses. Bestresults are shown in bold, while second best results are denoted in italics.* Our results are not directly comparable withthose of [8], since they were using data downsampled in time (5Hz). For a fair comparison, results for LSTM, BiLSTM onnon-downsampled data (30Hz) were obtained using the code and default parameters publicly available from the authors [8].** Our results are not directly comparable with those of LC-SC-CRF [17], where authors were using both kinematic data aswell as the distance from the tools to the closest object in the scene from the video.for kinematic data and features a temporal model that isable to capture action transitions. Other datasets collectedfor action segmentation with available skeleton data, suchas CAD-120 [12], Composable Activities [20], Watch-n-Patch [36] and OAD [7], have a mean number of 3 to 12action instances per sequence [21], while for example theSuturing task in the JIGSAWS dataset features an average of20 action instances per sequence, ranging from 17 to 37. Itis therefore more challenging for comparing temporal mod-els. Recently, the PKU-MMD dataset [21] was proposed,which is of larger scale and also contains around 20 actioninstances per sequence. However, the actions in this datasetare not fine-grained (e.g., hand waving , hugging etc.). Input data are normalized to have zero mean and unitstandard deviation. We apply PCA on the robot kinematicdata of the JIGSAWS dataset to reduce their dimensionfrom to following the setup of [30]. The dictionaryis initialized using the SPAMS dictionary learning tool-box [23] and the CRF parameters are initialized to . Weuse Stochastic Gradient Descent with a batch size of andmomentum of . . We also reduce the learning rate by onehalf every epochs and train our models for epochs.Parameters such as the regularization cost C , initial learningrate η , temporal window size for average-pooling L , Lassoregularizer parameter λ u , skip chain length d and dictio-nary size m vary with each dataset, surgical task or gran-ularity. The window size was fixed to for JIGSAWSand for 50 Salads, the dictionary size M was chosen via cross-validation from the values { , , , } , λ u from values { . , . } , C from { . , . , . , } , η from { . , . , . } and d from { , , } . To performcross-validation we generate five random splits of the avail-able sequences of each dataset task/granularity. Note thatsince both datasets have a fixed test setup, with all usersappearing in the test set exactly once, it is not clear howto use them for hyperparameter selection without inadver-tently training on the test set. Here we randomly crop atemporal segment from each of the videos instead of usingthe whole sequences for cross-validation, in order to avoidusing the exact same video sequences which will be usedfor evaluating our method. The length of these segments is of the original sequence length. Furthermore, we se-lect m , λ u and d by using the initialized dictionary Ψ andlearning the weights of a SC-CRF, while we choose C and η by jointly learning the dictionary and the SC-CRF weights. Overall performance.
We first compare our method withstate-of-the-art methods on the JIGSAWS and 50 Saladsdatasets. The per-frame action recognition accuracies of allthe compared methods on JIGSAWS are summarized in Ta-ble 1. It can be seen that our method yields the best orsecond best performance for all tasks on both the LOSOand LOUO setups, except for Suturing LOUO, where LC-SC-CRF achieves per-frame action recognition accuraciesup to . However, their result is not directly comparableto ours, since they employ additional video-based features.Also note that in [16] they use a SC-CRF with an addi-ethod 50 Salads eval mid
LC-SC-CRF [17] 77.8 *LSTM [18] 73.3 -TCN [18] -Ours (0.11) (0.72)Table 2: Results for action segmentation and recognition onthe 50 Salads dataset using granularities eval and mid . Re-sults are averaged over three random runs, with the standarddeviation reported in parentheses. Best results are shownin bold, while second best results are denoted in italics.*LC-SC-CRF [17] was evaluated on the mid granularity withsmoothed out short interstitial background segments [18].tional pairwise term (skip-length data potentials), which isnot incorporated in our model and could potentially improveour results. However, it is worth noting that our methodachieves comparable performance to deep recurrent mod-els such as LSTMs [8] and the newly proposed TCN [18],which possibly captures complex temporal patterns, suchas action compositions, action durations, and long-rangetemporal dependencies. Furthermore, our method consis-tently improves over SDSDL [30], which was based on jointsparse dictionary and linear SVM learning, as well as a tem-poral smoothing of results using the Viterbi algorithm withprecomputed action transition probabilities.Table 2 summarizes our results on the 50 Salads datasetunder two granularities. Although the modality used in thisdataset is different (accelerometer data), it can be seen thatour method is very competitive among all the comparedmethods, even with respect to methods relying on power-ful deep temporal models such as LSTMs.
Ablative analysis.
In Tables 4, 3 we analyze the contri-bution of the key components of our method, namely thecontribution of a) using sparse features (Eq. 3) obtainedfrom an unsupervised dictionary in conjunction with a Lin-ear Chain CRF, b) substituting the Linear Chain CRF witha Skip Chain CRF (SC-CRF) and c) jointly learning thedictionary used in sparse coding and the CRF unary andpairwise weights. As expected, using sparse features in-stead of the raw kinematic features consistently boosts per-formance across all tasks on JIGSAWS. Similarly, sparsecoding of accelerometer data improves performance on 50Salads and notably this improvement is larger in the case offine-grained activities ( mid granularity). Furthermore, us-ing a SC-CRF further boosts performance as expected, sinceit is more suitable for capturing action-to-action transitionprobabilities in contrast to the Linear Chain CRF which cap-tures frame-to-frame action transition probabilities.It is however surprising that learning a discriminativedictionary jointly with the CRF weights does not signifi-cantly improve performance, yielding an improvement of atmost ∼ . Further investigating this result, we computed Method 50 Salads eval mid raw + CRF 71.81 (0.55) 44.83 (0.73)SF + CRF 76.65 (0.19) 52.63 (0.23)SF + SC-CRF 80.24 (0.20) (0.08)SDL + SC-CRF (0.11) 56.72 (0.72)Table 3: Analysis of contribution to recognition perfor-mance from each model component in the 50 Salads dataset.Results are averaged over three random runs, with the stan-dard deviation reported in parentheses. raw+CRF: use kine-matic data as input to a CRF, SF + CRF: use sparse features z as input to a CRF, SF + SC-CRF: use sparse features z as input to a SC-CRF, SDL + SC-CRF: joint dictionary andSC-CRF learning.additional metrics for evaluating the segmentation qualityon the JIGSAWS dataset. In particular, we report the editscore [17], a metric measuring how well the model predic-tions the ordering of action segments, and segmental-f1@10score as defined in [15]. As it can be seen in Table 5, per-formance is similar across all metrics for both unsupervisedand discriminative dictionary, except for a consistent im-provement in Needle Passing. One possible explanationcould be that the computation of features based on aver-age pooling of sparse codes in a temporal window might re-duce the impact of the discriminatively trained dictionary.However, repeating the experiment on JIGSAWS (Sutur-ing LOSO) without average temporal pooling leads to thesame behavior, i.e. using a dictionary learned via unsuper-vised training with a SC-CRF yields a per-frame accuracyof . , while using a dictionary jointly trained with theSC-CRF yields . . Our findings could be attributedto the limited training data. They also seem to corroboratethe conclusions drawn by Coates et al. [6], who have exper-imentally observed that the superior performance of sparsecoding, especially when training samples are limited, arisesfrom its non-linear encoding scheme and not from the basisfunctions that it uses. Qualitative results.
In Fig. 2 we show examples of groundtruth segmentations and predictions for selected testing se-quences from JIGSAWS Suturing. As it can be seen, theLOUO setup is more challenging since the model is asked torecognize actions performed by a user it has not seen beforeand in addition to that there is great variability in experienceand styles between surgeons. In all cases our model outputssmooth predictions, without significant over-segmentations.
5. Conclusion
We have presented a novel end-to-end learning frame-work for fine-grained action segmentation and recognition,which combines features based on sparse coding with a Lin-ear Chain CRF model. We also proposed a max-marginethod LOSO LOUOSU KT NP SU KT NPraw + CRF 79.57 (0.04) 76.39 (0.09) 66.24 (0.10) 71.77 (0.05) 69.63 (0.06) 59.47 (0.18)SF + CRF 85.70 (0.01) 82.06 (0.03) 71.72 (0.07) 76.64 (0.05) 73.58 (0.07) 60.59 (0.19)SF + SC-CRF (0.03) 83.71 (0.03) 74.63 (0.02) (0.05) (0.14) 65.75 (0.12)SDL + SC-CRF 86.21 (0.34) (0.07) (0.12) 78.16 (0.42) 76.68 (1.20) (0.06)Table 4: Analysis of contribution to recognition performance from each model component in the JIGSAWS dataset. Resultsare averaged over three random runs, with the standard deviation reported in parentheses. raw+CRF: use kinematic data asinput to a Linear Chain CRF, SF + CRF: use sparse features z as input to a CRF, SF + SC-CRF: use sparse features z as inputto a SC-CRF, SDL + SC-CRF: joint dictionary and SC-CRF learning. Method LOSO LOUOSU KT NP SU KT NPSF + SC-CRF / / /87.46 74.62/73.05/76.01 / / /63.61/71.38 65.81/55.45/62.30SDL + SC-CRF 85.90/75.45/83.47 /82.82/ / / / / / Table 5: Comparison of unsupervised and supervised dictionary used for sparse coding on JIGSAWS dataset. Metricsreported are: accuracy/edit score/segmental f1 score. Results are from a single random run. SF + SC-CRF: use sparse features z obtained from unsupervised dictionary as input to a SC-CRF, SDL + SC-CRF: use sparse features z from discriminativedictionary learned jointly with a SC-CRF. G T P r e d P r e d + m e d (a) Suturing LOSO G T P r e d P r e d + m e d (b) Suturing LOSO G T P r e d P r e d + m e d (c) Suturing LOUO G T P r e d P r e d + m e d (d) Suturing LOUO Figure 2: Qualitative examples of ground truth and predicted temporal segmentations (before and after median filtering) onJIGSAWS data. Each color denotes a different action class. (Best viewed in color.)approach for jointly learning the sparse dictionary and theCRF weights, resulting in a dictionary adapted to the task ofaction segmentation and recognition. Experimental evalua-tion of our method on two datasets showed that our methodperforms on par or outperforms most of the state-of-the-artmethods. Given the recent success of deep convolutionalnetworks (CNNs), future work will explore using deep fea- tures as inputs to the temporal model and jointly learningthe CNN and CRF parameters in a unified framework.
Acknowledgements.
We would like to thank Colin Lea andLingling Tao for their insightful comments and for theirhelp with the JIGSAWS dataset, and Vicente Ord´o˜nez forhis useful feedback during this research collaboration. Thiswork was supported by NIH grant R01HD87133. eferences [1] M. Aharon, M. Elad, and A. M. Bruckstein. K-SVD: an al-gorithm for designing overcomplete dictionaries for sparserepresentation.
IEEE Transactions on Signal Processing ,54(11):4311–4322, 2006. 1[2] N. Ahmidi, L. Tao, S. Sefati, Y. Gao, C. Lea, B. B´ejar,L. Zappella, S. Khudanpur, R. Vidal, and G. D. Hager. Adataset and benchmarks for segmentation and recognition ofgestures in robotic surgery.
IEEE Transactions on Biomedi-cal Engineering , 2017. 5, 6[3] B. B´ejar, L. Zappella, and R. Vidal. Surgical gesture classi-fication from video data. In
Medical Image Computing andComputer Assisted Intervention , pages 34–41, 2012. 1[4] L. Bo, X. Ren, and D. Fox. Multipath sparse coding usinghierarchical matching pursuit. In
IEEE Conference on Com-puter Vision and Pattern Recognition , pages 660–667, 2013.3[5] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learn-ing mid-level features for recognition. In
IEEE Conferenceon Computer Vision and Pattern Recognition , pages 2559–2566. IEEE, 2010. 2[6] A. Coates and A. Y. Ng. The importance of encoding versustraining with sparse coding and vector quantization. In
Pro-ceedings of the 28th International Conference on MachineLearning (ICML-11) , pages 921–928, 2011. 3, 7[7] R. De Geest, E. Gavves, A. Ghodrati, Z. Li, C. Snoek, andT. Tuytelaars. Online action detection. In
European Con-ference on Computer Vision , pages 269–284. Springer, 2016.6[8] R. DiPietro, C. Lea, A. Malpani, N. Ahmidi, S. S. Vedula,G. I. Lee, M. R. Lee, and G. D. Hager. Recognizing surgicalactivities with recurrent neural networks. In
Medical ImageComputing and Computer Assisted Intervention , pages 551–558. Springer, 2016. 1, 2, 6, 7[9] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadara-jan, H. C. Lin, L. Tao, L. Zappella, B. B´ejar, D. D. Yuh,C. Chiung, G. Chen, R. Vidal, S. Khudanpur, and G. D.Hager. JHU-ISI gesture and skill assessment working set(JIGSAWS): a surgical activity dataset for human motionmodeling. In
Fifth Workshop on Modeling and Monitoringof Computer Assisted Interventions M2CAI , 2014. 1, 2, 5, 6[10] Z. Jiang, Z. Lin, and L. S. Davis. Learning a discriminativedictionary for sparse coding via label consistent K-SVD. In
IEEE Conference on Computer Vision and Pattern Recogni-tion , pages 1697–1704, 2011. 2[11] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane train-ing of structural SVMs.
Machine Learning , 77(1):27–59,2009. 4[12] H. S. Koppula, R. Gupta, and A. Saxena. Learning humanactivities and object affordances from RGB-D videos. In
In-ternational Journal of Robotics Research , 2013. 6[13] H. Kuehne, A. Arslan, and T. Serre. The language of ac-tions: Recovering the syntax and semantics of goal-directedhuman activities. In
IEEE Conference on Computer Visionand Pattern Recognition , pages 780–787, 2014. 1, 2[14] H. Kuehne, J. Gall, and T. Serre. An end-to-end generativeframework for video segmentation and recognition. In
IEEE Winter Applications of Computer Vision Conference , LakePlacid, Mar 2016. 2[15] C. Lea, M. Flynn, R. Vidal, A. Reiter, and G. Hager. Tempo-ral convolutional networks for action segmentation and de-tection. In
IEEE Conference on Computer Vision and PatternRecognition , 2017. 1, 2, 5, 7[16] C. Lea, G. D. Hager, and R. Vidal. An improved model forsegmentation and recognition of fine-grained activities withapplication to surgical training tasks. In
IEEE Winter Confer-ence on Applications of Computer Vision , pages 1123–1129,2015. 1, 2, 4, 6[17] C. Lea, R. Vidal, and G. D. Hager. Learning convolu-tional action primitives for fine-grained action recognition.In
IEEE International Conference on Robotics and Automa-tion , 2016. 1, 2, 4, 5, 6, 7[18] C. Lea, R. Vidal, A. Reiter, and G. D. Hager. Temporal con-volutional networks: A unified approach to action segmenta-tion. In
Workshop on Brave New Ideas on Motion Represen-tation , 2016. 5, 6, 7[19] X.-C. Lian, Z. Li, B.-L. Lu, and L. Zhang. Max-margin dic-tionary learning for multiclass image categorization.
Euro-pean Conference on Computer Vision , pages 157–170, 2010.2[20] I. Lillo, A. Soto, and J. C. Niebles. Discriminative hierarchi-cal modeling of spatio-temporally composable human activ-ities. In
IEEE Conference on Computer Vision and PatternRecognition , 2014. 6[21] C. Liu, Y. Hu, Y. Li, S. Song, and J. Liu. Pku-mmd: A largescale benchmark for continuous multi-modal human actionunderstanding.
CoRR , 2017. 6[22] J. Mairal, F. Bach, and J. Ponce. Task-driven dictionarylearning.
IEEE Transactions on Pattern Analysis and Ma-chine Intelligence , 34(4):791–804, 2012. 2, 4[23] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learningfor matrix factorization and sparse coding.
The Journal ofMachine Learning Research , 11:19–60, 2010. 3, 6[24] J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. R. Bach.Supervised dictionary learning. In
Neural Information Pro-cessing Systems , pages 1033–1040, 2009. 1, 2, 4[25] D. Oneata, J. Verbeek, and C. Schmid. Action and eventrecognition with Fisher vectors on a compact feature set. In
IEEE International Conference on Computer Vision , pages1817–1824, 2013. 2[26] Y. Quan, Y. Xu, Y. Sun, Y. Huang, and H. Ji. Sparse codingfor classification via discrimination ensemble. In
IEEE Con-ference on Computer Vision and Pattern Recognition , pages5839–5847, 2016. 2[27] A. Richard and J. Gall. Temporal action detection using astatistical language model. In
IEEE Conference on ComputerVision and Pattern Recognition , June 2016. 2[28] A. Richard, H. Kuehne, and J. Gall. Weakly supervisedaction learning with RNN based fine-to-coarse modeling.
CoRR , abs/1703.08132, 2017. 1[29] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. Adatabase for fine grained activity detection of cooking activ-ities. In
IEEE Conference on Computer Vision and PatternRecognition , 2012. 230] S. Sefati, N. J. Cowan, and R. Vidal. Learning shared,discriminative dictionaries for surgical gesture segmenta-tion and classification. In
MICCAI 6th Workshop on Mod-eling and Monitoring of Computer Assisted Interventions(M2CAI) , Munich, Germany, 2015. 1, 3, 6, 7[31] S. Stein and S. J. McKenna. Combining embedded ac-celerometers with computer vision for recognizing foodpreparation activities. In
ACM International Joint Confer-ence on Pervasive and Ubiquitous Computing , pages 729–738. ACM, 2013. 1, 2, 5[32] L. Tao, E. Elhamifar, S. Khudanpur, G. Hager, and R. Vi-dal. Sparse hidden Markov models for surgical gesture clas-sification and skill evaluation. In
Information Processing inComputed Assisted Interventions , 2012. 1, 2, 6[33] L. Tao, F. Porikli, and R. Vidal. Sparse dictionaries for se-mantic segmentation. In
European Conference on ComputerVision , 2014. 3, 4, 5[34] L. Tao, L. Zappella, G. Hager, and R. Vidal. Segmenta-tion and recognition of surgical gestures from kinematic andvideo data. In
Medical Image Computing and Computer As-sisted Intervention , 2013. 1, 2, 6[35] N. N. Vo and A. F. Bobick. From stochastic grammar tobayes network: Probabilistic parsing of complex activity. In
IEEE Conference on Computer Vision and Pattern Recogni-tion , pages 2641–2648, 2014. 1[36] C. Wu, J. Zhang, S. Savarese, and A. Saxena. Watch-n-patch: Unsupervised understanding of actions and relations.In
IEEE Conference on Computer Vision and Pattern Recog-nition , pages 4362–4370, 2015. 6[37] J. Yang and M.-H. Yang. Top-down visual saliency via jointCRF and dictionary learning. In
IEEE Conference on Com-puter Vision and Pattern Recognition , 2012. 3[38] J. Yang and M.-H. Yang. Top-down visual saliency via jointCRF and dictionary learning.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 39(3):576–588, 2017. 2,3[39] J. Yang, K. Yu, and T. Huang. Supervised translation-invariant sparse coding. In