[PDF] Multi-modal Fusion for Single-Stage Continuous Gesture Recognition

Abstract

Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction. Current gesture recognition methods have heavily focused on isolated gestures, and existing continuous gesture recognition methods are limited by a two-stage approach where independent models are required for detection and classification, with the performance of the latter being constrained by detection performance. In contrast, we introduce a single-stage continuous gesture recognition model, that can detect and classify multiple gestures in a single video via a single model. This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation stage to detect individual gestures. To enable this, we introduce a multi-modal fusion mechanism to support the integration of important information that flows from multi-modal inputs, and is scalable to any number of modes. Additionally, we propose Unimodal Feature Mapping (UFM) and Multi-modal Feature Mapping (MFM) models to map uni-modal features and the fused multi-modal features respectively. To further enhance the performance we propose a mid-point based loss function that encourages smooth alignment between the ground truth and the prediction. We demonstrate the utility of our proposed framework which can handle variable-length input videos, and outperforms the state-of-the-art on two challenging datasets, EgoGesture, and IPN hand. Furthermore, ablative experiments show the importance of different components of the proposed framework.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Multi-modal Fusion for Single-Stage ContinuousGesture Recognition

Harshala Gammulle,

Member, IEEE,

Simon Denman,

Member, IEEE,

Sridha Sridharan,

Life SeniorMember, IEEE,

Clinton Fookes,

Senior Member, IEEE.

Abstract —Gesture recognition is a much studied research areawhich has myriad real-world applications including roboticsand human-machine interaction. Current gesture recognitionmethods have heavily focused on isolated gestures, and existingcontinuous gesture recognition methods are limited by a two-stageapproach where independent models are required for detectionand classiﬁcation, with the performance of the latter beingconstrained by detection performance. In contrast, we introducea single-stage continuous gesture recognition model, that candetect and classify multiple gestures in a single video via a singlemodel. This approach learns the natural transitions betweengestures and non-gestures without the need for a pre-processingsegmentation stage to detect individual gestures. To enable this,we introduce a multi-modal fusion mechanism to support theintegration of important information that ﬂows from multi-modalinputs, and is scalable to any number of modes. Additionally, wepropose Unimodal Feature Mapping (UFM) and Multi-modalFeature Mapping (MFM) models to map uni-modal features andthe fused multi-modal features respectively. To further enhancethe performance we propose a mid-point based loss functionthat encourages smooth alignment between the ground truthand the prediction. We demonstrate the utility of our proposedframework which can handle variable-length input videos, andoutperforms the state-of-the-art on two challenging datasets,EgoGesture, and IPN hand. Furthermore, ablative experimentsshow the importance of different components of the proposedframework.

Index Terms —Gesture Recognition, Spatio-temporal Represen-tation Learning, Temporal Convolution Networks.

I. I

NTRODUCTION T HE computer-aided recognition of gestures has a vastnumber of applications including human-computer in-teraction, robotics, sign language recognition, gaming andvirtual reality control. Due to it’s diverse applications, gesturerecognition has gained much attention in the computer visiondomain.Most gesture recognition approaches are based on recognis-ing isolated gestures [1]–[3] where an input video is manuallysegmented into clips that each contain a single isolated ges-ture. In a real-world scenario where gestures are performedcontinuously, methods based on isolated gestures are notdirectly applicable, and thus do not translate to a naturalsetting. As such, recent approaches [4]–[6] aim to recognise

H. Gammulle, S. Denman, S. Sridharan and C. Fookes are with the SignalProcessing, Artiﬁcial Intelligence and Vision Technologies (SAIVT) Lab,Queensland University of Technology, Brisbane, Australia.E-mail: [email protected] received. gestures in the continuous original (i.e. unsegmented) videowhere multiple gesture categories, including both gestures andnon-gesture actions, are included. These continuous gesturerecognition approaches are formulated in two ways: two-stage[5]–[7] and single-stage [8] methods. The two-stage approachis built around using two models: one model to performgesture detection (also known as gesture spotting), and anotherfor gesture classiﬁcation. In [6] the authors proposed a two-stage method where gestures are ﬁrst detected by a shallow3D-CNN and when a gesture is detected it activates a deep3D-CNN classiﬁcation model. Another work [5] proposedutilising a Bidirectional Long Short-Term Memory (Bi-LSTM)to detect gesture while the authors use a combination oftwo 3D Convolution Neural Networks (3D-CNN) and a LongShort-Term Memory (LSTM) network to process multi-modalinputs for gesture classiﬁcation.

G1 G2 G3 G4 G5

BG BG BG BG

Continuous Gesture Recognition

G1 G2 G3 G4 G5

BG BG BG BG

P1 P3 P5 P7 P9

P2 P4 P6 P8

PredictionsInput Video M u l t i - m o d a l d a t a Ground Truth

Fig. 1. Single-stage Continuous Gesture Recognition: The model is fed witha multi-modal (RGB, Depth etc.) feature sequence and the ground truth labelsequence. The ground truth can belong to a particular gesture or a non-gesture(BG) class. During training, using the ground-truth the model learns to mapfrom the input frames to the corresponding gesture class.

Single-stage approaches originate from the action recogni-tion domain [9], [10], where frames that do not contain anaction are labelled ‘background’ (similar to the non-gestureclass). In contrast to two-stage methods, single-stage methodsuse only a single model which directly performs the gestureclassiﬁcation. Fig. 1 illustrates the typical structure of asingle-stage approach where the recognition is performed byconsidering all the gesture classes together with the non-gesture class. In addition to being simpler than two-stagemethods, single stage methods avoid the potential issue oferrors being propagated between stages. For example, in atwo-stage method, if the detector makes an error estimatingthe start or end of a gesture sequence this error is propagatedthrough to the classiﬁcation process. Hence, in two-stage a r X i v : . [ c s . C V ] N ov OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 methods, classiﬁer performance is highly dependent on therobustness of the detector, further increasing the appeal ofsingle-stage methods. However, we observe that two-stagemethods are the popular choice among researchers whenperforming continuous gesture recognition. This is largely dueto the challenges that a single network should address whenperforming both gesture localisation and recognition tasks,concurrently.Several gesture recognition approaches have also exploitedmulti-modal data and have shown improved results throughfusion [1], [5]. In [1], the authors introduce a simple neuralnetwork module to fuse features from two modes for thetwo-stage gesture recognition task. However, for continuousgesture recognition, any fusion scheme must consider thatthe input sequence includes multiple gestures that evolvetemporally. Hence, using a simple attention layer to fusedomains restricts the learning capacity as the model attentionis applied to the complete sequence, ignoring the fact that thereare multiple gesture sub-sequences, and potentially leading tosome individual gestures being suppressed.In this paper we propose a novel single-stage methodfor continuous gesture recognition. By using a single-stageapproach we expect the classiﬁcation model to learn naturaltransitions between gestures and non-gestures. However, di-rectly learning the gestures from a continuous unsegmentedvideo is much more challenging as it requires the modelto detect the transitions between gesture classes and recog-nise gestures/non-gestures simultaneously. To improve perfor-mance we consider multiple modalities and introduce a novelfusion module that extracts multiple feature sub-sequencesfrom the multi-modal input streams, considering their temporalorder. The proposed fusion module preserves this temporalorder and enables the learning of discriminative feature vectorsfrom the available modalities. To aid the model learning, wepropose a novel mid-point based loss function, and performadditional experiments to prove the effectiveness of the pro-posed loss.Figure 2 illustrates the architecture of our proposed frame-work. In the ﬁrst stage of the model, semantic features fromeach mode are extracted via a feature extractor, and extractedfeatures are passed through a Unimodal Feature Mapping(UFM) block. We maintain separate UFM blocks for eachstream. The outputs of all UFM blocks are used by theproposed fusion modules which learns multi-modal spatio-temporal relationships to support the recognition process. Theoutput of the fusion model is passed through the Multi-modal Feature Mapping (MFM) block which performs theﬁnal classiﬁcation. The model is explicitly designed to handlevariable length video sequences. Through evaluations on twocontinuous gesture datasets, EgoGesture [11] and IPN hands[4], we show that our proposed method achieves state-of-the-art results. We also perform extensive ablation evalu-ations, demonstrating the effectiveness of each componentof the model. Furthermore, we illustrate the scalability ofour proposed fusion model by performing continuous gesturerecognition with two and three modes.In summary our contributions are as follows, • We propose a single-stage continuous gesture recognition model, and utilises only a single model to achieve gesturedetection and classiﬁcation directly. • We introduce a novel temporal multi-modal feature fu-sion mechanism, which preserves the temporal order ofthe inputs in the fusion process and supports the ﬁnalclassiﬁcation task. • We introduce a novel mid-point based loss function,which encourages smooth transitions between differentgesture classes and enhances learning. • We demonstrate our proposed framework is able to handlevariable-length input videos, and signiﬁcantly outperformthe state-of-the-art results on two challenging data-sets. • Through extensive ablation evaluations, we show theeffectiveness of each of the proposed novelties in ourframework. II. R

ELATED W ORKS

Gesture recognition has been an extensively studied areain computer vision as it facilitates multiple applications. Earlyapproaches used handcrafted feature recognition systems [12]–[15]. For example, in [12] the authors proposed a spatio-temporal feature named Mixed Features around Sparse key-points (MFSK), which is extracted from RGB-D data. In[13] the authors propose to extract a visual representation forhand motions using motion divergence ﬁelds. Other methodsare based on extracting Random Occupancy Pattern (ROP)features [16], Super Normal Vectors (SNV) [15], and improveddense trajectories [17]. However, these hand-crafted featuremethods rely solely on human domain knowledge and riskfailing to capture necessary information that may greatlycontribute towards correct recognition.Subsequently, attention has shifted to deep network-basedapproaches [18]–[22] due to their ability to learn task-speciﬁcfeatures automatically, without being totally reliant on thedomain knowledge of the researcher. As such, most recentgesture recognition methods use deep networks [1], [4], [6],[23]–[25] and have demonstrated superior results to their hand-crafted counterparts.Deep learning methods have considered gesture recognitionin two ways: isolated gesture recognition [1], [26], [27];and continuous gesture recognition [6], [24]. Isolated gesturerecognition uses segmented gesture videos containing a singlegesture per video, and is a naive and simpliﬁed way ofperforming gesture recognition which does not reﬂect theoverall real-world challenge gesture recognition poses. In[26], the authors proposed three variants of 3D CNNs whichare able to learn spatio-temporal information through theirhierarchical structure to learn to recognise isolated gestures.[1] proposed a fusion unit to integrate and learn informationthat ﬂows through two uni-modal CNN models to supportisolated gesture recognition. [28] introduced a multi-modaltraining/uni-modal testing approach where the authors embedthe knowledge from individual 3D-CNN networks, forcingthem to collaborate and learn a common semantic represen-tation to recognise isolated gestures. However, the simplicityof isolated gesture recognition methods prevents their directapplication for real-world tasks as the input video contains

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 C o n v - D d il a t e d R e s i d u a l D il a t e d _ c o n v R e l u d r opo u t C o n v - D x K C o n v - D d il a t e d R e s i d u a l D il a t e d _ c o n v R e l u d r opo u t C o n v - D x K t t+1 t+2t-1 t t+1 t+2t-1 t D il a t e d _ c o n v R e l u d r opo u t C o n v - D d il a t e d R e s i d u a l x ḱ C o n v - D ŷ ŷ ŷ ŷ N predictionsModal 1 Modal 2 Fusion block (A t = 4) UFM block MFM blockUFM block FE C o n v - D d il a t e d R e s i d u a l D il a t e d _ c o n v R e l u d r opo u t C o n v - D x K Modal N

UFM block t t+1 t+2t-1

Fig. 2. Proposed single-stage framework: The data from each mode is passed through a pre-trained feature extractor and subsequently through separateUnimodal Feature Mapping (UFM) blocks. The output of each UFM block is fused by the proposed fusion block which learns discriminative features fromeach mode, considering the temporal order of the data. This aids the ﬁnal gesture classiﬁcation which is performed by the Multi-modal Feature Mapping(MFM) block. multiple sequential gestures that must ﬁrst be segmented.Hence focus has turned to developing methods for recognitionof gestures in unsegmented (i.e. continuous) video streams [6].In an unsegmented video, as there are sub-sequences con-taining both gestures and non-gestures, a typical model ﬁrstdetects gesture regions (also known as gesture spotting [11])prior to recognising each of these gestures. [6] formulated atwo-stage framework to carry out the detection and classiﬁ-cation of continuous gestures, where their detection methodﬁrst performs gesture detection and the classiﬁcation modelis activated only when a gesture is detected. However, thesetwo-stage methods require two separate networks to performgesture detection and classiﬁcation respectively.To the best of our knowledge [8], [11] are the only existingsingle-stage continuous gesture recognition methods. In [8]the authors employ an RNN to predict the gesture labels forinput frame sequence. In [29] the authors utilise a C3D modelto classify the continuous gestures. The model sequentiallyslides over the input video and outputs a single gesture classrepresenting the gesture within that input window, includingthe non-gesture class. They propose to further improved thegesture prediction method by employing a Spatio-TemporalTransfer Module (STTM) [29] and an LSTM network wherean LSTM predicts the gesture labels based on the C3Dfeatures. However, these methods fail to achieve comparativeacccuracies when evaluated against two-stage methods. Webelieve this is due to the simplistic nature of the architecture, which could not handle the complexities within the single-stage framework.Action and gesture recognition are related problem domains,where signiﬁcant developments have occurred compared togesture recognition domain. Similar to continuous gesturerecognition, the task of continuous action recognition (alsoknown as temporal action segmentation) has been investigatedusing various strategies [9], [10], [30], [31]. However, mosttemporal action segmentation methods are single-stage meth-ods where detection and classiﬁcation is performed by a singlenetwork. Single-stage methods offer advantages over two-stagemethods in that there is only a single model and errors from theﬁrst stage (the detector) are not propagated to the second stage(the classiﬁer). Furthermore, a single stage model can learn notonly a single gesture sequence, but also leverage informationon how different types of gestures are sequentially related. [9]introduced Temporal Convolution Networks (TCN) that utilisea hierarchy of temporal convolutions. In [10], the authors haveextended the ideas of [9] and introduced a multi-stage modelfor action segmentation, where each stage is composed of aset of dilated temporal convolutions that generate predictionsat each stage.Motivated by previous action segmentation methods [9],[10], we propose a single-stage method for continuous gesturerecognition. Although in [1], [28] multi-modal fusion or infor-mation sharing methods are proposed for gesture recognition,these fusion strategies have limited applicability to the single

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 stage paradigm. When sequences are segmented all framesare part of the same gesture, hence, a simple attention orconcatenation of the features can produce good results asall the information relates to the one gesture. In contrast,in a single-stage model the input to the classiﬁer containsmultiple gesture sequences and non-gesture frames. Hence, thefusion strategy should understand how these sub-sequences aretemporally related and ﬁlter out the most relevant informationconsidering this temporal order. To this end, we introduce afusion mechanism that preserves this temporal accordance andthat can be applied to two or more modalities for continuousgesture recognition. III. M

ETHOD

We introduce a novel framework to support multi-modalsingle-stage video-based classiﬁcation tasks such as gesturerecognition. In the introduced framework, ﬁrst the videos fromeach mode are passed through feature extractors, and theextracted deep features are subsequently passed through in-dividual unimodal networks which we term Unimodal FeatureMapping (UFM) blocks. The output feature vector of eachUFM block is used by the proposed fusion block and whichlearns a discriminative feature vector that is input to the Multi-modal Feature Mapping (MFM) block to perform classiﬁca-tion. Figure 2 illustrates the overall model architecture.Our framework can be used with segmented or unsegmentedvideos, that may be composed of one or more gesture classes,and supports the fusion of any number of modalities greaterthan 1. Each feature stream that belongs to a speciﬁc modeis required to pass through that mode’s UFM block prior tofusion.The task our approach seeks to solve can be deﬁnedas follows: given a sequence of video frames X i = { x i , x i , . . . , x iT } , where i = 1 , , . . . , M (M number ofmodalities), we aim to infer the gesture class label for eachtime step t (i.e ˆ y , ˆ y , . . . , ˆ y T ).In the following sections, we provide a detailed descriptionof the models and the proposed loss formulation. A. Unimodal Feature Mapping (UFM) Block

Video frames for a given modality are passed through afeature extractor (each mode has it’s own feature extractorto learn a mode speciﬁc representation), and the extractedfeatures are the input to the UFM block. Through the UFMblock, we are capture salient features related to a speciﬁcmodality and learn a feature vector suitable for feature fusion.As shown in Figure 2, this uni-modal network is composedof temporal convolution layers and multiple dilated residualblocks, where each dilated residual block is composed of adilated convolution layer followed by a non-linear activationfunction and a × convolution-BatchNorm-ReLu [32] layer.We take inspiration from [10], where the authors utilize theresidual connections to facilitate gradient ﬂow. As in [10],[33], we use a dilation factor that is doubled at each layer andeach layer is composed of an equal number of convolutionalﬁlters. B. Fusion Block

The output vectors of the UFM blocks are passed throughthe fusion block, which extracts temporal features from theuni-modal sequences, considering their temporal accordancewith the current time step. Feature fusion is performed usingthe attention level parameter. This parameter deﬁnes the fea-ture units that should be selected from the output vector ofeach UFM block at a given time. An illustration is given inFigure 3.

1) Attention Level parameter ( A ): Let V t , V t , . . . , V Mt bethe output feature vectors from the UFM blocks representingthe M modalities, where t = 1 , , . . . , T . By considering thevalue set for the parameter A the algorithm decides whichfeature units from each vector should be selected for thefusion at time t . This selection criteria is deﬁned based onthe fact that multi-modal feature streams are synchronisedand the features from the temporal neighbours at a particulartimestamp should carry knowledge informative for the gestureclass of that frame, while distant temporal neighbours do notcarry helpful information (as they are likely from differentgesture classes). Based on whether the A is even or odd, wecalculate the position increment ( i inc ) and decrement ( i dec )values as shown below. Here, i inc deﬁnes the number of unitsahead we should consider during the fusion, while i dec deﬁnesthe number of units behind that should be selected. if A is even (i.e. A %2 = 0 ) then i inc = A/ and i dec = ( A − / else if A t is odd (i.e. A %2 = 1 ) then i inc = i dec = ( A − / end if Once i inc and i dec are calculated, at t the units from t − i dec to t + i inc are selected from each feature vector. This subfeature vector is given by, S it = [ V it − i dec , . . . , V it , . . . , V it + i inc ] , (1)where i = 1 , , . . . , M . As shown in Figure 4, when theattention level is 4, 4 feature units (from t − to t + 2 ) areselected from each vector from the UFM.

2) Feature Enhancer (FE):

At each time step t , the featureenhancer receives computed sub-vectors S it s from each UFM,where i indicates the modality, and concatenates these sub-vectors generating an augmented vector η t , η t = [ S t , . . . , S it , . . . , S Mt ] . (2)If each feature unit is of dimension d and the attention levelis A , then η t will have shape, ( d, A × M ). We then utilisethe proposed Feature Enhancer (FE) block, which is inspiredby the squeeze and excitation block architecture introduced in[34], to allow the model to identify informative features fromthe fused multi-modal features, enhancing relevant featureunits and suppressing uninformative features. However, thesqueeze-and-excitation block of [34] considers the overall2D/3D CNN layer output and enhances features consideringtheir distribution across channels. In contrast, we propose toenhance features within sub-feature vectors, V i s, for each t .Through the FE block, features from each sub-feature vectorare enhanced by explicitly modelling the inter-dependencies OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 between channels, further supporting the multi-modal fusion.To exploit the sub-feature dependencies we ﬁrst perform globalaverage pooling to retrieve informative information withineach of the d channels of the sub-feature vector. This canbe deﬁned by, z t = F GAP ( η t ( a, m )) = 1 A × M A (cid:88) a =1 M (cid:88) m =1 η t ( a, m ) , (3)Then a gating mechanism implemented using sigmoid activa-tions is applied to ﬁlter out the informative channels within d such that, β t = σ ( W × ReLu( W × z t )) , (4)The resultant augmented feature vector, β t , is also of shape( d, A × M ), however, the informative information within isenhanced by considering all the modalities. t t+1 t+2t-1 t t+1 t+2t-1t t+1t-1 t t+1t-1t t+1 t t+1t t+1 t+2t-1 t t+1 t+2t-1t-2 t-2 A t = 2A t = 3A t = 4 t t A t = 1A t = 5 Modal1 Modal 2 Fig. 3. Illustration of the proposed Attention level parameter, A , andthe associated attention scheme. This parameter determines the number oftemporal neighbours that a particular frame is associated with, controllingthe information ﬂow to the fusion module. For instance if A = 5 twoneighbouring feature units surrounding the current time step t in each direction(i.e from t-2 to t and from t to t+2) are selected and processed. t t+1 t+2t-1 t t+1 t+2t-1 t ( T, d ) ( T, d )( T, 8d ) Modal 1 Modal 2Fused vector

Fig. 4. Illustration of the fusion process when A = 4 . Features surroundingthe current time step t are passed to the proposed fusion block from eachmodality and it ﬁrst concatenates them (See. Sec. III-B2). Then it passesthe concatenated feature vector through the proposed feature enhancementfunction which identiﬁes salient feature values in the concatenated vector toenhance, and components to suppress. Utilising this scheme we identify themost informative feature units from the local temporal window for decisionmaking at the current time step. C. Multi-modal Feature Mapping (MFM) block

The MFM block learns to generate the ﬁnal gesture clas-siﬁcation for the corresponding frame t utilising the fusedfeature vector β t . Similar to the UFM, the MFM utilisesa series of temporal convolution layers and multiple dilatedresidual blocks to operate over the fused feature vector, β = [ β , . . . , β T ] . By considering the sequential relationshipsit generate the frame-wise gesture classiﬁcations, ˆ y , . . . , ˆ y T .This can be written as, ˆ y , . . . , ˆ y T = F MFM ([ β , . . . , β T ]) . (5) D. Loss Formulation

As the classiﬁcation loss we utilise the cross-entropy losswhich is deﬁned as, L ce = 1 T (cid:88) t − log(ˆ y t ) . (6)However, only using the frame wise classiﬁcation loss tolearn gesture segmentation is insufﬁcient and can lead to oversegmentation errors, even while maintaining high frame wiseaccuracy. Hence, we also use the smoothing loss is introducedby [10]. This smoothing loss uses the truncated mean squarederror over the frame-wise log probabilities. The smoothing losscan be deﬁned as, L sm = 1 T × C (cid:88) t,c ˜∆ t,c , (7)where, ˜∆ t,c = (cid:40) ∆ t,c , if ∆ t,c ≤ ττ, otherwise (8)and, ∆ t,c = | log ˆ y t,c − log ˆ y t − ,c | . (9)Here, T , c , y t,c deﬁne the number of frames per sequence,number of classes and the probability of class c at time t ,respectively. However, during the smoothing loss calculationsit only takes the predicted sequence into account withoutconsidering it’s corresponding ground truth sequence. Further-more, we observe that it discourages the transition of gestures.Considering this limitation, we propose a novel loss functionwhich we term the mid-point smoothing loss.Motivated by median ﬁltering in signal denoising we deﬁneour mid-point smoothing loss to encourage smooth predictions.However, instead of merely smoothing the predictions, wepropose to calculate the distance between the smoothed groundtruth and predictions, incorporating a smoothing effect whencalculating the loss.Let w represent a sliding window with N elements. First, weobtain the ground truth gesture class at the mid-point withinthe window w , ¯ y = F mid − point ( y n ) , (10)where n ∈ w . Similarly we obtain the predicted gesture classat the mid-point using, ˜ y = F mid − point (ˆ y n ) . (11) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

Then we deﬁne our mid-point smoothing loss, L mid = (cid:88) w ∈ T (cid:107) ¯ y − ˜ y (cid:107) . (12)As we are operating over the smoothed ground truth andpredicted sequences instead of raw sequences, we observe thatthis loss component accounts for smooth alignment betweenground truth and predictions.Finally, all three loss functions are summed to form the ﬁnalloss, L = L ce + λ L sm + λ L mid , (13)where, λ and λ are model hyper-parameters that determinethe contribution of the different losses. E. Implementation details

The pre-trained features are extracted by using a ResNet50network, and features are extracted from the ﬂatten layer. TheUFM block is composed of k = 12 dilated residual layers,while the MFM block contains k (cid:48) = 10 dilated residual layers.Similar to [10], we double the dilation factor at each layer. Weuse the Adam optimiser with a learning rate of 0.0005. Theimplementation of the proposed framework is completed usingPyTorch [35]. IV. E XPERIMENTS

A. Datasets

We evaluate our model on two challenging public datasets,EgoGesture [11] and IPN hand [4]. Both datasets are com-prised of unsegmented videos containing more than one ges-tures per video.

EgoGesture dataset [11] is the largest egocentric gesturedataset available for segmented and unsegmented (continuous)gesture classiﬁcation and is composed of static and dynamicgestures. The dataset contains various challenging scenariosincluding a static subject with a dynamic background, awalking subject with a dynamic background, cluttered back-grounds, and subjects facing strong sunlight. In our workwe utilise the unsegmented continuous gesture data which ismore challenging as it requires us segment and recognise thegesture in a single pass. The dataset consists of 84 classes (83gesture classes and the non-gesture class) recorded in 6 diverseindoor and outdoor settings. The dataset contains 1,239, 411and 431 videos for training, validation and testing purposesrespectively. The dataset provides RGB and depth videos.

IPN hand dataset [4] is a recently released dataset thatsupports continuous gesture recognition. The dataset containsvideos based on 13 static/dynamic gesture classes and anon-gesture class. The gestures are performed by 50 distinctsubjects in 28 diverse scenes. The videos are collected un-der extreme illumination conditions, and static and dynamicbackgrounds. The dataset contains a total of 4,218 gesture in-stances and 800,491 RGB frames. Compared to other publiclyavailable hand gesture datasets, IPN Hand includes the largestnumber of continuous gestures per video, and has the mostrapid transitions between gestures [4]. We utilise the IPN handdataset speciﬁcally as it is provided with multiple modalities: RGB, optical ﬂow and hand segmentation data; which enablesus to demonstrate the scalability (Sec. IV-C4) of the proposedframework.

B. Evaluation Metrics

Mean Jaccard Index (MJI) : To enable state-of-the-artcomparisons, we utilise the MJI to evaluate the model onthe EgoGesture dataset as suggested in [11], [36]. For agiven input, the Jaccard index measures the average relativeoverlap between the ground truth and the predicted class labelsequence. The Jaccard index for the i t h class is calculatedusing, J s,i = G s,i ∩ P s,i G s,i ∪ P s,i , (14)where G s,i and P s,i represents the ground truth and predictionsof the i th class label for sequence s respectively. Then theJaccard index for the sequence can be computed by, J s = 1 l s L (cid:88) i =1 J s,i , (15)where L is the number of available gesture classes and l s represents unique true labels. Then, the ﬁnal mean Jaccardindex of all testing sequences is calculated, ¯ J s = 1 n n (cid:88) j =1 J s,j . (16) Levenshtein Accuracy (LA) : In order to evaluate the IPNhand dataset we use the Levenshtein accuracy metric usedby [4]. The Levenshtein accuracy is calculated by estimatingthe Levenshtein distance between the ground truth and thepredicted sequences. The Levenshtein distance counts thenumber of item-level changes between the sequences andtransforms one sequence to the other. However, after obtainingthe Levenshtein distances, the Levenshtein accuracy is cal-culated by averaging this distance over the number of truetarget classes ( T p ), and subtracting the average value from 1(to obtain the closeness) and multiplied by 100 to obtain apercentage, LA = 1 − l d T p × . (17) C. Evaluations1) Selection of Parameter Value, A : In order to decide onthe attention level parameter, A , we have evaluated the modelon the EgoGesture dataset by increasing the value of A . Figure5 illustrates the impact of the attention-level parameter on theJaccard index score for the EgoGesture validation set. Notethat A = 1 is simple concatenation of the features.As an attention level of 8 produces the best results, we use A = 8 for rest of the evaluations. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Table 1

Attention_level MJI M ea n J acca r d I n d e x ( M J I ) Attention Level (A)

A=8 Fig. 5. The change in Jaccard index score with as the attention level parameter, A , is varied.

2) State-of-the-art Comparison:

In Table I, we comparethe performance of the proposed method with the currentstate-of-the-arts using the EgoGesture dataset. It should benoted that the comparison is done using continuous videostreams containing multiple gestures in each videos (not pre-segmented videos containing a single gesture per video). Forthe evaluations, the Jaccard index (as described in IV-B) isused. Similar to the original work [11], we use two settingsin evaluating the results. In both settings we keep the slidingwindow (sw) length at 16 while we consider strides of 16(l=16, s=16) and 8 (l=16, s=8). In [11], the authors obtainthe label predictions by considering the class probabilitiesof each clip predicted by the C3D softmax layer. Here,the sliding window is applied over the whole sequence togenerate a video clip. For the setting where the sliding windowoverlaps (i.e. l=16, s=8), frame label predictions are obtainedby accumulating the classiﬁcation scores of two overlappedwindows and the most likely class is chosen as the predictedlabel for each frame.Other than utilising the C3D model, the authors in [11]have further improved the gesture prediction method by em-ploying a Spatio-Temporal Transfer Module (STTM) [29](denoted sw+C3D+STTM) and an LSTM network (denotedLSTM+C3D). In LSTM+C3D, the class labels of each frameare predicted by an LSTM based on the C3D features extractedat the current time slice. An LSTM with hidden feature dimen-sion of 256 is used. The authors have gained better results byemploying the LSTM network. However, in both settings, ourproposed multi-modal fusion method is able to outperform thecurrent state-of-the-art methods for the EgoGesture dataset bya considerable margin. We also obtained 1.9% gain (metriccalculation of MJI (see Sec. IV-B)) using the second setting(i.e. l=16, s=8) where the the sliding windows overlap.We also evaluate our proposed model on the IPN Handdataset [4] and Table II includes a comparison of our proposedfusion model with the state-of-the-art. The results use theLevenshtein accuracy metric (see IV-B). In the original work[4], the authors have performed continuous gesture recognitionby using a two-stage approach where at the ﬁrst stage aseparate detection model is used to detect gestures within asequence. For this task binary classiﬁcation is carried out to

TABLE IC

OMPARISON OF OUR PROPOSED METHOD WITH THE STATE - OF - THE - ARTMETHODS ON THE E GO G ESTURE DATASET . R

ESULTS ARE SHOWN USINGTHE

MJI

METRIC ( SEE S EC . IV-B). Method MJIsw+C3D (l=16, s=16) [11] 0.618sw+C3D (l=16, s=8) [11] 0.698sw+C3D+STTM (l=16, s=8) [11] 0.709LSTM+C3D (l=16, s=8) [11] 0.718Proposed (l=16, s=16) 0.784Proposed (l=16, s=8) separate gestures from non-gestures using a ResLight-10 [6]model. In the second stage the detected gesture is classiﬁedby the classiﬁcation model (ResNet50 or ResNetXt-101). Forthe overall process in [4], the authors have considered differentcombinations of data modalities such as RGB-Flow and RGB-Seg where ’Flow’ and ’Seg’ refer to optical ﬂow and semanticsegmentation respectively. However, the authors gained thehighest classiﬁcation results for the ResNetXt-101 with RGB-Flow data.In contrast to the two-stage approach introduced in [4], weuse a single-stage method which directly predicts the sequenceof gesture class labels for the entire frame sequence of thevideo. Even though such a direct approach is challengingand requires a high level of discriminating ability within themodel to separate multiple gesture and non-gesture classes,our fusion model outperforms the state-of-the-art results onIPN hand dataset by a signiﬁcant margin. In Sec. IV-C4 wefurther evaluate the model using the three available modalitiesof RGB, ﬂow and semantic segmentation outputs, illustratingthe scalability of the proposed framework.

TABLE IIC

OMPARISON OF OUR PROPOSED METHOD WITH THE STATE - OF - THE - ARTMETHOD ON THE

IPN H

AND DATASET . T

HE RESULTS ARE SHOWN INTERM OF L EVENSHTEIN ACCURACY ( SEE S EC . IV-B). Method Modality ResultsResNet50 [4] RGB-Seg 33.27ResNet50 [4] RGB-Flow 39.47ResNetXt-101 [4] RGB-Seg 39.01ResNetXt-101 [4] RGB-Flow 41.47Proposed RGB-Flow

In addition to the quantitative results, we provide qualitativeresults (in Fig. 6 and Fig. 7) where we visualise the temporalgesture predictions generated by the proposed method fordifferent frame sequences from EgoGesture and IPN Handdatasets respectively.As shown in Fig. 6, even with the higher number ofgesture classes (84 classes including the non-gesture class)in the EgoGesture dataset, the model is able to detect thegestures well. We noticed in only a few cases the gesture’snap ﬁngers’ (in Fig. 6(b), bottom) is poorly detected andthe model seems to struggle to locate the actual gesture class.Apart from a smalll number of instances where the model facesdifﬁculties to predict the gesture class, the model learned thenon-gesture to gesture transitions well. Similarly in Fig. 7 andthe predictions obtained for the IPN Hand dataset, the model

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 zoom_in_with_ﬁngers rotate_ﬁngers_clockwisepause wave_hand knock shape_cmove_ﬁngers_upward sweep_circle grasp zoom_in_with_ﬁngers rotate_ﬁngers_clockwisepause wave_hand knock shape_c (a) applaudapplaud dual_hands_heartdual_hands_heartgather_ﬁngersgather_ﬁngers dual_hands_heartdual_hands_heart snap_ﬁngerssnap_ﬁngersbeckon dual_ﬁngers_hearttrigger_with_index_ﬁngernumber_seven trigger_with_index_ﬁnger trigger_with_index_ﬁngernumber_seven (b)Fig. 6. Qualitative results of the propose model predictions on EgoGesture dataset. shows good performance even in the presence of the rapidaction changes in the dataset. According to the publishers ofthe dataset, the IPN hand dataset contains the largest number ofcontinuous gestures per video, which is understandable whencomparing the timelines shown in Fig. 6 and Fig. 7.

3) Impact of Loss Formulation:

We investigate the impactof our proposed loss formulation, which enhances the overalllearning of the introduced model. In Table III, the MJI of theEgoGesture dataset obtained with different loss formulationsis shown. In the table, L ce , L sm and L med represents thecross-entropy loss (Eq. 6), the smoothing loss (Eq. 7) andmid-point smoothing loss (Eq. 12) respectively. Note that thecombination of all three losses is the loss used by the proposedapproach, L overall (Eq. 13). TABLE IIIT

HE IMPACT OF DIFFERENT LOSS FORMULATIONS : WE COMPARE THE

MJI

ON THE E GO G ESTURE DATASET WITH DIFFERENT LOSS FORMULATIONS .H ERE , L ce , L sm AND L med REPRESENT THE CROSS - ENTROPY LOSS (E Q .6), THE SMOOTHING LOSS (E Q . 7) AND MID - POINT SMOOTHING LOSS (E Q . 12) RESPECTIVELY . Loss MJI L ce L ce + L sm L ce + L mid L ce + L mid + L sm From Table III we observe that both losses, L mid and L sm have contributed to improving the the cross-entropy loss, withthe proposed mid-point based smoothing mechanism showinga slightly higher improvement. However, we observe a sig-niﬁcant improvement when utilising all the losses together, illustrating the importance of mid-point based comparison ofdifferent predicted and ground-truth windows.

4) Scalability of the Fusion Block:

In order to illustrate thescalability of the proposed framework and fusion mechanismto different numbers of modalities, we make use of a thirdmodality: the segmentation maps which are provided in theIPN Hand dataset.To make the feature extraction of hand segmentation mapsmore meaningful, we use the Pix-to-Pix GAN introduced in[32] . We ﬁrst train the GAN to generate hand segmentationmaps that are similar to the ones provided with the IPN handdataset. We set the number of ﬁlters in generator and thediscriminator to 8 and train the GAN by following the originalwork. After GAN training, we use the trained generator modelfor feature extraction of the third modality where features areobtained from the bottleneck layer of the generator. Theseextracted feature vectors (of dimension × × )are fed to the third UFM model along side the UFM modelsused for the RGB and optical ﬂow based feature vectors,as per the model evaluated in Tab. II. It should be notedthat having varying feature vector dimensions (i.e 2592 forsegmentation map inputs while 2048 for RGB and optical ﬂowfeatures) does not effect the fusion as the UFM block mapsthe feature vectors to the same dimensionality at the outputhead which is the input to the fusion block. As expected,with the use of three modalities we were able to improvethe overall Levenshtein accuracy by 1.8% from the settingwith only 2 modalities, achieving a Levenshtein accuracy of69.92% with the 3 feature modalities. With this evaluation We use the implementation provided athttps://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

G11 B0AB0A G09 B0BB0B B0BB0B B0B B0BB0BG06 BOA B0A B0AG03BOA G10B0BD0XG03G03 B0AB0A B0B B0AB0AG04G04G01 G10 D0XD0XB0AB0AG05G05 G02G02B0B B0BB0B D0XD0X B0BB0B B0AB0A B0BB0BD0XD0X (a)

G09G09

B0AB0A G02 B0BB0B G01 B0AB0A D0XD0X G11B0AB0A G04 B0BB0B D0XD0X D0XD0XB0BB0B B0BB0B B0BB0B G05D0XD0XG03 B0B G04 B0AB0B B0A G10 B0AB0A D0XD0X G11 B0BB0B G05 B0BB0B G02 B0BB0B B0BB0B D0XD0X G01 B0AB0A D0XD0X (b)Fig. 7. Qualitative results of the propose model predictions on IPN Hand dataset. we illustrate that the proposed method can seamlessly beextended to fuse the data from a different number of modalitieswith different feature dimensions. Despite the challenge thisbrings, our method has been able to successfully extract salientfeatures to support the decision making process.

5) Ablation Experiment:

In order to demonstrate the im-portance of the proposed fusion mechanism we conducted anablation experiment. In the experiment, we gradually removecomponents of the proposed framework and re-train and testthe models with the EgoGesture dataset. In Table IV, we reportthe evaluation results for ﬁve ablation models with the MJImetric. The ablation models are formulated as follows. • Only RGB:

A single UFM block is utilised for the RGBstream and the output of the uni-modal block is passeddirectly through the MFM block (see Fig. 8 (a)). Here,the MFM block works as a second uni-modal network asonly a single modality is used. Therefore, the proposedfeature enhancer (FE) is not used. • Only Depth:

The model is similar to that of the ’OnlyRGB’ stream ablation model ((see Fig. 8 (a))). However,the instead of the input RGB stream, the depth inputstream is used. • Simple Fusion:

Our proposed framework is utilised with-out the introduced fusion block. The fusion is performedby performing concatenation along the sequence of RGBand depth modalities. The model architecture is furtherillustrated in Fig 8 (b). • Proposed w/o FE :

The proposed framework without thefeature enhancer (FE) module is utilised.

TABLE IVE

VALUATION RESULTS FOR THE ABLATION MODELS USING THE E GO G ESTURE DATASET . Model MJIOnly RGB 0.697Only Depth 0.741Simple Fusion 0.755Proposed w/o FE 0.792Proposed

A key observation based on the results presented in IV isthat naive concatenation of multi-modal features does not gen-erate helpful information for continuous gesture recognition.We observe a performance drop of approximately 5% whensimple concatenation is applied in comparison to the proposedapproach, and only slight improvement for naive concatenationover the best individual model (depth). The proposed temporalfusion strategy as well as the feature enhancement block haveclearly contributed to the superior results that we achieve.V. C

ONCLUSION

We propose a single-stage continuous gesture recognitionmethod with a novel fusion method to perform multi-modalfeature fusion. The proposed framework can be applied tovarying length gesture videos, and able is to perform thegesture detection and classiﬁcation directly in a single stepwithout the help of an additional detector model. The proposedfusion model is introduced to handle multiple modalitieswithout a restriction on the number of modes, and further

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 C on v - D d il a t e d R e s i d u a l D il a t ed _ con v R e l ud r opou t C on v - D x K D il a t ed _ con v R e l ud r opou t C on v - D d il a t e d R e s i d u a l x ḱ C on v - D ŷ ŷ ŷ ŷ N predictionsRGB/Depth UFM block MFM block (a) Only RGB/ Only Depth C o n v - D d il a t e d R e s i d u a l D il a t e d _ c o n v R e l u d r opo u t C o n v - D x K C o n v - D d il a t e d R e s i d u a l D il a t e d _ c o n v R e l u d r opo u t C o n v - D x K Modal 1 Modal 2

UFM block

Concat

UFM block D il a t e d _ c o n v R e l u d r opo u t C o n v - D d il a t e d R e s i d u a l x ḱ C o n v - D ŷ ŷ ŷ ŷ N predictions MFM block (b) Simple FusionFig. 8. Ablation models: (a) The models that utilise a single modality (Only RGB/ Only Depth) are composed of a single UFM and MFM blocks where theMFM block works as a second uni-modal network. (b) The Simple Fusion model is formulated by performing concatenation along the sequence of RGB anddepth modalities instead of utilising the proposed fusion block. experiments demonstrate the scalability of the fusion methodand show how the multiple streams complement the overallgesture recognition process. With the proposed loss formula-tion our introduced single-stage continuous gesture recognitionframework learns the gesture transitions with considerableaccuracy, even with the rapid gesture transitions of the IPNhand dataset. The ablation experiment further highlights theimportance of the components of the proposed method, whichoutperformed the state-of-the-art systems on both datasets bya signiﬁcant margin. Our model has applications to multiplereal-world domains that require classiﬁcation on continuousdata, while the fusion model is applicable to other fusionproblems where videos or signal inputs are present, and canbe used with or without the UFM or MFM blocks.A

CKNOWLEDGMENT

The research presented in this paper was supported byan Australian Research Council (ARC) Discovery grantDP170100632. R

EFERENCES[1] H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, “Mmtm:Multimodal transfer module for cnn fusion,” in

Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition ,2020, pp. 13 289–13 299.[2] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recogni-tion with 3d convolutional neural networks,” in

Proceedings of the IEEEconference on computer vision and pattern recognition workshops , 2015,pp. 1–7.[3] P. Molchanov, S. Gupta, K. Kim, and K. Pulli, “Multi-sensor systemfor driver’s hand-gesture recognition,” in , vol. 1, 2015, pp. 1–8.[4] G. Benitez-Garcia, J. Olivares-Mercado, G. Sanchez-Perez, andK. Yanai, “Ipn hand: A video dataset and benchmark for real-timecontinuous hand gesture recognition,” arXiv preprint arXiv:2005.02134 ,2020.[5] N. N. Hoang, G.-S. Lee, S.-H. Kim, and H.-J. Yang, “Continuous handgesture spotting and classiﬁcation using 3d ﬁnger joints information,” in , 2019,pp. 539–543. [6] O. K¨op¨ukl¨u, A. Gunduz, N. Kose, and G. Rigoll, “Real-time hand ges-ture detection and classiﬁcation using convolutional neural networks,” in , 2019, pp. 1–8.[7] G. Zhu, L. Zhang, P. Shen, J. Song, S. A. A. Shah, and M. Bennamoun,“Continuous gesture segmentation and recognition using 3dcnn andconvolutional lstm,”

IEEE Transactions on Multimedia , vol. 21, no. 4,pp. 1011–1021, 2018.[8] P. Gupta, K. Kautz et al. , “Online detection and classiﬁcation of dynamichand gestures with recurrent 3d convolutional neural networks,” in

CVPR , vol. 1, no. 2, 2016, p. 3.[9] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Tempo-ral convolutional networks for action segmentation and detection,” in proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2017, pp. 156–165.[10] Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional net-work for action segmentation,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 3575–3584.[11] Y. Zhang, C. Cao, J. Cheng, and H. Lu, “Egogesture: a new dataset andbenchmark for egocentric hand gesture recognition,”

IEEE Transactionson Multimedia , vol. 20, no. 5, pp. 1038–1050, 2018.[12] J. Wan, G. Guo, and S. Z. Li, “Explore efﬁcient local features from rgb-d data for one-shot learning gesture recognition,”

IEEE transactions onpattern analysis and machine intelligence , vol. 38, no. 8, pp. 1626–1639,2015.[13] X. Shen, G. Hua, L. Williams, and Y. Wu, “Dynamic hand gesture recog-nition: An exemplar-based approach from motion divergence ﬁelds,”

Image and Vision Computing , vol. 30, no. 3, pp. 227–235, 2012.[14] H. Trinh, Q. Fan, P. Gabbur, and S. Pankanti, “Hand tracking by binaryquadratic programming and its application to retail activity recognition,”in ,2012, pp. 1902–1909.[15] X. Yang and Y. Tian, “Super normal vector for activity recognition usingdepth sequences,” in

Proceedings of the IEEE conference on computervision and pattern recognition , 2014, pp. 804–811.[16] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, “Robust 3d actionrecognition with random occupancy patterns,” in

European conferenceon computer vision , 2012, pp. 872–885.[17] H. Wang and C. Schmid, “Action recognition with improved trajecto-ries,” in

Proceedings of the IEEE international conference on computervision , 2013, pp. 3551–3558.[18] K. Simonyan and A. Zisserman, “Two-stream convolutional networksfor action recognition in videos,” in

Advances in neural informationprocessing systems , 2014, pp. 568–576.[19] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Predicting thefuture: A jointly learnt model for action anticipation,” in

Proceedings

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 of the IEEE International Conference on Computer Vision , 2019, pp.5562–5571.[20] Z. Teng, J. Xing, Q. Wang, B. Zhang, and J. Fan, “Deep spatial andtemporal network for robust visual object tracking,”

IEEE Transactionson Image Processing , vol. 29, pp. 1762–1775, 2019.[21] Z. Shou, D. Wang, and S.-F. Chang, “Temporal action localization inuntrimmed videos via multi-stage cnns,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2016, pp.1049–1058.[22] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networksfor human action recognition,”

IEEE transactions on pattern analysisand machine intelligence , vol. 35, no. 1, pp. 221–231, 2012.[23] P. Zhang, J. Xue, C. Lan, W. Zeng, Z. Gao, and N. Zheng, “Eleatt-rnn:Adding attentiveness to neurons in recurrent neural networks,”

IEEETransactions on Image Processing , vol. 29, pp. 1061–1073, 2019.[24] Z. Liu, X. Chai, Z. Liu, and X. Chen, “Continuous gesture recognitionwith hand-oriented spatiotemporal feature,” in

Proceedings of the IEEEInternational Conference on Computer Vision Workshops , 2017, pp.3056–3064.[25] L. Li, S. Qin, Z. Lu, K. Xu, and Z. Hu, “One-shot learning gesturerecognition based on joint training of 3d resnet and memory module,”

Multimedia Tools and Applications , vol. 79, no. 9, pp. 6727–6757, 2020.[26] Y. Zhang, L. Shi, Y. Wu, K. Cheng, J. Cheng, and H. Lu, “Gesture recog-nition based on deep deformable 3d convolutional neural networks,”

Pattern Recognition , vol. 107, p. 107416, 2020.[27] B. Su, J. Zhou, X. Ding, and Y. Wu, “Unsupervised hierarchical dynamicparsing and encoding for action recognition,”

IEEE Transactions onImage Processing , vol. 26, no. 12, pp. 5784–5799, 2017.[28] M. Abavisani, H. R. V. Joze, and V. M. Patel, “Improving the perfor-mance of unimodal dynamic hand-gesture recognition with multimodaltraining,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2019, pp. 1165–1174.[29] C. Cao, Y. Zhang, Y. Wu, H. Lu, and J. Cheng, “Egocentric gesturerecognition using recurrent 3d convolutional neural networks with spa-tiotemporal transformer modules,” in

Proceedings of the IEEE Interna-tional Conference on Computer Vision , 2017, pp. 3763–3771.[30] H. Gammulle, T. Fernando, S. Denman, S. Sridharan, and C. Fookes,“Coupled generative adversarial network for continuous ﬁne-grainedaction segmentation,” in , 2019, pp. 200–209.[31] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Fine-grainedaction segmentation using the semi-supervised action gan,”

PatternRecognition , vol. 98, p. 107039, 2020.[32] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2017, pp. 1125–1134.[33] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A gener-ative model for raw audio,”

ISCA Speech Synthesis Workshop (SSW) ,2016.[34] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitationnetworks.”

IEEE transactions on pattern analysis and machine intelli-gence , 2019.[35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al. , “Pytorch: Animperative style, high-performance deep learning library,” in

Advancesin neural information processing systems , 2019, pp. 8026–8037.[36] J. Wan, Y. Zhao, S. Zhou, I. Guyon, S. Escalera, and S. Z. Li, “Chalearnlooking at people rgb-d isolated and continuous datasets for gesturerecognition,” in