Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition
MMULTITASK TRAINING WITH UNLABELED DATAFOR END-TO-END SIGN LANGUAGE FINGERSPELLING RECOGNITION
Bowen Shi, Karen Livescu
Toyota Technological Institute at Chicago { bshi, klivescu } @ttic.edu ABSTRACT
We address the problem of automatic American Sign Lan-guage fingerspelling recognition from video. Prior workhas largely relied on frame-level labels, hand-crafted fea-tures, or other constraints, and has been hampered by thescarcity of data for this task. We introduce a model forfingerspelling recognition that addresses these issues. Themodel consists of an auto-encoder-based feature extractorand an attention-based neural encoder-decoder, which aretrained jointly. The model receives a sequence of imageframes and outputs the fingerspelled word, without relyingon any frame-level training labels or hand-crafted features. Inaddition, the auto-encoder subcomponent makes it possibleto leverage unlabeled data to improve the feature learning.The model achieves 11.6% and 4.4% absolute letter accuracyimprovement respectively in signer-independent and signer-adapted fingerspelling recognition over previous approachesthat required frame-level training labels.
Index Terms — American Sign Language, fingerspellingrecognition, end-to-end neural network, auto-encoder
1. INTRODUCTION
Automatic recognition of sign language from video couldenable a variety of services, such as search and retrievalfor Deaf social and news media (e.g., deafvideo.tv,aslized.org ). Sign language recognition involves anumber of challenges. For example, sign languages eachhave their own grammatical structure with no built-in writ-ten form; “transcription” of sign language with a writtenlanguage is therefore a translation task. In addition, signlanguages often involve the simultaneous use of handshape,arm movement, and facial expressions, whose related com-puter vision problems of articulated pose estimation and handtracking still remain largely unsolved. Rather than treatingthe problem as a computer vision task, many researchers havetherefore chosen to address it as a linguistic task, with speechrecognition-like approaches.In this paper, we focus on recognition of fingerspelling,a part of ASL in which words are spelled out letter by letter(using the English alphabet) and each letter is represented by a distinct handshape. Fingerspelling accounts for 12 - 35%of ASL [1] and is mainly used for lexical items that do nothave their own ASL signs. Fingerspelled words are typicallynames, technical words, or words borrowed from another lan-guage, which makes its lexicon huge. Recognizing finger-spelling has great practical importance because fingerspelledwords are often some of the most important context words.One problem in fingerspelling recognition is that rela-tively little curated labeled data exists, and even less datalabeled at the frame level. Recent work has obtained encour-aging results using models based on neural network classifierstrained with frame-level labels [2]. One goal of our work isto eliminate the need for frame labels. In addition, most priorwork has used hand-engineered image features, which are notoptimized for the task. A second goal is to develop end-to-end models that learn the image representation. Finally, whilelabeled fingerspelling data is scarce, unlabeled fingerspellingor other hand gesture data is more plentiful. Our final goal isto study whether such unlabeled data can be used to improverecognition performance.We propose a model that jointly learns image frame fea-tures and sequence prediction with no frame-level labels. Themodel is composed of a feature learner and an attention-basedneural encoder-decoder. The feature learner is based on anauto-encoder, enabling us to use unlabeled data of hand im-ages (from both sign language video and other types of ges-ture video) in addition to transcribed data. We compare ourapproach experimentally to prior work and study the effectof model differences and of training with external unlabeleddata. Compared to the best prior results on this task, we obtain11.6% and 4.4% improvement respectively in signer-adaptedand signer-independent letter error rates.
2. RELATED WORK
Automatic sign language recognition can be approachedsimilarly to speech recognition, with signs being treatedanalogously to words or phones. Most previous work hasused approaches based on hidden Markov models (HMMs)[3, 4, 5, 6, 7]. This work has been supported by the collectionof several sign language video corpora, such as RWTH-PHOENIX-Weather [8, 9], containing 6,861 German Sign a r X i v : . [ c s . C L ] F e b anguage sentences, and the American Sign Language Lexi-con Video Dataset (ASLLVD [10, 11, 12]), containing videorecordings of almost 3000 isolated signs.Despite the importance of fingerspelling in spontaneoussign language, there has been relatively little work explicitlyaddressing fingerspelling recognition. Most prior work onfingerspelling recognition is focused on restricted settings.One typical restriction is the size of the lexicon. When thelexicon is fixed to a small size (20-100 words), excellentrecognition accuracy has been achieved [13, 14, 15], butthis restriction is impractical. For ASL fingerspelling, thelargest available open-vocabulary dataset to our knowledge isthe TTIC/UChicago Fingerspelling Video Dataset (Chicago-FSVid), containing 2400 open-domain word instances pro-duced by 4 signers [2], which we use here. Another impor-tant restriction is the signer identity. In the signer-dependentsetting, letter error rates below 10% can be achieved for un-constrained (lexicon-free) recognition on the Chicago-FSViddataset [16, 17, 2]; but the error rate goes above 50% in thesigner-independent setting and is around 28% after word-level (sequence-level) adaptation [2]. Large accuracy gapsbetween signer-dependent and signer-independent recog-nition have also been observed for general sign languagerecognition beyond fingerspelling [7].The best-performing prior approaches for open-vocabularyfingerspelling recognition have been based on HMMs or seg-mental conditional random fields (SCRFs) using deep neuralnetwork (DNN) frame classifiers to define features [2]. Thisprior work has largely relied on frame-level labels for train-ing data, but these are hard to obtain. In addition, because ofthe scarcity of data, prior work has largely relied on human-engineered image features, such as histograms of orientedgradients (HOG) [18], as the initial image representation.Our goal here is to move away from some of the restric-tions imposed in prior work. To our knowledge, this paperrepresents the first use of end-to-end neural models for fin-gerspelling recognition without any hand-crafted features orframe labels, as well as the first use of external unlabeledvideo data to address the lack of labeled data.
3. METHODS
Fingerspelling recognition from raw image frames, like manysequence prediction problems, can be treated conceptuallyas the following task: ( x , x , ..., x S ) → ( z , z , ..., z S ) → ( y , y , ..., y T ) , where { x i } , { z i } ( ≤ i ≤ S ) are raw imageframes and image features, respectively, and { y j } ( ≤ j ≤ T ) are predicted letters. Our model is composed of two mainparts (which can be trained separately or jointly): a featureextractor trained as an auto-encoder (AE) and an attention-based encoder-decoder for sequence prediction (see Figure1). The attention-based model maps from ( z , z , ..., z S ) to ( y , y , ..., y T ) and is similar to recent sequence-to-sequencemodels for speech recognition [19] and machine translation Fig. 1 . Structure of the proposed model (blue region: auto-encoder, ⊕ : concatenation). The decoder component of theauto-encoder (the blue box on the right) is used only at train-ing time.[20]. For the feature extractor, we consider three types ofauto-encoders: Vanilla Auto-Encoder (AE) [21]: A feedforward neuralnetwork consisting of an encoder that maps the input (image) x ∈ R d x to a latent variable z ∈ R d z , where d z < d x and adecoder that maps z ∈ R d z to output ˜x ∈ R d x . The objectiveis to minimize the reconstruction error L ( x ) = || x − ˜x || while keeping d z small. In our models we use multi-layerperceptrons (MLP) for both encoder and decoder. Denoising Auto-Encoder (DAE) [22]: An extension ofthe vanilla auto-encoder where the input x at training time isa corrupted version of the original input x (cid:48) . The training lossof the DAE is L ( x ; x (cid:48) ) = || x (cid:48) − ˜x || Variational Auto-Encoder (VAE) [23, 24] Unlike thevanilla and denoising auto-encoders, a variational auto-encoder models the joint distribution of the input x and latentvariable z : p θ ( x , z ) = p θ ( x | z ) p θ ( z ) . VAEs are trained byoptimizing a variational lower bound on the likelihood p ( x ) : L ( x ) = − D KL [ q φ ( z | x ) || p θ ( z )] + E q φ ( z | x ) [log p θ ( x | z )] (1)The two terms are the KL divergence between q φ ( z | x ) and p θ ( z ) and a reconstruction term E q φ ( z | x ) [log p θ ( x | z )] . Theprior p θ ( z ) is typically assumed to be a centered isotropicmultivariate Gaussian distribution N ( , I ) , and the poste-rior q φ ( z | x ) and conditional distribution p θ ( x | z ) are as-sumed to be multivariate Gaussians with diagonal covariance ( µµµ z , σσσ z I ) and N ( µµµ x , σσσ x I ) . Under these assumptions, theKL divergence can be computed as D KL [ q φ ( z | x ) || p θ ( z )] = 12 D (cid:88) d =1 (1 + log( σ d ) − µ d − σ d ) (2)where µµµ z = ( µ , ..., µ d ) and σσσ z = ( σ , ..., σ d ) are approxi-mated as the outputs of an MLP taking x as input.Similarly to the AE and DAE, we use an MLP to model µµµ x and σσσ x . The loss of the VAE can thus be rewritten as L ( x ) = − D (cid:88) d =1 (1 + log( σ d ) − µ d − σ d )+ 1 L L (cid:88) l =1 log N ( x ; µµµ lx , σσσ lx ) (3)where L is a number of samples used to approximate the ex-pectation in 1 (in practice we set L = 1 as in prior work [23]). µµµ z is the feature vector z and µµµ x serves the role of the recon-structed input ˜ x in Figure 1. RNN encoder-decoder:
The latent variable sequenceoutput by the auto-encoding module is fed into a long short-term memory (LSTM [25]) recurrent neural network (RNN)for encoding: ( z , z , ..., z S ) → ( h , h , ..., h S ) . The LSTMstates are fed into an RNN decoder that outputs the finalletter sequence ( y , y , ..., y T ) . Attention [26] weights areapplied to ( h , h , ..., h S ) during decoding in order to focuson certain chunks of image frames. If the hidden state ofthe decoder LSTM at time step t is d t , the probability ofoutputting letter y t , p ( y t | y t − , z T ) , is given by α it = softmax ( v T tanh( W h h i + W d d t )) d (cid:48) t = S (cid:88) i =1 α it h i p ( y t | y t − , z T ) = softmax ( W o [ d t ; d (cid:48) t ] + b o ) (4)and d t is given by the standard LSTM update equation [25].The loss for the complete model is a multitask loss: L ( x T , y S ) = − S S (cid:88) j =1 log p ( y j | y j − , z T )+ λ ae T T (cid:88) i =1 L ae ( x i ) (5)where L ae ( · ) is one of the losses of the AE, DAE or VAE, and λ ae measures the relative weight of the feature extraction lossvs. the prediction loss.
4. EXPERIMENTSData and experimental setup:
We use the TTIC/UChicagoASL Fingerspelling Video Dataset (Chicago-FSVid), which includes 4 native signers each fingerspelling 600 word in-stances consisting of 2 repetitions of a 300-word list con-taining common English words, foreign words, and names. We follow the same preprocessing steps as in [2] consist-ing of hand detection and segmentation, producing 347,962frames of hand regions. In addition, we also collect extraunlabeled handshape data consisting of 65,774 ASL finger-spelling frames from the data of [27] and 63,175 hand ges-ture frames from [28]. We chose these external data sets be-cause they provide hand bounding boxes; obtaining additionaldata from video data sets without bounding boxes is possible(and is the subject of future work), but would require handtracking or detection. Despite the smaller amount of exter-nal data, and although it is noisier than the UChicago-FSViddataset (it includes diverse backgrounds), it provides exam-ples of many additional individuals’ hands, which is help-ful for signer-independent recognition. All image frames arescaled to × before feeding into the network.Our experiments are done in three settings: signer-dependent (SD), signer-independent (SI) and signer-adapted(SA). We use the same setup as in [16, 17, 2], reviewed herefor completeness. For the SD case, models are trained andtested on a single signer’s data. The data for each signer isdivided into 10 subsets for k-fold experiments. 80%, 10%and 10% of the data are respectively used as train, valida-tion, and test sets in each fold. 8 out of 10 possible foldsare used (reserving 20% of the data for adaptation), and thereported result is the average letter error rate (LER) over thetest sets in those 8 folds. For the SI case, we train on threesigners’ data and test on the fourth. For the SA case, themodel is warm-started from a signer-independent model andfine-tuned with 20% of the target signer data.
10% of thetarget signer data is used for hyperparameter tuning and testresults are reported on the rest. Previous work has consideredtwo types of adaptation, using frame-level labels (alignments)for adaptation data or only word-level labels; here we onlyconsider word-level adaptation.
Model details
The auto-encoder consists of a 2-layerMLP encoder and 2-layer MLP decoder with 800 ReLUs ineach hidden layer and dimensionality of the latent variable z fixed at 100. Weights are initialized with Xavier initialization[29]. Dropout is added between layers at a rate of 0.8. Forthe sequence encoder and decoder, we use a one-layer LSTMRNN with hidden dimensionality of 128 and letter embeddingdimensionality of 128. We use the Adam optimizer [30] withinitial learning rate 0.001, which is decayed by a factor of 0.9when the held-out accuracy stops increasing. Beam search isused for decoding; the effect of beam width will be discussedlater. The default value for λ ae in the multitask loss function(Equation 5) is 1, but it can be tuned. The model is trained The recognition models do not use knowledge of the word list. In previous work on signer adaptation [17, 2], multiple approaches werecompared and this was the most successful one. Dropout rate refers to the probability of retaining a unit. ig. 2 . Attention visualization for the example word “LIBYA”. Colors correspond to the attention weights α it in Equation 4,where i and t are column and row index, respectively. Lighter color corresponds to higher value. At the top are subsampledimage frames for this word; frames with a plus (+) are the ones with highest attention weights, which are also the most canonicalhandshapes in this example. (Alignments between image frames and attention weights are imperfect due to frame subsamplingeffects.) Model SD SI SA1 Best prior results [2] a . b . c Table 1 . Letter error rates (%) of different models.SD: signer-dependent, SI: signer-independent, SA: signer-adapted. Model names with an asterisk (*) and a plus (+)use extra unlabeled hand image data and augmented data re-spectively. Best prior results are obtained with SCRFs ( a =2-pass SCRF, b = rescoring SCRF, c = first-pass SCRF).first with the unlabeled data, using only the auto-encoder loss,and then the labeled data using the multitask loss. We alsoexperimented with iteratively feeding unlabeled and labeleddata, but this produced worse performance. We compare the performance of our approach with the bestprior published results on this dataset, obtained with varioustypes of SCRFs and detailed in [2]. These prior approachesare trained with frame-level labels. In addition to the resultsin [2], we consider the following extra baselines.
Baseline 1 (HOG + enc-dec) : We use a classic hand-engineered image descriptor, histogram of oriented gradient(HOG [18]), and directly feed it into the attention encoder-decoder. We use the same HOG feature vector as in [2]. This baseline allows us to compare engineered features with fea-tures learned by a neural network.
Baseline 2 (CNN + enc-dec, DNN + enc-dec) : A CNNor DNN frame classifier is trained using frame letter labels,and its output (pre-softmax layer) is used as the feature in-put z in the attention encoder-decoder. The classifier networkis not updated during encoder-decoder training. This base-line tests whether frame-level label information is beneficialfor the neural encoder-decoder. The input for both CNN andDNN are the × image pixels concatenated over a 21-frame window (10 before and 10 following the current frame).The DNNs have three hidden layers of sizes 2000, 2000 and512. Dropout is added between layers at a rate of 0.6. TheCNNs are composed of (in order) 2 convolutional layers, 1max-pooling layer, 2 convolutional layers, one max-poolinglayer, 3 fully connected layers, and 1 softmax layer. The stridein all convolutional layers is 1 and the filter sizes are respec-tively: × × × , × × × , × × × , × × × . Max-pooling is done over a window ofsize × with stride 2. Finally the fully connected layersare of sizes 2000, 2000 and 512. Dropout at a rate of 0.75and 0.5 is used for the convolutional and fully connected lay-ers, respectively. The fully connected layers in both CNN andDNN have rectified linear unit (ReLU) [31] activation func-tions. Training is done via stochastic gradient descent withinitial learning rate 0.01, which is decayed by a factor of 0.8when the validation accuracy decreases after the first severalepochs. The network structural parameters (number and typeof layers, number of units, etc. ) are tuned according to thevalidation error, and the above architectures are the best onesin our tuning. Baseline 3 (E2E CNN/DNN+enc-dec) : End-to-end ver-sion of C NN/DNN + enc-dec. In this baseline, the CNN/DNNparameters are learned jointly with the encoder-decoder andno frame labels are used.
Baseline 4 (AE/DAE/VAE + enc-dec) : Separate trainingof auto-encoder and encoder-decoder modules, each with itsown loss. Baselines 3 and 4 are used to study the effectivenessof end-to-end training. ig. 3 . Visualization via a 2-D t-SNE embedding [32] of image frame features extracted in the end-to-end VAE model and theCNN classifier for example word “KERUL” in the signer-dependent (SD) and signer-independent (SI) settings.
The overall results are shown in Table 1. Our main findingsare as follows:
Best-performing model:
The proposed end-to-end model,when using a VAE and the external unlabeled data (line 16),achieves the best results in the signer-independent (SI) andsigner-adapted (SA) cases, improving over the previous bestpublished results by 11.6% and 4.4% absolute, respectively.In all of our end-to-end models (11-16), the VAE outper-forms the AE and DAE. In the signer-dependent case, ourbest model is 0.5% behind the best published SCRF result,presumably because our model is more data-hungry and theSD condition has the least training data.
Encoder-decoders vs. prior approaches:
More gener-ally, models based on RNN encoder-decoders (lines 2-16) of-ten outperform prior approaches (line 1) in the SI and SA set-tings but do somewhat worse in the signer-dependent case.We visualize the attention weights in Figure 2. The frame cor-responding to the canonical handshape often has the highestattention weight. The alignment between the decoder outputand image frames is generally monotonic, though we do notuse any location-based priors.
The effect of end-to-end training:
We measure the effectof end-to-end training vs. using frame labels by comparingthe separately trained CNN/DNN + enc-dec (lines 3-4) withtheir end-to-end counterparts (lines 6-7), as well as separatelytrained AEs (lines 8-10) vs. their E2E counterparts (lines 11-13). We find that separate training of a frame classifier canimprove error rate by about in the signer-dependent set-ting, but in the other two settings, end-to-end models trainedwithout frame labels consistently outperform their separatetraining counterparts. Features learned by a frame classifierseem to not generalize well across signers. The non-end-to-end AE-based models do much worse than their E2E counter-parts, presumably because the feature extractor does not getany supervisory signal. We visually compare the features ofeach image frame trained through an end-to-end model vs. aframe classifier via t-SNE [32] embeddings (Figure 3). Wefind that both feature types show good separation in the SD Fig. 4 . Comparison of different models with and without ex-tra data (*: with external data, +: with augmented data).setting, but in the SI setting the end-to-end VAE encoder-decoder has much clearer clusters corresponding to letters.
Does external unlabeled data help?
The extra data givesa consistent improvement for all three auto-encoding modelsin all settings (lines 11-13 vs. 14-16 and Figure 4). The aver-age accuracy improvements for the three settings are respec-tively 2.1%, 0.6%, and 0.7%. The SI and SA improvementis smallest for the best (VAE-based) model, but the overallconsistent trend suggests that we may be able to further im-prove results with even more external data. The improvementis largest in the SD setting, perhaps due to the relatively largeramount of extra data compared to the labeled training data.
Would data augmentation have the same effect as ex-ternal data?
We compare the extra-data scheme to clas-sic data augmentation techniques [33], which involve addingreplicates of the original training data with geometric trans-formations applied. We perform the following transforma-tions: scaling by a ratio of 0.8 and translation in a randomdirection by 10 pixels, rotation of the original image at a ran-dom angle up to 30 degrees both clockwise and counterclock-wise. We generate augmented data with roughly the samesize as the external data (960 word and 168,950 frames) and ig. 5 . Letter confusion matrix under three settings (from left to right: signer-dependent (SD), signer-independent (SI) andsigner-adapted (SA)). The color in each cell corresponds to the empirical probability of predicting a hypothesized letter (hori-zontal axis) given a certain ground-truth letter (vertical axis). The diagonal in each matrix has been removed for visual clarity.then train the CNN/DNN + enc-dec model (with frame la-bels). The results (Figure 4 and Table 1 line 5 vs. 3) showthat data augmentation hurts performance in the SD and SAsettings and achieves a 0.3% improvement in the SI setting.We hypothesize that the extra unlabeled hand data provides aricher set of examples than do the geometric transformationsof the augmented data.
Effect of beam width:
We analyze the influence of beamwidth on error rates, shown in Figure 6. Beam search is im-portant in the SD setting. In this setting, the main errors aresubstitutions among similar letter handshapes (like e and o),as seen from the confusion matrix in Figure 5. Using a widerbeam can help catch such near-miss errors. However, in theSI and SA settings, there are much more extreme differencesbetween the predicted and ground-truth words, evidenced bythe large number of deletion errors in Figure 5. Therefore itis hard to increase accuracy through beam search. Some ex-amples of predicted words are listed in Table 2.
B=1 B=3 B=5 Ground-truthSD FIRSWIUO FIREWIUE FIREWIRE FIREWIRENOTEBEEK NOTEBOOK NOTEBOOK NOTEBOOKSI AAQANNIS AOQAMIT AOQUNIR TANZANIAPOPLDCE POPULCE POPULOE SPRUCE
Table 2 . Example outputs with different beam sizes in signer-dependent and signer-independent settings.
5. CONCLUSION
We have introduced an end-to-end model for ASL finger-spelling recognition that jointly learns an auto-encoder basedfeature extractor and an RNN encoder-decoder for sequenceprediction. The auto-encoder module enables us to use un-labeled data to augment feature learning. We find that these
Fig. 6 . Letter error rate (%) with different beam widths insigner-dependent (SD), signer-independent (SI) and signer-adapted (SA) settings.end-to-end models consistently improve accuracy in signer-independent and signer-adapted settings, and the use of ex-ternal unlabeled data further slightly improves the results.Although our model does not improve over the best previous(SCRF-based) approach in the signer-dependent case, thisprior work required frame labels for training while our ap-proach does not. Future work includes collecting data “in thewild” (online) and harvesting even more unlabeled data.
Acknowledgements
We are grateful to Greg Shakhnarovich and Hao Tang forhelpful suggestions and discussions. This research wasfunded by NSF grant 1433485. . REFERENCES [1] C. Padden and D. C. Gunsauls, “How the alphabet cameto be used in a sign language,”
Sign Language Studies ,vol. 4, no. 1, pp. 10–33, 2003.[2] T. Kim, J. Keane, W. Wang, H. Tang, J. Rig-gle, G. Shakhnarovich, D. Brentari, and K. Livescu,“Lexicon-free fingerspelling recognition from video:data, models, and signer adaptation,”
Computer Speechand Language , pp. 209–232, November 2017.[3] T. Starner, J. Weaver, and A. Pentland, “Real-timeAmerican Sign Language recognition using desk andwearable computer based video,”
IEEE Transactionson Pattern Analysis and Machine Intelligence , 20(12)1998.[4] C. Vogler and D. Metaxas, “Parallel hidden Markovmodels for American Sign Language recognition,” in
ICCV , 1999.[5] K. Grobel and M. Assan, “Isolated sign language recog-nition using hidden Markov models,” in
InternationalConference on System Man and Cybernetics , 1997.[6] P. Dreuw, D. Rybach, T. Deselaers, M. Zahedi, andH. Ney, “Speech recognition techniques for a sign lan-guage recognition system,” in
Interspeech , 2007.[7] O. Koller, J. Forster, and H. Ney, “Continuous sign lan-guage recognition: Towards large vocabulary statisticalrecognition systems handling multiple signers,”
Com-puter Vision and Image Understanding , vol. 141, pp.108–125, 2015.[8] J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle,J. Piater, and H. Ney, “RWTH-PHOENIX-Weather: Alarge vocabulary sign language recognition and transla-tion corpus,”
Language Resources and Evaluation , pp.3785–3789, 2012.[9] J. Forster, C. Schmidt, O. Koller, M. Bellgardt, andH. Ney, “Extensions of the sign language recognitionand translation corpus RWTH-PHOENIX-Weather,”
Computer Vision and Image Understanding , vol. 141,pp. 108–125, 12 2015.[10] V. Athitsos, C. Neidle, S. Sclaroff, J. Nash, A. Stefan,A. Thangali, H. Wang, and Q. Yuan, “Large lexiconproject: American sign language video corpus and signlanguage indexing/retrieval algorithms,” in
Workshop onthe Representation and Processing of Sign Languages:Corpora and Sign Language Technologies , 2010.[11] C. Neidle and C. Vogler, “A new web interface to fa-cilitate access to corpora: Development of the ASLLRPdata access interface (DAI),” in
LREC Workshop on the Representation and Processing of Sign Language: In-teractions between Corpus and Lexicon
ICIP , 2006.[14] S. Liwicki and M. Everingham, “Automatic recognitionof fingerspelled words in British Sign Language,” in , 2009.[15] S. Ricco and C. Tomasi, “Fingerspelling recognitionthrough classification of letter-to-letter transitions,” in
ACCV , 2009.[16] T. Kim, G. Shakhnarovich, and K. Livescu, “Finger-spelling recognition with semi-Markov conditional ran-dom fields,” in
ICCV , 2013.[17] T. Kim, W. Wang, H. Tang, and K. Livescu, “Signer-independent fingerspelling recognition with deep neuralnetwork adaptation,” in
ICASSP , 2016.[18] N. Dalal and B. Triggs, “Histogram of oriented gradi-ents for human detection,” in
CVPR , 2005.[19] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen,Attend and Spell: A neural network for large vocabularyconversational speech recognition,” in
ICASSP , 2016.[20] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machinetranslation by jointly learning to align and translate,” in
ICLR , 2015.[21] P. Baldi, “Auto-encoders, unsupervised learning anddeep architectures,” in
International Conference on Un-supervised and Transfer Learning Workshop , 2011.[22] V. Pascal, H. Larochelle, I. Lajoie, Y. Bengio, and P.A.Manzagol, “Stacked denoising autoencoders: Learninguseful representations in a deep network with a localdenoising criterion,”
Journal of Machine Learning Re-search , vol. 11, Dec. 2010.[23] D. P Kingma and M. Welling, “Auto-encoding varia-tional Bayes,” in
ICLR , 2014.[24] D. J. Rezende, S. Mohamed, and D Wierstra, “Stochas-tic backpropagation and approximate inference in deepgenerative models,” in
ICML , 2014.[25] S. Hochreiter and J. Schmidh¨uber, “Long Short-TermMemory,”
Neural Computation , vol. 9, pp. 1735–1780,Nov. 1997.26] O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever,and G. Hinton, “Grammar as a foreign language,” in
NIPS 28 , pp. 2773–2781. 2015.[27] N. Pugeault and R. Bowden, “Spelling it out: Real-timeasl fingerspelling recognition,” in
Proceedings of the1st IEEE Workshop on Consumer Depth Cameras forComputer Vision, jointly with ICCV , 2011.[28] T.-K. Kim and R. Cipolla, “Canonical correlation analy-sis of video volume tensors for action categorization anddetection,”
IEEE Transactions on Pattern Analysis andMachine Intelligence , pp. 1415–1428, 31(8) 2009.[29] X. Glorot and Y. Bengio, “Understanding the difficultyof training deep feedforward neural networks,” in
AIS-TATS , 2010.[30] D. P. Kingma and J. L. Ba, “ADAM: A method forstochastic optimization,” in
ICLR , 2015.[31] Vinod Nair and Geoffrey E Hinton, “Rectified lin-ear units improve restricted Boltzmann machines,” in
ICML , 2010.[32] L. van der Maaten and G. E. Hinton, “Visualizing high-dimensional data using t-SNE,”
Journal of MachineLearning Research , vol. 9, pp. 2579–2605, 2008.[33] A. G. Howard, “Some improvements on deep convo-lutional neural network based image classification,” in