Archive | 2019

A Study on Fusion Framework for Air-writing Recognition Based on Spatial and Temporal Hand Trajectory Modeling

 

Abstract


of Thesis Air-writing refers to writing alphabet or numeric gestures by hand or finger movement in free space. It has attracted attention since it can offer verbal communication. Even the air-writing recognition has been studied more than three decades, creating a robust system is still challenging. The primary objective of this work is studying a fusion framework for air-writing recognition from a vision-based sensor. We employ a fusion scheme by modeling the air-writing with the temporal feature augmented with the spatial feature. We address the air-writing recognition in two categories: motion character recognition and motion word recognition. The underlying assumption of motion character is that the gesture is correctly spotted; therefore, a segmentation process is not necessary for the air-writing recognition. In contrast, the motion word gesture was captured from a user in a motion stream. It does not have a sign to indicate the writing and non-writing part. Moreover, there are ligatures between the characters in motion word. For learning the motion character, we model the air-writing by using spatial features augmented with an image-like feature. The proposed structure comprises three main parts: a CNN part, an RNN part, and a Fusion part. The CNN part consists of three convolution layers and two subsampling layers. The convolution layers in the CNN part are employed to extract information from the image-like feature. In the RNN part, there are two types of structures that were considered. The first structure is a Bidirectional Long-Short Term Memory (BLSTM), and the other one is a simplified Bidirectional Recurrent Neural Network (simplified BRNN). To obtain useful information from the temporal features, the BLSTM was deployed in the RNN part. The output of the CNN and the RNN parts were combined before feeding into the Fusion part. In the first experiment, the performance of the proposed structure was compared with three baseline references: the CNN, the BLSTM, and Yang s work. The result confirms the fusion scheme outperforms all of the references. In the second experiment, the effects of the recurrent units were examined by varying the number of BLSTM units in the RNN part. The optimum number of the BLSTM units are 15 and 25 for the numeric gesture and alphabet gesture, respectively. From the experimental results, we confined that the execution time of the fusion structure is high due to the complexity of the BLSTM unit. In the third experiment, the simplified BRNN was considered. When comparing the results with the previous experiment, the execution time of simplified BRNN unit reduces in half while the accuracy drops insignificantly. In the last experiment, we demonstrated that using hand position feature (RNN part) and image-like feature (CNN part) are adequate for the fusion network. For leaning the motion word, a deep recurrent neural network was studied. In the output layer of the proposed structure, the Connectionist Temporal Classification (CTC) loss was considered. The main advantage of using the CTC loss is removing a predefined alignment to create the training set. The features that we studied are the hand position feature and the path signature feature. In the preprocessing stage, we employ a sliding window technique to segment a long sequence of motion gesture into small pieces. Then, each piece of motion was used to generate the hand position feature and the path signature feature. When using the sliding window technique, the most critical parameter is the size of the sliding window. The output of the fusion structure attempts to predict characters in a word; therefore, the size of the sliding window should be set to capture the data no more than one character at a time. For examining the performance of the proposed structure, two public datasets were studied, i.e. a palm-writing dataset and a finger-writing dataset. Each dataset was analyzed to obtain a writing duration per character, which could be used to set the maximum size of the sliding window. The shortest of writing duration per one character in the palm-writing dataset and finger writing dataset are 0.88 seconds and 1.38 seconds, respectively. From the experiments, the appropriate window size of the palmwriting and the finger-writing dataset are determined as 0.5 seconds and 0.25 seconds, respectively. The best recognition accuracy on the palm-writing dataset and the finger-writing dataset are 86.90% and 75.81%, respectively. We also confirmed that the required prediction time per word on the palm-writing dataset and the finger-writing dataset are 3.91 milliseconds and 6.37 milliseconds, respectively. These results confirm the proposed algorithm can be executed in a real-time.

Volume None
Pages None
DOI 10.18910/73479
Language English
Journal None

Full Text