Skeleon-Based Typing Style Learning For Person Identification
11 Skeleton-Based Typing Style Learning ForPerson Identification
Lior Gelberg, David Mendelovic, and Dan Raviv
Abstract —We present a novel architecture for person identification based on typing-style, constructed of adaptive non-localspatio-temporal graph convolutional network. Since type style dynamics convey meaningful information that can be useful forperson identification, we extract the joints positions and then learn their movements’ dynamics. Our non-local approach incre-ases our model’s robustness to noisy input data while analyzing joints locations instead of RGB data provides remarkablerobustness to alternating environmental conditions, e.g. , lighting, noise, etc. . We further present two new datasets for typing-style based person identification task and extensive evaluation that displays our model’s superior discriminative and generali-zation abilities, when compared with state-of-the-art skeleton-based models.
Index Terms —Computer vision, motion recognition, style recognition, person identification, gesture recognition. (cid:70)
OTIVATION
User identification and continuous user identification aresome of the most challenging open problems we face todaymore than ever in the working-from-home lifestyle due tothe COVID-19 pandemic. The ability to learn a style insteadof a secret passphrase opens up a hatch towards the nextlevel of person identification, as style is constructed froma person’s set of motions and their relations. Therefore,analyzing a person’s style, rather than rely on its appearance(or some other easily fooled characteristic), can increase thelevel of security in numerous real-world applications, e.g. ,VPN, online education, finance, etc.. Furthermore, utilizinga person’s style can increase the robustness to changingenvironmental conditions, as a person’s style is indifferentto most scene properties.Here we focus on a typical daily task - typing asa method for identification and presenting a substantialamount of experiments supporting typing style as a strongindicator of a person’s identity. Moreover, our suggested ap-proach makes forgery too complicated, as typing someone’spassword is insufficient, but typing it in a similar style isneeded. Therefore, typing style’s remarkable discriminativeabilities and high endurance to forgery can offer an elegantand natural solution for both person identification andcontinuous person identification tasks.
NTRODUCTION
Biometrics are the physical and behavioral characteristicsthat make each one of us unique. Therefore, this kind ofcharacter is a natural choice for a person identity verifica-tion. Unlike passwords or keys, biometrics cannot be lostor stolen, and in the absence of physical damage, it offersa reliable way to verify someone’s identity. Physiologicalbiometrics involves biological input or measurement of • L. Gelberg, D. Mendelovic, and D. Raviv are with the Department ofElectrical Engineering, Tel-Aviv University, Tel-Aviv, Israel.E-mail: [email protected] other unique characteristics of the body. Such methods arefingerprint [1], blood vessel patterns in the retina [2] andface geometry [3], [4]. Unlike physiological characteristics,behavioral characteristics encompass both physiological andpsychological state. Human behavior is revealed as motionpatterns in which their analysis forms the basis for dynamicbiometric.Motion analysis is drawing increasing attention due toa substantial improvement in performance it provides in avariety of tasks [5], [6], [7], [8], [9]. Motion patterns con-vey meaningful information relevant to several applicationssuch as surveillance, gesture recognition, action recognition,and many more. These patterns can indicate the type ofaction within these frames, even manifesting a person’smood, intention, or identity.Deep learning methods are the main contributors to theperformance gain in analyzing and understanding motionthat we witness during recent years. Specifically, spatio-temporal convolutional neural networks that can learn todetect motion and extract high-level features from these pat-terns become common approaches in various tasks. Amongthem, video action classification (VAC), in which given avideo of a person performing some action, the model needsto predict the type of action in the video. In this work, wetake VAC one step further, and instead of trying to predictthe action occurs in the input video, we eliminate all actionclasses and introduce a single action - typing. Now, givena set of videos containing hands typing a sentence, weclassify the videos according to the person who is typingthe sentence.Over time, researchers in VAC’s field presented variousapproaches, where some use RGB based 2D or 3D convo-lutions [5], [10], [11] while others focus on skeleton-basedspatio-temporal analysis [12], [13], [14]. The skeleton-basedapproach proved its efficiency in cases where the videos aretaken under uncontrolled scene properties or in the presenceof a background that changes frequently. The skeleton datais captured by either using a depth sensor that providesjoint ( x, y, z ) location or by using a pose estimator such a r X i v : . [ c s . C V ] D ec Person 1Person 2Person 3Person 4Person 7Person 6Person 5 t-SNE on late features
Fig. 1. t-SNE on late features of 7 out of 60 people appears in dataset, where some videos went through data augmentation tosimulate changing environmental conditions. Given a video of a person typing a sentence, our model can classify the person according to its uniquedynamic, i.e. , typing style, with high accuracy, regardless of scene properties ( e.g. , lighting, noise, etc. ). The model generalizes the typing styleto other sentences, which it never saw during training even when it trains on one sentence type alone, while our non-local approach providesremarkable robustness to noisy data resulting from joints detector failures. Best viewed in color. as [15], that extracts the skeleton data from the RGB frames.The joint locations are then forwarded to the model thatperforms the action classification.Recent works in the field of skeleton-based VAC usesarchitectures of
Spatio Temporal Graph Convolutional Network (GCN) as graph-based networks are the most suitable forskeleton analysis since GCN can learn the dependencies be-tween correlated joints. Since Kipf and Welling introducedGCN in their work [16], other works such as [17] presentedadapted versions of GCN that applied for action classifica-tion. These adaptations include spatio-temporal GCN thatperforms an analysis of both space and time domains aswell as adaptive graphs that use a data-driven learnableadjacency matrix. Recently, a two-stream approach [18], [19]that is using both joints and bones data is gaining attention.Bones data is a differential version of the joints locationsdata since it is constructed from subtractions between linkedjoints. The bones vector contains each bone’s length anddirection, so analyzing this data is somewhat similar tohow a human is analyzing motion. Furthermore, bonescan offer new correlated yet complementary data to thejoints locations. When combining both joints and bones, themodel is provided with much more informative input data,enabling it to learn meaningful information that could notbe achieved with a one-stream approach alone.Even though VAC is a highly correlated task to ours,there are some critical differences. The full-body skeletonis a large structure. Its long-ranged joints relations are lessdistinct than those that appear in a human hand, whichhas strong dependencies between the different joints dueto its biomechanical structure. These dependencies cause each joint’s movement to affect other joints as well, eventhose on other fingers. Thus, when using a GCN containingfixed adjacency matrix, we limit our model to a set of pre-defined connections and not allowing it to learn the relationsbetween joints which are not directly connected. Further-more, the hand’s long-ranged dependencies that conveymeaningful information tend to be weaker than the close-range ones, and unless these connections are amplified,we lose essential information. Our constructed modulesare designed to increase vertices and edges inter (non-local) connections, allowing our model to learn non-trivialdependencies and to extract motion patterns from severalscales in time, which we refer to as style.In practice, we use a learnable additive adjacency matrixand a non-local operation that increases the long-rangedependencies in the layer’s unique graph. The spatial non-local operation enables the GCN unit to permute forwardbetter spatial features, and the temporal non-local opera-tion provides the model with a new order of informationby generating the inter joints relation in time. Now, eachjoint interacts with all other joints from different timesas well. These dependencies in time help the model gaininformation regarding the hand and finger posture alongtime and the typing division among the different fingers.We further apply a downsampler learnable unit that learnsto sum each channel information into a single value whilecausing minimal information loss. As a result, the refinedfeatures resulting from the long-ranged dependencies can bereflected as much as possible in the model’s final predictionlayer. Also, we follow the two-stream approach and applybones data to a second stream of our model. We train both streams jointly and let the data dictate the relationshipbetween both streams, i.e. , we apply learnable scalars thatset each stream’s contribution.The final model is evaluated on two newly presenteddatasets gathered for the task of typing style learning forperson identification (person-id). Since this work offers anew task, we present comprehensive comparisons withstate-of-the-art skeleton-based VAC models to prove ourmodel’s superiority. The main contributions of our work arein four folds:1) Develop a Spatio-Temporal Graph ConvolutionNetwork (
StyleNet ) for the task of typing stylelearning which outperforms all compared modelsin every experiment performed under controlledenvironmental conditions.2) Present substantially better robustness to challeng-ing environmental conditions and noisy input datathan all compared state-of-the-art VAC models.3) Introduce two new datasets for typing style learningfor person-id task.4) Introduce an innovative perspective for person-idbased on joints locations while typing a sentence.
ACKGROUND
AI methods entering the game allow for higher accuracy invarious tasks, moving for axiomatic methods towards data-driven approaches. These models focus on the detection ofminor changes that were missed earlier by examining dra-matically more data. The improvement of hardware allowedus to train deeper networks in a reasonable time and clas-sify in real-time using these complex models. This paper’srelated works can refer to biometric-based person identi-fication, VAC, Gait recognition, and Gesture recognition.We consider style learning as a biometric-based identificationmethod, and VAC as the motivation for our suggested task.Hence, we discuss these two as related works to ours.
Numerous person-identification methods using differenttechniques and inputs were presented over the years. Ratha et al. [20] presented work on fingerprints that uses thedelta and core points patterns and ridge density analysisto classify an individual. [21], [22], [23] studied the use ofKeystroke dynamics while others used different biometricsinclude face recognition [24], [25], iris scan [26], and gaitanalysis [27]. Identifying a person by his hands was studiedby Fong et al. [28], where they suggested a classificationmethod based on geometric measurements of the user’sstationary hand gesture of hand sign language. Roth et al. [29] presented an online user verification based on handgeometry and angle through time. Unlike [29], our methoddoes not treat the hand as one segment but as a deformablepart model by analyzing each of the hand joints relations inspace and time. Furthermore, our method is more flexiblesince it is not based on handcrafted features and does notrequire a gallery video to calculate a distance for its decision.
VAC methods are going through a significant paradigmshift in recent years. This shift involves moving from hand-designed features [30], [31], [32], [33], [34], [35] to deepneural network approaches that learn features and classifythem in an end-to-end manner. Simonyan and Zisserman [5]designed a two-stream CNN that utilizes RGB video andoptical flow to capture motion information. Carreira andZisserman [10] proposed to inflate 2D convolution layersthat pre-trained on ImageNet to 3D, while Diba et al. [11]presented their inflated DenseNet [36] model with temporaltransition layers for capturing different temporal depths.Wang et al. [37] proposed non-local neural networks tocapture long-range dependencies in videos.A different approach for VAC is a skeleton-based methodthat uses a GCN as well as joints locations as input insteadof the RGB video. Yan et al. [17] presented their spatio-temporal graph convolutional network that directly modelsthe skeleton data as the graph structure. Shi et al. [18]presented their adaptive graph two-stream model that usesboth joints coordinates and bones vectors for action clas-sification and based on the work of [38] that introducedadaptive graph learning.Inspired by the works presented above, this work fol-lows skeleton-based methods for the task of person-id basedon his typing style. Unlike full-body analysis, hand typ-ing style analysis has higher discriminating requirements,which can be fulfilled by better analysis of the hand’s globalfeatures such as the hand’s posture and the fingers intra-relationships as well as inter-relationships in space and time.We claim that all skeleton-based methods presented earlierin this section fail to fulfill these discriminative requirementsfully. Therefore, we propose a new architecture that aggre-gates non-locality with spatio-temporal graph convolutionlayers. Overall, we explored person-id on seen and unseensentences under different scenarios.
TYLE N ET The human hand is made from joints and bones that dictateits movements. Therefore, to analyze the hand’s movements,a Graph Convolutional Network (GCN) is the preferredchoice for deep neural network architecture in that case.GCN can implement the essential joints links, sustain thehand’s joints hierarchy, and ignore links that do not exist. Fig. 2. Left to right - adjacency matrix of the st , nd , and rd subset,respectively. Right - The hand as a graph. Each circle denotes a joint,and each blue line is a bone connecting two linked joints, i.e., eachjoint is a vertex, and bones are links in the graph. Black X marks thecenter of gravity. Gray blob is the subset B i of joint v i and its immediateneighbors. The green joint is v i , the joint in red is the immediate neighborof v i that is closer to the center of gravity, and the joint in purple is theimmediate neighbor of v i that is farther from the center of gravity. Fig. 3. Diagram of our spatial Non-Local GCN unit. Blue rectangles are for trainable parameters. (cid:78) denotes matrix multiplication and (cid:76) denoteselement-wise summation. residual block exist only when the unit’s Ch in (cid:54) = Ch out . This unit repeated K v times according to the number of subsets,Therefore, F Sout = (cid:80) K v k =1 f Sout k . Motivated by [17], we first formulate the graph convolu-tional operation on vertex v i as f Sout ( v i ) = (cid:88) v j ∈ B i Z ij f Sin ( v j ) · w ( l i ( v j )) , (1)where f Sin is the input feature map and superscript S refersto the spatial domain. v is a vertex in the graph and B i isthe convolution field of view which include all immediateneighbor v j to the target vertex v i . w is a weighting functionoperates according to a mapping function l i . We followedthe partition strategy introduced in [16] and construct themapping function l i as follows: given a hand center ofgravity (shown in Figure 2), for each vertex v i we definea set B i that include all immediate neighbors v j to v i . B i isdivided to 3 subsets, where B i is the target vertex v i , B i is the subset of vertices in B i that are closer to the centerof gravity and B i is the subset that contains all verticesin B i that are farther from the center of gravity. Accordingto this partition strategy, each v j ∈ B i , is mapped by l i to its matching subset. Z ij is the cardinality of the subset B ki that contains v j . We follow [16], [39] method for graphconvolution using polynomial parametrization and define anormalized adjacency matrix A of the hand’s joints by ˜ A = Λ − ( A + I )Λ − , (2)where I is the identity matrix representing self connections, A is the adjacency matrix representing the connectionsbetween joints, and Λ is the normalization matrix, where Λ ii = (cid:80) j A ij . Therefore, ˜ A is the normalized adjacencymatrix, where its non diagonal elements, i.e. , ˜ A ij where i (cid:54) = j indicate whether the vertex v j is connected to vertex v i . Using eq.1 and eq. 2 we define our spatial non-localgraph convolutional (Figure 3) operation as F Sout = K v (cid:88) k W k f Sin D k , (3)where K v is the total number of subsets and is equal to 3 inour case. W k is a set of learned parameters, and f Sin is theinput feature map. Inspired by [37], we construct D k by D k = ˆ W Sk ((Θ Sk ( ˆ A k )) T · Φ Sk ( ˆ A k )) G Sk ( ˆ A k )) + ˆ A k , (4)where superscript S denoted spatial domain. Φ Sk , Θ Sk , and G Sk are trainable 1D convolutions. These convolutions op-erate on the graph and embed their input into a lower-dimensional space, where an affinity between every two features is calculated. ˆ W Sk is a trainable 1D convolution usedto re-project the features to the higher dimensional space of ˆ A k . We use eq. 4 to apply self-attention on the input signal toenhances the meaningful connections between the featuresof its input ˆ A k , especially the long-range ones. To constructthe input signal ˆ A k , we adopt a similar approach to [18] anddefine ˆ A k to be ˆ A k = ˜ A k + B k + C k , (5)where ˜ A k is the normalized adjacency matrix of subset kaccording to eq. 2. This matrix is used for extracting only thevertices directly connected in a certain subset of the graph. B k is an adjacency matrix with the same size as ˜ A initializedto zeros. Unlike ˜ A k , B k is learnable and optimized alongwith all other trainable parameters of the model. B k is dic-tated by the training data, and therefore, it can increase themodel’s flexibility and make it more suitable for a specificgiven task. C k is the sample’s unique graph constructedby the normalized embedded Gaussian that calculates thesimilarity between all vertices pairs according to C k = sof tmax (( W k f Sin ) T W k f Sin ) , (6)where W k and W k are trainable parameters that embedthe input features to a lower-dimensional space, sof tmax used for normalizing the similarity operation’s output andsuperscript S denoted spatial domain. C k is somewhatrelated to D k in the way they both constructed. The maindifference is that C k generated by the input features alone,while D k is generated using the input features, the learnedadjacency matrix B k , and the normalized adjacency matrix ˜ A k . We use the non-local operation on the addition of ˜ A k , B k and C k to exploit the information from all three matrices.This information enables the spatial block to permute moremeaningful information forward, which contributes to themodel’s discriminative ability. To better exploit the time domain, we place a temporal unitafter each spatial GCN unit for better processing longitudi-nal information. We define X to be X = Conv ( F Sout ) , where Conv is 2D convolution with kernel size of × and F Sout is the spatial unit output. A temporal non-local operationapplied on X according to F ˜ Tout = W ˜ T ((Θ ˜ T ( X ) T · Φ ˜ T ( X )) · G ˜ T ( X )) + X, (7) ( , 128, 16, 21) ( , 256, 8, 21)( , 3, 32, 21) ( , 64, 32, 21) ( , 256) ( , ) TCN+GCN+Dropout unit TCN+NLGCN unit NLTCN + GCN unit Fully connected layer Class predictionTCN+GCN unit
Fig. 4. Single stream
StyleNet architecture. Input is consists of the 21 coordinates of the hand’s joints, while for each joint, we provide a 2D locationand a confidence level of its location per frame. The blue lines represent the joints’ spatial connections, while the green lines represent the joints’temporal connections. (N,Ch,T,V) Placed under the layers denote Batch size, the number of channels, temporal domain length, and V denotes thejoint’s index and represents a vertex in the graph, respectively. As for the fully connected layers, N denotes the batch size, and C is the dataset’snumber of classes. where ˜ T denoted the temporal domain. Unlike the spatialnon-local operation, here Φ ˜ T , Θ ˜ T , and G ˜ T are trainable 2Dconvolutions, since they process the temporal domain andnot part of the graph. These convolutions used to embedtheir input into a lower-dimensional space. Similarly, W ˜ T isa trainable 2D convolution used to re-project the features tothe higher dimensional space of X . The temporal non-localoperation used for two reasons: First, to better utilize thetemporal information regarding the same joint in differentplaces in time. Second, to construct the temporal relationsbetween the different joints through the temporal domain. We further apply a downsampling unit before the classi-fication layer. This unit receives the last temporal unit’soutput and downsamples each channel into a single valueinstead of using max or mean pooling. It constructedfrom [fully-connected,batch-normalization,fully-connected] lay-ers and shared among all channels. The benefit of using thissampling method is that it enables our model to learn sum-marizing each channel into a single value while minimizingthe loss of essential features.
Encouraged by the work of Shi et al. [18], we adopt theirtwo-stream approach and introduce
StyleNet . This ensemblemodel consists of one stream that operates on the jointslocation, and the other one that operates on the bone vectors.The final prediction constructed according to prediction = α · Output
Joints + β · Output
Bones , (8)where both α and β are trainable parameters that decide oneach stream weight for the final prediction. This weightingmethod increases the model’s flexibility since the trainingdata itself determines the weight of each stream. We ensem-ble the bones data by subtracting pairs of joints coordinatesthat tied by a connection in the graph. Therefore, the bonesdata is a differential version of the joints data, i.e. , the highfrequencies of the joints data. As deep neural network findit hard to cope with high frequencies, providing a secondorder of data constructed from these frequencies enable the model to utilize the unique clues hidden in the high fre-quencies and increase its discriminative ability accordingly. MPLEMENTATION DETAILS
We used
YOLOv3 [40] object detector for localizing the handin the input frame. For the joint detector, we used
Con-volutional Pose Machine [41] (CPM). This model outputs abelief map of the joints location, where each belief mapdenotes a specific joint. The joint’s location is given by aGaussian whose σ and peak value are set according to themodel’s confidence, i.e. , small σ with large peak value ifthe model is very confident in the location of the joint andlarge σ with small peak value otherwise. In that manner,the CPM model can predict a location for a joint, even whenthe joint is entirely or partially occluded in a given frame.It can predict the joint’s location according to the hand’scontext and decrease its belief score in exchange. This kindof method can help with cases of hidden joints since StyleNet can utilize the joint’s score as an indicator for the liability ofthe data related to that joint.
Pre-process pipeline:
We implemented our models for thepre-process using Tensorflow framework. An Input frameof size × was given to the hand localizer to outputa bounding box coordinates of the hand in the given frame.We cropped the hand centered frame according to the givenbounding box and resized the cropped frame to a size of × with respect to the aspect ratio. The resized frameis given to the joint detector that produces belief maps inreturn. The belief maps are resized back to fit the originalframe size with respect to the translation of the boundingbox produced by the hand localizer. Finally, argmax isapplied to each belief map to locate the joints coordinates.We repeat this process for the entire dataset to producethe joints locations matrix, which consists of all 21 jointslocations and belief scores by frame. StyleNet:
We implemented StyleNet using PyTorchframework. We defined A which is the adjacency matrix of the hand’s joints and normalized it according to eq. 2,where Λ iik = (cid:80) j ( A ijk ) + σ and σ equal to 0.001 is used toavoid empty rows. For each video, we sample a total of32 matrices, where each matrix refers to a certain frameand comprises the frame’s 21 ( x, y ) joints locations andtheir belief score. We created the bone data by subtractingthe ( x, y ) coordinates of each neighboring joints pair toextract the bone vectors, while we multiplied both neigh-boring joints belief score to produce a bone belief score.Our model (figure 4) is following the AGCN [18] archi-tecture, where each layer constructed from a spatial GCNunit that processes the joints or bones intra-frame relationsand a temporal unit that process the temporal inter-framerelations. The model’s 8 th GCN unit modified according toeq. 3 to improve the long-range dependencies of the spatialfeature maps before expanding the number of feature mapschannels. We also modify the 10 th TCN unit according toeq. 7 to improve the long-range dependencies between thedifferent frames. The downsampling unit is applied afterthe 10 th TCN unit for better downsampling of the finalfeature maps before forwarding to the classification layer.
Pre-process:
We used
YOLOv3 model pre-trained on
COCO dataset [42]. To train the model for our task, we created a sin-gle ”hand” label and used
Hands dataset [43] that contains ∼ k training and ∼ . k validation images, labeled withhands bounding boxes location. We used Adam optimizerwith an initial learning rate of 1e-3 and ran our trainingwith a batch size of 16 for 150 epochs. We trained CPM model using trained weights [44] as an initial starting point.We used 1256 random frames from our datasetlabeled with their joints locations. Training data consist of1100 frames and 156 frames used for validation. Data aug-mentation applied during training to prevent overfitting. Weused Adam optimizer with an initial learning rate of 1e-3and a batch size of 16 for a total of 960 epochs.
StyleNet:
We used a batch size of 32, where each sam-pled video consists of 32 sampled frames from the entirevideo. We used Adam optimizer with an initial learning rateof 1e-3, a momentum of 0.9, and a weight decay of 1e-5. Bothstream weights initialized to 1. A dropout rate of 0.3 wasapplied to increase the model’s generalization ability. Wetrained the model for 100 epochs and decreased the learningrate by a factor of 10 after 40, 70, and 90 epochs. No dataaugmentation needed due to the natural augmentation ofthe data results from the sampling of the video.
XPERIMENTS
Since there is no dataset for the suggested task, we cre-ated and datasets for the evaluationof our model. We compared our model with skeleton-basedaction classification models using the new datasets undervarious test cases, simulating user identification, and con-tinuous user identification tasks. In 6.1 we present our newdatasets and our main experiments results presented in 6.2and 6.3. We further compare our model under challengingscenarios such as noisy input data 6.4 and presents our cho-sen skeleton-based approach superiority over RGB modality in 6.5. In 6.6, we provide an additional comparison betweenthe models using 3D input data taken from
How We Type dataset [47].In all experiments, we split our data between train, vali-dation, and test sets randomly according to the experiment’ssettings for an accurate evaluation of the models. Each inputvideo consists of 32 sampled frames from the entire video.We tested each trained model for tens of times and set itsaccuracy according to all tests’ mean accuracy. It is crucial toevaluate each trained model several times since we sampleonly 32 frames and not use the entire video.
We present two new datasets created for typing style learningfor person identification task. The datasets recorded using asimple RGB camera with 100 fps for and 80 fpsfor . No special lighting used, and the camera’sposition remained fixed through all videos. No jewelry orany other unique clues appear in the videos. Both men andwomen, as well as right and left-handed, appear in thedataset. All participants were asked to type the sentenceswith their dominant hand only. dataset consists of 1600 videos of 80 par-ticipants. Each participant typed two different sentences,while each sentence repeated ten times. This setting’s mainpurpose is simulating a scenario where a small number ofdifferent sentences, as well as many repetitions from eachsentence, are provided. As each person encounters a chang-ing level of concentration, typing mistakes, distractions, andaccumulate fatigue, the variety in the typing style of eachparticipant revealed among a large number of repetitionsof each sentence. Therefore, this dataset deals with a clas-sification of a person under intra-sentence varying typingstyle, i.e. , changing motion patterns of the same sentence,and inter-person changing level of typing consistency. Ad-ditionally, this dataset can suggest a scenario where a modellearns on one sentence and need to infer to another sentenceit never saw during training. dataset consists of 1800 videos of 60 partici-pants. Each participant typed ten different sentences, whileeach sentence repeated three times. Unlike , setting’s purpose is simulating a scenario where alarge number of different sentences, as well as a smallnumber of repetitions from each sentence, are provided. Thelarge abundance of different sentences, i.e. , different motionpatterns, reveals each participant’s unique typing style,while the small amount of repetitions supports each par-ticipant variance in the typing style. Therefore, this datasetdeals with classification of a person under inter-sentencevarying motion patterns, and in order for the model togeneralize well to sentences it never saw during training,it must learn to classify each person by his unique typingstyle, i.e. , learn to classify the different people according totheir unique typing style.We labeled 1167 random frames from withtheir corresponding joints location to train a joint detector. In this experiment, we simulate a test case of continuoususer identification by testing our model’s ability to infer on
TABLE 1Test accuracy of user classification on unseen sentences on . [ α, β, γ ] denotes the number of sentences for train, validation and test,respectively [4,2,4] Model Acc(%)HCN [14] 91.98STGCN [17] 97.093sARGCN [45] 95.8PBGCN [46] 98.92sAGCN [18] 99.04StyleNet
Model Acc(%)HCN [14] 84.16STGCN [17] 97.213sARGCN [45] 93.6PBGCN [46] 98.62sAGCN [18] 98.82StyleNet
Model Acc(%)HCN [14] 79.53STGCN [17] 94.943sARGCN [45] 91.35PBGCN [46] 96.942sAGCN [18] 97.97StyleNet unseen sentences, i.e. , different motion patterns. We split ourdata by sentence type and let the model train on a certainset of sentences while testing performed on a different set ofsentences which the model never saw during training, i.e. ,different types of sentences the user typed. Therefore, toperform well, the model must learn the unique motion styleof each person.The experiment performed on in the follow-ing manner, we split our data in three ways, wherein eachsplit a different number of sentences is given for training.We randomly split our data by sentences to train, validation,and test sets according to the split settings. We applied thesame division to all other models for legitimate comparison.For , we randomized the train sentence, and theother sentence divided between validation and test wheretwo repetitions were used for validation and eight for test.Results for this experiment on and appears in table 1 and 2, respectively. Our model outper-forms all other compared models by an increasing marginas less training sentences are provided, which indicates ourmodel’s superior generalization ability.
In this experiment, we simulate a test case of user identifi-cation (access control by password sentence) by testing ourmodel’s ability to infer the same movement patterns, i.e. ,sentences, he saw during training and other repetitions ofthese patterns. We use a large number of sentence repeti-tions to test the robustness to the variance in the typing styleby simulating a scenario where a small amount of differentmotion patterns, i.e. , sentence type, is given along with asubstantial variance in these patterns resulting from a largenumber of repetitions.This experiment is performed by dividing ’sten repetitions of each sentence as follows: five for train,one for validation, and four for test. We trained each modelon the train set and tested its accuracy on the seen sentencesbut unseen repetitions.According to the experiment’s results, which appears intable 3, it is clear that this specific task is not complex andcan be addressed by other methods. However, it provesthat our models’ extra complexity does not harm the per-formance in the simpler ”password sentence” use cases.
The skeleton-based approach is dependent on a reliablejoints detector that extracts the joint’s location from eachinput frame. To challenge our model, we experimented with
TABLE 2Test accuracy of userclassification on unseen sentence experimenton . Each modeltraining set is constructed fromone sentence, while validationand test sets constructed fromthe other sentence that did notappear in the training
Model Acc(%)HCN [14] 94.18STGCN [17] 93.593sARGCN [45] 91.08PBGCN [46] 95.982sAGCN [18] 96.88StyleNet
TABLE 3Test accuracy of userclassification on seen sentences experimenton . The training setincludes five repetitions fromboth sentences, while thevalidation and test sets includeone and four repetitions fromboth sentences, respectively
Model Acc(%)HCN [14] 99.66STGCN [17] 99.643sARGCN [45] 99.44PBGCN [46] 99.842sAGCN [18] 99.85StyleNet a scenario similar to 6.2 (the more challenging task simulat-ing continuous user identification), where during inference,the joints detector is randomly failing and providing noisydata, i.e. , incorrect joints location.We performed this experiment by training all modelsas usual, while during test time, we randomly zeroed ( x, y, score ) data of a joint. The amount of joints that zeroedis drawn uniformly among [0,1,2], while the decision ofwhich joint values to zero is random, but weighted by eachjoint tendency to be occluded, e.g. , the tip of the thumb’sjoint has a higher probability of being drawn than any of thering fingers which tend less to be occluded while typing.According to the experiment’s results in table 4, ourmodel is much more robust to noisy data. The non-localapproach helps the model rely less on a particular joint andprovides a more global analysis of each person’s typingstyle, which increases the model’s robustness in cases ofnoisy data. TABLE 4Test accuracy for noisy data experiment on . Trainingconducted as usual, but during test time, we randomly zeroed joint ( x, y, score ) to simulate a situation where the data is noisy or somejoint’s location is missing. [ α, β, γ ] denotes the number of sentencesgiven for train, validation, and test, respectively Model [4,2,4] Acc(%) [3,2,5] Acc(%) [2,2,6] Acc(%)HCN [14] 57.87 53.46 45.06STGCN [17] 70.03 68.3 60.613sARGCN [45] 71.36 69.35 67.92PBGCN [46] 83.96 82.75 80.42sAGCN [18] 73.33 71.34 68.83StyleNet
TABLE 5Test accuracy for uncontrolled environment experiment on . RGB models trained with data augmentation while during test time, adifferent set of augmentations applied. [ α, β, γ ] denotes the number of sentences for train, validation, and test, respectively. env. denotesenvironment Model [4,2,4] Acc(%) [3,2,5] Acc(%) [2,2,6] Acc(%)
Controlled env. Uncontrolled env. Controlled env. Uncontrolled env. Controlled env. Uncontrolled env.I3D [10] 99.68 63.12 99.75 59.16
TABLE 6Test accuracy of user classification on unseen sentences on
How We Type using 3D input data. [ α, β, γ ] denotes the number of sentences fortrain, validation and test, respectively Model [5,10,35] Acc(%) [10,10,30] Acc(%) [15,10,25] Acc(%) [20,10,20] Acc(%) [25,10,15] Acc(%)HCN [14] 92.46 96.27 97.3 98.32 98.82STGCN [17] 95.72 97.92 98.24 98.79 98.963sARGCN [45] 94.7 97.76 98.08 98.56 98.89PBGCN [46] 98.51 99.07 99.48
TABLE 7Test accuracy of user classification on unseen sentences on when using 3D or 2D input data. [ α, β, γ ] denotes the number ofsentences for train, validation, and test, respectively. Model [5,10,35] Acc(%) [10,10,30] Acc(%) [15,10,25] Acc(%) [20,10,20] Acc(%) [25,10,15] Acc(%)StyleNet 2D 99.41 99.47
In this experiment, we compared our method with VACRGB-based methods in an uncontrolled environment sce-nario. Even though RGB based methods perform well in acontrolled environment, their performance tends to decreaseseverely under alternating scene properties such as lightingand noise. Even though data augmentation can increasethese methods robustness to challenging environmentalconditions, it is impossible to simulate all possible scenar-ios. Therefore, using an RGB-based approach in real-worldscenarios tends to fail in the wild. Therefore, we exploredour method’s robustness under challenging environmentalconditions to verify the skeleton-based approach superiorityin the task of typing style learning for person identification .We performed this experiment in a similar manner to 6.2,but with some differences. We trained each model usingdata augmentation techniques such as scaling, lighting, andnoise. Later, during test time, we applied different dataaugmentations, e.g. , different lighting, and noise models,than those used during training on the input videos.Results for this experiment appear in table 5. While allthe compared methods achieved a high accuracy rate undera controlled environment, their accuracy rate dropped in anuncontrolled environment scenario. Our method’s perfor-mance did not change except for a slight decline of less than0.5% in its accuracy rate. It is much easier to train a jointdetector to operate in an uncontrolled environment since itlocates the joints by the input image and the hand contextaltogether. Unlike the image appearance, the hand contextis not dependent on the environment. Therefore, the jointslocalizer can better maintain its performance under varyingconditions, making our pipeline resilient to this scenario.
We conducted an experiment that evaluates our modelusing a 3D input and the trade-off between 3D and 2D inputdata.We used How We Type dataset [47] that contains 3Dcoordinates of 52 joints from both hands and a total of 30different persons, where each person typed 50 sentences.Overall, we tested five different splits of the data, whereeach split contains a different number of training sentences.We randomly divided the data between training, valida-tion, and test in a similar manner to 6.2 according to thepartitioning setting of each split. We repeated this schemeseveral times for an accurate assessment of the model’sperformance. (a) st subset (b) nd subset (c) rd subset Fig. 5. Adjacency matrices of two hands. Each matrix is built by diago-nally concatenating two replicas of its one-hand version.
We used 21 out of 26 joints for each hand for consistencywith all other experiments and followed [17] partition strat-egy, which was mentioned in the paper. Figure 5 containsthe adjusted adjacency matrix that enables our model tolearn the unique dependencies between the joint of bothhands. When we tested our model with 3D coordinates asinput, z axis data replaced the score input. Therefore, each frame data consist of 42 ( x, y, z ) coordinates of joints fromboth hands.The results for this experiment appear in table 6, wherewe can see that even though our model trained on only10% of the entire data, it achieved a high accuracy rateand outperformed all other models. Results for the trade-offbetween 2D and 3D input data appear in table 7. Accordingto the results, we can see that our model achieves similarperformance when provided either with 2D or 3D inputdata. Unlike other tasks where the model benefits fromthe 3 rd dimension, it seems unneeded in this task. BLATION S TUDY
We conducted an ablation study to examine the effectivenessof our added blocks using . We performed thisexperiment in the same manner as 6.2, as this scenario offersa more challenging test case in which the true value of ourcomprised modules can manifest. The models training wasconducted as described in section 5.3.According to the results reported in table 8, we can seethat each added block improves the accuracy rate whencompared with the baseline. The most significant improve-ment was achieved when all the blocks added together. Ona broader note, applying [17], [18], or any other variantof these methods on a small deformable structure willbias toward close-ranged dependencies (due to the
Softmax normalization constructing C k ). As the close and long-range concept is no longer applicable in our task (movingonly one of the hand’s joints is almost impossible), thesemodels achieve inferior results to our model, which focuseson non-local spatial and temporal connectivity. Specifically,it constructs a new order of information. Each joint caninteract with all relevant (by attention) joints from all timesteps, helping our model extract more meaningful motionpatterns in space and time. TABLE 8Test accuracy of user classification on unseen sentenceson when adding each module to our baseline. [ α, β, γ ] denotes the number of sentences for train, validation, and test,respectively. NL denotes non-local, SNL denotes temporal non-localunit, and SNL denotes spatial non-local unit Model [4,2,4] [3,2,5] [2,2,6]Acc(%) Acc(%) Acc(%)AGCN [18] 99.04 98.82 97.97W downsample unit 99.47 99.39 98.72W downsample + TNL 99.62 99.59 99.13W downsample + SNL 99.76 99.71 99.28StyleNet
ONCLUSIONS
We introduced
StyleNet , a novel new architecture forskeleton-based typing style person identification. Motivatedby [37], we redesigned the spatial-temporal relationshipsallowing for a better longitudinal understanding of ac-tions.
StyleNet evaluated on the newly presented and datasets and outperformed all comparedskeleton-based action classification models by a large mar-gin when tested in the presence of noisy data and outper-formed when tested under controlled conditions. R EFERENCES [1] D. Isenor and S. G. Zaky, “Fingerprint identification using graphmatching,”
Pattern Recognition , vol. 19, no. 2, pp. 113–122, 1986.[2] W. Barkhoda, F. Akhlaqian, M. D. Amiri, and M. S. Nouroozzadeh,“Retina identification based on the pattern of blood vessels usingfuzzy logic,”
EURASIP Journal on Advances in Signal Processing , vol.2011, no. 1, p. 113, 2011.[3] R. Brunelli and T. Poggio, “Face recognition: Features versus tem-plates,”
IEEE transactions on pattern analysis and machine intelligence ,vol. 15, no. 10, pp. 1042–1052, 1993.[4] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unifiedembedding for face recognition and clustering,” in
Proceedings ofthe IEEE conference on computer vision and pattern recognition , 2015,pp. 815–823.[5] K. Simonyan and A. Zisserman, “Two-stream convolutional net-works for action recognition in videos,” in
Advances in neuralinformation processing systems , 2014, pp. 568–576.[6] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learningfor physical interaction through video prediction,” in
Advances inneural information processing systems , 2016, pp. 64–72.[7] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou, “Jointly at-tentive spatial-temporal pooling networks for video-based personre-identification,” in
Proceedings of the IEEE international conferenceon computer vision , 2017, pp. 4733–4742.[8] O. Kopuklu, N. Kose, and G. Rigoll, “Motion fused frames: Datalevel fusion strategy for hand gesture recognition,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops , 2018, pp. 2103–2111.[9] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detectionin surveillance videos,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2018, pp. 6479–6488.[10] J. Carreira and A. Zisserman, “Quo vadis, action recognition? anew model and the kinetics dataset,” in proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2017, pp.6299–6308.[11] A. Diba, M. Fayyaz, V. Sharma, A. H. Karami, M. M. Arzani,R. Yousefzadeh, and L. Van Gool, “Temporal 3d convnets: Newarchitecture and transfer learning for video classification,” arXivpreprint arXiv:1711.08200 , 2017.[12] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recog-nition by representing 3d skeletons as points in a lie group,” in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2014, pp. 588–595.[13] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neuralnetwork for skeleton based action recognition,” in
Proceedings ofthe IEEE conference on computer vision and pattern recognition , 2015,pp. 1110–1118.[14] C. Li, Q. Zhong, D. Xie, and S. Pu, “Co-occurrence feature learningfrom skeleton data for action recognition and detection withhierarchical aggregation,” arXiv preprint arXiv:1804.06055 , 2018.[15] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose:realtime multi-person 2D pose estimation using Part AffinityFields,” in arXiv preprint arXiv:1812.08008 , 2018.[16] T. N. Kipf and M. Welling, “Semi-supervised classification withgraph convolutional networks,” in
International Conference onLearning Representations (ICLR) , 2017.[17] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutionalnetworks for skeleton-based action recognition,” in
Thirty-secondAAAI conference on artificial intelligence , 2018.[18] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graphconvolutional networks for skeleton-based action recognition,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 12 026–12 035.[19] H. Wang, F. He, Z. Peng, Y. Yang, T. Shao, K. Zhou, and D. Hogg,“Smart: Skeletal motion action recognition attack,” arXiv preprintarXiv:1911.07107 , 2019.[20] N. K. Ratha, K. Karu, S. Chen, and A. K. Jain, “A real-time match-ing system for large fingerprint databases,”
IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 18, no. 8, pp. 799–813, 1996.[21] R. Joyce and G. Gupta, “Identity authentication based onkeystroke latencies,”
Communications of the ACM , vol. 33, no. 2,pp. 168–176, 1990.[22] D. Mahar, R. Napier, M. Wagner, W. Laverty, R. Henderson, andM. Hiron, “Optimizing digraph-latency based biometric typistverification systems: inter and intra typist differences in digraph latency distributions,” International journal of human-computer stud-ies , vol. 43, no. 4, pp. 579–592, 1995.[23] F. Monrose and A. Rubin, “Authentication via keystroke dynam-ics,” in
Proceedings of the 4th ACM conference on Computer andcommunications security , 1997, pp. 48–56.[24] W. Konen and E. Schulze-kruger, “Zn-face: A system for accesscontrol using automated face recognition,” in
In Proceedings of theInternational Workshop on Automated Face-and Gesture-Recognition .Citeseer, 1995.[25] W. Zhao, A. Krishnaswamy, R. Chellappa, D. L. Swets, andJ. Weng, “Discriminant analysis of principal components for facerecognition,” in
Face Recognition . Springer, 1998, pp. 73–85.[26] R. P. Wildes, “Iris recognition: an emerging biometric technology,”
Proceedings of the IEEE , vol. 85, no. 9, pp. 1348–1363, 1997.[27] C. BenAbdelkader, R. Cutler, and L. Davis, “Stride and cadence asa biometric in automatic person identification and verification,”in
Proceedings of Fifth IEEE international conference on automatic facegesture recognition . IEEE, 2002, pp. 372–377.[28] S. Fong, Y. Zhuang, and I. Fister, “A biometric authenticationmodel using hand gesture images,”
Biomedical engineering online ,vol. 12, no. 1, p. 111, 2013.[29] J. Roth, X. Liu, and D. Metaxas, “On continuous user authentica-tion via typing behavior,”
IEEE Transactions on Image Processing ,vol. 23, no. 10, pp. 4611–4624, 2014.[30] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visualcategorization with bags of keypoints,” in
Workshop on statisticallearning in computer vision, ECCV , vol. 1. Prague, 2004, pp. 1–2.[31] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learningrealistic human actions from movies,” in . IEEE, 2008, pp. 1–8.[32] H. Wang, A. Kl¨aser, C. Schmid, and C.-L. Liu, “Action recognitionby dense trajectories,” in
CVPR 2011 . IEEE, 2011, pp. 3169–3176.[33] H. Jegou, F. Perronnin, M. Douze, J. S´anchez, P. Perez, andC. Schmid, “Aggregating local image descriptors into compactcodes,”
IEEE transactions on pattern analysis and machine intelligence ,vol. 34, no. 9, pp. 1704–1716, 2011.[34] M. Raptis, I. Kokkinos, and S. Soatto, “Discovering discriminativeaction parts from mid-level video representations,” in . IEEE, 2012,pp. 1242–1249.[35] A. Jain, A. Gupta, M. Rodriguez, and L. S. Davis, “Representingvideos using mid-level discriminative patches,” in
Proceedings ofthe IEEE conference on computer vision and pattern recognition , 2013,pp. 2571–2578.[36] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in
Proceedings of theIEEE conference on computer vision and pattern recognition , 2017, pp.4700–4708.[37] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neuralnetworks,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2018, pp. 7794–7803.[38] R. Li, S. Wang, F. Zhu, and J. Huang, “Adaptive graph con-volutional neural networks,” in
Thirty-second AAAI conference onartificial intelligence , 2018.[39] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutionalneural networks on graphs with fast localized spectral filtering,”in
Advances in neural information processing systems , 2016, pp. 3844–3852.[40] J. Redmon and A. Farhadi, “Yolov3: An incremental improve-ment,” arXiv preprint arXiv:1804.02767 , 2018.[41] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convo-lutional pose machines,” in
Proceedings of the IEEE conference onComputer Vision and Pattern Recognition , 2016, pp. 4724–4732.[42] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in
European conference on computer vision . Springer, 2014,pp. 740–755.[43] A. Mittal, A. Zisserman, and P. H. Torr, “Hand detection usingmultiple proposals.” in
BMVC , vol. 40. Citeseer, 2011, pp. 75–1.[44] T. Ho, “Implementation of convolutional pose ma-chines tensorflow,” https://github.com/timctho/convolutional-pose-machines-tensorflow.[45] Y.-F. Song, Z. Zhang, and L. Wang, “Richly activated graphconvolutional network for action recognition with incompleteskeletons,” in . IEEE, 2019, pp. 1–5. [46] K. Thakkar and P. Narayanan, “Part-based graph convolutionalnetwork for action recognition,” arXiv preprint arXiv:1809.04983 ,2018.[47] A. M. Feit, D. Weir, and A. Oulasvirta, “How we type: Movementstrategies and performance in everyday typing,” in
Proceedings ofthe 2016 CHI Conference on Human Factors in Computing Systems ,ser. CHI ’16. New York, NY, USA: ACM, 2016, pp. 4262–4273.[Online]. Available: http://doi.acm.org/10.1145/2858036.2858233
Lior Gelberg is a Ph.D. candidate in the Electri-cal Engineering department at Tel-Aviv Univer-sity. He received his BSc in 2016 and his MScin 2020, both in Electrical Engineering from Tel-Aviv University. His main areas of interest areMotion and Gesture recognition, Video analysis,and forensics.
David Mendelovic is a professor of Electri-cal Engineering at Tel-Aviv University. He helda post-doctoral position at the University ofErlangen-Nurnberg, Bavaria as a MINERVAPostdoctoral Fellow. He is the author of morethan 130 refereed journal articles and numer-ous conference presentations. His academic in-terests include optical information processing,signal and image processing, diffractive optics,holography, temporal optics, optoelectronic andoptically interconnected systems.