Motion Similarity Modeling -- A State of the Art Report
TTechnical ReportVR-TR-001
Motion Similarity ModelingA State of the Art Report
Anna Sebernegg, Peter K´an, Hannes Kaufmann
Virtual and Augmented Reality GroupInstitute of Visual Computing and Human-Centered TechnologyVienna University of TechnologyAugust 14, 2020
Abstract
The analysis of human motion opens up a wide range of possibilities, such as realistic trainingsimulations or authentic motions in robotics or animation. One of the problems underlying motionanalysis is the meaningful comparison of actions based on similarity measures. Since the motionanalysis is application-dependent, it is essential to find the appropriate motion similarity method forthe particular use case. This state of the art report provides an overview of human motion analysisand different similarity modeling methods, while mainly focusing on approaches that work with3D motion data. The survey summarises various similarity aspects and features of motion anddescribes approaches to measuring the similarity between two actions. a r X i v : . [ c s . G R ] A ug ontents ii hapter 1Introduction Advances in motion capture technologies to digitize human motion have created a foundation formotion analysis. One main underlying problem of motion analysis is the meaningful comparisonof motions by using similarity measures [R ¨od06]. Similarity measures used for the detection andanalysis of human motion depend on the context and goal of the intended application [R ¨od06].Consequently, it is essential to consider which method and similarity model is suitable.Related work in the area of motion similarity proposed different similarity models along withalgorithms for applications in various fields. In the medical field, for instance, it can be utilized totrack and analyze the process of rehabilitation treatments more precisely and further to determinetheir success [FBSL12, ZLER14, CLZ + + + hapter 2Motion Valˇc´ık defines motion as a set of trajectories or as a sequence of poses [Val16]. Natural humanmotion depends on the three spatial dimensions as well as on time and is influenced by internaland external factors [R ¨od06]. The motion of the human skeleton can be described by the followingthree types of rigid motion as described by Reyes [Rey16]:• Linear• Angular• GeneralThe linear motion describes a translation in a particular direction over time, while angularmotion describes a rotation around a single axis. The general motion is the combination of bothlinear and angular motion.
There are several different technologies for recording full-body motion in order to perform humanmotion analysis, including inertial and optical sensors [Rey16, R¨od06]. Primarily appearance-based and model-based approaches are used for the capturing of motion data, as described byValˇc´ık et al. [Val16]. Appearance-based methods work directly with recorded sequences, likevideo sequences from a single video camera, where it is possible to extract, for example, silhouettesas data representations [Val16]. Appearance-based methods usually utilize heuristic assumptionsto establish feature correspondence between successive frames [AC99]. Model-based methods,on the other hand, apply predefined models such as skeleton models or volumetric models, whichmakes it easier to establish feature correspondence and body structure recovery as the tracked dataonly needs to be matched to the model [AC99]. Because of the complexity of human motion, due tothe vast number of degrees of freedom and the limited number of recordable points, the predefinedmodels representing the human body are often simplified, typically by using a skeleton modelwith restricted sets of joints connected by rigid bones [R¨od06]. The capturing of motion resultsin motion capture data (MoCap data). MoCap data formats are not consistently standardized, andthus, various formats, like the proprietary BVH or CSM files, exist [Val16].2 .2 3D Skeleton Representation of MoCap Data
Motion data can be fitted into a skeletal model by transforming the data into a joint chain [Wan16].As mentioned above, the 3D skeleton representation uses an additional abstraction over the recordeddata and provides position, view, and scale invariance due to the known perspective. However, cal-culations based on skeleton data can be computationally intensive if, for example, all joints perframe have to be traversed to derive more features [Val16]. The data representation may also lacksmoothness due to motion reconstruction. Therefore, filtering techniques are used to reduce theoutlier and noise in motion data [Wan16]. The number of joints and limbs used in a 3D skeletondata set is also dependent upon the application. Figure 2.1 shows one possible 3D skeleton datastructure.Figure 2.1: 3D Skeletal kinematic chain model. The joints are represented as dots and labeled.The lines connecting the joints represent rigid bones.3 hapter 3Similarity Aspects of Motion
The comparison of similarity between human motions is derived from the context of the intendedapplication due to its ambiguous definition [R ¨od06]. Various factors can play a role, whether amotion is interpreted as similar to another or not. In some applications, only the rough course ofmotion may be of interest, while in others, also subtle nuances are viewed as different. As a result,a central task of motion comparison or analysis is the design of a suitable similarity model; forinstance, the selected or designed similarity model should be invariant to arbitrary or tamperingaspects [MR08]. Some of the following similarity aspects may influence the design decisions of asimilarity model.
In many cases, transformations relative to the origin are not considered a factor of differencein similarity models; accordingly, two motions are considered similar if the only differences areglobal transformations such as absolute positions in time or space [R ¨od06]. Global transformationsrefer not only to translations in the global space but also to rotations about an axis of the globalcoordinate system or the overall scale/speed of the actor [MR08].
Motion content and style are abstract terms that appear in literature in various forms. M¨uller et al.refers to motion style as the person’s individual characteristics/personalized aspects of motion ortheir performance and emotional expressiveness. In their paper, they give the example of differentwalking styles: A walk can be performed in various ways, for example, by tiptoeing, limping,or marching. Different moods can influence the performed motion as well, for example, a walkcan seem cheerful or angry. In contrast, motion content is only related to the semantics of motionand thus is close to the raw data [MR08]. Lee and Elgammal define motion style in the contextof gait recognition as the “time-invariant personalized style of the gait which can be used foridentification”, and motion contend as a “time-dependent factor representing different body posesduring the gait cycle” [LE04]. For some applications, it may be interesting to separate content andstyle of the motion, e.g., if the aim is to identify related motions by content independent of motionstyle. This then has to be taken into account in the process of designing the similarity model, forexample by utilizing concepts like logically similar motion detection where qualitative features areused to cope with significant numerical differences in 3D positions or joint angles that can arise4hrough different styles, as described by M¨uller et al. [MR08]. Lee and Elgammal separated thegait style with a bilinear model to utilize it for gait recognition, and Davis and Gao presented anapproach for modeling and recognizing different action styles [LE04, DG03]. M¨uller et al. give afurther, more detailed overview of how the concepts of motion content and style are treated in theliterature [MR08].
Logical and numerical similarity are similar concepts to motion content and style. Kovar et al. de-fine logical similarity of motions as “variations of the same action or sequence of actions” [KG04].Therefore, logically similar motions share the same action pattern [CSLL12]. According to M¨ulleret al., the variation of logically similar actions can be influenced by the spatial and the temporaldomain. For example, two walking sequences can be logically similar, even when they containsignificant spatial and temporal differences [MR08]. Logical similarity focuses on the content ofactions while masking out factors of the individual motion style [R¨od06]. As an example, differentvariations of a walking action, resulting from the individual styles of the performers, can still beperceived as logically similar in an application since all of them can be classified as locomotion.Motions are numerically similar if the corresponding numerical values, such as skeleton poses,are approximately the same [KG04]. Algorithms, which are developed for numerically similarmotion detection, are usually based on numerical/quantitative features that are semantically moreclose to the raw MoCap data than qualitative descriptions [MR08].It is important to note that logical and numerical similarity does not imply each other. Koveret al., for example, states that action sequences, which are referred to as logically similar, perhapsbe numerically dissimilar and vice versa [KG04]. In their publication “Automated extraction andparameterization of motions in large data sets”, there are also examples given to highlight thisconcept.
Partial similarity is the case when certain body parts are moving similarly, while other parts ofthe body move in different ways [R¨od06]. Sometimes partial similarities are more important thanthe overall similarity of a motion. Integrating extracted features from irrelevant body parts inthe similarity measure can then affect the results negatively [CSLL12]. M¨uller et al. use a setof Boolean geometry features to express the relations between body parts [MRC05]. Chen etal. describe a partial similarity motion retrieval based on geometric features [CSLL12]. As thenumber of relative geometry features for human motion is vast, they utilized Adaboost for theselection of effective features [CSLL12]. Jang et al. synthesize new human body motions bycombining different partial motions. They analyze the similarity of partial motions to choose themost natural-looking combinations [JLLL08]. Rule-based approaches, as described by Zhao et al.,could perhaps also be used to calculate partial similarity [ZLER14].5 hapter 4Human Motion Features
Human motion features, as discussed by Valˇc´ık et al., describe different characteristics of humanmotion and are an abstraction of MoCap data to enable further processing [Val16]. MoCap datacontains information about the position, orientation, and movement of a person in 3D space alongwith noise. Depending on the problem definition and the similarity model, some of this informationmay be irrelevant or even counterproductive for further calculations. For example, in cases wheresemantic/logical similarity is required regardless of the actual global position or orientation of aperson. Therefore, human motion features are derived, which focus only on specific informationto leave out unsuitable or misleading data. The selection and combination of these features thendepend on the requirements of the given problem [Val16]. It should also be noted that each move-ment that needs to be distinguished within an application requires an explicit representation bythe selected criteria [KSL + .1 Anthropometric Features Anthropometric features are quantitative measurements that describe body dimensions of the recordedperson, for instance, body height and width, lengths of particular bones such as the arm-length, aswell as joint rotation limits and, therefore, are not corresponding to motion. Some anthropometricinformation related to the human bone structure can be extracted from 3D skeleton MoCap-Data.Anthropometric features have minimal use in appearance-based approaches, as opposed to skele-ton models, where several features can be derived [Val16]. Subject-specific features may be helpfulin surveillance tasks, such as the recognition and identification of humans, but can be misleadingin applications where only the similarity factor between two movements is required. In this case, itis necessary to normalize the 3D skeleton data as addressed by [VW18]. As an example, the massof skeleton segments and their center of mass is used by Kr¨uger et al. together with translationfeatures for comparing motions [KTMW08].
Pose features describe characteristics of single poses and are extracted from each static posture orsingle frame independently [ANK19]. Pose features are neither influenced by the speed of actionnor by its surrounding frames [Val16]. Their time invariance can make pose similarity calculations(e.g., comparing individual key poses) less complicated than the analysis of the overall action,which often involves the analysis of entire pose sequences or the additional use of extra features.Pose features are used for action recognition tasks as described by Agahian et al. [ANK19], aswell as for motion-comparison tasks as outlined by Monash et al., where a user has to try to mimica recorded-gesture displayed on a screen [Mon12].
Joint angle rotations are measured on each joint in a 3D skeleton model, and it is important to notethat their definition depends on the coordinate system used [Val16]. Typically local or absoluterotations are used. Local rotations describe the rotation of an object relative to its parent. Absoluterotations, on the other hand, describe the rotation of an object relative to the global coordinatesystem or the coordinate system of the skeletal root. One can transform local rotations to absoluterotations by hierarchically traversing the skeleton from the root and chaining the correspondingrotations. In other words, the local coordinate systems are aligned with the global coordinatesystem in case of the absolute description and rotated relative to the parent in case of the localvariant [Val16]. Rotations can also have different mathematical representations, such as Eulerangles, quaternions, rotation matrices, and spherical coordinate systems, where each descriptionhas its own distinct properties [Huy09]. Furthermore, not all skeleton models are represented byjoint rotations. Sometimes joint positions are used. Forward and inverse kinematics can be usedin such cases to convert between the two different systems [Val16]. One major drawback whencomparing motions, based only on relative joint rotations, is that rotations in some body parts havea more significant overall effect on a pose or action than others. For example, a small rotation inthe shoulder can lead to much larger changes as in the wrist [R¨od06].7 .2.2 Distance-Based Pose Features
Distance-based pose features are based on the joint positions and measure either the distance be-tween two arbitrary joints or between a joint and a defined edge or plane. The planes for the joint toplane distances can be defined both relative to the subject or absolute. Anthropometric features in-fluence distances between two joints. Therefore, normalized 3D skeletons usually are used in thesecases [Val16]. An example of joint to joint distance-based features as well as for joint to plane dis-tance measurement – in this particular case, the global floor is used as an absolute plane – is givenby Ijjina et al., where the features are applied for action recognition [IM14]. Two examples forjoint to relative plane distances are given in M¨uller et al. and M¨uller and R¨oder [MRC05, MR06].
M¨uller et al. propose relational features as geometric relations between specified body parts orpoints included in 3D skeleton data [MRC05]. They provide a semantic representation of motionand are invariant “to various kinds of spatial deformations of poses” [R¨od06]. Previously discussedfeatures, such as joint position, are numerical, and thus a quantitative description of motion. Nu-merical features can be sensitive towards pose deformations. For this reason, they are not alwayssuitable for logical similarity detection [TCKL13]. Relational features are, on the other hand, aqualitative description that defines an individual or common sequential characteristics of logicallysimilar motions and have the – for logical similarity favorable – property of being invariant to lo-cal deformations [MR08]. In M ¨uller et al., a set of Boolean geometric features is used [MRC05].Figure 4.1 shows some possible Boolean expressions, such as ‘Is the right foot in front of the leftfoot?’ or ‘Is the left arm bending?’. In the paper “Efficient motion search in large motion capture”written by Yi, more generalized variations of relational features are described, where only featuresof bones that count as dominant in a specific motion are considered [Yi06].Figure 4.1: Examples of relational features as Boolean expressions, as described by R¨oder et al.[R¨od06].
The extracted silhouettes from appearance-based models can directly be used as pose features,either as the silhouettes region or its contour [Val16]. Phillips et al. proposed a baseline algorithmfor gait analysis that utilized silhouette as a feature [PSR + Transition features, as defined by Valˇc´ık et al., describe characteristics of transformations/displacementsbetween two or more sequential poses, such as the displacement of joints. Just like pose features,they are bound to one pose. However, precedent and subsequent poses in a sequence affect thecomputations as well. A typical transition feature is the instantaneous velocity [Val16].For transition features, it must be considered that joint motion can not only be triggered bythe joint itself but also by parented joints, as stated by Kamel et al. For example, our fingers canmove separately from the rest of our bodies, but they can also move due to the movement of ourwrist, elbow, shoulder, or the whole body. When walking, each joint in our body moves along,even if some parts of the body are held steady. Moreover, most transition features are not explicit.For example, joint position trajectories do not hold information about the direction of movement[KSL + local - or global -motion. They describe the local motion of a joint as the displacement relative to a parental jointwhile global motion describes the displacement according to a fixed joint (root) or the globalcoordinate system. The benefit by using a local coordinate system is that it provides informationabout the influence of the parent joints, and therefore it is possible to distinguish if a joint movedby itself or was (partly) moved by one of its parents [KSL + Instantaneous velocity describes the change of joint position or joint rotation between two samplepoints (i.e., movement of a joint between two frames). Both the absolute and relative velocity applyas transition features, though only the relative velocity describes the relation between velocities ofthe joints. Kamel et al. present a motion quantification and subsequent similarity evaluation usingboth translation and angular velocities from 3D skeletal data with the distinction between local andglobal motion [KSL + Another transition feature is the instantaneous joint acceleration. The acceleration is derived fromvelocity and therefore has similar characteristics as velocity features. The quality of this feature9as, for example, a similar dependency on noise. In Kr ¨uger et al., the acceleration of the center ofmass is used to detect so-called non-contact phases, since the acceleration then corresponds to theacceleration due to gravity [KTMW08]. As mentioned above, acceleration is also a feature utilizedin Moencks et al. [MDSRK19].
The mentioned transition features can be classified as kinematic properties. Kinetic properties,like forces as used in combination with joint angles for measuring motion similarity by Yang et al.[Y + Action features describe the characteristics of a complete semantic or logical action. Therefore,the motion sequence has to be analyzed to extract actions beforehand. This extraction is eitherdone by user input or automatically based on other features [Val16]. For instance, an action couldbe delimited by two key poses. Examples for action features are the duration of action, the totaldisplacement of joints, periodicity patterns such as walk cycles and rhythms of motion as well esthe average velocity and average acceleration [Val16].The main difference to the instantaneous velocity described in Section 4.3 is that the averagevelocity not only represents a short momentum between two frames but averages it over a wholesequence or total action. The average velocity, therefore, is the total displacement divided by thetotal time of the action.In other publications, statistical descriptions, such as the mean, median, modus, standard de-viation, or minimum or maximum values of, e.g., acquired joint angle velocity, are utilized asaction features [Val16]. An example is given by Ball et al., where the k-means algorithm is usedfor gait recognition [BRRV12]. Other widely adopted features are trajectories in two- or three-dimensional space [Rey16]. Trajectories describe how a coordinate or value evolves and can bedisplayed graphically as a curve diagram. The typical joint trajectories, for example, represent thepath a joint follows through space as a function of time. Joint-angle trajectories, as utilized byZhao et al. as well as by Tanawongsuwan and Bobick, represent the change of joint-angles overtime [TB01, ZHD + hapter 5Human Motion Comparison Motion similarity analysis is application orientated, therefore different approaches where devel-oped over the time to meet the particular requirements [TLKS08]. The similarity between posesand actions can be measured by using distance metrics or learning methods [Gav99]. Additionaltasks that build on the measured similarity, such as action recognition, depend heavily on the ac-curacy of the distance metric or learning process [Por04]. This section gives an overview of aselection of approaches and focuses mainly on the different choices of feature vectors, preprocess-ing of the data, and similarity measurements.
The analysis of human motion and the calculation of similarity are complicated by the high dimen-sionality and complexity of the motion data [MBR17, WSP09]. Since it is not easy to work withmotion data in its raw form, different approaches are utilized for simplifying the data by dimen-sionality reduction and filtering [FF05]. Two popular methods, Principal Component Analysis andSelf Organizing Map are described below.
Principal Component Analysis (P CA ) is a linear method to simplifying a multivariate dataset byreducing data dimensionality [LC11]. The P CA and other projection methods attempt to find thebest approximated subspace for the data set to which the data can be projected (typically in termsof variance) [WSP09]. The execution of a P CA on a given dataset leads to a vector space of equaldimensions where each axis in space represents a principal component vector. Points in space arethen weighted combination of principal components [FF05]. Data reduction is obtained by usingonly a subset of its principal components [FF05, LC11]. Several authors such as R¨oder and Tido[R¨od06], Agahian et al. [ANK19] or Witte et al. [WSP09] apply P CA before the comparisonof motion data to achieve dimensionality reduction. K. Forbes and E. Fiume used a weightedP CA -based pose representation for pose-to-pose distance calculations to find similar motions in adatabase [FF05]. A Self Organizing Map (S OM ) is a neural network that is trained by unsupervised learning andproduces a low-dimensional and discrete representation of the input space. Therefore S OM s are11sed for dimension reduction [HW10, SKK04]. Sakamoto et al. map large data sets onto a two-dimensional discrete space by the help of S OM s [SKK04]. Huang and Wu also utilize this technol-ogy for their human action recognition approach to reduce data dimensionality as well as to clusterfeature data [HW10]. They use sequences of human silhouettes as the primary feature and extractkey poses through the trained S OM . As a similarity measure, they utilize the Euclidean distanceand later the longest common subsequence (LCS) method for comparing trajectories [HW10]. Single frames can hold a lot of information if they contain an expressive pose, such as a kick orhit of a tennis player. Local similarity measures compare such individual poses and, therefore,do not take temporal aspect into account [R ¨od06]. They can be used for comparing key-posesor are incorporated into more complex, time-dependent similarity models. Popular pose distancefunctions are for example the
Manhattan distance (L1) , Euclidean distance (L2) and the
Cosinedistance [Val16]. Chan et al., for example, utilize the cosine similarity as a local similarity measurefor each pair of joint angles in a frame [CLTK07]. They proposed an immersive performancetraining tool where trainees have to imitate the simulated trainer. Their movement was comparedand analyzed by posture matching, to be able to give the trainees feedback [CLTK07]. In additionto the drawback of needing to define thresholds as error tolerance, this method also introduces theproblem of not being invariant to global rotations if absolute and not relative angles are used. Otherdistance metrics for rotations can be utilized if joint rotations are represented as unit quaternions,such as the total weighted quaternion distance or geodesic distance [R¨od06]. The followingdistance functions for rotations are analyzed in more detail by Huynh [Huy09]: The Euclidean distance applies the Pythagorean theorem to p dimensions. For 2D, this distancemeasure can be interpreted as the length of the chord between two Euler angles on the unit circle.Euler angles are not unique, i.e., several Euler angles can represent the same rotation. However,their Euclidean distance may result in a non-zero value, which does not reflect the actual distancebetween them. The Euclidean distance between Euler angles can also lead to large distance valuesbetween nearby rotations, while two distant rotations may lead to smaller values. Therefore, thepaper does not recommend this metric for calculating the difference between the two rotations[Huy09]. Switonski et al. stated that distance functions based on quaternions allow a more efficientassessment of rotation similarities, in comparison to Euler angles [SMJ + Unit quaternions give a more flexible representation of rotations. Other than Euler angles, theirvalues do not dependent on the order of rotations about the three principal axes and do not sufferthe gimbal lock problem. They are well suited for interpolations as well [SMJ + Φ defines the distance between two rotations as the Euclidean distance between two unitquaternions q q q and − q denote thesame rotation, the minimum operator is required [Hug14].12 ( q , q ) = min ( || q − q || , || q + q || ) . (5.1) A similar metric for calculating the distance between unit quaternions is given by the inner productas denoted in Equation 5.2 [Huy09]. This metric is, for example, used by Wunsch et al. for 3Dobject pose estimation [WWH97]. Φ ( q , q ) = min ( arccos ( dot ( q , q )) , π − arccos ( dot ( q , q ))) . (5.2)The metric can alternatively be replaced by the following computationally more efficient func-tions: Φ ( q , q ) = arccos ( | dot ( q , q ) | ) . (5.3) Φ ( q , q ) = − | dot ( q , q ) | . (5.4) Distance functions can be based of matrix representations as well. One drawback is however, thatthey usually are more computationally expensive. The following metric given as Equation 5.5 triesto find the amount of rotation required to align a rotation matrix R with R . Among the distancemetrics described by Huynh, this metric is the most computationally expensive metric. However,the computation work can be significantly reduced if unit quaternions instead of matrices are used[Huy09]. Φ ( R , R ) = || I − R R T || F . (5.5) This metric, as denoted in Equation 5.6, gives a geodesic on the unit sphere, which can be inter-preted as the shortest curve between two rotations lying on the surface of the unit sphere. As shownby Huynh, this metric has a linear relationship with Equation 5.3, and therefore can be calculatedmore simply based on unit quaternions. Φ ( R , R ) = || log ( R R T ) || . (5.6)Arikan and Forsyth give a more complex example for the usage of a local similarity measure,by utilizing multiple features like joint positions, velocities, and accelerations [AF02]. A lot ofsimilarity metrics, such as the one proposed by Arikan and Forsyth, utilize different joint weightsto control the relevance of different joints [YG05]. One major drawback of such metrics is theneed to define the optimal attribute weights for the given problem definitions and the dependenceof the quality of results on selected weights [YG05].Kovar et al. employ a model based approach working on point clouds to measure the simi-larity between two frames [KGP08]. The point cloud distance metric is defined as the minimumdifference between two point sets (a squared form of L2 distance) [Val16]. Each point cloud is thecomposition of smaller point clouds that represent the pose at each frame in a defined window ofneighboring frames [KGP08]. Drawbacks of the point cloud approach described by Kovar et al.are the coordinate-invariance and the efficiency [YG05].13 .3 Global Similarity Measures In image-based human pose comparison, the similarity can be assessed through conventional ap-proaches such as measuring the distance between joint positions or rotations [CJTC + + + CA basedpose representation in combination with Euclidean distance metric [FF05]. P CA trajectories arethen compared by using the Dynamic Time Warping (Dtw) distance [R ¨od06]. A similar algo-rithm for constructing a content-based human motion retrieval system is proposed by Chiu et al.[CCW + OM -based trajectory clusters [CCW +
04, R¨od06].
Dynamic Time Warping
Dynamic Time Warping (Dtw) determines the optimal alignment between two given temporalsequences and can be used for measuring similarity between them as it allows the comparison oftwo time-series sequences with varying lengths and speeds [Rey16]. Given two time series X =( x , x , ..., x N ) , N ∈ N and Y = ( y , y , ..., y M ) , M ∈ N with equidistant points in time and sequencesizes of N and M . To align both sequences, Dtw starts by calculating the local cost matrix C ∈ R NxM : c i , j = || x i − y j || , i ∈ [ N ] , j ∈ [ M ] that represents all pairwise distances between X and Y [Sen08]. The algorithm then finds an alignment or warping path from the first cell [1,1]to the last cell [N, M], which runs through the low-cost areas on the local cost matrix. To findthe optimal warping path Ω of the minimum total cost as shown in Figure 5.1, one would need totest every warping path between X and Y , which would be computationally expensive. Therefore,Dtw utilizes discrete dynamic programming to build an accumulated cost matrix D in a recursivefashion where the warping path Ω can be found by following the greedy strategy. This optimizationleads to a total complexity of O(NM) [Sen08, FBM +
18, YCWJ19]. The warping path can thenbe employed to align the series in time. Similarity models using the default Dtw implementationon high-dimensional data have, however, two main disadvantages: the quadratic complexity andthe limitation in capturing the semantic relationship between two sequences, by disregarding the14emporal context [CJTC + MDTW ), which, in contrast to the basic Dtw, does not require a quadratic runtimeand memory space [KTMW08]. Reyes refers to FastDTW as a less computationally intensivealternative to Dtw as it is linear in both time and space complexity [Rey16]. Another methodswith linear complexity are Uniform Time Warping (U TW ) and its extension named InterpolatedUniform Time Warping (I UTW ) [Val16]. Other examples for Dtw-based similarity models aregiven by [CCW +
04, FF05].Figure 5.1: Dynamic time warping. Illustration by Yang et al. [YCWJ19].
Hidden Markov Models
Another method for matching time-varying data and action recognition are Hidden Markov Models(H MM ), which are statistical models that are based on hidden states that represent an activity[AR11]. In contrast to Dtw, H MM s involve a training and classification stage [Gav99]. H MM s alsodisregard the temporal context by assuming that observations are temporally independent [Wan16].H MM is used by Yamato et al. for the matching of human motion [YOI92] while Mandery etal. utilized it for the reduction of feature space for whole-body human action recognition tasks[MPBA16]. Porikli proposed H MM -based distance metrics to determine the similarity betweentrajectories [Por04].Dtw and H MM are widely adopted solutions for the analysis of time-varying data, but othermethods for time alignment are used as well. Valˇc´ık proposed Uniform Scaling (U S ), Scaled andWarp Matching (combination of U S and Dtw) and Move Split Merge [Val16]. D. M. Gavrilareferenced approaches based on
Neural Networks (NN) and also mentioned the possibility todisregard the time component of human motion data by using representations in different spacessuch as a phase-space [Gav99]. In “Content-based retrieval for human motion data” templatematching is discussed as a time-alignment method, where input patterns are compared with pre-stored patterns in a database [CCW + spatio-temporal invariance intheir proposed concept by working with geometric features and adaptive-temporal segmentation[MRC05]. Trajectory-based approaches interpret an activity as a set of space-time trajectories. The similar-ity is then measured between the trajectories to archive motion analysis and action recognition15AR11, ANK19]. Trajectories can be captured through motion capturing techniques with markersor through the space-time movement of 3d skeleton data. The whole set of joint-trajectories thenrepresents the full-body motion [Rey16]. Several approaches use trajectories themselves as motionrepresentation or as further human motion features, as discussed in Section 4 [AR11].Trajectory based approaches can achieve a detailed analysis of human motion and in manycases are view-invariant [AR11]. However, they dependent on well reconstructed or captured data.Trajectories can also be used to extract further features [Por04, AR11]. Reyes, for example, pro-posed a human motion model where trajectories are mapped into chain codes by using orthogonalchanges of direction. Chain codes allow data reduction and the use of string matching algorithms[Rey16].A simple metric to compare a pair of trajectories is the mean of coordinate distances (Carte-sian distance, L-norm, ...), which can be enhanced by further statistical values such as median,variance, minimum, or maximum distance [Por04]. As mentioned by Porikli, these trajectory dis-tance metrics have the disadvantage that they depend on mutual coordinate correspondences andtherefore are limited to compare only trajectories of equal duration (unless they are normalized)[Por04]. In the paper, it is also noted that normalization destroys the temporal properties of thetrajectory. Porikli, therefore, proposed an H MM -based distance metric to compare trajectories ofdifferent temporal properties [Por04]. Another action recognition problem solved by comparingtrajectory, as mentioned before, is proposed by Huang et al. [HW10]. Similarity models are often template-based and focus on gesture or action recognition. Rule-based approaches, on the other hand, are primarily used to analyze the correctness of motion, forexample, in the context of rehabilitation exercise monitoring. Rule-based similarity models definea ground-truth by a set of rules and compare the observed motion with them instead of comparingit with a previously recorded reference motion. The usage of predefined rules can simplify theimplementation of real-time feedback to the patient [ZLER14].
There also exist approaches to measure the similarity of 3D models or single poses by compar-ing skeleton graphs or other tree-based representations. Brennecke et al., for instance, employgraph similarity for 3D shape matching while Chen et al. [BI04] proposed a novel skeleton treerepresentation matching approach [CHL + hapter 6Conclusion This state of the art report provides a brief overview of multiple approaches for human motionanalysis and similarity modeling. It must be noted, however, that only a small insight into the vastnumber of methods is given. In recent years, many papers on human motion analysis have beenreleased to address problems such as real-time action recognition, motion retrieval, or monitoringthe correct execution of an action. For all these tasks, the similarity between actions or at least indi-vidual poses must be measured. The proposed approaches for similarity modeling are diverse andfocus on different aspects. They are based on various distinct definitions of similarity, dependingon their area of application, and utilize a different combination of features. What the models havein common, however, is the high dimensional and spatiotemporal-varying MoCap-data on whichthey are built. Multiple approaches, therefore, utilize techniques for dimensionality reduction andtime alignments such as the wide adopted Principal Component Analysis or Dynamic Time Warp-ing. However, by employing such techniques, it must be ensured that they do not result in anyinformation loss or high runtime. As a consequence, numerous approaches adopt popular andpromising algorithms, such as Dynamic Time Warping, and modify them to their specific purpose.Most of the discussed approaches are template-based, where an observed motion is comparedwith a pre-recorded one. However, rule-based methods or the combination of both could also beof interest for applications in which feedback is provided over the execution of an action.17 ist of acronyms
Dtw Dynamic Time WarpingH MM Hidden Markov ModelI
MDTW
Iterative Multi-scale Dynamic Time DistortionNN Neural NetworksP CA Principal Component AnalysisS OM Self Organizing MapU S Uniform ScalingU TW Uniform Time WarpingI
UTW
Interpolat Uniform Time Warping 18 ibliography [AC99] Jake K Aggarwal and Quin Cai. Human motion analysis: A review.
Computer visionand image understanding , 73(3):428–440, 1999.[AF02] Okan Arikan and David A Forsyth. Interactive motion generation from examples.
ACM Transactions on Graphics (TOG) , 21(3):483–490, 2002.[ANK19] Saeid Agahian, Farhood Negin, and Cemal K ¨ose. An efficient human action recog-nition framework with pose-based spatiotemporal features.
Engineering Science andTechnology, an International Journal , 2019.[AR11] Jake K Aggarwal and Michael S Ryoo. Human activity analysis: A review.
ACMComputing Surveys (CSUR) , 43(3):1–43, 2011.[BI04] Angela Brennecke and Tobias Isenberg. 3d shape matching using skeleton graphs.In
SimVis , pages 299–310. Citeseer, 2004.[BRRV12] Adrian Ball, David Rye, Fabio Ramos, and Mari Velonaki. Unsupervised cluster-ing of people from’skeleton’data. In
Proceedings of the seventh annual ACM/IEEEinternational conference on Human-Robot Interaction , pages 225–226, 2012.[BWP08] Matthew Brodie, Alan Walmsley, and Wyatt Page. Fusion motion capture: a proto-type system using inertial measurement units and gps for the biomechanical analysisof ski racing.
Sports Technology , 1(1):17–28, 2008.[CCD +
03] Philo Tan Chua, Rebecca Crivella, Bo Daly, Ning Hu, Russ Schaaf, David Ventura,Todd Camill, Jessica Hodgins, and Randy Pausch. Training for physical tasks invirtual environments: Tai chi. In
IEEE Virtual Reality, 2003. Proceedings. , pages87–94. IEEE, 2003.[CCW +
04] Chih-Yi Chiu, Shih-Pin Chao, Ming-Yang Wu, Shi-Nine Yang, and Hsin-Chih Lin.Content-based retrieval for human motion data.
Journal of visual communicationand image representation , 15(3):446–466, 2004.[CHL +
17] Xin Chen, Jingbin Hao, Hao Liu, Zhengtong Han, and Shengping Ye. Research onsimilarity measurements of 3d models based on skeleton trees.
Computers , 6(2):17,2017.[CJTC +
18] Huseyin Coskun, David Joseph Tan, Sailesh Conjeti, Nassir Navab, and FedericoTombari. Human motion analysis with deep metric learning. In
Proceedings of theEuropean Conference on Computer Vision (ECCV) , pages 667–683, 2018.19CLTK07] Jacky Chan, Howard Leung, Kai Tai Tang, and Taku Komura. Immersive perfor-mance training tools using motion capture technology. In
Proceedings of the FirstInternational Conference on Immersive Telecommunications , page 7. ICST (Insti-tute for Computer Sciences, Social-Informatics and ˆa C¦, 2007.[CLZ +
12] Chien-Yen Chang, Belinda Lange, Mi Zhang, Sebastian Koenig, Phil Requejo,Noom Somboon, Alexander A Sawchuk, and Albert A Rizzo. Towards pervasivephysical rehabilitation using microsoft kinect. In , pages 159–162. IEEE, 2012.[CSC18] Rafael MO Cruz, Robert Sabourin, and George DC Cavalcanti. Dynamic classi-fier selection: Recent advances and perspectives.
Information Fusion , 41:195–216,2018.[CSLL12] Songle Chen, Zhengxing Sun, Yi Li, and Qian Li. Partial similarity human motionretrieval based on relative geometry features. In , pages 298–303. IEEE, 2012.[DG03] James W Davis and Hui Gao. An expressive three-mode principal componentsmodel of human action style.
Image and Vision Computing , 21(11):1001–1016,2003.[FBM +
18] Duarte Folgado, Mar´ılia Barandas, Ricardo Matias, Rodrigo Martins, Miguel Car-valho, and Hugo Gamboa. Time alignment measurement for time series.
PatternRecognition , 81:268–279, 2018.[FBSL12] Adso Fern’ndez-Baena, Antonio Sus´ın, and Xavier Lligadas. Biomechanical valida-tion of upper-body and lower-body joint movements of kinect motion capture datafor rehabilitation treatments. In , pages 656–661. IEEE, 2012.[FF05] Kevin Forbes and Eugene Fiume. An efficient search algorithm for motion datausing weighted pca. In
Proceedings of the 2005 ACM SIGGRAPH/Eurographicssymposium on Computer animation , pages 67–76, 2005.[Gav99] Dariu M Gavrila. The visual analysis of human movement: A survey.
Computervision and image understanding , 73(1):82–98, 1999.[HDIE13] He He, Hal Daum´e III, and Jason Eisner. Dynamic feature selection for dependencyparsing. In
Proceedings of the 2013 conference on empirical methods in naturallanguage processing , pages 1455–1464, 2013.[Hug14] David J Huggins. Comparing distance metrics for rotation using the k-nearestneighbors algorithm for entropy estimation.
Journal of computational chemistry ,35(5):377–385, 2014.[Huy09] Du Q Huynh. Metrics for 3d rotations: Comparison and analysis.
Journal of Math-ematical Imaging and Vision , 35(2):155–164, 2009.20HW10] Wei Huang and QM Jonathan Wu. Human action recognition based on self organiz-ing map. In , pages 2130–2133. IEEE, 2010.[IM14] Earnest Paul Ijjina and C Krishna Mohan. Human action recognition based on mo-cap information using convolution neural networks. In , pages 159–164. IEEE, 2014.[JLLL08] Won-Seob Jang, Won-Kyu Lee, In-Kwon Lee, and Jehee Lee. Enriching a motiondatabase by analogous combination of partial human motions.
The Visual Computer ,24(4):271–280, 2008.[KG04] Lucas Kovar and Michael Gleicher. Automated extraction and parameterization ofmotions in large data sets.
ACM Transactions on Graphics (ToG) , 23(3):559–568,2004.[KGP08] Lucas Kovar, Michael Gleicher, and Fr´ed´eric Pighin. Motion graphs. In
ACM SIG-GRAPH 2008 classes , pages 1–10. 2008.[KSL +
19] Aouaidjia Kamel, Bin Sheng, Ping Li, Jinman Kim, and David Dagan Feng. Effi-cient body motion quantification and similarity evaluation using 3-d joints skeletoncoordinates.
IEEE Transactions on Systems, Man, and Cybernetics: Systems , 2019.[KTMW08] Bj¨orn Kr¨uger, Jochen Tautges, Meinard M¨uller, and Andreas Weber. Multi-modetensor representation of motion data.
JVRB-Journal of Virtual Reality and Broad-casting , 5(5), 2008.[LC11] Vittorio Lippi and Giacomo Ceccarelli. Can principal component analysis be appliedin real time to reduce the dimension of human motion signals? In
BIO Web ofConferences , volume 1, page 00055. EDP Sciences, 2011.[LE04] Chan-Su Lee and Ahmed Elgammal. Gait style and gait content: bilinear modelsfor gait recognition using gait re-sampling. In
Sixth IEEE International Conferenceon Automatic Face and Gesture Recognition, 2004. Proceedings. , pages 147–152.IEEE, 2004.[MBR17] Julieta Martinez, Michael J Black, and Javier Romero. On human motion predic-tion using recurrent neural networks. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 2891–2900, 2017.[MDSRK19] Mirco Moencks, Varuna De Silva, Jamie Roche, and Ahmet Kondoz. Adaptivefeature processing for robust human activity recognition on a novel multi-modaldataset. arXiv preprint arXiv:1901.02858 , 2019.[Mon12] H Ali Monash. Motion comparison using microsoft kinect. computer scienceproject, MONASH UNIVERSITY , 2012.[MPBA16] Christian Mandery, Matthias Plappert, J´ulia Borras, and Tamim Asfour. Dimension-ality reduction for whole-body human motion recognition. In , pages 355–362. IEEE, 2016.21MR06] Meinard M¨uller and Tido R ¨oder. Motion templates for automatic classificationand retrieval of motion capture data. In
Proceedings of the 2006 ACM SIG-GRAPH/Eurographics symposium on Computer animation , pages 137–146, 2006.[MR08] Meinard M¨uller and Tido R ¨oder. A relational approach to content-based analysis ofmotion capture data. In
Human Motion , pages 477–506. Springer, 2008.[MRC05] Meinard M ¨uller, Tido R¨oder, and Michael Clausen. Efficient content-based retrievalof motion capture data. In
ACM SIGGRAPH 2005 Papers , pages 677–685. 2005.[NB13] Sven Nomm and Kirill Buhhalko. Monitoring of the human motor functions reha-bilitation by neural networks based system with kinect sensor.
IFAC ProceedingsVolumes , 46(15):249–253, 2013.[Por04] Fatih Porikli. Trajectory distance metric using hidden markov model based rep-resentation. In
IEEE European Conference on Computer Vision, PETS Workshop ,volume 3. Citeseer, 2004.[PSR +
02] P Jonathon Phillips, Sudeep Sarkar, Isidro Robledo, Patrick Grother, and KevinBowyer. Baseline results for the challenge problem of humanid using gait analysis.In
Proceedings of Fifth IEEE International Conference on Automatic Face GestureRecognition , pages 137–142. IEEE, 2002.[PWW15] German I Parisi, Cornelius Weber, and Stefan Wermter. Self-organizing neural in-tegration of pose-motion features for human action recognition.
Frontiers in neuro-robotics , 9:3, 2015.[Rey16] Francisco Javier Torres Reyes.
Human motion: analysis of similarity and dissim-ilarity using orthogonal changes of direction on given trajectories . University ofColorado at Colorado Springs, 2016.[R¨od06] Tido R¨oder.
Similarity, retrieval, and classification of motion capture data . PhDthesis, 2006.[Sen08] Pavel Senin. Dynamic time warping algorithm review.
Information and ComputerScience Department University of Hawaii at Manoa Honolulu, USA , 855(1-23):40,2008.[SKK04] Yasuhiko Sakamoto, Shigeru Kuriyama, and Toyohisa Kaneko. Motion map: image-based retrieval and segmentation of motion data. In
Proceedings of the 2004ACM SIGGRAPH/Eurographics symposium on Computer animation , pages 259–266, 2004.[SMJ +
12] Adam Switonski, Agnieszka Michalczuk, Henryk Josinski, Andrzej Polanski, andKonrad Wojciechowski. Dynamic time warping in gait classification of motion cap-ture data. In
Proceedings of World Academy of Science, Engineering and Technol-ogy , number 71, page 53. World Academy of Science, Engineering and Technology(WASET), 2012. 22TB01] Rawesak Tanawongsuwan and Aaron Bobick. Gait recognition from time-normalized joint-angle trajectories in the walking plane. In
Proceedings of the 2001IEEE Computer Society Conference on Computer Vision and Pattern Recognition.CVPR 2001 , volume 2, pages II–II. IEEE, 2001.[TCKL13] Tran Thang Thanh, Fan Chen, Kazunori Kotani, and Bac Le. Automatic extrac-tion of semantic action features. In , pages 148–155. IEEE, 2013.[TLKS08] Jeff KT Tang, Howard Leung, Taku Komura, and Hubert PH Shum. Emulatinghuman perception of motion similarity.
Computer Animation and Virtual Worlds ,19(3-4):211–221, 2008.[Val16] Jakub Valˇc´ık. Similarity models for human motion data.
Ph. D. dissertation , 2016.[VW18] Jan P Vox and Frank Wallhoff. Preprocessing and normalization of 3d-skeleton-datafor human motion recognition. In , pages279–282. IEEE, 2018.[Wan16] Qifei Wang. A survey of visual analysis of human motion and its applications. arXivpreprint arXiv:1608.00700 , 2016.[WSP09] K Witte, H Schobesberger, and C Peham. Motion pattern analysis of gait in horse-back riding by means of principal component analysis.
Human movement science ,28(3):394–405, 2009.[WTNH03] Liang Wang, Tieniu Tan, Huazhong Ning, and Weiming Hu. Silhouette analysis-based gait recognition for human identification.
IEEE transactions on pattern anal-ysis and machine intelligence , 25(12):1505–1518, 2003.[WWH97] Patrick Wunsch, Stefan Winkler, and Gerd Hirzinger. Real-time pose estimation of3d objects from camera images using neural networks. In
Proceedings of Interna-tional Conference on Robotics and Automation , volume 4, pages 3232–3237. IEEE,1997.[Y +
08] Yi-Ting Yang et al. Human recognition based on kinematics and kinetics of gait.In
The 13th International Conference on Biomedical Engineering Suntec SingaporeInternational Convention & Exhibi, Suntec, Singapore , 2008.[YCWJ19] Chan-Yun Yang, Pei-Yu Chen, Te-Jen Wen, and Gene Eu Jan. Imu consensus ex-ception detection with dynamic time warping – a comparative approach.
Sensors ,19(10):2237, 2019.[YG05] Herb Yang and Tong Guan. Motion similarity analysis and evaluation of motioncapture data. 2005.[Yi06] LT Yi. Efficient motion search in large motion capture database.
ISVC , pages 151–160, 2006.[YOI92] Junji Yamato, Jun Ohya, and Kenichiro Ishii. Recognizing human action in time-sequential images using hidden markov model. In
CVPR , volume 92, pages 379–385, 1992. 23ZHD +
04] Xiaojun Zhao, Qiang Huang, Peng Du, Dongming Wen, and Kejie Li. Humanoidkinematics mapping and similarity evaluation based on human motion capture. In
In-ternational Conference on Information Acquisition, 2004. Proceedings. , pages 426–431. IEEE, 2004.[ZHPL04] Xiaojun Zhao, Qiang Huang, Zhaoqin Peng, and Kejie Li. Kinematics mapping andsimilarity evaluation of humanoid motion based on human motion capture. In , volume 1, pages 840–845. IEEE, 2004.[ZLER14] Wenbing Zhao, Roanna Lun, Deborah D Espy, and M Ann Reinthal. Rule basedrealtime motion assessment for rehabilitation exercises. In2014 IEEE Symposiumon Computational Intelligence in Healthcare and e-health (CICARE)