Spatially Encoding Temporal Correlations to Classify Temporal Data Using Convolutional Neural Networks
SSpatially Encoding Temporal Correlations to ClassifyTemporal Data Using Convolutional Neural Networks (cid:73)
Zhiguang Wang a, ∗ , Tim Oates a a Department of Computer Science and Electric Engineering, University of MarylandBaltimore County, Baltimore, 21228 Maryland, United States
Abstract
We propose an off-line approach to explicitly encode temporal patterns spa-tially as different types of images, namely, Gramian Angular Fields and MarkovTransition Fields. This enables the use of techniques from computer vision forfeature learning and classification. We used Tiled Convolutional Neural Net-works to learn high-level features from individual GAF, MTF, and GAF-MTFimages on 12 benchmark time series datasets and two real spatial-temporal tra-jectory datasets. The classification results of our approach are competitive withstate-of-the-art approaches on both types of data. An analysis of the featuresand weights learned by the CNNs explains why the approach works.
Keywords:
Time-series, Trajectory, Classification, Gramian Angular Field,Markov Transition Field, Convolutional Neural Networks
1. Introduction
The problem of temporal data classification has attracted great interest re-cently, finding applications in domains as diverse as medicine, finance, entertain-ment, and industry. However, learning the complicated temporal correlationsin complex dynamic systems is still a challenging problem. Inspired by recentsuccesses of deep learning in computer vision, we consider the problem of en- (cid:73)
Preliminary versions of parts of this paper appear in the Twenty-Ninth AAAI workshopproceedings on Trajectory-based Behaviour Analytics. ∗ Corresponding author.
Email address: [email protected] (Zhiguang Wang)
Preprint submitted to Journal of Computer and Systems Sciences September 25, 2015 a r X i v : . [ c s . L G ] S e p oding temporal information spatially as images to allow machines to ”visually”recognize and classify temporal data, especially time series data.Recognition tasks in speech and audio have been well studied. Researchershave achieved success using combinations of HMMs with acoustic models basedon Gaussian Mixture models (GMMs) [1, 2]. An alternative approach is to usedeep neural networks to produce posterior probabilities over HMM states. Deeplearning has become increasingly popular since the introduction of effective waysto train multiple hidden layers [3] and has been proposed as a replacement forGMMs to model acoustic data in speech recognition tasks [4]. These DeepNeural Network - Hidden Markov Model hybrid systems (DNN-HMM) achievedremarkable performance in a variety of speech recognition tasks [5, 6, 7]. Suchsuccess stems from learning distributed representations via deeply layered struc-ture and unsupervised pretraining by stacking single layer Restricted BoltzmannMachines (RBM).Another deep learning architecture used in computer vision is convolutionalneural networks (CNNs) [8]. CNNs exploit translational invariance within theirstructures by extracting features through receptive fields [9] and learn withweight sharing. CNNs are the state-of-the-art approach in various image recog-nition and computer vision tasks [10, 11, 12]. Since unsupervised pretraininghas been shown to improve performance [13], sparse coding and TopographicIndependent Component Analysis (TICA) are integrated as unsupervised pre-training approaches to learn more diverse features with complex invariances[14, 15].CNNs were proposed for speech processing because of their invariance toshifts in time and frequency [16]. Recently, CNNs have been shown to furtherimprove hybrid model performance by applying convolution and max-pooling inthe frequency domain on the TIMIT phone recognition task [17]. A heteroge-neous pooling approach proved to be beneficial for training acoustic invariance[18]. Further exploration with limited weight sharing and a weighted softmaxpooling layer has been proposed to optimize CNN structures for speech recog-nition tasks [19]. 2owever, except for audio and speech data, relatively little work has ex-plored feature learning in the context of typical time series analysis tasks withcurrent deep learning architectures. [20] explores supervised feature learningwith CNNs to classify multi-channel time series with two datasets. They ex-tracted subsequences with sliding windows and compared their results to Dy-namic Time Warping (DTW) with a 1-Nearest-Neighbor classifier (1NN-DTW).Our motivation is to explore a novel framework to encode time series as imagesand thus to take advantage of the success of deep learning architectures in com-puter vision to learn features and identify structure in time series. Unlike speechrecognition systems in which acoustic/speech data input is typically representedby concatenating Mel-frequency cepstral coefficients (MFCCs) or perceptual lin-ear predictive coefficient (PLPs) [21], typical time series data are not likely tobenefit from transformations applied to speech or acoustic data.In this work, we propose two types of representations for explicitly encodingthe temporal patterns in time series as images. We test our approach on twelvetime series datasets produced from 2D shape, physiological surveillance, indus-try and other domains. Two real spatial-temporal trajectory datasets are alsoconsidered for experiments to demonstrate the performance of our approach.We applied deep Convolutional Neural Networks with a pretraining stage thatexploits local orthogonality by Topographic ICA [15] to “visually” inspect andclassify time series. We report our classification performance both on GAF andMTF separately, and GAF-MTF which resulted from combining GAF and MTFrepresentations into single image. By comparing our results with the currentbest hand-crafted representation and classification methods on both time seriesand trajectory data, we show that our approach in practice achieves competitiveperformance with the state of the art with only cursory exploration of hyper-parameters. In addition to exploring the high level features learned by TiledCNNs, we provide an in-depth analysis in terms of the duality between timeseries and images. This helps us to more precisely identify the reasons why ourapproaches work well. 3 . Motivation Learning the (long) temporal correlations that are often embedded in timeseries remains a major challenge in time series analysis and modeling. Most real-world data has a temporal component, whether it is measurements of natural(weather, sound) or man-made (stock market, robotics) phenomena. Tradi-tional approaches for modeling and representing time-series data fall into threecategories. In time series learning problems, non-data adaptive models, suchas Discrete Fourier Transformation (DFT) [23], Discrete Wavelet Transforma-tion (DWT) [24], and Discrete Cosine Transformation (DCT) [25], compute thetransformation with an algorithm that is invariant with respect to the datato capture the intrinsic temporal correlation with the different basis functions.Meanwhile, researchers explored in the model-based approaches to model timeseries, such as Auto-Regressive Moving Average models (ARMA) [26] and Hid-den Markov Models (HMMs) [27], in which the underlying data is assumed tofit a specific type of model to explicitly function the temporal patterns. The es-timated parameters can then be used as features for classification or regression.However, more complex, high-dimensional, and noisy real-world time-series dataare often difficult to model because the dynamics are either too complex or un-known. Traditional methods, which contain a small number of non-linear oper-ations, might not have the capacity to accurately model such complex systems.If implicitly learning the complex temporal correlation is difficult, how aboutreformulating the data to explicitly or even visually encode the temporal de-pendency, allowing the algorithms to learn more easily? Actually, reformulatingthe features of time series as visual clues has raised much attention in com-puter science and physics. The typical examples in speech recognition tasks arethat acoustic/speech data input is typically represented by MFCCs or PLPsto explicitly represent the temporal and frequency information. Recently, re-searchers are trying to build different network structures from time series forvisual inspection or designing distance measures. Recurrence Networks wereproposed to analyze the structural properties of time series from complex sys-4ems [28, 29]. They build adjacency matrices from the predefined recurrencefunctions to interpret the time series as complex networks. Silva et al. extendedthe recurrence plot paradigm for time series classification using compressiondistance [30]. Another way to build a weighted adjacency matrix is extractingtransition dynamics from the first order Markov matrix [31]. Although thesemaps demonstrate distinct topological properties among different time series,it remains unclear how these topological properties relate to the original timeseries since they have no exact inverse operations. One of our contributions is topropose a set of off-line algorithm to encode the complex correlations in time se-ries into images for visual inspection and classification. The proposed encodingfunctions have exact/approximate inverse maps, making such transformationsmore interpretable.
3. Encoding Methods
We first introduce our two frameworks to encode time series data as im-ages. The first type of image is the Gramian Angular field (GAF), in which werepresent time series in a polar coordinate system instead of the typical Carte-sian coordinates. In the Gramian matrix, each element is actually the cosineof the summation of pairwise temporal values. Inspired by previous work onthe duality between time series and complex networks [31], the main idea of thesecond framework, the Markov Transition Field (MTF), is to build the Markovmatrix of quantile bins after discretization and encode the dynamic transitionprobability in a quasi-Gramian matrix.
Given a time series X = { x , x , ..., x n } of n real-valued observations, werescale X so that all values fall in the interval [ − ,
1] or [0 ,
1] by:˜ x i − = ( x i − max ( X )+( x i − min ( X )) max ( X ) − min ( X ) (1)or ˜ x i = x i − min ( X ) max ( X ) − min ( X ) (2)5hus we can represent the rescaled time series ˜ X in polar coordinates by en-coding the value as the angular cosine and the time stamp as the radius withthe equations below: φ = arccos ( ˜ x i ) , − ≤ ˜ x i ≤ , ˜ x i ∈ ˜ Xr = t i N , t i ∈ N (3) t i is the time stamp and N is a constant factor to regularize the span ofthe polar coordinate system. This polar coordinate based representation is anovel way to understand time series. As time increases, corresponding valueswarp among different angular points on the spanning circles, like water rippling.The encoding map of Eq. 3 has two important properties. First, it is bijectiveas cos( φ ) is monotonic when φ ∈ [0 , π ]. Given a time series, the proposedmap produces one and only one result in the polar coordinate system with aunique inverse function. Second, as opposed to Cartesian coordinates, polarcoordinates preserve absolute temporal relations. In Cartesian coordinates, thearea is defined by S i,j = (cid:82) x ( j ) x ( i ) f ( x ( t )) dx ( t ), we have S i,i + k = S j,j + k if f ( x ( t ))has the same values on [ i, i + k ] and [ j, j + k ]. However, in polar coordinates, ifthe area is defined as S (cid:48) i,j = (cid:82) φ ( j ) φ ( i ) r [ φ ( t )] d ( φ ( t )), then S (cid:48) i,i + k (cid:54) = S (cid:48) j,j + k . That is,the corresponding area from time stamp i to time stamp j is not only dependenton the time interval | i − j | , but also determined by the absolute value of i and j . After transforming the rescaled time series into the polar coordinate system,we can easily exploit the angular perspective by considering the trigonometricsum between each pair of points to identify the temporal correlation in differenttime intervals. The GAF is defined as follows: G = cos( φ + φ ) · · · cos( φ + φ n )cos( φ + φ ) · · · cos( φ + φ n )... . . . ...cos( φ n + φ ) · · · cos( φ n + φ n ) (4)= ˜ X (cid:48) · ˜ X − (cid:112) I − ˜ X (cid:48) · (cid:112) I − ˜ X (5) I is the unit row vector [1 , , ..., < x, y > = x · y − √ − x · (cid:112) − y , G is a Gramian matrix: < ˜ x , ˜ x > · · · < ˜ x , ˜ x n >< ˜ x , ˜ x > · · · < ˜ x , ˜ x n > ... . . . ... < ˜ x n , ˜ x > · · · < ˜ x n , ˜ x n > (6)GAF has several advantages. It provides a way to preserve temporal depen-dency. When time increases, the position moves from top-left to bottom-right inthe Gramian matrix. The GAF contains temporal correlations, as G ( i,j || i − j | = k ) represents the relative correlation by superposition of directions with respectto time interval k . The main diagonal G i,i is the special case when k = 0,which contains the original value/angular information. With the main diagonal,we will approximately reconstruct the time series from the high level featureslearned by the deep neural network. The GAF images may be large becausethe size of the Gramian matrix is n × n when the length of the raw time seriesis n . To reduce the size of the GAF images, we apply Piecewise AggregateApproximation [32] to smooth the time series while keeping the overall trends.The full procedure for generating the GAF is illustrated in Figure 1.Through the polar coordinate system, GAFs actually represent the mutualcorrelations between each pair of points/phases by the superposition of thenonlinear cosine functions. Different types of time series always have theirspecific patterns embedded along the time and frequency dimensions. After thefeature reformulation process by GAF, most different patterns are enhancedeven for visual inspection by humans (Figure 2). We propose a framework that is similar to [31] for encoding dynamical tran-sition statistics. We develop that idea by representing the Markov transitionprobabilities sequentially to preserve information in the temporal dimension.7 ime Series 𝑋 Gramian Angular Field
Polar Coordinate
Figure 1: Illustration of the proposed encoding map of the Gramian Angular Field. X is asequence of typical time series in ’SwedishLeaf’ dataset. After X is rescaled by Equation. ( ?? )and optionally smoothed by PAA , we transform it to a polar coordinate system by Equation.(3) and finally calculate its GAF image with Equation. (5). In this example, we build GAFwithout PAA smoothing, so the GAF has a high resolution of 128 × Given a time series X , we identify its Q quantile bins and assign each x i to itscorresponding bin q j ( j ∈ [1 , Q ]). Thus we construct a Q × Q weighted adjacencymatrix W by counting transitions among quantile bins in the manner of a first-order Markov chain along the time axis. w i,j is the frequency with which apoint in quantile q j is followed by a point in quantile q i . After normalization by (cid:80) j w ij = 1, W is the Markov transition matrix: V = v | P ( x t ∈ q | x t − ∈ q ) · · · v Q | P ( x t ∈ q | x t − ∈ q Q ) v | P ( x t ∈ q | x t − ∈ q ) · · · v Q | P ( x t ∈ q | x t − ∈ q Q ) ... . . . ... v Q | P ( x t ∈ q Q | x t − ∈ q ) · · · v QQ | P ( x t ∈ q Q | x t − ∈ q Q ) (7)It is insensitive to the distribution of X and the temporal dependency onthe time steps t i . However, getting rid of the temporal dependency results intoo much information loss in the matrix W . To overcome this drawback, we8 igure 2: Examples of GAF images on the ’Coffee’, ’Gun-Point’, ’Adiac’ and ’50Words’datasets. define the Markov Transition Field (MTF) as follows: M = v ij | x ∈ q i ,x ∈ q j · · · v ij | x ∈ q i ,x n ∈ q j v ij | x ∈ q i ,x ∈ q j · · · v ij | x ∈ q i ,x n ∈ q j ... . . . ... v ij | x n ∈ q i ,x ∈ q j · · · v ij | x n ∈ q i ,x n ∈ q j (8)We build a Q × Q Markov transition matrix W by dividing the data (mag-nitude) into Q quantile bins. The quantile bins that contain the data at timesteps i and j (temporal axis) are q i and q j ( q ∈ [1 , Q ]). M ij in MTF denotesthe transition probability of q i → q j . That is, we spread out matrix W , whichcontains the transition probability on the magnitude axis, into the MTF matrixby considering temporal positions.By assigning the probability from the quantile at time step i to the quan-tile at time step j at each pixel M ij , the MTF M actually encodes multi-steptransition probabilities of the time series. M i,j || i − j | = k denotes the transitionprobability between the points with time interval k . For example, M ij | j − i =1 illustrates the transition process along the time axis with a skip step. The maindiagonal M ii , which is a special case when k = 0 captures the probability from9 .917 0.0830.083 0.583 0 00.334 00 0.2600 0.083 0.522 0.2180.167 0.75 A B C DA B CD Typical Time Series Markov Transition MatrxMarkov Transition Field
ABCDABCD
Figure 3: Illustration of the proposed encoding map of a Markov Transition Field. X is asequence of the typical time series in the ’ECG’ dataset. X is first discretized into Q quantilebins. Then we calculate its Markov Transition Matrix W and finally build its MTF M byEquation. (8). We reduce the image size from 96 ×
96 to 48 ×
48 by averaging the pixels ineach non-overlapping 2 × each quantile to itself (the self-transition probability) at time step i . To makethe image size manageable for more efficient computation, we reduce the MTFby averaging the pixels in each non-overlapping m × m patch with the blurringkernel { m } m × m . That is, we aggregate the transition probabilities in eachsubsequence of length m together. Figure 3 shows the procedure to encode timeseries to MTF.By scattering the first-order transition probability into the temporally or-dered matrix, MTFs encode the transition dynamics between different time lags k . We assume that different types of time series have their specific transitiondynamics embedded in the temporal and frequency domains. After the featurereformulation process by MTF, most transition dynamics are extracted, whichare explicitly obvious for visual inspection (Figure 4).10 maging Time-Series – GAF-MTF MTF images [Wang, Oates, AAAI workshop on TrBA, 2015]
Figure 4: Examples of MTF images on the ’OSUleaf’, ’fish’, ’ECG’ and ’Faceall’ datasets.
4. Tiled Convolutional Neural Networks
Tiled Convolutional Neural Networks [15] are a variation of ConvolutionalNeural Networks. They use tiles and multiple feature maps to learn invariantfeatures. Tiles are parameterized by a tile size K to control the distance overwhich weights are shared. By producing multiple feature maps, Tiled CNNslearn overcomplete representations through unsupervised pretraining with To-pographic ICA (TICA).A typical TICA network is actually a double-stage optimization procedurewith square and square root nonlinearities in each stage, respectively. In thefirst stage, the weight matrix W is learned while the matrix V is hard-coded torepresent the topographic structure of units. More precisely, given a sequence ofinputs { x h } , the activation of each unit in the second stage is f i ( x ( h ) ; W, V ) = (cid:113)(cid:80) pk =1 V ik ( (cid:80) qj =1 W kj x ( h ) j ) . TICA learns the weight matrix W in the secondstage by solving: minimize W n (cid:88) h =1 p (cid:88) i =1 f i ( x ( h ) ; W, V )subject to
W W T = I (9) W ∈ R p × q and V ∈ R p × p where p is the number of hidden units in a layer11 lgorithm 1 Unsupervised pretraining with TICA [15]
Require: { x ( t ) } Tt =1 , v, s, W, V as input Ensure: W as output repeat f old = (cid:80) Tt =1 (cid:80) mi =1 (cid:113)(cid:80) mk =1 V ik ( (cid:80) nj =1 W kj x ( t ) j ) , g = f old ∂W , f new = + ∞ , α = 1 while f new > f old do W new = W − αgW new = Localize ( W new , s ) W new = tieW eights ( W new , k ) W new = orthogonalizeLocalRF ( W new ) W new = tieW eights ( W new , k ) f new = (cid:80) Tt =1 (cid:80) mi =1 (cid:113)(cid:80) mk =1 V ik ( (cid:80) nj =1 W kj x ( t ) j ) α = 0 . α end while W = W new until convergenceand q is the size of the input. V is a logical matrix ( V ij = 1 or 0) that encodesthe topographic structure of the hidden units by a contiguous 3 × W W T = I provides diversity among learned features.The pretraining algorithm (Algorithm. 1) is based on gradient descent onthe TICA objective function in Equation. 9. The inner loop is a simple im-plementation of backtracking linesearch. The orthogonalize localRF ( W n ew )function only orthogonalizes the weights that have completely overlapping re-ceptive fields. Weight-tying is applied by averaging each set of tied weights.The algorithm is trained by batch projected gradient descent. Other unsuper-vised feature learning algorithms such as RBMs and autoencoders [33] requiremore parameter tuning, especially during optimization. However, pretrainingwith TICA usually requires little tuning of optimization parameters, becausethe tractable objective function of TICA allows to monitor convergence easily.12 .. ... Feature maps 𝑙 = 6
Convolutional I TICA Pooling I ... ...
Convolutional II TICA Pooling II ...
Linear SVM
Receptive Field
Receptive Field U ntied weights 𝑘 = 2 P ooling Size 3 × 3 P ooling Size 3 × 3 Figure 5: Structure of the tiled convolutional neural network. We fix the size of receptivefields to 8 × × × l = 6) and tiling size k = 2. Neither GAF nor MTF images are natural images; they have no naturalconcepts such as “edges” and “angles”. Thus, we propose to exploit the benefitsof unsupervised pretraining with TICA to learn many diverse features with localorthogonality. In [15], the authors empirically demonstrate that tiled CNNsperform well with limited labeled data because the partial weight tying requiresfewer parameters and reduces the need for a large amount of labeled data. Ourdata from the UCR Time Series Repository [34] tends to have few instances(e.g., the “yoga” dataset has 300 labeled instance in the training set and 3000unlabeled instance in the test set), so tiled CNNs are suitable for our learningtask. Moreover, Tiled CNNs achieve good performance on large datasets (suchas NORB and CIFAR).Typically, tiled CNNs are trained with two hyperparameters, the tiling size k and the number of feature maps l . In our experiments, we directly fixed13 able 1: Summary statistics of 12 standard datasets DATASET CLASS TRAIN TEST LENGTH50words 50 450 455 270Adiac 37 390 391 176Beef 5 30 30 470Coffee 2 28 28 286ECG200 2 100 100 96FaceAll 14 560 1,690 131Lightning2 2 60 61 637Lightning7 7 70 73 319OliveOil 4 30 30 570OSULeaf 6 200 242 427SwedishLeaf 15 500 625 128Yoga 2 300 3,000 426the network structures without tuning these hyperparameters in loops. Ourexperimental settings follow the default deep network structures and parame-ters in [15]. Tiled CNNs with such configurations are reported to achieve thebest performance on the NORB image classification benchmark. Although tun-ing the parameters will surely enhance performance, doing so may cloud ourunderstanding of the power of the representation. Another consideration iscomputational efficiency. All of the experiments on the 12 datasets could bedone in one day on a laptop with an Intel i7-3630QM CPU and 8GB of memory(our experimental platform). Thus, the results in this paper are a preliminarylower bound on the potential best performance. Thoroughly exploring networkstructures and parameters will be addressed in future work. The structure andparameters of the tiled CNN used in this paper are illustrated in Figure 5.14 . Experiments on Time Series Data
We apply Tiled CNNs to classify using GAF and MTF representation ontwelve tough datasets, on which the classification error rate is above 0 . In our experiments, the size of the GAF image is regulated by the the numberof PAA bins S GAF . Given a time series X of size n , we divide the time seriesinto S GAF adjacent, non-overlapping windows along the time axis and extractthe means of each bin. This enables us to construct the smaller GAF matrix G S GAF × S GAF . MTF requires the time series to be discretized into Q quantilebins to calculate the Q × Q Markov transition matrix, from which we constructthe raw MTF image M n × n afterwards. Before classification, we shrink theMTF image size to S MT F × S MT F by the blurring kernel { m } m × m where m = (cid:100) nS MTF (cid:101) . The Tiled CNN is trained with image size { S GAF , S
MT F } ∈{ , , , , } and quantile size Q ∈ { , , , } . At the last layer of theTiled CNN, we use a linear soft margin SVM [36] and select C by 5-fold crossvalidation over { − , − , . . . , } on the training set.For each input of image size S GAF or S MT F and quantile size Q , we pretrainthe Tiled CNN with the full unlabeled dataset (both training and test set with nolabels) to learn the initial weights W through TICA.Then we train the SVM atthe last layer by selecting the penalty factor C with cross validation. Finally, weclassify the test set using the optimal hyperparameters { S, Q, C } with the lowesterror rate on the training set. If two or more models tie, we prefer the larger S and Q because larger S helps preserve more information through the PAAprocedure and larger Q encodes the dynamic transition statistics with more15 able 2: Tiled CNN error rate on training set and test set DATASET GAF MTFTRAIN TEST TRAIN TEST50words 0.338 0.310 0.442 0.426adiac 0.321 0.284 0.638 0.665beef 0.633 0.4 0.533 0.233coffee 0 0 0 0ECG200 0.16 0.11 0.15 0.21faceall 0.121 0.244 0.102 0.259lighting2 0.2 0.18 0.167 0.361lighting7 0.329 0.397 0.386 0.411oliveoil 0.2 0.2 0.033 0.3OSULeaf 0.415 0.463 0.43 0.483SwedishLeaf 0.134 0.104 0.206 0.176yoga 0.183 0.177 0.193 0.243 detail. Our model selection approach provides generalization without beingoverly expensive computationally.
We use Tiled CNNs to classify GAF and MTF representations separatelyon the 12 datasets. The training and test error rates are shown in Table 2.Generally, our approach is not prone to overfitting as seen by the relativelysmall difference between training and test set errors. One exception is the’Olive Oil’ dataset with the MTF approach where the test error is significantlyhigher.In addition to the slight risk of potential overfitting, MTF has generallyhigher error rates than GAF. This is most likely because of uncertainty in theinverse image of MTF. Note that the encoding function from time series to GAFand MTF are both surjection. The map functions of GAF and MTF will eachproduce only one image with fixed S and Q for each given time series X . Because16hey are both surjective mapping functions, the inverse image of the map is notfixed. As shown in a later section, we can approximately reconstruct the rawtime series from GAF, but it is very hard to even roughly recover the signal fromMTF. GAF has smaller uncertainty in the inverse image of its mapping functionbecause randomness only comes from the ambiguity of cos( φ ) when φ ∈ [0 , π ].MTF, on the other hand, has a much larger inverse image space, which results inlarge variation when we try to recover the signals. Although MTF encodes thetransition dynamics, which are important features of time series, such featuresseem not to be sufficient for recognition/classification tasks.Note that at each pixel, G ij , denotes the superstition of the directions at t i and t j , M ij is the transition probability from the quantile at t i to the quantileat t j . GAF encodes static information while MTF depicts information aboutdynamics. From this point of view, we consider them as two “orthogonal”channels, like different colors in the RGB image space. Thus, we can combineGAF and MTF images of the same size (i.e. S GAF = S MT F ) to construct adouble-channel image (GAF-MTF). Since GAF-MTF combines both the staticand dynamic statistics embedded in raw time series, we posit that it will improveclassification performance. In the next experiment, we pretrained and fine-tuned the Tiled CNN on the compound GAF-MTF images. Then, we reportthe classification error rate on test sets.Table 3 compares the classification error rate of our approach with previouslypublished results of five competing methods: two state-of-the-art 1NN classifiersbased on Euclidean distance and DTW, the recently proposed Fast-Shapeletsbased classifier [37], the classifier based on Bag-of-Patterns (BoP) [35, 22] andthe most recent SAX-VSM approach [38]. Our approach outperforms 1NN-Euclidean, fast-shapelets, and BoP, and is competitive with 1NN-DTW andSAX-VSM.In addition, by comparing the results between Table 3 and Table 2, we ver-ified our assumption that combined GAF-MTF images have better expressivepower than the single GAF or MTF alone for classification. GAF-MTF im-ages achieves the lower test error rate on ten datasets out of twelve (except17 able 3: Summary of the error rates from 6 recently published best results and our approach.The symbols ∗ , (cid:47) , † and • represent datasets generated from figure shapes (2D), physiologicalsurveillance, industry and all remaining temporal signals, respectively. Dataset 1NN- 1NN- Fast SAX- SAX- RPCD GAF-Euclidean DTW shapelet BoP VSM MTF50words • ∗ Beef • Coffee • ECG200 (cid:47)
FaceAll ∗ † † • ∗ ∗ Yoga ∗ WINS for the ’Adiac’ and ’Beef’ dataset ). On the ’Olive Oil’ dataset, the trainingerror rate is 6 .
67% and the test error rate is 16 . igure 6: (a) Original GAF and its six learned feature maps before the SVM layer in TiledCNN (top left), and (b) raw time series and approximate reconstructions based on the maindiagonal of six feature maps (top right) on ’50Words’ dataset; (c) Original MTF and its sixlearned feature maps before the SVM layer in Tiled CNN (bottom left), and (d) curve ofself-transition probability along time axis (main diagonal of MTF) and approximate recon-structions based on the main diagonal of six feature maps (bottom right) on ”SwedishLeaf”dataset. In contrast to the cases in which the CNN is applied in natural image recogni-tion tasks, neither GAF nor MTF have natural interpretations of visual concepts(e.g., ”edges” or “angles”). In this section, we analyze the features and weightslearned through the Tiled CNNs to explain why our approach works.As mentioned earlier, the mapping function from time series to GAF issurjective and the uncertainty in its inverse image comes from the ambiguity ofcos( φ ) when φ ∈ [0 , π ]. The main diagonal of GAF, i.e. { G ii } = { cos(2 φ i ) } allows us to approximately reconstruct the original time series, ignoring thesigns, by cos( φ ) = (cid:114) cos(2 φ ) + 12 (10)MTF has much larger uncertainty in its inverse image, making it hard to19econstruct the raw data from MTF alone. However, the diagonal { M ij || i − j | = k } represents the transition probability among the quantiles in temporal order con-sidering the time interval k . We construct the self-transition probability alongthe time axis from the main diagonal of MTF like we do for GAF. Althoughsuch reconstructions less accurately capture the morphology of the raw time se-ries, they provide another perspective of how Tiled CNNs capture the transitiondynamics embedded in MTF.Figure 6 illustrates the reconstruction results from six feature maps learnedbefore the last SVM layer on the GAF and MTF. The Tiled CNN extractsthe color patch, which is essentially an adaptive moving average that enhancesseveral receptive fields within the nonlinear units by different trained weights. Itis not a simple moving average but the synthetic integration by considering the2D temporal dependencies among different time intervals, which is a benefit fromthe Gramian matrix structure that helps preserve the temporal information. Byobserving the rough orthogonal reconstruction from each layer of the featuremaps, we can clearly observe that the CNNs can extract the multi-frequencydependencies through the convolution and pooling architecture on the GAF andMTF images. Different feature maps preserve the overall trend while addressingmore details in different subphases. As shown in Figures 6(b) and 6(d), thehigh-leveled feature maps learned by the Tiled CNN are equivalent to a multi-frequency approximator of the original curve.Figure 7 demonstrates the learned sparse weight matrix W with the con-straint W W T = I , which makes effective use of local orthogonality. The TICApretraining provides the built-in advantage that the function w.r.t the parame-ter space is not likely to be ill-conditioned as W W T = 1. As shown in Figure7 (right), the weight matrix W is quasi-orthogonal and approaching 0 withoutvery large magnitude. This implies that the condition number of W approaches1 and helps the system to be well-conditioned.20 igure 7: learned sparse weights W for the last SVM layer in Tiled CNN (left) and itsorthogonality constraint by W W T = I (right).
6. Experiments on Trajectory Data
We have demonstrated the effectiveness of GAF and MTF the benchmarktime series datasets as diverse as shape, physiological surveillance and industryfrom the UCR time series repository. In this section we describe an applicationof our approaches to classify spatial-temporal trajectory data. The trajectorydata is complex because patterns of movement are often driven by unperceivedgoals and constrained by an unknown environment.To compare our results with other benchmark approaches including the sem-inal work from [39], we run experiments on two benchmark datasets, the animalmovement dataset (Animal) and the hurricane track dataset (Hurricane) (Fig-ure 8). Both datasets have trajectories of unequal length. For the ”Animal”dataset, the x and y coordinates are extracted from animal movements observedin June 1995. It is divided into three classes by species: elk, deer, and cattle,as shown in Figure 15. The numbers of trajectories (points) are 38 (7117), 30(4333), and 34 (3540), respectively. In the ”Hurricane” dataset, the latitudeand longitude are extracted from Atlantic hurricanes for the years 1950 through21 ulf of Mexico
Red: Category 2 Blue: Category 3
Stronger hurricanes tend to go further than weaker ones
Red: Elk Blue: Deer Black: Cattle
Figure 8: Overview of the trajectory and the RB-TB features [39] learnt in (a). Animaltracking data (left) and (b). Hurricane data (right)
Spatial-temporal trajectory data is commonly multi-dimensional. We useHilbert Space Filling Curves (SFC) to transform the trajectory into time serieswhile preserving the spatial-temporal information.Space filling has been studied by the mathematicians since the late 19thcentury when the first graphical representation was proposed by David Hilbert
Table 4: Summary statistics of two trajectory datasets.
Dataset Classes Training Testing Min MaxSize Size Length LengthAnimal Tracking 3 80 18 10 291Hurricane 2 112 21 11 10822 igure 9: (a). Hilbert space filling curve of order { } in 2-dimensional space (left)(b). An example of the transformation from 2-dimensional trajectory to 1-dimensional timeseries using HSCF of order 2 (right). in 1891 [40]. Space filling curves provide a linear mapping from the multi-dimensional space to the 1-dimensional space. This mapping can be thoughtof as dividing D-dimensional space into D-dimensional hypercubes with a linepassing through each hypercube. Recently, filling curve based approaches haveshown to be able to preserve locality between objects in the multidimensionalspace in the linear space, and thus have been applied to different tasks likeclustering [41], high dimensional outlier detection [42], and trajectory query [43]and classification [44]. Figure 9 (a) shows SFC examples of order { } .Basically, the SFC of order 1 divides the square into 4 area. For the Hilbertcurve with order 2, each sub-area of the curve with order 1 is further dividedinto 4 sub-areas. This process goes on as the order of the SFC increases. Itis clear that the number of sub-areas in 2 dimensional SFC is 4 order . To con-vert 2-dimensional data points to 1-dimensional points, each sub-area is integernumbered from 0 to 4 order − The parameter settings are the same as the previous experiments on UCRdatasets (Section 5). The optimal SFC order is selected together with otherparameters through 5-fold cross validation from { } .Note that both trajectory datasets have quite small sample size with varyinglength. When the trajectory length (as well as the time series length producedby SFC) is smaller than image size S , we uniformly duplicate each point inthe time series in temporal order to stretch the sequence to length S . If thedifference between the length of a time series and S is smaller than the originaltime series length, the interpolation strategy changes to random duplicationinstead of following the temporal order. Both ’Animal’ and ’Hurricane’ datasets have been used in previous research[39, 44] to achieve state-of-the-art classification accuracy. Traclass give two al-gorithms, trajectory-based (TB-only) and region-based + trajectory-based (RB-TB) approaches based on features used for classification on these datastes. Theycarefully designed a hierarchy of features by partitioning trajectories and ex-ploring two types of clustering. In [44], the author used SFC transformation tolinearly map the trajectory data to time series and classified the sequences basedon symbolic discretization with the multiple normal distribution assumption.After transforming the 2D trajectory data to time series using SFC, we gen-erate the corresponding GAF and MTF images as shown in Figure 10. However,we found significant overfitting with CNNs even using 5-fold cross validation.This is probably because both the sample size and the time series length ofthe trajectory datasets are too small to avoid overfitting in neural networks.24 a). GAF of ‘Animal’ (b). GAF of ‘Hurricane’ (d). MTF of ‘Hurricane’ (c). MTF of ‘Animal’
Figure 10: Examples of GAF and MTF images generated from the time series on ’Animal’and ’Hurricane’ datasets. The time series is produced using SFC from raw 2D trajectory.
Previous work has discussed overfitting during cross validation and proposedpotential techniques to address this problem [45, 46]. Here, we applied a sim-ple and straight-forward hyperparameter selection approach to reduce classifiervariance. For a given set of hyperparameter { S, Q, SF C order } , after cross val-idation with different C values of the linear SVM, we compute the mean andstandard deviation to get the 3 σ lower bound over all C by score σ = mean ( Accuracy ) − × ST D ( Accuracy ) (11)By selecting the other hyperparameters { S, Q, SF C − order } with the beststatistical lower bound on the classifier performance over C , the optimal hy-perparameters have lower variance while preserving lower bias. Using this hy-perparameter selection approach, the classification results are reported in Table5. We perform better than the TB-Only method on both datasets and almostas good as the RB-TB method on the ’Hurricane’ dataset. However, both RB-25 able 5: Classification accuracy for TB-Only, RB-TB methods, multiple normal distributionbased symbolic distance (NDist) and our algorithm (%). Dataset TB-Only RB-TB NDist GAF-MTFAnimal Tracking 50 83.3 83.3 72.2Hurricane 65.4 73.1 52.3 71.42TB and NDist methods outperform ours on the ’Animal’ dataset. As shown inFigure 8, both region and trajectory based features are useful for classification.For the ’Hurricane’ dataset, direction based features are more useful than regionbased features. Direction based features are quite easy to capture using ourapproach as the GAF is actually calculating the pairwise direction fields oneach points in the trajectory data. For the ’Animal’ dataset, region is veryimportant as shown in Figure 8 (a). Elk, deer and cattle are almost separablejust using location as their regions are clearly located at the left, right topand right bottom, respectively. When transforming the trajectory data intotime series using SFC, two close regions might be mapped to different sub-areaswith different SFC indexes. When the indexes of two close regions are alsonear, this can be handled by CNNs with its capability to capture the smallshifting-invariance features. However, CNNs are not good at discriminatingsimilar images with large shifting from each other. Thus, when the regioninformation is preserved by the manner of shifting the specific patterns largelyin the time series produced by SFC, CNNs might have difficulty capturing theregion information.Although our approach does not overtake other benchmark methods on bothtrajectory datasets, we provide a more general framework to encode the spatial-temporal patterns for classification tasks. Instead of complicated hand-tunedfeatures, our approach can be applied to a variety of time series and trajectorydata. When the region of the trajectory is not significantly important or thedirection feature dominates, our general methods work quite well. On largedatasets where the volume of time series/trajectory data is big, our deep neural26etwork based approach will greatly benefit from the large sample size in bothfeature learning and classification tasks.
7. Conclusions and Future Work
This paper proposed an off-line approach to spatially encode the tempo-ral patterns for classification using convolutional neural networks. We createda pipeline for converting trajectory and time series data into novel representa-tions, GAF and MTF images, and extracted high-level features from these usingCNNs. The features were subsequently used for classification. We demonstratedthat our approach yields competitive results when compared to state-of-the-artmethods by searching a relatively small parameter space. We found that GAF-MTF multi-channel images are scalable to larger numbers of quasi-orthogonalfeatures that yield more comprehensive images. Our analysis of high-level fea-tures learned from CNNs suggested Tiled CNNs work like multi-frequency mov-ing averages that benefit from the 2D temporal dependency that is preservedby the Gramian matrix.Important future work will involve applying our method to massive amountsof data and searching in a more complete parameter space to solve real worldproblems. We are also quite interested in how different deep learning architec-tures perform on the GAF and MTF images generated from large datasets. An-other interesting future direction is to model time series through GAF and MTFimages. We aim to apply learned time series models in regression/imputationand anomaly detection tasks. To extend our methods to the streaming data, wesuppose to design the online learning approach with recurrent network struc-tures to represent, learn and model temporal data in real-time.
ReferencesReferences . Author Biography
Zhiguang Wang received his B.S.degree in Mathematics and AppliedMathematics from Fudan University,Shanghai, China, in 2012 and enrolledin the Ph.D. program after then. Hereceived the first prize in Chinese Na-tional Mathematical Olympiad in Se-nior in 2007 and NSF travel awardfor IJCAI in 2015. He is currently aPh.D. Candidate in the Department of Computer Science and Electrical Engi-neering at the University of Maryland Baltimore County. His research interestsare in the areas of artificial intelligence, machine learning theory, deep neuralnetworks, non-convex optimization with emphasis on mathematical modeling,time series analysis and pattern recognition in streaming data.