[PDF] Handwriting Prediction Considering Inter-Class Bifurcation Structures

Abstract

Temporal prediction is a still difficult task due to the chaotic behavior, non-Markovian characteristics, and non-stationary noise of temporal signals. Handwriting prediction is also challenging because of uncertainty arising from inter-class bifurcation structures, in addition to the above problems. For example, the classes '0' and '6' are very similar in terms of their beginning parts; therefore it is nearly impossible to predict their subsequent parts from the beginning part. In other words, '0' and '6' have a bifurcation structure due to ambiguity between classes, and we cannot make a long-term prediction in this context. In this paper, we propose a temporal prediction model that can deal with this bifurcation structure. Specifically, the proposed model learns the bifurcation structure explicitly as a Gaussian mixture model (GMM) for each class as well as the posterior probability of the classes. The final result of prediction is represented as the weighted sum of GMMs using the class probabilities as weights. When multiple classes have large weights, the model can handle a bifurcation and thus avoid an inaccurate prediction. The proposed model is formulated as a neural network including long short-term memories and is thus trained in an end-to-end manner. The proposed model was evaluated on the UNIPEN online handwritten character dataset, and the results show that the model can catch and deal with the bifurcation structures.

Full PDF

HHandwriting Prediction ConsideringInter-Class Bifurcation Structures

Masaki Yamagata, Hideaki Hayashi, and Seiichi Uchida

Department of Advanced Information TechnologyKyushu University

Fukuoka, [email protected], { hayashi,uchida } @ait.kyushu-u.ac.jp Abstract —Temporal prediction is a still difﬁcult task due tothe chaotic behavior, non-Markovian characteristics, and non-stationary noise of temporal signals. Handwriting prediction isalso challenging because of uncertainty arising from inter-classbifurcation structures, in addition to the above problems. Forexample, the classes ‘0’ and ‘6’ are very similar in terms of theirbeginning parts; therefore it is nearly impossible to predict theirsubsequent parts from the beginning part. In other words, ‘0’ and‘6’ have a bifurcation structure due to ambiguity between classes,and we cannot make a long-term prediction in this context. Inthis paper, we propose a temporal prediction model that can dealwith this bifurcation structure. Speciﬁcally, the proposed modellearns the bifurcation structure explicitly as a Gaussian mixturemodel (GMM) for each class as well as the posterior probabilityof the classes. The ﬁnal result of prediction is represented asthe weighted sum of GMMs using the class probabilities asweights. When multiple classes have large weights, the modelcan handle a bifurcation and thus avoid an inaccurate prediction.The proposed model is formulated as a neural network includinglong short-term memories and is thus trained in an end-to-endmanner. The proposed model was evaluated on the UNIPENonline handwritten character dataset, and the results show thatthe model can catch and deal with the bifurcation structures.

Index Terms —temporal prediction, class-guided prediction,probabilistic prediction, handwriting, class ambiguity

I. I

NTRODUCTION

Temporal prediction is an important and still challengingtask in various applications [1]–[4]. This difﬁculty comesfrom various reasons that cause uncertainty. Many attemptshave been made in the literature to improve the accuracy ofprediction by appropriately modeling the uncertainty [1], [5].Despite those attempts, the perfect prediction is theoreticallyimpossible when the target temporal signals have bifurcation .This is the case that there are two (or more) very differentdistributions in the future and we are not sure which distri-bution will be taken. As a simple but informative example,let us assume the trajectories of writing digits ‘0’ and ‘6.’Their beginning parts are nearly identical, and therefore itis impossible to predict which of two will be subsequentlywritten at their beginning part. This example suggests thatperfect handwriting prediction is theoretically impossible.Even though the perfect prediction is still impossible, aprediction model that can learn the underlying bifurcationstructure automatically is very useful from several aspects.First, we can give an accurate prediction result until thebifurcation point. Second, we can generate multiple prediction results beyond the bifurcation point, if necessary. Third, if wecan know there is no bifurcation after a certain point, we candetermine a unique prediction result with high conﬁdence.In this paper, we propose a class-guided prediction (CGP)model and apply it to a handwriting digit prediction task. Inthis application, “class” means ten digit classes from ‘0’ to‘9.’ By incorporating the class information during the trainingthe prediction model in an explicit way, we can build a modelwhich deals with the inter-class bifurcation, such as the aboveexample of ‘0’ and ‘6.’Roughly speaking, our CGP model is derived by the fac-torization of the prediction task into a coordinate predictionmodule by class-wise GMMs and a class probability module.These modules are realized as neural networks and trainedsimultaneously in an end-to-end manner. The outputs of thosemodules are the parameters of GMMs and the class probabilitydistribution. By using those outputs, we can provide not onlythe prediction result but also the uncertainty degree by theinter-class bifurcation.Note that handwriting digit trajectories are simple but verysuitable for observing the prediction performance of the model,although our prediction model can be applied to any temporalpatterns. Especially, handwriting digit trajectories have theten predeﬁned classes, ﬁnite temporal lengths, and severalbifurcation structures. Even from an application viewpoint, it isstill useful to examine the possibility of realizing early classiﬁ-cation ; if the model tells that there is no inter-class bifurcationat the current point, we can determine the recognition resultby the class with the highest probability without waiting forthe end of the pattern.Our main contributions are summarised as follows: • To the best of our knowledge, this is the ﬁrst study thatapplies a class-guided prediction model to handwritingtrajectories in the presence of ambiguities by, especially,inter-class bifurcations. • The proposed model can provide a probabilistic predic-tion while evaluating uncertainty by inter-class bifurca-tion. This ability is useful for the future applications ofthe proposed model such as the early classiﬁcation. • The proposed model gives an end-to-end training frame-work for individual class probability estimation and tra-jectory prediction. a r X i v : . [ c s . C V ] S e p I. R

ELATED WORK

A. Models for online handwriting generation

Studies on the generation of online handwriting can bedivided into two approaches: One is based on the motor modeland the other on the stochastic model. In the motor model-based approach, the movement of the hand while writing isformulated using differential equations based on the kinematictheory. For example, Plamondon and Maarse regarded thegeneration of handwriting as a product of motor behaviorand expressed the writing process using a hand movementmodel [6]. Plamondon and Guerfali proposed a model of hand-writing generation that parameterizes the generation processof handwritten characters using delta-lognormal theory [7].Although this approach can be used to express the generationprocess of handwriting in an interpretable way, it requiresstrong assumptions such as a simpliﬁcation of the generationprocess. The stochastic model-based approach expresses theprocess of handwriting generation using a stochastic model,which allows for ﬂexible modeling. In [8], the strokes of thepen were generated stochastically using a Bayesian network.Graves used the MDN to stochastically generate and predicthandwritten characters [9]. An attempt was also made togenerate handwritten characters using a spiking neural networkin [10].

B. Prediction model considering bifurcation

Several methods of bifurcation-aware predictions have alsobeen proposed. In [11], [12], the Markov decision process isused to predict human behavior considering the bifurcation.There are several attempts to predict time-series by consideringclass information and bifurcation simultaneously. Pool et al.proposed a prediction model for autonomous driving that canrepresent bifurcation by combining linear dynamical systemsfor each type of behavior of autonomous cars [13]. Deo andTrivedi proposed a model that classiﬁes the trajectory patternsof automobiles into six classes and predicts trajectories ac-cording to the estimated class probabilities [1], [14]. Tangand Salakhutdinov proposed a model for multimodal predic-tion without the explicitly labeling of tracking patterns [15].Makansi et al. proposed a two-stage model based on the MDN,with hypothetical sampling by the Winner-Takes-All loss anddistribution ﬁtting, to avoid mode collapse [16].Our trial can be differentiated from them at the followingpoints. First, in our model, the prediction task is factorized intoa class-wise coordinate prediction module and a single classprobability module and trains both modules in an end-to-endframework (this implies that our model is solving a multi-tasktraining task). Second, compared to the tasks of the past at-tempts (e.g., car trajectory prediction), handwriting trajectorieshave very different from many aspects. Especially, we needto deal with nonstationarity because handwriting trajectoriesshow different characteristics at each time point. Third, ourmain focus is not only to get more accurate prediction butalso to analyze the inter-class bifurcation structures underlyingthe speciﬁc target, i.e., handwritings. We therefore show manyanalysis results to understand the structures. 𝒙 𝑡 Three-layer LSTM with skip connections 𝝅 𝑐𝑡 𝝁 𝑐𝑡 𝝈 𝑐𝑡 𝝆 𝑐𝑡 GMM parameters for each class 𝑐⋮ 𝑝 𝑡 𝑝 𝑡 𝑝 𝐾𝑡 Feature extraction Class probability predictionCoordinate prediction 𝑃 𝜃 𝑡 𝒙 𝑡 +1 ∣ 𝑐 , 𝒙 𝑡 𝜃 𝑡 𝜃 𝑡 𝑃 𝜃 𝑡 𝑐 ∣ 𝒙 𝑡 Fig. 1: Structure of the proposed CGP model.III. C

LASS - GUIDED PREDICTION

A. Formulation of class-guided prediction

The proposed CGP model provides the distribution of thenext pen-tip coordinate x ( t +1) ∈ R as its prediction result.Speciﬁcally, given a sequence x (1: t ) , the CGP model providesthe posterior distribution P ( x ( t +1) | x (1: t ) ) . Instead of pre-dicting a single pen-tip coordinate, the distribution estimationis suitable for dealing with bifurcations by providing multiplepossible trajectories as random samples from the distribution.In addition, it is also useful to understand how the predictionuncertainty increases for a longer-term prediction.The key idea of the proposed model is to factorize thedistribution P ( x ( t +1) | x (1: t ) ) as the class-weighted sum ofthe class-conditional distribution of coordinates: P ( x ( t +1) | x (1: t ) )= K (cid:88) c =1 P θ ( t )1 ( x ( t +1) | c, x (1: t ) ) P θ ( t )2 ( c | x (1: t ) ) , (1)where K is the number of classes and θ ( t )1 and θ ( t )2 aredistribution parameters at t . Those parameters are time-variantbecause of the nonstationarity of handwriting trajectories.With this class-wise factorization, we can expect that themodel acquires the inter-class bifurcation structure (like ‘0’and ‘6’) more explicitly. If we try to model P ( x ( t +1) | x (1: t ) ) without class-wise factorization by, say, a single Gaussianmixture model such as MDN, there is a risk that the inter-class bifurcation parts are mixed-up into a single Gaussiancomponent with a large covariance. For example, one of theGaussian components for the ending parts of ‘0’ and ‘6’might have a large covariance to cover both ending direc-tions. In contrast, by the above factorization, each distribution P θ ( t )1 ( x ( t +1) | c, x (1: t ) ) is trained within each class and notdisturbed by other classes. Therefore, we have K sharperdistributions with less overlaps—namely, we can catch theinter-class bifurcation structure more explicitly. B. Prediction model by neural networks

Fig. 1 shows the structure of the proposed CGP model.The CGP model consists of three network modules. The ﬁrstmodule is prepared for feature extraction from x (1: t ) andcomposed of three layers of LSTM with skip connectionsbetween them. The skip connections mitigate the vanishingradient problem and facilitate the learning of deep neuralnetworks [9].The second and third modules are prepared for the factor-ization of (1). The second module is the coordinate predictionmodule for estimating θ ( t )1 and composed of a single fully-connected (FC) layer. The third module is the class probabilityprediction module for θ ( t )2 and also composed of a singleFC layer. The approach that a neural network outputs thedistribution parameters (of, especially, a mixture distribution)is inspired by the MDN [17].In this paper, a Gaussian mixture distribution is used torepresent P θ ( t )1 ( x ( t +1) | c, x (1: t ) ) . By preparing a mixturedistribution for each class, its parameter set becomes θ ( t )1 = (cid:110)(cid:16) π ( t ) c,m , µ ( t ) c,m , σ ( t ) c,m , ρ ( t ) c,m (cid:17) | m = 1 ,. . . ,M, c = 1 ,. . . ,K (cid:111) , where M is the number of Gaussian components per class, π ( t ) c,m the mixture coefﬁcient, µ ( t ) c,m is the mean, σ ( t ) c,m is thestandard deviation, and ρ ( t ) c,m is the correlation coefﬁcient .These parameters are estimated as the output of the coordinateprediction module to the input x (1: t ) .For the class probability P θ ( t )2 ( c | x (1: t ) ) , a categoricaldistribution is used, whose parameter set is θ ( t )2 = { p ( t ) c | c = 1 , . . . , K } . Like θ ( t )1 , these parameters are estimated as the outputs of theclass probability prediction module to the input x (1: t ) . C. Training the model

The trainable weights of the CGP model, namely, theweights of LSTM layers and FC layers, are trained with agiven training dataset. Given a set of N training sequencesand corresponding class labels { x (1: T ) n , y n } Nn =1 , where y n =( y n,c ) c =1 ,...,K is the class label encoded as a one-hot vector,the CGP model is trained by minimizing the following lossfunction L : L = N (cid:88) n =1 T (cid:88) t =1 L ( n,t )coord + L ( n,t )class , where L ( n,t )coord is the negative log-likelihood loss for the coordi-nate prediction module and evaluates the likelihood of x ( t +1) by the current mixture distribution: L ( n,t )coord = − log K (cid:88) c =1 y n,c M (cid:88) j =1 π ( t ) c,j N (cid:16) x ( t +1) | µ ( t ) c,j , σ ( t ) c,j , ρ ( t ) c,j (cid:17) , The parameters σ ( t ) c,m = ( σ , σ ) and ρ ( t ) c,m specify the covariance matrixas (cid:32) σ ρ ( t ) c,m σ σ ρ ( t ) c,m σ σ σ (cid:33) . Several post-operations are applied to the network outputs to make themvalid as the parameters of a probabilistic distribution. Speciﬁcally, the nor-malization to make (cid:80) m π ( t ) c,m = 1 , the range extension exp σ ( t ) c,m → σ ( t ) c,m ,and the range limitation tanh ρ ( t ) c,m → ρ ( t ) c,m . The outputs are also normalized so that (cid:80) c p ( t ) c = 1 . and L ( n,t )class is the the cross entropy loss for the class probabilityprediction module and evaluates how the module outputscorrect class probabilities: L ( n,t )class = − K (cid:88) c =1 y n,c log p ( t ) c . Since the entire model is formulated using only differentiableoperations, its weights can be updated through backpropaga-tion in an end-to-end manner.

D. Stochastic prediction by the model

A two-step sampling procedure is employed to obtain thecoordinate at t + 1 . In the ﬁrst step, a class ˜ c at t is sampledaccording to the class probability distribution P θ ( t )2 ( c | x (1: t ) ) .The predicted coordinate x ( t +1) is sampled from P θ ( t )1 ( x ( t +1) | ˜ c, x (1: t ) ) . For predicting the coordinate at t +∆ t , this two-stepsampling procedure is repeated ∆ t − times by concatenatingthe predicted results and x (1: t ) .IV. E XPERIMENT

A. Experimental setups

To evaluate the validity of the CGP, we conducted anexperiment to predict handwriting. We used the UNIPENdatabase (Train-R01/V07, 1a) [18], which contains onlinehandwritten digits as sequences of two-dimensional coordi-nates. We normalized these sequences in the range [0, 127]for each dimension and the time length of 50. Instead ofdirectly inputting the coordinate sequences into the model, weconverted each coordinate into a relative coordinate that is thedifference from the coordinate one time prior to the currenttime point. Since sequences of relative coordinates includethe relationships of two temporally adjacent coordinates toeach other, the model is expected to catch movements ofthe handwriting more easily than when absolute coordinatesequences are used. This dataset is slightly class-imbalancedand contains approximately 1,300 sequences for each class.We randomly divided the dataset into 70% for the trainingset, 10% for the validation set, and 20% for the test set.For comparison, we used deterministic prediction usingLSTMs (D-LSTM), a model combining LSTMs and an MDN(hereafter referred to simply as the MDN), and the 1-nearestneighbor (1-NN). In the CGP model and the MDN, whichoutput mixture distributions, the total number of componentswas uniﬁed to 40, i.e., four components were assigned to eachdigit class in the CGP model. The numbers of training epochsfor the D-LSTM, MDN, and CGP were determined based onthe criterion that the validation loss did not decrease for 10epochs. We used the mean squared error for the D-LSTM asthe loss function.In the MDN, the predicted trajectory at t + 1 was obtainedby sampling x ( t +1) from P ( x ( t +1) | x (1: t ) ) . As in the CGPmodel, the trajectory at t + ∆ t ( ∆ t ≥ ) was predicted by theprocess of Section III-D. . Qualitative evaluation We investigated the characteristics of the CGP model byqualitatively evaluating its representative results. Fig. 2 showsan example of the predictions made by the CGP model. Notethat this ﬁgure was drawn by overlying results of 100 sam-plings, and the subsequent ﬁgures are drawn in the same way.These examples conﬁrmed that the CGP model can predict thetrajectories of multiple classes. For example, in the panel atthe top-left corner, most of the predicted trajectories branchto classes ‘0’ and ‘6’ because there is still the possibility ofbeing in either class at this time point.Figs. 3(a) and (b) show changes in the predictions made bythe CGP model when the length of the input was increased.In both cases, the predictions contained the possibility of thebifurcation of either ‘2’ or ‘3’ in the early stages. As the lengthof the input increases, the number of predictions for the correctclass increases in both cases. Therefore, the CGP model madepredictions based on an ambiguous class prediction in the earlystages, and the certainty of its predictions increased with theinput length. This result is reasonable because the longer theinput length was, the more conﬁdent the prediction was.Fig. 4 shows an example of a prediction by the CGP modelwith inputs up to t = 20 while varying ∆ t , where the pen-up for ‘5’ was performed. Although accurate prediction isdifﬁcult because the stroke moves drastically when pen-upoccurs, it was conﬁrmed that the pen-up was predicted at anapproximately accurate timing. Furthermore, the destinationafter the pen-up was a natural starting position for the sub-sequent stroke. Incidentally, this example contains strokes for‘4’ in the early stages of the prediction, and even in this case,the pen-up for ‘4’ was correctly predicted.Fig. 5 shows a comparison of the results predicted by theCGP model, MDN, D-LSTM, and 1-NN. Since the MDN is aprobabilistic model as well as the CGP, the results of 100samplings are drawn. The spread of the predicted lines isproportional to the variance of the predicted distribution. Inthe prediction for ‘6’ by the CGP model (top of Fig. 5(a)),a branch to ‘0’ and ‘6’ is apparent. The lines predicted bythe MDN for ‘0’ and ‘6’ were not clearly separated, and itappeared to have been a unimodal prediction, rather than abimodal prediction, with a large variance. This is because therole of each component was not clearly speciﬁed in the MDN.Therefore, the distribution predicted the MDN was ambiguous,having a large variance with the mean between ‘0’ and ‘6.’ InFigs. 5(c) and 5(d), the predictions of the D-LSTM and 1-NNdeviate from the true values. In particular, for the predictionof ‘3’ (bottom of Figs. 5(c) and 5(d)), it seems that the D-LSTM and 1-NN recognized the class as ‘2’ and subsequentlypredicted the corresponding trajectory. Since the D-LSTMand 1-NN can only make deterministic predictions, theirperformance worsened if a class was incorrectly predicted. C. Quantitative evaluation

We quantitatively evaluated the performance of the CGPmodel. We used two criteria for evaluation: the root meansquared error (RMSE) and negative log-likelihood (NLL). We sampled 20 times from the predictive distribution for each testdata and calculated both criteria.In the evaluation based on the RMSE, we used two typesof class-conditional RMSEs (RMSE 2 and RMSE 3), inaddition to the ordinary RMSE (RMSE 1). This is becauseit is inadvisable to calculate errors for all predictions at agiven time point in case there is a possibility of branching.For example, as shown in Fig. 5, when there is a possibilityof predicting both the correct class ‘6’ and the incorrect class‘0’, it is preferable to calculate the error only for the correctclass. We deﬁned RMSE 2 as the RMSE for the samplingresults from the majority class distribution, whereas we deﬁnedRMSE 3 as the RMSE from the sampling results of thecorrect class distribution. In order to compute RMSE 2 andRMSE 3, we sampled coordinates from the majority classdistribution for RMSE 2 and from the correct class distributionfor RMSE 3.Fig. 6(a) shows the RMSE values for each model. There isno signiﬁcant difference between the errors of the CGP modeland the other models in terms of RMSE 1. When comparedin terms of RMSE 2, the error of the CGP model is slightlylower than those of the others for large values of ∆ t . Whencompared in terms of RMSE 3, the error of the CGP following ∆ t = 9 was remarkably lower than those of the other models.These results suggest that the CGP model can make moreaccurate predictions when it can correctly predict the class ofthe input time series.Fig. 6(b) shows the results of evaluation using the NLL. Thelog-likelihood calculates how well the predicted distributionﬁts the real values, and its negative value is often used toassess density estimation: The smaller the NLL is, the betterthe distribution is.In Fig. 6(b), the MDN shows a smaller NLL than the CGPmodel. As ∆ t increased, the gap became more prominent.These results conﬁrmed that the MDN generated the predicteddistribution that ﬁts the real values, whereas the CGP modelalso demonstrated competitive results. A possible explanationof this is the difference in the variance of distributions forlong-term predictions. As shown in Fig. 5, the MDN tended tomake a prediction with a larger variance than the CGP model.If the variance of each component is large, the likelihood tendsto become large even if the mean of each component apartfrom the true value. For these reasons, we considered that theMDN ostensibly showed better NLL values despite ambiguouspredictions. D. Bifurcation structure of characters

Fig. 7 shows changes in RMSE 1 according to the inputtime t for each class and each prediction width ∆ t . Roughlyspeaking, the RMSE tended to decrease over time for allclasses. However, for some classes, the time variation of theRMSE had a speciﬁc pattern rather than a monotonic decrease.For example, in class ‘5,’ the RMSE increased rapidly around t = 25 and then decreased around t = 35 . Moreover, thisclass-speciﬁc pattern was the same regardless of the predictedwidth.ig. 2: Example of the predictions made by the CGP model. The blue dots are the inputted coordinate sequences, and the graydots are subsequent true coordinate sequences. The red dots represent the ﬁnal time point of the coordinate sequences. Thecolored lines are the class-conditional trajectories predicted by the CGP model. (a) Correct class: ‘2’(b) Correct class: ‘3’ Fig. 3: Predictions for ‘2’ and ‘3’ while increasing input length t . The colors represent the same items as in Fig. 2.Fig. 4: Trajectories predicted by the CGP model for ‘5’ whilevarying ∆ t with ﬁxed inputs up to t = 20 . The colored lineshave the same meanings as in Fig. 2. (a) CGP (b) MDN (c) D-LSTM (d) 1-NN Fig. 5: Comparison of the results of prediction by the CGPmodel, MDN, D-LSTM, and 1-NN. The ﬁgures in the top rowshow predictions for ‘6’ and those in the bottom row showthose for ‘3.’This pattern of time variation in the RMSE for each classis related to the bifurcation in the handwriting prediction. For (a) RMSE (b) Negative log-likelihood

Fig. 6: Performance of each model.example, ‘0’ and ‘6’ had a similar trajectory until the middlestage of the writing process, but bifurcate after that. Thisresulted in a decrease in the RMSE up to the bifurcation pointof the character (around t = 18 ) because a common predictioncould have been made. However, the RMSE increased becauseit was not possible to determine whether the class was ‘0’ or‘6’ when predicting after the bifurcation point. As the inputseries became longer and exceeded t = 35 , the predictionof the character class became clearer, resulting in a decreasein the RMSE. That is, there were similarities in the timevariation of the RMSE among classes that have common partsin the handwriting trajectory, which was inﬂuenced by thebifurcation structure of the characters.The frequency of sampled class labels according to the inputtime t is shown in Fig. 8. Note that the ﬁgures are drawn foreach correct class with the prediction width ﬁxed to ∆ t = 10 ,and that a particularly confusing pair (‘0’ and ‘6’) and triplet(‘2,’ ‘3,’ and ‘7’) were selected. For all cases, the frequency ofmultiple classes was high in the early stage of the prediction,which means that the class predictions were ambiguous. Untilthe middle stage of prediction, the frequency of classes with acommon part in the trajectory, e.g., ‘0’ and ‘6,’ increased,while the frequencies of the other classes decreased. Thisindicates that the number of possible character classes ofthe input series was limited as the time length of the inputincreased. The state where the frequency of these few classeswas high persisted because it was not possible to determinethe class until the bifurcation point. Once the input time hadexceeded t = 23 , the frequency of incorrect classes tended todecrease. The classes ‘0’ and ‘6’ in Fig. 7 show that the RMSEdecreased at around t = 35 , which is consistent with the resultof the selected class frequency. The graphs of classes ‘2,’ ‘3,’and ‘7’ in Fig. 8 show that the prediction had three classesof possibilities, which means that ‘2,’ ‘3,’ and ‘7’ each had athree-way bifurcated structures.ig. 7: Changes in RMSE for each class.Fig. 8: Frequency of the selected classes in sampling ( ∆ t = 10 ). The colored lines have the same meanings as in Fig. 2.V. C ONCLUSION

In this paper, we proposed a temporal prediction model thatcan handle the bifurcation structures related to class infor-mation. By combining class prediction and class-conditionalcoordinate prediction, the proposed class-guided prediction(CGP) model learns the bifurcation structure explicitly. Inexperiments using the UNIPEN online handwritten characterdataset, we veriﬁed that the CGP model can represent andhandle the bifurcation structure of handwritten characters, andcan predict their trajectories with a smaller error.In future work, we will utilize the learned bifurcationstructure for several applications and scientiﬁc investigations.For example, we can realize early classiﬁcation by knowingthat no bifurcation after the current time point [19], [20]. It isalso possible to combine our framework with other temporalpattern generation models. In fact, the ability of predictingthe future uncertainty will be useful for the reinforcementlearning-based generative models, since those models rely onthe future reward prediction.A

CKNOWLEDGMENT

This work was partially supported by JSPS KAKENHIGrant Number JP17H06100 and JST ACT-I Grant NumberJPMJPR18UO. R

EFERENCES[1] N. Deo and M. M. Trivedi, “Convolutional Social Pooling for VehicleTrajectory Prediction,” in

Proc. CVPR Workshops , 2018.[2] K. Tastambekov, S. Puechmorel, D. Delahaye, and C. Rabut, “Air-craft Trajectory Forecasting Using Local Functional Regression inSobolev Space,”

Transportation Research Part C: Emerging Technolo-gies , vol. 39, pp. 1–22, 2014.[3] L. Suganthi and A. A. Samuel, “Energy Models for Demand Forecasting– A Review,”

Renewable and Sustainable Energy Reviews , vol. 16, no. 2,pp. 1223–1240, 2012. [4] X. Feng, Q. Li, Y. Zhu, J. Hou, L. Jin, and J. Wang, “Artiﬁcial NeuralNetworks Forecasting of PM2.5 Pollution Using Air Mass TrajectoryBased Geographic Model and Wavelet Transformation,”

AtmosphericEnvironment , vol. 107, pp. 118–128, 2015.[5] A. Kendall and Y. Gal, “What Uncertainties Do We Need in BayesianDeep Learning for Computer Vision?” in

Proc. NIPS , 2017.[6] R. Plamondon and F. J. Maarse, “An Evaluation of Motor Models ofHandwriting,”

IEEE Trans. Systems, Man, and Cybernetics , vol. 19,no. 5, pp. 1060–1072, 1989.[7] R. Plamondon and W. Guerfali, “The Generation of Handwriting withDelta-Lognormal Synergies,”

Biological Cybernetics , vol. 78, pp. 119–132, 1998.[8] H. Choi, S.-J. Cho, and J. H. Kim, “Generation of Handwritten Charac-ters with Bayesian Network Based On-line Handwriting Recognizers,”in

Proc. ICDAR , 2003.[9] A. Graves, “Generating Sequences with Recurrent Neural Networks,” arXiv preprint arXiv:1308.0850 , 2013.[10] M. Ltaief, H. Bezine, and A. M. Alimi, “A Spiking Moter-Model forOnline Handwriting Movements Generation,” in

Proc. ICFHR , 2016.[11] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “ActivityForecasting,” in

Proc. ECCV , 2012.[12] N. Lee and K. M. Kitani, “Predicting Wide Receiver Trajectories inAmerican Football,” in

Proc. WACV , 2016.[13] E. A. I. Pool, J. F. Kooij, and D. M. Gavrila, “Using Road Topology toImprove Cyclist Path Prediction,” in

Proc. Intelligent Vehicles Sympo-sium , 2017.[14] N. Deo and M. M. Trivedi, “Multi-Modal Trajectory Prediction ofSurrounding Vehicles with Maneuver based LSTMs,” in

Proc. IntelligentVehicles Symposium , 2018.[15] C. Tang and R. R. Salakhutdinov, “Multiple Futures Prediction,” in

Proc.NIPS , 2019.[16] O. Makansi, E. Ilg, O. Cicek, and T. Brox, “Overcoming Limitationsof Mixture Density Networks: A Sampling and Fitting Framework forMultimodal Future Prediction,” in

Proc. CVPR , 2019.[17] C. Bishop, “Mixture Density Networks,” Technical ReportNCRG/94/004, Neural Computing Research Group, Aston University,1994.[18] I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet,“UNIPEN Project of On-line Data Exchange and Recognizer Bench-marks,” in

Proc. ICPR , 1994.[19] M. Weber, M. Liwicki, D. Stricker, C. Scholzel, and S. Uchida, “LSTM-Based Early Recognition of Motion Patterns,” in

Proc. ICPR , 2014.[20] Z. Chen, E. Anquetil, C. Viard-Gaudin, and H. Mouchre, “EarlyRecognition of Handwritten Gestures Based on Multi-Classiﬁer RejectOption,” in