[PDF] Hierarchical Learning Using Deep Optimum-Path Forest

Abstract

Bag-of-Visual Words (BoVW) and deep learning techniques have been widely used in several domains, which include computer-assisted medical diagnoses. In this work, we are interested in developing tools for the automatic identification of Parkinson's disease using machine learning and the concept of BoVW. The proposed approach concerns a hierarchical-based learning technique to design visual dictionaries through the Deep Optimum-Path Forest classifier. The proposed method was evaluated in six datasets derived from data collected from individuals when performing handwriting exams. Experimental results showed the potential of the technique, with robust achievements.

Full PDF

HHierarchical Learning Using Deep Optimum-PathForest

Luis C. S. Afonso a , Clayton R. Pereira b , Silke A. T. Weber c , Christian Hook d ,Alexandre X. Falc˜ao e , Jo˜ao P. Papa b a UFSCar - Federal University of S˜ao Carlos, Department of Computing, S˜ao Carlos, [email protected] b UNESP - S˜ao Paulo State University, School of Sciences, Bauru, Brazil { clayton.pereira,joao.papa } @unesp.br c UNESP - S˜ao Paulo State University, Medical School, Botucatu, [email protected] d Ostbayerische Technische Hochschule, Regensburg, [email protected] e UNICAMP - University of Campinas, Institute of Computing, Campinas, [email protected]

Abstract

Bag-of-Visual Words (BoVW) and deep learning techniques have been widelyused in several domains, which include computer-assisted medical diagnoses. Inthis work, we are interested in developing tools for the automatic identiﬁcationof Parkinson’s disease using machine learning and the concept of BoVW. Theproposed approach concerns a hierarchical-based learning technique to designvisual dictionaries through the Deep Optimum-Path Forest classiﬁer. The pro-posed method was evaluated in six datasets derived from data collected fromindividuals when performing handwriting exams. Experimental results showedthe potential of the technique, with robust achievements.

Key words:

Parkinson’s disease, Optimum-Path Forest, HandwritingDynamics, Hierarchical Representation

1. Introduction

Image and signal classiﬁcation problems have been widely studied in the pastdecades by machine learning and computer vision research communities. Morerecently, a considerable eﬀort is done towards deep learning (DL) techniques

Preprint submitted to Journal of L A TEX Templates February 19, 2021 a r X i v : . [ c s . A I] F e b ecently. Despite the fact that DL-driven approaches are known to be quiteuseful in generalizing over a number of problems, they still can not deal withsome simple problems as well [1]. Also, speciﬁc neural architectures need to bedesigned to cope with signal classiﬁcation problems since most of the modelsavailable in the literature are developed to handle image-based applications only.The well-known Bag-of-Visual-Words (BoVW) [2] paradigm has been con-sistently employed and enhanced over the years to address both image- andsignal-based classiﬁcation problems. In a nutshell, the idea consists in extract-ing information (e.g., visual words/key points/descriptors) from the data forfurther using them to compose a dictionary (i.e., bag) that can be employedto compute new representations for a given data. Applications in medical datavary from X-ray categorization to histopathology image classiﬁcation [3, 4, 5],among others.Computer-assisted Parkinson’s disease (PD) identiﬁcation is another re-search area that can beneﬁt from automated diagnosis and the BoVW paradigm.Such illness is known to be neurodegenerative, it has no cure, and its main symp-toms include the freezing of gate, tremors, and speech alterations, to name afew. In this context, a considerable number of works that deal with automatedPD diagnosis can be referred in the literature. Spadotto et al. [6], for instance,introduced the Optimum-Path Forest (OPF) [7, 8, 9] for PD identiﬁcation fromspeech signals. Later on, they employed evolutionary optimization techniquesto select the most relevant features to deal with the same problem [10]. Samaet al. [11] and B¨achlin et al. [12] explored wearable accelerometers to detect thefreezing of gate and to provide assistance as soon as the condition is detected.Rigas et al. [13] investigated an automated method that estimates the typeand severity of tremors based on data acquired from accelerometers attached tospeciﬁc positions at a patient’s body. The estimations are used to assess bothresting and action tremors.Other works used images to cope with PD recognition automatically. Pereiraet al. [14] proposed to extract features from handwriting exams that were further2igitized to fulﬁll the aims of the work. They used the HandPD dataset ,which comprises exams performed by healthy individuals and PD patients todetect subtle tremors when drawing spirals and meanders on a piece of paper.Since the exams were conducted using a pen equipped with sensors , the samegroup of authors further proposed to use the signals obtained from the pen asa means to perform automatic PD recognition [15]. Very recently, Afonso etal. [16] introduced the concept of “deep recurrence plots” for the identiﬁcationof Parkinson’s disease, where the idea is to employ recurrence plots [17] to modelthe time dependency of the signals acquired during the exam.Afonso et al. [18] also proposed a BoVW-based model to learn representa-tions from signals (i.e., the same ones used in the works mentioned earlier) to befurther used to cope with the problem of Parkinson’s disease identiﬁcation. Theproposed approach ﬁrst extracts key points (descriptors) from the signal, whichare then clustered using the unsupervised OPF technique [19]. The idea behindthe clustering is to select only the most informative ones that will compose theﬁnal dictionary. The results showed that OPF could build more informativedictionaries than other clustering algorithms. The OPF is a framework to thedesign of classiﬁers based on graph partition, where each node stands for adataset sample, and an adjacency relation connects them for the further appli-cation of a reward-competition approach that ends up partitioning the datasetinto optimum-path trees (OPTs). Such OPTs can be either unlabeled clusters(unsupervised problems) or labeled trees (supervised and semi-supervised [20]problems), and their roots stand for the so-called “prototypes”.In this paper, we extend the work of Afonso et al. [18] by proposing ahierarchical-based learning methodology to design visual dictionaries. The pro-posed approach makes use of the Deep OPF classiﬁer [21], which aims at per-forming diﬀerent levels of clustering to learn and encode distinct information ateach phase. We showed results that outperformed the ones obtained by Afonso

3t al. [18] in the context of computer-assisted Parkinson’s disease identiﬁcationusing signals derived from handwriting exams.The remainder of this paper is organized as follows. Section 2 presents a the-oretical background about the OPF and Deep OPF. Sections 3 and 4 describehow deep representations are learned through Deep OPF and the proposed ap-proach, respectively. Section 5 presents the experimental results, and Section 6states conclusions and future work.

2. Optimum-Path Forest Clustering

The fundamental problem in unsupervised learning is to identify clustersin an unlabeled dataset Z , such that samples from the same cluster shouldshare some level of similarity. Many methods were proposed where the learningproblem is addressed with diﬀerent perspectives, such as data clustering anddensity estimation, just to mention a few [22]. The Optimum-Path Forest han-dles unsupervised learning under a data clustering perspective through graphpartitioning [19]. Brieﬂy, the partition task is performed as a competitive-basedprocess ruled by a set of key samples s ∈ Z called prototypes that conquerthe remaining samples oﬀering them optimum-cost paths. As a result, it is ob-tained a collection of trees (forest) rooted at each prototype, in which each treerepresents a diﬀerent cluster.Suppose that a graph ( Z, A k ) can be derived from Z through a k -nearestneighbors adjacency relation A k . Each n -dimensional sample x ∈ Z is repre-sented as a graph node, and the connection (edge) between two nodes s and t is weighted by some distance or similarity metric d ( s , t ) based on their featurevectors. Also, each node s is weighted by a probability density function (pdf)deﬁned as follows: ρ ( s ) = 1 √ πσ |A ( s ) | (cid:88) ∀ t ∈A ( s ) exp (cid:18) − d ( s , t )2 σ (cid:19) , (1)in which σ = d f , and d f is the length of the longest edge in the graph ( Z, A k ).4he choice of this parameter considers all nodes for density computation sincea Gaussian function covers most samples within d ( s , t ) ∈ [0 , σ ].The most common method for probability density function is the Parzen-window provided by Equation 1, which is based on the isotropic Gaussian kernelwhen the arcs are deﬁned by ( s , t ) ∈ A k if d ( s , t ) ≤ d f . However, issues re-lated to diﬀerences in scale and sample concentration arise on the application ofsuch approach. To overcome the mentioned problems, Comaniciu [23] proposedadopting adaptive choices for d f according to the region in the feature space.The method consists in selecting the best number of k -nearest neighbors within[1, k max ], such that 1 ≤ k max ≤ |Z| . In a similar way, Rocha et al. [19] proposedto select the value k ∈ [1, k max ] that minimizes the graph cut measurement ofShi and Malik [24] computed to each ( Z, A k ).As aforementioned, the graph partitioning task is performed in a competitivefashion where the prototype nodes try to conquer the non-prototype ones byoﬀering them optimum-path costs. A path π t in ( Z, A k ) can be deﬁned as asequence of adjacent nodes starting in a root-node S ( t ) and ending at a sample t , where S stands for the set of root nodes. Also, let π t = (cid:104) t (cid:105) a trivial path, and π s · (cid:104) s , t (cid:105) the concatenation of π s and the arc ( s , t ). The Optimum-Path Forestmakes use of a smooth function that assigns a value f ( π t ) to each path π t . Apath is said optimum if f ( π t ) ≥ f ( τ t ), being τ t any other path with terminusat t . The smooth function formulation employed by OPF is deﬁned as follows: f ( (cid:104) t (cid:105) ) =  ρ ( t ) if t ∈ S ρ ( t ) − δ otherwise f ( (cid:104) π s · (cid:104) s , t (cid:105)(cid:105) ) = min { f ( π s ) , ρ ( t ) } , (2)for δ = min ∀ ( s , t ) ∈A k | ρ ( t ) (cid:54) = ρ ( s ) | ρ ( t ) − ρ ( s ) | . Notice that high values of delta reducethe number of maxima. In summary, the OPF algorithm maximizes f ( π t ) suchthat the optimum paths form an optimum-path forest, i.e., a predecessor map P with no cycles that assigns to each sample t / ∈ S its predecessor P ( t ) in theoptimum path from S or a marker nil when t ∈ S .5 .1. Deep Optimum-Path Forest The OPF has the very interesting characteristic to identify clusters on-the-ﬂy, which is very useful for applications where the number of groups is unknown.However, the absence of an OPF-based approach capable of computing a speciﬁcnumber of clusters becomes also a bottleneck. One solution is to play with theparameter k max by setting diﬀerent values until the desired number of clustersis reached. Besides costly, this operation does not guarantee such condition issatisﬁed.Based on hierarchical clustering, Afonso et al. [21] proposed a multi-layeredOPF-based clustering algorithm. The main idea is to build a model comprisedof a user-deﬁned number of layers to obtain the desired number of clusters in thelast layer. In their method, each layer is responsible for computing an optimum-path forest. The ﬁrst layer takes as input the original dataset and clusters itfollowing the OPF algorithm. The roots (prototypes) from the resulting forestare used as input by the following layer. The process is repeated until the lastlayer is reached. The usage of prototypes as the most representatives samplesis supported by the studies of Castelo and Calder´on-Ruiz [25] and Afonso etal. [26]. Prototypes are located in the regions of highest density and, therefore,are suitable to represent the samples of its cluster [27].Let S i be the set of prototypes at layer L i , i = 1 , , . . . , l , in which l stands forthe number of layers. Since each root will be the maximum of a pdf (Equation 1),we have a set of samples that fall in the same optimum-path tree and arerepresented by the very same prototype (root of that tree) in the next layer.In summary, the higher the number of layers, the less prototypes (clusters) oneshall have, i.e., |S | < |S | < . . . < |S l | < . . . ≤

1. Therefore, at layer l , one shallﬁnd only one cluster when l → ∞ . Figure 1 displays the OPF-based architecturefor deep-driven feature space representation, hereinafter called dOPF.In the example, the layer L computed four clusters, i.e., optimum-pathtrees, rooted at the black-ﬁlled nodes (prototypes). The resulting set of proto-types S is used as input by the following layer L . As one can observe at layer L , a few samples become prototypes once more, resulting in the set S . The6 igure 1: Architecture of an l th -layered dOPF. Adapted from Afonso et al. [18] process described above is performed until the last layer L l is reached. As thenumber of layers increases, the number of clusters computed by the last layerdecreases, thus reducing to a single cluster at the coarsest level. This processcan be interrupted as soon as the number of desired clusters (or close to it) ismet.

3. Deep-based Representations through Optimum-Path Forest

Deep-based representations are commonly employed in image classiﬁcationapplications, but they are not restricted to such ones. Such representationsare obtained through deep learning architectures that are characterized by amodel comprised of many layers. The introduction of such model allows learningnumerous features from data as it ﬂows through the layers. One of the mostcommon models is the Convolutional Neural Network, which applies a series ofconvolutional kernels to the data, being each of them responsible for learningdiﬀerent information. The dOPF follows the same idea by learning multiplerepresentations, being each of them the outcome of a clustering process from adiﬀerent layer.In the context of Bag-of-Visual Words using dOPF, the ﬁnal bag could be7 coarser model if only the outcome of the last layer was used to composeit (i.e., only the prototypes of the last layer comprise the bag), as proposed byAfonso et al. [21]. However, an enriched model could be accomplished by addinginformation computed by the intermediate layers as well. As a comparison, theidea of using intermediate representations would be similar to using the featureslearned by the many hidden layers of a deep-learning model. Each layer canlearn more complex features and, therefore, more robust representations.As mentioned earlier, we propose to extend the work of Afonso et al. [18]by employing hierarchical learning in the context of BoVW, hereinafter calledhOPF (hierarchical OPF). The proposed approach will provide a more complexand more robust dictionary, being such representation the collection of selectedvisual words computed by all layers. Figure 2 illustrates both dictionary learningmethods, i.e., dOPF and hOPF.

Figure 2: The main diﬀerence between (a) dOPF and (b) hOPF concerns the usage (or not)of features learned in the intermediate layers.

Although dOPF provides a simpler and coarser representation by using onlythe features learned in the last layer, hOPF outputs a more complex and robustrepresentation that stands for the concatenation of features learned by all layers.In the context of BoVW, the resulting dictionary generated by hOPF will be ofsize |S | + |S | + . . . + |S l | . 8 . Proposed Approach This section describes the steps employed in the assessment of dOPF andhOPF as visual dictionary learning methods for BoVW in the context of auto-matic Parkinson’s disease identiﬁcation, as illustrated in Figure 3. The workﬂowindicated by the light blue arrow concerns the training phase. The ﬁrst step com-putes the local descriptors from the training signals to further clustering. Themost representative samples from each cluster compose the dictionary, which isused for quantization (i.e., ﬂow indicated by the purple arrow) of both trainingand testing signals. The outcome of such process is the new representation ofeach sample. Similarly, testing signals have their local descriptors extractedand quantized (i.e., ﬂow indicated by the yellow arrows). The ﬁnal step is toperform training and classiﬁcation using the new computed representations.

Figure 3: Proposed approach based on BoW and dOPF for computer-aided PD diagnosis.

The experimental data were collected from a series of tasks performed byindividuals using a smart pen. The tasks exercise diﬀerent hand movements that9nable to capture the handwriting dynamics for further analysis. Furthermore,the exercises were elaborated in such way that are supposed not to be trivialto PD patients. All hand motion is captured by the smart pen that containssensors that provide information on ﬁnger grip, the axial pressure of ink reﬁll,tilt and acceleration in the x , y , and z directions.Figure 4 illustrates the set of six tasks employed to evaluate the hand move-ments and to support the detection of anomalies. The set of six tasks standsfor a sample, and an individual may have more than one sample assigned to it(i.e., the individual had more than on appointment). In the ﬁrst task (exam(a) in Figure 4), the individual is asked to draw a circle 12 times continuously.In the second task, the individual performs the circle-drawing movement (i.e.,on the air) 12 times continuously (exam (b) in Figure 4). The third and fourthtasks also concern drawing activities. In the exam (c) in Figure 4, four spiralsare drawn over a guideline from the inner to the outer part. The exam (d) inFigure 4 comprises the drawing of meander also four times and from the innerto the outer part. Last but not least, the ﬁfth and sixth tasks are known as thediadochokinesis test and are used to evaluate the wrist movement of the rightand left hands. Figure 4: Tasks perfomed to the assessment of hand movements .2. Local descriptor extraction The local descriptors are extracted from the recorded signals in a sliding-window fashion that goes through each of the six signals. The descriptors arecomputed using a single-level Discrete Wavelet Transform (DWT) applied toeach segment delimited by the sliding window. Each time segment of the signalis in fact represented by the concatenation of the resulting DWT from six slidingwindows (i.e., one sliding window applied to each signal), as depicted in Figure 5.Notice that all sliding windows comprise the same portion of time (i.e., the sameinitial and ﬁnal times) as they go through the signals and the DWT is computedindependently to each window. Moreover, the window length and shifting areuser-deﬁned. The experiments used windows of 150 ms of length and a strideof 100 ms, which showed the best results when compared with a window of size100 ms and stride of 50 ms, and a window of size 200 ms and stride of 150ms. Moreover, each segment is represented by a descriptor, and the longer isthe signal, the higher will be the number of descriptors representing the inputsignal.

The dictionary is formulated in a straightforward way by selecting the mostrepresentative “words” (descriptors) among the set computed in the previousstep, and it is further used to compute a new sample representation. The mostrepresentative words are usually selected by a clustering algorithm where eachcentroid becomes a “word” of the dictionary. Therefore, the dictionary size isdeﬁned by the number of clusters. Since it has some impact on the accuracyrate, it is common the use of diﬀerent sizes for the dictionary to balance thecomputational cost and accuracy rate. As mentioned in Section 2.1, the proto-types are very suitable to represent the samples of their trees (i.e., prototypesare equivalent to centroids of a cluster), thus being good representations to com-pose the dictionaries. This work employs the Optimum-Path Forest clusteringalgorithm to select the most representative words, i.e., the prototypes.11 igure 5: Local feature extraction.

A signal can be represented by its set of descriptors, which can range fromdozens to thousands (i.e., a descriptor can be computed for each segment ofthe original signal, and the number of segments varies according to its lengthand stride). However, a few of these descriptors might be variations of anotherone or only represent noisy information. Moreover, machine learning techniquescannot be directly applied to the sets of descriptors since their dimension is notthe same to all signals. Therefore, quantization is performed so that signalscan be mapped into the same feature space. The outcome of the process is ahistogram of length equals to the size of the previously formulated dictionary,where each bin stores the frequency of its closest word (descriptor) in the inputsignal. Finally, any machine learning technique can be applied for classiﬁcationpurposes using the histograms as input.12 . Experiments and results

In this section, we provide details concerning the experiments carried out onthe assessment of deep-based dictionaries in the context of automatic Parkin-son’s disease identiﬁcation. The experiments were divided into two parts: (i)the ﬁrst one evaluated and compared dOPF-based dictionaries against the tra-ditional OPF-based bags and the traditional BoVW method that computesthe bags using the well-known k -means (Section 5.1); and (ii) the second partprovides a comparison of the proposed approach (i.e., hOPF) with the methodpresented in the work of Afonso et al. [21] (Section 5.2). Additionally, the secondsection of experiments includes the hOPF performance evaluation using com-pressed versions of the representations learned. For that purpose, we appliedthe Restricted Boltzmann Machine (RBM) [28] to provide diﬀerent compres-sion levels. Notice that both experiments used data collected from 66 exams(35 healthy individuals and 31 PD patients), and the output of the protocoldiscussed in the previous section results in six diﬀerent datasets, one for eachtask. The following sections describe the particularities of each experiment andthe results using the proposed methodology. This experiment aimed at evaluating the clustering quality of dOPF, k -means and OPF through the accuracy rate obtained in the classiﬁcation phase.The dOPF used in the work comprises an architecture with four layers, beingthe values of k max set as follows: 100 for the ﬁrst layer, 1% of the numberof clusters computed in the previous layer are used as an input for the secondlayer, and 10% of the number of clusters computed in their respective antecessorlayers for the third and fourth layers . The parameter k for k -means was alwaysset as the number of clusters found by the fourth (last) layer of dOPF approach Our implementation. https://github.com/jppbsi/LibOPF Those values were empirically set.

13o allow a fair comparison. Regarding the OPF algorithm, the values for k max were empirically set as 2 ,

500 for the Spiral and Meander datasets, and as 1 , Table 1: Number of descriptors extracted from the training set and number of words computedby each technique. dataset (task) k -means OPF Circ-A exam (a) 18,000 5,682 - 2,584 - 228 -

68 693Circ-B exam (b) 11,898 538 - 376 - 43 -

17 33Spiral exam (c) 46,637 12,118 - 3,951 - 370 -

92 1,424Meander exam (d) 41,094 10,865 - 3,937 - 429 -

99 1,591Dia-A exam (e) 14,608 666 - 480 - 95 -

47 80Dia-B exam (f) 13,947 657 - 394 - 78 -

27 70

The clustering quality was assessed under a hold-out procedure with 15 runs,being the training and testing sets randomly partitioned in each new run andalways with 50% of the dataset each. The classiﬁcation step employs threeclassiﬁers for comparison purposes: Bayesian Classiﬁer (BC) , supervised OPF(sOPF) and SVM using a Radial Basis Function (RBF) kernel with ﬁne-tunnedparameters (SVM-RBF) [29].Tables 2a— 2f present the mean recognition rates concerning all six exams,being the accuracy computed according to Papa et al. [7], which considers un-balanced datasets. The best results (i.e., bold values) are deﬁned according tothe Wilcoxon signed-rank [30] with a signiﬁcance of 0 .

05, which pointed out the Our implementation. https://github.com/LibOPF/LibOPF k -means, BC]as the best pairs of [dictionary learner, classiﬁer] with accuracies near to 81%and 83%, respectively. Comparing that recognition rates against some previousworks [14], dOPF showed signiﬁcant gains, ranging from 10% to 30%. Table 2: Overall accuracies. (a) Circ-A dataset.

BC sOPF SVM-RBFdOPF 82.96 ± ± ± k -means 83.38 ± ± ± OPF 81.06 ± ± ± (b) Circ-B dataset. BC sOPF SVM-RBFdOPF ± ± ± k -means ± ± ± ± ± ± (c) Spiral dataset. BC sOPF SVM-RBFdOPF 78.30 ± ± ± k -means ± ± ± OPF ± ± ± (d) Meander dataset. BC sOPF SVM-RBFdOPF ± ± ± k -means ± ± ± OPF ± ± ± (e) Dia-A dataset. BC sOPF SVM-RBFdOPF 69.86 ± ± ± k -means 72.18 ± ± ± ± ± ± (f) Dia-B dataset. BC sOPF SVM-RBFdOPF 67.96 ± ± ± k -means 72.92 ± ± ± ± ± ± Concerning the best accuracies regarding each exam, dOPF obtained verymuch suitable results, being more accurate than na¨ıve OPF in most cases. Su-pervised OPF obtained good results as well, but SVM-RBF achieved the bestrecognition rates in a few more situations. Additionally, we also evaluated theaccuracy per class for all situations, as presented in Tables 3— 8, whose bestresults are also highlighted considering the Wilcoxon signed-rank. The best re-sults for each class are in bold, and the best among all datasets is underlined.Actually, the main improvement concerns the accuracy for the identiﬁcation ofhealthy individuals, since Pereira et al. [15] obtained recognition rates nearlyto 50% over the Meander and Spirals datasets for the control class. The dOPFincreased not only the global accuracy with respect to the work by Pereira et15l. [15], but also the speciﬁcity and sensitivity for most of the cases. Also,Circ-A dataset provided two out of the ﬁve best results, thus showing as a goodalternative for the Parkinson’s Disease identiﬁcation.

Table 3: Circ-A dataset.

BC sOPF SVM-RBFHC PD HC PD HC PDdOPF 82.59 ± ± ± ± ± ± k -means 82.59 ± ± ± ± ± ± OPF 82.96 ± ± ± ± ± ± Table 4: Circ-B dataset.

BC sOPF SVM-RBFHC PD HC PD HC PDdOPF ± ± ± ± ± ± k -means ± ± ± ± ± ± ± ± ± ± ± ± Table 5: Spiral dataset.

BC sOPF SVM-RBFHC PD HC PD HC PDdOPF ± ± ± ± ± ± k -means ± ± ± ± ± ± ± ± ± ± ± ± Table 6: Meander dataset.

BC sOPF SVM-RBFHC PD HC PD HC PDdOPF ± ± ± ± ± ± k -means ± ± ± ± ± ± OPF ± ± ± ± ± ± able 7: Dia-A dataset. BC sOPF SVM-RBFHC PD HC PD HC PDdOPF 72.22 ± ± ± ± ± ± k -means 75.19 ± ± ± ± ± ± ± ± ± ± ± ± Table 8: Dia-B dataset.

BC sOPF SVM-RBFHC PD HC PD HC PDdOPF 72.59 ± ± ± ± ± ± k -means 68.89 ± ± ± ± ± ± ± ± ± ± ± ± Table 9 presents the mean computational load required by each techniquefor dictionary learning. Notice the computational burden for dOPF considersthe four layers. In this context, k -means ﬁgured as the fastest one due to itssimplicity. If one considers dOPF and OPF only, we can observe the former isabout 78 times faster in Circ-B dataset, which is quite eﬀective. The lowest gainscan be observed in both Meander and Spiral datasets. The small diﬀerencescome from the fact the value used for k max in both situations is small, thusjustifying the fact the dictionaries computed in these datasets have very highdimension when compared to others. Table 9: Dictionary learning computational load [s] required by each technique. dataset (task) dOPF k -means OPF Circ-A (a) 968.167 37.008 49,087.137Circ-B (b) 419.498 13.113 32,777.539Spiral (c) 6,063.205 239.859 6,643.906Meander (d) 5,003.233 208.443 5,168.819Dia-A (e) 613.109 19.878 41,189.133Dia-B (f) 569.053 11.025 39,367.84417 .2. Multi-scale deep-based representations

As aforementioned, this round of experiments aimed at providing a per-formance comparison between dOPF and hOPF. To fulﬁll that purpose, thequality of the dictionaries provided by both techniques was compared using theprotocol described in Section 5.1. As more visual words were added, hOPFdictionaries provide higher-dimensional representations. Hence, an additionalexperiment evaluated the quality of compressed representations computed byRBM. There were used representation sizes of 25% (hOPF-25), 50% (hOPF-50), and 75% (hOPF-75) of the original one (hOPF). Figure 6 illustrates theworkﬂow of representation compression.

Figure 6: The Deep OPF block represents the workﬂow depicted in Figure 3. The outcomeof such process is used as input of RBM that outputs a compressed representation used byclassiﬁers.

Tables 10– 15 provide the overall accuracy rates concerning each dataset.The accuracy rate was computed using the same formulation as in Section 5.1,and the best results (i.e., bolded ones) were determined by the Wilcoxon signed-rank with signiﬁcance as 0 . .

79% (BC) to 8 .

83% (SVM-RBF).18n interesting aspect to be highlighted, it is the fact that compressed rep-resentations computed by RBM also ﬁgured among the best results, even themost compressed ones (hOPF-25). The representations hOPF-50 and hOPF-75achieved the best performance among the compressed versions with best resultsin 11 out of 18 scenarios against 6 out 18 of hOPF-25.Concerning the classiﬁers employed in the work, it can be observed a similarsituation like the one illustrated in Section 5.1. The classiﬁers obtained goodresults with SVM-RBF being the best statistically technique in most cases. Thehighest accuracy among all datasets was achieved by the pair [hOPF, SVM-RBF]with 85 . Table 10: Circ-A dataset - Overall accuracies. dOPF hOPF hOPF-25 hOPF-50 hOPF-75BC 82.94 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 11: Circ-B dataset - Overall accuracies. dOPF hOPF hOPF-25 hOPF-50 hOPF-75BC 80.98 ± ± ± ± ± sOPF 80.00 ± ± ± ± ± SVM-RBF 79.61 ± ± ± ± ± Table 12: Spiral dataset - Overall accuracies. dOPF hOPF hOPF-25 hOPF-50 hOPF-75BC 77.22 ± ± ± ± ± sOPF 76.97 ± ± ± ± ± SVM-RBF 79.49 ± ± ± ± ± able 13: Meander dataset - Overall accuracies. dOPF hOPF hOPF-25 hOPF-50 hOPF-75BC 77.02 ± ± ± ± ± sOPF ± ± ± ± ± SVM-RBF 82.17 ± ± ± ± ± Table 14: Dia-A dataset - Overall accuracies. dOPF hOPF hOPF-25 hOPF-50 hOPF-75BC 73.33 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 15: Dia-B dataset - Overall accuracies. dOPF hOPF hOPF-25 hOPF-50 hOPF-75BC ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± We also investigated the accuracy rates in each class, as shown in Tables 16–21. The representations learned by the hierarchical approach also ﬁgured amongthe best results in many situations. Once again, the more signiﬁcant improve-ments (i.e., compared to dOPF) can be observed in the HC class in almostall scenarios, such as the ones in Circ-B and Dia-B datasets (i.e., the greatestones). Compressed representations also presented competitive results, especiallythe most compressed one (hOPF-25) as one can observe in Circ-A and Circ-Bdatasets for HC class, and Dia-A dataset for PD class.20 able 16: Circ-A - Average accuracy rate for each class.

BC sOPF SVM-RBFHC PD HC PD HC PDdOPF 81.85 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 17: Circ-B - Average accuracy rate for each class.

BC sOPF SVM-RBFHC PD HC PD HC PDdOPF ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± hOPF-50 85.56 ± ± ± ± ± ± ± ± ± ± ± ± Table 18: Spiral - Average accuracy rate for each class.

BC sOPF SVM-RBFHC PD HC PD HC PDdOPF ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 19: Meander - Average accuracy rate for each class.

BC sOPF SVM-RBFHC PD HC PD HC PDdOPF 80.76 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± hOPF-50 ± ± ± ± ± ± hOPF-75 ± ± ± ± ± ± able 20: Dia-A - Average accuracy rate for each class. BC sOPF SVM-RBFHC PD HC PD HC PDdOPF 79.26 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± hOPF-75 ± ± ± ± ± ± Table 21: Dia-B - Average accuracy rate for each class.

BC sOPF SVM-RBFHC PD HC PD HC PDdOPF ± ± ± ± ± ± hOPF ± ± ± ± ± ± hOPF-25 ± ± ± ± ± ± hOPF-50 89.99 ± ± ± ± ± ± ± ± ± ± ± ± Since the diﬀerence between dOPF and hOPF relies on whether the visualwords selected in the intermediate layers are used or not in the ﬁnal dictionary,it must be concluded that the computational load for dictionary learning is thesame.

6. Conclusion and Future Work

This work introduced a hierarchical-learning approach using the Deep Optimum-Path Forest to design visual dictionaries. The proposed approach was assessedand compared against a previous approach proposed by Afonso et al. [18] in thecontext of Parkinson’s disease identiﬁcation. The experiments used six datasetsderived from signal data collected when individuals were submitted to a hand-writing exam. The exam is comprised of tasks supposed not to be trivial toParkinson’s disease patients, and the usage of signals allows to detect subtlevariations.The main contributions of this work rely on the introduction of the pro-posed approach itself, its application in the context of automatic PD detection,22nd the usage of Restricted Boltzmann Machine for data compression. Exper-imental results showed the potential of hierarchical-learning approaches whereinteresting results were achieved. A general analysis pointed improvements inmost scenarios and the proposed approach always ﬁgured the best results. Anin-depth investigation showed a more considerable improvement in accuracy inthe healthy individuals class in most scenarios.With respect to the compressed representations, RBM provided good modelsand achieved very interesting results (it either outperformed or was statisticallysimilar to the original-sized representation and dOPF) in 12 out of 18 conﬁgu-rations (i.e., pair [dictionary learner, classiﬁer] ) for the HC class, and 10 out of18 conﬁgurations for the PD class. Regarding future work, we aim to study dif-ferent ways to create hierarchical representations instead of the concatenation.

Acknowledgments

The authors are grateful to FAPESP grants

References [1] M. Nye, A. Saxe, Are eﬃcient deep representations learnable?, in: Proceed-ings of the International Conference on Learning Representations, 2018.URL https://openreview.net/forum?id=B1HI4FyvMhttps://openreview.net/forum?id=B1HI4FyvM