Nonlinear Dynamic Field Embedding: On Hyperspectral Scene Visualization
11 Nonlinear Dynamic Field Embedding: OnHyperspectral Scene Visualization
Dalton Lunga and Okan Ersoy
Abstract
Graph embedding techniques are useful to characterize spectral signature relations for hyperspectralimages. However, such images consists of disjoint classes due to spatial details that are often ignoredby existing graph computing tools. Robust parameter estimation is a challenge for kernel functions thatcompute such graphs. Finding a corresponding high quality coordinate system to map signature relationsremains an open research question. We answer positively on these challenges by first proposing a kernelfunction of spatial and spectral information in computing neighborhood graphs. Secondly, the studyexploits the force field interpretation from mechanics and devise a unifying nonlinear graph embeddingframework. The generalized framework leads to novel unsupervised multidimensional artificial fieldembedding techniques that rely on the simple additive assumption of pair-dependent attraction andrepulsion functions. The formulations capture long range and short range distance related effects oftenassociated with living organisms and help to establish algorithmic properties that mimic mutual behaviorfor the purpose of dimensionality reduction. The main benefits from the proposed models includesthe ability to preserve the local topology of data and produce quality visualizations i.e . maintainingdisjoint meaningful neighborhoods. As part of evaluation, visualization, gradient field trajectories, andsemisupervised classification experiments are conducted for image scenes acquired by multiple sensorsat various spatial resolutions over different types of objects. The results demonstrate the superiority ofthe proposed embedding framework over various widely used methods.
I. I
NTRODUCTION N onlinear dimensionality reduction has emerged as a key preprocessing step for extracting D. Lunga is with the Department of Electrical and Computer Engineering, Purdue University. He is also affiliated with theCSIR-Meraka Institute, Brummeria, Pretoria, South Africa e-mail: dlunga @ purdue-edu.O. Ersoy is with the Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, 47907-0501,USA e-mail: ersoy @ purdue-edu. a r X i v : . [ c s . C V ] N ov and visualizing regular structures within complex data sets. Prior to its popularity, the easy toimplement linear methods have spanned decades of research on this task dating back to principalcomponent analysis(PCA) [1]–[3], the classical scaling or multidimensional scaling(MDS) tech-nique [4], more recently the local Fisher discriminant analysis (LFDA) [5], and semi-supervisedlocal Fisher discriminant analysis (SELF) [6]. With modern technology enabling the capabilityto gather and combine data from various sensing mechanisms, linear dimensionality reductiontechniques have met their limitations in seeking meaningful structures from complex data [7].For example, hyperspectral sensors have enabled the acquisition of greater details about objectson the earth surface which poses a challenge for linear dimensionality reduction techniques.Feature extraction in such data sets can be accomplished by employing nonlinear methods suchas the maximum variance unfolding ( MVU ) [8]- a method that computes maximum varianceembedding maps subject to preserving local distances, the locally linear embedding(
LLE ) [9]- a method that represents the relations of each neighborhood by linear coefficients that bestreconstruct each data point from its neighbors, and the laplacian eigenmaps Laplacian( LE ) [10],which draws on the correspondence between the graph Laplacian, the Laplace Beltrami operatoron a manifold, and the connections to the heat equation, to devise a geometrically motivatedalgorithm for constructing a representation for data sampled from a low m -dimensional manifoldembedded in a higher d -dimensional space.Further widely popular methods include the generalized-MDS [11], [12] which extends theproperties of classical scaling to nonlinear embedding, the self-organizing feature map ( SOFM)[13] which is an artificial unsupervised neural network for learning low-dimensional maps,graph embedding and extensions for dimensionality reduction [14]. Probabilistic approachesinclude the stochastic neighbor embedding( SNE ) [15], a method that represents each object by amixture of widely separated low dimensional factors capturing some of the local structure, andestablishes global formations of clusters of similar maps. With improved modeling assumptions,variants of SNE have led to high quality embedding visualizations, e.g. the student t -distributionbased stochastic neighbor embedding(tSNE) [16], the elastic embedding algorithm [17], and thespherical stochastic neighbor embedding( sSNE) [18]. In most cases, the approaches rely on highdimensional neighborhood graphs to capture the geometrical relations of observations and usegraph weights as inputs for embedding purposes. The techniques are formulated under metricmeasures that include the geodesic distances, and for others, non-metric probability measures including Kullback Leibler divergences.Notwithstanding individual differences in efficiency and their specific data applications, nonlin-ear dimensionality reduction methods share some features including, better compression, bettervisualization, and reduction of classifier input features. Each embedding technique representsan attempt to search for a coordinate representation that resides on the data manifold. Recentcomparative studies conducted on various nonlinear embedding techniques strongly indicated theinability of existing manifold learning methods to handle disjoint class structures that exists inhyperspectral data [19]. In addition, many dimensionality reduction algorithms suffer from whatis known as the crowding problem in the machine learning community [16]. The crowding prob-lem can be described as a tendency by which embedding techniques collapse maps towards thecenter of the embedding space resulting in increased class overlaps. Under such a phenomenon,many embedding algorithms fail to establish discriminative boundaries between structures ofdifferent classes. In this study several functional forms are proposed to address this challengewithin a unifying framework for developing nonlinear dimensionality reduction algorithms withbenefits that spans visualization, compression, and classification tasks.While incorporating some benefits from existing methods, the study provides a unified multi-dimensional artificial force embedding( MAFE) framework for nonlinear dimensionality reduc-tion with new perspectives and high quality embedding functional forms. MAFE models arebased on seeking a minimum energy configuration state of a high dimensional neighborhoodgraph in a lower dimensional space. Underlying the framework is the premise that given apredefined graph structure, the optimal embedding coordinates should generate a correspondinglow dimensional neighborhood graph that preserve the high dimensional pairwise relations. Theembedding framework draws from the force field interpretation to suggest insightful designproperties that lead to pair-dependent attraction and repulsion functions for composing novelpotential embedding fields. The functions are superposed to generate an odd function that enforcethe pairwise interaction fields in the embedding space. The generated interaction is such thatat longer range the attraction force, which pulls similar maps towards each other, dominates,while the repulsion force dominates at short range distances. Under such an environment, allmaps will experience a pushing and a pulling force from their relative neighbors due to the fieldgenerated by the potential functions. Of importance is to observe that the attraction functionemphasizes the pulling together of similar maps while the repulsion function acts as a barrier that generates a pushing force for maps to be dispersed. Much of the local relations that revealmeaningful structures in optimal embeddings should already have been captured by the similaritykernel function that constructs the neighborhood input graph. As a second contribution, the studycombines MAFE models with a local bilateral similarity kernel for mapping spatial and spectralsignatures onto the lower dimensional space. This is demonstrated by embedding real worldremote sensing images that are characterized by high spectral resolution pixels with severalhundred channels. However, having many channels and nonlinearities in each pixel poses severalchallenges to conventional land cover methods since visually different objects tend to exhibitoverlapping or similar spectral signatures.The paper is structured as follows. A review of the force field formulations is introducedin Section II. A general formulation of the multidimensional artificial field embedding(MAFE)framework is presented in Section III. In Section IV, we establish MAFE connections to existingpopular nonlinear embedding techniques. In Section V, using MAFE framework, we propose twonovel techniques: the multidimensional artificial field embedding - bounded repulsion(MAFE-BR), and the multidimensional artificial field embedding - unbounded repulsion(MAFE-UR). InSection VIII, we describe the experimental setup and results on various data sets. A discussionof experiments and future work is presented in Section IX. Finally, we present conclusions inSection X. II. F ORCE F IELD M OTIVATION
Force field interpretations have a long history that relates to nature studies. In Biology,researchers have long studied nature and discovered that populations often appear in patterns ofaggregation such as flocks of birds, schools of fish, and herds of mammals. Biological models thatuse forces between individuals that are analogous to physical forces, have since been developed.The study in [20], is an example of early work that proposed an idea of mutual interactionsbetween individuals that were composed of attractions and repulsions with the goal of maintainingthe group as a stable mass. This idea was developed further in [21] by discussing the possibilityof modeling forces between individual fish upon classical gravitation and electromagnetism. In alater paper [22], the author considered inverse power laws to model the repulsions and attractionsbetween individuals, with repulsion stronger at short inter-individual distances. The work in [22],was compared to actual data collected from schools of fish, to obtain model parameters. The model was based on a simple constant attraction and a repulsion inversely proportional to thesquare of the inter-individual distance. The same laws governing animal behavior have beenobserved to be generalizable to guide artificial systems to carry out complex tasks by relying onlyon local interactions. In fact, in individual-based frameworks, the basis rule is that aggregationbehavior is a result of an interplay between a short-ranged repulsion and a long-ranged attractionbetween the individuals. Such an intuition has been applied with success in control of multi-agent systems, stability analysis of social foraging swarms [23], and robotic motion planning[24].Most Biology studies [20]–[22], [25], [26] and Control Engineering studies [23], [24], [27],seek to address questions that relates to maintaining a group as a stable mass, stability analy-sis, group cohesion and obstacle avoidance. Notwithstanding such growing recognition of theimportance of force field formulation in those areas, their role in the allied area of signal andimage processing remains to be fully appreciated. This study is most concerned on drawingon the collective dynamics of aggregation behavior to devise algorithms that establish mani-fold formations given a predefined structure, i.e . formation of multiple manifolds given a highdimensional neighborhood graph as a constraint. As such, we focus on the application of theforce field intuition to the problem of dimensionality reduction and manifold learning based ona graph embedding optimization framework.III. D
YNAMIC G RAPH E MBEDDING F ORMULATION
Let G = ( E , V ) be a finite undirected graph with vertex set V , edge set E and with no selfloops. The elements of E are designated as ideal springs. Furthermore, let S = { ( w ij , k ij ) } be thespring properties between each vertex i and j for all { ( v i , v j ) } ∈ E , where w ij is the normalizedor unnormalized length without compression or extension computed for each observed pair in Y = { y , y , · · · , y N } , where y i ∈ R d , and k ij = 1 is the corresponding force constant. Let G S be a neighborhood spring graph, where S denotes the spring relation. An embedding of G S is anassignment of vertices into a m -dimensional Euclidean space R m . Let Z = { z , z , · · · , z N } T be the assigned embedding of G S , where z i ∈ R m is the position of vertex i ’s map. When framedas a graph embedding task, where on each vertex the approach imagines a particle in motion andeach edge weight is dictated by spring force laws, the corresponding problem of dimensionalityreduction simply becomes that of establishing a minimum energy configuration that is governed by the structure in W = [ w ij ] . The modeling assumptions are such that the configuration yieldsmaps that preserve pairwise relations characterized by the neighborhood graph G S . Finding sucha mapping is at the heart of every dimensionality reduction model, and it is the subject that isdiscussed shortly.A mechanics interpretation of the graph embedding framework is presented as follows. Imaginethe existence of a particle on every z i ∈ Z , that is moving with the velocity of Z ’s centroid.With the following change of notation to denote the embedding positions as a state of a graph,let Z = (cid:8) z T , z T , · · · , z TN (cid:9) T be a long vector in R Nm . Thus only consider the motion dynamicsof individual maps, not the motion of the group. The approach assumes that all individual mapsmove simultaneously and each map i is aware of the position of other vertices (maps), as well asthe strength of corresponding force that defines each edge weight in the graph. The positions, z i ’s,of individuals relative to the group centroid can change through the rearrangements informed bypair-dependent force field interactions. Assuming such motion is to change in a continuous time,then the velocity as determined by the effect of group members on each vertex i , at position z i is described by ˙ z i = (cid:88) j =1 ,j (cid:54) = i F ij ( z i − z j ) , i = 1 , · · · , N (1)where F ij ( z i − z j ) = ( z i − z j ) { F ijr ( (cid:107) z i − z j (cid:107) ) − F ija ( (cid:107) z i − z j (cid:107) ) } describes pairwise symmetricinteractions between the i th and j th maps. Symmetry of the function follows from the fact thatif map i is attracted to map j , then j is attracted to i . F ijr : R + → R + denotes the magnitude ofthe repulsion term, whereas F ija : R + → R + represents the magnitude of the attraction term. Thesuperposition of these two terms defines an interactive function that is the basis for the MAFEframework. Model insights and properties governing the choice of this function are presented inthe following sections. A. Force Fields Function Properties
Artificial force field formulations assume that at large distances, the attraction function dom-inates, and that on short distances the repulsion function dominates, while in between there is aunique distance at which both terms will balance - defining a central path in a similar mannerto the barrier method [28]. The choice of suitable force field embedding interaction functions, F ij ( z i − z j ) , are guided by the following properties:
1) There is pair-equilibrium distance (cid:15) ij at which F ijr ( (cid:15) ij ) = F ija ( (cid:15) ij ) , else F ija ( (cid:107) z i − z j (cid:107) ) >F ijr ( (cid:107) z i − z j (cid:107) ) for (cid:107) z i − z j (cid:107) > (cid:15) ij or F ija ( (cid:107) z i − z j (cid:107) ) < F ijr ( (cid:107) z i − z j (cid:107) ) for (cid:107) z i − z j (cid:107) < (cid:15) ij . F ij is an odd function, i.e. F ij ( − ( z i − z j )) = − F ij ( z i − z j ) , therefore symmetric withrespect to the origin.3) There exist pair dependent functions U ijatt → R + → R + and U ijrep → R + → R + such that ∇ z i w ij U ijatt ( (cid:107) z i − z j (cid:107) ) = F ija ( (cid:107) z i − z j (cid:107) )( z i − z j ) ∇ z i U ijrep ( (cid:107) z i − z j (cid:107) ) = F ijr ( (cid:107) z i − z j (cid:107) )( z i − z j ) U ijatt and U ijrep are viewed as artificial attraction and repulsion potential energy functions. Thecombined term ( z i − z j ) F ijr ( (cid:107) z i − z j (cid:107) ) represents the actual repulsion effect, whereas theterm − ( z i − z j ) F ija ( (cid:107) z i − z j (cid:107) ) represents the actual attraction effect. The vector ( z i − z j ) establishes the alignment on which the attraction and repulsion interaction forces acts along inopposing directions. These functions describe the reactive approach by potential fields in whichtrajectories of particles motion are not planned explicitly. Instead the interactions of every mapwith its neighbors is a superposition of fields that enable its position to cope with the changingenvironment of other maps. The motion dynamics can be rewritten to reflect the resultant forceson each individual map as ˙ z i = − (cid:88) j =1 ,j (cid:54) = i (cid:8) ∇ z i w ij U ijatt ( (cid:107) z i − z j (cid:107) ) − ∇ z i U ijrep ( (cid:107) z i − z j (cid:107) ) (cid:9) The assumption made to envision each map as moving along the negative gradient has animplication, i.e. to achieve a minimum-energy configuration of the graph G S , a choice of theattraction and repulsion potential functions should be such that the minimum of U ijatt ( (cid:107) z i − z j (cid:107) ) occurs on or around (cid:107) z i − z j (cid:107) = 0 , whereas the minimum of − U ijrep ( (cid:107) z i − z j (cid:107) ) (or maximum of U ijrep ( (cid:107) z i − z j (cid:107) ) ) occurs on or around (cid:107) z i − z j (cid:107) → ∞ , and that the minimum of the combination U ijatt ( (cid:107) z i − z j (cid:107) ) − U ijrep ( (cid:107) z i − z j (cid:107) ) occurs at (cid:107) z i − z j (cid:107) = (cid:15) ij , thus defining the stationery stateof motion that exist between pairs i and j .Using the above framework, the reactive potentials that are effective on each individual map i can be represented as U i ( Z ) = (cid:88) j =1 ,j (cid:54) = i (cid:8) w ij U ijatt ( (cid:107) z i − z j (cid:107) ) − U ijrep ( (cid:107) z i − z j (cid:107) ) (cid:9) (2) while the total superposed potential function on the neighborhood graph G S is defined by U ( Z ) = N (cid:88) i =1 U i ( Z ) (3)Letting Ω be the set of attraction and repulsion functions F ij ( · ) satisfying the embedding fieldproperties, new embedding models can simply be derived by solving the following generaloptimization problem: Z (cid:63) = argmin Z∈ R Nm U ( Z ) (4)where Z (cid:63) describes the minimum-energy configuration state of G S in the lower dimensionalspace. With some parameter adjustments on F ij ( · ) , the embedding maps will converge to aminimum-energy configuration that yields the optimal maps. Such an embedding framework canbe adapted as a general platform to develop new nonlinear dimensionality reduction algorithmsbut first, we establish its links to existing popular embedding techniques in the following section.IV. C ONNECTIONS T O E XISTING M ETHODS
Many nonlinear techniques have been proposed for the tasks of visualization and dimension-ality reduction. Even though with differences, they are applied in various fields as preprocessingbuilding blocks for compression, visualization, and classification tasks. A basic question thatwe ask is whether some of the existing methods can be derived as special cases of the MAFEframework? A positive illustration to this question is presented together with reformulations ofpopular techniques including the stochastic neighbor embedding(SNE) [15], student-t stochasticneighbor embedding [16], Laplacian eigenmaps (LE) [10], Cauchy graph embedding [29], andthe spherical stochastic neighbor embedding(sSNE) [18]. Such interpretations show that MAFE isa unifying framework that not only inherits some algorithmic benefits of various techniques, butalso provides extended functional properties combined with strong intuitive insights for creatingnew algorithms.
A. Stochastic Neighbor Embedding
Stochastic neighbor embedding [15] is a method for preserving probabilities on lower di-mensional manifolds that are nonlinear. SNE assumes that edge weights are antisymmetricprobabilities w ij (i.e. w ij (cid:54) = w ji ) of pairs of vertices being neighbors in the higher dimensional space. However, our presentation focuses on the symmetric version where w ij = w ji for all pairsof vertices. The high dimensional edge weights are defined using the Gaussian functions of theform, w ij = exp {− (cid:107) y i − y j (cid:107) σ i } (cid:80) r =1 ,r (cid:54) = i exp {− (cid:107) y r − y i (cid:107) σ i } (5)where σ i is computed using a binary search method ensuring that the entropy of the distribution W i is approximately log( k ) , with k defining the effective number of neighbors. In the lowerdimensional space, a symmetric Gaussian probability ˆ w ij is assumed between each pair ofembedding maps, i.e. the embedding graph weights are computed as ˆ w ij = exp {−(cid:107) z i − z j (cid:107) } (cid:80) r =1 ,r (cid:54) = i exp {−(cid:107) z r − z i (cid:107) } (6)Each z i ∈ R m is the corresponding lower dimensional map of the observation y i ∈ R d . SNEproceeds to compute for the maps by minimizing a sum of Kullback Leibler(KL) objectivefunctions (cid:88) i KL ( W i || ˆ W i ) = (cid:88) i (cid:88) j =1 ,j (cid:54) = i w ij log( w ij ˆ w ij ) (7)The goal of (7) is to minimize the distortion between each of the N high dimensional neigh-borhood distributions W i and their corresponding lower dimensional neighborhood distributions ˆ W i . The results obtained from this approach have so far demonstrated its superiority whencompared to methods that include locally linear embedding(LLE) [9], MDS [4], and Isomap[30]. However, the optimization algorithm is very unstable which leads to a lot of experimentallydefined parameters in order to attain meaningful results. A further expansion on (7) while ignoringterms that are not a function of the lower dimensional maps (terms that do not depend on ˆ w ij ),much in parallel to the work of [17], reveals the log-sum term as a source of difficulty whencomputing the gradient and its a term that increases the nonlinearity of the model.Computing the negative gradient of (7) yields the corresponding MAFE motion dynamics orforce field equations of the form ˙ z iSNE = −∇ z i U SNE = − (cid:88) r =1 ,r (cid:54) = i F SNE ( z i − z j ) (8)where the expression under summation is defined as F SNE ( z i − z j ) = ( z i − z j ) (cid:40) w ij − exp {−(cid:107) z i − z j (cid:107) } (cid:80) r =1 ,r (cid:54) = i exp {−(cid:107) z r − z i (cid:107) } (cid:41) (9) Under the MAFE formulation, equation (8) describes the motion state of vertex maps seekingthe graph’s minimum energy configuration in the lower dimensional space. A simple observationidentifies the attractive gradient force field to be − w ij ( z i − z j ) , and a repulsion gradient forcefield to be ( z i − z j ) exp {−(cid:107) z i − z j (cid:107) } (cid:80) r =1 ,r (cid:54) = i exp {−(cid:107) z r − z i (cid:107) } . The interpretation of (8) is such that, at longer distances withparameters set carefully, the embedding maps start to form clusters due to the strong attractionforce field, while the repulsion force field is very negligible. As (cid:107) z i − z j (cid:107) → for each ( i, j ) pair, the repulsion magnitude dominates the interaction force vector. This causes the maps topush away from each other. The convergence of the algorithm is established when the forcesbalance, much in the same way as described for the general MAFE framework. B. t-Stochastic Neighbor Embedding t -Stochastic Neighbor Embedding [16] is similar to SNE except that the lower dimensionalmaps are assumed to be better modeled by a Student t-distribution of degree one. This simplemodification leads to a complete improvement of results over SNE. The improvement is due tothe pair-dependent inverse distance relation introduced by the Student t-distribution. Using thisdistribution, the joint embedding probabilities ˆ w ij are defined by ˆ w ij = (1 + (cid:107) z i − z j (cid:107) ) − (cid:80) r =1 ,r (cid:54) = i (1 + (cid:107) z r − z i (cid:107) ) − (10)In tSNE, the cost function is also based on the sum of Kullback Leibler(KL) divergences (sameas in (7)). The corresponding negative gradient of the expanded tSNE cost function gives thefollowing MAFE motion dynamics or force field equations, ˙ z itSNE = −∇ z i U tSNE = − (cid:88) r =1 ,r (cid:54) = i (cid:40) w ij ( z i − z j )1 + (cid:107) z i − z j (cid:107) − ( z i − z j )(1 + (cid:107) z i − z j (cid:107) ) − (cid:80) r =1 ,r (cid:54) = i (1 + (cid:107) z r − z i (cid:107) ) − (cid:41) = − (cid:88) ( i,j ) ∈E F tSNE ( z i − z j ) (11)where the term under summation in (11) is defined as F tSNE ( z i − z j ) = ( z i − z j ) (cid:40) w ij (cid:107) z i − z j (cid:107) − (1 + (cid:107) z i − z j (cid:107) ) − (cid:80) r =1 ,r (cid:54) = i (1 + (cid:107) z r − z i (cid:107) ) − (cid:41) (12)In a similar manner, equation (11) describes the motion state of vertex maps seeking the graph’sminimum energy configuration in the lower dimensional space. However, the attractive force field is described by − ( z i − z j ) w ij (cid:107) z i − z j (cid:107) , a term whose magnitude approaches an inverse squarelaw for large pairwise distances (cid:107) z i − z j (cid:107) . This property makes the coordinate representation ofjoint probabilities invariant to changes in scale for maps that are far-apart. The repulsion forcefield is described by ( z i − z j ) (1+ (cid:107) z i − z j (cid:107) ) − (cid:80) r =1 ,r (cid:54) = i (1+ (cid:107) z r − z i (cid:107) ) − . The magnitude of the inverse square lawapproximation by both terms dictates that formation of clusters be established at long rangedistances, while the repulsion force field is magnified at short range distances, causing all mapsto disperse from each other. The convergence of the algorithm is established when the two forcefields balance. C. sPherical Stochastic Neighbor Embedding
A spherical stochastic neighbor embedding model [18] is also a variant of SNE, whose em-bedding representations are constrained to reside on a spherical manifold. As such, it assumes thejoint probability distribution ˆ W i to be defined on a spherical surface, S m = { z ∈ R m +1 : (cid:107) z (cid:107) = 1 } ,with its corresponding probable values ˆ w ij obtained by means of an Exit distribution [31].sSNE, similarly to SNE, exhibits a desirable property that enables the unfolding of many-to-onemappings from which the same class objects are in several disparate locations that are spatiallydriven. The probable values ˆ w ij are computed from the Exit expression ˆ w ij = (1 − (cid:37) ) (cid:107) z j − (cid:37) z i (cid:107) m (cid:80) r =1 ,r (cid:54) = i (1 − (cid:37) ) (cid:107) z r − (cid:37) z i (cid:107) m (13)where (cid:37) is a concentration parameter that controls the assignment of neighborhood weights fora given z thi central map. sSNE’s total cost function given by N (cid:88) i KL ( W i || ˆ W i ) + λ ( z Ti z i −
1) = (cid:88) i (cid:88) j =1 ,j (cid:54) = i w ij log( w ij ˆ w ij ) + λ ( z Ti z i − (14)where λ is the Lagrangian parameter enabling the implicit incorporation of the unit sphericalconstraint in the objective function. The negative gradient of the expanded sSNE function yieldsthe corresponding MAFE motion dynamic equations of the form ˙ z isSNE = − m(cid:37) (cid:88) j =1 ,j (cid:54) = i ( z j − (cid:37) z i ) (cid:40) w ij (cid:107) z j − (cid:37) z i (cid:107) − ( (cid:107) z j − (cid:37) z i (cid:107) ) − m − (cid:80) r =1 ,r (cid:54) = i (cid:107) z r − (cid:37) z i (cid:107) − m (cid:41) − λ z i = 2 m(cid:37) (cid:88) j =1 ,j (cid:54) = i F sSNE ( z i − z j ) − λ z i (15) where the pairwise gradient force field on each map is defined by F sSNE ( · ) = F E ( · ) − λ z i ,with F E ( z i − z j ) = − ( z j − (cid:37) z i ) (cid:40) w ij (cid:107) z j − (cid:37) z i (cid:107) − ( (cid:107) z j − (cid:37) z i (cid:107) − m − ) (cid:80) r =1 ,r (cid:54) = i (cid:107) z r − (cid:37) z i (cid:107) − m (cid:41) (16)Contrary to both SNE and tSNE, equation (15) describes a MAFE dynamic model with mapsconstrained to be on a spherical manifold and in turn defining a unit sphere neighborhood graphconfiguration. The attractive force field is described by − ( z j − (cid:37) z i ) w ij (cid:107) z j − (cid:37) z i (cid:107) , a term whosemagnitude is based on the inverse unbounded square law for large pairwise distances (cid:107) z i − z j (cid:107) .This property also makes the spherical coordinate representation of joint probabilities invariantto changes in scale for maps that are far-apart. The repulsion force field is described by ( z j − (cid:37) z i ) ( (cid:107) z j − (cid:37) z i (cid:107) − m − ) (cid:80) r =1 ,r (cid:54) = i (cid:107) z r − (cid:37) z i (cid:107) − m , also an unbounded inverse distance term. The unbounded magnitudeof the inverse square law by both terms enables sSNE to exhibit much needed capability tointroduce a splitting nature on clusters that overlap. However, due to the unbounded nature onboth terms, as (cid:107) z j − (cid:37) z i (cid:107) → , the minimum energy configuration of the graph becomes unstable.This introduces algorithmic inefficiencies when seeking the optimal embedding coordinates forlarge data sets [18]. D. Laplacian and Cauchy Embedding
All graph embedding techniques presented in this study make the assumption that if observa-tion i and j are similar, i.e. w ij is large, then the proximity of their corresponding maps z i and z j in the embedding space should also be closer. The m -dimensional embedding coordinatescan also be obtained using the Laplacian eigenmaps (LE) approach [10]. Considering the -dimensional example, the idea is to minimize argmin z s.t. (cid:107) z (cid:107) =1 , z T =0 U LE ( z ) = (cid:88) j =1 ,j (cid:54) = i w ij ( z i − z j ) (17)where = [1 , · · · , T and (cid:107) z (cid:107) = 1 is a magnitude constraint imposed to avoid obtaining asolution where all z i ’s are zero, whereas the constraint z T = 0 , reduces the uncertainty on thenon-uniqueness of the embedding solution. Under such constraints the problem is equivalent to argmin Z U LE ( z ) = z T Lz (18)where L = D − W is the graph Laplacian, D = diag ( d , · · · , d N ) , d i = (cid:80) j w ij and W =[ w ij ] is the edge weight matrix. The solution is often computed by solving equation (18) as an eigenvector problem. A simple observation on (17) connects the Laplacian eigenmaps formulationto the MAFE framework. A force field interpretation of the Laplacian embedding identifies, (cid:80) j =1 ,j (cid:54) = i w ij ( z i − z j ) , as the artificial attractive potential whose corresponding negative artificialforce field provides the motion equation for maps as defined by ˙ z LEi = − dU LE ( z ) dz i = − (cid:88) j =1 ,j (cid:54) = i F LE ( z i − z j ) (19)where F LE ( z i − z j ) = ( D − W ) ij ( z i − z j ) . The dynamic Laplacian model of (19) doesincorporate the magnitude and non-uniqueness constraints from its objective function. However,these constraints do not generate a repulsion field that can mitigate the crowding of embeddings.The Cauchy graph embedding [29] approach has a MAFE interpretation that follows the samesteps as applied in reformulation of Laplacian eigenmaps.V. N EW MAFE I
MAGE E MBEDDING M ODELS
Apart from the key insights obtained from the force field properties to design suitable pair-depended attraction and repulsion potential energies for a MAFE embedding model, anotheressential component involves choosing a similarity function suitable for incorporating all localinformation that is relevant for computing a neighborhood graph that captures the local structureof high dimensional observations. In most popular techniques, the Gaussian kernel described inequation (5) is used to compute such relations. However, for certain application areas with spatialinformation, e.g. hyperspectral imagery, the Gaussian weights may not be suitable. As such, aparametric local kernel for computing the spectral neighborhood is discussed in the next section.The parametric kernel can be seen as an extension of bilateral filter function [32]–[34] with arobust high dimensional covariance estimation component that enhances pixel differences andgenerates a sparse neighborhood graph. A discussion on the design of practical and better kernelfunctions tend to be application-driven and it remains an open research problem. Its completeinvestigation is therefore not fully addressed in this study.
A. Spectral Neighborhood Graph Similarities
In various image processing applications, spatial preprocessing methods are often appliedto remove noise and smooth images. These methods also enhance spatial texture informationresulting in features that improve the performance of classification techniques. For example in [35], nonlinear diffusion partial differential equations (PDEs) and wavelet shrinkage wereused for spatial preprocessing of hyperspectral images, and the results obtained demonstrateda significant improvement on classification performance. Unlike the use of PDEs, we adapt a local bilateral filtering approach to devise a pairwise pixel similarity function or kernel over theobserved neighborhood graph. The local kernel function is defined by w ( s i , s j , y i , y j ) = exp (cid:26) −(cid:107) s i − s j (cid:107) σ s (cid:27) · exp (cid:8) − ( y i − y j ) T Σ − y ( y i − y j ) (cid:9) (20)where s i denotes the spatial coordinates of pixel i , y i denotes the photometric d -dimensionalspectral vector, with d corresponding to the number of spectral channels. The expression (cid:107) s i − s j (cid:107) weighs image pixel values as a function of the spatial distance from the center pixel and h s is the variance parameter. The kernel also employs a nonlinear term, ˜ w p ( i, j ) = exp (cid:26) −
12 ( y i − y j ) T Σ − y ( y i − y j ) (cid:27) (21)which simply weighs pixel values as a function of the photometric differences between thecenter pixel and its neighbor pixels. For example, given N hyperspectral pixels, organized intoa zero-mean data matrix Y = [ y y · · · y N ] ∈ R d × N , the sample covariance is computed as S = N Y Y T = (cid:104) yy T (cid:105) , with the angle brackets denoting the average over N pixels. Thus, S isa d × d matrix whose diagonal components indicate the magnitude of noise variation in each ofthe d spectral channels, and the off-diagonal elements denote the extent to which noise co-varywith each pair of spectral bands. It is easy to show [18] that the photometric weights can besummarized by ˜ w p ( i, j ) = exp (cid:26) − N tr ( Σ − S ) (cid:27) (22)We make an observation that one can represent the unnormalized kernel K as a product ofunnormalized gaussian functions, one for each pixel y i , yielding K = exp (cid:40) − N (cid:88) j =1 ( y j − y i ) T Σ − ( y j − y i ) (cid:41) = exp (cid:40) − tr ( S Σ − )2 + N (cid:88) j =1 y j Σ − y i − N y Ti Σ − y i (cid:41) where Σ − is the inverse covariance matrix, and tr ( B ) denotes the trace of matrix B . We couldassume a zero-mean unnormalized Gaussian noise model over the pixels, i.e. we can simply subtract the center y i from the data, to obtain a simplified expression as K = exp (cid:40) − N (cid:88) j =1 y Tj Σ − y j (cid:41) = exp (cid:26) − tr ( Y T Σ − Y ) (cid:27) (23)Note that tr ( Y T Σ − Y ) = tr ( Σ − Y T Y ) = N tr ( Σ − S ) . This shows that S is a sufficientstatistic for characterizing the unnormalized likelihood (herein the photometric similarity) ofdata Y , and we can further write K = exp (cid:26) − N tr ( Σ − S ) (cid:27) (24)In practice the weights are computed by first decomposing the true covariance matrix intoa product Σ = E Λ E T , where E is the orthogonal eigenvector matrix and Λ is the corre-sponding diagonal matrix of eigenvalues, that easily compute the covariance matrix whoseinverse is required. We adapt the efficient sparse matrix transform (SMT) approach in esti-mating the covariance matrix Σ [36]. The SMT approach solves the optimization problem, ˆ E = argmin E ∈ Ω (cid:8) | diag ( E T SE ) | (cid:9) , and set ˆ Λ = diag ( ˆ E T S ˆ E ) , where Ω is the set of allowedorthogonal transforms that can be computed using a series of Givens rotations [36]. A simplemanipulation can show that Σ − = ˆ E ˆ Λ − ˆ E T so that we have ˜ w p ( i, j ) = exp (cid:26) −
12 ( ˆ E T y i − ˆ E T y j ) T ˆ Λ − ( ˆ E T y i − ˆ E T y j ) (cid:27) with the final graphs weights computed from w ( s i , s j , y i , y j ) = exp (cid:26) −(cid:107) s i − s j (cid:107) σ s (cid:27) · ˜ w p ( i, j ) (25)The SMT approach to computing the covariance matrix Σ is efficient and robust in handling thesingularities of Σ . Other approaches to computing Σ have been used in the literature includingthe PCA adaptation approach [32], where the singularity of Σ is not carefully addressed. B. Attractive Potential Function
The main fundamental idea underpinning the design of multidimensional artificial field em-bedding models is to treat the pair-equilibrium distances (cid:15) ij for vertices in G S as attractive wells.That is, consider the minimum-energy configuration between maps i and j as a sink for thepotential energy function. For example, the attractive potential energy functions that we consider are bounded from below to allow for the existence of constant attraction force effects at alldistances, that is U ijatt ( (cid:107) z i − z j (cid:107) ) ≥ α , where α is a positive constant ∀ (cid:107) z i − z j (cid:107) . In this study,we explore potential functions of the form U ijatt ( (cid:107) z i − z j (cid:107) ) = ξ a (cid:107) z i − z j (cid:107) p (26)where for values < p ≤ , the pairwise function is conic in shape, and the resulting attractiveforce field has a constant cluster formation amplitude determined from the graph’s edge weights w ij . ξ a is an attraction force magnitude related adaptive parameter. Figure 2 shows the attractivepotential energy generated from equation (26) for p = 2 , ξ a = 1 , and w ij = 0 . . The shapecorresponds to quadratic potential, i.e. we have a global optimal that acts to pull all force fieldseffects in its direction, thereby demonstrating the sink nature of the minimum point. C. Repulsive Potential Functions
The nature of a repulsion function is chosen such that as the distance between pair-pointsincreases, its properties are deemed to have negligible influence on maps ( i.e. maps are in longrange zone where F ijr < F ija ). However, when the distance is small, the function generates abarrier or a repulsive force between maps ( i.e. maps are in the short range zone where F ijr > F ija ).The procedure to selecting a repulsive potential function U ijrep starts by thinking of an indicatorfunction of the form I + ( (cid:107) z i − z j (cid:107) ) = (cid:107) z i − z j (cid:107) > (cid:15) ij ∞ (cid:107) z i − z j (cid:107) ≤ (cid:15) ij . (27)which is a non-increasing function of distance. Equation (27) best captures the behavioral formfor a repulsive function that has negligible effects at large distances and has a large dominanceat short range distances. With all its appealing intuitive properties, the indicator function is not adifferentiable function. As such, we require its approximation by a differentiable function whosegradient can create a repulsion force F ijr with a magnitude that is inversely proportional to thedistance between pairs of maps i.e. (cid:107) F ijr ( z i − z j ) (cid:107) = dist ( z i , z j ) . Such approximations can bechosen from e.g.
Gaussian, Exponential, Cauchy, Hyperbolic Tangent and Inverse distance powerfunctions. The behavior of such functions in approximating I + ( (cid:107) z i − z j (cid:107) ) is shown in Figure1. e x p( − z / σ )e x p( − | z | / σ )1 z z z + σ c o t h( z ) Fig. 1. Dashed lines show the function I + ( z ) , and the solid curves show different forms of continuous decaying functionssuitable for approximating I + ( z ) .
1) Exponential Bounded Repulsion:
The Exponential or unnormalized Gaussian curves inFigure 1 generates a continuous bounded approximation of (27). As the distance between mapsgrows the magnitude of the curves approaches zero while a maximum magnitude is assignedfor maps that are very close in distance. A general unnormalized Gaussian bounded repulsionfunction can be devised in the form, U rep ( (cid:107) z i − z j (cid:107) ) = ξ r σ exp {− (cid:107) z i − z j (cid:107) q σ } (28)Where σ r is the bounded repulsion variance parameter. For q = 2 , the function has sphericalsymmetry as shown in Figure 2. For values < q ≤ the repulsion potential field has the shapeof a harmonic function often used in modeling obstacles in robotic path planning [24], while for < q < it has the form of a tower centered at the origin to generate equidistributed repulsiveforce fields in the direction of the gradient as shown in Figure 2.
2) Inverse Power Unbounded Repulsion:
The inverse distance function as shown in Figure 1has a continuous best approximation of the indicator function in equation (27). Its corresponding general repulsion potential function is given by, U ijrep ( (cid:107) z i − z j (cid:107) ) = ξ r (cid:107) z i − z j (cid:107) q (29)where q is a positive number. It is clear that as (cid:107) z i − z j (cid:107) → + the repulsion becomes unbounded,i.e. for q = 2 , lim (cid:107) z i − z j (cid:107)→ + U ijrep ( (cid:107) z i − z j (cid:107) ) (cid:107) z i − z j (cid:107) = ∞ . The inverse square law structure ofthis function promotes an invariant representation of neighborhood similarities even with a changein scale for embedding points that are far apart. ξ r is an repulsion magnitude related parameter.Figure 3 shows the unbounded repulsion potential field generated from equation (29). In sharpcontrast to existing embedding methods, this function affords properties that exhibit collisionavoidance of maps as their coordinates change in search for the pair-equilibrium distances (cid:15) ij .The strong repulsive force field generated by this function renders it more favorable for rejoiningof clusters that may have split during the search for a minimum energy configuration state of G S . D. Multidimensional Artificial Field Embedding with Bounded Repulsion
A combination of the attractive and bounded unnormalized Gaussian repulsion functions yieldsa multidimensional artificial field embedding (MAFE-BR) model described by U ( Z ) = N (cid:88) i =1 (cid:88) j =1 ,j (cid:54) = i (cid:8) W ij U ijatt ( (cid:107) z i − z j (cid:107) ) − U ijrep ( (cid:107) z i − z j (cid:107) ) (cid:9) (30) = (cid:88) i =1 (cid:88) j =1 ,j (cid:54) = i (cid:26) ξ a w ij (cid:107) z i − z j (cid:107) p − ξ r σ exp {− (cid:107) z i − z j (cid:107) q σ } (cid:27) (31)For q = 2 , (31) yields a special case elastic embedding model [17]. Computing the gradientprovides information related to the direction and magnitude of motion for each individual mapin the embedding space. This motion is described by ˙ z i = − (cid:88) j =1 ,j (cid:54) = i ( z i − z j ) (cid:26) ξ a w ij p (cid:107) z i − z j (cid:107) p − − ξ r q (cid:107) z i − z j (cid:107) q − exp {− (cid:107) z i − z j (cid:107) q σ } (cid:27) An illustration of the gradient field for a point with strong attraction force field is shown inFigure 2. −5 0 5−5050102030 z Attraction Potentialz U att (z ,z ) −5 0 5−50500.010.02 z Bounded Repulsion Potentialz U rep (z ,z ) −5 0 5−5050102030 z Superposed Potentialz U(z ,z ) −5 0 5−505 Gradient Fieldz z Fig. 2. Illustrations of a superposed-potential function for p = q = 2 . Arrows indicate the negative gradient force field. The maplocated at ( z , z ) = (0 , has a very strong attraction force over a large range of distance. The repulsion force is significantlysmall and mostly effective over a very short range. E. Multidimensional Artificial Field Embedding with Unbounded Repulsion
The second model that is proposed entails a multidimensional artificial field embedding modelwith an unbounded repulsion inverse distance function of (29). By considering all pairwise mapinteractions for vertices of G S , the resultant total potential function of a MAFE-UR model isgiven by U ( Z ) = (cid:88) i =1 (cid:88) j =1 ,j (cid:54) = i (cid:8) w ij U ijatt ( (cid:107) z i − z j (cid:107) ) − U ijrep ( (cid:107) z i − z j (cid:107) ) (cid:9) = (cid:88) i =1 (cid:88) j =1 ,j (cid:54) = i (cid:26) ξ a w ij (cid:107) z i − z j (cid:107) p − ξ r (cid:107) z i − z j (cid:107) q (cid:27) (32) −5 0 5−5050102030 z Attraction Potentialz U att (z ,z ) −5 0 5−505024 z Unbounded Repulsion Potentialz U rep (z ,z ) −5 0 5−5051234 z Superposed Potentialz U(z ,z ) −5 0 5−505 z z Gradient Field
Fig. 3. Illustrations of a superposed-unbounded repulsion potential function for p = 1 , q = 2 . Arrows indicate the negativegradient force field. The map located at ( z , z ) = (0 , has a very strong repulsion force over on short range distances, whilethe attraction force is dominant over long range distances. Under the supposed artificial force field model, the motion of each map is given by ˙ z i = −∇ z i U ( z ) , i.e. ˙ z i = − (cid:88) j =1 ,j (cid:54) = i (cid:8) ∇ z i w ij U ijatt ( (cid:107) z i − z j (cid:107) ) − ∇ z i U ijrep ( (cid:107) z i − z j (cid:107) ) (cid:9) = − (cid:88) j =1 ,j (cid:54) = i ( z i − z j ) (cid:8) ξ a w ij p (cid:107) z i − z j (cid:107) p − + ξ r q (cid:107) z i − z j (cid:107) − q − (cid:9) (33)Fig. 3, shows the total potential field generated from equation (32) and its corresponding gradientfield obtained from (33) for p = 1 , q = 2 .An important observation on the gradient fields in Figure 2 and Figure 3 demonstrates thatwithout an attraction term in each model, cluster formation would not occur since all pair-wisemaps would disperse from each other; whereas by eliminating the repulsion term(by setting TABLE IA
SUMMARY OF FUNCTIONS COMPATIBLE TO DESIGN MULTIDIMENSIONAL ARTIFICIAL FORCE FIELD EMBEDDINGTECHNIQUES .Technique Long Range Attraction Short Range RepulsionSNE (cid:80)
Ni,j =1 w ij (cid:107) z i − z j (cid:107) (cid:80) Ni =1 log (cid:80) Nj =1 e {−(cid:107) z i − z j (cid:107) } (cid:80) j w ij = 1 tSNE (cid:80) Ni,j =1 w ij log(1 + (cid:107) z i − z j (cid:107) ) (cid:80) Ni =1 log( (cid:80) Nj =1 11+ (cid:107) z i − z j (cid:107) ) (cid:80) j w ij = 1 SSNE (cid:80)
Ni,j =1 w ij log( (cid:107) z i − ρ z j (cid:107) m ) (cid:80) Ni =1 log( (cid:80) Nj =1 1 (cid:107) z i − ρ z j (cid:107) m ) z i ∈ S m LE (cid:80) Ni,j =1 w ij (cid:107) z i − z j (cid:107) ZZ T = I , Z = CE (cid:80) Ni,j =1 w ij σ + (cid:107) z i − z j (cid:107) } ZZ T = I , Z = GE (cid:80) Ni,j =1 w ij exp {−(cid:107) z i − z j (cid:107) } ZZ T = I , Z = MAFEBR (cid:80)
Ni,j =1 w ij ξ a (cid:107) z i − z j (cid:107) p (cid:80) Ni,j =1 ξ r σ exp {− (cid:107) z i − z j (cid:107) q σ } p, q ≥ MAFEUR (cid:80)
Ni,j =1 w ij ξ a (cid:107) z i − z j (cid:107) p (cid:80) i,j =1 ξ r (cid:107) z i − z j (cid:107) q p, q ≥ MAFEE (cid:80)
Ni,j =1 w ij σ a exp {− (cid:107) z i − z j (cid:107) p σ a } (cid:80) Ni,j =1 ξ r σ r exp {− (cid:107) z i − z j (cid:107) q σ r } p, q ≥ MAFEH (cid:80)
Ni,j =1 w ij ξ a (cid:107) z i − z j (cid:107) p (cid:80) i,j =1 ξ r (cid:107) z i − z j (cid:107) q p, q ≥ ξ r = 0 ), all maps would collapse to a single point leading to the crowding problem that has beena weakness in most existing embedding models. Table I summarizes typical force field functionsthat can be used to design embedding models that exhibit aggregation behavior.VI. S TOCHASTIC G RADIENT D ESCENT O PTIMIZATION
The objective functions for MAFE based models are by far less nonlinear as compared toSNE and tSNE, as such their optimization is relatively simple, requiring no random jitter termsto establish stability. The optimization adapts a variation of the stochastic gradient descent [37]with a common adaptive learning rate α ( t +1) = α ( t ) + γ (cid:104)∇ U ( Z ( t − ) , ∇ U ( Z ( t ) ) (cid:105) + γ (cid:104)∇ U ( Z ( t − ) , ∇ U ( Z ( t − ) (cid:105) (34)where α ( t ) is the learning rate at iteration t , γ and γ are the meta-learning rates. The maincharacteristics of this fast learning rate adaptation scheme is that it exploits gradient-relatedinformation from the current as well as the two previous embedding coordinates in the sequence.This provides an enhancement on the stabilization in the values of the learning rate and helps thegradient descent algorithm to exhibit even fast convergence that leads to better minimum energy-configuration. A description of the proposed algorithm is given in Algorithm 1. The algorithm’stermination condition is when (cid:107)∇ U ( Z ) (cid:107) ≤ (cid:15) . The choice of γ and γ is not critical for finding the minimum-energy configuration but only affect the rate at which the algorithm does so. Figure6 shows gradient field trajectories that were generated by this optimization scheme for a realworld data set. Algorithm 1:
Generalized MAFE Algorithm
Input : High dimensional observations Y ; Initialize optimization parameters: α (1) , γ , γ ; Output : Embedding coordinates matrix Z = { z , z , · · · , z N } ; Compute high dimensional neighborhood graph weights w ij from equation (25) or use a Gaussian kernel;Randomly sample initial maps from a normal distribution i.e. Z (0) ∼ N (0 , I ) ;Set Z (1) = [ z (0) T , z (0) T , · · · , z (0) TN ] T ∈ R Nm ; while (cid:107)∇ U ( Z ( t ) ) (cid:107) > (cid:15) do Set t = t + 1 ; Compute new embedding coordinates using; Z ( t +1) = Z ( t ) − α ( t ) ∇ U ( Z t ) ; Calculate the new learning rate from α ( t +1) = α ( t ) + γ (cid:104)∇ U ( Z ( t − ) , ∇ U ( Z ( t ) ) (cid:105) + γ (cid:104)∇ U ( Z ( t − ) , ∇ U ( Z ( t − ) (cid:105) ; end The iterative optimization in (4), or in particular the models described by equations (32),and (33) converges when all pair-equilibrium distances (cid:15) ij are established. A strong theoreticalargument can be made to assert that the motion of maps is guaranteed to stop and that nooscillatory behavior exist at the minimum-energy configuration state. This is done by letting theinvariant set of the equilibrium positions to be Ξ equi = (cid:110) Z : ˙ Z = 0 (cid:111) . The proof needs to show that as t → ∞ the state Z ( t ) converges to Ξ equi , i.e. the minimum-energy configuration of a graph converges to a stationery state. This extends Theorem 2 of [23]to problems of data visualization and dimensionality reduction. Theorem 1.
Consider a graph embedding described by ˙ z i = (cid:80) j =1 ,j (cid:54) = i F ij ( z i − z j ) , i =1 , · · · , N , with pairwise force field functions F ij ( z i − z j ) = ( z i − z j ) { F ijr ( (cid:107) z i − z j (cid:107) ) − F ija ( (cid:107) z i − z j (cid:107) ) } . As t → ∞ , it can be shown that Z ( t ) → Ξ equi . Proof:
Consider the general energy function U ( Z ) = (cid:80) Ni =1 U i ( Z ) , where U i ( Z ) is definedin (2). Taking the derivative of U ( Z ) with respect to each z i yields, ∇ z i U ( Z ) = 2 (cid:88) j =1 ,j (cid:54) = i (cid:8) ∇ z i w ij U ijatt ( (cid:107) z i − z j (cid:107) ) − ∇ z i U ijrep ( (cid:107) z i − z j (cid:107) ) (cid:9) = − ˙ z i where the negative gradient is observed as direction of motion in the second equality. Takingthe time derivative of U ( Z ) along the motion of a graph configuration yields, ˙ U ( Z ) = ∇ z i U ( Z ) T ˙ Z = 2 N (cid:88) i =1 ∇ z i U ( Z ) T ˙ z i = 2 N (cid:88) i =1 {− ˙ z i } T ˙ z i = − N (cid:88) i =1 (cid:107) ˙ z i (cid:107) ≤ , ∀ t. This result shows that the motion will continue in the direction of decreasing U ( Z ) to a statewhen all ˙ z i = 0 . By invoking the Lasalle Invariance Principle [38], it can be concluded that as t → ∞ the graph configuration state Z ( t ) converges to the largest subset of the set defined as Ξ = (cid:110) Z : ˙ U ( Z ) = 0 (cid:111) = {Z : ˙ z i = 0 } = Ξ equi . Since each ˙ z i ∈ Ξ equi is an equilibrium point, Ξ equi is an invariant set and this concludes theproof.This general result holds for any function F ij chosen based on the embedding force fieldproperties discussed in Section III. As such, it guarantees the convergence of our algorithms.However, in practise, the termination condition is set to some (cid:15) finite optimization of U ( Z ) . Theresult also extends to SNE, tSNE, sSNE and other related MAFE reformulations.In addition to obtaining optimal embedding coordinates, the optimization of MAFE-BR andMAFE-UR with p = q = 2 , also learns the embedding space neighborhood graph weights. ForMAFE-BR, the optimal embedding neighborhood graph is described by ˜ w brij = ξ a w ij − ξ r exp {− (cid:107) z i − z j (cid:107) σ } The iterative optimization of MAFE-UR generates embedding graph weights that are given by ˜ w urij = ξ a w ij − ξ r (cid:107) z i − z j (cid:107) In contrast to the high dimensional neighborhood graph weights , w ij , that are positive, the lowerdimensional graph weights ˜ w brij , and ˜ w ubij , corresponding to the optimal embedding coordinates z z −1 0 1 2−101 −0.4−0.3−0.2−0.10 (a) z z −1 0 1−1.5−1−0.500.51 −10−8−6−4−20 (b)Fig. 4. Minimum energy graph configuration weights that are learned after 100 iterations for a KSC Graminoid Marsh pixelmap with coordinates z = [ − . , − . . MAFE-BR weights, ˜ w brij , are shown in (a) and MAFE-UR weights, ˜ w urij , are shownin (b). For both models, the search for optimal embedding coordinates also leads to learning of negative graph weights. can be negative as illustrated in Figure 4. This property establishes another point of significantcontrast between the proposed MAFE based techniques in comparison to traditional embeddingtechniques that are known to enforce learning of positive lower dimensional weights e.g. LLE,SNE, tSNE, and Isomap. VII. H
YPERSPECTRAL I MAGE S CENES
1) Botswana Hyperion:
Hyperion data with nine identified classes of complex natural veg-etation were acquired over the Okavango Delta, Botswana, in May 2001, [39]. The generalclass groupings include seasonal swamps, occasional swamps, and woodlands. Signatures ofseveral classes are spectrally overlapped, typically resulting in poor classification accuracies.After removing water absorption, noisy, and overlapping spectral bands, 145 bands were usedfor classification experiments. Classification error rates on signatures are carried out after datahas been embedded to a lower dimensional space.
2) Kennedy Space Center (KSC):
Airborne hyperspectral data were acquired by the NationalAeronautics and Space Administration(NASA) Airborne Visible/Infrared Imaging Spectrometer(AVIRIS) sensor at 18-m spatial resolution over Kennedy Space Center during March 1996.Noisy and water absorption bands were removed, leaving 176 features for thirteen wetland andupland classes of interest. Cabbage Palm Hammock (Class 3) and Broad Leaf/Oak Hammock(Class 6) are upland trees; Willow Swamp (Class 2), Hardwood Swamp (Class 7), GraminoidMarsh (Class 8) and Spartina Marsh (Class 9) are trees and grasses in wetlands. Their spectralsignatures are mixed and often exhibit only subtle differences. Visualization and classificationresults for all thirteen classes are reported using the lower dimensional space coordinates.
3) Indian Pine:
This scene was gathered by an AVIRIS sensor over the Indian Pines testsite in North-western Indiana in June 12, 1992 at 20-m spatial resolution. The 220-band imagescene is over an area that is 6 miles west of West Lafayette. The image has 16 classes con-sisting of agricultural fields that are alfalfa, corn-notill, cornmin, corn, grass/pasture, grass/trees,grass/pasture-mowed, haywindrowed, oats, soybeans-notill, soybean-min, soybeanclean, wheat,woods, dldg-grass-tree-drives, and stone-steel towers. The image size is × pixels. Thepixel resolution is 16 bits, corresponding to 65536 gray levels. 2550 pixels were selected togenerate the sample training and testing data. Again, we report on Euclidean and non-Euclideanembedding results as well as evaluation of the computed coordinates based on classification errorrates for all 16 classes.
4) Salinas-A:
This is a small sub-scene of Salinas image, denoted Salinas-A, that is commonlyused in hyperspectral data analysis. It comprises × pixels located within the same scene at[samples, lines] = [591-676, 158-240] and includes six classes. The original scene was collectedby the 224-band AVIRIS sensor over Salinas Valley, California, and is characterized by high TABLE IIG
ROUND TRUTH CLASSES FOR THE S ALINAS -A AND I NDIAN P INE SCENES WITH THEIR RESPECTIVE SAMPLES NUMBERS
Salinas-A Indian Pinec1 Brocoli-green-weeds-1 (391) c1 Alfalfa (54)c2 Corn-senesced-green-weeds (1343) c2 Corn-notill (1434)c3 Lettuce-romaine-4wk (616) c3 Corn-min (834)c4 Lettuce-romaine-5wk (1525) c4 Corn (234)c5 Lettuce-romaine-6wk (674) c5 Grass-pasture (497)c6 Lettuce-romaine-7wk (799) c6 Grass-trees (747)c7 Grass-mowed (26)c8 Hay-windrowed (489)c9 Oats (20)c10 Soybean-notill (968)c11 Soybean-min (2468)c12 Soybean-heavy till (614)c13 Wheat (212)c14 Woods (1294)c15 Bldg-grass-tree-dr (380)c16 Stone-steel towers (95) spatial resolution (3.7-meter pixels). The area covered comprises 512 lines by 217 samples. The20 water absorption bands were discarded, i.e . bands: [108-112], [154-167], 224. It includesvegetables, bare soils, and vineyard fields. Salinas ground truth contains 16 classes.VIII. E
XPERIMENTAL R ESULTS
MAFE evaluation on lower dimensional representations consists of various comparisons withpopular techniques on all four image scenes. The comparison includes experimental resultsobtained with SNE, tSNE, Isomap and LE, taking as input the Gaussian kernel neighborhoodgraphs obtained from a principal component analysis step that reduces the original dimension to . In constructing the high dimensional neighborhood graph, we vary the number of neighborsfrom k = 1 to k = 50 and pick an optimal value of k = 15 . The optimal value for k ensures that the embeddings are neither too noisy and unstable nor does the geometry of TABLE IIIG
ROUND TRUTH CLASSES FOR IMAGE SCENES WITH THEIR RESPECTIVE SAMPLES NUMBERS
Botswana Hyperion Kennedy Space Centerc1 Water (158) c1 Scrub (761)c2 Floodplain (228) c2 Willow swamp (243)c3 Riparian (237) c3 Cabbage hamm (256)c4 Firescar (178) c4 Cabbage palm(252)c5 Island interior (183) c5 Slash pine (161)c6 Woodlands (199) c6 Oak (229)c7 Savanna (162) c7 Hardwood swamp (105)c8 Short mopane (124) c8 Graminoid marsh (431)c9 Exposed soils (111) c9 Spartina marsh (520)c10 Cattail marsh (404)c11 Salt marsh (419)c12 Mud flats (503)c13 Water (927) the observation collapse the coordinates of dissimilar observations. All existing models areimplemented as in [16], with the perplexity of the conditional distribution induced by theGaussian kernel determined as H , where H is the entropy. The parameterized bilateral kernelfunction presented in Section V-A, is used to generate the input graphs for MAFE with noprincipal component analysis step. MAFE-BR results are generated with a setting p = q = 2 for(31), establishing a quadratic attraction potential and a spherical Gaussian repulsion potential.MAFE-BR’s magnitude parameters are set as ξ a = 0 . and ξ r = 10 − . MAFE-UR resultsare generated with a setting p = 2 , q = 1 and parameter values ξ a = 0 . and ξ r = 10 − .The proposed MAFE optimization algorithm uses the norm of the gradient as the terminationcondition, i.e . (cid:107)∇ U ( Z ( t ) ) (cid:107) < (cid:15) , with (cid:15) = 10 − . SNE, tSNE and sSNE obtain solutions based onstandard gradient descent algorithms whose terminations are defined by the number of iterations T , i.e . terminate when T = 1000 iterations. Embedding maps obtained by Isomap are basedon the classical formulation that admits the closed-form solution of an eigenvector structuredproblem, namely picking the leading components of variation, while LE is based on picking the trailing eigenvectors. A. Trajectories of Gradient Vector Field
The dynamic equation in (1) serves as an approximation to the true model that describes thecentral path in search for optimal data manifolds. Additional insights on the uncovering of theunderlying manifolds can be obtained by observing the gradient field trajectories as each maptraverses towards the minimum energy configuration state of the embedding graph. The trajectorytraces in 2D space provide a visual assessment of how the formation of clusters is affected bydifferent cost functions and their corresponding optimization schemes. The gradient vector fieldsto be presented were obtained from mapping fifteen hyperspectral pixels from three classesof both the Botswana and the Kennedy Space Center images. Three classes are consideredfor each image, i.e . Woodlands, Firescar ,
Island Interior for Botswana and
Water, GraminoidMarsh, Cabbage Palm for Kennedy Space Center. MAFE based trajectories are compared togradient fields obtained with tSNE and SNE. Figure 5 and Figure 6 illustrate the formationof clusters under these iterative optimized embedding algorithms. tSNE and SNE results areshown in Figures 6(d) and 6(e) from which a high degree of oscillation and random changein gradient vector directions can be observed. These instabilities causes inefficiencies on theestablishment of the equilibrium distances (cid:15) ij between pairs of maps. In contrast, starting fromthe same initial maps, Figures 6(b) and 6(c) demonstrates MAFE-BR and MAFE-UR techniqueswith smooth trajectories and no instabilities, i.e. no random change of gradient vector directions.Such smooth gradient fields are due to two important features of the proposed MAFE framework.First, the local stochastic adaptive optimization scheme used in MAFE models retains the localgradient information from its last two sequence computations thereby establishing smooth andstable iterates, it does not depend on random jitter parameters in its update rule. Second, thecareful design of pair-dependent attraction and repulsion functions, as well as the sparsity ofneighborhood graphs allow for a smooth interaction between long range and short range forceson pairs of maps. The smooth interaction generates a robust motion of clusters that lead to theestablishment of the minimum energy configuration state much quicker. As such, the optimizationscheme used in MAFE models achieves several magnitudes of convergence speed in contrast toother algorithms as shown in Figure 16. −100 0 100−100−50050 z z (a) Initial Maps. −20 −10 0 10−10−505101520 (b) MAFE-BR . −10 −5 0 5 10 15−1001020 (c) SNE . −20 0 20−30−20−100102030 (d) tSNE .Fig. 5. Gradient based optimization illustrations over iterations. MAFE-BR (b),
SNE (c) , tSNE (d), are all initialized fromthe same seed of points for Botswana maps of different classes ( Woodlands, Firescar and
Island Interior ). MAFE-BR displays smooth gradient trajectories. Similar points cluster and traverse in the direction of the gradient field as indicated by thearrows.
SNE and tSNE trajectories are subject to oscillations, including collisions of maps leading to instabilities as well as thesevere local maxima traps that slow down the optimization algorithm. Arrows are shown pointing in the negative direction ofthe gradient.
B. Visualization of Embeddings
Nonlinear dimensionality reduction methods do tend to demonstrate better performance whenembedding data sets with smaller number of classes than for problems with many classes [19].This is caused by the complexity of data manifolds represented in problems with many disparateclasses. For example, in hyperspectral imagery; similar classes of data may result in multiplemanifolds due to their spatial locations, as such that presents a challenge to many existingdimensionality reduction techniques with a tendency of mapping all similar or related data onto z (a) Initial Maps. −4 −2 0 2 4−10−8−6−4−202 z z (b) MAFE-BR. −20 −10 0 10 20 30−30−20−1001020 z z (c) MAFE-UR. −30 −20 −10 0 10 20−30−20−100102030 z z (d) SNE. −20 0 20−30−20−100102030 z z (e) tSNE.Fig. 6. Gradient based optimization illustrations over iterations. All techniques were initialized from the same pointsto embed fifteen KSC pixel maps consisting of three different classes i.e. Water, Graminoid Marsh and Cabbage Palm . BothMAFE-BR and MAFE-UR demonstrate smooth gradient trajectories with points belonging to the same class clustering togetherand traversing in the negative direction of the gradient field ( indicated by the arrows). SNE and tSNE trajectories are subjectto oscillations and instabilities leading to poor convergence of the algorithms. a single cluster. Such characteristics increases the chances of collapsing very different classes tothe same embedding representation. Using MAFE-BR and MAFE-UR models, this study makesan effort to address this challenge and provides high quality lower dimensional visualizationresults for complex data sets.Figure 7 shows the Botswana RGB image and its ground references data. The embeddingresults obtained for MAFE-BR, SNE, tSNE, MDS, Isomap, and LE are shown in Figure 8.MAFE-BR results demonstrate different and superior embedding map compared to other meth-ods. For example, SNE computes coordinates that seem to separate different classes well howeverthere is a significant overlap on the Riparian and
Woodlands classes. The clusters seem to bemore spread implying large variances in the embedding space. MAFE-BR combined with thebilateral kernel function achieves tight spatial disjoint clusters, demonstrating no overlaps on the
Riparian and
Woodlands classes, the most known to be difficult classes to separate for this dataset [19]. tSNE although with capability to mitigate overcrowding of points and good clusteringeffect, leads to significant overlap between
Riparian and
Woodlands classes. MDS and Isomapembeddings are similar due to the dependence on the classical MDS solution, both results showa significant overlapping of coordinates. LE produces a solution with very limited interpretationon the separation of classes.Figure 9 shows both the RGB image and ground references of the Kennedy Space Center(KSC) data. Figure 10 shows the embedding results obtained with MAFE-BR, MAFE-UR, SNE,tSNE, Isomap, and LE. The -dimensional embedding results demonstrate that MAFE-BR andMAFE-UR construct very tight clusters. In addition, due to the spatially-sensitive kernel theembedding result maintains the disjoint nature observed from the land cover categories of theground truth. Furthermore, MAFE models introduce the capability to tile different land coverclasses when their boundaries appear to be overlapping. The models achieve better neighborhoodproperties as illustrated in Figure 15. SNE and tSNE results demonstrate the existence of clustersfor Water, Salt Marsh, and
Spartina Marsh classes. However, there is very little insight on theseparability of the other ten classes. Isomap’s solution demonstrates the presence of overcrowdingand class overlap. LE shades very little meaningful interpretation on the structure of all classes.Figure 11 and Figure 13 shows ground truth image data for both the Indian Pine and the Saline-A scenes. Their corresponding -dimensional MAFE based embeddings are shown in Figure 12and Figure 14. The results further demonstrate quality visualizations that are generated due to (a) UnclassifiedWaterFloodPlainRiparianFirescarIsland InteriorWoodlandsGrasslandsShort Mopane (b)Fig. 7. (a) RGB image of Botswana data. (b) Botswana labeled data. use of a bilateral kernel function in combination with a MAFE based embedding framework. Useof spatial information allows for the disjoint nature of classes to be encoded in the neighborhoodgraph, i.e . it introduces the needed level of sparsity for establishing disjoint neighborhoods. Theinteractive fields that generates the attraction and repulsion forces in MAFE techniques enablespreserving such topological relations. −20 −10 0 10 20 30 40−25−20−15−10−5051015 MAFE−BR −15 −10 −5 0 5 10 15−15−10−5051015 SNE −50 0 50−80−60−40−200204060 tSNE −100 −50 0 50 100−30−20−100102030 MDS −100 −50 0 50 100−30−20−100102030 Isomap −1.5 −1 −0.5 0 0.5 1 1.5x 10 −3 −2−1.5−1−0.500.511.5 x 10 −3 Laplacian
Fig. 8. Embedding of all labeled samples of the Botswana data. (a) UnclassifiedScrubWillow SwampCB/HammockCB/ PalmSlash PineOak/Broad/HamHardwood SwampGraminoid MarshSpartina MarshCattail MarshSalt MarshMud FlatsWater (b)Fig. 9. (a) RGB image of Kennedy Space Center data. (b) Kennedy Space Center labeled data.
C. Frobenius Distance
As in many dimensionality reduction methods, the goal is to maintain the local structure ofthe data. Here, the study seeks to examine the local topology of manifolds obtained by MAFEmodels in comparison to other iterative optimized embedding algorithms i.e.
SNE, and tSNE.The most straightforward approach to assess this is by the norm of the residual distance matrix: (cid:107) D − ˆ D (cid:107) F = (cid:115)(cid:88) ij ( ˆ D ij − D ij ) (35)where (cid:107) · (cid:107) F denotes the Frobenius norm. It is the square root sum of the squares of elementsof D and ˆ D , the high and low dimensional Euclidean distance matrices, respectively. Theresults shown in Figure 15 indicate that MAFE models achieve better local distance preservingrepresentations in the sense of (35). The results also highlight the efficiency with which theoptimization proposed for MAFE based models enables the discovery of neighboring relatedcoordinates. As the number of iterations increase, SNE continues to show reduced Frobeniuserror due to its small magnitude repulsion which in turn has a negative effect as it leads toovercrowding of maps.In the case of tSNE, its clear that it does not preserve local distances due to its strong repulsionforces. tSNE has a tendency of separating maps of similar instances into multiple small separatedclusters, a property that might be useful for visualization and not so suitable for tasks that requirekeeping similar maps in close proximity, e.g . COIL20 data visualization and hyperspectral data −2 0 2 4 6−2−101234567 MAFE−BR −15 −10 −5 0 5 10−10−8−6−4−20246 MAFE−UR −80 −60 −40 −20 0 20 40−80−60−40−20020406080100 tSNE −20 −10 0 10 20−15−10−5051015 SNE −1500 −1000 −500 0 500 1000 1500−50005001000 Isomap −0.01 −0.005 0 0.005 0.01−0.0200.020.04 Laplacian Fig. 10. Embedding of Kennedy Space Center data (showing pixels coordinates). embedding.
D. Semisupervised Classification
In various image analysis applications, including remote sensing, the main task entails estab-lishing an automated image understanding process through object classification or land coverclassification. Class label information is used to determine the designation of test or querysamples in the lower dimensional space. Our experiments employ a one nearest neighbor(1NN)classification performance accuracy, based on the spectral angle mapper [40], [41], to determine Empt lanAlfaAlfaCorn−notilCorn−minCornGrass−pGrass−tGrass−pHay−wOatsSoy−nSoy−mSoy−hWheatWoodsBldg−grStone−st
Fig. 11. Indian Pines ground truth image. whether the defined object classes or earth land covers, do occupy separable volumes in the lowerdimensional space and if they can be discriminated successfully. The experiments are carried outin various embedding spaces. Embedded pixel maps were randomly sampled to generate training samples and testing samples, with results averaged over runs. All approacheswere compared on the same data samples to maintain consistency in error comparison. Theresults provide several interesting observations that are consistent with the visualization mapsfrom Chapter VIII-B.Tables IV and V illustrates the 1NN classification performance accuracy per class. The trendsobserved on MAFE-BR and MAFE-UR embedding spaces indicate coordinate representationsthat outperform other spaces by significant margins in enabling high classification accuracy.Performance accuracies are reported by observing the mean values obtained from the Kappastatistic (KS) [42] that is computed as, KS = N (cid:80) | C | i =1 t cc − (cid:80) | C | c =1 t c + t + c N − (cid:80) | C | c =1 t c + t + c where N is the number of testing samples, t cc indicates the number of samples correctly classifiedin class c , t c + denotes the number of testing samples labeled as class c , and t + c denotes thenumber of samples predicted as belonging to class c . | C | denotes the total number of classes inthe data. The percentage of correctly predicted samples of the total testing samples provides theoverall accuracy (OA) which is also reported in the results. −1.5 −1 −0.5 0 0.5 1 1.5−1.5−1−0.500.511.52 c1c2c3c4c5c6c7c8c9c10c11c12c13c14c15c16 (a) MAFE-BR −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.2−0.100.10.20.30.40.50.60.7 (b) MAFE-URFig. 12. Embedding of the Indian Pine Scene(showing samples to avoid clutter). The classification results obtained on the Botswana data leads to interpretations that areconsistent with the visualization analysis of Figure 8. In general, all methods seem to provideembedding coordinates that leads to reasonable classification performance accuracy. However,the lowest accuracy results are achieved with the LE embedding representations. Furthermore,the lowest accuracy per class is observed between class 3 (c3) and class 6 (c6) corresponding Empty landBrocoli−weedsCorn−weedsLettuce−4wkLettuce−5wkLettuce−6wkLettuce−7wk
Fig. 13. Salinas-A ground truth image. to the
Riparian and
Woodlands classes, respectively. Lower classification results on c3 and c6are expected because these two classes are the most difficult to separate, in consistency withthe visual results of Figure 8 - similarly as demonstrated in [19], [43], [44]. LMNN using 1NNachieves the second best result on both data set perhaps owing to its ability to make use ofclass label information in learning the Mahalanobis distance metric. The objective function forLMNN does have a force field structure in which class label information is used to compute theoptimal metric with a goal that k-nearest neighbors always belong to the same class ( i.e . pulledcloser by an attraction term), while example samples from other classes are separated by a largemargin ( i.e . pushed far by a repulsion term). In contrast, both MAFE-BR and MAFE-UR haveobjective functions that are formulated as dependent on the distance between pairs of points, andno class label information is used during computation of the maps. When classifying samplesfrom the KSC data, a similar trend is observed with both MAFE based techniques providing acoordinate representation from which a higher 1NN classification performance is achieved. Theclassification results achieved for class 3(c3), class 4 (c4), class 5(c5), and class 6(c6) indicatethe lowest performance in all embedding spaces except for the solution achieved by the MAFEbased techniques. Based on the visualization result of Figure 10, these classes correspond tothe
Cabbage Palm Hammock, Cabbage Palm/Oak Hammock, Slash Pine, and
Oak/BroadleafHammock , respectively. These are all categories of very similar upland trees. Their spectralsignatures are mixed and often exhibit only subtle differences. However, the neighborhood graph −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.8−0.6−0.4−0.200.20.40.6 c1c2c3c4c5c6 (a) MAFE-UR −1 −0.5 0 0.5 1 1.5−1.5−1−0.500.511.5 (b) MAFE-BRFig. 14. Embedding of the Saline-A Image Scene (showing samples to avoid clutter). structure computed from the bilateral local kernel function which incorporates spatial detailsplays a significant boost to the separation of different categories of objects. As such, MAFEbased methods take advantage of this in addition to the smooth attraction/repulsion interactionbetween pairwise maps and yields embeddings that are clearly separable. On the other hand,sSNE achieves very similar performance results with its input graph structure defined to take Number of iterations
Frobenius distance
SNEMAFE-BRMAFE-URtSNE
Fig. 15. Illustration shows how the quality of local distance preservation varies with the number of iterations. The quality isdefined by the Frobenius norm of the difference distance matrix , for Kennedy Space Center. MAFE related models and theSNE model show generally the same performance at preserving the local distances. The Frobenius norm results obtained fromtSNE shows poor quality indicating that the method does not preserve local distance for all observed. This is due to its natureof splitting clusters of data that belong to the same class. advantage of the spatial information in images even though it suffers from algorithmic instabilitiesand inefficiencies associated with the optimization of its highly nonlinear objective function.Figure 17 shows the 1NN misclassification error plots as a function of the embedding dimen-sion, i.e . m = 1 ∼ . Such error plots are commonly used in dimensionality reduction algorithmsthat rely on the so called manifold projection approach for estimating the dimension of theembedding space. The manifold projection approach is used to estimate the intrinsic dimensionof the embedding space by carefully predefining a higher dimensional space neighborhood graphthat achieves a lower dimensional representation with better neighborhood preserving topology.Classification error plots provides one such criterion by which the intrinsic dimension of thedata is chosen as the lowest dimension that allows capturing most of the variance ( i.e . regularinformation) in the data with higher dimensions only adding to the redundancies. The bilateralkernel function proposed for constructing the higher dimensional neighborhood graph allows -2 Runtime(sec)
Cost function value
SNEMAFE-BRMAFE-URtSNE
Fig. 16. MAFE, SNE and tSNE objective function optimization run times in seconds for KSC data set. a significant automated sparsity property to be induced by estimating the covariance structureof the data. However, the study does acknowledge that choosing the intrinsic dimension is anopen difficult research question that warranties a deep separate study, here we simply include anexperimental approach necessary for evaluating the proposed models. From Figures 17(b) and17(a), the optimal dimension for mapping both the dimension Botswana spectral channels andthe
Kennedy Space Center spectral channels is m = 6 . The approximate optimal embeddingdimension is chosen based on the smallest lowest elbow drop point displayed on the 1NNclassification error plots in the MAFE embedding spaces.IX. D ISCUSSION AND F UTURE W ORK
MAFE nonlinear embedding techniques coupled with a bilateral kernel function demonstrateda better approach to account for nonlinear mixtures of similar categories that are captured byhigh resolution sensors as compared to other techniques. The MAFE general framework opens upvaluable avenues coupled with a platform to derive further theoretical insights and developmentof new nonlinear dimensionality reduction algorithms. T A B LE I V B O T S W ANA D A T A C L A SS I F I C A T I ON R E S U LT S I N D I FF E R E N T E M B E DD I NG S P A C E S . C l a ss A cc u r ac yon s i ng l ec l a ss ( % ) MDSSNEtSNEMAFE-BR ∗∗ MAFE-UR ∗∗ IsomapLELMNN ∗∗ LFDA c c . . . . c ∗ . . . . . . . . c . . . . . c . . . . . c ∗ . . . . . . c . . . . . . . c . . . . . c . . K S . . . . . . . . ∗ M o s t d i f fi c u lt c l a ss e s t o s e p a r a t e . ∗∗ H i gh1 NN c l a ss i fi ca ti on acc u r ac y . T A B LE V K E NN E DY S P A C E C E N TE R D A T A C L A SS I F I C A T I ON R E S U LT S I N D I FF E R E N T E M B E DD I NG S P A C E S . C l a ss A cc u r ac yon s i ng l ec l a ss ( % ) MDStSNEMAFE-BR ∗∗ MAFE-UR ∗∗ SNEIsomapLELMNNLFDA c . . . . . . . c . . . . . . . c ∗ . . . . . . c ∗ . . . . . c ∗ . . . . . . . c ∗ . . . . . c . . . . . . c . . . . . . c . . . . . . c . . . . . c . . . . . . . . . c . . . . . . . c . K S . . . . . . . . . ∗ M o s t d i f fi c u lt c l a ss e s t o s e p a r a t e . ∗∗ H i gh1 NN c l a ss i fi ca ti on acc u r ac y . classification error (a) classification error MAFEMDStSNESNEIsomapLMNNLFDALE (b)Fig. 17. Mean ± one standard error misclassification error comparison for 1-nearest neighbor classifier based on variousembedding spaces while varying the dimension. (a) Kennedy Space Center data and (b) Botswana data. The nonlinear dimensionality reduction algorithms discussed in this study have a kerneldimension that is the square of the number of vectors in the sample space. As such the natureof their objective functions can be summarized as follows. • For MAFE based techniques, the objective function U : R N × m → R is defined over the spaceof embedding matrices Z ∈ R N × m . The neighborhood graph’s pairwise kernel similaritiesare represented as a N × N matrix W . The complexity scales with the square of the numberof observed samples, i.e . O ( N ) , both in computation and memory usage. • For sSNE, SNE, and tSNE, the objective function KL : R N × m → R is defined over the space of embedding matrices R N × m . The probable neighbors matrix is denoted by an N × N pairwise kernel matrix W . The complexity scales with the square of the numberof observations, i.e . O ( N ) , both in computation and memory usage. • For MDS, the objective function σ : R N × m → R is defined over the space of matrices R N × m . The pixel data in the problem are the geodesic distances, represented as a N × N kernel matrix D ( Y ) = [ D Y ( y i , y j ] . The complexity scales with the square of the numberof samples, i.e . O ( N ) , both in computation and memory usage. A. Algorithmic Complexity Challenges
For non-iterative methods the spectral decomposition of a large dimensional kernel encountersdifficulties in at least three aspects: large memory usage, high computational complexity, andcomputational instability. Methods that are capable of exploiting the sparsity structure in theneighborhood graph may be useful to overcome the difficulties in memory usage and com-putational complexity. Approaches that incorporate the concepts of rank revealing, random-ized low rank approximation algorithms, and greedy rank-revealing algorithms and randomizedanisotropic transformation algorithms, which approximate leading eigenvalues and eigenvectorsof dimensionality reduction kernels may lead to faster algorithms. For iterative gradient basedmethods, efficient second order or Newton approximation algorithms that introduce additionallocal information about the curvature of the objective function can be developed to speed-up the convergence to the minimum energy configuration state. However, we caution that thecomputation of gradient directional vectors in each iteration of the optimization scheme docontinue to introduce computational hurdles for larger data sets.As indicated, the complexity of these algorithms scales with the square of the number of pixelsi.e. O ( N ) , a number that is prohibitive to obtain efficient solutions on large-scale (order of and greater) hyperspectral image pixels. This is a challenge faced by many embedding algorithms.A possible mitigating approach would be to split the pixels into overlapping patches, run anonlinear embedding method on each patch and use the coordinates of pixels in the overlappingpatches as references to merge each pair of resulting submanifolds to obtain a common globalmanifold. An even harder problem not addressed in this study pertains to the issue of spectralunmixing for pixels that contain more than a single land cover category [45]. If the spatialresolution of sensors is poor, increased overlap of different spectral signatures is imminent for different land cover objects. In such cases, knowledge of the observations should be used toderive kernel representations that can incorporate such characteristics. B. Objective Function Formulation Challenges
The formulation of MAFE models does encounter more challenges on the design of thetotal energy function. For example, the appealing force field embedding intuition is subject tonumerous dynamic local maxima behaviors during cluster formation. This can be explained fromobserving that the formation of clusters creates local repulsions leading to local traps for mapsthat still need to move closer to their most similar neighbors (as determined by the neighborhoodgraph weights). A simple illustration of the dynamic local maxima behavior is shown in Figure18. The behavior is exacerbated by a weak choice on the attraction potential energy functionwhich may lead to less effective pulling of trapped maps to their closest similar neighbors.This shortcoming does affect the convergence of the iterative embedding algorithms. The localmaxima generated traps can perhaps be better handled by piecewise attraction potential functionswhose short range and long range capability vary at different distances in the hope of increasingthe pulling force’s magnitudes to overcome the local repulsion forces from dissimilar neighbors. -5 0 5 -5 0 5024 z z U(z ,z ) z z -4 -2 0 2 4-4-2024 0.511.522.533.5 Fig. 18. Illustration of dynamic local maxima traps from a MAFE based model. Showing the 2D surface and contour plots.Arrows indicate the repulsive force fields that form during clustering. The disjoint cluster formation caused by the local repulsionson the green color class are shown on the right.(Best viewed in color.)
X. C
ONCLUSIONS
This report provides design and algorithmic insights leading to new nonlinear dimensionalityreduction models with strong connections to widely used methods. It introduces a novel pa-rameterized joint spatial and photometric distance based bilateral kernel function that improves the capability of capturing regularities within hyperspectral image pixels. Much of the disjointrelations is encoded into the neighborhood graph through spatial information and estimationof the covariance structure using the sparse matrix transform. The study then adapts a graphembedding framework, namely the multidimensional artificial field embedding, featuring theexample bounded and the unbounded repulsion models, and demonstrates the general applica-bility of the force field intuition to problems in signal and image processing. The idea is toenvision the motion of positional maps (representing graph vertices) as motivated by attractionand repulsion forces that enable a long range distance pulling of similar maps and a shortrange distance repulsion for all maps, respectively. The force field interpretation allows a naturalway to capture this imagination and promotes the design of pair-wise interactions functionsthat act on pixel samples to establish the direction and magnitude of the interaction vectors. Anadaptive iterative gradient based algorithm based on this notion was further implemented to yieldthe minimum energy configuration of the neighborhood graph. The proposed MAFE-UR andMAFE-BR models were applied to data sets acquired from remote sensing. The new algorithmsproposed in the study are shown to have desirable properties that preserves the local topologyof observations while inducing strong natural global structures, e.g . disjoint spatially motivatedclusters in hyperspectral imagery. In its general form, the framework yields formulations ofcurrent popular dimensionality reduction methods with very few assumptions. Experimentalwork conducted on visualization, gradient field trajectories and semisupervised classificationtasks demonstrates that both MAFE-UR and MAFE-BR do overcome the crowding problem andlead to better classification performance in comparison to other approaches.R EFERENCES [1] K. Pearson. On lines and planes of closest fit to systems of points in space.
Philosophical Magazine , 2:559–572, 1901.[2] I. Joliffe.
Principal Component Analysis . Springer-Verlag, 1986.[3] A. M. Martinez and A. Kak. Pca versus lda.
IEEE Trans. Pattern Analysis and Machine Intelligence , 23(2):228–223,2001.[4] W. Torgerson. Multidimensional Scaling I: Theory and method.
Psychometrika , 17, 1952.[5] M. Sugiyama. Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis.
Journal ofMachine Learning Research , 8:1027–1061, 2007.[6] M. Sugiyama, T. Id´e, S. Nakajima, and J. Sese. Semi-supervised local fisher discriminant analysis for dimensionalityreduction.
Machine Learning , 78(1-2):35–61, 2010.[7] J. A. Lee and M. Verleysen.
Nonlinear dimensionality reduction . Springer, 1986. [8] L. Song, A.J. Smola, K. Borgwardt, and A. Gretton. Colored maxium variance unfolding. Adv. in Neural InformationProcessing Systems , 21, 2007.[9] S. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.
Science , 290:2323–2326,2000.[10] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.
Neural Comp. ,15(6):1373–1396, 2003.[11] V. de Silva and J. B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction.
Adv. in NeuralInformation Processing Sys. , pages 721–728, 2003.[12] M. M. Bronstein A. M. Bronstein and R. Kimmel.
Numerical Geometry of Non-Rigid Shapes . Springer, NY, 2008.[13] T. Kohonen and T. Honkela.
Kohonen network . Scholarpedia, 2007.[14] S. Yan, D. Xu, H.J. Zhang, Q. Yang, and S. Lin. Graph embedding and extensions: a general framework for dimensionalityreduction.
IEEE Trans. Pattern Anal. Mach. Intell. , 29(1):40–51, 2007.[15] G. Hinton and S. Roweis. Stochastic neighbor embedding.
ICML , 15:833–840, 2002.[16] L. van der Maaten and G. Hinton. Visualizing data using t-sne.
Journal of Machine Learning Research , 9:2579–2605,2008.[17] M. A. Carreira-Perpinan. The elastic embedding algorithm for dimensionality reduction.
International conference onmachine learning , pages 167–174, 2010.[18] D. Lunga and O. Ersoy. Spherical stochastic neighbor embedding of hyperspectral data.
IEEE Transactions on Geoscienceand Remote Sensing , in press, 2012.[19] M. M. Crawford, W. Kim, and M. Li. Exploring nonlinear manifold learning for classification of hyperspectral data.
IEEETrans. on Opt. Remote Sensing , 3:207–234, 2011.[20] A. E. Parr. A contribution to the theoretical analysis of the schooling behavior of fishes.
Ocassional papers of the Binghamoceanographic collection 1 , 1(1):1–32, 1927.[21] C. M. Breder. Studies on the structure of fish schools.
Bull. Am. Mus. Nat. Hist. , 98(1):3–27, 1951.[22] C. Breder. Equations descriptive of the fish schools and other animal aggregations.
Ecology , 35(3):361–370, 1954.[23] V. Gazi and K. M. Passino. Stability analysis of social foraging swarms.
IEEE Trans. on Systems, Man, and Cybernetics ,34(1):539–557, 2004.[24] J.C. Latombe.
Robotic Motion Planning . Kluwer Academic Publishers, Boston, MA, US, 1991.[25] D. Grunbaum. Schooling as a stretegy for taxis in a noisy environment.
Animal Groups in Three Dimensions , pages257–281, 1997.[26] D. GrUnbaum. Schooling as a strategy for taxis in a noisy environment.
Evolutionary Ecology , 12:503–522, 1998.[27] Z. Xue, Z. Liu, C. Feng, and J. Zeng. Stability analysis of exponential type stochastic swarms with time-delay.
Int. Journ.of Innovative Managemnt, Information and Production , 2(3):1–12, 2011.[28] S. Boyd and L. Vandenberghe.
Convex Optimization . Camb. Uni. Press, New York, NY, USA, 2004.[29] D. Luo, C. Ding, F. Nie, and H. Huang. Cauchy graph embedding.
In Proc. of ICML , 2011.[30] J. B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction.
Science , 290:2319–2323, 2000.[31] R. Durrett.
Brownian Motion And Martingales in Analysis . Wadsworth Advanced Books and Software, Belmont, California,1951. [32] P. Honghong and R. Raghuveer. Hyperspectral image enhancement with vector bilateral filtering. IEEE intern. conferenceon Image processing , pages 3669–3672, 2009.[33] K. Kotwal and S. Chaudhuri. Visualization of hyperspectral images using bilateral filtering.
IEEE Transactions onGeoscience and Remote Sensing , 48(5):2308–2316, 2010.[34] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images.
IEEE Int. Conf. on Computer Vision , pages839–846, 1998.[35] S. Velasco-Forero and V. Manian. Improving hyperspectral image classification using spatial preprocessing.
IEEE TGRS ,6(2):297–301, 2009.[36] J. Theiler, G. Cao, L. Bachega, and C. Bouman. Sparse matrix transform for hyperspectral image processing.
Journ. ofSelected Topics in Signal Proc. , 5(3):424–437, 2011.[37] V. P. Plagianakos, G. D. Magoulas, and M. N. Vrahatis. Ch. 2 learning rate adaptation in stochastic grdient descent.
Information Sys. Journ., Kluwer Academic Pub , pages 15–26, 2001.[38] J.P. LaSalle. Some extensions of Liapunov’s second method.
IRE Trans. on Circuit Theory , pages 520–527, 1960.[39] A. L. Neuenschwander.
Remote sensing of vegetation dynamics in response to flooding and fire in the Okavango Delta-Botswana . Ph.D. dissertation, Univ. Texas Austin, Austin, TX, 2007.[40] A. F. H. Goetz R. H. Yuhas and J. W. Boardman. Discrimination amoung semi-arid landscape endmembers using thespectral angle members (sam) algorithm.
In Annual JPL Airborne Geoscience Workshop , pages 147–149, 1992.[41] K. B. Heidebrecht F. A. Kruse, A. B. Boardman and P. J. Barloon. The spectral image processing system(sips)-interactivevisualization and analysis of imaging spectrometer data.
Remote Sensing Environments , 44(2):145–163, 1993.[42] J. Cohen. A coefficient of agreement from nominal scales.
Education Psychological Measurement , 20:37—46, 1960.[43] W. Di and M. M. Crawford. Active learning via multi-view and local proximity co-regularization for hyperspectral imageclassification.
IEEE Journal of Selected Topics in Signal. Proc. , 5(3):618–628, 2011.[44] L. Ma and M. M. Crawford and J. Tian. Local manifold learning-based k-nearest-neighbor for hyperspectral imageclassification.
IEEE Transactions on Geoscience and Remote Sensing , 48(11):4099–4109, 2010.[45] J. A. Benediktsson A. Villa, J. Chanussot and C. Jutten. Spectral unmixing for the classification of hyperspectral imagesat a finer spatial resolution.