[PDF] Enabling Explainable Fusion in Deep Learning with Fuzzy Integral Neural Networks

Abstract

Information fusion is an essential part of numerous engineering systems and biological functions, e.g., human cognition. Fusion occurs at many levels, ranging from the low-level combination of signals to the high-level aggregation of heterogeneous decision-making processes. While the last decade has witnessed an explosion of research in deep learning, fusion in neural networks has not observed the same revolution. Specifically, most neural fusion approaches are ad hoc, are not understood, are distributed versus localized, and/or explainability is low (if present at all). Herein, we prove that the fuzzy Choquet integral (ChI), a powerful nonlinear aggregation function, can be represented as a multi-layer network, referred to hereafter as ChIMP. We also put forth an improved ChIMP (iChIMP) that leads to a stochastic gradient descent-based optimization in light of the exponential number of ChI inequality constraints. An additional benefit of ChIMP/iChIMP is that it enables eXplainable AI (XAI). Synthetic validation experiments are provided and iChIMP is applied to the fusion of a set of heterogeneous architecture deep models in remote sensing. We show an improvement in model accuracy and our previously established XAI indices shed light on the quality of our data, model, and its decisions.

Full PDF

AAUTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 1 c (cid:13) a r X i v : . [ c s . N E ] M a y UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 2

Enabling Explainable Fusion in Deep Learning withFuzzy Integral Neural Networks

Muhammad Aminul Islam,

Member, IEEE,

Derek T. Anderson,

Senior Member, IEEE,

Anthony J. Pinar,

Member, IEEE,

Timothy C. Havens,

Senior Member, IEEE,

Grant Scott,

Senior Member, IEEE, and James M. Keller,

Life Fellow, IEEE

Abstract —Information fusion is an essential part ofnumerous engineering systems and biological functions,e.g., human cognition. Fusion occurs at many levels,ranging from the low-level combination of signals to thehigh-level aggregation of heterogeneous decision-makingprocesses. While the last decade has witnessed an explosionof research in deep learning, fusion in neural networkshas not observed the same revolution. Speciﬁcally, mostneural fusion approaches are ad hoc, are not understood,are distributed versus localized, and/or explainability islow (if present at all). Herein, we prove that the fuzzy

Choquet integral (ChI), a powerful nonlinear aggregationfunction, can be represented as a multi-layer network,referred to hereafter as ChIMP. We also put forth an improved ChIMP (iChIMP) that leads to a stochastic gradi-ent descent-based optimization in light of the exponentialnumber of ChI inequality constraints. An additional ben-eﬁt of ChIMP/iChIMP is that it enables eXplainable AI (XAI). Synthetic validation experiments are provided andiChIMP is applied to the fusion of a set of heterogeneousarchitecture deep models in remote sensing. We showan improvement in model accuracy and our previouslyestablished XAI indices shed light on the quality of ourdata, model, and its decisions.

Index Terms —data fusion, Choquet integral, deep learn-ing, neural network, explainable AI

I. I

NTRODUCTION

Data are ubiquitous in today’s technological era.This is both a blessing and a curse as we areswimming in sensors but drowning in data. In orderto cope with these data, many systems employ

Muhammad Aminul Islam is with the Department of Electricaland Computer Engineering, Mississippi State University, MS, USAe-mail: ([email protected]).Derek Anderson, Grant Scott, and James Keller are with theElectrical Engineering and Computer Science Department, Universityof Missouri, MO, USA. Timothy Havens and Anthony Pinar arewith the Electrical and Computer Engineering Department and theComputer Science Department, Michigan Technological University,MI, USA.Manuscript received May 20, 2018. data/information fusion. For example, you are rightnow combining multiple sources of data, e.g., taste,smell, touch, vision, hearing, memories, etc. Inremote sensing, it is common practice to combinelidar, hyperspectral, visible, radar and/or other vari-able spectral-spatial-temporal resolution sensors todetect objects, perform earth observations, etc. Thisis the same story for computer vision, smart cars,Big Data, and numerous other thrusts. While the lastdecade has seen great strides in topics like deeplearning, the reality is that our understanding offusion in the context of neural networks (NNs) (andtherefore deep learning) has not witnessed similargrowth. Most approaches to fusion in NNs are adhoc (specialized for a particular application) and/orthey are not well understood nor explainable (i.e.,how are the data being combined and why shouldwe trust system outputs).Let z i ∈ (cid:60) D i be data from source i =1 , . . . , N (sensor, algorithm, human). If fusion isneeded, most approaches just concatenate, i.e., z =( z , ..., z N ) t , resulting in a higher dimensional input.As such, “fusion” occurs somewhere in the network.Another approach is divide-and-conquer, where in-dividuals NNs are attached to each z i , followedby NN(s) (or other machine learning algorithmslike a support vector machine) to combine theiroutput. However, whereas this approach gives rise toa modular design, plug-and-play possibilities, etc.,it does so at the expense of likely not exposinglow-level data correlations (which are in z ). In[1], Xiaowei et al. explored inﬁnite-valued logic ona set of pre-trained convolutional neural networks (CNNs). Pal, Mitra, and others (e.g., Keller and thefuzzy perceptron [2]) explored a variety of topicslike fuzzy min-max networks, fuzzy multilayer per-ceptron (MLP), Sugeno fuzzy measure densities [3],and fuzzy Kohonen networks. In 1992 [4], Yager UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 3 put forth the ordered weighted average (OWA) [5]neuron—which technically is a linear order statistic (LOS) as the weights are real-valued numbers (vs.sets). In 1995, Sung-Bae utilized the OWA for NNaggregation at the decision/output level [6].In 1995, Sung-Bae et al. explored the fuzzyintegral, speciﬁcally the Sugeno integral, for NNaggregation [7]. They used the

Sugeno λ -fuzzy mea-sure ( λ -FM) deﬁned on the N singletons versusthe full set of N subsets and the densities werederived using their respective accuracy rates ontraining data. In 2017 [8], we used the Sugenoand Choquet integral (ChI) for deep CNN (DCNN)fusion. Speciﬁcally, we used data augmentation andtransfer learning to adapt GoogLeNet, AlexNet, andResNet50 from camera imagery to remote sensingimagery. We then applied different aggregations—fuzzy integral, voting, arrogance, and weightedsum—to these DCNNs. A λ -FM with normalizedclassiﬁer accuracy densities and also a genetic al-gorithm was used to learn the densities. In [9],quadratic programming was used to learn the fullChI, relative to pre-trained DCNNs. These are a fewNN fusion approaches explored to date.Herein, we investigate basic NN fusion questions.The ﬁrst, Q1, is what fusions—aggregation func-tions to be precise—are possible relative to existingNN ingredients? Q1 is more-or-less an existenceargument. The next, Q2, is can we represent andoptimize an aggregation function, such as the ChI,as an NN? As such, Q2 addresses how do we ﬁnda solution (versus does one exist). Last, Q3, is canwe open the hood on a fusion NN and understandwhat it has learned?The following contributions are made in thispaper. For Q1, we demonstrate that state-of-the-artaggregation operators are achievable using existingNN mathematics. Namely, we show that two NNscan compute the ChI; one based on a selectionnetwork and N ! linear convex sums (LCS), the otherbased on the Mobius transform. We also logicallyand empirically show that it is a feat to approxi-mate the ChI on limited variety and volume data(which is often the case). For Q2, we present the ChI multi-layer perception (ChIMP) (aka, dedicatedfusion network) that can be optimized via stochasticgradient descent (SGD) in light of an exponentialnumber of ChI inequality constraints. For Q3, weuse indices for introspection on ChIMP. Whereasmost NN fusion solutions to date operate on the basis of implicit and distributed computation, wefocus on explicit and centralized computation topromote understandability. ChIMP is used here tofuse a set of heterogeneous architecture deep NNsfor remote sensing. Adding ChIMP to the top of acollection of deep NNs results in a deep fuzzy NN.The remainder of this article is organized as such.First, in Section II we introduce the capacity andintegral. In Section III different NNs (ChIMPs) forthe ChI are put forth; Section IV is an improvedChIMP (iChIMP) (relative to SGD optimization)and Section V presents eXplainable AI (XAI) fu-sion. The ﬁnal sections present our experiments,results and, conclusions.II. B

ACKGROUND : M

EASURE AND I NTEGRAL

Let X = { x , x , ..., x N } be N sources ofdata/information (e.g., sensor, human, algorithm),let h ( x i ) be the input from source i , and let h = ( h ( x ) , ..., h ( x N )) t be a vector of inputs. Anaggregation operator maps data-to-data, f Θ ( h ) = y ,which ideally obeys conditions such as idempo-tency, associativity, continuity, symmetry, etc. Typ-ically, y is not multi-dimensional, but is (cid:60) -valued.The ChI is a nonlinear aggregation function pa-rameterized by the FM [10], [11]. Whereas theintegral has its roots in calculus, Keller et al. werethe ﬁrst to use it for pattern recognition/machinelearning [12], [13], [14]. However, the integral hasbeen used in many contexts, e.g., by Grabisch etal. in multi-criteria decision making (MCDM) [15].Regardless of the application the question remains:where do the parameters come from? Examples in-clude human speciﬁcation (which becomes quicklyintractable; as N grows, there are N variablesand N (2 N − ) monotonicity constraints), it can belearned from training data [16], or extrapolated in acrowd sourced fashion [17], [18].In addition, it is important to remark about thecomplexity of the ChI. For example, for N = 10 there are , variables and , constraints.In order to combat the computational complexity,imputation methods like the λ -FM have been putforth, where one speciﬁes the measure of the N indi-viduals and the λ -FM automatically ﬁlls in (guessesat) the remaining N (2 N − ) − N values. Grabish,Labreuche, and others have explored routes like the k -additive integral to restrict the number of FMvariables to at most k tuples [19]. This helps control UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 4 the complexity of the integral relative to tasks likehuman decision making and bounded rationality. In[16], we introduced a way to identify data supportedand data unsupported ChI variables. Optimizationis for data supported variables only, new examplesare classiﬁed as known or unsupported (i.e., requiresunknown variables), and imputation is used to makean intelligent guess in the case of missing variables.The next few subsections are quick reviews of themeasure and integral.

A. Fuzzy Measure

The FM, g : 2 X → R + , is a function with thefollowing two properties; (i) (boundary condition) g ( ∅ ) = 0 , and (ii) (monotonicity) if A, B ⊆ X , and A ⊆ B , then g ( A ) ≤ g ( B ) . B. Choquet Integral

The ChI of observation h on X is (cid:90) h ◦ g = C g ( h ) = N (cid:88) j =1 h π ( j ) ( g ( A π ( j ) ) − g ( A π ( j − )) , (1)for A π ( j ) = { x π (1) , . . . , x π ( j ) } , g ( A π (0) ) = 0 ,and permutation π such that h π (1) ≥ h π (2) ≥ . . . ≥ h π ( N ) . C. Choquet Integral as N ! Linear Convex SumOperators

One way to discuss the ChI is in terms of N ! LCSoperators. Relative to a particular sorting of the data( π i )—of which there are N ! possible sorts—the ChIcan be expressed as f π i = N (cid:88) j =1 h π i ( j ) ( g ( A π i ( j ) ) − g ( A π i ( j − )) = h tπ i w π i , (2)where w π i ( j ) = ( g ( A π i ( j ) ) − g ( A π i ( j − )) . For theChI, these N × N ! weights are tied to the underlying N FM variables.

Example 1.

For N = 2 , the ChI can be expandedas C g ( h ) = (cid:26) h w + h w : h ≥ h h w + h w : h > h (3) Sometimes a normality condition is imposed such that g ( X ) = 1 . Shorthand notation h i = h ( x i ) is used. where the weights are w = g ( { x } ) , w = 1 − g ( { x } ) , w = g ( { x } ) , w = 1 − g ( { x } ) . Thus,there are four weights, but just two underlying freeFM variables: the densities. Remark 1.

This difference in weights versus un-derlying FM variables grows with respect to N . Forexample, when N = 5 , there are FM variablesand weights. However, for N = 10 there are , FM variables and , , weights. TheChI can be seen as a form of variable compression. D. Restricting the Scope of the FM/ChI

Since the ChI is a parametric function, once theFM is determined, the ChI turns into a speciﬁcoperator. For example: if g ( A ) = 1 , ∀ A ∈ X \ ∅ ,the ChI becomes the maximum operator; if g ( A ) =0 , ∀ A ∈ X \ X , we recover the minimum; if g ( A ) = | A | N , we recover the mean; and for g ( A ) = g ( B ) when | A | = | B | , ∀ A, B ⊆ X , we obtainan LOS. In general, each of these cases can beviewed as constraints or simpliﬁcations on the FM,and therefore the ChI. Also, the reader should knowthat the all-too-familiar operators—mean, max, min,trimmed versions of these operators, etc.—are allsubsets of the LOS and therefore of the ChI. Assuch, the ChI is useful because it can be adaptedto suit a wide range of aggregation needs. This is aprimary reason for selecting it for use in this articleand for sensor fusion, in general.III. T HE C HOQUET I NTEGRAL AS AN

NN:C H IMPIn this section we explore question Q1, can anNN compute the ChI, and therefore a wide set ofinteresting, useful, and commonly used aggregationoperators? To make a long story short, the answeris yes (see Example 2 and Fig. 1). But why isQ1 important? Well, there are many claims aboutwhat an NN (e.g., CNN) can do. Mathematically,a CNN can encode ﬁlters (linear time invariantﬁlters such as a matched ﬁlter, low pass ﬁlter, orGabor ﬁlter), random projections, and combinationsthereof, to name a few. However, limited attentionhas been placed on understanding aggregation in anNN. Ideally, we would like to know if, and then Keeping in mind the difference between an existence proof versuscan we identify a way to achieve (e.g., learn) it.

UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 5 where, fusion is occurring, understand what aggre-gation operator was selected (e.g., intersection like,union like, average like, something more exotic),what aggregation functions are possible relative toa network, etc. Understanding if an NN can computethe ChI gives us insight into what is possible, or notpossible as it may be.

Example 2.

Consider the case of N = 2 and theNN outlined in Fig. 1. The network output is o = u ( h − h )( h w + h w )+ u ( h − h )( h w + h w ) , where u is a unit/Heaviside step function, , whichgives us o = u ( h − h ) [ h g ( { x } ) + h (1 − g ( { x } ))]+ u ( h − h )[ h g ( { x } ) + h (1 − g ( { x } ))]; thus, o =  h g ( { x } ) + h (1 − g ( { x } )) , h > h ,h g ( { x } ) + h (1 − g ( { x } )) , h > h , . h g ( { x } ) + h (1 − g ( { x } )))+0 . h g ( { x } ) + h (1 − g ( { x } ))) , h = h . Without loss of generality, this extends to any N .As the reader can see, ChIMP represents the ChIby a set of LCS operators and it uses a selectionnetwork to pick one of these results. Technically, oursolution can learn and compute the ChI, but it hasmore functionality (freedom) than a standard ChI aswe made the N ! × N weights independent (and in (cid:60) versus (cid:60) + ), versus reducing them (sharing weights)into the underlying N FM variables, which can bedone. However, our goal in this section is not tomake the simplest possible network, it is to showthat an NN can represent the ChI.

Remark 2.

As discussed in the introduction, an-swers for fusion are in the eye of the beholder; The unit step function is u ( x ) = 1 if x ≥ , otherwise u ( x ) =0 . In practice, many use a smooth approximation like the logisticfunction + tanh ( kx ) = e − kx , where the larger the k , thesharper the transition about x = 0 . Herein, we consider a modiﬁed unit step function that has value N if f ( x ) = 0 . In the difference-in- g form of the ChI, what isthe rule for the case of equal input values? For example, let h =(0 . , . , . . For h we can choose π (1) = 1 , π (2) = 2 , and π (3) = 3 or π (1) = 2 , π (2) = 1 , and π (3) = 3 . Depending onthe underlying FM, these sorts can yield different ChI outputs. Inpractice, most sort algorithms use an increasing or decreasing rule(i.e., default mapping to one case). Herein, we augment the unit stepfunction and consider the average function value. w w h h w w = Unit Step = Dot1 -11 -1 1111 Fig. 1: NN to compute the ChI for N = 2 . The ﬁrsttwo blue neurons (dot products) on the left are N ! LCSs, the red neurons (nonlinearlities) select a LCS(based on input sort order), and the right blue neuronsums the results. Dot is the dot product and UnitStep speciﬁcs are outlined in Section III. Multipleinputs to nodes are summed.that is, context matters. Figure 1 does indeed giveus the ChI. For example, we could fuse the softmax outputs i.e., decision-in-decision-out (DIDO)fusion of multiple deep learners (e.g., ResNet andGoogleNet). On the other hand, if ChIMP waspushed back in the network, possibly connecteddirectly to the inputs, it would likely function dif-ferently, e.g., signal-in-signal-out (SISO) or SIDOversus DIDO. For example, each LCS neuron couldrepresent a matched ﬁlter and the selection networkwould pick one result. We mention this because it(that is, context) is substantial for XAI.

Remark 3.

Our N ! LCS-based ChIMP is not theonly solution. Another example is based on theMobius transform; see Fig. 2. The Mobius transformof g is m ( A ) = (cid:88) B ⊆ A ( − | A \ B | g ( B ) , ∀ A ⊆ X, (4)which is invertable via the Zeta transform, g ( A ) = (cid:88) B ⊆ A m ( B ) , ∀ A ⊆ X. (5)The Mobius transform of the ChI is C g ( h ) = (cid:88) A ⊆ X m ( A ) (cid:94) x i ∈ A h i . (6)Thus, the Mobius ChI can be thought of as adot product of Mobius terms and a t-norm of theintegrand term h . There is no sort in the Mobius UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 6 h h = t-norm (e.g., min) = Dotm x1 m X m x2 Fig. 2: Mobius transform-based ChIMP.integral; the tradeoff is summing across N ! versus N values. The point is, there are many ChIMPs. Wepresented two, but more of them likely exist. Remark 4.

The above N ! LCS-based and Mobiustransform-based ChIMPs do not scale well with re-spect to N . For example, the Mobius-based ChIMPhas N t-norm neurons and one dot product. Athigher levels (closer to the output, e.g., DIDO) in aneural network, N might not be large (on the orderof 10 classes) so all is well. However, if we considerusing ChIMP at lower-levels in a network, a large N could render ChIMP intractable. For example,consider fusing a set of convolutional ﬁlters of size × . When unrolled, the × gives rise to N = 121 . As is a very large number, one couldreduce ChIMP network complexity with respect toa method like k-additivity, C kg ( h ) = (cid:88) A ⊆ X, | A |≤ k m ( A ) (cid:94) x i ∈ A h i . (7)which uses tuples only up to, and including, k . Thepoint is this, as N grows the ChI can be restrictedto suit the needs of an application at the expense ofloss of some expressability.In summary, our response to Q1 is yes, an NNcan represent the ChI and therefore a wide classof useful aggregation operators from the min tomax, average, and more exotic variants as well.Furthermore, there are multiple ways (architectures)to achieve this. Technically, there are an inﬁnitenumber of possibilities, e.g., recursive argumentin which each solution is expanded by a singleneuron, which could be bypassed or turned off bysetting its weights to all zeros. This problem—existence of multiple ways to encode a solution— g g g Δw g12 g g g Δw gX g X w g1 w g2 w g3 g13 Δw g23 g m12 g m13 g m23 g m123 Fig. 3: FM learnable network for N = 3 . Dashedlines are learnable weights and dot has ﬁxed w =(1 , ..., t .is well-known in many communities, e.g., bloatingin genetic programming, which can be addressedusing cost function augmentation with a complexityterm. In the next section we explore an iChIMParchitecture, which is simple to optimize using SGDand whose weights are explicit, enabling XAI.IV. I C H IMPIn this section we present iChIMP, an NN withan SGD solution. As such, this addresses Q2. Tothis end, we explore an alternative way of writingthe ChI [20], C g ( h ) = (cid:88) A ⊆ X g ( A ) o ( A ) , (8)where o ( A ) = (cid:40) max (cid:16) , (cid:86) x i ∈ A h i − (cid:87) x j / ∈ A h j (cid:17) , A ⊂ X, (cid:86) x i ∈ A h i , A = X. (9) A. Measure Network

Our idea is to design an FM NN. This networkconsists of constants, learnable weights, and existingneural mathematics (dot product, ReLU nonlinearl-ity, and maximum neurons). Figure 3 illustrates thenetwork for N = 3 .Speciﬁcally, the densities (FM variables whoseset cardinality is 1) are represented as a weightvector and a nonlinearity (e.g., ReLU) enforces Note that the ChI formulation herein differs from article [20] inone respect that Eq. 8 is for R -valued inputs whereas that in [20] isfor R + inputs. UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 7 h = f(a,b) = max(0,a-b) = Max = Minh h o = max(0, ℎ − ℎ ∨ ℎ ) o = max(0, ℎ − ℎ ∨ ℎ )o = max(0, ℎ − ℎ ∨ ℎ )o = max(0, ℎ ∧ ℎ − ℎ ) o = max(0, ℎ ∧ ℎ − ℎ ) o = max(0, ℎ ∧ ℎ − ℎ )o = ℎ ∧ ℎ ∧ ℎ Fig. 4: Neural architecture for the integrand and N = 3 . Note, there are no learnable weights. … 𝑂 = [𝑜 𝑜 𝑜 … 𝑜 𝑋 ] = Choquet integral 𝑔 𝑔 𝑋 𝑜 𝑋 𝑜 Fig. 5: Neural architecture that combines Fig. 3 and4.the lower boundary condition. Next, each tuple isexpressed as g ( A ) = (cid:95) B ⊂ A g ( B ) + ∆ g ( A ) , (10)where ∆ g ( A ) ∈ (cid:60) . Like before, a positive valueenforcing nonlinearity is used to ensure the mono-tonicity property, forcing ∆ g ( A ) to reside in (cid:60) + .If g ( X ) is required to be , then all of g canbe renormalized by taking the minimum of g and . Otherwise, we can ignore the upper boundarycondition, since this is not a hard requirement. B. Integral and Evaluation Networks

The next piece of iChIMP is expanding the N − terms in Eq. (8) as shown in Fig. 4. This MLP hasno trainable weights. The network can be achievedusing max, min, and a custom f ( a, b ) = max (0 , a − b ) neuron. The ﬁnal step is a single dot product (seeFig. 5) of the expanded integrand terms (see Fig. 4)and the FM variable values (see Fig. 3). C. iChIMP Optimization

For readability, the derivation of iChIMP SGDoptimization is presented in the Appendix. V. XAI

FOR THE C HOQUET I NTEGRAL

In this section, a beneﬁt of designing an explicitneural fusion network is highlighted. In [21], weestablished ChI indices for XAI. The reader canrefer to [21] for full mathematical explanation. Dueto manuscript length, we are only able to summarizethe indices.The ﬁrst category of fusion XAI indices explainthe quality of the individual sources and their inter-action characteristics. The utility of a source, e.g.,deep model, can be extracted via the Shapley index, Φ g ( i ) ∈ [0 , , where (cid:80) Ni =1 Φ g ( i ) = 1 . On the otherhand, the Interaction Index [22], I g ( i, j ) ∈ [ − , ,informs us about how two deep models interact with one another—that is, is there an advantagein combining sources. A value of represents themaximum complementary between i and j . On theother hand, a value of − represents the maximumredundancy between i and j .A second category of fusion XAI indices tellus what aggregation was learned. This helps usdetermine if the data are being combined in aunion, intersection, average, or perhaps more uniqueand worthy of the ChI fashion. In [21], we poseddistance formulas to measure how similar a learned g is to the maximum, minimum, mean, and ingeneral, LOS.The third category of fusion XAI indices is datacentric . In [21], we determined how often an FMvariable is encountered in training data; which helpsus ﬁnd missing FM variables. We also calculatedwhat percentage of FM variables were observed intraining data, what percentage of the N ! possibleLCSs were observed, and if there is a dominantwalk (and therefore lack of training data variety).These indices ultimately inform us about the qualityof a learned solution and they highlight what isincomplete with respect to our model/training data.We also postulated a trust index based on whatpercentage of missing variables are used in a ChIcalculation.In summary, in [21] we discussed existing meth-ods and proposed new ways to elicit informationabout a learned fusion. Since iChIMP is an explicitneural architecture, meaning we know which net-work elements map to which FM variables, XAIcan help us understand, validate, and do iterativedevelopment. UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 8

TABLE I: Target FMs for Experiment 1. FM g g g g g g g FM1

FM2 . ¯33 0 . ¯33 0 . ¯33 0 . ¯66 0 . ¯66 0 . ¯66 FM3

FM4

VI. E

XPERIMENTS AND R ESULTS

In this subsection, two experiments are per-formed. The ﬁrst experiment uses synthetic data. Assuch, we know the answer and we can control allfactors, e.g., noise. We generate familiar operatorsthat range from optimistic union-like to pessimisticintersection-like, average-like, and random opera-tors. The purpose is to show range and variationin the FM and our ability to learn it. Next, we takethe validated iChIMP and we use it to fuse a setof heterogeneous architecture DCNNs, to which wenote no one knows the solution. The purpose of thisexperiment is to demonstrate the iChIMP on real-world benchmark data and to compare it to existingwork.

A. Experiment 1: Synthetic Data Set

The objective of Experiment 1 is to show that werecover the correct ChI and to compare iChIMP toan existing (non-neural) way of solving the ChI, i.e., quadratic programming (QP) [23]. Our data are gen-erated pseudo-randomly from a uniform distributionin a unit interval and consist of M = 300 samplesand three inputs ( N = 3 ). We use four FMs withdisparate aggregation behavior—FM1 is a soft-max,FM2 is a mean, FM3 is a soft-mean, and FM4 isan arbitrary FM—to generate their labels. Table Ishows the four target FMs.In order to investigate the impact of noiseon learning, we perturb the true labels withrandom noise sampled from a Gaussian dis-tribution with variance σ y . We consider sixnoise levels ranging from no-noise to ofthe true label standard deviation, i.e., σ n = { , . σ y , . σ y , . σ y , . σ y , . σ y } .The data are partitioned into two segments, for training and for test. Training parametersfor iChIMP are learning rate = . and numberof epochs = . The iChIMP variables are initial-ized with soft-mean like FM with values randomly picked from a uniform distribution in [0 . , . . Foreach FM, optimization is performed on the trainingdata to learn the FM variables, which are then usedto estimate the label/output of the test data. The mean squared error (MSE) with respect to the truetraining labels, true test labels, and FM variablesare computed and used as a performance metric.The optimization task was repeated times foriChIMP with different initializations, and we reportthe average of the resultant MSEs. Table II containsthe results of Experiment 1.Table II tells the following story. The MSEs forindividual methods and their differences are quitelow (on the order of − ∼ − ) even at noiselevels as high as . σ y . This means that even thoughiChIMP is optimizing a non-convex network (withits ReLU, max, and min functions), it provides anapproximation of the integrals as good as the QPmethod. B. Experiment 2: Real-World Data Set

Experiment 1 validates iChIMP and Experiment2 uses it to fuse a set of heterogeneous deepCNNs (DCNN) for remote sensing. An outstandingchallenge in deep learning is network architecture.In general, no architecture has been shown to besuperior across data sets. This is why we investigatethe fusion of different architectures.Two benchmark remote sensing data sets areinvestigated herein for land cover classiﬁcation andobject detection. The aerial image data set (AID)has 30 classes, it has approximately 330 imagesper class, and the ground sampling distance (GSD)varies between 0.5 to 8 meters. The remote sensingimagery scene classiﬁcation-45 (R45) was speciﬁ-cally designed to be challenging for remote sensingimage scene classiﬁcation. It contains 45 classeswith 700 images per class and a variable GSD of0.2 to 30 meters. However, the vast majority of theR45 classes have a GSD < dropout. The trained UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 9

TABLE II: Experiment 1 Label and Variable Error Rates at Noise Levels σ n = { , . σ y , . σ y , . σ y , . σ y , . σ y } . Label Error FM Variable Error0 0.01 0.05 0.1 0.3 0.5 0 0.01 0.05 0.1 0.3 0.5FM1 ChIMP 1.2E-15 3.4E-08 8.5E-07 3.4E-06 3.1E-05 8.5E-05 5.3E-15 1.4E-07 3.4E-06 1.4E-05 1.2E-04 3.4E-04FM1 QP 1.4E-05 1.4E-05 1.8E-05 2.2E-05 5.4E-05 1.1E-04 1.1E-04 1.1E-04 1.4E-04 1.7E-04 3.3E-04 6.1E-04FM2 ChIMP 1.2E-18 3.1E-08 7.9E-07 3.1E-06 2.8E-05 7.9E-05 9.5E-18 1.3E-07 3.1E-06 1.3E-05 1.1E-04 3.1E-04FM2 QP 5.1E-06 3.9E-06 3.7E-06 4.8E-06 2.4E-05 6.7E-05 3.4E-05 2.8E-05 2.2E-05 2.1E-05 7.1E-05 2.0E-04FM3 ChIMP 4.1E-20 3.4E-08 8.4E-07 3.4E-06 3.0E-05 8.4E-05 3.1E-19 1.3E-07 3.3E-06 1.3E-05 1.2E-04 3.3E-04FM3 QP 1.1E-05 1.1E-05 1.1E-05 1.4E-05 4.0E-05 8.9E-05 5.8E-05 5.9E-05 5.8E-05 6.8E-05 1.7E-04 3.5E-04FM4 ChIMP 1.8E-19 3.4E-08 8.4E-07 3.4E-06 3.0E-05 8.4E-05 1.1E-18 1.3E-07 3.4E-06 1.3E-05 1.2E-04 3.4E-04FM4 QP 2.8E-06 2.5E-06 3.1E-06 4.7E-06 2.7E-05 7.3E-05 1.6E-05 1.4E-05 1.5E-05 1.8E-05 8.8E-05 2.4E-04

DCNNs are then used in a locked state, i.e., nofurther learning happens during the fusion stage.The training of the DCNNs are done in ﬁve-fold,cross validation manner; we have 5 sets of 80%training and 20% testing for both data sets. PerDCNN fold, three-fold cross validation (CV) fusionis used (due to limited volume of data). Table IIIsummarizes the performance of the DCNNs and ourfused solution for the test data sets. In particular,iChIMP outperforms the individual networks; re-ducing the error rate by 40% (3.8% down to 2.27%)over the best single DCNN architecture for AID,and similarly a 30% relative error rate reduction forR45.In most NNs, accuracy is the prime objective,and sometimes the only objective. However, we canapply our XAI indices to “open up” the learned so-lutions and interpret what is going on. On R45, theShapley values are . µ ) ± . σ ) , . ± . , . ± . , . ± . , . ± . , . ± . , and . ± . , withrespect to CaffeNet, DenseNet, GoogLeNet, Incep-tionResNetV2, ResNet50, ResNet101, and Xcep-tion. On AID, the Shapley values are . ± . , . ± . , . ± . , . ± . , . ± . , . ± . , and . ± . . These Shapley values indicate thatthe deep nets have near equal overall importance,with perhaps the exception of Xception. Next, ourXAI aggregation operator distance indices had avalue of approximately relative to the mean. Assuch, we know that we learned an additive measure,which is reinforced by our Interaction Index valuesnear . At the moment, our XAI indices are evi-dence . That is, their results need to be analyzed byan expert to determine what is going on. In futurework we will discover a way to automatically reason about our various XAI information to deduce high-level linguistic summaries for non-experts.In prior work [8], we explored the “ofﬂine”fusion—QP versus iChIMP—of three DCNNs.Herein, we repeat those experiments using iChIMP.This is done because we are interested to see if fu-sion learned a mean in part due to adding more deepmodels. As we discovered in [8], we do not alwayslearn equal Shapley values. For example, on RSDwe fused CaffeNet, GoogLeNet, and ResNet50. Theshared weight Shapley solution had Shapley valuesof . , . , and . , respectively. Furthermore, ifwe trained a different iChIMP per class, versus asingle shared weight solution across classes, thenwe obtain Shapley values of . , . , and . for the bridge class and . , . , and . forthe mountain class. This informs us that while asingle shared weight iChIMP has the best overallaccuracy, using fewer deep models leads to non-uniform Shapley values and non-mean like aggre-gations. Furthermore, it informs us that differentclasses prefer different deep models. The questionwe need to address is why?In [21], we created XAI indices that tell us whichFM variables cannot be approximated from trainingdata. To no surprise, based on the above we ran ourXAI indices and found that the “problem classes”that are bringing down the overall average per-classiChIMP solution are likely due to lack of trainingdata variety (aka, they have a large percentage ofun-approximated FM variables). This is probablya contributing factor to why the shared weightiChIMP performs better than N per-class iChIMPsand possibly why a mean aggregation might be agood general strategy for combining a larger numberof deep models; e.g., seven versus three, whichmeans we need to encounter

7! = 5 , versus UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 10

TABLE III: Experiment 2 ChIMP Accuracy Results on Benchmark AID and R45 Data Sets.

CaffeNet DenseNet GoogleNet InceptionResNetV2 ResNet101 ResNet50 Xception Shared ChIMP

AID Fold 1 93.55 95.40 95.70 96.20 96.20 95.65 97.40

Fold 2 93.00 94.90 95.30 93.75 96.15 96.15 96.90

Fold 3 94.40 94.35 94.80 95.35 95.30 95.20 95.70

Fold 4 93.60 95.40 95.00 93.40 95.05 95.30 95.70

Fold 5 94.70 94.65 94.80 95.95 96.15 96.10 96.80

Mean 93.85 94.94 95.12 94.93 95.77 95.68 96.50

SD 0.69 0.46 0.38 1.28 0.55 0.44 0.76

R45 Fold 1 93.17 94.81 94.70 95.43 95.25 95.17 96.43

Fold 2 93.41 95.59 95.60 95.65 95.60 95.44 95.97

Fold 3 92.79 94.67 95.48 95.76 95.43 95.35 95.97

Fold 4 93.51 93.60 95.62 95.51 95.68 95.41 96.11

Fold 5 93.29 95.08 95.46 95.76 95.62 95.57 96.06

Mean 93.23 94.75 95.37 95.62 95.52 95.39 96.11

SD 0.28 0.73 0.38

3! = 6 unique sorts.In [21] we created an XAI index called dominantwalk. A dominant walk is when a large percentageof the input data is sorted in a speciﬁc permu-tation order. We ran this index on our iChIMPsolutions and discovered that there is typically avery dominant walk order of h ≥ h ≥ ... ≥ h N ,the default order. This ﬁnding and pattern was toocoincidental for our liking. Upon inspection, wediscovered this is because a majority of our datahas all the networks certainty in a label (and s otherwise). As such, , , ..., N is the default sortorder. This means that we have strong learners andthere is not a lot of diverse information (variety)to help us learn fusion. As such, a better solution,to be addressed in future work, would be to learnthese networks in parallel now that the ChI can beintegrated into a homogeneous neural solution.In summary, in this subsection we used iChIMPto fuse a set of heterogeneous DNNs. Furthermore,iChIMP has XAI tools that allow us to understand,explain, and pursue the design of new data collec-tions and better architectures. C. Computational Complexity

In this subsection, a time complexity analysisis provided. This study consists of three steps, (i)assessing the complexity of o ( A ) , (ii) g ( A ) , andﬁnally (iii) C g , the dot the product of g and o , as inEq. (8). 1) o ( A ) : For A ⊂ X , the computation of o ( A ) —see Eq. (9)—has one minimum on x ( ≤ x ≤ N − ) elements, one maximum on N − x elements, one maximum on two elements, andone subtraction. This gives N + 3 operationsas O ( n ) is the worst case complexity forﬁnding a maximum of n elements and O (1) for subtraction operations. Using a similaranalysis, the number of operations for A = X is N . As a result, the total cost for computing o ( A ) for all A , where A ⊆ X and A (cid:54) = ∅ , is (2 N − × ( N + 3) + N . The complexity of o ( A ) is the highest term without the constant,i.e., O ( N N ) .2) g ( A ) : Eq. (10) for g ( A ) has two parts: (a) ∆ g ( A ) , a Relu on two elements (thus O (2) complexity); and (b) (cid:87) B ⊂ A g ( B ) , a maximumon | A |− elements with cost of | A |− , where | A | . As a result, the cost of g ( A ) is | A | + 1 .Let N C x be the combination of N elementstaken x time. Thus, there are N C | A | sets withcardinality | A | . The total cost of computing g ( A ) for all A is N C × (1+1)+ · · · + N C N × ( N +1) < N ( N +1) , because N C + · · · + N C N = 2 N . This resultsin a time complexity of O ( N N ) for compu-tation of g ( A ) .3) C g : Eq. (8) involves the dot product of g and o with N ﬂoating point operations (FLOPS) UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 11 including N additions and N multiplications,resulting in a O ( N ) complexity.Combining all three components, the time com-plexity of iCHIMP is O ( N N )+ O ( N N )+ O ( N ) = O ( N N ) . This means that iChIMP has complexityof exponential order, O ( N N ) w.r.t. the numberof inputs, and complexity of polynomial order, O ( M log ( M )) , with respect to the number of FMvariables, M = 2 N .Next, we discuss the cost of fusing deep modelsin practical applications. First, many pre-trainedmodels, e.g., the ones used herein for large datasetslike Imagenet, are publicly available. Common prac-tice is to apply transfer learning, which is an ofﬂineprocedure that is inexpensive relative to training anequivalent network from scratch. Next, the fusionof N models only adds N variables, where N isminuscule in relation to the number of parametersin the deep models. Translated, ofﬂine learning of N fusion variables is nominal. Online is a similarstory. iChIMP is a tiny parallel network, relativeto the complexity of a deep model, of commondeep learning operations that have been acceleratedin tool kits like TensorFlow and PyTorch. Theonly worthwhile cost of fusing N deep models isthe storage and computational cost associated withevaluating N deep models, which can be performedin parallel. However, as evident in our experiments,fusion improves performance. Overall, the expenseof fusing a set of deep models can be rationalizedrelative to the current dilemma of not knowingwhich deep model is correct.VII. C ONCLUSION AND F UTURE W ORK

In this article, we proposed a novel NN archi-tecture with a gradient descent-based optimizationsolution that mimics the Choquet integral for in-formation aggregation. This solution was demon-strated on synthetic data for validation and real-world benchmark data sets in remote-sensing fusionrelative to the fusion of a set of heterogeneousarchitecture deep models. Adding iChIMP to the topof a set of deep neural networks produces a deepfuzzy neural network. Furthermore, we analyzedsimilarities and differences between multi-layer netsand the ChI, with respect to factors like representa-tion and constrained learning algorithms. Last, ourrecently established XAI indices were used to “openup” our learned deep neural solutions enabling inter-pretability, helping us understand what was learned, identifying ﬂaws in our training data, and ultimatelydesigning better deep model solutions.The proposed NN architectures are not just lim-ited to scalar-valued inputs and can be applied tohigher-order FSs, such as Type-1 or Type-2, usingthe Zadehs extension principle. If one wishes touse a NN for higher-order FS-valued inputs, andcan deﬁne an objective function and associatedgradients, then using our proposed ChI-based NNis as straightforward as it is for real-valued inputs.Extensions of the fuzzy ChI for FS-valued inputshave been previously proposed [31]; we suggestapplication and extension of iChIMP for follow-onwork.In future work, now that we have theiChIMP foundation we will explore efﬁcientrepresentations—e.g., k -additivity or our recentlyestablished data-compressed ChI [16]. We will alsoinvestigate advanced learning algorithms—e.g.,drop out [32], regularization [33], etc.—withregard to the (2 N − free variables. Once this isachieved we can push the ChI neuron back in thenetwork for signal- and feature-level fusion, versusdecision-level fusion. Furthermore, we will explorewhere and when a fusion neuron should be used,akin to the current revolution of what architectureshould be employed. We will also, in the future,explore the possible beneﬁt of enforcing g ( X ) = 1 ,focusing on what exact penalty functions to useto enforce a soft boundary. We will also continueto explore XAI and discover ways to linguisticallysummarize their contents for human consumptionand to use them possibly in optimization directly toencourage certain factors, e.g., diversity, speciﬁcity,or efﬁciency. In this paper, we used iChIMP to fusepre-trained DCNNs. Next, we will explore how tosimultaneously learn iChIMP and the componentnetworks to improve factors like accuracy andnetwork diversity. UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 12 A PPENDIX

A. Derivative of maximum

The derivative of f = max { f ( x ) , f ( x ) } is df ( x ) dx =  df ( x ) dx if f ( x ) > f ( x ) df ( x ) dx if f ( x ) < f ( x ) (cid:16) df ( x ) dx + df ( x ) dx (cid:17) if f ( x ) = f ( x ) . (11)Let J f = f i be an indicator function that points towhether f i is equal to f (in other words, max of f i s) or not, i.e., J f = f i = (cid:40) if f ( x ) = f i ( x )0 else.As such, we can write (11) as df ( x ) dx = (cid:80) i =1 J f = f i df i ( x ) dx (cid:80) i =1 J f = f i = (cid:88) i =1 I f = f i df i ( x ) dx , (12)where I f = f i = J f = fi (cid:80) i =1 J f = fi is a normalized indicatorvariable. Respectively, (12) can be generalized forthe case of f ( x ) = max { f ( x ) , . . . , f n ( x ) } as df ( x ) dx = n (cid:88) i =1 I f = f i df i ( x ) dx . B. Derivative of min

The derivative for f ( x ) = min { f ( x ) , . . . , f n ( x ) } is df ( x ) dx = n (cid:88) i =1 I f = f i df i ( x ) dx . C. Gradients of weights

Let y k , k = 1 , . . . , M , be the iChiMP output forobservation o k . For a set of data, the error is E = 12 M (cid:88) k ( l k − y k ) . (13)For notational simplicity, we deﬁne N − − N (not deﬁned on the densities nor empty set) maxof subset auxiliary variables (see Fig. 3), g m . Forexample, g m = max ( g , g , g ) ,g = g m + max (∆ w g , . Without loss of generality, the following intermedi-ate formulas are provided relative to N = 3 , ∂E∂y k = ( y k − l k ) = e k , (14) ∂y k ∂g i = o i , (15) ∂g ∂g m = 1 , (16) ∂g m ∂g = I ( g m = g ) + 0 + 0 = I ( g m = g ) . (17)The same holds for ∂g m ∂g and ∂g m ∂g respectively. Assuch, ∂g ∂g = ∂g ∂g m ∂g m ∂g = I g m = g , (18)and ∂g ∂ ∆ w g = (cid:26) w g > else . (19)Furthermore, we deﬁne the indicator variable I c> = (cid:40) if c > else,and thus ∂g ∂ ∆ w g = I ∆ w g > . Next, the gradients are ∂E∂g = ∂E∂y k ∂y k ∂g = e k o , (20) ∂E∂g = ∂E∂y k (cid:18) ∂y k ∂g + ∂y k ∂g ∂g ∂g (cid:19) (21a) = e k (cid:0) o + o I g m = g (cid:1) , (21b)and similar for ∂E∂g and ∂E∂g . In a similar fashion,the error for each density is deﬁned as ∂E∂g = ∂E∂y k (cid:18) ∂y k ∂g + ∂y k ∂g ∂g ∂g + (22a) ∂y k ∂g ∂g ∂g + ∂y k ∂g ∂g ∂g (cid:19) (22b) = e k (cid:0) o + o I g m = g + o I g m = g + (22c) o (cid:0) I g m = g I g m = g + I g m = g I g m = g (cid:1) (cid:1) , (22d) UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 13 and similar for ∂E∂g and ∂E∂g . Last, ∂E∂ ∆ w g = ∂E∂g ∂g ∂ ∆ w g (23a) = e k o I ∆ w g > , (23b) ∂E∂ ∆ w g = ∂E∂g ∂g ∂ ∆ w g (24a) = e k ( o + o I g m = g ) I ∆ w g > , (24b)and similar for ∂E∂ ∆ w g and ∂E∂ ∆ w g . D. Gradients for inputs

Here we give the expressions for the gradients ofthe cost function with respect to inputs h , h , and h . The gradients of o i s w.r.t. h are ∂o ∂h = 0 + I o = h − h ∨ h = I o = h − h ∨ h ,∂o ∂h = 0 + I o = h − h ∨ h (0 − ( I h max = h + 0))= − I o = h − h ∨ h I h max = h ,∂o ∂h = − I o = h − h ∨ h I h max = h ,∂o ∂h = (cid:16) I o = h ∧ h − h ( I h min = h + 0) (cid:17) = I o = h ∧ h − h I h min = h ,∂o ∂h = (cid:16) I o = h ∧ h − h ( I h min = h + 0) (cid:17) = I o = h ∧ h − h I h min = h ,∂o ∂h = 0 − I o = h ∧ h − h = − I o = h ∧ h − h ,∂o ∂h = I o = h + 0 + 0 = I o = h . The gradient of h w.r.t. the cost function, E , is ∂E∂h = ∂E∂y k (cid:88) i ∂y k ∂o i ∂o i ∂h = e k (cid:0) g I o = h − h ∨ h − g I o = h − h ∨ h I h max = h − g I o = h − h ∨ h I h max = h + g I o = h ∧ h − h I h min = h + g I o = h ∧ h − h I h min = h − g I o = h ∧ h − h + g I o = h (cid:1) . Similarly, ∂E∂h = ∂E∂y k (cid:88) i ∂y k ∂o i ∂o i ∂h = e k (cid:0) − g I o = h − h ∨ h I h max = h + g I o = h − h ∨ h − g I o = h − h ∨ h I h max = h + g I o = h ∧ h − h I h min = h − g I o = h ∧ h − h + g I o = h ∧ h − h I h min = h + g I o = h ) ,∂E∂h = ∂E∂y k (cid:88) i ∂y k ∂o i ∂o i ∂h = e k (cid:0) − g I o = h − h ∨ h I h max = h − g I o = h − h ∨ h I h max = h + g I o = h − h ∨ h − g I o = h ∧ h − h + g I o = h ∧ h − h I h min = h + g I o = h ∧ h − h I h min = h + g I o = h ) . R EFERENCES [1] X. Gu, P. P. Angelov, C. Zhang, and P. M. Atkinson, “A mas-sively parallel deep rule-based ensemble classiﬁer for remotesensing scenes,”

IEEE Geoscience and Remote Sensing Letters ,vol. 15, no. 3, pp. 345–349, March 2018.[2] J. M. Keller and D. J. Hunt, “Incorporating fuzzy membershipfunctions into the perceptron algorithm,”

IEEE Transactions onPattern Analysis and Machine Intelligence , no. 6, pp. 693–699,1985.[3] P. Karczmarek, A. Kiersztyn, and W. Pedrycz, “On devel-oping sugeno fuzzy measure densities in problems of facerecognition,”

International Journal of Machine Intelligence andSensory Signal Processing , vol. 2, no. 1, pp. 80–96, 2017.[4] R. R. Yager, “Applications and extensions of owa aggregations,”

International Journal of Man-Machine Studies , vol. 37, no. 1,pp. 103–122, 1992.[5] ——, “On ordered weighted averaging aggregation operators inmulticriteria decisionmaking,”

IEEE Transactions on Systems,Man, and Cybernetics , vol. 18, no. 1, pp. 183–190, 1988.[6] C. Sung-Bae, “Fuzzy aggregation of modular neural networkswith ordered weighted averaging operators,”

International Jour-nal of Approximate Reasoning , vol. 13, no. 4, pp. 359–375,1995.[7] S. B. Cho and J. H. Kim, “Combining multiple neural networksby fuzzy integral for robust classiﬁcation,”

IEEE Transactionson Systems, Man, and Cybernetics , vol. 25, no. 2, pp. 380–384,1995.[8] G. J. Scott, R. A. Marcum, C. H. Davis, and T. W. Nivin,“Fusion of deep convolutional neural networks for land coverclassiﬁcation of high-resolution imagery,”

IEEE Geoscience andRemote Sensing Letters , 2017.[9] D. Anderson, G. Scott, M. Islam, B. Murray, and R. Marcum,“Fuzzy choquet integration of deep convolutional neural net-works for remote sensing,” in

Computational Intelligence forPattern Recognition, Springer-Verlag , 2018.[10] M. Sugeno, “Theory of fuzzy integrals and its applications,”

Ph.D. thesis, Tokyo Institute of Technology , 1974.[11] G. Choquet, “Theory of capacities,” in

Annales de l’institutFourier , vol. 5. Institut Fourier, 1954, pp. 131–295.

UTHOR COPY, TO APPEAR IN SPECIAL ISSUE ON DEEP FUZZY MODELS, TRANSACTIONS ON FUZZY SYSTEMS 14 [12] H. Tahani and J. Keller, “Information fusion in computer visionusing the fuzzy integral,”

IEEE Transactions System, Man, andCybernetics , vol. 20, pp. 733–741, 1990.[13] M. Grabisch and M. Sugeno, “Multi-attribute classiﬁcation us-ing fuzzy integral,” in

Fuzzy Systems, 1992., IEEE InternationalConference on . IEEE, 1992, pp. 47–54.[14] M. Grabisch and J.-M. Nicolas, “Classiﬁcation by fuzzy inte-gral: Performance and tests,”

Fuzzy Sets and Systems , vol. 65,no. 2-3, pp. 255–271, 1994.[15] M. Grabisch, “The application of fuzzy integrals in multicriteriadecision making,”

European Journal of Operational Research ,vol. 89, no. 3, pp. 445–456, 1996.[16] M. A. Islam, D. T. Anderson, A. J. Pinar, and T. C. Havens,“Data-driven compression and efﬁcient learning of the ChoquetIntegral,”

IEEE Transactions on Fuzzy Systems , vol. PP, no. 99,pp. 1–1, 2017.[17] T. C. Havens, D. T. Anderson, and C. Wagner, “Data-informedfuzzy measures for fuzzy integration of intervals and fuzzynumbers,”

IEEE Transactions on Fuzzy Systems , vol. 23, no. 5,pp. 1861–1875, Oct 2015.[18] C. Wagner and D. T. Anderson, “Extracting meta-measuresfrom data for fuzzy aggregation of crowd sourced information,”in , June2012, pp. 1–8.[19] M. Grabisch, “k-order additive discrete fuzzy measures andtheir representation,”

Fuzzy Sets and Systems

IEEE Transactions on Fuzzy Systems , 2018.[21] B. Murray, M. A. Islam, A. J. Pinar, T. C. Havens, D. T.Anderson, and G. Scott, “Explainable ai for understandingdecisions and data-driven optimization of the choquet integral,”in

Proceedings of IEEE International Conference on FuzzySystems (FUZZ-IEEE) , 2018.[22] T. Murofushi and S. Soneda, “Techniques for reading fuzzymeasures (iii): interaction index,” in . Sapporo,, Japan, 1993, pp. 693–696.[23] D. T. Anderson, S. R. Price, and T. C. Havens, “Regularization-based learning of the Choquet integral,” in , July 2014,pp. 2519–2526.[24] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten,“Densely connected convolutional networks,” in

Proceedings ofthe IEEE conference on computer vision and pattern recogni-tion , vol. 1, no. 2, 2017, p. 3.[25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeperwith convolutions,” in

Proc. of the IEEE Conference on Com-puter Vision and Pattern Recognition , 2015, pp. 1–9.[26] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi,“Inception-v4, inception-resnet and the impact of residual con-nections on learning.” in

AAAI , vol. 4, 2017, p. 12.[27] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutionalarchitecture for fast feature embedding,” in

Proc. of the 22ndACM International Conference on Multimedia . ACM, 2014,pp. 675–678.[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” arXiv preprint arXiv:1512.03385 , 2015.[29] F. Chollet, “Xception: Deep learning with depthwise separableconvolutions,” arXiv preprint , pp. 1610–02 357, 2017. [30] G. J. Scott, M. R. England, W. A. Starms, R. A. Marcum,and C. H. Davis, “Training deep convolutional neural networksfor land-cover classiﬁcation of high-resolution imagery,”

IEEEGeoscience and Remote Sensing Letters , vol. 14, no. 4, pp.549–553, 2017.[31] D. T. Anderson, T. C. Havens, C. Wagner, J. M. Keller, M. F.Anderson, and D. J. Wescott, “Extension of the fuzzy integralfor general fuzzy set-valued information,” vol. 22, no. 6, pp.1625–1639, 2014.[32] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neuralnetworks from overﬁtting.”

Journal of Machine Learning Re-search , vol. 15, no. 1, pp. 1929–1958, 2014.[33] I. Goodfellow, Y. Bengio, and A. Courville,