1-Dimensional polynomial neural networks for audio signal related problems
11-Dimensional polynomial neural networks for audio signal related problems
Habib Ben Abdallah ∗ , Christopher J. Henry, Sheela Ramanna Department of Applied Computer Science, University of Winnipeg, Winnipeg, Manitoba, Canada
Abstract
In addition to being extremely non-linear, modern problems require millions if not billions of parametersto solve or at least to get a good approximation of the solution, and neural networks are known to assimi-late that complexity by deepening and widening their topology in order to increase the level of non-linearityneeded for a better approximation. However, compact topologies are always preferred to deeper ones as theyoffer the advantage of using less computational units and less parameters. This compacity comes at the priceof reduced non-linearity and thus, of limited solution search space. We propose the 1-Dimensional Polyno-mial Neural Network (1DPNN) model that uses automatic polynomial kernel estimation for 1-DimensionalConvolutional Neural Networks (1DCNNs) and that introduces a high degree of non-linearity from the firstlayer which can compensate the need for deep and/or wide topologies. We show that this non-linearity intro-duces more computational complexity but enables the model to yield better results than a regular 1DCNNthat has the same number of training parameters on various classification and regression problems relatedto audio signals. The experiments were conducted on three publicly available datasets and demonstratethat the proposed model can achieve a much faster convergence than a 1DCNN on the tackled regressionproblems.
Keywords:
Audio signal processing, convolutional neural networks, denoising, machine learning,polynomial approximation, speech processing.
1. Introduction
The Artificial Neural Network (ANN) has nowadays become an extremely popular model for machine-learning applications that involve classification or regression [1]. Due to its effectiveness on feature-basedproblems, it has been extended with many variants such as Convolutional Neural Networks (CNNs) [2, 3]or Recurrent Neural Networks (RNNs) [4, 5] that aim to solve a broader panel of problems involving signalprocessing [6, 7] and/or time-series [8, 9] for example. However, as data is becoming more available to useand exploit, problems are becoming richer and more complex, and deeper and bigger topologies [10, 11] ofneural networks are used to solve them. Moreover, the computational load to train such models is steadilyincreasing - despite advances in high performance computing systems - and it may take up to several daysjust to develop a trained model that generalizes well on a given problem. This aggravation is partially dueto the fact that complex problems involve highly non-linear solution spaces, and, to achieve a high levelof non-linearity, deeper topologies [12] are required since every network layer introduces a certain level ofnon-linearity with an activation function.A well known machine-learning trick to alleviate such a problem is to resort to a kernel transformation [13, 14]. The basic idea is to apply a non-linear function (that has certain properties) to the input features sothat the search space becomes slightly non-linear and maybe (not always) more adequate to solve the given ∗ Corresponding author.
Email addresses: [email protected] (Habib Ben Abdallah), [email protected] (Christopher J.Henry), [email protected] (Sheela Ramanna)
Preprint submitted to Elsevier September 10, 2020 a r X i v : . [ ee ss . A S ] S e p roblem. However, such a trick highly depends on the choice of the so-called kernel , and may not be successfuldue to the fact that the non-linearity of a problem can not always be expressed through usual functions;e.g. polynomial, exponential, logarithmic or circular functions. Moreover, the search for an adequate kernelinvolves experimenting with different functions and evaluating their individual performance which is timeconsuming. One can also use the kernel trick with neural networks [15, 16] but the same problem remains:What is an adequate kernel for the given problem? The proposed model creates a variant of 1-DimensionalCNN (1DCNN) that can estimate kernels for each neuron using a polynomial approximation with a givendegree for each layer. This work aims to demonstrate that the non-linearity introduced by polynomialapproximation may help to reduce the depth (number of layers) or at least the width (number of neuronsper layer) of the conventional neural network architectures due to the fact that the kernels estimated aredependent on the problems that are considered, and more specific than the well-known general kernels.The contribution of this work is a novel variant of 1DCNN which we call 1-Dimensional Polynomial NeuralNetwork (1DPNN) designed to create the adequate kernels for each neuron in an automated way, and thus,better approximate the non-linearity of the solution space by increasing the complexity of the search space.We present a formal definition of the 1DPNN model which includes forward and backward propagation,a new weight initialization method, and a detailed theoretical computational complexity analysis. Ourproposed 1DPNN is evaluated in terms of the number of parameters, the computational complexity, andthe estimation performance for various audio signal applications involving either classification or regression.Two classification problems were considered, musical note recognition on a subset of the NSynth[17] datsetand spoken digit recognition on the Free Spoken Digits dataset [18]. Identically, two regression problemswere considered for which a subset of the MUSDB18 [19] dataset was used, namely audio signal denoisingand vocal source separation. Our evaluation methodology is based on comparing the performance of the1DPNN to that of the 1DCNN under certain conditions, notably, the equality of the number of trainbleparameters. Therefore, we chose not to create very deep and wide networks and to restrict our analysis toa fairly manageable number of parameters due to technical limitations.The outline of our paper is as follows. Section 2 discusses the major contributions made regarding theintroduction of additional non-linearities in neural networks. Section 3 formulates the model mathematicallyby detailing how the forward propagation and the backward propagation are performed as well as providingnew way to initialize the weights and a detailed theoretical computational analysis of the model whileSection 4 describes how it was implemented and tackles the computational analysis from an experimentalperspective. Section 5 describes in detail the experiments that were conducted in order to evaluate themodel, and shows its results, and finally, Section 6 addresses the strengths and the weaknesses of the modeland proposes different ways of extending and improving it.
2. Related Work
Many of the works that try to add a degree of non-linearity to the neural network model focuses eitheron enriching pooling layers [20, 21], creating new layers such as dropout layers [22] and batch normalizationlayers [23], or designing new activation functions [24, 25]. However, Livni et al. [26] have considered changingthe way a neuron behaves by introducing a quadratic function to compute its output. By stacking layers andcreating a deep network, this quadratic function enables the network to learn any polynomial of a degreenot exceeding the number of layers. Moreover, they provide an algorithm that can construct the networkprogressively. While achieving relatively good results on various problems with a small topology and withminimal human intervention, they only dealt with the perceptron model which can be inappropriate for theproblems they tackled which are computer vision problems. In fact, the perceptron model does not takeinto account the spatial proximity of the pixels and only considers an image as a vector with no specificspatial arrangement or relationship between neighboring pixels, which is why a convolutional model is moreappropriate for these kind of problems.Wang et al. [27], on the contrary, considered the 2D convolutional model and changed the way a neuronoperates by applying a kernel function on its input which they call kervolution, thus, not changing thenumber of parameters a network needs to train, but adding a level of non-linearity that can lead to betterresults than regular CNNs. They used a sigmoidal, a gaussian and a polynomial kernel and studied their2nfluence on the accuracy measured on various datasets. Furthermore, they used well-known architecturessuch as ResNet [28] and modified some layers to incorporate the kervolution operator. However, they showthat this operator can make the model become unstable when they introduce more complexity than what theproblem requires. Although they have achieved better accuracy than state-of-the-art models, they still needto manually choose the kernel for each layer which can be inefficient due to the sheer number of possibilitiesthey can choose from and because it can be really difficult to estimate how much non-linearity a problemneeds.Mairal [29] on the other hand, proposed a way to learn layer-wise kernels in a 2D convolutional neuralnetwork by learning a linear subspace in a Reproducing Kernel Hilbert Space [30] at each layer from whichfeature maps are built using kernel compositions. Although the method introduces new parameters thatneed to be learned during the backpropagation, its main advantage is that it gets rid of the main problemof using kernels with machine learning namely, learning and data representation decoupling. The problemsthat the author tackled are image classification and image super-resolution [31], and the results that wereobtained outperform approaches solely based on classical convolutional neural networks. Nevertheless, dueto technical limitations, the kernel learning could not be tested on large networks and its main drawback isthat a pre-parametrized kernel function should be defined for each layer of the network.However, Tran et al. [32] went even further by proposing a relaxation of the fundamental operationsused in the multilayer perceptron model called Generalized Operational Perceptron (GOP) which allowschanges to the summation in the linear combination between weights and inputs by any other operationsuch as median, maximum or product, and they call it a pool operator. Moreover, they propose the conceptof nodal operators which basically applies a non-linear function on the product between the weights and theinputs. They show that this configuration surpasses the regular multilayer perceptron model on well-knownclassification problems, with the same number of parameters, but with a slight increase in the computationalcomplexity. However, as a pool operator, a nodal operator and an activation function have to be chosenfor every neuron in every layer, they have devised an algorithm to construct a network with the operatorsthat should supposedly minimize the loss chosen for any given problem. Nevertheless, the main limitationof the model is that such operators need to be created and specified manually as an initial step, and thatchoosing an operator for every neuron is a time-consuming and compulsory preliminary step that has to beperformed in order to be able to use the model and to train it.Kiranyaz et al. [33] extended the notion of GOP to 2-dimensional signals such as images and createdthe Operational Neural Network (ONN) model which generalizes the 2D convolutions to any non-linearoperation without adding any new parameter. They prove that ONN models can solve complex problemslike image transformation, image denoising and image synthesis with a minimal topology where CNN modelsfail to do so with the same topology and number of parameters. They also propose an algorithm that cancreate homogeneous layers (all neurons inside the layer have the same pool operator, nodal operator andactivation function) and choose the operators that minimize any given loss, in a greedy fashion. However,the aforementioned limitation still applies to the model since it is based on GOPs and the operator choosingalgorithm does not take into account the intra-layer dependency of the so-called operators. Therefore, theyproposed the concept of generative neurons in [34] to approximate nodal operators by adding learnableparameters into the network. While achieving comparable results to the ONN model, they did not get ridof the need to manually choose the pool operator and the activation function.Nevertheless, in works [26], [27], and [29], there is still a need to manually predetermine which kernel touse on each layer, and in works [32] and [33, 34], there is a need to predetermine, for each neuron, the nodaloperator, the pool operator and the activation function to use in order to create the network. The maingain of the proposed 1DPNN model is that there is no longer the need to search for a kernel that producesgood results, as the model itself approximates a problem-specific polynomial kernel for each neuron.
3. Theoretical Framework
In this section, the 1DPNN will be formally defined with its relevant hyperparameters, its trainableparameters and the way to train the model with the gradient descent algorithm [2]. A theoretical analysis3f the computational complexity of the model is also made with respect to the complexity of the regularconvolutional model.
The aim is to create a network whose neurons can perform non-linear filtering using a Taylor seriesdecomposition around 0. For 1-Dimensional signals, a regular neuron in 1DCNN performs a convolutionbetween its weight vector and its input vector whereas the neuron that needs to be modeled for 1DPNNshould perform convolutions not only with its input vector but also with its exponentiation. In the following,we designate by L the number of layers of a 1DPNN. ∀ l ∈ [[1 , L ]] , N l is the number of neurons in layer l , D l is the degree of the Taylor decomposition of the neurons in layer l and ∀ i ∈ [[1 , N l ]] , y ( l ) i is the outputvector of neuron i in layer l considered as having 1 line and M l columns representing the output samplesindexed in [[0 , M l − l = 0 with N inputs in general (for a singleinput network, N = 1). ∀ l ∈ [[0 , L ]] , we construct the N l by M l matrix Y l such that: Y l = y ( l )1 ... y ( l ) N l . ∀ l ∈ [[1 , L ]] , ∀ ( i, j, d ) ∈ [[1 , N l ]] × [[1 , N l − ]] × [[1 , D l ]] , w ( l ) ijd is the weight vector of neuron i in layer l correspondingto the exponent d and to the output of neuron j in layer l − K l columnsindexed in [[0 , K l − ∀ l ∈ [[1 , L ]] , ∀ ( i, d ) ∈ [[1 , N l ]] × [[1 , D l ]] , we construct the N l − by K l matrix W ( l ) id suchthat: W ( l ) id = w ( l ) i d ... w ( l ) iN l − d . ∀ l ∈ [[1 , L ]] , ∀ i ∈ [[1 , N l ]] , b ( l ) i is the bias of neuron i in layer l and f ( l ) i is a differentiable function called theactivation function of neuron i in layer l such that y ( l ) i = f ( l ) i (cid:16) x ( l ) i (cid:17) where x ( l ) i is the pre-activation outputof neuron i in layer l . Definition 1.
The output of a neuron in the 1DPNN is defined as such: ∀ l ∈ [[1 , L ]] , ∀ i ∈ [[1 , N l ]] , y ( l ) i = f ( l ) i (cid:32) D l (cid:88) d =1 W ( l ) id ∗ Y dl − + b ( l ) i (cid:33) = f ( l ) i (cid:16) x ( l ) i (cid:17) , (1)where ∗ is the convolution operator, Y dl − = Y l − (cid:12) · · · (cid:12) Y l − (cid:124) (cid:123)(cid:122) (cid:125) d times , and (cid:12) is the Hadamard product. Remark 1.
As stated above, the whole focus of this work is to learn the best polynomial function in eachneuron for a given problem, which is entirely defined by the weights W ( l ) id associated with Y l − to the powerof d . Remark 2.
Since the 1DPNN neuron creates a polynomial function using the weights W ( l ) id , the activationfunction f ( l ) i can seem unnecessary to define, and can be replaced by the identity function. However, in thecontext of the 1DPNN model, the activation function plays the role of a bounding function, meaning thatit can be used to control the range of the values of the created polynomial function.4 .2. 1DPNN Model Training In order to enable the weights of the model to be updated so that it learns, we need to define a lossfunction that measures whether the output of the network is close or far from the output that is desiredsince it is a supervised model. We denote by Y the desired output, by ˆ Y the output that is produced bythe network, and by (cid:15) (cid:16) Y, ˆ Y (cid:17) the loss between the desired output and the estimated output. (cid:15) needs to bedifferentiable since the estimation of its derivative with respect to different variables is the key of learningthe weights due to the fact that gradient descent is used as a numeric optimization technique. Given a function φ : R N × R P → R M , where ( M, N, P ) ∈ N ∗ , the input is a tuple X = ( X , ..., X N ) ∈ R N , and the parameters are θ = ( θ , ..., θ P ); we consider a desired output from X given φ called Y where Y ∈ R M . The objective is to estimate θ so that (cid:15) ( Y, φ ( X, θ )) is minimum where (cid:15) : R M × R M → R + is adifferentiable loss function. Gradient descent [35] is an algorithm that iteratively estimates new values of θ for T iterations or until a target loss (cid:15) t is attained. ∀ t ∈ [[0 , T ]] , ˆ θ ( t ) = (ˆ θ ( t )1 , ..., ˆ θ ( t ) P ) is the estimated value of θ at iteration t . Given an initial value ˆ θ (0) , and a learning rate η ∈ (0 , ∀ t ∈ [[0 , T − , ˆ θ ( t +1) = ˆ θ ( t ) − η ∇ θ (cid:15) (cid:16) ˆ θ ( t ) (cid:17) , where ∇ θ (cid:15) (cid:16) ˆ θ ( t ) (cid:17) is the gradient of (cid:15) with respect to θ applied on ˆ θ ( t ) . In the case of the 1DPNN, theparameters are the weights and the biases so there is a need to estimate the weight gradients ∂(cid:15)∂w ( l ) ijd and thebias derivatives ∂(cid:15)∂b ( l ) i , ∀ l ∈ [[1 , L ]] , ∀ ( i, j, d ) ∈ [[1 , N l ]] × [[1 , N l − ]] × [[1 , D l ]]. In order for the weights to be updated using the Gradient Descent algorithm, there is a need to estimatethe contribution of each weight of each neuron in the loss by means of calculating the gradient.
Proposition 1.
The gradient of the loss with respect to the weights of a 1DPNN neuron can be estimatedusing the following formula: ∀ l ∈ [[1 , L ]] , ∀ ( i, j, d ) ∈ [[1 , N l ]] × [[1 , N l − ]] × [[1 , D l ]] ,∂(cid:15)∂w ( l ) ijd = (cid:32) ∂(cid:15)∂y ( l ) i (cid:12) ∂y ( l ) i ∂x ( l ) i (cid:33) ∗ (cid:16) y ( l − j (cid:17) d = ∂(cid:15)∂x ( l ) i ∗ (cid:16) y ( l − j (cid:17) d , (2)where: • ∂y ( l ) i ∂x ( l ) i = ∂f ( l ) i ∂x ( l ) i (cid:16) x ( l ) i (cid:17) is the derivative of the activation function of neuron i in layer l with respect to x ( l ) i . • ∂(cid:15)∂y ( l ) i is the gradient of the loss with respect to the output of neuron i in layer l , that we call theoutput gradient. 5 roof of Proposition 1 . To estimate the weight gradients, we use the chain-rule such that: ∀ l ∈ [[1 , L ]] , ∀ ( i, j, k, d ) ∈ [[1 , N l ]] × [[1 , N l − ]] × [[0 , K l − × [[1 , D l ]] ,∂(cid:15)∂w ( l ) ijd ( k ) = M l − (cid:88) m =0 (cid:32) ∂(cid:15)∂y ( l ) i (cid:12) ∂y ( l ) i ∂x ( l ) i (cid:33) ( m ) . ∂x ( l ) i ∂w ( l ) ijd ( k ) ( m ) . (3)From Eq. (1), we can write: ∀ l ∈ [[1 , L ]] , ∀ ( i, m, d ) ∈ [[1 , N l ]] × [[0 , M l − × [[1 , D l ]] ,x ( l ) i ( m ) = N l − (cid:88) j (cid:48) =1 D l (cid:88) d (cid:48) =1 K l − (cid:88) k (cid:48) =0 w ( l ) ij (cid:48) d (cid:48) ( k (cid:48) ) (cid:16) y ( l − j (cid:48) (cid:17) d (cid:48) ( m + k (cid:48) ) , from which we deduce: ∀ l ∈ [[1 , L ]] , ∀ ( i, j, m, k, d ) ∈ [[1 , N l ]] × [[1 , N l − ]] × [[0 , M l − × [[0 , K l − × [[1 , D l ]] ,∂x ( l ) i ∂w ( l ) ijd ( k ) ( m ) = (cid:16) y ( l − j (cid:17) d ( m + k ) . (4)When injecting Eq. (4) in Eq. (3), we find that: ∀ l ∈ [[1 , L ]] , ∀ ( i, j, k, d ) ∈ [[1 , N l ]] × [[1 , N l − ]] × [[0 , K l − × [[1 , D l ]] ,∂(cid:15)∂w ( l ) ijd ( k ) = M l − (cid:88) m =0 (cid:32) ∂(cid:15)∂y ( l ) i (cid:12) ∂y ( l ) i ∂x ( l ) i (cid:33) ( m ) . (cid:16) y ( l − j (cid:17) d ( m + k ) , which is equivalent to: ∀ l ∈ [[1 , L ]] , ∀ ( i, j, d ) ∈ [[1 , N l ]] × [[1 , N l − ]] × [[1 , D l ]] ,∂(cid:15)∂w ( l ) ijd = (cid:32) ∂(cid:15)∂y ( l ) i (cid:12) ∂y ( l ) i ∂x ( l ) i (cid:33) ∗ (cid:16) y ( l − j (cid:17) d = ∂(cid:15)∂x ( l ) i ∗ (cid:16) y ( l − j (cid:17) d . (5) Remark 3.
The weight gradient estimation of a 1DPNN neuron is equivalent to that of a 1DCNN neuronwhen D l = 1. This was to be expected as the former is only a mere extension of the latter. Remark 4.
The output gradient can easily be determined for the last layer since ˆ Y = Y L which makes theloss directly dependent on Y L . However, it can not be determined as easily for the other layers since theloss is indirectly dependent on their outputs. As stated in Remark 4, there is a need to find a way to estimate the gradient of the loss with respect tothe inner layers’ outputs in order to estimate the weight gradients of their neurons as in Eq. (2).
Proposition 2.
The output gradient of a neuron in an inner layer can be estimated using the followingformula: ∀ l ∈ [[1 , L − , j ∈ [[1 , N l ]] ,∂(cid:15)∂y ( l ) j = D l (cid:88) d =1 d N l +1 (cid:88) i =1 ˜ w ( l +1) ijd ∗ ◦ ∂(cid:15)∂x ( l +1) i (cid:12) (cid:16) y ( l ) j (cid:17) d − , (6)where: 6 ∀ k ∈ [[0 , K l +1 − , ˜ w ( l +1) ijd ( k ) = w ( l +1) ijd ( K l +1 − − k ) . • ∀ m ∈ [[0 , M l − , ◦ ∂(cid:15)∂x ( l +1) i ( m ) = ∂(cid:15)∂x ( l +1) i ( m − K l +1 ) if m ∈ [[ K l +1 , M l +1 + K l +1 − else . Proof of Proposition 2 . Since Y l +1 is directly computed from Y l , ∀ l ∈ [[1 , L − Y L is totally dependent on Y l +1 so that we can write: ∀ l ∈ [[1 , L − , (cid:15) (cid:16) Y, ˆ Y (cid:17) = (cid:15) ( Y, ψ l +1 ( Y l +1 )) , (7)where Y is a given desired output and ψ l +1 is the application of Eq. (1) from layer l + 1 to layer L . FromEq. (7) we can write the differential of (cid:15) as such: ∀ l ∈ [[1 , L − , d(cid:15) = N l +1 (cid:88) i =1 ∂(cid:15)∂y ( l +1) i (cid:12) dy ( l +1) i . (8)We can then derive the general expression of the output gradient of any neuron in layer l as such: ∀ l ∈ [[1 , L − , j ∈ [[1 , N l ]] , ∂(cid:15)∂y ( l ) j = N l +1 (cid:88) i =1 ∂(cid:15)∂y ( l +1) i (cid:12) ∂y ( l +1) i ∂y ( l ) j = N l +1 (cid:88) i =1 ∂(cid:15)∂y ( l +1) i (cid:12) ∂y ( l +1) i ∂x ( l +1) i (cid:12) ∂x ( l +1) i ∂y ( l ) j . This expression can be qualitatively interpreted as finding the contribution of each sample in y ( l ) j in the lossby finding its contribution to every output vector in layer l + 1. Considering a layer l ∈ [[1 , L − j in layer l , and a neuron i in layer l + 1, a sample m ∈ [[0 , M l − y ( l ) j contributes to x ( l +1) i in the followingsamples: ∀ l ∈ [[1 , L − , ∀ ( i, m, k, d ) ∈ [[1 , N l +1 ]] × [[0 , M l − × [[ k lm , k (cid:48) lm ]] × [[1 , D l +1 ]] ,x ( l +1) i ( m − k ) = N l (cid:88) j (cid:48) =1 D l +1 (cid:88) d =1 K l +1 − (cid:88) k (cid:48) =0 w ( l +1) ij (cid:48) d ( k (cid:48) ) (cid:16) y ( l ) j (cid:48) (cid:17) d ( m − k + k (cid:48) ) , where ∀ l ∈ [[1 , L − , k lm = max (0 , m − M l +1 + 1) and k (cid:48) lm = min ( m, K l +1 − y ( l ) j ( m ) in x ( l +1) i as such: ∀ l ∈ [[1 , L − , ∀ ( i, j, m, k, d ) ∈ [[1 , N l +1 ]] × [[1 , N l ]] × [[0 , M l − × [[ k lm , k (cid:48) lm ]] × [[1 , D l +1 ]] ,∂x ( l +1) i ∂y ( l ) j ( m ) ( m − k ) = D l +1 (cid:88) d =1 d.w ( l +1) ijd ( k ) (cid:16) y ( l ) j (cid:17) d − ( m ) . (9)Therefore, we can finally determine the full expression of the output gradient for each sample: ∀ l ∈ [[1 , L − , ( j, m ) ∈ [[1 , N l ]] × [[0 , M l − ,∂(cid:15)∂y ( l ) j ( m ) = N l +1 (cid:88) i =1 k (cid:48) lm (cid:88) k = k lm D l (cid:88) d =1 d. (cid:32) ∂(cid:15)∂y ( l +1) i (cid:12) ∂y ( l +1) i ∂x ( l +1) i (cid:33) ( m − k ) w ( l +1) ijd ( k ) (cid:16) y ( l ) j (cid:17) d − ( m ) . (10)Eq. (10) can be decomposed as the sum along i and d of the product of (cid:16) y ( l ) j (cid:17) d − with the correlationbetween ∂(cid:15)∂x ( l +1) i and w ( l +1) ijd , the correlation being a rotated version of the convolution where the samples7f the weights are considered in an inverted order to that of the convolution. Therefore, a correlation canbe transformed into a convolution by considering the rotated weights ˜ w ( l +1) ijd defined as such: ∀ k ∈ [[0 , K l +1 − , ˜ w ( l +1) ijd ( k ) = w ( l +1) ijd ( K l +1 − − k ) . However, we can only perform a valid convolution when k lm = 0 and k (cid:48) lm = K l +1 −
1. Thus, to obtaina valid convolution otherwise, we consider the zero-padded version of ∂(cid:15)∂x ( l +1) i designated by ◦ ∂(cid:15)∂x ( l +1) i anddefined as such: ∀ m ∈ [[0 , M l − , ◦ ∂(cid:15)∂x ( l +1) i ( m ) = ∂(cid:15)∂x ( l +1) i ( m − K l +1 ) if m ∈ [[ K l +1 , M l +1 + K l +1 − else . Finally, we can obtain the desired expression by considering the rotated version of the weights, and thezero-padded version of the gradient of the error with respect to x ( l +1) i in Eq. (10). Remark 5.
The main difference between the output gradient estimation of a 1DPNN inner layer’s neuronand that of a 1DCNN inner layer’s neuron is that in the former, the gradient depends on the output valuesof the considered layer and in the latter, it does not. By injecting Eq. (6) in Eq. (2), we notice that, unlikea 1DCNN neuron, the weight gradient of a 1DPNN inner layer’s neuron carry the information of its outputvalues as well as its previous layer’s output values.
The bias is a parameter whose contribution to the loss needs to be estimated in order to properly trainthe model.
Proposition 3.
The bias gradient of a 1DPNN neuron can be estimated using the following formula: ∀ l ∈ [[1 , L ]] , ∀ i ∈ [[1 , N l ]] , ∂(cid:15)b ( l ) i = M l − (cid:88) m =0 (cid:32) ∂(cid:15)∂y ( l ) i (cid:12) ∂y ( l ) i ∂x ( l ) i (cid:33) ( m ) = M l − (cid:88) m =0 ∂(cid:15)∂x ( l ) i ( m ) . Proof of Proposition 3 . Using the differential of (cid:15) , we can determine the gradient of the loss with respectto the bias as such: ∀ l ∈ [[1 , L ]] , ∀ i ∈ [[1 , N l ]] , ∂(cid:15)b ( l ) i = M l − (cid:88) m =0 (cid:32) ∂(cid:15)∂y ( l ) i (cid:12) ∂y ( l ) i ∂x ( l ) i (cid:12) ∂x ( l ) i ∂b ( l ) i (cid:33) ( m ) . And since from Eq. (1), we can notice that ∂x ( l ) i ∂b ( l ) i ( m ) = 1 , ∀ l ∈ [[1 , L ]] , ∀ ( i, m ) ∈ [[1 , N l ]] × [[0 , M l − Remark 6.
The bias gradient formula of a 1DPNN neuron is the same as that of a 1DCNN neuron regardlessof the degree of the polynomial approximation.
Given a tuple (
X, Y ) representing an input and a desired output, we generate ˆ Y using a defined ar-chitecture of the 1DPNN, then calculate ∂(cid:15)∂Y L directly from the loss expression, in order to determine the8eight gradients and the bias gradients for the output layer. Then using ∂(cid:15)∂Y L and Eq. (6), calculate theoutput gradients, the weight gradients and the bias gradients of the previous layer. Repeat the process untilreaching the first layer. After computing the gradients, we use gradient descent to update the weights assuch: ∀ l ∈ [[1 , L ]] , ∀ ( i, j, d ) ∈ [[1 , N l ]] × [[1 , N l − ]] × [[1 , D l ]] , (cid:16) w ( l ) ijd (cid:17) ( t +1) = (cid:16) w ( l ) ijd (cid:17) ( t ) − η ∂(cid:15)∂w ( l ) ijd , (11)where η is the learning rate and t is the epoch. The same goes for the updating the biases: ∀ l ∈ [[1 , L ]] , ∀ i ∈ [[1 , N l ]] , (cid:16) b ( l ) i (cid:17) ( t +1) = (cid:16) b ( l ) i (cid:17) ( t ) − η ∂(cid:15)b ( l ) i . Since the 1DPNN model uses polynomials to generate non-linear filtering, it is highly likely that, due tothe fact that every feature map of every layer is raised to a power higher than 1, the weight updates becomeexponentially big (gradient explosion) or exponentially small (gradient vanishing), depending on the natureof the activation functions that are used. To remedy that, the weights have to be initialized in a way thatthe highest power of any feature map is associated with the lowest weight values.
Definition 2.
Let R ( α l ) , l ∈ [[1 , L ]] be a probability law with a parameter vector α l used to initialize theweights of any layer l . The proposed weight initialization is defined as such: ∀ l ∈ [[1 , L ]] , ( i, j, d ) ∈ [[1 , N l ]] × [[1 , N l − ]] × [[1 , D l ]] , w ( l ) ijd ∼ R ( α l ) d ! . Remark 7.
This initialization offers the advantage of allowing the use of any known deep-learning initial-ization scheme such as the Glorot Normalized Uniform Initialization [36] while adapting it to the weightsassociated with any degree. However, this does not ensure that there will be no gradient explosion as itonly provides an initial insight on how the weights should evolve. Completely avoiding gradient explosioncan be achieved by either choosing activation functions bounded between -1 and 1, by performing weight orgradient clipping or by using weight or activity regularization.
The use of polynomial approximations in the 1DPNN model introduces a level of complexity that hasto be quantified in order to estimate the gain of using it against using the regular convolutional model.Therefore, a thorough analysis is conducted with the aim to determine the complexity of the 1DPNN modelwith respect to the complexity of the 1DCNN model for both the forward propagation and the learningprocess (forward propagation + backpropagation).
Definition 3.
Let γ be a mathematical expression that can be brought to a combination of summations andproducts. We define C as a function that takes as input a mathematical expression such as γ and outputsan integer designating the number of operations performed to calculate that expression so that C ( γ ) ∈ N . Example 1.
Let N ∈ N ∗ and γ = N (cid:80) i =1 i , then C ( γ ) = 2 N + N − N − N − N times. Example 2. ∀ z ∈ C , ≤ C ( z ) ≤ z = a + ib, ( a, b ) ∈ R .9 emark 8. C does not take into account any possible optimization such as the one for the exponentiationin Example 1 which can have a complexity of O (log m ) where m is the exponent, nor does it take intoaccount any possible simplification of a mathematical expression such as N (cid:80) i =1 i = N ( N + 1) C provides an upper bound complexity that is independent of any implementation.Since the smallest computational unit in both models is the neuron, the complexity is calculated at itslevel for every operation performed during the forward propagation and the backpropagation. Moreover,every 1DPNN operation’s complexity denoted by C p is calculated as a function of the corresponding 1DCNNoperation’s complexity denoted by C c since the aim is to compare both models with each others. The forward propagation complexity is a measure relevant to the estimation of how complex a trained1DPNN neuron is and can provide an insight on how complex it is to use a trained model compared to usinga trained 1DCNN model.
Proposition 4.
The computational complexity a 1DPNN neuron’s forward propagation with respect tothat of a 1DCNN neuron is given by the following formula: ∀ l ∈ [[1 , L ]] , ∀ i ∈ [[1 , N l ]] , C p (cid:16) y ( l ) i (cid:17) = D l C c (cid:16) y ( l ) i (cid:17) + ( D l − (cid:18) M l − N l − D l − M l (cid:19) . (12) Proof of Proposition 4 . The 1DPNN forward propagation is fully defined by Eq. (1) which can alsobe interpreted as the 1DCNN forward propagation if the degree of the polynomials is 1. Therefore, wecan determine the 1DPNN complexity as a function of the 1DCNN complexity by first determining thecomplexity of x ( l ) i before adding the biases as such: ∀ l ∈ [[1 , L ]] , ∀ i ∈ [[1 , N l ]] , C p (cid:16) x ( l ) i − b ( l ) i (cid:17) = D l (cid:88) d =1 (cid:16) C c (cid:16) x ( l ) i − b ( l ) i (cid:17) + ( d − M l − N l − (cid:17) = D l C c (cid:16) x ( l ) i − b ( l ) i (cid:17) + 12 D l ( D l − M l − N l − . (13)Assuming that the activation functions are atomic meaning that their complexity is O (1), we have: ∀ l ∈ [[1 , L ]] , ∀ i ∈ [[1 , N l ]] , C c (cid:16) x ( l ) i (cid:17) = C c (cid:16) x ( l ) i − b ( l ) i (cid:17) + M l C c (cid:16) y ( l ) i (cid:17) = C c (cid:16) x ( l ) i (cid:17) + M l and C p (cid:16) x ( l ) i (cid:17) = C p (cid:16) x ( l ) i − b ( l ) i (cid:17) + M l C p (cid:16) y ( l ) i (cid:17) = C p (cid:16) x ( l ) i (cid:17) + M l . (14)By using the relationships in Eq. (14) in Eq. (13), we obtain the desired expression. Remark 9.
Eq. (12) shows that the forward propagation’s complexity of the 1DPNN does not scale lin-early with the degrees of the polynomials, which means that a 1DPNN neuron with degree D l is morecomputationally complex than D l Y dl − can be calculated onlyonce, stored in memory and used for all neurons in layer l . The learning complexity is a measure relevant to the estimation of how complex it is to train a 1DPNNneuron and can provide an overall insight on how complex a model is to train compared to training a1DCNN model. Since the learning process of the inner layers’ neurons is more complex than the outputlayer’s neurons which can learn faster by estimating the output gradient directly from the loss function, thelearning complexity will only be determined for the inner layers’ neurons.10 roposition 5.
The computational complexity of the learning process of a 1DPNN inner layer’s neurondenoted by L ( l ) p is given by the following formula: ∀ l ∈ [[1 , L − , L ( l ) p = D l L ( l ) c + ( D l − (cid:18)(cid:18) M l − N l − + 12 M l (cid:19) D l − M l (cid:19) , (15)where L ( l ) c is the learning complexity of a 1DCNN neuron in a layer l . Proof of Proposition 5 . The difference between the two models in the backpropagation phase resides inthe weight gradient estimation and the output gradient estimation. In fact, the bias gradient estimation isthe same for both models so there is no need to quantify its complexity. Since the weight gradient estimationis dependent on the output gradient estimation, the output gradient estimation will be quantified first. FromEq.(6), we can quantify the output gradient estimation complexity for the 1DPNN model as such: ∀ l ∈ [[1 , L − , j ∈ [[1 , N l ]] , C p (cid:32) ∂(cid:15)∂y ( l ) j (cid:33) = C c (cid:32) ∂(cid:15)∂y ( l ) j (cid:33) + D l (cid:88) d =2 (cid:32) C c (cid:32) ∂(cid:15)∂y ( l ) j (cid:33) + 2 M l + ( d − M l (cid:33) = D l C c (cid:32) ∂(cid:15)∂y ( l ) j (cid:33) + 12 ( D l + 2)( D l − M l . Since the 1DPNN neuron introduces D l times more weights than the 1DCNN neuron, the complexity ofcalculating the weight gradients for a pair of neurons ( i, j ) as in Eq. (2) is the sum of their correspondingweight gradients with respect to the degree, as such: ∀ l ∈ [[1 , L ]] , ∀ ( i, j ) ∈ [[1 , N l ]] × [[1 , N l − ]] , C p (cid:32) ∂(cid:15)∂w ( l ) ij (cid:33) = D l (cid:88) d =1 C p (cid:32) ∂(cid:15)∂w ( l ) ijd (cid:33) = D l (cid:88) d =1 (cid:32) C c (cid:32) ∂(cid:15)∂w ( l ) ij (cid:33) + ( d − M l − (cid:33) = D l C c (cid:32) ∂(cid:15)∂w ( l ) ij (cid:33) + 12 D l ( D l − M l − . Since one 1DPNN neuron in a layer l has D l times more weights than a 1DCNN neuron, its weight updatecomplexity is D l times that of the 1DCNN. Therefore, the complexity of Eq. (11) is: ∀ l ∈ [[1 , L ]] , ∀ ( i, j ) ∈ [[1 , N l ]] × [[1 , N l − ]] , C p (cid:18)(cid:16) w ( l ) ij (cid:17) ( t +1) (cid:19) = D l C c (cid:18)(cid:16) w ( l ) ij (cid:17) ( t +1) (cid:19) . From the above expressions, we can determine the learning complexity of any neuron which will be the sumof the forward propagation complexity and the backward propagation complexity. Therefore, the learningcomplexity denoted by L ( l ) p,i of a 1DPNN inner layer’s neuron is: ∀ l ∈ [[1 , L − , i ∈ [[1 , N l ]] , L ( l ) p,i = C p (cid:16) y ( l ) i (cid:17) + C p (cid:32) ∂(cid:15)∂y ( l ) i (cid:33) + N l − (cid:88) j =1 (cid:32) C p (cid:32) ∂(cid:15)∂w ( l ) ij (cid:33) + C p (cid:18)(cid:16) w ( l ) ij (cid:17) ( t +1) (cid:19)(cid:33) . Since we suppose that the network is fully connected, L ( l ) p,i does not depend on i , thus we can replace it by L ( l ) p . Therefore, by replacing each complexity in the above equation by their full expressions, we obtain thedesired expression where: ∀ l ∈ [[1 , L − , i ∈ [[1 , N l ]] , L ( l ) c = C c (cid:16) y ( l ) i (cid:17) + C c (cid:32) ∂(cid:15)∂y ( l ) i (cid:33) + N l − (cid:88) j =1 (cid:32) C c (cid:32) ∂(cid:15)∂w ( l ) ij (cid:33) + C c (cid:18)(cid:16) w ( l ) ij (cid:17) ( t +1) (cid:19)(cid:33) .
4. Implementation
This section describes the application programming interface (API) used to implement the 1DPNN modelas well as an evaluation of the implementation in the form of an experimental computational complexityanalysis for the forward propagation and the learning process of a 1DPNN neuron.11 .1. Keras Tensorflow API Implementation
Tensorflow [37] is an API created by Google that is widely used for graphics processing unit (GPU)-basedsymbolic computation and especially for neural networks. It allows the implementation of a wide range ofmodels and automatically takes into account the usual derivatives without the need to define them manually.However, Tensorflow is a low-level API that involves a lot of coding and memory management to define asimple model. The 1DPNN model is mainly based on convolutions between inputs and filters, so it can beconsidered as an extension of the basic 1DCNN with slight changes. Therefore, there is no need to use sucha low-level API like Tensorflow or Cuda to define the 1DPNN, so the Keras API was used to implement themodel.Keras [38] is a high level API built on top of Tensorflow that provides an abstraction of the internaloperations performed to compute gradients or anything related. Keras allows to build a network as acombination of layers whether they are stacked sequentially or not. It uses the concept of layer which isthe key element to perform any operation and allows the definition of custom layers that can jointly beused with predefined layers, thus, allowing to create a heterogeneous network composed of different types oflayers. In Keras, there is only the need to define the forward propagation of 1DPNN since convolutions andexponents are considered basic operations, and the API takes care of storing the gradients and using themto calculate the weight updates.
In order to evaluate the efficiency of the implementation, we compare the forward propagation complexityand the learning process complexity of a 1DPNN neuron with the respective theoretical upper bound com-plexities determined in Section 3.4, as well as with a 1DCNN neuron’s complexity, and a 1DPNN-equivalentneuron by varying the degrees of the polynomials in a given range. This experimental analysis is made toshow that the theoretical complexity is indeed an upper bound and that, as stated in Section 3.4, a 1DPNNneuron with a degree D is actually more complex than D Theorem 1.
Any 1DPNN with L ≥ L + 1 layers that hasthe same number of parameters as the 1DPNN. Proof of Theorem 1 . Let L ∈ N ∗ be the number of layers of a 1DPNN and ∀ l ∈ [[1 , L + 1]], let N (cid:48) l be thenumber of neurons in a 1DCNN layer. Given any inner 1DPNN layer, we can create a 1DCNN layer thathas the same number of parameters using the equality between the number of parameters of each model’slayer as such: ∀ l ∈ [[1 , L − , N (cid:48) l N (cid:48) l − K l + N (cid:48) l = N l N l − K l D l + N l , where N (cid:48) = N since the input layer remains the same. We then determine N (cid:48) l from that equation as such: ∀ l ∈ [[1 , L − , N (cid:48) l = (cid:22) N l ( N l − K l D l + 1) N (cid:48) l − K l + 1 + 12 (cid:23) , (16)where N (cid:48) l is rounded by adding 1 / N (cid:48) L in the same manner as we determine the number of neurons in the inner layers, we willchange the number of neurons in the output layer of the 1DCNN, which is not a desired effect. To remedythat problem, we add another 1DCNN layer with N (cid:48) L +1 = N L neurons having a filter size of 1 in the 1DCNN,and we adjust the number of neurons in layer L so that the number of parameters in layer L and layer L + 1equals the number of parameters in the 1DPNN layer L using the following equality: N (cid:48) L N (cid:48) L − K L + N (cid:48) L + N L N (cid:48) L + N L = N L N L − K L D L + N L .
12y extracting N (cid:48) L from this expression and by rounding it, we end up with N (cid:48) L = (cid:22) N L N L − K L D L N (cid:48) L − K L + N L + 1 + 12 (cid:23) . (17) Remark 10.
We call any 1DCNN recurrently constructed using the aforementioned theorem and Eqs. (16)and (17) a 1DPNN-equivalent network because it has the same number of parameters as the 1DPNN it wasconstructed from. However, their respective search spaces and computational complexities are generally notequivalent.
Remark 11.
In a 1DPNN having only 1 layer ( L = 1), (cid:22) N (cid:48) L N L + 12 (cid:23) Since the implementation is GPU-based, it is very difficult to estimate the actual execution time ofa mathematical operation since it can be split among parallel execution kernels and since the memorybandwidth greatly impacts it. Nevertheless, for a given amount of data, we can roughly estimate howfast a mathematical operation was performed by running it multiple times and averaging. However, thatestimate also includes accesses to the memory which are the slowest operations that can run on a GPU, anddoes not provide a real estimation on how fast an operation was performed nor does it align with the truecomputational power of the so-called GPU.Despite this limitation, a rough estimate is used to determine the execution times of a 1DPNN neuron,a 1DCNN neuron and a 1DPNN-equivalent neuron with different hyperparameters. Various networks with2 layers serving as a basis for the complexity estimation are created with the hyperparameters defined inTable 1 below. Recall that N l is the number of neurons in layer l , M l is the number of samples of the signalscreated from the neurons of layer l , K l is the kernel size of the neurons in layer l and D l is the degree of thepolynomials that need to be estimated for each neuron in layer l . Since the output layer is a single 1DCNN Table 1: Hyperparameters for each layer of the networks used for the complexity estimation.
LayerHyperparameter l = 0 l = 1 l = 2 M l N l K l
25 25 D l [[1 , N .The theoretical complexities determined in Eqs. (12) and (15) are expressed in terms of number ofoperations and need to be expressed in seconds. Therefore, given the execution times of the 1DPNN with D = 1, we can estimate how long it would theoretically take to perform the same operations with a differentpolynomial degree. For instance, we can estimate the time T it takes to perform a forward propagation as13 function of the polynomial degree D using Eq. (12) as such: ∀ D ∈ [[1 , , T ( D ) = D T + ( D −
1) ( c D − c ) T, where: • T is the forward propagation execution time of a 1DCNN neuron, • c = 12 M N , • c = 2 M , and • T is an estimate of the time it takes to perform one addition or one multiplication. T can only be estimated from the fact that T is an increasing function of D which means that ∀ D ∈ [[1 , , ∂T ∂D ( D ) = T + (2 c D − ( c + c )) T ≥ . This is equivalent to: ∀ D ∈ [[1 , , T ≥ T c + c − c D . Since c + c − c D is a decreasing function of D , the final estimation of T is T = T c + c . The same can be done for the theoretical learning complexity defined in Eq. (15) by replacing c , c and T accordingly. Figure 1 shows the evolution of the execution time of a neuron’s forward propagation with respect to thedegree of the polynomial for each of the 1DCNN neuron, 1DPNN neuron and 1DPNN-equivalent neuron.Figure 1(a) also shows the evolution of the theoretical complexity with respect to the degree. This confirmsthat Eq. (12) is indeed an upper bound complexity and that, with optimization, the forward propagationof a 1DPNN neuron can run in quasi-linear time as shown in Figure 1(b). Moreover, with the chosenhyperparameters, the 1DPNN neuron is, on average, 1.94 times slower than the 1DPNN-equivalent neuron,which confirms the expectation that a 1DPNN neuron is more complex than a 1DPNN-equivalent neuron.Figure 2 shows the evolution of the execution time of neuron’s learning process with respect to thedegree of the polynomial for each of the 1DCNN neuron, 1DPNN neuron and 1DPNN-equivalent neuron.Figure 2(a) showing the evolution of the theoretical complexity with respect to the degree confirms that Eq.(15) is in fact an upper bound, and Figure 2(b) shows that the learning process of a 1DPNN neuron canalso run in quasi-linear time, as stated in Section 3.4. However, the 1DPNN neuron’s learning process is, onaverage, 2.72 times slower than the 1DPNN-equivalent neuron, as it is to be expected from Eq. (15).
5. Experiments And Results
Since 1DPNN is basically an extension of 1DCNN, there is a need to compare the 1DPNN model’s per-formance with the 1DCNN model’s performance under certain conditions. In fact, the number of parametersin a 1DPNN differs from the number of parameters in a 1DCNN with the same topology, therefore we candevise 2 different comparison strategies that define the conditions of the experiments:14 a) With theoretical execution time. (b) Without theoretical execution time.Figure 1: Forward propagation execution time for 1 neuron of each network.(a) With theoretical execution time. (b) Without theoretical execution time.Figure 2: Learning process execution time for 1 neuron of each network. Topology-wise comparison which consists in comparing the performances of a 1DCNN and a 1DPNNthat both have the same topology.2.
Parameter-wise comparison which consist in comparing the performances of a 1DCNN and a 1DPNNthat both have the same number of parameters. The 1DCNN is created according to the definition ofthe 1DPNN-equivalent network detailed in Section 4.2.Both models will be evaluated on the same problems consisting of 2 classification problems which are musicalnote recognition on monophonic instrumental audio signals and spoken digit recognition and 2 regressionproblems which are audio signal denoising at 0db Signal-to-Noise Ratio (SNR) with Additive White GaussianNoise (AWGN) and vocal source separation from a musical environment. The same learning rate will beused for all models on any given problem.A comparison of the performances of both models on all the problems should be made by following thestrategies described above. All problems involve 1 dimensional audio signals sampled at a given samplingrate and with a specific number of samples. However, the datasets used for the experiments contain eithersignals that have a high number of samples, or signals that have different numbers of samples inside thesame dataset. Due to technical limitations, creating a network that takes a high number of samples as inputor different number of samples per signal is time-consuming and irrelevant because the main aim of theexperiments is to compare both models with each other, and not to produce state-of-the-art results on theconsidered problems. Therefore, the sliding window technique described below has been adopted for all theproblems as a preprocessing step.
The sliding window technique consists on applying a sliding observation window on a signal to extract a15ertain number of consecutive samples with a certain overlap between two consecutive windows. It is usefulwhen dealing with signals that have a high number of samples and when studying a locally stationary signalproperty (such as the frequency of a tone which lasts for a certain duration).
Definition 4.
Let x = ( x , ..., x N − ) ∈ R N be a temporal signal. Let w ∈ [[1 , N ]] be the size of the window.Let α ∈ [0 ,
1[ be the overlap ratio between two consecutive windows. We define W w,α ( x ) as the set of allthe observed windows for the signal x , and we express it as such: W w,α ( x ) = { ( x i , ..., x i + w − ) | i = (cid:98) n (1 − α ) w (cid:99) , n ∈ [[0 , (cid:22) N − w (1 − α ) w (cid:23) ]] } . Remark 12.
The sliding window has the effect of widening the spectrum of the signal due to the Heisenberg-Gabor uncertainty principle [39], thus, distorting it in a certain measure. Therefore, the size of the windowhas to be chosen so that a given observation window of the signal can contain enough information to processit accordingly.
Since 1DPNN and 1DCNN are based on convolutions, they are basically used as regressors in the form offeature extractors. Their objective is to extract temporal features that will be used to classify the signals. Inthe case of this work, they are used to create a feature extractor block that will be linked to a classifier whichwill classify the input signals based on the features extracted as described in Figure 3, where x is a temporalsignal, ( f , ..., f p ) are p features extracted using either 1DPNN or 1DCNN, and ( c , ...c q ) are probabilitiesdescribing whether the signal x belongs to a certain class (there are q classes in general). The classifier willbe a multilayer perceptron (MLP) in both problems and the metric used to evaluate the performance of themodels is the classification accuracy. However, since the sliding window technique is used on all signals, the Figure 3: Block diagram of the classification procedure. classifier will only be able to classify one window at a time. Therefore, for a given signal x , a window size w and an overlap ratio α as defined in the previous subsection, we end up with |W w,α ( x ) | different classes forthe same signal, where |W w,α ( x ) | is the cardinal of W w,α ( x ). Thus, we define the class of the signal as thestatistical mode of all the classes estimated from every window extracted from the signal. For instance, ifwe have 3 classes and a signal gets decomposed in 10 windows such that 5 windows are classified as class 0,2 windows as class 1, and 3 windows as class 2, the signal will be classified as class 0 since it is the class thatis represented the most in the estimated classes. As a result, the performance of each model is evaluatedwith respect to the per-window accuracy and the per-signal accuracy. The dataset used for this problem is NSynth [17] which is a large scale dataset of annotated musical notesproduced by various instruments. It contains more than 300,000 signals sampled at 16kHz, and lasting 4seconds each. The usual musical range is composed of 88 different notes which represent our classes. Sincethe dataset is huge, we only use 12,678 signals for training, and 4096 for testing. We also use a slidingwindow with a window size w = 1600 and an overlap of 0 . Table 2: Networks’ topologies for note recognition.
Table 3: Accuracy per window for each network for note recognition over 10 folds.
NetworkStatistic (%) 1DPNN 1DCNN same topology 1DPNN-equivalentMinimum accuracy 79.9 75.3 77Maximum accuracy 81.1 78 80.2Average accuracy 80 77 79.47accuracy of the 1DPNN is higher than the 1DPNN-equivalent network.The networks are evaluated on whether they can classify a window from a signal correctly, but theyshould be evaluated on whether they classify a complete signal since the dataset is originally composed of4 second signals. In the case of this work, multiple windows are derived from a single signal that belongsto a certain class. Therefore, the windows are also considered as belonging to the same class of the signalthat they are derived from. However, the models may give a different classification for each window, so todetermine the class of a signal given the classification of its windows, we use the statistical mode of thedifferent classifications. By performing this postprocessing step, we end up with the average accuracy persignal for each network, as shown in Table 4. We notice that the per-signal accuracies are better than theper-window accuracies for all networks, and that the 1DPNN is 5% better at classifying the signals than the1DCNN with the same number of parameters.
The dataset used for this problem is the Free Spoken Digits Dataset [18] which contains 2000 audio17 igure 4: Average accuracy for each epoch of all the networks trained on the note recognition dataset.Table 4: Accuracy per signal for each network for note recognition over 10 folds.
NetworkStatistic (%) 1DPNN 1DCNN same topology 1DPNN-equivalentAverage accuracy 86.6 79.3 81.7signals of different duration sampled at 8kHz consisting of people uttering digits from 0 to 9 which willrepresent the classes of the signals. In this work 1800 signals will be considered for training and 200 signalswill be considered for testing the networks. Since the dataset has signals of different durations, we use asliding window with a window size of w = 1600 and an overlap of 0 . Table 5: Networks’ topologies for spoken digit recognition.
Table 6: Accuracy per window for each network for spoken digit recognition over 10 folds.
NetworkStatistic (%) 1DPNN 1DCNN same topology 1DPNN-equivalentMinimum accuracy 79.17 73.74 77.86Maximum accuracy 82.21 76.93 79.61Average accuracy 81.07 75.29 78.48
Figure 5: Average accuracy for each epoch of all the networks trained on the spoken digit recognition dataset.Table 7: Accuracy per signal for each network for spoken digit recognition over 10 folds.
NetworkStatistic (%) 1DPNN 1DCNN same topology 1DPNN-equivalentAverage accuracy 86.22 78.57 80.9
Both 1DPNN and 1DCNN models take as input a signal and output a signal that usually has a lowertemporal dimension due to border effects. However, in both regression problems that are considered, weneed the output signal to have the same temporal dimension as the input signal. Therefore, we use zero-padding to avoid border effects. The metric that is used to evaluate the performances of the models forboth problems is the Signal-to-Noise Ratio (SNR) measured in decibel (dB) defined as such:
SN R = 10 log (cid:18) µ s µ n (cid:19) , µ s is the mean square of the signal without noise, and µ n is the mean square of the noise. Thedataset used for both problems is the MUSDB18 [19] dataset containing 150 high quality full length stereomusical tracks (10 hours of recordings) with their isolated drums, bass, and vocal stems sampled at 44.1kHz.It is mainly used for source separation, but can also be used for instrument tracking or for noise reduction. The aim of this problem is to restore voice recordings that are drowned in AWGN making their individ-ual SNR around 0dB. However, since the dataset is huge and the experiments are restricted by technicallimitations, we take a small subset of the training set and the test set, downsample the voice tracks to16kHz, and extract windows of 100ms to obtain 40,000 short clips (80% for training, 20% for testing). Thetopologies used to solve the problem are detailed in Table 8 below. The statistics on the SNR gatheredusing 10-fold cross validation are reported in Table 9 where we notice that the 1DPNN is slightly betterthan the 1DPNN-equivalent. However, Figure 6 representing the evolution of the 10-fold average SNR overthe epochs shows that the 1DPNN has a highly faster convergence than the other networks.
Table 8: Networks’ topologies for audio signal denoising.
Table 9: SNR statistics for each network for audio signal denoising over 10 folds.
NetworkStatistic (dB) 1DPNN 1DCNN same topology 1DPNN-equivalentMinimum SNR 6.28 5.53 6.23Maximum SNR 6.42 6.07 6.35Average SNR 6.34 5.88 6.31
The aim of this problem is to extract the vocal source from a musical recording containing musicalinstruments such as drums or bass. Since all signals in the dataset are stereo signals and sampled at44.1kHz, it is very much time-consuming to produce fully trained models on the entire dataset. Due totechnical limitations, we take a small subset of the training set and the test set, extract windows of 200msto obtain 150,000 short clips (120,000 for training and 30,000 for testing). The topologies used to solve theproblem are detailed in Table 10 below. The statistics on the SNR gathered using 10-fold cross validationare reported in Table 11 where we notice that there is a clear gap showing that even the worst 1DPNNperformed better than the best 1DPNN-equivalent network. Figure 7 further emphasizes this gap by showingthe 10-fold average SNR over 100 epochs. It also shows that after only 10 epochs, the 1DPNN model startsto exceed the best average SNR achieved by the 1DCNN model in 100 epochs.20 igure 6: Average SNR in dB for each epoch of all the networks trained on audio signal denoising.Table 10: Networks’ topologies for vocal source separation.
Table 11: SNR statistics for each network for vocal source separation over 10 folds.
NetworkStatistic (dB) 1DPNN 1DCNN same topology 1DPNN-equivalentMinimum SNR 1.93 1.61 1.76Maximum SNR 2.31 1.82 1.91Average SNR 2.08 1.73 1.81
6. Conclusion
In this paper, we have formally introduced a novel 1-Dimensional Polynomial Neural Network (1DPNN)model that induces a high degree of non-linearity starting from the initial layer in an effort to producecompact topologies in the context of audio signal applications. Our experiments demonstrate it has thepotential to produce more accurate feature extraction and better regression than the conventional 1DCNN.21 igure 7: Average SNR in dB for each epoch of all the networks trained on vocal source separation.
Furthermore, our model also shows faster convergence in audio signal denoising and vocal source separationproblems. We have also demonstrated that the 1DPNN model converges when we use activation functionsbounded between -1 and 1 or when all the degrees are set to 1 because it becomes equivalent to a 1DCNN.Our experiments are not sufficient to claim that our proposed 1DPNN surpasses 1DCNN on all classof complex classification and regression problems. In addition, there is still no mathematical proof thatthe 1DPNN is a convergent model. Moreover, the choice of the degree of every layer of the networkscreated for the experiments was specifically designed to evaluate the performances. Therefore, a methodthat accurately estimates the appropriate degree for each layer needs to be created in order to replace themanual hyperparameter search. Furthermore, the stability of the model is not ensured as stacked layers withhigh degrees can lead to the model becoming unstable, thus, losing its generalization capability. Therefore,there is also a need to estimate an upper bound limit for the degree of each layer so that the overall networkstably learns from the given data. Finally, due to computational limits, the model could not be tested withdeep topologies that are used to solve very complex classification and regression problems involving a hugeamount of data.Future work may be done on demonstrating the conditions of the stability of the model as well as itsconvergence. Furthermore, a method that automatically sets the degree of each 1DPNN layer depending onthe problem tackled may be developed as to provide further independence from manual search. Moreover,the equations defining the 1DPNN model and its backpropagation can easily be extended to 2 dimensions andto 3 dimensions, thus, providing new models to deal with image processing and video processing problemsfor example. In addition, a more specific gradient descent optimization scheme may be developed to avoidgradient explosion. Finally, deeper 1DPNN topologies can be created to compete with state-of-the-artmodels on any 1 dimensional signal related problem, not only audio signals. More thorough experimentsand further mathematical analysis can help this model to thrive and find its place as a standard model inthe deep learning field.
Acknowledgement
This work was funded by the Mitacs Globalink Graduate Fellowship Program (no. FR41237) and theNSERC Discovery Grants Program (nos. 194376, 418413).
References [1] J. Schmidhuber, “Deep learning in neural networks: An overview,”
Neural Networks , vol. 61, pp. 85 – 117, 2015. https://doi.org/10.1016/j.neunet.2014.09.003.
2] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object recognition with gradient-based learning,” in
Shape, Contourand Grouping in Computer Vision , (Berlin, Heidelberg), p. 319, Springer-Verlag, 1999.[3] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffectedby shift in position,”
Biol. Cybernetics , vol. 36, p. 193202, 1980. https://doi.org/10.1007/BF00344251.[4] Pearlmutter, “Learning state space trajectories in recurrent neural networks,” in
International 1989 Joint Conference onNeural Networks , pp. 365–372 vol.2, 1989. https://doi.org/10.1109/IJCNN.1989.118724.[5] A. Cleeremans, D. Servan-Schreiber, and J. L. McClelland, “Finite state automata and simple recurrent networks,”
NeuralComputation , vol. 1, no. 3, pp. 372–381, 1989. https://doi.org/10.1162/neco.1989.1.3.372.[6] S. Liu and W. Deng, “Very deep convolutional neural network based image classification using small training sample size,”in , pp. 730–734, 2015. https://doi.org/10.1109/ACPR.2015.7486599.[7] J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classi-fication,”
IEEE Signal Processing Letters , vol. 24, no. 3, pp. 279–283, 2017. https://doi.org/10.1109/LSP.2017.2657381.[8] S. Oeda, I. Kurimoto, and T. Ichimura, “Time series data classification using recurrent neural network with ensemblelearning,” in
Knowledge-Based Intelligent Information and Engineering Systems (B. Gabrys, R. J. Howlett, and L. C.Jain, eds.), (Berlin, Heidelberg), pp. 742–748, Springer Berlin Heidelberg, 2006.[9] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu, “Recurrent neural networks for multivariate time series withmissing values,”
Scientific Reports , vol. 8, p. 6085, Apr 2018. https://doi.org/10.1038/s41598-018-24271-9.[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,eds.), pp. 1097–1105, Curran Associates, Inc., 2012.[11] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,“Going deeper with convolutions,” in ,pp. 1–9, 2015.[12] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,”
Neural Net-works , vol. 2, no. 5, pp. 359 – 366, 1989. https://doi.org/10.1016/0893-6080(89)90020-8.[13] T. Hofmann, B. Schlkopf, and A. J. Smola, “Kernel methods in machine learning,”
Ann. Statist. , vol. 36, pp. 1171–1220,06 2008. https://doi.org/10.1214/009053607000000677.[14] Y. Cho and L. K. Saul, “Kernel methods for deep learning,” in
Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, eds.), pp. 342–350, Curran Associates, Inc.,2009.[15] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid, “Convolutional kernel networks,” in
Advances in Neural InformationProcessing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds.), pp. 2627–2635, Curran Associates, Inc., 2014.[16] D. Chen, L. Jacob, and J. Mairal, “Recurrent kernel networks,” in
Advances in Neural Information Processing Systems32 (H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´e Buc, E. Fox, and R. Garnett, eds.), pp. 13431–13442, CurranAssociates, Inc., 2019.[17] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musicalnotes with wavenet autoencoders,” in
Proceedings of the 34th International Conference on Machine Learning - Volume70 , ICML17, p. 10681077, JMLR.org, 2017.[18] [dataset]
Zohar Jackson, C. Souza, J. Flaks, and H. Nicolas, “Jakobovski/free-spoken-digit-dataset v1.0.6,” Oct. 2017.https://doi.org/10.5281/zenodo.1039893.[19] [dataset]
Z. Rafii, A. Liutkus, F.-R. St¨oter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,”Dec. 2017. https://doi.org/10.5281/zenodo.1117372.[20] B. Graham, “Fractional max-pooling,”
CoRR , vol. abs/1412.6071, 2014.[21] C.-Y. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functions in convolutional neural networks: Mixed, gated,and tree,” in
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (A. Gretton andC. C. Robert, eds.), vol. 51 of
Proceedings of Machine Learning Research , (Cadiz, Spain), pp. 464–472, PMLR, 09–11May 2016.[22] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neuralnetworks from overfitting,”
J. Mach. Learn. Res. , vol. 15, p. 19291958, Jan. 2014.[23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,”in
Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 ,ICML’15, p. 448456, JMLR.org, 2015.[24] A. F. Agarap, “Deep learning using rectified linear units (relu),”
CoRR , vol. abs/1803.08375, 2018.[25] J. Turian, J. Bergstra, and Y. Bengio, “Quadratic features and deep architectures for chunking,” in
Proceedings of HumanLanguage Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computa-tional Linguistics, Companion Volume: Short Papers , (Boulder, Colorado), pp. 245–248, Association for ComputationalLinguistics, June 2009.[26] R. Livni, S. Shalev-Shwartz, and O. Shamir, “A provably efficient algorithm for training deep networks,”
CoRR ,vol. abs/1304.7045, 2013.[27] C. Wang, J. Yang, L. Xie, and J. Yuan, “Kervolutional neural networks,” in , pp. 31–40, 2019. https://doi.org/10.1109/CVPR.2019.00012.[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in , pp. 770–778, 2016. https://doi.org/10.1109/CVPR.2016.90.
29] J. Mairal, “End-to-end kernel learning with supervised convolutional kernel networks,” in
Advances in Neural InformationProcessing Systems 29 (D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds.), pp. 1399–1407, CurranAssociates, Inc., 2016.[30] N. Aronszajn, “Theory of reproducing kernels,”
Trans. Amer. Math. Soc. , vol. 68, pp. 337–404, 1950.[31] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in
ComputerVision – ECCV 2014 (D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds.), (Cham), pp. 184–199, Springer InternationalPublishing, 2014.[32] D. T. Tran, S. Kiranyaz, M. Gabbouj, and A. Iosifidis, “Heterogeneous multilayer generalized operational perceptron,”
IEEE Transactions on Neural Networks and Learning Systems , vol. 31, no. 3, pp. 710–724, 2020. https://doi.org/10.1109/TNNLS.2019.2914082.[33] S. Kiranyaz, T. Ince, A. Iosifidis, and M. Gabbouj, “Operational neural networks,”
Neural Computing and Applications ,vol. 32, p. 66456668, 2020.[34] S. Kiranyaz, J. Malik, H. Ben Abdallah, T. Ince, A. Iosifidis, and M. Gabbouj, “Self-organized operational neural networkswith generative neurons,”
CoRR , vol. abs/2004.11778, 2020.[35] A. CAUCHY, “Methode generale pour la resolution des systemes d’equations simultanees,”
C.R. Acad. Sci. Paris , vol. 25,pp. 536–538, 1847.[36] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in
Proceedings ofthe Thirteenth International Conference on Artificial Intelligence and Statistics (Y. W. Teh and M. Titterington, eds.),vol. 9 of
Proceedings of Machine Learning Research et al. , “Keras.” https://github.com/fchollet/keras, 2015. (accessed 27 July 2020).[39] D. Gabor, “Theory of communication. part 1: The analysis of information,”
Journal of the Institution of ElectricalEngineers - Part III: Radio and Communication Engineering , vol. 93, no. 26, pp. 429–441, 1946. https://doi.org/10.1049/ji-3-2.1946.0074.[40] P. Refaeilzadeh, L. Tang, and H. Liu,
Cross-Validation , pp. 532–538. Boston, MA: Springer US, 2009. https://doi.org/10.1007/978-0-387-39940-9 565., pp. 532–538. Boston, MA: Springer US, 2009. https://doi.org/10.1007/978-0-387-39940-9 565.