[PDF] Nonlinear ISA with Auxiliary Variables for Learning Speech Representations

Abstract

This paper extends recent work on nonlinear Independent Component Analysis (ICA) by introducing a theoretical framework for nonlinear Independent Subspace Analysis (ISA) in the presence of auxiliary variables. Observed high dimensional acoustic features like log Mel spectrograms can be considered as surface level manifestations of nonlinear transformations over individual multivariate sources of information like speaker characteristics, phonological content etc. Under assumptions of energy based models we use the theory of nonlinear ISA to propose an algorithm that learns unsupervised speech representations whose subspaces are independent and potentially highly correlated with the original non-stationary multivariate sources. We show how nonlinear ICA with auxiliary variables can be extended to a generic identifiable model for subspaces as well while also providing sufficient conditions for the identifiability of these high dimensional subspaces. Our proposed methodology is generic and can be integrated with standard unsupervised approaches to learn speech representations with subspaces that can theoretically capture independent higher order speech signals. We evaluate the gains of our algorithm when integrated with the Autoregressive Predictive Decoding (APC) model by showing empirical results on the speaker verification and phoneme recognition tasks.

Full PDF

aa r X i v : . [ ee ss . A S ] J u l Nonlinear ISA with Auxiliary Variables for Learning Speech Representations

Amrith Setlur † , Barnabas Poczos † , Alan W Black †† Carnegie Mellon University [email protected], [email protected], [email protected]

Abstract

This paper extends recent work on nonlinear Independent Com-ponent Analysis (

ICA ) by introducing a theoretical frameworkfor nonlinear Independent Subspace Analysis (

ISA ) in the pres-ence of auxiliary variables. Observed high dimensional acous-tic features like log Mel spectrograms can be considered as sur-face level manifestations of nonlinear transformations over in-dividual multivariate sources of information like speaker char-acteristics, phonological content etc . Under assumptions of en-ergy based models we use the theory of nonlinear

ISA to pro-pose an algorithm that learns unsupervised speech representa-tions whose subspaces are independent and potentially highlycorrelated with the original non-stationary multivariate sources.We show how nonlinear

ICA with auxiliary variables can beextended to a generic identiﬁable model for subspaces as wellwhile also providing sufﬁcient conditions for the identiﬁabilityof these high dimensional subspaces. Our proposed method-ology is generic and can be integrated with standard unsu-pervised approaches to learn speech representations with sub-spaces that can theoretically capture independent higher orderspeech signals. We evaluate the gains of our algorithm whenintegrated with the Autoregressive Predictive Decoding (

APC )model by showing empirical results on the speaker veriﬁcationand phoneme recognition tasks.

Index Terms : ISA , speech representation learning, unsuper-vised learning

1. Introduction

The speech signals that we observe can be viewed as high-dimensional surface level manifestations of samples from inde-pendent non-stationary sources, that are entangled via a non-linear mixing mechanism. These sources can be entangled atsession, utterance or segment levels [1]. Speech representationslearnt by training deep recurrent models [2, 3] over these surfacelevel features fail to capture the original signals in their purestdisentangled form. Unsupervised disentanglement of speechrepresentations has been an active area of research [4, 5] since ithas been shown that recovering independent factors of variationcan improve the performance of downstream tasks like Auto-matic Speech Recognition (

ASR ), especially under low resourceconstraints and domain mismatch [1]. Inspired by this, we pro-pose an algorithm to learn unsupervised speech representationswith independent subspaces, each of which can capture distinctdisentangled source signals. These distinct subspaces can bepotentially informative of patterns based on speaker character-istics or subphonetic events. This can be useful in learning avariety of acoustic models given very few labeled samples foreach.Recently [6] it has been shown that learning disentangledrepresentations is impossible without explicit bias on the algo-rithm and the data. Hence, we leverage a more principled ap-proach to capturing the independent sources through the lens of nonlinear Independent Subspace Analysis (

ISA ) in the presenceof auxiliary variables.Nonlinear Independent Component Analysis (

ICA ) is aprovably unidentiﬁable problem [7] as opposed to linear

ICA [8] which is identiﬁable given non-gaussian sources and otherfundamental restrictions on the mixing matrix [9]. Attempts[10, 11, 12] have been made to solve nonlinear

ICA for i.i.d distributions under slightly stronger assumptions on the genera-tive process [13, 14]. Recent progress in the ﬁeld [15, 16] hasrevolved around a generic identiﬁable model that renders thelatent sources conditionally independent in the presence of aux-iliary variables. But most of the work [17, 18] has been focusedon univariate sources which means that these models can’t be di-rectly applied to speech where the source signals are very highdimensional. Hence, we extend the auxiliary variables modelproposed by [17] for multivariate sources by ﬁrst stating sufﬁ-cient conditions for the separability of sources and then , provid-ing training objectives suitable for learning speech representa-tions with ﬁnite audio samples. Nonlinear

ISA is leveraged tolearn unsupervised features on large unlabeled speech datasets.Using these features, simpler (linear) models are learnt on smalllabeled datasets.Numerous approaches [5, 19, 20, 21] have been proposedfor learning unsupervised speech representations. Recent ones[2, 5] have been based on predictive coding schemes that uselanguage model like objectives. In parallel, there have been ef-forts to learn quantized representations via temporal segmenta-tion and phonetic clustering [22] so as to map frame representa-tions to linguistic units. But such models are fairly complicatedand tricky to train. Also, most of these methods learn highlyentangled representations that suffer from spurious correlationsin the underlying data and thus fail to generalize. Our proposedalgorithm improves upon these approaches by advocating forindependent subspaces attained via additional constraints in theoriginal optimization objectives. We begin by providing a the-oretically identiﬁable model for nonlinear

ISA and then discusshow the model can be incorporated into existing methods forlearning unsupervised speech representations.

2. Theory

We introduce a generative model of the observed data that weassume henceforth and present conditions under which, the orig-inal multi-dimensional sources are identiﬁable. We assume thatthe observed data x ∈ X ⊂ R nd is generated by applying a non-linear invertible transform f on n source signals s . . . s n ∈S ⊂ R d . We are given a dataset D = { ( x ( i ) , u ( i ) ) } Ni =1 with N samples where each x ( i ) = f ( s ( i ) ) , s ( i ) = N nj =1 s ( i ) j =[ s ( i )1 . . . s ( i ) n ] . Here, u ( i ) ∈ U ⊂ R p denotes the correspond-ing auxiliary variable for x ( i ) and f : S n → X is a non-linearmixing function ( eqn. Here N denotes the concatenation operation. ifferentiable almost everywhere ( a.e ). The objective is to learnrepresentations that can recover the source signals ( { s i } ni =1 ) upto an identiﬁability factor that we shall deﬁne shortly. For nota-tional convenience, we denote the j th scalar element in a vector z as z j and the i th consecutive d -dimensional vector ( i th sub-space) in z as z i or as z i : = (cid:2) z ( i − d +1 . . . z id (cid:3) . Model

The source distributions { p i ( s i ) } ni =1 are assumed tobe independent given the auxiliary variable u ( eqn.

1) and theirdensities are given by conditional energy based models ( eqn. x = f ( s ) log p ( s | u ) = n X i =1 log p i ( s i | u ) (1) p i ( s i | u ) = exp φ i ( s i ) T η i ( u ) Z i ( u ) φ i : S → R m η i : U → R m (2) Deﬁnition of Identiﬁability

We shall deﬁne the originalsources { s i } ni =1 to be identiﬁable if there exists an algorithmthat takes as input a pair comprising of the observed sample andthe corresponding auxiliary variable ( x = f ( s ) , u ) , and outputs (cid:2) g ( s π ) , . . . , g n ( s π n ) (cid:3) , for some permutation π : N n → N n over { . . . n } . Each g i : S → S is an invertible ( a.e ) functionand is deﬁned as a function of a single distinct source s π i .Popular algorithms [8] in linear ICA [9] rely on estimatorsof Mutual Information ( MI ) to be able to separate the observedmixed samples into samples from the original source signals.Similarly, for nonlinear ICA we compute MI between the ob-served and auxiliary variables ( I ( x , u ) in eqn.

3) using NoiseContrastive Estimation (

NCE ) [23]. A nonlinear logistic clas-siﬁer is used to distinguish between correct (observed) pairs ( x ( i ) , u ( i ) ) and randomly generated incorrect pairs ( x ( i ) , ˜ u ( i ) ) where ˜ u ( i ) is drawn from the marginal distribution over u . Theregression function for this logistic classiﬁer is given by r ( x , u ) ,where h i : X → R d , ψ i : R d × R p → R ∈ L are sufﬁcientlysmooth universal function approximators (neural networks) and ∀ i , h i is invertible a.e . I ( x , u ) = Z x,u log p ( x , u ) p ( x ) p ( u ) d P ( x , u ) = Z r ( x , u ) d P ( x , u ) where, r ( x , u ) = n X i =1 ψ i ( h i ( x ) , u ) (3)The following main ISA separation theorem states that the vec-tor h ( x ) = N ni =1 h i ( x ) ∈ R nd , with subspaces h i ( x ) ∈ R d can recover s i since ∃ π, { g i } ni =1 such that h i ( x ) = g i ( s π i ) . Theorem 1.

Given that we observe the dataset D with N sam-ples: { x ( i ) = f ( s ( i ) ) , u ( i ) } Ni =1 generated by a model based oneqns. (1, 2), then under the following assumptions :1. Realizability Assumption:

Given inﬁnite ( N → ∞ )samples one can efﬁciently learn ψ ∗ i , h ∗ i such that the NCE algorithm can estimate the mutual information I ( x , u ) with an arbitrarily small error, using the regres-sion function r ( x , u ) which follows the form in eqn. 3.2. Separability Assumption: ∀ s ∈ S n , z = 0 ∈ R d with ﬁrst and second order derivatives given by tensors ∇ φ i ( s i ) ∈ R m × d and ∇ φ i ( s i ) ∈ R m × d × d respec-tively; ∃{ u l } ndl =0 ∈ U nd +1 such that: ( n O i =1 (cid:18)(cid:20) ∇ φ i ( s i ) T ( ∇ φ i ( s i ) ¯ × z ) T (cid:21) ζ i ( u l , u ) (cid:19)) ndl =1 The separability assm. requires the auxiliary variables u to havea sufﬁciently strong and diverse effect on the source distributions [17]. spans R nd for ζ i ( u l , u ) = ( η i ( u l ) − η i ( u )) ,the subspaces { h i ( x ) } ni =1 can identify the conditionally inde-pendent sources { s i } ni =1 up to the deﬁnition of identiﬁability .Proof Sketch : For an observed sample x ∈ X , let y = h ∗ ( x ) be given by the optimal functions { h ∗ i } ni =1 . The func-tions { ψ ∗ i , h ∗ i } ni =1 are learnt using the NCE objective whose re-gression function is given by r ( x , u ) . Since s = f − ( h − ( y )) is a composition of two invertible transforms, we introduce v : R nd → S n where s = v ( y ) . Also, let f − be denoted by g .From eqn. r ( x , u ) = log p ( x | u ) − log p ( x ) ,Using the density transformation rules [25] for invert-ible functions we can show that, log p ( x | u ) = log p ( s | u ) +log | det J g ( x ) | and log p ( x ) = log p ( s ) + log | det J g ( x ) | .Thus, r ( x , u ) = log p ( s | u ) − log p ( s ) . Using eqn. n X i =1 ψ ∗ i ( y i , u ) = log p ( v ( y ) | u ) − log p ( v ( y )) (4)We begin by substituting eqns.

1, 2 in the above result. Also,since eqn. { u l } ndl =0 , we can get nd + 1 such equations and from each we can subtract the equationgiven by u , which leaves us with nd eqns. of the form P ni =1 φ i ( v ( y ) i : ) T ζ i ( u l , u ) − (log Z i ( u l ) − log Z i ( u )) = P ni =1 ψ ∗ i ( y i , u ) . Taking the derivative of both sides of this eqn.w.r.t. y j and subsequently w.r.t y k s.t. ⌈ j/d ⌉ 6 = ⌈ k/d ⌉ we get, n X i =1  ∇ φ i ( v ( y ) i : ) | {z } (cid:13) ∂ v ( y ) i : ∂y j ∂y k  T ζ i ( u l , u )+ (cid:18) ∇ φ i ( v ( y ) i : ) ¯ × ∂v ( y ) i : ∂y j (cid:19)| {z } (cid:13) ∂v ( y ) i : ∂y k  T ζ i ( u l , u ) Concatenating 1 (cid:13) , 2 (cid:13) into a single matrix in R d × m , the abovecan be written as a single euclidean inner product in R nd . n O i =1 " ∇ φ i ( s i ) T (cid:16) ∇ φ i ( s i ) ¯ × ∂v ( y ) i : ∂y j (cid:17) T ζ i ( u l , u ) !! Γ( y ) = 0 For Γ( y ) = (cid:18) n N i =1 h ∂ v ( y ) i : ∂y j ∂y k ∂v ( y ) i : ∂y k i(cid:19) ∈ R nd the above equa-tion holds true for nd distinct values of the auxiliary variable u l . For invertible v , if we assume that ∂v ( y ) i : ∂y j = 0 then we canapply the separability assm. which implies Γ( y ) = . Thisfurther implies that ∂v ( y ) i : ∂y k = 0 . Thus ∀ i , ∂v ( y ) i : ∂y j ∨ ∂v ( y ) i : ∂y k .Since ⌈ j/d ⌉ 6 = ⌈ k/d ⌉ , y j and y k belong to distinct subspacesof y = h ( x ) . Hence the i th source given by v ( y ) i : cannot si-multaneously be a function of two distinct subspaces of h ( x ) .Given the invertible function f ( h ( · )) with its full rank jacobianwe can recover the sources { s i } ni =1 via the subspaces of h ( x ) ; h i ( x ) = g i ( s π i ) for an invertible function g i , permutation π . ¯ × denotes the rd mode product [24]. For more details on the validity and necessity of similar results forindependent components (as opposed to subspaces) we refer the readerto [17]. Also, for the sake of completion we show a proof sketch for theidentiﬁability of our

ISA model. It’s an extension of the proof for theunivariate case [17, 18]. Subscript i is dropped wherever it can be understood from context. ilbert-Schmidt Independence Criterion ( HSIC ) [26]The above theorem proves the existence of functions ψ ∗ , h ∗ that can not only compute I ( x , u ) with arbitrary precision butcan also recover the original multi-dimensional sources. Al-beit, NCE algorithm relies on the assumption of inﬁnite sam-ples of positive ( x , u ) and negative ( x , ˜u ) pairs which is rarelytrue in practice. Hence, along with the NCE objective whichlearns r ( x , u ) that distinguishes between those pairs, we intro-duce constraints imposed via the HSIC estimator that speciﬁ-cally accounts for independence amongst the subspaces of h ( x ) .This acts as a strong inductive bias to learn ψ ∗ , h ∗ with ﬁ-nite observed samples of ( x , u ) . HSIC is a kernel based sta-tistical test of independence for two multivariate random vari-ables and is well suited for high dimensional data as opposedto tests [27, 28, 29] based on the power divergence familyand characteristic functions which are mainly meant for low-dimensional random variables [26]. Given D = { x ( i ) , u ( i ) } Ni =1 with N samples, let the set of features ( h ( x ( i ) )) be denotedby { y ( i ) = h ( x ( i ) ) } Ni =1 . For R d dimensional subspaces j, k let y j ∈ Y j ⊆ R d , y k ∈ Y k ⊆ R d and P jk denote aBorel probability measure over Y j × Y k with N i.i.d samples Z jk := { ( y ( i ) j , y ( i ) k ) } Ni =1 drawn from it. If F , G are two Re-producible Kernel Hilbert Spaces ( RKHS ) equipped with ker-nels k f , k g then the biased empirical HSIC criterion ˆ H jk = N tr ( K f ( j ) HK g ( k ) H ) and K f ( j ) [ p, q ] = k f ( y ( p ) j , y ( q ) j ) , K g ( k ) [ p, q ] = k g ( y ( p ) k , y ( q ) k ) , H = I − N T ∈ R N × N . Algorithm (

NCE - HSIC ) We have shown that the

NCE al-gorithm can learn a regression function of the form r ( x , u ) ( eqn.

3) with optimal predictors ψ ∗ , h ∗ such that the subspacesof h ∗ ( x ) can recover the original sources s i . Constrained bya ﬁnite dataset we use the biased empirical HSIC estimator ˆ H jk (lower values imply more independence) as an additionalobjective while optimizing for ψ ∗ , h ∗ . If the true and noisysamples for the NCE algorithm are given by ( x ( l ) , u ( l ) ) and ( x ( l ) , u ( l ′ = l ) ) respectively, then the ﬁnal loss objective L nh for NCE - HSIC is: L nh = 1 N X l ∈ [ N ] r ( x ( l ) , u ( l ′ = l ) ) − r ( x ( l ) , u ( l ) ) + λ X j,k ˆ H jk

3. Proposed Methodology

Speech representations that can explicitly capture factors ofvariation like phoneme identities or speaker traits while beinginvariant to other factors like underlying pitch contour or back-ground noise [4, 5] have proven to be beneﬁcial since they areless prone to overﬁtting on spurious correlations in the data.Nevertheless, disentanglement is hard to achieve in general dueto the presence of confounding variables [6]. In this section,we introduce our approach

APC - NCE - HSIC or ANH to learnrepresentations with independent subspaces that can theoreti-cally capture distinct acoustic/linguistic units relevant for down-stream tasks like

ASR .Nonlinear

ISA provides us with a simple yet principledframework for learning speech representations in the presenceof auxiliary variables, which in the case of sequential data likespeech can be “time”. Learning unsupervised representationscan be posed as a problem of recovering from entangled sam-ples the non-stationary sources that are independent given the k f : Y j × Y j → R , k g : Y k × Y k → R ; for z , z ′ ∈ Y j , k f ( z , z ′ ) = h k f ( z , · ) , k f ( z ′ , · ) i F and for z , z ′ ∈ Y k , k g ( z , z ′ ) = h k g ( z , · ) , k g ( z ′ , · ) i G . auxiliary variable (time frame sequence). The NCE - HSIC algo-rithm can be used to identify original factors of variation viadistinct independent subspaces. In order to ensure that the in-dependent subspaces are not only mutually exclusive but arealso having a high MI with surface features like Mel-frequencycepstral coefﬁcients ( MFCC ) or log Mel spectrograms (

LMS ) webuild on existing approaches based on predictive coding strate-gies [19, 3]. Although our algorithm can be seamlessly inte-grated into any of these methods, in this work we show empiri-cal results that highlight the performance improvements gainedby incorporating the

NCE - HSIC criterion into the

APC model.

APC [2] is a language model based method to learn unsu-pervised speech representations. It uses a Recurrent Neural Net-work (

RNN ) to model temporal information within an acousticsequence comprising of -dimensional LMS features { x i } Ti =0 .Given these features until a ﬁxed time step t , the APC modelpredicts the surface feature τ time steps ahead i.e. x t + τ . If { ˆp i } T − τi =0 represents the sequence predicted by the RNN , thenthe l loss used to train the model is given by: L apc ( x ) = T − τ X i =0 − log p ( x i + τ | x . . . x i ) = T − τ X i =0 | ˆp i − x i + τ | APC - NCE - HSIC or ANH is our proposed model where featureswith independent subspaces are learnt through the

NCE - HSIC criterion which is applied to the hidden states of the

RNN mod-ule trained with the

APC objective above. Speciﬁcally, the func-tion h ( x ) is modeled using the RNN . The

NCE - HSIC criterionincreases the correlation of the original sources with the sub-spaces of h ( x ) or in this case the subspaces of the hidden statesof the RNN . If the

RNN is parameterized by θ ∈ Θ then thehidden state can be represented as the function h ( θ, x ) . With r ( x , u ) = P ni =1 ψ i ( h i ( θ, x ) , u ) the ﬁnal objective would be: argmin { ψ i } ni =1 ,θ L anh = 1 |D| X x ∈D L apc ( x ) + β L nh (5) Auxiliary Variables The original

LMS sequence of length T is fragmented into time segments { s j } ⌈ T/γ ⌉ j =1 of length γ , andeach element x j,t in a given segment s j has its auxiliary vari-able u j,t set to the value j , which is nothing but the corre-sponding segment’s position in the input sequence. The hiddenstates of the RNN along with the generated auxiliary variablesare passed to the

NCE module which ﬁrst , generates positive ( x t , u t ) and negative ( x t , ˜u t ) pairs and then , learns ψ ∗ , θ ∗ todistinguish between them optimally. Upon the commencementof the unsupervised learning phase, the hidden state for the t th frame with surface features x t would comprise of n subspaces( { h i ( θ ∗ , x t ) } ni =1 ) that capture different factors of variation, in-dependent for the same value of the auxiliary variable u t . Thusthe hidden states can efﬁciently decouple factors that vary inde-pendently locally. NCE is a powerful tool to predict MI and has been used inrecent works like CPC [3] that rely on the

NCE objective to dis-tinguish pairs of context vectors from the same or different timesegments. This approach is similar to Time Contrastive Learn-ing

TCL [7] which is an algorithm for nonlinear

ICA . Although

TCL has only been shown to work for univariate cases and

CPC fails to model independent subspaces explicitly, they serve as astrong motivation for our approach which addresses both con-cerns. Auxiliary variables can be potentially given by other domains likethe frequency spectrum, but in this work we focus only on time. . Experiments and Results

In this section, we empirically evaluate the performance of theproposed

ANH algorithm against two baseline models:

APC and

CPC on two downstream tasks, (1) phoneme recognition ( PR )and (2) speaker veriﬁcation ( SV ). Datasets and Implementation

LibriSpeech corpus [30]was used for unsupervised training of the

ANH model and otherbaselines. The datasets for PR and SV were picked from WSJ [31] and

TIMIT corpora respectively . For APC we use a multi-layer unidirectional

LSTM with residual connections exactly asdetailed in [2], with the exception of using 4 layers in the

LSTM (wherever mentioned explicitly) and for

CPC modiﬁcations sug-gested in [2] are made for a fair comparison. In the unsuper-vised phase we train the

RNN using the L anh objective. The RNN hidden states which are -dimensional are assumed tobe a collection of n = 4 contiguous subspaces each of whichhas d = 128 dimensions. These subspaces of the RNN param-eterized by θ , represent the output { h i ( θ, x ) } i =1 where x is the LMS feature and h i ( θ, x ) is the i th subspace. The NCE modulealso needs ψ i ( · , · ) which is implemented using 4-layer MLP s,with ReLU activations, dropouts and batch-normalization. For L nh , ﬁve negative pairs are drawn for every positive pair. Inthe supervised phase, once the ISA features ( h ( θ ∗ , x ) ) givenby the hidden states (ﬁnal layer) of the trained RNN ( θ ∗ ) areextracted, a supervised linear classiﬁer is trained over featuresfrom each frame for PR whereas an LDA model is trained overfeatures averaged over the entire sequence for SV . Phoneme Recognition

Table 1 highlights the performance(Phone Error Rates (

PER )) of our approach (

ANH ) against thebest variants of the CPC , APC models. The supervised baseline(

LMS + MLP ) which involves training a 3-layer nonlinear classi-ﬁer over the

LMS features fails to capture contextual informa-tion. Even though

CPC can learn contextual features, it onlycaptures information relevant for recognizing contexts that are τ steps apart. Thus it may ignore signals that remain relativelystationary for the entire utterance [2]. On the other hand APC di-rectly predicts surface features τ steps ahead and thus can modelsub-phonetic context useful in predicting the next phone. ANH with τ = 5 has the least PER since the addition of the

NCE - HSIC objective enables the model to learn noise-free subspaces thatcan capture relevant factors like formant movements. Finally,adding layers to the

RNN further improves the scores.

Speaker Veriﬁcation

Results for SV are summarized in ta-ble 2 which shows lower Equal Error Rates ( EER ) achieved by

ANH as compared to the baselines. It has been shown that indeep language models, lower layers model local syntax whilethe higher ones capture semantic content [2, 32]. We make sim-ilar observations since the

EER values increase (for all τ ) whenthe ANH model has more than 3 layers. Lowering τ reduced EER in most cases and had minimal impact on the independence( ˆ H jk ). Ablations

NCE - HSIC model when trained without the L apc loss rendered independent subspaces but performed poorly on PR since there is no reason to believe why such subspaces wouldretain phonetic information. Adding the APC objective aids themodel (

ANH ) to learn acoustic features while disentangling thefactors across subspaces (table 1). Removing the

HSIC criterionincreased the

PER and the model training also took ( ×

2) longerto converge. This reinforces our hypothesis that the

HSIC cri- For brevity we skip the details of the dataset and refer the reader to[2] from where we borrowed the dataset splits and input

LMS features. The optimal β in L anh & λ in L nh were found to be . and . . Unless speciﬁed all

ANH models are trained with γ = 30 . Table 1:

Performance comparison (based on PER) on thePhoneme Recognition task (

WSJ corpus [31]).

Method PER τ ) 2 5 10 LMS + MLP (supervised) 42.5

CPC [3] 41.8 44.6 47.3

APC (3-layer) [2] 36.6 35.7 35.5

APC (4-layer) [2] 34.5 35.2

ANH (3-layer) (Ours) 33.2

ANH (4-layer) (Ours)

Ablations

APC + NCE

NCE - HSIC

NCE

Performance (based on EER) on the speaker veriﬁca-tion task (

TIMIT corpus). ( ∗ choosing different layers [2]) Method EER τ ) 2 3 5 10 CPC features [3] 5.62 5.29 5.42 6.01

APC (3-layer)-1 ∗ [2] 3.82 3.67 3.88 4.01 APC (3-layer)-2 ∗ [2] ANH (2-layer) (Ours) 3.53 3.35 3.91 4.12

ANH (3-layer) (Ours) 3.45 terion provides a good inductive bias for a more generalizablemodel.

Independence

In order to measure the independence of thefour -dimensional subspaces of the

RNN states, absolute val-ues of the Pearson’s Correlation were computed on the valida-tion splits for PR , SV . When averaged over all possible pairs,they were found to be . , . on PR , SV respectively whenboth NCE and

HSIC objectives were considered in L nh . With λ = 0 these values were . and . but were still signiﬁ-cantly lower as compared to the case of APC which had averageabsolute correlation values of . and . on PR and SV re-spectively. Time Segment Length ( γ ) We show the impact of the timesegment length γ on the phoneme classiﬁcation task in table3. As we increase γ the total number of segments (and auxil-iary variables) reduce in an utterance. Theoretically, nd dis-tinct auxiliary variables are needed to identify n sources eachof which is d -dimensional ( sec. γ to val-ues greater than leads to higher ( > PER s. Additionally,we observe that when the

RNN is trained with higher values of τ for the APC objective

PER drops by using wider segments. Thismay indicate that the distribution of the underlying factors re-main stationary for longer periods at higher values of τ .

5. Conclusion

We extend nonlinear

ICA and show how the proposed algorithmto compute MI between the observed and auxiliary variables canprovably identify independent subspaces under certain regular-ity conditions. We also use the algorithm to learn unsupervisedspeech representations with disentangled subspaces when inte-grated with existing approaches like APC . Future work may in-able 3:

Comparing different values of ( γ ) for ANH (3-layer)model on the phoneme classiﬁcation task.

Segment size γ PER τ ) 2 5 10 γ = 10 γ = 20 γ = 30 γ = 50 volve a close analysis of the features in these subspaces to un-derstand which orthogonal components are represented by eachand how they can prove to be useful for downstream tasks.

6. References [1] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of dis-entangled and interpretable representations from sequential data,”in

Advances in neural information processing systems , 2017, pp.1878–1889.[2] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervisedautoregressive model for speech representation learning,” arXivpreprint arXiv:1904.03240 , 2019.[3] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning withcontrastive predictive coding,” arXiv preprint arXiv:1807.03748 ,2018.[4] Y. Li and S. Mandt, “Disentangled sequential autoencoder,” arXivpreprint arXiv:1803.02991 , 2018.[5] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Un-supervised speech representation learning using wavenet autoen-coders,”

IEEE/ACM transactions on audio, speech, and languageprocessing , vol. 27, no. 12, pp. 2041–2053, 2019.[6] F. Locatello, S. Bauer, M. Lucic, G. R¨atsch, S. Gelly,B. Sch¨olkopf, and O. Bachem, “Challenging common assump-tions in the unsupervised learning of disentangled representa-tions,” arXiv preprint arXiv:1811.12359 , 2018.[7] A. Hyv¨arinen and P. Pajunen, “Nonlinear independent componentanalysis: Existence and uniqueness results,”

Neural Networks ,vol. 12, no. 3, pp. 429–439, 1999.[8] A. Hyv¨arinen and E. Oja, “Independent component analysis: al-gorithms and applications,”

Neural networks , vol. 13, no. 4-5, pp.411–430, 2000.[9] A. Hyvarinen, “Survey on independent component analysis,”

Neu-ral computing surveys , vol. 2, no. 4, pp. 94–128, 1999.[10] Y. Tan, J. Wang, and J. M. Zurada, “Nonlinear blind source sep-aration using a radial basis function network,”

IEEE transactionson neural networks , vol. 12, no. 1, pp. 124–134, 2001.[11] L. B. Almeida, “Misep–linear and nonlinear ica based on mutualinformation,”

Journal of Machine Learning Research , vol. 4, no.Dec, pp. 1297–1318, 2003.[12] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear inde-pendent components estimation,” arXiv preprint arXiv:1410.8516 ,2014.[13] P. Brakel and Y. Bengio, “Learning independent featureswith adversarial nets for non-linear ica,” arXiv preprintarXiv:1710.05050 , 2017.[14] J. A. Lee, C. Jutten, and M. Verleysen, “Non-linear ica by us-ing isometric dimensionality reduction,” in

International Confer-ence on Independent Component Analysis and Signal Separation .Springer, 2004, pp. 710–717.[15] I. Khemakhem, D. P. Kingma, and A. Hyv¨arinen, “Variationalautoencoders and nonlinear ica: A unifying framework,” arXivpreprint arXiv:1907.04809 , 2019. [16] I. Khemakhem, R. P. Monti, D. P. Kingma, and A. Hyv¨arinen,“Ice-beem: Identiﬁable conditional energy-based deep models,” arXiv preprint arXiv:2002.11537 , 2020.[17] A. Hyvarinen, H. Sasaki, and R. E. Turner, “Nonlinear ica usingauxiliary variables and generalized contrastive learning,” arXivpreprint arXiv:1805.08651 , 2018.[18] A. Hyvarinen and H. Morioka, “Unsupervised feature extractionby time-contrastive learning and nonlinear ica,” in

Advances inNeural Information Processing Systems , 2016, pp. 3765–3773.[19] Y.-A. Chung and J. Glass, “Speech2vec: A sequence-to-sequenceframework for learning word embeddings from speech,” arXivpreprint arXiv:1803.08976 , 2018.[20] B. Milde and C. Biemann, “Unspeech: Unsupervised speech con-text embeddings,” arXiv preprint arXiv:1804.06775 , 2018.[21] W.-N. Hsu, Y. Zhang, and J. Glass, “Learning latent representa-tions for speech generation and transformation,” arXiv preprintarXiv:1704.04222 , 2017.[22] A. H. Liu, T. Tu, H.-y. Lee, and L.-s. Lee, “Towards unsupervisedspeech recognition and synthesis with quantized speech represen-tation learning,” in

ICASSP 2020-2020 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2020, pp. 7259–7263.[23] M. U. Gutmann and A. Hyv¨arinen, “Noise-contrastive estimationof unnormalized statistical models, with applications to naturalimage statistics,”

Journal of Machine Learning Research , vol. 13,no. Feb, pp. 307–361, 2012.[24] T. G. Kolda and B. W. Bader, “Tensor decompositions and appli-cations,”

SIAM review , vol. 51, no. 3, pp. 455–500, 2009.[25] K. P. Murphy,

Machine learning: a probabilistic perspective .MIT press, 2012.[26] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch¨olkopf, andA. J. Smola, “A kernel statistical test of independence,” in

Ad-vances in neural information processing systems , 2008, pp. 585–592.[27] T. R. Read and N. A. Cressie,

Goodness-of-ﬁt statistics for dis-crete multivariate data . Springer Science & Business Media,2012.[28] A. Kankainen,

Consistent testing of total independence based onthe empirical characteristic function . University of Jyv¨askyl¨a,1995, vol. 29.[29] A. Feuerverger, “A consistent test for bivariate dependence,”

In-ternational Statistical Review/Revue Internationale de Statistique ,pp. 419–433, 1993.[30] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,” in . IEEE, 2015, pp. 5206–5210.[31] D. B. Paul and J. M. Baker, “The design for the wall street journal-based csr corpus,” in

Proceedings of the workshop on Speech andNatural Language . Association for Computational Linguistics,1992, pp. 357–362.[32] M. E. Peters, M. Neumann, L. Zettlemoyer, and W.-t. Yih, “Dis-secting contextual word embeddings: Architecture and represen-tation,” arXiv preprint arXiv:1808.08949arXiv preprint arXiv:1808.08949