Nonlinear ISA with Auxiliary Variables for Learning Speech Representations
aa r X i v : . [ ee ss . A S ] J u l Nonlinear ISA with Auxiliary Variables for Learning Speech Representations
Amrith Setlur † , Barnabas Poczos † , Alan W Black †† Carnegie Mellon University [email protected], [email protected], [email protected]
Abstract
This paper extends recent work on nonlinear Independent Com-ponent Analysis (
ICA ) by introducing a theoretical frameworkfor nonlinear Independent Subspace Analysis (
ISA ) in the pres-ence of auxiliary variables. Observed high dimensional acous-tic features like log Mel spectrograms can be considered as sur-face level manifestations of nonlinear transformations over in-dividual multivariate sources of information like speaker char-acteristics, phonological content etc . Under assumptions of en-ergy based models we use the theory of nonlinear
ISA to pro-pose an algorithm that learns unsupervised speech representa-tions whose subspaces are independent and potentially highlycorrelated with the original non-stationary multivariate sources.We show how nonlinear
ICA with auxiliary variables can beextended to a generic identifiable model for subspaces as wellwhile also providing sufficient conditions for the identifiabilityof these high dimensional subspaces. Our proposed method-ology is generic and can be integrated with standard unsu-pervised approaches to learn speech representations with sub-spaces that can theoretically capture independent higher orderspeech signals. We evaluate the gains of our algorithm whenintegrated with the Autoregressive Predictive Decoding (
APC )model by showing empirical results on the speaker verificationand phoneme recognition tasks.
Index Terms : ISA , speech representation learning, unsuper-vised learning
1. Introduction
The speech signals that we observe can be viewed as high-dimensional surface level manifestations of samples from inde-pendent non-stationary sources, that are entangled via a non-linear mixing mechanism. These sources can be entangled atsession, utterance or segment levels [1]. Speech representationslearnt by training deep recurrent models [2, 3] over these surfacelevel features fail to capture the original signals in their purestdisentangled form. Unsupervised disentanglement of speechrepresentations has been an active area of research [4, 5] since ithas been shown that recovering independent factors of variationcan improve the performance of downstream tasks like Auto-matic Speech Recognition (
ASR ), especially under low resourceconstraints and domain mismatch [1]. Inspired by this, we pro-pose an algorithm to learn unsupervised speech representationswith independent subspaces, each of which can capture distinctdisentangled source signals. These distinct subspaces can bepotentially informative of patterns based on speaker character-istics or subphonetic events. This can be useful in learning avariety of acoustic models given very few labeled samples foreach.Recently [6] it has been shown that learning disentangledrepresentations is impossible without explicit bias on the algo-rithm and the data. Hence, we leverage a more principled ap-proach to capturing the independent sources through the lens of nonlinear Independent Subspace Analysis (
ISA ) in the presenceof auxiliary variables.Nonlinear Independent Component Analysis (
ICA ) is aprovably unidentifiable problem [7] as opposed to linear
ICA [8] which is identifiable given non-gaussian sources and otherfundamental restrictions on the mixing matrix [9]. Attempts[10, 11, 12] have been made to solve nonlinear
ICA for i.i.d distributions under slightly stronger assumptions on the genera-tive process [13, 14]. Recent progress in the field [15, 16] hasrevolved around a generic identifiable model that renders thelatent sources conditionally independent in the presence of aux-iliary variables. But most of the work [17, 18] has been focusedon univariate sources which means that these models can’t be di-rectly applied to speech where the source signals are very highdimensional. Hence, we extend the auxiliary variables modelproposed by [17] for multivariate sources by first stating suffi-cient conditions for the separability of sources and then , provid-ing training objectives suitable for learning speech representa-tions with finite audio samples. Nonlinear
ISA is leveraged tolearn unsupervised features on large unlabeled speech datasets.Using these features, simpler (linear) models are learnt on smalllabeled datasets.Numerous approaches [5, 19, 20, 21] have been proposedfor learning unsupervised speech representations. Recent ones[2, 5] have been based on predictive coding schemes that uselanguage model like objectives. In parallel, there have been ef-forts to learn quantized representations via temporal segmenta-tion and phonetic clustering [22] so as to map frame representa-tions to linguistic units. But such models are fairly complicatedand tricky to train. Also, most of these methods learn highlyentangled representations that suffer from spurious correlationsin the underlying data and thus fail to generalize. Our proposedalgorithm improves upon these approaches by advocating forindependent subspaces attained via additional constraints in theoriginal optimization objectives. We begin by providing a the-oretically identifiable model for nonlinear
ISA and then discusshow the model can be incorporated into existing methods forlearning unsupervised speech representations.
2. Theory
We introduce a generative model of the observed data that weassume henceforth and present conditions under which, the orig-inal multi-dimensional sources are identifiable. We assume thatthe observed data x ∈ X ⊂ R nd is generated by applying a non-linear invertible transform f on n source signals s . . . s n ∈S ⊂ R d . We are given a dataset D = { ( x ( i ) , u ( i ) ) } Ni =1 with N samples where each x ( i ) = f ( s ( i ) ) , s ( i ) = N nj =1 s ( i ) j =[ s ( i )1 . . . s ( i ) n ] . Here, u ( i ) ∈ U ⊂ R p denotes the correspond-ing auxiliary variable for x ( i ) and f : S n → X is a non-linearmixing function ( eqn. Here N denotes the concatenation operation. ifferentiable almost everywhere ( a.e ). The objective is to learnrepresentations that can recover the source signals ( { s i } ni =1 ) upto an identifiability factor that we shall define shortly. For nota-tional convenience, we denote the j th scalar element in a vector z as z j and the i th consecutive d -dimensional vector ( i th sub-space) in z as z i or as z i : = (cid:2) z ( i − d +1 . . . z id (cid:3) . Model
The source distributions { p i ( s i ) } ni =1 are assumed tobe independent given the auxiliary variable u ( eqn.
1) and theirdensities are given by conditional energy based models ( eqn. x = f ( s ) log p ( s | u ) = n X i =1 log p i ( s i | u ) (1) p i ( s i | u ) = exp φ i ( s i ) T η i ( u ) Z i ( u ) φ i : S → R m η i : U → R m (2) Definition of Identifiability
We shall define the originalsources { s i } ni =1 to be identifiable if there exists an algorithmthat takes as input a pair comprising of the observed sample andthe corresponding auxiliary variable ( x = f ( s ) , u ) , and outputs (cid:2) g ( s π ) , . . . , g n ( s π n ) (cid:3) , for some permutation π : N n → N n over { . . . n } . Each g i : S → S is an invertible ( a.e ) functionand is defined as a function of a single distinct source s π i .Popular algorithms [8] in linear ICA [9] rely on estimatorsof Mutual Information ( MI ) to be able to separate the observedmixed samples into samples from the original source signals.Similarly, for nonlinear ICA we compute MI between the ob-served and auxiliary variables ( I ( x , u ) in eqn.
3) using NoiseContrastive Estimation (
NCE ) [23]. A nonlinear logistic clas-sifier is used to distinguish between correct (observed) pairs ( x ( i ) , u ( i ) ) and randomly generated incorrect pairs ( x ( i ) , ˜ u ( i ) ) where ˜ u ( i ) is drawn from the marginal distribution over u . Theregression function for this logistic classifier is given by r ( x , u ) ,where h i : X → R d , ψ i : R d × R p → R ∈ L are sufficientlysmooth universal function approximators (neural networks) and ∀ i , h i is invertible a.e . I ( x , u ) = Z x,u log p ( x , u ) p ( x ) p ( u ) d P ( x , u ) = Z r ( x , u ) d P ( x , u ) where, r ( x , u ) = n X i =1 ψ i ( h i ( x ) , u ) (3)The following main ISA separation theorem states that the vec-tor h ( x ) = N ni =1 h i ( x ) ∈ R nd , with subspaces h i ( x ) ∈ R d can recover s i since ∃ π, { g i } ni =1 such that h i ( x ) = g i ( s π i ) . Theorem 1.
Given that we observe the dataset D with N sam-ples: { x ( i ) = f ( s ( i ) ) , u ( i ) } Ni =1 generated by a model based oneqns. (1, 2), then under the following assumptions :1. Realizability Assumption:
Given infinite ( N → ∞ )samples one can efficiently learn ψ ∗ i , h ∗ i such that the NCE algorithm can estimate the mutual information I ( x , u ) with an arbitrarily small error, using the regres-sion function r ( x , u ) which follows the form in eqn. 3.2. Separability Assumption: ∀ s ∈ S n , z = 0 ∈ R d with first and second order derivatives given by tensors ∇ φ i ( s i ) ∈ R m × d and ∇ φ i ( s i ) ∈ R m × d × d respec-tively; ∃{ u l } ndl =0 ∈ U nd +1 such that: ( n O i =1 (cid:18)(cid:20) ∇ φ i ( s i ) T ( ∇ φ i ( s i ) ¯ × z ) T (cid:21) ζ i ( u l , u ) (cid:19)) ndl =1 The separability assm. requires the auxiliary variables u to havea sufficiently strong and diverse effect on the source distributions [17]. spans R nd for ζ i ( u l , u ) = ( η i ( u l ) − η i ( u )) ,the subspaces { h i ( x ) } ni =1 can identify the conditionally inde-pendent sources { s i } ni =1 up to the definition of identifiability .Proof Sketch : For an observed sample x ∈ X , let y = h ∗ ( x ) be given by the optimal functions { h ∗ i } ni =1 . The func-tions { ψ ∗ i , h ∗ i } ni =1 are learnt using the NCE objective whose re-gression function is given by r ( x , u ) . Since s = f − ( h − ( y )) is a composition of two invertible transforms, we introduce v : R nd → S n where s = v ( y ) . Also, let f − be denoted by g .From eqn. r ( x , u ) = log p ( x | u ) − log p ( x ) ,Using the density transformation rules [25] for invert-ible functions we can show that, log p ( x | u ) = log p ( s | u ) +log | det J g ( x ) | and log p ( x ) = log p ( s ) + log | det J g ( x ) | .Thus, r ( x , u ) = log p ( s | u ) − log p ( s ) . Using eqn. n X i =1 ψ ∗ i ( y i , u ) = log p ( v ( y ) | u ) − log p ( v ( y )) (4)We begin by substituting eqns.
1, 2 in the above result. Also,since eqn. { u l } ndl =0 , we can get nd + 1 such equations and from each we can subtract the equationgiven by u , which leaves us with nd eqns. of the form P ni =1 φ i ( v ( y ) i : ) T ζ i ( u l , u ) − (log Z i ( u l ) − log Z i ( u )) = P ni =1 ψ ∗ i ( y i , u ) . Taking the derivative of both sides of this eqn.w.r.t. y j and subsequently w.r.t y k s.t. ⌈ j/d ⌉ 6 = ⌈ k/d ⌉ we get, n X i =1 ∇ φ i ( v ( y ) i : ) | {z } (cid:13) ∂ v ( y ) i : ∂y j ∂y k T ζ i ( u l , u )+ (cid:18) ∇ φ i ( v ( y ) i : ) ¯ × ∂v ( y ) i : ∂y j (cid:19)| {z } (cid:13) ∂v ( y ) i : ∂y k T ζ i ( u l , u ) Concatenating 1 (cid:13) , 2 (cid:13) into a single matrix in R d × m , the abovecan be written as a single euclidean inner product in R nd . n O i =1 " ∇ φ i ( s i ) T (cid:16) ∇ φ i ( s i ) ¯ × ∂v ( y ) i : ∂y j (cid:17) T ζ i ( u l , u ) !! Γ( y ) = 0 For Γ( y ) = (cid:18) n N i =1 h ∂ v ( y ) i : ∂y j ∂y k ∂v ( y ) i : ∂y k i(cid:19) ∈ R nd the above equa-tion holds true for nd distinct values of the auxiliary variable u l . For invertible v , if we assume that ∂v ( y ) i : ∂y j = 0 then we canapply the separability assm. which implies Γ( y ) = . Thisfurther implies that ∂v ( y ) i : ∂y k = 0 . Thus ∀ i , ∂v ( y ) i : ∂y j ∨ ∂v ( y ) i : ∂y k .Since ⌈ j/d ⌉ 6 = ⌈ k/d ⌉ , y j and y k belong to distinct subspacesof y = h ( x ) . Hence the i th source given by v ( y ) i : cannot si-multaneously be a function of two distinct subspaces of h ( x ) .Given the invertible function f ( h ( · )) with its full rank jacobianwe can recover the sources { s i } ni =1 via the subspaces of h ( x ) ; h i ( x ) = g i ( s π i ) for an invertible function g i , permutation π . ¯ × denotes the rd mode product [24]. For more details on the validity and necessity of similar results forindependent components (as opposed to subspaces) we refer the readerto [17]. Also, for the sake of completion we show a proof sketch for theidentifiability of our
ISA model. It’s an extension of the proof for theunivariate case [17, 18]. Subscript i is dropped wherever it can be understood from context. ilbert-Schmidt Independence Criterion ( HSIC ) [26]The above theorem proves the existence of functions ψ ∗ , h ∗ that can not only compute I ( x , u ) with arbitrary precision butcan also recover the original multi-dimensional sources. Al-beit, NCE algorithm relies on the assumption of infinite sam-ples of positive ( x , u ) and negative ( x , ˜u ) pairs which is rarelytrue in practice. Hence, along with the NCE objective whichlearns r ( x , u ) that distinguishes between those pairs, we intro-duce constraints imposed via the HSIC estimator that specifi-cally accounts for independence amongst the subspaces of h ( x ) .This acts as a strong inductive bias to learn ψ ∗ , h ∗ with fi-nite observed samples of ( x , u ) . HSIC is a kernel based sta-tistical test of independence for two multivariate random vari-ables and is well suited for high dimensional data as opposedto tests [27, 28, 29] based on the power divergence familyand characteristic functions which are mainly meant for low-dimensional random variables [26]. Given D = { x ( i ) , u ( i ) } Ni =1 with N samples, let the set of features ( h ( x ( i ) )) be denotedby { y ( i ) = h ( x ( i ) ) } Ni =1 . For R d dimensional subspaces j, k let y j ∈ Y j ⊆ R d , y k ∈ Y k ⊆ R d and P jk denote aBorel probability measure over Y j × Y k with N i.i.d samples Z jk := { ( y ( i ) j , y ( i ) k ) } Ni =1 drawn from it. If F , G are two Re-producible Kernel Hilbert Spaces ( RKHS ) equipped with ker-nels k f , k g then the biased empirical HSIC criterion ˆ H jk = N tr ( K f ( j ) HK g ( k ) H ) and K f ( j ) [ p, q ] = k f ( y ( p ) j , y ( q ) j ) , K g ( k ) [ p, q ] = k g ( y ( p ) k , y ( q ) k ) , H = I − N T ∈ R N × N . Algorithm (
NCE - HSIC ) We have shown that the
NCE al-gorithm can learn a regression function of the form r ( x , u ) ( eqn.
3) with optimal predictors ψ ∗ , h ∗ such that the subspacesof h ∗ ( x ) can recover the original sources s i . Constrained bya finite dataset we use the biased empirical HSIC estimator ˆ H jk (lower values imply more independence) as an additionalobjective while optimizing for ψ ∗ , h ∗ . If the true and noisysamples for the NCE algorithm are given by ( x ( l ) , u ( l ) ) and ( x ( l ) , u ( l ′ = l ) ) respectively, then the final loss objective L nh for NCE - HSIC is: L nh = 1 N X l ∈ [ N ] r ( x ( l ) , u ( l ′ = l ) ) − r ( x ( l ) , u ( l ) ) + λ X j,k ˆ H jk
3. Proposed Methodology
Speech representations that can explicitly capture factors ofvariation like phoneme identities or speaker traits while beinginvariant to other factors like underlying pitch contour or back-ground noise [4, 5] have proven to be beneficial since they areless prone to overfitting on spurious correlations in the data.Nevertheless, disentanglement is hard to achieve in general dueto the presence of confounding variables [6]. In this section,we introduce our approach
APC - NCE - HSIC or ANH to learnrepresentations with independent subspaces that can theoreti-cally capture distinct acoustic/linguistic units relevant for down-stream tasks like
ASR .Nonlinear
ISA provides us with a simple yet principledframework for learning speech representations in the presenceof auxiliary variables, which in the case of sequential data likespeech can be “time”. Learning unsupervised representationscan be posed as a problem of recovering from entangled sam-ples the non-stationary sources that are independent given the k f : Y j × Y j → R , k g : Y k × Y k → R ; for z , z ′ ∈ Y j , k f ( z , z ′ ) = h k f ( z , · ) , k f ( z ′ , · ) i F and for z , z ′ ∈ Y k , k g ( z , z ′ ) = h k g ( z , · ) , k g ( z ′ , · ) i G . auxiliary variable (time frame sequence). The NCE - HSIC algo-rithm can be used to identify original factors of variation viadistinct independent subspaces. In order to ensure that the in-dependent subspaces are not only mutually exclusive but arealso having a high MI with surface features like Mel-frequencycepstral coefficients ( MFCC ) or log Mel spectrograms (
LMS ) webuild on existing approaches based on predictive coding strate-gies [19, 3]. Although our algorithm can be seamlessly inte-grated into any of these methods, in this work we show empiri-cal results that highlight the performance improvements gainedby incorporating the
NCE - HSIC criterion into the
APC model.
APC [2] is a language model based method to learn unsu-pervised speech representations. It uses a Recurrent Neural Net-work (
RNN ) to model temporal information within an acousticsequence comprising of -dimensional LMS features { x i } Ti =0 .Given these features until a fixed time step t , the APC modelpredicts the surface feature τ time steps ahead i.e. x t + τ . If { ˆp i } T − τi =0 represents the sequence predicted by the RNN , thenthe l loss used to train the model is given by: L apc ( x ) = T − τ X i =0 − log p ( x i + τ | x . . . x i ) = T − τ X i =0 | ˆp i − x i + τ | APC - NCE - HSIC or ANH is our proposed model where featureswith independent subspaces are learnt through the
NCE - HSIC criterion which is applied to the hidden states of the
RNN mod-ule trained with the
APC objective above. Specifically, the func-tion h ( x ) is modeled using the RNN . The
NCE - HSIC criterionincreases the correlation of the original sources with the sub-spaces of h ( x ) or in this case the subspaces of the hidden statesof the RNN . If the
RNN is parameterized by θ ∈ Θ then thehidden state can be represented as the function h ( θ, x ) . With r ( x , u ) = P ni =1 ψ i ( h i ( θ, x ) , u ) the final objective would be: argmin { ψ i } ni =1 ,θ L anh = 1 |D| X x ∈D L apc ( x ) + β L nh (5) Auxiliary Variables The original
LMS sequence of length T is fragmented into time segments { s j } ⌈ T/γ ⌉ j =1 of length γ , andeach element x j,t in a given segment s j has its auxiliary vari-able u j,t set to the value j , which is nothing but the corre-sponding segment’s position in the input sequence. The hiddenstates of the RNN along with the generated auxiliary variablesare passed to the
NCE module which first , generates positive ( x t , u t ) and negative ( x t , ˜u t ) pairs and then , learns ψ ∗ , θ ∗ todistinguish between them optimally. Upon the commencementof the unsupervised learning phase, the hidden state for the t th frame with surface features x t would comprise of n subspaces( { h i ( θ ∗ , x t ) } ni =1 ) that capture different factors of variation, in-dependent for the same value of the auxiliary variable u t . Thusthe hidden states can efficiently decouple factors that vary inde-pendently locally. NCE is a powerful tool to predict MI and has been used inrecent works like CPC [3] that rely on the
NCE objective to dis-tinguish pairs of context vectors from the same or different timesegments. This approach is similar to Time Contrastive Learn-ing
TCL [7] which is an algorithm for nonlinear
ICA . Although
TCL has only been shown to work for univariate cases and
CPC fails to model independent subspaces explicitly, they serve as astrong motivation for our approach which addresses both con-cerns. Auxiliary variables can be potentially given by other domains likethe frequency spectrum, but in this work we focus only on time. . Experiments and Results
In this section, we empirically evaluate the performance of theproposed
ANH algorithm against two baseline models:
APC and
CPC on two downstream tasks, (1) phoneme recognition ( PR )and (2) speaker verification ( SV ). Datasets and Implementation
LibriSpeech corpus [30]was used for unsupervised training of the
ANH model and otherbaselines. The datasets for PR and SV were picked from WSJ [31] and
TIMIT corpora respectively . For APC we use a multi-layer unidirectional
LSTM with residual connections exactly asdetailed in [2], with the exception of using 4 layers in the
LSTM (wherever mentioned explicitly) and for
CPC modifications sug-gested in [2] are made for a fair comparison. In the unsuper-vised phase we train the
RNN using the L anh objective. The RNN hidden states which are -dimensional are assumed tobe a collection of n = 4 contiguous subspaces each of whichhas d = 128 dimensions. These subspaces of the RNN param-eterized by θ , represent the output { h i ( θ, x ) } i =1 where x is the LMS feature and h i ( θ, x ) is the i th subspace. The NCE modulealso needs ψ i ( · , · ) which is implemented using 4-layer MLP s,with ReLU activations, dropouts and batch-normalization. For L nh , five negative pairs are drawn for every positive pair. Inthe supervised phase, once the ISA features ( h ( θ ∗ , x ) ) givenby the hidden states (final layer) of the trained RNN ( θ ∗ ) areextracted, a supervised linear classifier is trained over featuresfrom each frame for PR whereas an LDA model is trained overfeatures averaged over the entire sequence for SV . Phoneme Recognition
Table 1 highlights the performance(Phone Error Rates (
PER )) of our approach (
ANH ) against thebest variants of the CPC , APC models. The supervised baseline(
LMS + MLP ) which involves training a 3-layer nonlinear classi-fier over the
LMS features fails to capture contextual informa-tion. Even though
CPC can learn contextual features, it onlycaptures information relevant for recognizing contexts that are τ steps apart. Thus it may ignore signals that remain relativelystationary for the entire utterance [2]. On the other hand APC di-rectly predicts surface features τ steps ahead and thus can modelsub-phonetic context useful in predicting the next phone. ANH with τ = 5 has the least PER since the addition of the
NCE - HSIC objective enables the model to learn noise-free subspaces thatcan capture relevant factors like formant movements. Finally,adding layers to the
RNN further improves the scores.
Speaker Verification
Results for SV are summarized in ta-ble 2 which shows lower Equal Error Rates ( EER ) achieved by
ANH as compared to the baselines. It has been shown that indeep language models, lower layers model local syntax whilethe higher ones capture semantic content [2, 32]. We make sim-ilar observations since the
EER values increase (for all τ ) whenthe ANH model has more than 3 layers. Lowering τ reduced EER in most cases and had minimal impact on the independence( ˆ H jk ). Ablations
NCE - HSIC model when trained without the L apc loss rendered independent subspaces but performed poorly on PR since there is no reason to believe why such subspaces wouldretain phonetic information. Adding the APC objective aids themodel (
ANH ) to learn acoustic features while disentangling thefactors across subspaces (table 1). Removing the
HSIC criterionincreased the
PER and the model training also took ( ×
2) longerto converge. This reinforces our hypothesis that the
HSIC cri- For brevity we skip the details of the dataset and refer the reader to[2] from where we borrowed the dataset splits and input
LMS features. The optimal β in L anh & λ in L nh were found to be . and . . Unless specified all
ANH models are trained with γ = 30 . Table 1:
Performance comparison (based on PER) on thePhoneme Recognition task (
WSJ corpus [31]).
Method PER τ ) 2 5 10 LMS + MLP (supervised) 42.5
CPC [3] 41.8 44.6 47.3
APC (3-layer) [2] 36.6 35.7 35.5
APC (4-layer) [2] 34.5 35.2
ANH (3-layer) (Ours) 33.2
ANH (4-layer) (Ours)
Ablations
APC + NCE
NCE - HSIC
NCE
Performance (based on EER) on the speaker verifica-tion task (
TIMIT corpus). ( ∗ choosing different layers [2]) Method EER τ ) 2 3 5 10 CPC features [3] 5.62 5.29 5.42 6.01
APC (3-layer)-1 ∗ [2] 3.82 3.67 3.88 4.01 APC (3-layer)-2 ∗ [2] ANH (2-layer) (Ours) 3.53 3.35 3.91 4.12
ANH (3-layer) (Ours) 3.45 terion provides a good inductive bias for a more generalizablemodel.
Independence
In order to measure the independence of thefour -dimensional subspaces of the
RNN states, absolute val-ues of the Pearson’s Correlation were computed on the valida-tion splits for PR , SV . When averaged over all possible pairs,they were found to be . , . on PR , SV respectively whenboth NCE and
HSIC objectives were considered in L nh . With λ = 0 these values were . and . but were still signifi-cantly lower as compared to the case of APC which had averageabsolute correlation values of . and . on PR and SV re-spectively. Time Segment Length ( γ ) We show the impact of the timesegment length γ on the phoneme classification task in table3. As we increase γ the total number of segments (and auxil-iary variables) reduce in an utterance. Theoretically, nd dis-tinct auxiliary variables are needed to identify n sources eachof which is d -dimensional ( sec. γ to val-ues greater than leads to higher ( > PER s. Additionally,we observe that when the
RNN is trained with higher values of τ for the APC objective
PER drops by using wider segments. Thismay indicate that the distribution of the underlying factors re-main stationary for longer periods at higher values of τ .
5. Conclusion
We extend nonlinear
ICA and show how the proposed algorithmto compute MI between the observed and auxiliary variables canprovably identify independent subspaces under certain regular-ity conditions. We also use the algorithm to learn unsupervisedspeech representations with disentangled subspaces when inte-grated with existing approaches like APC . Future work may in-able 3:
Comparing different values of ( γ ) for ANH (3-layer)model on the phoneme classification task.
Segment size γ PER τ ) 2 5 10 γ = 10 γ = 20 γ = 30 γ = 50 volve a close analysis of the features in these subspaces to un-derstand which orthogonal components are represented by eachand how they can prove to be useful for downstream tasks.
6. References [1] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of dis-entangled and interpretable representations from sequential data,”in
Advances in neural information processing systems , 2017, pp.1878–1889.[2] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervisedautoregressive model for speech representation learning,” arXivpreprint arXiv:1904.03240 , 2019.[3] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning withcontrastive predictive coding,” arXiv preprint arXiv:1807.03748 ,2018.[4] Y. Li and S. Mandt, “Disentangled sequential autoencoder,” arXivpreprint arXiv:1803.02991 , 2018.[5] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Un-supervised speech representation learning using wavenet autoen-coders,”
IEEE/ACM transactions on audio, speech, and languageprocessing , vol. 27, no. 12, pp. 2041–2053, 2019.[6] F. Locatello, S. Bauer, M. Lucic, G. R¨atsch, S. Gelly,B. Sch¨olkopf, and O. Bachem, “Challenging common assump-tions in the unsupervised learning of disentangled representa-tions,” arXiv preprint arXiv:1811.12359 , 2018.[7] A. Hyv¨arinen and P. Pajunen, “Nonlinear independent componentanalysis: Existence and uniqueness results,”
Neural Networks ,vol. 12, no. 3, pp. 429–439, 1999.[8] A. Hyv¨arinen and E. Oja, “Independent component analysis: al-gorithms and applications,”
Neural networks , vol. 13, no. 4-5, pp.411–430, 2000.[9] A. Hyvarinen, “Survey on independent component analysis,”
Neu-ral computing surveys , vol. 2, no. 4, pp. 94–128, 1999.[10] Y. Tan, J. Wang, and J. M. Zurada, “Nonlinear blind source sep-aration using a radial basis function network,”
IEEE transactionson neural networks , vol. 12, no. 1, pp. 124–134, 2001.[11] L. B. Almeida, “Misep–linear and nonlinear ica based on mutualinformation,”
Journal of Machine Learning Research , vol. 4, no.Dec, pp. 1297–1318, 2003.[12] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear inde-pendent components estimation,” arXiv preprint arXiv:1410.8516 ,2014.[13] P. Brakel and Y. Bengio, “Learning independent featureswith adversarial nets for non-linear ica,” arXiv preprintarXiv:1710.05050 , 2017.[14] J. A. Lee, C. Jutten, and M. Verleysen, “Non-linear ica by us-ing isometric dimensionality reduction,” in
International Confer-ence on Independent Component Analysis and Signal Separation .Springer, 2004, pp. 710–717.[15] I. Khemakhem, D. P. Kingma, and A. Hyv¨arinen, “Variationalautoencoders and nonlinear ica: A unifying framework,” arXivpreprint arXiv:1907.04809 , 2019. [16] I. Khemakhem, R. P. Monti, D. P. Kingma, and A. Hyv¨arinen,“Ice-beem: Identifiable conditional energy-based deep models,” arXiv preprint arXiv:2002.11537 , 2020.[17] A. Hyvarinen, H. Sasaki, and R. E. Turner, “Nonlinear ica usingauxiliary variables and generalized contrastive learning,” arXivpreprint arXiv:1805.08651 , 2018.[18] A. Hyvarinen and H. Morioka, “Unsupervised feature extractionby time-contrastive learning and nonlinear ica,” in
Advances inNeural Information Processing Systems , 2016, pp. 3765–3773.[19] Y.-A. Chung and J. Glass, “Speech2vec: A sequence-to-sequenceframework for learning word embeddings from speech,” arXivpreprint arXiv:1803.08976 , 2018.[20] B. Milde and C. Biemann, “Unspeech: Unsupervised speech con-text embeddings,” arXiv preprint arXiv:1804.06775 , 2018.[21] W.-N. Hsu, Y. Zhang, and J. Glass, “Learning latent representa-tions for speech generation and transformation,” arXiv preprintarXiv:1704.04222 , 2017.[22] A. H. Liu, T. Tu, H.-y. Lee, and L.-s. Lee, “Towards unsupervisedspeech recognition and synthesis with quantized speech represen-tation learning,” in
ICASSP 2020-2020 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2020, pp. 7259–7263.[23] M. U. Gutmann and A. Hyv¨arinen, “Noise-contrastive estimationof unnormalized statistical models, with applications to naturalimage statistics,”
Journal of Machine Learning Research , vol. 13,no. Feb, pp. 307–361, 2012.[24] T. G. Kolda and B. W. Bader, “Tensor decompositions and appli-cations,”
SIAM review , vol. 51, no. 3, pp. 455–500, 2009.[25] K. P. Murphy,
Machine learning: a probabilistic perspective .MIT press, 2012.[26] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch¨olkopf, andA. J. Smola, “A kernel statistical test of independence,” in
Ad-vances in neural information processing systems , 2008, pp. 585–592.[27] T. R. Read and N. A. Cressie,
Goodness-of-fit statistics for dis-crete multivariate data . Springer Science & Business Media,2012.[28] A. Kankainen,
Consistent testing of total independence based onthe empirical characteristic function . University of Jyv¨askyl¨a,1995, vol. 29.[29] A. Feuerverger, “A consistent test for bivariate dependence,”
In-ternational Statistical Review/Revue Internationale de Statistique ,pp. 419–433, 1993.[30] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,” in . IEEE, 2015, pp. 5206–5210.[31] D. B. Paul and J. M. Baker, “The design for the wall street journal-based csr corpus,” in
Proceedings of the workshop on Speech andNatural Language . Association for Computational Linguistics,1992, pp. 357–362.[32] M. E. Peters, M. Neumann, L. Zettlemoyer, and W.-t. Yih, “Dis-secting contextual word embeddings: Architecture and represen-tation,” arXiv preprint arXiv:1808.08949arXiv preprint arXiv:1808.08949