A survey and an extensive evaluation of popular audio declipping methods
aa r X i v : . [ ee ss . A S ] J u l A survey and an extensive evaluation of popularaudio declipping methods
Pavel Z´aviˇska, Pavel Rajmic, Alexey Ozerov, and Lucas Rencker
Abstract —Dynamic range limitations in signal processing oftenlead to clipping, or saturation, in signals. Audio declipping isthe task of estimating the original audio signal given its clippedmeasurements and has attracted a lot of interest in recent years.Audio declipping algorithms often make assumptions about theunderlying signal, such as sparsity or low-rankness, as well asthe measurement system. In this paper, we provide an extensivereview of audio declipping algorithms proposed in the literature.For each algorithm, we present the assumptions being madeabout the audio signal, the modeling domain, as well as theoptimization algorithm. Furthermore, we provide an extensivenumerical evaluation of popular declipping algorithms, on realaudio data. We evaluate each algorithm in terms of the Signal-to-Distortion Ratio, as well as using perceptual metrics of soundquality. The article is accompanied with the repository containingthe evaluated methods.
Index Terms —audio clipping, saturation, declipping, model,sparsity, learning, optimization, evaluation, survey
I. I
NTRODUCTION C LIPPING is a non-linear signal distortion usually appear-ing when a signal exceeds its allowed dynamic range.As a typical instance, an analog signal that is digitized canbe clipped in value when its original peak values go beyondthe largest (or lowest) digit representation. For this reason, theeffect is also called saturation.Clipping in audio signals has a great negative effect on theperceptual quality of audio [1], and it reduces the accuracyof automatic speech recognition [2], [3] and other audioanalysis applications. To improve the perceived quality ofaudio, a recovery of clipped samples can be done; this processis usually termed declipping .Many audio declipping methods are available today. Theyare based on different modeling assumptions, tested on verydifferent audio datasets, and evaluated by different method-ologies. The goal of this article is to survey the existingapproaches to audio declipping, categorize them and empha-size some interconnections. No less important goal of thiscontribution is the numerical evaluation of selected audiodeclipping methods on a representative dataset and providinga freely available MATLAB toolbox. The subset of methodsunder consideration was selected based on our intent tocover different restoration techniques, on the popularity of
P. Z´aviˇska and P. Rajmic are with the Signal Processing Laboratory, BrnoUniversity of Technology, Czech Republic. E-mail: [email protected]. Ozerov is with InterDigital, Cesson-S´evign´e, France. E-mail:[email protected]. Rencker is with the Centre for Vision, Speech and Signal Processing(CVSSP), University of Surrey, UK. E-mail: [email protected] received April 19, 2005; revised August 26, 2015. the methods, and on the availability—or reproducibility—oftheir implementations (which go hand in hand in many cases).Worth saying that the restoration quality is the primary focusin the evaluation, but comments on the speed of computationare provided as well.In the case of the hard clipping, which is the degradationconsidered in this survey, the input signal exceeding theprescribed dynamic range [ − θ c , θ c ] is limited in amplitudesuch that y n = ( x n for | x n | < θ c ,θ c · sgn( x n ) for | x n | ≥ θ c , (1)where [ x , . . . , x N ] = x ∈ R N denotes the original (clean)signal and [ y , . . . , y N ] = y ∈ R N the observed clippedsignal. The limiting constant θ c is referred to as the clippingthreshold (this article supposes that clipping is symmetric,without affecting generality). See Fig. 1 for an example ofa clipped signal (and its various reconstructions).Inspired by the terminology used in audio source separation[4], and machine learning in general [5], the audio declippingmethods can be cast into two main categories: • unsupervised , or blind , where the signal is recoveredassuming some generic regularisation (or modeling as-sumption) on how a natural audio signal should be, butno additional clean audio signals are involved, and • supervised , where signal recovery model parameters, ora part of them, are trained on (estimated from) clean audioexamples that should be similar to the audio sample tobe recovered (e.g., all signals are speech signals).To the date, the vast majority of state-of-the-art audio declip-ping approaches are unsupervised, and as such, we limit thisstudy to those approaches. However, supervised approachesare emerging as well, and they are mostly deep neural net-works (DNNs)-based, trained to declip speech signals [6], [7],[8]. Supervised approaches are more specialized (since usuallytrained on particular classes of audio), and potentially morepowerful, simply because more relevant information couldbe contained in the training set. As such, we believe thatsupervised learning is one of the potential and promisingdirections of evolution of research on audio declipping.As for the unsupervised approaches, they often followa generic path:1) A modeling domain (e.g., time, analysis or synthesis,see Sec. II-A) is chosen.2) A generic model regularizing an audio signal is speci-fied in the chosen domain (e.g., autoregressive model,sparsity, group sparsity or low-rank structure). Model parameters to be estimated from the clippedsignal are specified (e.g., decomposition coefficients,coefficients and the dictionary, non-negative matrix fac-torization parameters, etc.).4) A criterion linking model parameters and observationsto be optimized is specified (though the criterion isrelated to modeling, different choices are often possible,for instance, sparsity-assisted methods may penalizecoefficients with ℓ or ℓ norm). The criterion may ormay not include the following ingredients: • Clipped part consistency : whether the clipping con-straint in the missing part is hold (see Sec. III). • Reliable part consistency : whether the reconstructedsignal in the reliable part equals to the observedclipped signal (see Sec. III).5) A suitable algorithm to optimize the model criterion ischosen / constructed (e.g., orthogonal matching pursuit,expectation maximization, etc.).6) Once the algorithm has terminated (typically, a fixednumber of iterations is performed or a condition de-signed to check convergence is satisfied), the final signalis formed.Most of state-of-the-art unsupervised audio declipping ap-proaches characterized by the above-mentioned ingredients,including the approaches evaluated in this paper, are summa-rized in Table I.In the case of multichannel audio (e.g., stereo) de-clipping may exploit correlations between different audiosources / objects in different channels, and this can improvethe result over a straightforward, dummy solution of applyinga single-channel declipping to each channel independently.This was for the first time investigated in [9], and then studiedas well in [10], though with a different approach. Theseworks have shown that using the inter-channel dependenciescan indeed improve the declipping performance. We do notevaluate those methods in this article, though, since there areonly a few of them so far and such a task would require thecreation of a particular multichannel dataset.
Roadmap of the article.
Section II formulates the problemand prepares notation used in further parts. A survey ofdeclipping methods is given in Section III, while the methodsselected for a thorough evaluation are described in moredetail in Section IV. Then, the experiment and evaluationsetup are explained in Section V, together with the resultsand their discussion. Finally, a conclusion is given and someperspectives for future research are indicated.II. P
ROBLEM FORMULATION
In correspondence with the clipping model (1), it is pos-sible to divide the signal samples into three disjoint setsof indexes
R, H, L such that R ∪ H ∪ L = { , . . . , N } and, correspondingly, to distinguish the reliable samples (notinfluenced by clipping), samples clipped from above to thehigh clipping threshold θ c and samples clipped from below tothe low clipping threshold ( − θ c ), respectively. The respectiveprojection operators M R , M H and M L (masks) are used to . . . . − . − . − . − . . . . . − θ c θ c time (ms) originalclippedfully consistentconsistent in the clipped partconsistent in the reliable partinconsistent Fig. 1. Example of a clipped signal. In addition, various recovery possibilitiesare depicted showing different types of consistency of the solution (discussedlater in Sec. III). select only samples from the corresponding set. With the maskoperators, the following feasible sets can be defined: Γ R = { ˜ x | M R ˜ x = M R y } , (2a) Γ H = { ˜ x | M H ˜ x ≥ θ c } , Γ L = { ˜ x | M L ˜ x ≤ − θ c } (2b) Γ = Γ R ∩ Γ H ∩ Γ L . (2c)Note that these sets depend on the observation y , since themasks do, hence formally we should write Γ( y ) , for example,but we omit the dependence on the signal at most places forbrevity.Additional constraints like M H ˜ x ≤ θ max and M L ˜ x ≤− θ max can be appended to further restrict Γ . This is nottypically done, since in general the original dynamic rangeis not known, but for example [15] reports an improvement insignal recovery after such a trick for heavily clipped signals.The declipping task is clearly ill-posed, since there isan infinite number of solutions that satisfy (1). Therefore,considering some additional information about the signal iscrucial. That is where a signal or statistical model comes intoplay, which regularizes the inverse problem. A. Preliminaries and some notation
Most of the declipping methods rely on signal processingin a transformed domain. In such a context, A : R N → C P will denote the analysis operator, and D : C P → R N is thesynthesis operator. The operators are linear, it holds P ≥ N ,and the operators are connected through the relation D = A ∗ .The asterisk ∗ denotes the adjoint operator. For computationalreasons, authors often restrict themselves to the Parseval tightframes [33], i.e., transforms for which DD ∗ = A ∗ A = Id .Unitary operators are clearly special cases of Parseval tightframes, where D = A − .In synthesis-based signal models, one seeks for coefficients z ∈ C P that follow some prescribed properties, both directlyin C P and after synthesizing them into the signal D z ∈ R N .In the analysis models, one seeks for a signal x ∈ R N thatsatisfies some properties both in R N and after the analysis intocoefficients, A x ∈ C P [34]. TABLE IC
ATEGORIZATION OF EXISTING SINGLE - CHANNEL UNSUPERVISED DECLIPPING APPROACHES . M
ETHODS TREATED IN DETAIL AND EVALUATED AREHIGHLIGHTED IN BOLD . T
ABLE ENTRIES THAT DID NOT FIT INTO A CLEAR CATEGORY ARE LEFT WITH THE “N/A”
MARK .Method Modelingdomain Modelingassumptions Modelparameters Optimizationcriterion Clippingconsistency Rel. partconsistency Optimizationalgorithm
Janssen’86 [11] AR AR model AR params. ML no yes EMAbel’91 [12] spectrum limitedbandwidth band limit several yes yes N/AFong’01 [13] N/A AR model N/A AR coefs &correlation coefs N/A N/A Monte CarloDahimene’08 [14] time AR model AR params. least squares no yes pseudoinverse
Adler’11 [15] synthesis sparsity transform coefs ℓ -min yes no OMP Weinstein’11 [16] sparsity sparsity transform coefs reweighted ℓ -min yes yes CVXMiura’11 [17] synthesis sparsity transform coefs ℓ -min no N/A RVP (MP)Kiti´c’13 [18] synthesis sparsity transform coefs ℓ -min approximately approximately IHT Defraene’13 [19] synthesis sparsity &psychoacoust. transform coefs ℓ -min yes no CVX Siedenburg’14 [20] synthesis social sparsity transform coefs social shrinkage approximately approximately (F)ISTAKiti´c’14 [21] analysis sparsity transform coefs ℓ -min yes yes ADMMJonscher’14 [22] synthesis sparsity transform coefs N/A no N/A N/A Bilen’15 [23] analysis low-rank NMF NMF params. ML yes yes EM
Kiti´c’15 [24] analysis &synthesis sparsity transform coefs ℓ -min yes yes ADMMHarvilla’15 [25] time smoothness signal samples regularized LS no no explicit formulaTakahashi’15 [26] N/A low rank signal samples quadratic yes yes customElvander’17 [27] synthesis sparsity transform coefs atomic norm min yes yes SD Rencker’18 [28] synthesis sparsity &learned dict. transform coefs &dictionary ℓ -min approximately approximately alternate GDGaultier’19 [29] analysis &synthesis sparsity transform coefs ℓ -min yes yes ADMM Z´aviˇska’19 [30] synthesis sparsity transform coefs ℓ -min yes yes ADMM Z´aviˇska’19b [31] synthesis sparsity &psychoacoust. transform coefs ℓ -min yes yes DRAbbreviations: ADMM. . . Alternating Direction Method of Multipliers, AR. . . Autoregressive, CV. . . Condat–V˜u algorithm, CVX. . . convex opt. toolbox [32],DR. . . Douglas–Rachford alg., EM. . . Expectation–Maximization, (F)ISTA. . . (Fast) Iterative Shrinkage Thresholding Alg., GD. . . Gradient Descent,IHT. . . Iterative Hard Thresholding, LS. . . Least Squares, ML. . . Maximum Likelihood, NMF. . . Nonnegative Matrix Factorization,(O)MP. . . (Orthogonal) Matching Pursuit, RVP. . . Recursive Vector Projection, SD. . . Semidefinite programming In finite-dimensional spaces (which is our case), operators D and A can be identified with matrices. The matrix D corresponding to the synthesis is often called the dictionary ,since its columns are the basic blocks in building the signalvia their linear combination [35].Since the methods covered by this survey concern exclu-sively audio, it is natural that the majority of the methodsuses transforms that map the signal to the time-frequency(TF) plane (and vice versa), such as the short-time Fouriertransform (STFT), often referred to as the discrete Gabortransform (DGT) [33], [36]. Methods based on such time-frequency transforms work with (possibly overlapping) blocksof the signal. Such signal chunks are usually created by meansof time-domain windowing; note that this is the reason whywe will speak about the windows of the signal, alternativelyabout the signal blocks , but not about the signal’s time frames ,in order to avoid confusion with the above-introduced conceptof frames in vector spaces. The TF coefficients z are treatedas the vector from the mathematical point of view, but notethat for the user, it is often more convenient to form a matrix [ z ft ] from z , whose rows represent frequencies and whosecolumns correspond to time-shifts of the windows. Methods inSections IV-I and IV-J will need to explicitly refer to individual signal blocks. For this sake, an additional index t will be used,such that for instance, y t will denote the t -th block of theclipped signal; in analogy to this, appending t to the maskingoperators and to the feasible sets will refer to their restrictionin time, such as M R t or Γ( y t ) = Γ t .Norms of vectors will be denoted by k · k , usually appendedwith the lower index characterizing the particular type of thenorm. Using no index corresponds to the case of the operatornorm (i.e., the largest singular value of the operator).Many methods are based on the concept of the so-calledsparsity, popularized in the last two decades [35], [37]. Ex-ploiting sparsity means that within feasible declipping solu-tions, signals D z with a low number of nonzero elementsof z are prioritized (synthesis model) or signals x with a lownumber of nonzeros in A x are preferred (analysis model) [37].Mathematically, the k · k pseudo-norm is used to count thenonzeros of a vector.Many of the methods below utilize a convex optimization;typically a sum of convex functions has to be minimized.In line with the recent trend, numerical solutions of suchproblems will be found using the so-called proximal split-ting algorithms [38], [39], [40]. The proximal algorithmsare iterative schemes, usually with only a few important— but rather simple—computations in each iteration. Each sucha computational step is related to the respective function inthe sum separately. We recall that the proximal operator ofa convex function f is a mapping prox f ( x ) = arg min z k x − z k + f ( z ) . (3)The concept provides a generalization of the common projec-tion operator [38].III. D ECLIPPING METHODS
Whichever technique is employed, every declipping methodcan be assigned to the one of several classes, based on the so-called consistency. A fully consistent method seeks a solutionthat is a member of the intersection
Γ = Γ R ∩ Γ H ∩ Γ L , orin other words, the recovered signal should equal the originalsamples in the reliable part and, at the same time, it should laybeyond the clipping thresholds in the clipped part. A method consistent in the reliable part belongs to Γ R , while a method consistent in the clipped part is a member of Γ H ∩ Γ L . A fullyinconsistent method does not require the strict membershipof the solution in any of the sets Γ R , Γ H , Γ L . See Fig. 1for examples of different types of the solution consistency.Consistent methods reflect the observed clipped signal andthe clipping model, but the solutions are usually quite slowto compute. Inconsistency usually means a gain in speed inexchange for only approximate solutions (which might still begreat for the human auditory system).Apart from the detailed treatment of the methods selectedfor further numerical evaluation, this section is devoted tosurveying the other declipping methods in the literature.Abel and Smith [12] discuss declipping of signals whosespectral band is limited (more than at the Nyquist frequency).This assumption is the key ingredient leading to a convexoptimization program. The recovery uses oversampling andinterpolation with sinc functions. The method is fully consis-tent, apart from the “noisy” variant treated at the end of [12].Fong and Godsill [13] approach the declipping problemfrom the viewpoint of Bayesian statistical signal processing.The main assumption is the autoregressive nature of the signal,and for finding the declipped samples, Monte Carlo particlefiltering is utilized. The experiment follows a very simplifiedscenario (a single test on a very short speech sample).Dahimene et al. [14] also start from the autoregressive (AR)assumption imposed on the signal. The paper forms a systemof linear equations which is row-wise pruned in correspon-dence to the positions of the clipped samples. Two means ofsignal estimation are suggested: one based on ordinary leastsquares and another based on Kalman filtering. This modelingdoes not guarantee any consistency in the clipped part.The method introduced by Miura et al. [17] is based ona procedure coined recursive vector projection (RVP) by theauthors. It turns out that it is actually the classical matchingpursuit algorithm [41] restricted to reliable samples of thesignal. Thus, it is a sythesis approach, with the dictionarydescribed as the (possibly overcomplete) DFT. Since theclipping constraints are not taken into consideration, [17] isa method inconsistent in the clipped part, and its idea is actually quite similar to audio inpainting using the orthogonalmatching pursuit [42].Jonscher et al. [22] introduce the Frequency SelectiveExtrapolation method. The signal is processed block by block.The clipped samples are treated as missing and their recon-struction is performed sequentially. The model behind themethod can be understood as synthesis sparsity-based, with anovercomplete Fourier dictionary. Tests were carried on speechsignals only.Method proposed by Takahashi et al. [43], [26] starts fromthe interesting observation that the Hankel matrix formedfrom the signal that follows the autoregressive (AR) modelhas rank identical to the order of the respective AR process.Therefore, the approach aims at estimating the unknown butclipped elements of the Hankel matrix, whose rank is beingminimized at the same time. After a series of simplifications,the authors formulate a convex optimization problem. Thereported results look promising but unfortunately no data orcodes are available.Harvilla and Stern [25] introduce the RBAR (RegularizedBlind Amplitude Reconstruction) method, which is reportedto run in real time. The declipping task is formulated as anextended Tikhonov-regularized least squares problem, wherethe main idea is to penalize large deviations in the signal’ssecond difference, and at the same time to penalize deviationsfrom the clipping level in the clipped part. The experimentswere carried on speech signal, no codes are available.Elvander et al. [27] introduce probably the first approachthat adapts the grid-less sparse recovery framework to the de-clipping problem. In brief, grid-less means that the dictionarydoes no longer contain countably many columns. In case of[27], the continuous range of frequencies is available as thebuilding blocks for the signal. Such an approach comes at theprice that the resulting minimization problem is a semidefiniteprogram, which can be computationally expensive.In his PhD thesis Gaultier [29] extends the idea of earlieralgorithms based on hard thresholding [18], [21], [24]. Theauthor works with the idea similar to the one published in[20], and introduces coefficients neighborhoods such that theTF coefficients are not processed individually (as commonlydone), but group-wise. This is shown beneficial, especially formild clipping.This survey considers only hard clipping governed byEq. (1) but to be complete, let us shortly mention the exis-tence of the soft clipping (and the corresponding declippingmethods). The transfer function of the soft clipping does notbreak suddenly at the points − θ c and θ c as in the case of hardclipping. Rather a certain transition interval is present aroundthese spots that makes the transfer function smoother, resultingin less spectral distortion of clipping. Recovery of signals thathave been soft-clipped is treated in [44], [45], [46], to namea few.Let us conclude this section with the important remarkthat if a user (or a particular application) does not requireconsistency in the clipped part, the declipping problem can betreated as the so-called audio inpainting , considering clippedsamples simply as missing, hence ignoring potentially usefulinformation. Audio inpainting is an area itself, see for example [47], [48], [49] and references therein. In our declippingexperiment below, we included the Janssen’s method [11]as the representative of such methods (being actually verysuccessful in audio inpainting).IV. D ECLIPPING METHODS SELECTED FOR EVALUATION
This section explains the principles of the methods that wereselected for the evaluation procedure. Each method comeswith the algorithm in pseudocode (software implementation isaddressed later in Section V). Some of the existing methodsbased on the synthesis sparsity are quite easily adaptable to therespective analysis counterpart; we included such unpublishedvariants in several cases to cover a more wide range ofmethods. The order of the methods in this Section was chosensuch that greedy-type algorithms are covered first, then ℓ -based (optionally including psychoacoustics) are presented,then methods that can adapt to the signal, and as the last onethe simple Janssen’s method [11] serving as the “anchor”. A. Constrained Orthogonal Matching Pursuit (C-OMP)
The approach to audio declipping proposed by Adler etal. [15] follows the same idea as the article [42] devoted toinpainting. The Orthogonal Matching Pursuit (OMP) is a well-known greedy-type algorithm [50], [51] used here as the firstpart of the procedure that approximates sparse coefficients inthe NP-hard problem arg min z k z k s.t. ( k M R y − M R D z k ≤ ǫ,D z ∈ Γ H ∩ Γ L . (4)It is clear that (4) is a synthesis-based signal model and thatit is clip-consistent, but inconsistent in the reliable part. Theauthors of [15] used an overcomplete discrete cosine transform(DCT) in the role of the synthesis operator D .The signal is cut into overlapping windows first, and theOMP is applied in each window separately. In the course ofthe OMP iterations, more and more significant coefficientswith respect to D are picked in a greedy fashion. Once sucha subset of coefficients fulfils k M R y − M R D z k ≤ ǫ , where ǫ > is the user-set precision required for the reliable part, theOMP stops. Notice that doing this is effectively performingthe audio inpainting using OMP—however, as the very laststep of the estimation, the current solution is updated usingconvex optimization. This makes the approach very slow,since it requires an iterative algorithm. Authors of [15] relyin this step on the CVX toolbox [32] in which (the subsetof) D is handled in the matrix form which deccelerates thecomputations even more. After the algorithm is finished, thecoefficients are synthesized using D . Individual windows ofthe declipped signal are then put together using the commonoverlap–add procedure. The algorithm for a single window issummarized in Alg. 1.Some remarks should be made here: Consider D as thematrix for the moment; the OMP requires that the columnsof D have the same energy, i.e., the same ℓ norm—this kindof normalization guarantees fair selection of coefficients, atleast for oscillatory signals, such as audio. To preserve sucha condition, the problem at line 1 of Alg. 1 needs to weight Algorithm 1:
Constrained OMP declipping [15] (C-OMP)
Input: D, y ∈ R N , R, H, L Parameters: ǫ > Using OMP, find an approximate solution, ˆ z , to problem arg min z k z k s.t. k M R y − M R D w z k ≤ ǫ Fix the support Ω ⊆ { , . . . , P } of ˆ z Solve the constrained convex program ˆ z Ω = arg min z Ω k M R y − M R D Ω z Ω k s.t. D Ω z Ω ∈ Γ H ∩ Γ L return D Ω ˆ z Ω columns of D which arises from the fact that the subsetsof the columns used for estimation, M R D , do not containthe same energy. We denote this weighted synthesis D w . Formore details, see the original paper [15] or the discussion ondifferent means of weighting in [52].Second, Alg. 1 uses the notation D Ω for the synthesisoperator restricted just to its columns contained in the set Ω ,and, by analogy, for the restricted vector of coefficients z Ω .There is no need to weight the columns of D Ω for the purposeof solving the problem at line 3, actually.Third, notice that the condition k M R y − M R D z k ≤ ǫ isin general violated after the update at line 3 because theremight be no solution to the convex program, which wouldsatisfy it. Furthermore, hand in hand with this, notice thatwhile D is usually chosen as a frame in R N , the restriction D Ω does not have to inherit this property anymore, and thisfact naturally applies to M R D Ω as well. In turn, when OMPfinds Ω ⊂ { , . . . , P } , there is no guarantee of the existence of any solution to the convex program. We will return to this issuein the experimental part since it will serve as an explanationof the strange numerical behavior of the C-OMP in some rarecases. B. A-SPADE
The A-SPADE (Analysis SPArse DEclipper) was introducedby Kiti´c et al. in [24]. It is a natural successor of similarsparsity-based approaches [18] and [21] which are outper-formed by A-SPADE [24].The A-SPADE algorithm approximates the solution of thefollowing NP-hard regularized inverse problem min x , z k z k s.t. x ∈ Γ( y ) and k A x − z k ≤ ǫ, (5)where x ∈ R N stands for the unknown signal in the timedomain, and z ∈ C P contains the (also unknown) coefficients.Parseval tight frames [33], i.e., DD ∗ = A ∗ A = Id , areconsidered. The processing of the signal is done sequentially,window by window. Due to the overlaps of windows, it isenough to use a simple TF transform like the DFT or theDCT (both possibly oversampled) which all are Parseval tightframes.The optimal solution to (5) in each window is approximatedby means of the alternating direction method of mutipliers(ADMM). The resulting algorithm is in Alg. 2; for a detailed derivation and discussion, see [53]. The computational cost ofthe A-SPADE is dominated by the signal transformations; thealgorithm requires one synthesis and one analysis per iteration.The hard thresholding H is a trivial operation. The projectionat line 3 seeks for the signal x ∈ Γ whose analysis A x isclosest to (¯ z ( i +1) − u ( i ) ) . For tight frames, this task can betranslated to an elementwise mapping in the time domain [53], (cid:16) proj Γ ( u ) (cid:17) n = y n for n ∈ R, max( θ c , u n ) for n ∈ H, min( − θ c , u n ) for n ∈ L, (6) ( u ) n denoting the n th element of the vector, i.e., ( u ) n = u n .Compared to most available algorithms, it is fairly easy totune the parameters. The variable k directly represents thenumber of selected coefficients in the hard-thresholding step.This number is growing in the course of iterations, driven bythe parameters s and r (every r -th iteration, k is increasedby s ). The algorithm works really well with the basic setting,where k steps by one (i.e., s = r = 1 ). Algorithm 2:
A-SPADE algorithm [24]
Input: A, y ∈ R N , R, H, L, ǫ > Parameters: s, r ∈ N Initialization: ˆ x (0) ∈ R N , u (0) ∈ C P , k = s for i = 0 , , . . . until k A x − z k ≤ ǫ do ¯ z ( i +1) = H k (cid:0) A ˆ x ( i ) + u ( i ) (cid:1) ˆ x ( i +1) = arg min x k A x − ¯ z ( i +1) + u ( i ) k s.t. x ∈ Γ u ( i +1) = u ( i ) + A ˆ x ( i +1) − ¯ z ( i +1) if ( i + 1) mod r = 0 then k = k + s return ˆ x ( i +1) C. S-SPADE
Similarly to A-SPADE, Kiti´c et al. [24] introduce alsoa synthesis-based formulation, min x , z k z k s.t. x ∈ Γ( y ) and k x − D z k ≤ ǫ. (7)However, the S-SPADE algorithm from [24] which copes with(7) has been shown in [30], [53] as actually solving a differentoptimization problem than (7). The same publications sug-gested a new version of the S-SPADE as the true counterpartof the A-SPADE, and showed its superiority in the SDRperformance. Such an algorithm is in Alg. 3.The computational complexity of Algs. 2 and 3 is the same.It employs one synthesis, one analysis, the hard thresholdingand an elementwise mapping per iteration. D. Declipping using weighted ℓ minimization The above methods are based on the greedy approach tosparsity. Now we present several methods that rely on the so-called convex relaxation: the idea of these is to substitute thenon-convex ℓ pseudonorm with the “closest” convex norm,which is the ℓ [35], [54]. The two declipping formulations inthis section are quite basic, but up to our knowledge they were Algorithm 3:
S-SPADE algorithm according to [30]
Input: D, y ∈ R N , R, H, L, ǫ > Parameters: s, r ∈ N Initialization: ˆ x (0) ∈ R N , u (0) ∈ R N , k = s for i = 0 , , . . . until k x − D z k ≤ ǫ do ¯ z ( i +1) = H k (cid:0) D ∗ (ˆ x ( i ) − u ( i ) ) (cid:1) ˆ x ( i +1) = arg min x k D ¯ z ( i +1) − x + u ( i ) k s.t. x ∈ Γ u ( i +1) = u ( i ) + D ¯ z ( i +1) − ˆ x ( i +1) if ( i + 1) mod r = 0 then k = k + s return ˆ x ( i +1) treated only in [56] and [31]. In this article, they are includedin the evaluation in their simple form and but they also serveas the building block for algorithms from further sections.Let us start with the simple synthesis-based task arg min z k w ⊙ z k s.t. D z ∈ Γ , (8)where D is the synthesis operator and ⊙ denotes the element-wise product of two vectors. The vector w > can be set to allones when no coefficients should be prioritized, but the largeran element of w is, the more is the corresponding coefficient in z penalized in the optimization. Due to the strict requirement D z ∈ Γ , such an approach is fully consistent.To find an appropriate algorithm to solve (8), it is convenientto rewrite it to an unconstrained form: arg min z k w ⊙ z k + ι Γ ( D z ) , (9)where the hard constraint from (8) has been equivalentlyreplaced by the indicator function ι , ι C ( u ) = (cid:26) for u ∈ C, + ∞ for u / ∈ C. (10)Next, the observation is used that ι Γ ( D z ) = ( ι Γ ◦ D )( z ) = ι Γ ∗ ( z ) , with Γ ∗ being the set of coefficients consistent withthe clipping model, i.e., Γ ∗ = { ˜ z | D ˜ z ∈ Γ } ; cf. definitions(2). The Douglas–Rachford algorithm (DR) [55] is able to findthe minimizer of a sum of two convex functions of our type.The algorithm is presented in Alg. 4. Algorithm 4:
Douglas–Rachford (DR) alg. solving (8) [56]
Input: D, y ∈ R N , w ∈ R P , R, H, L Parameters: λ = 1 , γ > Initialization: z (0) ∈ C P for i = 0 , , . . . until convergence do ˜ z ( i ) = proj Γ ∗ z ( i ) % using (11) z ( i +1) = z ( i ) + λ (cid:0) soft γ w (2˜ z ( i ) − z ( i ) ) − ˜ z ( i ) (cid:1) return D z ( i +1) It iterates over two principal steps: The first is the projectiononto Γ ∗ , which corresponds to the proximal operator of ι Γ ∗ (recall the definition of prox in (3)). This projection is nontrivial and an explicit formula exists only in case when DD ∗ is diagonal. According to [56], for Parseval tight framesit holds proj Γ ∗ ( z ) = z − D ∗ ( D z − proj Γ ( D z )) , (11)where proj Γ is a trivial, elementwise time-domain mapping.The second step involves soft thresholding soft τ w with thevector of thresholds τ w , which coincides with the proximaloperator of the weighted ℓ -norm. The soft thresholding is anelementwise mapping defined by soft τ w ( z ) = z ⊙ max (cid:18) − τ w ⊙ | z | , (cid:19) . (12)The analysis counterpart of (8) reads arg min x k w ⊙ A x k s.t. x ∈ Γ , (13)where A is the analysis operator. Unfortunately, the presenceof A inside the weighted ℓ norm disables us to use theDR algorithm as above. The Chambolle–Pock (CP) algorithm[57] is able to cope with problems including a general linearoperator. Its particular form for signal declipping is shown inAlg. 5. There, clip is the Fenchel–Rockafellar conjugate of the Algorithm 5:
Chambolle–Pock (CP) algorithm solving (13)
Input: A, y ∈ R N , w ∈ R P , R, H, L Parameters: ζ, σ > and ρ ∈ [0 , Initialization: x (0) ∈ R N , v (0) ∈ C P for i = 0 , , . . . until convergence do v ( i +1) = clip w ( v ( i ) + σA ¯ x ( i ) ) x ( i +1) = proj Γ ( x ( i ) − ζA ∗ v ( i +1) ) ¯ x ( i +1) = x ( i +1) + ρ ( x ( i +1) − x ( i ) ) return ¯ x ( i +1) soft thresholding, defined as clip w ( x ) = (Id − soft w )( x ) . (14)For ρ = 1 , the CP algorithm converges if ζσ k A k < .Looking at Algorithms 4 and 5, one recognizes that bothhave identical cost per iteration. The dominating operationsare A and D . E. Declipping in Sparseland (R ℓ CC)
Weinstein and Wakin [16] present four approaches to de-clipping, all based on sparsity. The basic synthesis model (8)is actually covered by the article as well under the acronymBPCC (Basis Pursuit with Clipping Constraints). In this sec-tion, we review the most successful method of [16] withcoefficients reweighting, referred to as R ℓ CC (Reweighted ℓ with Clipping Constraints) by the authors. It is again a fullyconsistent approach.R ℓ CC follows on a well-known idea from the field of sparserecovery / compressed sensing: To enhance sparsity of the solu-tion, a standard iterative program is performed, but repeatedly,and the actual weights w are being adapted based on the current temporary solution [58]. This way, large coefficientsare penalized less and less during the course of runs, while theopposite is true for the tiny coefficients, leading to a sharperfinal sparsity and in effect to a significantly better bias ofthe solution [59], [60]. The described effect, however, is notachieved automatically; in some applications, the improvementcan be large compared to the non-adaptive case [61], butsometimes it does not improve much [62] or even fails. Worthto notice that it is not correct to say that R ℓ CC solves (8),see a discussion in [58].To be more specific, R ℓ CC starts with solving the problem(8) with weights set to w = . Based on the solution, w is recomputed and (8) is solved again, and again, untila reasonable convergence criterion is fulfilled. Authors of [16]however provide no algorithm to solve (8). We know fromSec. IV-D that the DR algorithm can be used. In turn, R ℓ CCis presented in Alg. 6, with reweighting performed in step 4.Note that in practice, the number of the outer loop repetitionsshould be controlled in order to avoid the drop of performance;see the discussion in the evaluation part.
Algorithm 6: R ℓ CC using the Douglas–Rachford alg.
Input: D, y ∈ R N , R, H, L Parameters: ǫ > Initialization: z (0) ∈ C P , w (0) = for i = 0 , , . . . until convergence do Solve (8) using Alg. 4 with w ( i ) % returns z ( i +1) w ( i +1) = | z ( i +1) | + ǫ % update weights elementwise return D z ( i +1) For this survey, we found interesting also to include theanalysis variant, which was not considered in [16]. Theprocedure is analogue to the just presented: Problem (13) issolved repeatedly, now by the CP algorithm, and the weights w are adapted depending on the last available solution, similarto how it is done in Alg. 6. The difference is, however, thatthe solution of the CP algorithm is a time-domain signal;recomputing of the weights thus requires application of anadditional analysis to it. F. Social Sparsity
Siedenburg et al. [20] utilize the concept of the so-calledsocial sparsity [63], [64] for audio declipping. The plainsparsity induced by the ℓ norm as the regularizer, usedin the above sections, resulted in the soft thresholding ofeach coefficient individually in the respective algorithms. Thesocial sparsity approach is more general: it allows shrinkageof a coefficient based on the values of coefficients in itsneighborhood.The particular design of the neighborhood depends heavilyon the task to solve. For declipping, i.e., reverting a time-domain damage, TF neighborhoods that spread in the directionof time are beneficial, since they help to share and leak Codes from https://github.com/aweinstein/declipping rely on CVX [32]. information in the time direction. Such neighborhoods promotepersistence in time, since with clipping, it makes more senseto focus on harmonic structures in audio than on transients.Mathematically, the problem to solve is min z (cid:26) k M R D z − M R y k + 12 k h ( M H D z − M H θ c ) k ++ 12 k h ( − M L D z − M L θ c ) k + λR ( z ) (cid:27) . (15)It is a synthesis model and it allows inconsistency of thereliable part (see the first term). The terms with h penalizethe distance of the solution D z from the feasible set Γ in theclipped part; function h , called hinge, acts elementwise suchthat for each element of its input, h ( u ) = ( u for u < otherwise. (16)The bold symbol represents the vector of ones as long asthe signal. Since the use of h in (15) does not guarantee that D z stays above and below the clipping thresholds, the method[20] is also inconsistent in the clipped part.The first three terms in (15) are clearly differentiable, andeven with Lipschitz-continuous gradient. Therefore, (15) canbe treated as a sum of two functions; the second of them, R ,being possibly non-smooth. This observation makes it possibleto use standard optimization algorithms as ISTA or FISTA[65], [66], [38], as outlined in Alg. 7. Algorithm 7:
ISTA-type Social sparsity declipper [20]
Input: y ∈ R N , λ > , R, H, L, D Parameters: γ ∈ R , δ = k DD ∗ k Initialization: ˆ z (0) , z (0) ∈ C P for i = 0 , , . . . until convergence do g = D ∗ M ∗ R ( M R D z ( i ) − M R y ) % gradients g = D ∗ M ∗ H h ( M H D z ( i ) − M H θ c ) g = D ∗ M ∗ L h ( − M L D z ( i ) − M L θ c ) ˆ z ( i +1) = S λ/δ (cid:0) z ( i ) − δ ( g + g + g ) (cid:1) % step,shrink z ( i +1) = ˆ z ( i +1) + γ (ˆ z ( i +1) − ˆ z ( i ) ) % extrapolate return D ˆ z ( i +1) In step 5, the gradients are added. Looking at the structureof the particular gradients at lines 2–4 reveals that in practicalimplementation, a much more effective way of computing g + g + g is possible, containing a single application of D and D ∗ . Another important trick that is not included in Alg. 7for clarity of presentation is the warm start/adaptive restartstrategy [67]. The authors of [20] found out that the overallconvergence is significantly accelerated if ISTA is first run fora large λ for a few hundreds of iterations, then λ is decreasedand so on, until the target value of λ from (15) is reached.In Alg. 7, the shrinkage operator S plays the role of theproximal operator of R . The regularizer R should promote theexpected structure of the signal’s TF coefficients. Paper [20] suggests using four types of social shrinkage of TF coefficients z , indexed by t (in time) and f (in frequency): • LASSO (L): S λ ( z ft ) = z ft · max (cid:16) − λ | z ft | , (cid:17) . • Windowed Group LASSO (WGL): S λ ( z ft ) = z ft · max (cid:16) − λ kN ( z ft ) k , (cid:17) ,where N ( z ft ) denotes a vector formed from coefficientsin the neighborhood of TF position f t . • Empirical Wiener (EW): S λ ( z ft ) = z ft · max (cid:16) − λ | z ft | , (cid:17) . • Persistent Empirical Wiener (PEW): S λ ( z ft ) = z ft · max (cid:16) − λ kN ( z ft ) k , (cid:17) .Simple LASSO shrinkage is identical to soft thresholdingand corresponds to the proximal operator of R = k · k .Empirical Wiener, also known as the non-negative garrote [68]is better than LASSO in terms of bias [60] but still operateson coefficients individually. EW is the proximal operator ofa function R that has no explicit form, see for instance [69,Sec. 3.2] for more details. In contrast to LASSO and EW,both WGL and PEW involve TF neighborhoods, such thatthe resulting value of the processed coefficient z ft dependson the energy contained in N ( z ft ) . Again, the difference isonly the second power used by the PEW. It is interesting tonote that the study [70] proves that neither WGL nor PEWare proximal operators of any penalty R ; in other words, therespective shrinkages are purely heuristic. Hand in hand withthis, there are guarantees of convergence of Alg. 7 for LASSOand the Empirical Wiener if the extrapolation parameter γ isset properly [20], [38], while setting it in case of WGL andPEW requires trial tuning.The experiments in [20] give evidence that only PEW andEW are well-performing out of the four choices, and theyoutperform both the OMP [15] and C-IHT [18] approaches. G. Perceptual Compressed Sensing
The approach by Defraene et al. [19] is by far the firstto include psychoacoustics in declipping (both in the modelitself and in the evaluation). Although the approach is not quiterelated to the compressed sensing, we refer to the method asthe authors coined it. The following optimization problem issolved: min z (cid:26) k M R D z − M R y k + λ k w ⊙ z k s.t. D z ∈ Γ H ∩ Γ L (cid:27) . (17)This might look like just another variation of the synthesis-based declipping, but the main difference to the other methodsis that the weights w are computed based on a humanperception model. Note that with respect to our terminology,this method is consistent in the clipped part .To be more specific about the method, the signal is pro-cessed window-by-window. Ignoring the introduction of cor-rect notation, task (17) is solved independently for the signalchunks given by windowing. The recovered signal is obtainedby the synthesis D and by reusing the reliable samples atpositions given by the set R . Once all windows are processed,the final signal is obtained via the overlap-add procedure. The psychoacoustic model enters in through w . Authors of[19] rely on the MPEG-1 Layer 1 psychoacoustic model 1[71], [72] , which computes the instantaneous masking curvebased on the incoming signal and on the absolute threshold ofhearing. In short, such a curve informs us about the spectralcomponents in the signal that will not be perceived whenother strong components are present. This effect is commonlyknown as the instantaneous or frequency masking. Inspiredby this effect, w is set as the inverse of this curve (the curvein dB is non-negative, which justifies such an approach fromthe mathematical point of view). Application of such weightscan be interpreted as discouraging introduction of distinctivelyaudible new spectral components that are not present inthe original signal. On the other hand, the introduction ofless audible or inaudible spectral components is tolerated toa greater extent [19].Worth noticing that a correctly treated masking curve1/ should be computed based on the original signal whichis not available in practice, 2/ should be applied to thevery current window of the signal. Authors of [19] copewith both the issues in such a way that they recurrentlyuse the just declipped window as the base for calculatingthe masking curve, which is then applied in declipping thecurrently processed window.In terms of the numerical treatment of (17), the authorspropose algorithm coined PCSL1. Its core, optimization part,refers to the CVX toolbox [32], but no particular codes forPCSL1 are available, unfortunately. After several unsuccessfultrials with CVX, we decided to solve the problem with the so-called generic proximal algorithm introduced by Condat andV˜u [39], [73]. Such an algorithm, in the following abbrevi-ated as the CV algorithm, is able to solve convex problemswith more than two terms, possibly even containing linearoperators. This is the case of (17), indeed. Alg. 8 presentsthe particular shape of the CV algorithm for declipping. Theprojection onto Γ H ∩ Γ L is done using the second and third linesof (6). Algorithm 8 is guaranteed to converge if σ < τ − − / . Algorithm 8:
Condat–V˜u (CV) algorithm solving (17)
Input: D, y ∈ R N , w ∈ R P , λ > , R, H, L Parameters: σ, τ > and ρ ∈ (0 , Initialization: z (0) ∈ C P , u (0) ∈ R N for i = 0 , , . . . until convergence do ˜ z ( i +1) =soft τλ w (cid:0) z ( i ) − τ D ∗ (cid:2) M ∗ R M R ( D z ( i ) − y ) + u ( i ) (cid:3)(cid:1) z ( i +1) = ρ ˜ z ( i +1) + (1 − ρ ) z ( i ) p ( i +1) = u ( i ) + σD (2˜ z ( i +1) − z ( i ) ) % auxiliary ˜ u ( i +1) = p ( i +1) − σ proj Γ H ∩ Γ L (cid:0) p ( i +1) /σ (cid:1) u ( i +1) = ρ ˜ u ( i +1) + (1 − ρ ) u ( i ) return D z ( i +1) H. Psychoacoustically motivated ℓ minimization The second method that involves psychoacoustics, byZ´aviˇska et al. [31], is similar to the above (Section IV-G), but it is designed as completely consistent. Recall that it means thatthe declipped signal should belong to the set Γ defined in (2).The problem solved in [31] is actually identical to (8), but theweights are now derived from the human perception model.It is a synthesis-based signal model, and again, the Douglas–Rachford algorithm presented in Alg. 4 can be applied to findthe numerical solution (with the efficient projection onto Γ ∗ in case that D is the tight frame).In difference to [19], the paper [31] discusses multiple wayshow to choose the weights w . Besides the basic inversion,there are several other options of “inverting” the maskingcurve introduced and evaluated. Surprisingly, the best declip-ping results were obtained using weights which simply growquadratically with frequency! Such an option is not psychoa-coustically inspired at all, but its success might be explained bythe fact that clipped signals have a very rich spectrum, whilespectra of original signals decay with increasing frequency.In other words, regularizing the spectrum in such a way (inaddition to sparsity) seem to be more powerful than modelingdelicate perceptual effects. For the experiments in Sec. V, justthis “parabola” option were selected.Motivated by such an interesting observation, we alsoemployed these quadratic weights w in the method fromSec. IV-G. See more in the evaluation part of the article. I. Dictionary Learning approach
In the next two sections, we consider T windows y , . . . , y T of the clipped signal y , and their corresponding operators andconsistency sets M Rt , Γ t = Γ( y t ) .Sparsity-based methods reviewed so far use fixed and knownsynthesis operators, such as the DCT or Gabor transforms.However another approach consists in adapting D to theobserved data. Learning the dictionary (we prefer this termsince D will be treated in its matrix form from now on), ratherthan using an off-the-shelf one, has shown improvements ininverse problems such as denoising or inpainting [74], [37].Dictionary learning (DL) from clipped measurements has beenformulated by Rencker et al. [75], [28]. Given a collection of T clipped signals y , . . . , y T (here typically corresponding to T overlapping time windows extracted from a signal), dictionarylearning from clipped measurements can be formulated as: min z t ,D T X i =1 d( D z t , Γ t ) s.t. k z t k ≤ K, t = 1 , . . . , T, (18)where d( D z t , Γ t ) is the Euclidean distance of D z t to the set Γ t , and Γ t is the feasibility set corresponding to the signal y t (as defined in (2)). Note that using the notations in Sec. II, d( · , Γ t ) is equivalent to the data-fidelity term in (15), andis convex, differentiable with Lipschitz gradient thanks to theconvexity of Γ t . DL algorithms typically alternate betweenoptimizing z , . . . , z T with D fixed (sparse coding step),and optimizing D with z , . . . , z T fixed (dictionary updatestep) [37].The sparse coding step solves, for each t independently: min z t d( D z t , Γ t ) s.t. k z t k ≤ K, (19) which can be aproximated using consistent Iterative HardThresholding (IHT). Consistent IHT, proposed in [18], is asimple algorithm that iterates: z t ← H K ( z t + µD ⊤ ( D z − proj Γ t ( D z t )) , (20)which corresponds to a gradient descend step (with parameter µ ), followed by the hard thresholding. The ℓ constraint in(18) can also be relaxed into an ℓ constraint, in which casethe sparse coding step corresponds to an ISTA-type algorithmin Alg. 7. The dictionary update step is formulated as: min D ∈D T X t =1 d( D z t , Γ t ) , (21)which can be solved using (accelerated) gradient descent.Note that D is constrained to belong to D = { D =[ d , . . . , d P ] | k d p k ≤ for p = 1 , . . . , P } in order to avoidscaling ambiguity. The overall dictionary learning algorithmis presented in Algorithm 9. When y , . . . , y T correspondto overlapping windows extracted from a given signal, eachwindow can be recovered using the estimated dictionary andsparse coefficients as ˆ D ˆ z , . . . , ˆ D ˆ z T . The overall signal canthen be estimated using overlap-add. Algorithm 9:
Dictionary learning algorithm for declipping[28]
Input: y , . . . , y T ∈ R N , R, H, L Parameters: K , P Initialization: z (0)1 , . . . , z (0) T ∈ R P , D (0) ∈ R N × P for i = 0 , , . . . until convergence do Solve for t = 1 , . . . , T , using e.g., consistent IHT: z ( i +1) t = arg min z t d( D ( i ) z t , Γ t ) s.t. k z t k ≤ K Solve using, e.g., accelerated gradient descent: D ( i +1) = arg min D P Tt =1 d( D z ( i +1) t , Γ t ) return D ( i +1) z ( i +1)1 , . . . , D ( i +1) z ( i +1) T J. Nonnegative Matrix Factorization
Another approach proposed recently by Bilen et al. [23],[76] is based on the nonnegative matrix factorization (NMF).This is also a dictionary learning approach, though, instead oflearning a dictionary of the waveform, it learns a nonnegativedictionary together with a nonnegative decomposition coeffi-cients to approximate the unknown power spectrogram of theoriginal signal. Equivalently, this translates in the assumptionthat the the power spectrogram is approximately low-rank.Given that the power spectrogram is a phase-free represen-tation, this modeling is phase-invariant, thus allowing usinga dictionary of a considerably smaller size than dictionary sizein the approach presented in Section IV-I.NMF modeling is defined on the latent clean signal powerspectrogram obtained from the analysis short-time Fouriertransform (STFT) coefficients. Note that it is also possibleto decompose power spectrograms of synthesis coefficients,as in [77], though this was not yet done for audio declipping but for a related problem of compressed sensing recovery [77].More specifically, the analysis NMF approach assumes that thepower spectrogram nonnegative matrix P = [ p ft ] F,Tf,t =1 (with p ft = | z ft | , and z ft being clean signal STFT coefficients) isapproximated as P ≈ V = WH , (22)with W ∈ R F × K + and H ∈ R K × T + nonnegative matricesand K > usually chosen much smaller than F and T ( K ≪ min( F, T ) ). Matrix W can be understood as the powerspectrum dictionary (columns of W being characteristic spec-tral patterns), while H contains the corresponding nonnegativedecomposition (activation) coefficients.Note that the modeling in (22) is not yet well definedsince the approximation is not specified mathematically andthe power spectrogram P is unknown. To specify it properly,it is proposed in [23], [76] to resort to the maximum likelihood(ML) optimization under a probabilistic Gaussian formulationof Itakura Saito (IS) NMF [78]. To simplify formulation here,let all signals be considered either in the time domain (win-dowed, with overlap) or in the frequency domain, and the twodomains are related by the DFT for each time block separately.The DFT, denoted A here, A : C F → C F , is unitary, andthe number of frequency channels F is identical to time-domain samples. Let Y = [ y , . . . , y T ] and X = [ x , . . . , x T ] denote the windowed versions of the clipped and original(unknown) signals, respectively, and Z = [ z , . . . , z T ] theSTFT of the original signal ( z t = A x t ). It is assumed thatthe coefficients in Z are all mutually independent, and eachcoefficient z ft follows a complex circular zero-mean Gaussiandistribution z ft ∼ N c (0 , v ft ) with V = [ v ft ] f,t being a low-rank power spectrogram approximation specified in (22). NMFmodel parameters are estimated optimizing the ML criterion(see [23] for details) ( W , H ) = arg max W ′ , H ′ p ( Y | W ′ , H ′ ) (23)via the generalized expectation–maximization (GEM) algo-rithm [79] with multiplicative update (MU) rules [78], andthe final windowed signal block estimate b X is recovered viaWiener filtering, see (24) . This is altogether summarized inAlg. 10, where M R t denotes the restriction of the operator M R to block t , and all operators (e.g., M R t or A ∗ ), when appliedto matrices, are applied column-wise. It should be highlightedthat, though the NMF modeling (22) is defined on signal powerspectrogram, the signal is reconstructed with both amplitudeand phase since the Wiener filtering (24) with a complex-valued Wiener gain (matrix Σ ∗ M R t y t z t Σ − M R t y t M R t y t ) mapsfrom the time domain to the complex-valued STFT domain.Note that the consistency in the clipped part is not satisfiedin Algorithm 10, and it is difficult to satisfy it properly sincewith this constraint the posterior distribution of x t (given theobservations and the NMF model) is not Gaussian any more.To take the constraint into account an ad hoc strategy wasproposed in [23], [76]. This strategy consists in checking instep 4 whether the clipping constraint is satisfied, and, if not, Estimating windowed signal blocks results in a problem relaxation [76]since the overlapping frames are clearly not independent, but those dependen-cies are not exploited during estimation. Algorithm 10:
NMF GEM algorithm [23]
Input: y , . . . , y T ∈ R F , R, H, L Parameters:
K > Initialization: W (0) ∈ R F × K + , H (0) ∈ R K × T + (random) V (0) = W (0) H (0) for i = 0 , , . . . until convergence do Estimate posterior power spectrogram P = [ p ft ] : ˆ p ft = | ˆ z ( i +1) ft | + b Σ z t z t ( f, f ) ,where ( f, f ) picks the f -th diagonal matrix entryand ˆ z ( i +1) t = Σ ∗ M R t y t z t Σ − M R t y t M R t y t M R t y t , (24) b Σ z t z t = Σ z t z t − Σ ∗ M R t y t z t Σ − M R t y t M R t y t Σ M R t y t z t ,with Σ z t z t = diag (cid:16) [ v ( i ) ft ] f (cid:17) , Σ M R t y t z t = M R t A ∗ Σ z t z t , Σ M R t y t M R t y t = M R t A ∗ Σ ∗ M R t y t z t . Compute: ˆ x ( i +1)1 = A ∗ ˆ z ( i +1)1 , . . . , ˆ x ( i +1) T = A ∗ ˆ z ( i +1) T Update NMF parameters using MU rules: W ( i +1) = W ( i ) ⊙ (cid:16) [ W ( i ) H ( i ) ] . − ⊙ b P (cid:17) ( H ( i ) ) ⊤ [ W ( i ) H ( i ) ] . − ( H ( i ) ) ⊤ , H ( i +1) = H ( i ) ⊙ ( W ( i +1) ) ⊤ (cid:16) [ W ( i +1) H ( i ) ] . − ⊙ b P (cid:17) ( W ( i +1) ) ⊤ [ W ( i +1) H ( i ) ] . − , with ⊙ and [ · ] .b denoting element-wise matrixproduct and power, all divisions being element-wiseas well, and [ · ] ⊤ denoting the transpose. Update: V ( i +1) = W ( i +1) H ( i +1) return ˆ x ( i +1)1 , . . . , ˆ x ( i +1) T after applying thecorresponding synthesis window and overlap-add to getthe signal in time domain.the samples of those blocks ˆ x t for which it is not satisfiedare projected on the corresponding clipping thresholds, dy-namically added to the reliable set, and for those blocks thesteps 3 and 4 are repeated again, and, if necessary, iterated tillclipping constraint is completely satisfied at step 4. Wheneverthe main algorithm’s iteration starts over, the reliable set isre-initialized (see [23], [76]). K. Janssen’s autoregressive interpolation
The Janssen’s method [11] published back in 1986 and thor-oughly discussed recently in [80] relies on the autoregressive(AR) signal model. It assumes that a particular signal samplecan be induced from a fixed linear combination of precedingsamples. The coefficients in such a combination are the ARcoefficients and their total number is called the order of theAR model. The model can be alternatively interpreted suchthat the audio signal is generated by a Gaussian white noisefiltered by an all-pole filter.In practice, the AR model can be successfully applied tosignals containing harmonic components. Janssen’s method cannot handle the clipping constraints; hence in declipping, itis only possible to use it in order to replace the clipped samplesby the values linearly estimated from the reliable samples.In that regard, Janssen’s method belongs to the approachesinconsistent in the clipped part.Despite its simplicity and age, the algorithm is a strongcompetitor of the most recent audio inpainting methods [81].That is why we decided to consider it within our evaluation.V. E
VALUATION
This section compares the selected audio declipping meth-ods in terms of the quality of restoration. First, the experimentsthat were performed are described, along with the charac-terization of the audio dataset. The evaluation metrics usedto objectively assess the quality of restoration are presentedthen. Subsection V-C contains details about the algorithmsfrom the practical viewpoint, such as settings of the parametersand comments on the behavior of the algorithms. Finally, theresults are presented and discussed.
A. Experiment design and the dataset
The audio database used for the evaluation consists of10 musical excerpts in mono, sampled at 44.1 kHz, with anapproximate length of 7 seconds. They were extracted from theEBU SQAM database . The excerpts were thoroughly selectedto cover a wide range of audio signal characteristics. Sincea significant number of methods is based on signal sparsity, theselection took care about including different levels of sparsityin the signals (w.r.t. the Gabor transform).To our best knowledge, the only declipping experimentsincluding audio sampled at 44.1 kHz were carried in [19] and[31], while the others used audio at 16 kHz at maximum. Thissurvey thus provides the very first large-scale experiment fora high quality sampled audio.The input data were clipped in agreement with the model(1) using clipping levels that were chosen to lead to 7 differentinput signal-to-distortion ratios (SDR). The SDR for twosignals u and v is defined asSDR ( u , v ) = 20 log k u k k u − v k . (25)Recall that x denotes the original and y is the clipped signal;hence, the input SDR is computed as SDR ( x , y ) .With respect to the human perception of clipping severity,the SDR is more meaningful than treating signals according tothe clipping levels or according to the percentage of clippedsamples [29]. The particular input SDR levels were chosen tocover the range from very harsh clipping to mild but still no-ticeable clipping. The specific values along with the respectivepercentages of clipped samples are visualized in Fig. 2. Sincethe input SDR is used, there is no need to peak-normalize theaudio samples before processing because the number of theclipped samples remains the same independently on scaling.All the data have been processed and evaluated usingMATLAB in double precision, therefore there is no additionaldistortion caused by quantization during the process. https://tech.ebu.ch/publications/sqamcd input SDR (dB) c li pp e d s a m p l e s ( % ) violinclarinetbassoonharpglockenspielcelestaaccordionguitarpianowind ensemble Fig. 2. Percentage of the clipped samples for the selected input SDRs.
B. Evaluation metrics
To evaluate the restoration quality, we use several metrics.
1) Signal-to-Distortion Ratio:
Firstly, we utilize the signal-to-distortion ratio (SDR), which is one of the simplest,nonetheless, one of the most used methods. It expresses thephysical quality of restoration, i.e., how much is the recoveredsignal ˆ x numerically close to the ground truth x .The restored signal ˆ x is evaluated using the output SDR,which is computed as SDR ( x , ˆ y ) according to (25). Note thatevaluating the SDR on the whole signal may handicap themethods that produce signals inconsistent in the reliable part.(Some of such methods may contain replacing the reliablesamples with the reliable samples from the input clipped signal y as the final step of the restoration.) Therefore, as, e.g., in[20], we compute the SDR on the clipped part only, SDR c , asSDR c ( x , ˆ x ) = 20 log (cid:13)(cid:13)(cid:13)h M H M L i x (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)h M H M L i x − h M H M L i ˆ x (cid:13)(cid:13)(cid:13) . (26)Since this article aims at signal restoration, we will rather usethe SDR improvement, i.e., the difference between the SDRof the restored and the clipped signal, formally defined as ∆ SDR c = SDR c ( x , ˆ x ) − SDR c ( x , y ) , (27)and similarly for ∆ SDR. Note that this criterion does not takeinto consideration whether a method is or is not consistent inthe clipped part, since (26) does neither. Note also that in thecase of consistency in the reliable part, the ∆ SDR producesthe same values, whether the SDR is computed on the wholesignal or on the clipped samples only, and ∆ SDR and ∆ SDR c coinside.
2) PEAQ:
A good quality assessment should correspondas much as possible to perceptual experience. From this pointof view, SDR is not the best metric, since the physicalsimilarity with the original signal does not automatically implyperceptual similarity and quality. Hence, an evaluation metricconcerning the human perceptual system should be used.PEAQ—Perceptual Evaluation of Audio Quality, whichbecame the ITU-R recommendation (BS 1387) in 1999, isconsidered standard for audio quality evaluation. The finaloutput of PEAQ is the Objective Difference Grade (ODG) rating the perceived difference between the clean and thedegraded signal. The ODG score is shown in Table II.For the experiments, we used the MATLAB implementa-tion, implemented according to the revised version of PEAQ(BS.1387-1) and available with a detailed review of thismethod [82]. Unfortunately, this implementation is suited onlyfor signals with 48 kHz sampling frequency. Since the audiodatabase used is sampled at 44.1 kHz, we perform upsamplingof the signals in order to compute the PEAQ ODG. TABLE IIO
BJECTIVE DIFFERENCE GRADE
ODG Impairment description . Imperceptible − . Perceptible, but not annoying − . Slightly annoying − . Annoying − . Very annoying
3) PEMO-Q:
As another evaluation metric taking intoaccount the human auditory system we use the PEMO-Qmethod published in [83]. The Matlab implementation isfreely available for academic use and research. PEMO-Qcomputes the perceptual similarity measure (PSM), which canbe mapped to the ODG score (see Table II). The mappingto ODG is available only for signals with 44.1 kHz samplingfrequency but since this is the case of the database used, noadditional resampling is required. C. Algorithms and settings
Parameter fine-tuning is a necessary part of the experiment,as the overall results depend highly on the parameter selection.In our experiments, we attempted to tune parameters of eachalgorithm to produce the best possible restoration result interms of the SDR.Several above-presented algorithms employ a time-frequency (TF) transform and / or processing by blocks. Forsuch cases, we tried to unify this setting across the algo-rithms to ensure a fair comparison. The optimal way wouldbe to tune the parameters for every input SDR separately(for instance, harsh clipping with a great number of clippedsamples benefits from the use of longer windows). But weused a compromise among all the cases for simplicity, andeach method’s parameters stay fixed for all test signals andall clipping levels. Specifically, if the algorithm processes thesignal block-by-block, 8192 samples long ( ∼
186 ms) blocksare used. If the algorithm utilizes the DGT, we used the 8192samples long Hann window with 75% overlap and 16384frequency channels. Unfortunately, such a setting could notbe used in C-OMP, NMF and DL due to the computationalcomplexity of these algorithms. For C-OMP and DL, we usedwindows of 1024 samples, with 75% overlap, and twice-redundant dictionaries of size P = 2048 . The NMF algorithmused windows of size 2048, with 2048 frequency channels.Implementation of different TF transforms is handled by theLTFAT toolbox [36] in most of the methods. As for the termination criterion we resort to using justa simple maximum number of iterations. This number hadbeen empirically set for each algorithm independently to makesure that the algorithm fully converges. The only exceptionare the SPADE algorithms that have the ǫ parameter involvedstraight in the problem formulation, see (5) and (7), and ǫ isthus naturally used as the termination threshold.If an algorithm produces a signal inconsistent in the reliablepart, we do not replace that part with the original reliablesamples before the evaluation. Naturally, such a replacementincreases the overall output SDR. However, note that in termsof perceptual metrics, doing this can be beneficial for highinput SDR values. In such a case, we found out empiricallythat PEMO-Q and PEAQ ODG improve a little. In general,nevertheless, the replacement causes discontinuities in thewaveform, which leads to introduction of artificial higherharmonics that in the end degrade the restoration quality. Natu-rally, this effect is more pronounced for low input SDR valueswhere this kind of degradation prevails over the advantage ofmatching the reliable samples, resulting in a decrease of theODG. Note that these additional variants were excluded fromthe comparison for clarity.
1) Constrained OMP:
C-OMP was tested using the im-plementation provided by the authors of [15], [42] in theSMALLbox MATLAB toolbox . We have used the DGT-based implementation with min-constraint, as we have foundthat it provided a good tradeoff between performance andcomputational complexity. Note that when the clipping level islow (many samples missing), the constrained optimization (thefinal step of Alg. 1) often failed to converge. We believe thisis because the support set Ω estimated with the OMP (withoutany clipping constraint) is often suboptimal, leading to a signalthat is clipping- inconsistent . As a result, a signal that belongsto the range space of D Ω and the clipping consistency set Γ H ∩ Γ L might not exist or have a very large amplitude,thus making the constraint D Ω z Ω ∈ Γ H ∩ Γ L infeasible. Inthat scenario, following guidelines from the authors in [42](implemented in the accompanying code), we simply returnthe output of the (unconstrained) OMP as a solution.
2) SPADE:
Although we have the original implementationof A-SPADE, we use our own implementation, which isslightly improved over the original version. The main dif-ferences are the means how the signal is windowed and thehard-thresholding step, where we take into account complexconjugate structure of the DFT and always process pairs of co-efficients (hence producing purely real signals with the inverseDFT). The above applies also to the S-SPADE implementation.Both SPADE algorithms process the signal by overlappingblocks. Each block is processed separately and the blocks arefolded back using the standard overlap-add (OLA) procedure.The frequency representation in each block is computed usinga twice redundant DFT (forming a Parseval frame). Theparameters of S-SPADE and A-SPADE are identical and theycorrespond to the description in Sec. IV-B. It is fairly easy andintuitive to tune them. The algorithm works very well withdefault parameters ( s = 1 , r = 1 ). During testing, we found http://small.inria.fr/software-data/smallbox/ out that for the most extreme clipping (input SDR = 1 dB) ithelps to increase the number of iterations by incrementing k every even iteration, i.e., r = 2 . This option lifted the averageSDR by 1.2 dB.The termination criterion is based on the minimized residue,i.e., on A x − z for A-SPADE and x − D z for S-SPADE. Thealgorithms runs until the ℓ norm of the residue is smallerthan ǫ . We used the default ǫ = 0 . used in the original papers.Decreasing ǫ may increase the number of iterations but doesnot improve the overall restoration quality (in terms of neitherof the three considered quality measures).
3) Plain ℓ minimization: The algorithms based on ℓ relaxation described in Sec. IV-D are designed to processthe input signal all at once using the DGT, with the DGTparameters specified above in this section.The synthesis variant was computed using the DR algorithm(Alg. 4) and the analysis variant was solved by the CP algo-rithm (Alg. 5). It was quite difficult to tune the parametersin these algorithms, since each test sound required a slightlydifferent setting to obtain reasonable convergence. It alsohappens sometimes that the output SDR starts to drop afterreaching its peak, and then it stabilizes at a lower value. Insuch a case, we let the algorithm reach the maximum numberof iterations and we take the result from the final iteration.Note that the just-described behavior is more common formethods described below that employ the same DR and CPalgorithms, but use coefficient weighting, i.e., w = . Becauseof these issues, we set all the parameters to unity, whichturns out to be a compromise covering all the cases; we set λ = γ = 1 for the DR algorithm and ζ = σ = ρ = 1 for theCP algorithm.In both algorithms, the convergence criterion was set strictlyto 3000 iterations, where it was certain that the algorithmsreached the minimum with sufficient accuracy.
4) Declipping in Sparseland R ℓ CC:
The original codes ofthe R ℓ CC method rely on the CVX, whose disadvantage isthat the transforms are handled only in the form of matrix; thisdisables the use of CVX from using longer window lengths.Therefore, we re-implemented the original approach using theDR algorithm, as described in Alg. 6. For the DR algorithm,we used the same setting as in the non-weighted case, i.e., λ = γ = 1 , but the maximum number of the DR iterationswas set to 1000. The original paper [16] uses 10 outer-cycleiterations of the R ℓ CC algorithm, but our implementationuses only 6 since after the sixth iteration the performancestarted to decrease.Concerning the operators, we use the DGT instead of theDFT used for toy examples in [16]. The parameter ǫ usedfor updating weights was set to . . Besides that, weintroduced another parameter, δ ; the inner cycle is terminatedif the relative change of the DGT coefficients between twosubsequent iterations drops below δ . The specific value usedin the tests was δ = 0 . . This modification is used just tospeed up the computations and has no effect on the restorationquality.On top of that, we also include the analysis variant usingCP algorithm (with ζ = σ = ρ = 1 and 1000 inner iterations). Also in case of the analysis variant, the restoration qualitystarts to decrease with the seventh outer iteration.
5) Social Sparsity:
For the experiments, we used the imple-mentation of the algorithm kindly provided by the authors of[20]. For clarity, only the best performing variants are includedin the evaluation, i.e., Empirical Wiener (EW) and PersistentEW (PEW). In case of the PEW, one needs to specify the sizeof the coefficient neighborhood in the TF plane. For our testcase (audio at 44.1 kHz and the DGT), the best-performingsize was the neighborhood × (i.e., 3 in the direction offrequency and 7 coefficients in time).One needs to carefully tune the number of inner and outeriterations and the distribution of the parameter λ during thecourse of iterations. In the final algorithm, we used 500 innerand 20 outer iterations with λ logarithmically decreasing from − to − .As for the step size, the authors of [20] report using γ = 0 . ,leading to fast convergence. Following the codes provided bythe authors, however, our γ develops according to the formula k − k +5 , where k is the iteration counter. Such an approachcorresponds to acceleration in FISTA [66].Sometimes it happens that the optimization gets stuck(especially in the first couple of outer iterations) and startsto converge again with the next outer iteration (i.e., when λ is decreased). For this reason, we slightly modified thealgorithm by adding the δ threshold, used to break the outeriteration even if the maximum number of inner iterations hasnot been reached. The ℓ norm of the difference between thetime-domain solutions of the current and previous iteration iscompared with δ , whose value was set to . .Even though the algorithm does not produce signals consis-tent in any earlier defined sense, the result is usually not toofar from the set Γ .
6) Perceptual Compressed Sensing:
We were not able toobtain neither the implementation of this method nor anyexample of output signals, unfortunately. The authors say thatit relies on CVX, which is not quite practical for our exper-iment. Since the scores reported in [19] look promising, were-implemented the method using the Condat-V˜u algorithm,which mimics the algorithmic scheme suggested in [19]. Thesignal is processed block by block as in the SPADE algorithms.Our experiment includes the non-weighted variant, CSL1,the perceptually-weighted variant, PCSL1, and, inspired byvery good results of quadratic weights in [31], an additionalparabola-weighted variant, coined PWCSL1.Parameters of the CV algorithm were set to γ = 0 . , σ =1 , τ ≈ . , and ρ = 0 . for all three mentioned variants.The maximum number of iterations was set to 500 for CSL1and PCSL1 and 5000 for PWCSL1.In contrast to the original article, the “official” implemen-tation of MPEG PM 1 could not be used because it is strictlylimited to 512 sample-long windows. In this survey, our goalis to compare algorithms with the best possible settings andwith the same DGT settings across all methods. Therefore, wewanted to work with 8192 samples long Hann window with75 % overlap and 16384 frequency channels. Hence, insteadof the official implementation, we had to switch to a slightly modified and simplified version of MPEG Layer 1 PM 1,which is not restricted in terms of the block lengthThe PCSL1 algorithm places the original reliable samplesat the reliable positions at the very end of processing. Thisleads to some ODG gain for mild clipping levels. However, asdiscussed earlier, we present results without such a replace-ment.
7) Psychoacoustically motivated ℓ minimization: The al-gorithms used here are the same as in case of the plain ℓ minimization (Sec. V-C3), but the weights are now percep-tually motivated. The original paper [31] presented severalways of implementing the psychoacoustical information intothe declipping process (only for the synthesis case). The bestoption turned to be simple weights that grow quadraticallywith frequency.In the experiments, we present only this “parabola-weighted” option, but we newly include the analysis variantas well. All the parameters and the maximum number ofiterations for both DR and CP are identical to the plain case.
8) Dictionary Learning approach:
For the dictionary learn-ing algorithm, each signal is first decomposed into T overlap-ping windows y , . . . , y T of size , for a total of approxi-mately T = 1200 windows per signal. These are directly usedas inputs to the algorithm, such that the dictionary is learnedand evaluated on the same signal.As the optimal sparsity parameter K depends on the signalas well as the clipping level, we adopt here an adaptivesparsity strategy. At the first sparse coding step (20), we firstiterate (20) with K = 1 , then sequentially increase K everyfew iterations, until an error threshold ǫ is reached. The result-ing sparsity level ˆ K is then fixed throughout the rest of thealgorithm. The dictionary algorithm is initialized with a realDCT dictionary of size P = 2048 . We perform 20 iterations ofsparse coding and dictionary update steps. The sparse codingsteps (apart from the first one which uses an adaptive sparsitystrategy) are computed using 20 iterations of consistent IHT.The dictionary update step is computed using 20 steps ofaccelerated gradient descent. To improve the convergence,the sparse coefficients and dictionary are always initializedusing estimates from the previous iteration. Using the learneddictionary ˆ D and sparse coefficients ˆ z , . . . , ˆ z T , each of theindividual windows is reconstructed as ˆ D ˆ z , . . . , ˆ D ˆ z T , and theestimated signal is then recovered using the overlap-add.
9) NMF:
As for NMF-based declipping, we used the STFTwith sine window of size F = 2048 samples with 50%overlap. The NMF model order was set to K = 20 . The GEMalgorithm 10 was run for 50 iterations. These choices followthose done in the corresponding paper [23] except for theSTFT window size which is 64 ms (1024 samples for 16kHz-sampled signals) in [23] and is 46.4 ms (2048 samples for44.1kHz-sampled signals) here. Note that the computationalload of the NMF approach grows drastically with increasingwindow length. This is due to matrix inversion in Wienerfiltering (24) and to the iterative ad hoc clipping constraintmanagement strategy described at the end of Sec. IV-J. Assuch, even with 2048 samples long window, declipping ofsome sequences took more than 8 hours. This is why we havenot chosen a longer window size for experiments.
10) Janssen:
For evaluation of the Janssen algorithm, wehave adapted the codes published in the SMALLbox. Thesignal is processed block by block. The order of the AR filterdoes not play a significant role in the quality of declipping;it turns out that the number of iterations is significantly moreimportant. The ∆ SDR increases with the number of iterations,but from a certain point, the algorithm starts to diverge wildly.The reason is probably that the algorithm does not haveenough reliable data to fit the AR parameters to the knownsignal. This point of change differs among the test signals andit is highly influenced by the input SDR (the smaller inputSDR, the sooner the divergence appears). In our experiments,the order of the filter is set to 512 and we run 3 iterations.Using 5 iterations produces slightly better results but for oneof the test signals at 1 dB input SDR, the divergence appears,devaluating the average score.
D. Computational cost
Though the survey concentrates primarily on the restorationquality, some remarks on the computational cost of the meth-ods are valuable. Below we quote the rough computationaltime needed to process one second of audio (i.e., 44100samples). No parallel computing is considered, and the timespent by tuning the parameters is not included. The overalltime needed to completely recompute our experiments isestimated to one month when performed on a common PC.1) C-OMP: It takes 5 to 10 minutes to declip one secondof the signal.2) SPADE: The clipping threshold determines theperformance—the higher input SDR, the longer it takesfor the SPADE algorithms to converge. Processing onesecond of audio takes from 22 up to 64 seconds forA-SPADE and 14 up to 52 seconds for S-SPADE.3) Plain ℓ : Computational time roughly 20 seconds forboth the DR and CP algorithms, independent of theclipping threshold.4) Reweighted ℓ : It takes 66 seconds for the DR algorithmand 56 seconds for the CP algorithm.5) Social Sparsity: Slightly less than 2 minutes for EW andslightly more than 2 minutes in case of PEW.6) Perceptual Compressed Sensing: CSL1 & PCSL1:roughly 20 seconds, PWCSL1 below three minutes.7) PW ℓ : Same as for Plain ℓ .8) Dictionary Learning: 1 to 2 minutes depending on theclipping level, the algorithm generally converging a bitfaster when the clipping level is low.9) NMF: Average computation time is 30 minutes per onesecond of audio. The particular time depends on theinput SDR—for the lowest, the cost can raise up to1 hour.10) Janssen: Depends heavily on the input SDR; the dura-tions differ by two orders of magnitude: for 1 dB inputSDR Janssen takes about 16 minutes per second ofaudio, and for 20 dB input SDR it is done in 5–15 s. E. Results and discussion
Results of the declipping in terms of performance arepresented and commented in this section. Recall that the com- parison is done in terms of three objective metrics— ∆ SDR c ,PEAQ, and PEMO-Q. In the bar graphs that follow, algorithmscoming from the same family share the same color. If a methodwas examined in both the analysis and the synthesis variant,the analysis variant is graphically distinguished via hatching.Other variants (e.g., multiple shrinkage operators in the SSalgorithms or different weights within the CSL family) usegray stippling. The abbreviations used in the legends are usedall over the text, but also summarized in Table III.The ∆ SDR c results are presented in Fig. 3, PEAQ ODGvalues in Fig. 4 and PEMO-Q ODG values in Fig. 5. The readercan easily make a number of conclusions by studying theplots, however we try to summarize the most important andinteresting facts inferred from these results. We concentratemore on the ODG score designed to reflect properties of thehuman auditory system.First of all, note that the SDR scores correlate with the ODGscores to a certain degree. There are exceptions, however—compare for example R ℓ CC CP and R ℓ CC DR, who behavejust the opposite way (SDR versus ODG). Note also that ODGvalues of PEMO-Q are uniformly worse than those of PEAQ,however the relation between scores of the individual methodsis retained. An exception is the family of CSL1, PCSL1,PWCSL1, where we observe a difference.The main messages can be summarized as follows: • Both variants of the SPADE algorithm perform similarlyand very well in terms of all the three metrics and acrossall levels of degradation, while the analysis variant isslightly preferred. • Using social sparsity leads to very good results. Inparticular, the SS PEW method (that assumes persistenceof frequencies in time) performs overall best in terms ofthe SDR and one of the best few in terms of the perceptualmeasures. • In the medium to mild clipping regime, NMF is the clearwinner in terms of ODG and also very good performing inSDR. With more severe clipping (1 and 3 dB) it behavesworse but still very competitive. • Despite its simplicity, the results of the parabola-weighted ℓ minimization are uniformly very good, in terms of allthree metrics. Its SDR values more or less correspond tothose of SPADE, and the ODG scores show that for wildclipping, PW ℓ is even the best declipping method. • Introduction of reweighting improves over the plain ℓ minimization, especially for the analysis variant, but itholds only for the SDR results. In terms of ODG, theeffect is just reversed. In fact, the performance of plain ℓ can be found surprisingly satisfactory in the PEAQODG graph. • The family of psychoacoustically weighted optimizations(CSL1) failed. The best results are achieved using theparabolic weights which are in fact not psychoacous-tically motivated. These observations are especially in-teresting since the original article [19] reported betterdeclipping quality (but on a different dataset). This corresponds to our experience with reweighting in audio inpainting,see [61], but we do not have an explanation for this effect unfortunately. Input SDR (dB) ∆ S D R c ( d B ) C-OMPA-SPADES-SPADE ℓ CP ℓ DRR ℓ CC CPR ℓ CC DRSS EWSS PEWCSL1PCSL1PWCSL1PW ℓ CPPW ℓ DRDLNMFJanssen
Fig. 3. Average ∆ SDR c results. The abbreviations from the legend are summarized in Table III. − − . − − . − − . − − . Input SDR (dB) P E AQODG clippedC-OMPA-SPADES-SPADE ℓ CP ℓ DRR ℓ CC CPR ℓ CC DRSS EWSS PEWCSL1PCSL1PWCSL1PW ℓ CPPW ℓ DRDLNMFJanssen
Fig. 4. Average PEAQ ODG results. The PEAQ ODG of the clipped signal is depicted in gray. − − . − − . − − . − − . Input SDR (dB) P E M O - QODG clippedC-OMPA-SPADES-SPADE ℓ CP ℓ DRR ℓ CC CPR ℓ CC DRSS EWSS PEWCSL1PCSL1PWCSL1PW ℓ CPPW ℓ DRDLNMFJanssen
Fig. 5. Average PEMO-Q ODG Results. • Low scores of dictionary learning may be probablyattributed to the fact that it uses the IHT algorithm inits sparse coding step (see Alg. 9) that has been recentlysurpassed by its successor, SPADE, for example. Thesecond factor could be that the initial dictionary is thereal-valued DCT and that the iterates of Alg. 9 remain inthe real domain, causing phase-related artifacts. • Janssen method performs well only in the very high inputSDR regime, otherwise it fails due to the lack of reliablesamples. Recall that Janssen was included in the studysince it performs on par with state-of-the-art methodsin audio inpainting (of compact signal gaps). Clearly,the hypothesis that it could be similarly successful in declipping is not validated.Besides the restoration quality which is the main concern ofthe article, other factors can also be taken into consideration.For example, consider the rough cost of computation that hasbeen reported in Sec. V-D. According to these values, the NMFis 90-times slower than the PW ℓ algorithms, while producingresults of almost identical quality for most clipping levels. Onthe other hand, some methods require painful parameter tuningto achieve good results. In that regard, NMF or SPADE inparticular can be seen as advantageous. F. Software and data
Besides the numerical evaluation, the intent of the articleis to collect implementations of the examined methods and TABLE IIIT
ABLE OF EXAMINED ALGORITHMS AND THEIR ABBREVIATIONS AND REFERENCES . S
OME OF THE ALGORITHMS WERE PROPOSED IN THIS ARTICLE INORDER TO MAKE IT AS COMPLETE AS POSSIBLE . T
WO ALGORITHM ARE NOT PRESENTED .Abbreviation Full name Algorithm utilized ReferenceC-OMP Constrained Orthogonal Matching Pursuit Alg. 1 Adler’11 [15]A-SPADE Analysis SParse Audio DEclipper Alg. 2 Kiti´c’15 [24]S-SPADE Synthesis SParse Audio DEclipper Alg. 3 Z´aviˇska’19 [30] ℓ CP ℓ -minimization using Chambolle–Pock (analysis) Alg. 5 proposed ℓ DR ℓ -minimization using Douglas–Rachford (synthesis) Alg. 4 Rajmic’19 [56]R ℓ CC CP Reweighted ℓ -min. with Clipping Constraints using Chambolle–Pock (analysis) — proposedR ℓ CC DR Reweighted ℓ -min. with Clipping Constraints using Douglas–Rachford (synthesis) Alg. 6 Weinstein’11 [16]SS EW Social Sparsity with Empirical Wiener Alg. 7 Siedenburg’14 [20]SS PEW Social Sparsity with Persistent Empirical Wiener Alg. 7 Siedenburg’14 [20]CSL1 Compressed Sensing method minimizing ℓ -norm Alg. 8 Defraene’13 [19]PCSL1 Perceptual Compressed Sensing method minimizing ℓ -norm Alg. 8 Defraene’13 [19]PWCSL1 Parabola-Weighted Compressed Sensing method minimizing ℓ -norm Alg. 8 proposedPW ℓ CP Parabola-Weighted ℓ -minimization using Chambolle–Pock (analysis) Alg. 5 proposedPW ℓ DR Parabola-Weighted ℓ -minimization using Douglas–Rachford (synthesis) Alg. 4 Z´aviˇska’19b [31]DL Dictionary Learning approach Alg. 9 Rencker’18 [28]NMF Nonnegative Matrix Factorization Alg. 10 Bilen’15 [23]Janssen Janssen method for inpainting — Janssen’86 [11] to make them publicly available, both for the reproducibilitypurposes and to stimulate future research in this area. Thewebpage https://rajmic.github.io/declipping2020/contains the link to the MATLAB implementation (except theNMF, which is not publicly available). The tests have beenperformed in MATLAB version 2019b.This article provides the objective evaluation, though PEAQand PEMO-Q pursuit being as close as possible to human per-ception. Individual subjective assessment which is always themost decisive, can be made via the supplied webpage whereall sound examples are directly playable (or downloadable).VI. C ONCLUSION AND P ERSPECTIVES
The article presented the declipping problem and anoverview of methods used to solve it. Besides such a survey,several popular declipping methods of different kinds wereselected for deeper evaluation. This is the first time so manymethods are compared based on the same audio dataset(moreover sampled at 44.1 kHz). The main focus of the articlewas the restoration quality, which is reported in terms of threemetrics. However, other factors as the computation cost andcomplexity in tuning parameters are also discussed.The algorithms studied and compared in this paper exhibitvarious performances and computational times. Some algo-rithms perform better at low clipping levels, while othersperform better at high clipping levels. The choice of algorithmthus depends on the input data. Nevertheless, the methodsbased on social shrinkage, nonnegative matrix factorization,weighting the transform coefficients and last but not least thegreedy SPADE seem to yield results that make them preferredchoices. Depending on the application, the computationaltime of each algorithm and the time-consuming tuning ofparameters might also be a decisive selection criterion.Directions for future research may include combining strate-gies and modeling assumptions of the various algorithms pre-sented in this paper. For instance, the social sparsity regularizer of [20], or the perceptual weights of [19], could be com-bined with S-SPADE or other algorithms. Dictionary learningcould be combined with S-SPADE or social sparsity. Mostalgorithms discussed here use the synthesis model, howeverdeveloping their analysis counterpart could also be a promisingidea. We could also imagine assigning weights, in order tofavor clipping consistency. Finally we have focused in thispaper on unsupervised techniques that do not rely on a train-ing set with clean data. However, the success of supervisedtechniques, and in particular deep learning based techniques,in tackling many other problems in computer vision, speechrecognition, and audio analysis, motivates the further studyof supervised techniques to audio declipping. Recent deeplearning based approaches to audio declipping have shownpromising results in the context of speech declipping [6],[7], [8]. A potential research direction would be to combinethe power of supervised techniques with signal assumptions,modeling and algorithms discussed in this article. One of suchdirections could be the recent finding that artificial networksthat bare the structure of unfolded proximal algorithms are ableto join the signal modeling and learning from data, possiblykeeping advantages of both distinct worlds, see for example[84] in the context of image processing.Note that this survey and the papers under considerationonly investigate declipping of signals that are clipped inthe digital domain. However, when clipping occurs in theanalog domain, it is different since before the A/D conver-sion, a low-pass filter is applied to avoid aliasing. Sincethe clipping distortion is wide-band, clipping aliasing [85],a quite unpleasant distortion, is present in the latter case.This effect is well-known to audio engineers, and a digitalaliasing-free clipping or compression of dynamics is oftenrealized via upsampling [86]. For example, [87] addresses thealiasing reduction in digitally-clipped signals, though withoutdeclipping itself. While the methods covered by this surveyreduce both aliasing and the remaining narrow-band clippingdistortion, if the clipping is realized in the analog domain or ina particular aliasing-free manner (e.g., via upsampling [86]), different new declipping algorithms should be developed andapplied, simply because the clipping process is different inthat case. A CKNOWLEDGMENT
The authors would like to thank M. Kowalski for the im-plementations related to the paper [20], to O. Mokr´y for com-puting results of the Janssen method, to Z. Pr˚uˇsa for helpingwith the accompanying HTML page. Thanks to S. Kiti´c andN. Bertin for discussing SPADE algorithms and projectionswith tight frames.The work of P. Z´aviˇska and P. Rajmic was supported bythe Czech Science Foundation (GA ˇCR) project number 20-29009S. The work of L. Rencker was supported by the Euro-pean Union’s H2020 Framework Programme (H2020-MSCA-ITN-2014) under grant agreement no. 642685 MacSeNet.R
EFERENCES[1] C.-T. Tan, B. C. J. Moore, and N. Zacharov, “The effect of nonlineardistortion on the perceived quality of music and speech signals,”
J.Audio Eng. Soc , vol. 51, no. 11, pp. 1012–1031, 2003.[2] J. M´alek, “Blind compensation of memoryless nonlinear distortionsin sparse signals,” in , Sept 2013.[3] Y. Tachioka, T. Narita, and J. Ishii, “Speech recognition performanceestimation for clipped speech based on objective measures,”
AcousticalScience and Technology , vol. 35, no. 6, pp. 324–326, 2014.[4] A. Ozerov and C. F´evotte, “Multichannel nonnegative matrix factoriza-tion in convolutive mixtures for audio source separation,”
IEEE TASLP ,vol. 18, no. 3, pp. 550–563, 2009.[5] R. Sathya and A. Abraham, “Comparison of supervised and unsuper-vised learning algorithms for pattern classification,”
Intl. Journal ofAdvanced Research in Artificial Intelligence , vol. 2, no. 2, 2013.[6] F. Bie, D. Wang, J. Wang, and T. F. Zheng, “Detection andreconstruction of clipped speech for speaker recognition,”
SpeechCommunication , vol. 72, pp. 218 – 231, 2015.[7] H. B. Kashani, A. Jodeiri, M. M. Goodarzi, and S. G. Firooz, “Image toimage translation based on convolutional neural network approach forspeech declipping,” 2019.[8] W. Mack and E. A. P. Habets, “Declipping speech using deep filtering,”in , Oct 2019, pp. 200–204.[9] A. Ozerov, C¸ . Bilen, and P. P´erez, “Multichannel audio declipping,” in , March 2016, pp. 659–663.[10] C. Gaultier, N. Bertin, and R. Gribonval, “Cascade: Channel-awarestructured cosparse audio declipper,” in , 2018.[11] A. J. E. M. Janssen, R. N. J. Veldhuis, and L. B. Vries, “Adaptiveinterpolation of discrete-time signals that can be modeled as autoregres-sive processes,”
IEEE Trans. Acoustics, Speech and Signal Processing ,vol. 34, no. 2, pp. 317–330, 4, 1986.[12] J. Abel and J. Smith, “Restoring a clipped signal,” in
IEEE ICASSP ,1991. pp. 1745–1748 vol.3.[13] W. Fong and S. Godsill, “Monte carlo smoothing for non-linearlydistorted signals,” in , vol. 6, 2001, pp. 3997–4000vol.6.[14] A. Dahimene, M. Noureddine, and A. Azrar, “A simple algorithm for therestoration of clipped speech signal,”
Informatica , vol. 32, pp. 183–188,2008.[15] A. Adler, V. Emiya, M. Jafari, M. Elad, R. Gribonval, and M. Plumbley,“A constrained matching pursuit approach to audio declipping,” in
IEEEICASSP , 2011, pp. 329–332.[16] A. J. Weinstein and M. B. Wakin, “Recovering a clipped signal insparseland,”
Sampling Theory in Signal and Image Processing , vol. 12,no. 1, pp. 55–69, 2013.[17] S. Miura, H. Nakajima, S. Miyabe, S. Makino, T. Yamada, andK. Nakadai, “Restoration of clipped audio signal using recursive vectorprojection,” in
TENCON 2011 , Nov 2011, pp. 394–397. [18] S. Kiti´c, L. Jacques, N. Madhu, M. Hopwood, A. Spriet, andC. De Vleeschouwer, “Consistent iterative hard thresholding for signaldeclipping,” in
ICASSP , May 2013, pp. 5939–5943.[19] B. Defraene, N. Mansour, S. D. Hertogh, T. van Waterschoot, M. Diehl,and M. Moonen, “Declipping of audio signals using perceptual com-pressed sensing,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 21, no. 12, pp. 2627–2637, Dec 2013.[20] K. Siedenburg, M. Kowalski, and M. Dorfler, “Audio declipping withsocial sparsity,” in
Acoustics, Speech and Signal Processing (ICASSP),2014 IEEE International Conference on . IEEE, 2014, pp. 1577–1581.[21] S. Kiti´c, N. Bertin, and R. Gribonval, “Audio declipping by cosparsehard thresholding,” in , 2014.[22] M. Jonscher, J. Seiler, and A. Kaup, “Declipping of speech signals usingfrequency selective extrapolation,” in
Speech Communication; 11. ITGSymposium , Sept 2014, pp. 1–4.[23] C¸ . Bilen, A. Ozerov, and P. P´erez, “Audio declipping via nonnegativematrix factorization,” in
Applications of Signal Processing to Audio andAcoustics (WASPAA), 2015 IEEE Workshop on , Oct 2015, pp. 1–5.[24] S. Kiti´c, N. Bertin, and R. Gribonval, “Sparsity and cosparsity for audiodeclipping: a flexible non-convex approach,” in
LVA/ICA 2015 , Liberec,Czech Republic, Aug. 2015.[25] M. J. Harvilla and R. M. Stern, “Efficient audio declipping usingregularized least squares,” in , April 2015.[26] T. Takahashi, K. Uruma, K. Konishi, and T. Furukawa, “Block adaptivealgorithm for signal declipping based on null space alternating opti-mization,”
IEICE Transactions on Information and Systems , vol. E98.D,no. 1, pp. 206–209, 2015.[27] F. Elvander, J. Sw¨ard, and A. Jakobsson, “Grid-less estimation of satu-rated signals,” in , Oct 2017, pp. 372–376.[28] L. Rencker, F. Bach, W. Wang, and M. D. Plumbley, “Consistentdictionary learning for signal declipping,” in
Latent Variable Analysisand Signal Separation . Springer International Publishing, 2018.[29] C. Gaultier, “Design and evaluation of sparse models and algorithmsfor audio inverse problems,” Theses, Universit´e Rennes 1, Jan. 2019.[30] P. Z´aviˇska, P. Rajmic, O. Mokr´y, and Z. Pr˚uˇsa, “A proper versionof synthesis-based sparse audio declipper,” in ,Brighton, UK, May 2019, pp. 591–595.[31] P. Z´aviˇska, P. Rajmic, and J. Schimmel, “Psychoacoustically motivatedaudio declipping based on weighted l1 minimization,” in , July 2019, pp. 338–342.[32] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convexprogramming,” http://cvxr.com/cvx.[33] O. Christensen,
Frames and Bases, An Introductory Course . Boston:Birkh¨auser, 2008.[34] M. Elad, P. Milanfar, and R. Rubinstein, “Analysis versus synthesis insignal priors,” in
Inverse Problems 23 (200) , 2005, pp. 947–968.[35] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions ofsystems of equations to sparse modeling of signals and images,”
SIAMReview , vol. 51, no. 1, pp. 34–81, 2009.[36] Z. Pr˚uˇsa, P. L. Søndergaard, N. Holighaus, C. Wiesmeyr, and P. Balazs,“The Large Time-Frequency Analysis Toolbox 2.0,” in
Sound, Music,and Motion , Springer International Publishing, 2014, pp. 419–442.[37] M. Elad,
Sparse and Redundant Representations: From Theory toApplications in Signal and Image Processing . Springer, 2010.[38] P. Combettes and J. Pesquet, “Proximal splitting methods in signalprocessing,”
Fixed-Point Algorithms for Inverse Problems in Science andEngineering , pp. 185–212, 2011.[39] L. Condat, “A generic proximal algorithm for convex optimization—application to total variation minimization,”
Signal Processing Letters,IEEE , vol. 21, no. 8, pp. 985–989, Aug 2014.[40] M. Fadili and J.-L. Starck, “Monotone operator splitting for optimizationproblems in sparse recovery.” IEEE Publishing, 2009, pp. 1461–1464.[41] S. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictio-naries,”
IEEE Transactions on Signal Processing , vol. 41, no. 12, pp.3397–3415, 1993.[42] A. Adler, V. Emiya, M. Jafari, M. Elad, R. Gribonval, and M. Plumbley,“Audio Inpainting,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 20, no. 3, pp. 922–932, March 2012.[43] T. Takahashi, K. Konishi, and T. Furukawa, “Hankel structured matrixrank minimization approach to signal declipping,” in , Sep. 2013, pp. 1–5.[44] F. R. ´Avila, M. P. Tcheou, and L. W. P. Biscainho, “Audio soft declippingbased on constrained weighted least squares,”
IEEE Signal ProcessingLetters , vol. 24, no. 9, pp. 1348–1352, Sep. 2017. [45] F. R. Avila and L. W. P. Biscainho, “Audio soft declipping based onweighted l1-norm,” in , Oct 2017, pp. 299–303.[46] S. Gorlow and J. D. Reiss, “Model-based inversion of dynamic rangecompression,” IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 21, no. 7, pp. 1434–1444, 2013.[47] O. Mokr´y, P. Z´aviˇska, P. Rajmic, and V. Vesel´y, “Introducing SPAIN(SParse Audio INpainter),” in . IEEE, 2019.[48] Y. Bahat, Y. Y. Schechner, and M. Elad, “Self-content-based audioinpainting,”
Signal Processing , vol. 111, 2015.[49] A. Marafioti, N. Perraudin, N. Holighaus, and P. Majdak, “A context en-coder for audio inpainting,”
IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 27, no. 12, pp. 2362–2372, dec 2019.[50] J. Tropp, “Greed is good: Algorithmic results for sparse approximation,”
IEEE Transactions on Information Theory , vol. 50, 2004.[51] B. L. Sturm and M. G. Christensen, “Comparison of orthogonal match-ing pursuit implementations,” in , 2012, pp. 220–224.[52] O. Mokr´y and P. Rajmic, “Audio inpainting: Revisited and reweighted,”2020.[53] P. Z´aviˇska, O. Mokr´y, and P. Rajmic, “S-SPADE Done Right:Detailed Study of the Sparse Audio Declipper Algorithms,”Brno University of Technology, techreport, Sep. 2018, URL:https://arxiv.org/pdf/1809.09847.pdf.[54] D. L. Donoho and M. Elad, “Optimally sparse representation in general(nonorthogonal) dictionaries via ℓ minimization,” Proceedings of TheNational Academy of Sciences , vol. 100, no. 5, pp. 2197–2202, 2003.[55] P. Combettes and J. Pesquet, “A Douglas–Rachford splitting approach tononsmooth convex variational signal recovery,”
IEEE Journal of SelectedTopics in Signal Processing , vol. 1, no. 4, pp. 564–574, 2007.[56] P. Rajmic, P. Z´aviˇska, V. Vesel´y, and O. Mokr´y, “A new generalizedprojection and its application to acceleration of audio declipping,”
Axioms , vol. 8, no. 3, Sep. 2019.[57] A. Chambolle and T. Pock, “A first-order primal-dual algorithm forconvex problems with applications to imaging,”
Journal of MathematicalImaging and Vision , vol. 40, no. 1, pp. 120–145, 2011.[58] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity byreweighted ℓ minimization,” Journal of Fourier Analysis and Applica-tions , vol. 14, pp. 877–905, 12 2008.[59] T. Hastie, R. Tibshirani, and M. Wainwright,
Statistical learning withsparsity : the lasso and generalization , Boca Raton: CRC Press, 2015.[60] P. Rajmic, “Exact risk analysis of wavelet spectrum thresholding rules,”in
IEEE Electronics, Circuits and Systems, 2003, proceedings , vol. 2,12 2003, pp. 455–458, Vol.2.[61] O. Mokr´y and P. Rajmic, “Reweighted l1 minimization for audioinpainting,” in
Proceedings of the 2019 SPARS workshop , Toulouse,Jul. 2019.[62] M. Novosadov´a and P. Rajmic, “Piecewise-polynomial signal segmenta-tion using reweighted convex optimization,” in
Proceedings of the 40thInternational Conference TSP , Barcelona, 2017, pp. 769–774.[63] M. Kowalski, K. Siedenburg, and M. D¨orfler, “Social Sparsity! Neigh-borhood Systems Enrich Structured Shrinkage Operators,”
Signal Pro-cessing, IEEE Transactions on , vol. 61, no. 10, pp. 2498–2511, 2013.[64] I. Bayram, “Mixed norms with overlapping groups as signal priors,”in
Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEEInternational Conference on , May 2011, pp. 4036–4039.[65] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholdingalgorithm for linear inverse problems with a sparsity constraint,”
Communications on Pure and Applied Mathematics , vol. 57, no. 11,pp. 1413–1457, 2004.[66] A. Beck and M. Teboulle, “A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems,”
SIAM Journal on ImagingSciences , vol. 2, no. 1, pp. 183–202, 2009.[67] B. O’Donoghue and E. Candes, “Adaptive restart for acceleratedgradient schemes,”
Foundations of Computational Mathematics , 2013.[68] H.-Y. Gao, “Wavelet shrinkage denoising using the non-negativegarrote,”
Journal of Computational and Graphical Statistics , vol. 7,no. 4, pp. 469–488, 1998.[69] A. Antoniadis, “Wavelet methods in statistics: Some recent develop-ments and their applications,”
Statistics Surveys , vol. 1, pp. 16–55, 2007.[70] R. Gribonval and M. Nikolova, “A characterization of proximity opera-tors.”[71] A. Spanias, T. Painter, and V. Atti,
Audio Signal Processing and Coding .John Wiley & Sons, Inc., 12 2005.[72] S. Shlien, “Guide to mpeg-1 audio standard,”
IEEE Transactions onBroadcasting , vol. 40, no. 4, pp. 206–218, 1994. [73] B. C. V˜u, “A splitting algorithm for dual monotone inclusions involvingcocoercive operators,”
Advances in Computational Mathematics , vol. 38,no. 3, pp. 667–681, Apr. 2013.[74] J. Mairal, F. Bach, and J. Ponce, “Sparse modeling for image and visionprocessing,”
Found. Trends Comput. Graph. Vis. , vol. 8, no. 2-3, pp. 85–283, 2014.[75] L. Rencker, F. Bach, W. Wang, and M. D. Plumbley, “Sparse recoveryand dictionary learning from nonlinear compressive measurements,”
IEEE Transactions on Signal Processing , vol. 67, no. 21, 2019.[76] C¸ . Bilen, A. Ozerov, and P. P´erez, “Solving time-domain audio inverseproblems using nonnegative tensor factorization,”
IEEE Transactions onSignal Processing , vol. 66, no. 21, pp. 5604–5617, Nov 2018.[77] C. F´evotte and M. Kowalski, “Estimation with low-rank time–frequencysynthesis models,”
IEEE Transactions on Signal Processing , vol. 66,no. 15, pp. 4121–4132, 2018.[78] C. F´evotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factor-ization with the itakura-saito divergence: With application to musicanalysis,”
Neural computation , vol. 21, no. 3, pp. 793–830, 2009.[79] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the em algorithm,”
Journal of the RoyalStatistical Society: Series B , vol. 39, no. 1, 1977.[80] L. Oudre, “Interpolation of missing samples in sound signals based onautoregressive modeling,”
Image Processing On Line , vol. 8, pp. 329–344, Oct. 2018.[81] O. Mokr´y, P. Z´aviˇska, P. Rajmic, and V. Vesel´y, “Introducing SPAIN(SParse Audio INpainter),” EUSIPCO 2019.[82] P. Kabal, “An examination and interpretation of ITU-R BS.1387: Percep-tual evaluation of audio quality,” MMSP Lab Technical Report, McGillUniversity, Tech. Rep., May 2002.[83] R. Huber and B. Kollmeier, “PEMO-Q—A new method for objectiveaudio quality assessment using a model of auditory perception,”
IEEETrans. Audio Speech Language Proc. , vol. 14, no. 6, 2006.[84] J. H. R. Chang et al. “One network to solve them all — solvinglinear inverse problems using deep projection models,” in . IEEE, oct 2017.[85] P. H. Kraght, “Aliasing in digital clippers and compressors,”
Journal ofthe Audio Engineering Society , vol. 48, no. 11, pp. 1060–1065, 2000.[86] D. Mapes-Riordan, “A worst-case analysis for analog-quality (alias-free)digital dynamics processing,” in
AES Convention 105 , 1998.[87] F. Esqueda, S. Bilbao, and V. V¨alim¨aki, “Aliasing reduction in clippedsignals,”