[PDF] Semiparametric Wavelet-based JPEG IV Estimator for endogenously truncated data

Abstract

A new and an enriched JPEG algorithm is provided for identifying redundancies in a sequence of irregular noisy data points which also accommodates a reference-free criterion function. Our main contribution is by formulating analytically (instead of approximating) the inverse of the transpose of JPEGwavelet transform without involving matrices which are computationally cumbersome. The algorithm is suitable for the widely-spread situations where the original data distribution is unobservable such as in cases where there is deficient representation of the entire population in the training data (in machine learning) and thus the covariate shift assumption is violated. The proposed estimator corrects for both biases, the one generated by endogenous truncation and the one generated by endogenous covariates. Results from utilizing 2,000,000 different distribution functions verify the applicability and high accuracy of our procedure to cases in which the disturbances are neither jointly nor marginally normally distributed.

Full PDF

©© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuseof any copyrighted component of this work in other works.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior toﬁnal publication. Citation information: DOI 10.1109/ACCESS.2019.2929571, IEEE Access.

Digital Object Identiﬁer 10.1109/ACCESS.2019.2929571

Semiparametric Wavelet-based JPEG IVEstimator for endogenously truncateddata

NIR BILLFELD , MOSHE KIM University of Haifa, Haifa, Israel (e-mail: [email protected]) University of Haifa, Haifa, Israel (e-mail: [email protected])

Corresponding author: Moshe kim (e-mail: [email protected]).This work was supported by the Research Authority, The University of Haifa.The authors would like to thank seminar participants of the faculty of Mathematics and Computer Science, The Weizmann Institute ofScience and Statistic departments of Tel-Aviv and Haifa universities, for very constructive comments.

ABSTRACT

A new and an enriched JPEG algorithm is provided for identifying redundancies in a sequenceof irregular noisy data points which also accommodates a reference-free criterion function. Our maincontribution is by formulating analytically (instead of approximating) the inverse of the transpose of JPEG-wavelet transform without involving matrices which are computationally cumbersome. The algorithm issuitable for the widely-spread situations where the original data distribution is unobservable such as in caseswhere there is deﬁcient representation of the entire population in the training data (in machine learning)and thus the covariate shift assumption is violated. The proposed estimator corrects for both biases, the onegenerated by endogenous truncation and the one generated by endogenous covariates. Results from utilizing2,000,000 different distribution functions verify the applicability and high accuracy of our procedure to casesin which the disturbances are neither jointly nor marginally normally distributed.

INDEX TERMS

JPEG, Semiparametric, biorthogonal wavelet, causality, proximal gradient-descent,Lifting scheme, Denoising, Covariate Shift, Training data, Reference-free

I. INTRODUCTION

Scientists routinely try to model and extract causal relationsamong covariates, rather than merely their correlations. Inpractice however, the presence of endogenous covariates inthe model challenges the causal inference due to comovementof the random disturbance with these covariates. We distin-guish between population induced comovement and trainingdata (as in machine learning) comovement without havingto rely on a covariate shift assumption since the behavioral(causal) model embedded in the training data does not nec-essarily describing the behavior in the entire population (seediscussion in [2]).The common way to overcome the aforementioned, is togenerate a variation in the endogenous covariate withoutintroducing variation in the random disturbance. This idea is For a speciﬁc form of causality due to treatment effect see [1] deﬁnition. achieved by employing a proper instrumental variable (IV). Application of a proper instrumental variable generatesvariation in the endogenous covariate without introducingvariation in the random disturbance and hence is orthogonalto it. Thus, IV should contribute to exogeneity and thereforehas been extremely popular in empirical work.Once we have analytically shown that the IV estimator isno longer valid in an endogenously truncated environment,we offer a truncation-proof estimator, which is a semipara-metric wavelet-based JPEG-IV denoising algorithm. Thisdenoising algorithm decomposes the random disturbance intoa noise and a systemic bias part, enabling the elimination ofthe truncation bias. The magnitude of the this bias is capturedby the size of the wavelet coefﬁcients which quantify and Note that we deal with endogenously truncated sample selection modelto differentiate from censored sample selection models [3]–[5], where thereexists information pertaining to the non-participants. A wavelet is a bandwidth-free estimator that is based on a multi-scalerepresentation of the data. It is a widely used denoising technique [6]. a r X i v : . [ s t a t . M E ] A ug easure the degree of redundancy hidden in a sequence ofdata points. Consequently, this algorithm nests the conven-tional IV estimator as a special case due to the fact that inthe absence of systemic endogenous truncation, the waveletcoefﬁcients approach zero except for the intercept whichdescribe the coarse level of the function. Our main contribution to the biorthogonal wavelet estima-tor is by formulating analytically (instead of approximating)the inverse of the transpose of wavelet transform withoutinvolving matrices which are computationally cumbersome[7] as well as the management of irregular-spaced data. Ad-ditionally, our proposed methodology enables the combina-tion of several penalty functions in the estimation procedurewhich are resolution-dependent. Wavelets are useful in denoising data. Several imagequality assessment (IQA) measures have been introducedto choose the optimal level of denoising. These approachescan be classiﬁed to full-reference (FR) in cases the originalimage (noise-free) is observed; reduced-reference (RR) incases where there is a partial information about the reference;reference-free (RF) in cases where the original image is notaccessible [10], [11]. As we deal with truncated distribu-tions, the source (the complete non-truncated distribution)is intrinsically unobservable and thus, we cannot assess thesuccess of the denoising by comparing it to the original non-truncated distribution. Therefore, we select both the thresh-olding (tuning) parameter as well as the penalty functionusing a reference-free criterion function.The proposed JPEG IV is biorthogonal, thus preservingboth the symmetry (the original shape of the data) andcompact support (small number of coefﬁcients) properties ofthe data. Importantly, the proposed methodology is easy tocompute by precluding the need to ﬁnd an optimal bandwidthas conventionally done. These properties make it suitable fordenoising by alleviating both the problem of coefﬁcient ex-pansion as well as border discontinuities [15]. The proposedalgorithm corrects for both sources of bias: the endogeneity Unlike Fourier transform, the wavelet estimator preservers not onlythe data average (coarse) behavior but also its local behavior capturingdeviations (details) from the average. This fact renders our denoising suitablefor irregular-spaced data which largely depend on local behavior and play animportant role in the denoising. “The implemented routine for the inverse transpose transform is approx-imate." [7], p.285. In the orthogonal wavelets design various interpolation methods are usedto alleviate these irregularities [8] and speciﬁc methodologies can be used toextend the Haar wavelet transform to the unequally spaced case [9]. It is known that soft thresholding provides smoother results relative tothe hard thresholding because it is continuous. The latter, however, providesbetter edge preservation in comparison with the former. Our JPEG IV is a biorthogonal wavelet as it requires two sets of vectors,which are the dual basis and the series expansion sets, to obtain a denoisedrepresentation of the data. The elements in the former set are orthogonal tothe corresponding elements in the latter set. See [12] for a formal deﬁnitionof biorthogonality. Kernel estimation involves computational burden due to the necessityof ﬁnding the optimal bandwidth [13]. Unlike the nonparametric case, inthe semiparametric context there is no “protocol" for ﬁnding the optimalbandwidth, as the traditional bandwidth choice methods might lead to biasestimates due to improper bandwidth choice [14]. of covariates as well as the endogenous self-selection biases.We run Monte Carlo simulations to measure the magnitudeof the potential bias in the parameters’ estimates under en-dogenous truncation, obtained by employing a conventionalIV to eliminate the endogeneity bias. Our empirical imple-mentation shows that even under mild correlation betweenthe random disturbances, the resulting bias in the estimatedparameter of the endogenous covariate in the substantiveequation can amount to almost tenfold the true parametervalue. Further, for sake of generality of the offered esti-mator, we subject it to various distributions in which thedisturbances are neither jointly nor marginally normally dis-tributed. These disturbances are constructed as realizations ofnon-symmetric and non-unimodal distribution functions. The rest of this paper is organized as follows. The method-ology is presented is section II. Section III prepares theground for the biorthogonal wavelet. Section IV presents ourproposed JPEG algorithm. In section V we employ MonteCarlo simulations to validate our estimator performance.Section VI concludes.

II. METHODOLOGY

As discussed above, the IV is based on the following ba-sic requirements: it is correlated with the endogenous co-variate, as well as orthogonal to the random disturbance.Additionally, it must satisfy the exclusion restriction, suchthat in the presence of the endogenous covariate, the IVmust be excluded from the regression. The IV is allowed toaffect the dependent variable only through its effect on theendogenous covariate. However, the orthogonality conditionis rarely satisﬁed in the presence of endogenous truncation,which is very frequently the nature of data used in empiricalresearch, and therefore the IV will not provide a solution forthe endogeneity problem. In what follows, we demonstratethe shortcoming of the conventional IV estimator, as well aspotential bias generated in an environment of endogenouslytruncated data.Suppose that there is a population random variable (cid:119) =(z; x , x − ; w ) and that there is an independent and identi-cally distributed sample { z i , x i , x − i , w i } Ni =1 drawn fromthis population, referred to as the complete data set consistingof N observations. The instrumental variable is z , the en-dogenous variable is x and the exogenous random variablesare ( x − ; w ) , and where w ∈ R l is a covariate vector.Let ξ i , ξ i and v i be jointly dependent random distur-bances with the respective marginal distribution functions F ξ , F ξ and F v . Their joint distribution function is F ξ ,ξ , v .The model is semiparametric, as neither the marginals nor thejoint distribution function are required to be speciﬁed by theresearcher.The underlying model is composed of two parts. The ﬁrstpart consists of a selection equation, while the second part Unlike the practice in some other studies applying only normallydistributed disturbances. Capital letters indicate random variables; lower case letters indicaterealizations of these random variables. onsists of the substantive (of interest) equation.The population (non-truncated) selection equation is de-ﬁned as: y ∗ i = w Ti γ + ξ i (1)where γ ∈ R l and w i ∈ R l are the selection equation’scoefﬁcients and covariates vector, respectively. The selectionequation’s random disturbance is denoted by ξ i .The substantive equation and the endogenous variableequation are deﬁned as a system of equations: (cid:40)(cid:34) y ∗ i x ∗ i (cid:35) = (cid:34) x Ti [ z Ti , x T − i ] (cid:35) (cid:34) βδ (cid:35) + (cid:34) ξ i v i (cid:35) the substantive equation variablethe endogenous variable equation (2)where β ∈ R p and δ ∈ R p are covariates vectors, x i isan endogenous variable included in vector x i ∈ R p , and theexogenous variables are denoted by x T − i . The substantiveequation’s random disturbances are denoted by ξ i and v i .However the variables y ∗ i , y ∗ i , x ∗ i are latent in the trun-cated environment and their respective observed realizationsare denoted by y i , y i , x i , deﬁned in (3) and (4) to follow.The variable y ∗ i is latent, while y i is observed and deﬁnedas: y i = (cid:40) if y ∗ i ≥ Unobserved if y ∗ i < , the selection equation (3) (cid:20) y i x i (cid:21) = (cid:34) y ∗ i x ∗ i (cid:35) if y ∗ i ≥ Unobserved if y ∗ i < , the substantive equations (4)In the next section we reformulate the substantive equationas a partially linear single index model. A. SEMIPARAMETRIC SELECTIVITY BIAS CORRECTION

The key difference between censored and truncated sampleselection models is that in the former the entire covariate set(including the non-participants) and the selection variable arefully observed. In the latter, the entire data are truncated.Nevertheless, in both cases, the substantive equation canbe represented as a partially linear regression, in which thedependent variable is observed only for the participants,as we are about to show. Following [16], the conditionalexpectation of the substantive equation in semiparametric(censored) sample selection models is some generally un-known function M ( . ) (to be estimated) of the selectionequation’s covariates variables w i : E (cid:2) ξ i | ξ i > − w Ti γ (cid:3) = M ( w Ti γ ) (5)such that γ is the selection equation’s coefﬁcient vector.Since y i is observed only if i is a participant, the substantiveequation’s dependent variable obtains the following func- His approach is a generalization of the well-known inverse-mills ra-tio estimator introduced by [3] for the substantive equation’s bias term E (cid:2) ξ i | ξ i > − w Ti γ (cid:3) in the case of a censored sample selection model.Note the difference between censored data and truncated data, which is thecase we deal with. tional form: y i = x Ti β + M ( w Ti γ ) (cid:124) (cid:123)(cid:122) (cid:125) the bias term + ˜ (cid:15) i (cid:124)(cid:123)(cid:122)(cid:125) white noise (6)The regression equation in (6) is referred to as a semi-parametric partially linear regression (SP-NLS), in whichthe non-linear part is the bias term function. This regressioncan be estimated semiparametrically in cases of a truncatedsample selection model using a non-linear least squares pro-cedure as suggested by [13].Both [13] and [16] models involve a kernel function es-timation. However, kernel estimates’ accuracy is sensitiveto the bandwidth selected. This entails a potential problemof ﬁnding the optimal bandwidth resulting in computationalcomplexity. Due to the lack of applicability of the tradi-tional bandwidth selection methods in the semiparametriccontext, informal methods are being used, that may leadto a non-ignorable bias in the estimates [18]. In orderto avoid the problems involved with kernel estimation, ourmethodology relies on a (thresholding-propagated) nonlinearwavelet-based JPEG IV estimator to approximate the biasterm (in (6)).The substantive equation depicted in (6) deals with en-dogenous truncation bias, assuming that the random distur-bance and the covariates are not jointly dependent. However,in cases where this random disturbance is jointly dependentwith one (or more) of the covariates there will emerge twobias terms: the ﬁrst one is propagated by the endogenoustruncation and the second one is propagated by the endoge-nous covariate. Next we present a decomposition Theorem1, which enables reformulating the substantive equations asa partially linear single index model in the presence of anendogenous covariate.

B. DECOMPOSITION OF THE SUBSTANTIVEEQUATIONS

Theorem 1.

Let the underlying model be as depicted in (3) and (4) . Denote the random disturbances ε i and (cid:15) i which are constructed as: ε i = y ∗ i − E [ y ∗ i | x i ] and (cid:15) i = y i − E [ y ∗ i | y i = 1] , respectively. The following requirementsmust hold:(i) E [ y ∗ i | y i = 1] = E [ x Ti β | x i ]+ E [ ξ i | x i ]+ E [ ε i | y i = 1] ∀ i ∈ { , ..., N } ;(ii) y i = x Ti β + (cid:15) ∗ i , E [ (cid:15) i (cid:12)(cid:12) y i = 1] = 0 , (cid:15) ∗ i ≡ (cid:15) i + E [ ξ i | x i ] + M ( w i T γ ) , ∀ i ∈ { i | y i = 1 } .Proof. By construction ε i = y ∗ i − E [ y ∗ i | x i ] , it follows that y ∗ i = E [ x Ti β | x i ] + E [ ξ i | x i ] + ε i (7)Using (7) we get: E [ y ∗ i | y i = 1] = E [ ε i | y i = 1] (8) + E (cid:8) E [ x Ti β | x i ] | y i = 1 (cid:9) + E { E [ ξ i | x i ] | y i = 1 } There is an open question whether there is a way to choose a bandwidthsequence that is optimal for the estimation of the parameters [17]. “The well known bandwidth selection rules used in non-parametricestimation, such as cross validation, are not generally applicable to semi-parametric settings.” [18, p. 191] hich is simpliﬁed to: E [ y ∗ i | y i = 1] = E [ x Ti β | x i ] + E [ ξ i | x i ] (9) + E [ ε i | y i = 1] In order to obtain the substantive equation in the truncatedenvironment, we construct (cid:15) i = y i − E [ y ∗ i | y i = 1] where E [ y ∗ i | y i = 1] is obtained from (9). Following [17], theconditional expectation of ε i , given participation is expressedby some unknown function M ( · ) as E [ ε i | y i = 1] = M ( w Ti γ ) . Thus, we obtain: y i = x Ti β (cid:124)(cid:123)(cid:122)(cid:125) substantivecovariates + (cid:15) ∗ i (cid:122) (cid:125)(cid:124) (cid:123) M ( w Ti γ ) (cid:124) (cid:123)(cid:122) (cid:125) selection biasterm + E [ ξ i | x i ] (cid:124) (cid:123)(cid:122) (cid:125) endogeneity biasterm + (cid:15) i (cid:124)(cid:123)(cid:122)(cid:125) white noise (10)For sake of brevity we present equation (10), which is adecomposition of the substantive equation into its compo-nents, such as the substantive equation’s covariates, selectionbias term, endogeneity bias term and a stochastic white noiseterm. It is easy to see that the conventional IV can not besufﬁcient in eliminating the endogeneity bias E [ ξ i | x i ] in(10), since under truncation the endogeneity bias term isactually E [ ξ i | x i , y i = 1] .Similarly, we construct (cid:15) i = x i − E [ x ∗ i | y i = 1] where E [ x ∗ i | y i = 1] satisﬁes: E [ x ∗ i | y i = 1] = E [v i | y i = 1] + E [[ z Ti , x T − i ] δ | y i = 1] (11)to get: (cid:15) i = x i − E [v i | y i = 1] − E [[ z Ti , x T − i ] δ | y i = 1] (12)We express E [v i | y i = 1] in (12) as E [v i | y i = 1] = M ( w Ti γ ) where M ( · ) is some unknown function andobtain (see Theorem 4 to follow): x i = [ z Ti , x T − i ] δ (cid:124) (cid:123)(cid:122) (cid:125) substantivecovariates + (cid:15) ∗ i (cid:122) (cid:125)(cid:124) (cid:123) M ( w Ti γ ) (cid:124) (cid:123)(cid:122) (cid:125) selection biasterm + (cid:15) i (cid:124)(cid:123)(cid:122)(cid:125) white noise (13)It is easy to see the joint dependence of (cid:15) ∗ i and (cid:15) ∗ i throughthe selection bias terms in (10) and (13). (cid:4) Next we formulate the relationship between the covariatesand dependent variables in the equations to be estimated, inthe presence of an endogenous covariate in the substantiveequation under truncation.

C. TRUNCATED SAMPLE SELECTION MODEL WITH ANENDOGENOUS COVARIATE

In cases where the substantive equation’s dependent variableis a function of an endogenous covariate x i , both x i as wellas y i (as in (4)) are truncated, we face a truncated sampleselection model with an endogenous covariate.Thus, the semiparametric partially linear index model ina truncated environment consists of the following system of By construction of y i , the equality E [ y i | y i = 1] = E [ y ∗ i | y i = 1] must be satisﬁed. It implies that E [ (cid:15) i | y i = 1] = E [ y i | y i = 1] − E (cid:8) E [ y ∗ i | y i = 1] | y i = 1 (cid:9) = E [ y i | y i = 1] − E [ y ∗ i | y i = 1] = 0 . equations: (cid:20) y i x i (cid:21) =  x Ti β + M ( w Ti γ ) + (cid:15) ∗∗ i (cid:122) (cid:125)(cid:124) (cid:123) E [ ξ i | x i ] + (cid:15) i (cid:124)(cid:123)(cid:122)(cid:125) white noise [ z Ti , x T − i ] δ + M ( w Ti γ ) + (cid:15) i (cid:124)(cid:123)(cid:122)(cid:125) white noise (14)where (cid:15) ∗∗ i and (cid:15) i are two jointly dependent randomdisturbances, and by construction are independent of therandom variables vector w . The intrinsic endogeneity inthe model is captured by the joint dependence of (cid:15) ∗∗ i and thecovariates. The presence of the function M ( . ) implies thatwe allow for a dependence between v i (the endogenous partof x i ) and the selection equation’s random disturbance ξ i (in(1), the complete, non-truncated, sample selection equation).Our primary interest is to show that the instrumentalvariable and the random disturbance might be correlated ina truncated environment as will be depicted in Theorem 2to follow. By doing this, we denote a truncated environmentusing the indicator (selection variable) s = I ( ξ i > − w T γ ) and postulate the following assumptions: Assumtption 1.

The instrumental variable z is jointly dis-tributed with all covariates in the data: E [z | w = w ] = G ( w ) where G ( · ) is some function of w . Assumtption 2.

Conditioning the instrumental variable z both on random variable w and a stochastic function of w denoted by F ( w , ε ) (given that the stochastic compo-nent ε is an i.i.d white noise which is independent of z ),would be the same as conditioning it only on w . Formally: E [z | w = w , F ( w , ε )] = E [z | w = w ] . These two assumptions implies that the conditional ex-pectation of the instrumental variable, given the selectionvariable, is a function of the random variable vector w , asthe following proposition argues: Proposition 1.

Given assumptions 1 and 2, the condi-tional expectation of the instrumental variable given theselection variable is a function of w , rather E [z | s = s ] = (cid:82) w G ( w ) f w | s= s ( w | s = s ) d w ∀ s ∈ { , } , where f w | s=1 ( w | s = 1) and f w | s=0 ( w | s = 0) are the conditionaldensity functions of vector w given participation and non-participation, respectively.Proof. It easy to see that w mediates between z and s usingthe Tower property of conditional expectation [19]: E [z | s = s ] = E w [ E [z | w , s] | s = s ] , s ∈ { , } The indicator variable s is a stochastic function of w , thus,it follows from assumption 2 that E w [ E [z | w , s] | s = s ] = E w [ E [z | w ] | s = s ] , s ∈ { , } There is dependence of these two random disturbances due to thedependence between v i and ξ (as in (2)) in the complete (non-truncated)data. Not to be confused with its realization w i . The intrinsic model’s endogeneity is related to the joint dependenceof the random disturbance and the covariates in the population , unlike aconditional joint dependence of the random disturbance and the covariatesgiven participation in the sample. The Tower property is referred interchangeability to the law of iteratedexpectations. For formal proof see [19]. ollowing assumption 1 we get: E [z | s = s ] = E w [ E [z | w ] | s = s ] (15) = E w [ G| s = s ] = (cid:82) w G ( w ) f w | s= s ( w | s = s ) d w , s ∈ { , } . (cid:4) In Theorem 2 to follow we use proposition 1 and presentour primary argument: in truncated sample selection models,the orthogonality condition of the instrumental variable withrespect to the random disturbance might be violated. Thisviolation stems from a dependency between the instrumentalvariables and the selection equation’s covariates.

Theorem 2 (Lack of orthogonality) . Let ξ and ξ be twojointly distributed random disturbances, and let z be a validinstrumental variable satisfying E [z · ξ ] = 0 . Denote arandom variables vector w ∈ R l , a parameters vector γ ∈ R l and a truncated environment using the indicatorvariable s = I ( ξ > − w (cid:48) γ ) . Suppose that the followingconditions are satisﬁed:(i) assumptions 1 and 2 hold;(ii) E [ ξ | s = s, w = w ] = M ( w T γ ) ; (iii) z and ξ areconditionally independent given w and s ; (iv) G and M arelinearly dependent in the truncated environment (given s ). Under conditions (i)-(iv) above, z is not orthogonal to therandom disturbance ξ given s .Proof. Using the Tower property, the following must hold: E [z ξ | s = s ] = E w [ E [z ξ | w , s] | s = s ]= E w [ E z [z | w , s] E ξ [ ξ | w , s] | s = s ] (cid:124) (cid:123)(cid:122) (cid:125) by conditional independence of z and ξ given w and s = E w [ GM| s = s ] = (cid:82) w G ( w ) M ( w T γ ) f w | s= s ( w | s = s ) d w Similarly (using proposition 1), E [z | s = s ] = (cid:82) w G ( w ) f w | s= s ( w | s = s ) d w and E [ ξ | s = s ] = E w [ E [ ξ | w , s] | s = s ] (16) = E w [ M| s = s ] = (cid:82) w M ( w T γ ) f w | s= s ( w | s = s ) d w As G and M are conditionally linearly dependent randomvariables in the truncated environment (given s ), implies: (cid:82) G ( w ) M ( w T γ ) f w | s= s ( w ) d w (17) (cid:54) = (cid:82) G ( w ) f w | s= s ( w ) d w (cid:82) M ( w T γ ) f w | s= s ( w ) d w and consequently: E [z ξ | s = s ] (cid:54) = E [z | s = s ] E [ ξ | s = s ] = > COV [z , ξ | s = s ] (cid:54) = 0 Therefore, z is not orthogonal to ξ given s (in the trun-cated environment). (cid:4) However, the orthogonality condition can be satisﬁed byremoving the contamination factor, which is the covariategenerating the comovement between the random disturbanceand the instrumental variable, as shown in the followingTheorem 3. The conditional linear dependence between G and M given the indicator (selection) variable implies that E [ GM| s = s ] (cid:54) = E [ G| s = s ] E [ M| s = s ] . Since G and M are both functions of the random variable w , thisinequality implies that (cid:82) G ( w ) M ( w T γ ) f w | s= s ( w ) d w (cid:54) = (cid:82) G ( w ) f w | s= s ( w ) d w (cid:82) M ( w T γ ) f w | s= s ( w ) d w . Theorem 3 (Bias removal) . Let ξ and ξ be two jointlydistributed random disturbances, and let z be a valid instru-mental variable satisfying E [z · ξ ] = 0 . Denote a randomvariables vector w ∈ R l , a parameters vector γ ∈ R l and a truncated environment using the indicator variable s = I ( ξ > − w (cid:48) γ ) . Suppose that the following conditionsare satisﬁed: (i) assumptions 1 and 2 hold;(ii) E [ ξ | s = s, w = w ] = M ( w T γ ) .Under conditions (i) and (ii) above, removing the contam-ination factor (the bias term) from the residual in the trun-cated environment leads to orthogonality of the instrumentalvariable to the substantive equation’s disturbance, such that: E [z (cid:2) ξ − M ( w T γ ) (cid:3) | s = s ] = 0 .Proof. Express E [z (cid:2) ξ − M ( w T γ ) (cid:3) | s = s ] as a differenceof two conditional expectations: E [z (cid:2) ξ − M ( w T γ ) (cid:3) | s = s ] = E [z ξ | s = s ] − E [z M ( w T γ ) | s = s ] Using the Tower property, to get: E [z M ( w T γ ) | s = s ] = E w (cid:2) E [z M ( w T γ ) | w , s] | s = s (cid:3) = E w (cid:2) E z [z | w , s] E [ M ( w T γ ) | w , s] | s = s (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) by conditional independence of z and M ( w T γ ) given w and s = E [ G ( w ) M ( w T γ ) | s = s ] As E [z ξ | s = s ] = E [ G ( w ) M ( w T γ ) | s = s ] (proof of Theorem 2),which implies that E [z (cid:2) ξ − M ( w T γ ) (cid:3) | s = s ] = 0 . Moreover,

COV (cid:2) z , ξ − M ( w T γ ) | s = s (cid:3) = E [z (cid:2) ξ − M ( w T γ ) (cid:3) | s = s ] (cid:124) (cid:123)(cid:122) (cid:125) (18) − E [z | s = s ] E [ ξ − M ( w T γ ) | s = s ] (cid:124) (cid:123)(cid:122) (cid:125) = 0 . (cid:4) Therefore, a valid instrumental variable z is orthogonal tothe truncated distribution (non-contaminated) disturbance (cid:15) ∗∗ i in (14), even though z and w are dependent.The joint dependence of ( ξ , ξ , v) implies the violationof zero mean expectation (under truncation) in the x i regression equation (4), such that E [v | ξ > − w (cid:48) γ ] = M ( w T γ ) (cid:54) = E [v] = 0 . That is, the conditional expectationof v given an endogenous truncation is a function of thecovariate vector w , while in the population it does not dependon w and has a zero mean expectation. This violation is aprecondition for the endogeneity of { x − i , z i } with respectto v i given participation in the regression of x i . The fol-lowing theorem indicates that such violation is also obtainedin cases where the comovement of v and ξ is entirely relatedto a variation in ξ . Theorem 4 (Conditional independence) . Let ξ and ξ betwo jointly distributed random disturbances of the substan-tive and selection equations, respectively. Let v be a randomvariable which depends on ξ such that v and ξ are con-ditionally independent given ξ . Denote a random variablesvector w ∈ R l independent of ( ξ , ξ , v) with a realization w , a parameters vector γ ∈ R l and a truncated environmentusing the indicator variable s = I ( ξ > − w (cid:48) γ ) . As been discussed in [3], the fact that the conditional disturbance (givenparticipation) in the substantive equation of x i is a function of the selectionequation’s covariates, leads to a potential correlation between the distur-bance and the substantive equation’s covariates. This correlation impliesthe endogeneity of the substantive equation’s covariates (cid:8) x − i , z i (cid:9) withrespect to its random disturbance v i given participation. ssume the following conditions are satisﬁed: (i) the con-ditional expectation of the random disturbance given par-ticipation is E [ ξ | ξ > − w (cid:48) γ ] = M ( w T γ ) [16]; (ii) E [v | ξ , ξ > − w (cid:48) γ ] = E [v | ξ ] = H ( ξ ) , (endogeneity);(iii) H ( . ) , a monotonic mapping R (cid:55)→ R .Under conditions (i)-(iii) above, E [v | ξ > − w (cid:48) γ ] (cid:54) = E [v] regardless of the conditional independence of v and ξ given ξ .Proof. Applying Tower property to E [v | s = 1] : E [v | ξ > − w (cid:48) γ ] = E [v | s = 1] = E ξ { E [v | ξ , s] | s = 1 } = E [ H ( ξ ) | s = 1] = M ( w T γ ) (cid:54) = E [v] . It can be shown that ξ mediates between v and s (par-ticipation), in that it generates a comovement between therandom variables v and s . The last equality relies on the factthat the random variable H ( ξ ) is a monotonic mapping of ξ ,implying dependence on s due to the dependency between ξ and s . (cid:4) Next we show that the conventional IV estimator is in-consistent in the presence of a truncated environment inwhich the expectation of the instrumental variable and therandom disturbance are functions of the selection equation’scovariates vector w . The proof in section II-D to follow,relies on a linear dependence assumption between these twofunctions of w . The rationale for the linear dependence isdue to the fact that the random disturbance’s ( ξ ) conditionalexpectation generally satisﬁes monotonicity with respect tothe index variable w (cid:48) γ . Therefore, it is enough to assumethat, on average, z is affected monotonically by the indexvariable w (cid:48) γ to generate a linear dependence between z andthe conditional expectation of ξ given participation. D. THE CONVENTIONAL IV ESTIMATOR’S ASYMPTOTICBIAS

The IV estimator’s asymptotic bias is: (cid:98) β iv = ( z T x ) − z T y = ( z T x ) − z T ( x β + M ( w T γ ) + (cid:15) ∗∗ i ) (19) (cid:98) β iv = ( z T x ) − z T ( x β ) + ( z T x ) − z T M ( w T γ ) + ( z T x ) − z T (cid:15) ∗∗ i (cid:98) β iv = β + ( z T x ) − z T M ( w T γ ) + ( z T x ) − z T (cid:15) ∗∗ i plim N →∞ (cid:104) (cid:98) β iv (cid:105) = β + plim N →∞ (cid:2) ( N − z T x ) − (cid:3) plim N →∞ (cid:2) N − z T M ( w T γ ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) Asymptotic bias

Given any correlation between z and M ( w T γ ) , plim N →∞ (cid:2) z T M ( w T γ ) (cid:3) (cid:54)→ . Thus, the (cid:98) β iv estimator is aninconsistent estimator for β .Next we discuss the two types of joint dependence whichare present in our model. This is done is order to facilitate theunderstanding of our proposed procedure, which is intendedto correct for the bias propagated by each type of jointdependence. III. PRELIMINARIES

The objective is to eliminate the selection bias term capturedby M ( · ) in (14). As we don’t want to impose a speciﬁc dis-tribution function on the random disturbances, the aforemen-tioned elimination should be performed in a nonparametric Both functions are dependent through w by construction, generallyleading to some degree of linear dependence. manner. This can be achieved using a semiparametric esti-mation method, which is distribution-free. However, the biasterm might be a discontinuous function with different levelsof smoothness that must be considered. These issues can bealleviated using multi-resolution analysis by employing thewavelet estimator [20]. Wavelet is a bandwidth-free estima-tor, that is based on the idea of multi-scale representation ofthe data [21] and is used as a denoising technique by simplethresholding, which is based on the concept of sparsity. The applicability of the classical wavelet estimator is prob-lematic in several important aspects. First, it limits the samplesize to be represented as J , with J a non-negative integer,and the observations to be equispaced, which challenges theestimation in case of irregular-spaced data. Second, theclassical wavelet estimator imposes the parametric assump-tion that the disturbances are independent identically dis-tributed normal variables [23]. Lastly, there are the problemsof coefﬁcient expansion and border discontinuities. In order to overcome these limitations, second generationwavelets have been introduced [24] which deﬁne wavelets interms of lifting-steps instead of matrices to reduce computa-tional complexity. An earlier attempt to deal with irregular-spaced data using second generation wavelets is presentedin [21] by postulating a prior distribution function for thewavelet coefﬁcients. Alternative approaches extend Haarwavelet transform to accommodate for irregular data [27].Both ﬁrst as well as second generation wavelet estimationmethods involve three steps: coefﬁcient estimation (forwardtransform); (ii) denoising by using element-wise thresholding(coefﬁcients selection) and (iii) reconstruction of the datawithout the noise (inverse transform). It is important to noticethat the sequential nature of the estimation that relies onelement-wise thresholding is applicable for limited types ofwavelets, referred to as orthogonal wavelets which consistof the above described limitations. The main shortcomingof orthogonal wavelets is that the compact support and thesymmetry properties which are useful in denoising are con-ﬂicting. To preserve both these properties, the biorthogonalwavelet-based JPEG is used [29], [30]. Due to its multi-scale property, we can distinguish between the impor-tant information, the function’s average behavior, from the noise. The coarsescales (lower resolution-levels) usually convey important information, whileat ﬁne scales there is usually more noise. Sparsity implies that the majority of wavelet coefﬁcients are small, andcan be replaced by zero [22]. The observations location in space or time must be of equal distance. The standard orthogonal wavelet transform has the shortcoming in thatit requires a large number of coefﬁcients (coefﬁcient expansion) to representthe original data [15]. The lifting-steps are consecutive operations of prediction (scaled-moving average) and update (scaled-ﬁrst difference) to obtain the waveletcoefﬁcients. [21] adopt the parametric Bayesian denoising approach introduced by[25], [26] to obtain the wavelet coefﬁcients assuming the coefﬁcients aredistributed according to a continuous mixture of a normal by a Beta density. Unlike biorthogonality, orthogonality and symmetry are conﬂictingproperties for design of compactly supported nontrivial wavelets (see Theo-rem 8.1.4 in [28]). The JPEG algorithm used here is termed ‘wavelet CDF 9/7’. n what follows we brieﬂy explain the concept of biorthog-onality. Denote a set of functions { ϕ k ( t ) } which spans a vec-tor space F , referred to as the expansion set. By construction,any function g ( t ) ∈ F can be expressed by using a seriesexpansion, such that g ( t ) = (cid:80) k η k ϕ k ( t ) ,where η k and ϕ k are the expansion coefﬁcients and expan-sion functions, respectively. The set { ϕ k ( t ) } is biorthogonalto the set { ˜ ϕ k ( t ) } if (cid:104) ϕ k , ˜ ϕ k (cid:48) (cid:105) = d ( k − k (cid:48) ) ∀ k and k (cid:48) ,with (cid:104)·(cid:105) being the L inner product and the function d ( · ) isthe Kronecker delta. These two sets form a biorthogonalsystem, in which { ˜ ϕ k ( t ) } is referred to as the dual basis of { ϕ k ( t ) } . Thus, we get the following unique representation: η k = (cid:104) g ( t ) , ˜ ϕ k ( t ) (cid:105) (20)Substituting each η k coefﬁcient with its analytic expres-sion in (20), to obtain: g ( t ) = (cid:88) k (cid:104) g ( t ) , ˜ ϕ k ( t ) (cid:105) ϕ k ( t ) (21)Obviously, in the present case of biorthogonality, the coef-ﬁcients in (20) are obtained by using the dual basis and thefunction is reconstructed in (21) by using another basis whichis the expansion set. In cases where { ˜ ϕ k ( t ) } = { ϕ k ( t ) } we have an orthogonal basis { ϕ k ( t ) } , which is referred toas self-dual. Therefore, biorthogonality is a generalization oforthogonality that allows for a larger class of expansions.Recall that our objective is to estimate the bias term foran unknown functional form, captured by M ( · ) in (14), theconditional expectation of ε i , given participation deﬁned as E [ ε i | y i = 1] = M ( w Ti γ ) . In what follows, we attend tothe estimation of M ( · ) using the wavelet estimator.We use the concept of a frame in (1) to deﬁne Riezs basis in (2).

Riezs basis is a building block in the deﬁnition ofbiorthogonal wavelets in (3) to follow.Let H be a separable Hilbert space with inner product (cid:104)· , ·(cid:105) and a norm (cid:13)(cid:13)(cid:13) · (cid:13)(cid:13)(cid:13) . We denote a sequence F = { f k , k ∈ Λ } ⊂ H , in which Λ ⊂ Z .We use the following frame and Riesz basis deﬁnitions[31]:

Deﬁnition 1 (Frame) . F is called a frame if there areconstants < A ≤ B such that ∀ f ∈ H A (cid:13)(cid:13)(cid:13) f (cid:13)(cid:13)(cid:13) ≤ (cid:80) k ∈ Λ |(cid:104) f, f k (cid:105)| ≤ B (cid:13)(cid:13)(cid:13) f (cid:13)(cid:13)(cid:13) . Deﬁnition 2 (Riesz basis) . A sequence F is a Riesz basis ifand only if it is a frame having the additional property thatupon the removal of any element from the sequence, it ceasesto be a frame. Let L ( R ) be the space of square integrable and real-valued functions on R . We use the following biorthogonalwavelet deﬁnition [32]: d ( k ) = (cid:40) if | k | = 00 if | k | > . The set ϕ k ( t ) is orthogonal if (cid:104) ϕ k , ϕ k (cid:48) (cid:105) =0 ∀ k (cid:54) = k (cid:48) . For brevity, we present M ( · ) only. Identical treatment is applied to M ( · ) . Deﬁnition 3 (Biorthogonal wavelet) . A pair of functions ϕ j,k , ˜ ϕ j,k ∈ L ( R ) is a pair of biorthogonal wavelets ifthe sets (cid:8) ϕ j,k | j, k ∈ Z (cid:9) and (cid:8) ˜ ϕ j,k | j, k ∈ Z (cid:9) form the Rieszbasis for L ( R ) and if any function g ∈ L ( R ) has therepresentation: g = (cid:80) j ∈ Z (cid:80) k ∈ Z (cid:104) g, ˜ ϕ j,k (cid:105) ϕ j,k . It is worth noticing that both existence as well as unique-ness of the series representation are satisﬁed in deﬁnition3. However, our proposed nonparametric estimator might beunstable, rendering the estimation problem ill-posed, whichis one of the challenges in nonparametric estimation of un-known functions. This ill-posed problem can be alleviatedby employing regularization on the wavelet series expansioncoefﬁcients [33], [34].Let u i = y i − x Ti β be the i ’th element in vector u of size n × , which satisﬁes: u i = M ( t i ) + (cid:15) i (22)where { t i } ni =1 is a sequence in which the i ’th elementsatisﬁes t i = w Ti γ and (cid:15) i is the white noise described in(14). We use Φ I and Φ F ≡ Φ − I to denote the inverse andforward transformation matrices, respectively, each of size n × n [35] in an orthogonal wavelet. We note that usingorthogonal wavelets one obtains the close-form solution tothe wavelet coefﬁcients as follows: (cid:100) M = Φ I ρ λ ( Φ F u ) (23)where (cid:100) M ( · ) is the estimate of the unknown function M ( · ) , and ρ λ ( · ) represents the element-wise thresholdinggenerated by some penalty function, in which the tuning pa-rameter is represented by λ . The procedure in (23) to obtain (cid:100) M ( · ) by employing a thresholding operator is referred to asdenoising.We depart from the denoising procedure in (23) by em-ploying biorthogonal wavelets, as we are interested in theapplicability of the general case where the penelization is notan element-wise due to correlations among wavelet regres-sors. In such cases, there is no such a close-form solution,which necessitates the regularized least squares optimizationmethod to follow.Let Ψ I and Ψ F ≡ ( Ψ T I Ψ I ) − Ψ T I denote the inverse andforward transformation matrices, respectively, each of size n × n of the wavelet-based JPEG, which is a biorthogonal An estimator violating at least one of the requirements: existence,uniqueness and stability is referred to as ill-posed. A more general formulation solves the ill-posed problem by employingregularization in cases where a linear transform of the unknown functionreplaces the original function [33]. The forward transform is referred to as the Discrete Wavelet Transform(DWT). It has been shown that the performance of wavelet estimator canbe improved when the dependencies among coefﬁcients were taken intoaccount [36]. avelet. , Let ρ λ,γ ( · ) be the minimax concave penalty (MCP) func-tion [37], deﬁned as [38]: ρ λ,γ ( θ ) = (cid:40) λθ − θ γ if θ ≤ γλ, λ γ if θ > γλ (24)where θ ∈ ( −∞ , ∞ ) is the parameter to be penalized, λ > and γ ∈ (1 , ∞ ) .We deﬁne resolution-dependent regularized least squaresat resolution levels , ..., J : (cid:98) δ = arg min δ n (cid:13)(cid:13)(cid:13) u − Ψ I δ (cid:13)(cid:13)(cid:13) + J (cid:88) j =1 P λ j ,γ j ( δ j ) (25)where δ = [ δ T , δ T , ..., δ TJ ] T is the wavelet coefﬁcientvector of size n × and δ j is of size n j × . (cid:107) · (cid:107) is the usual (cid:96) (Euclidean) norm, deﬁned as (cid:107) b (cid:107) = (cid:16)(cid:80) ni =1 | b i | (cid:17) / .Thepenalty function is P λ j ,γ j ( δ j ) = (cid:80) nk =1 ρ λ j ,γ j ( | δ j,k | ) .It is evident that when the δ j → , the bias propagatedby the endogenous truncation approaches zero and thus, ouralgorithm is reduced to the conventional IV estimator.The univariate solution of a regularized least squares prob-lem using the penalty function in (24) is denoted by S α ( · ) and deﬁned as: S α (˜ δ ; λ, γ ) =  − / ( αγ ) sign (˜ δ ) max ( (cid:12)(cid:12)(cid:12) ˜ δ (cid:12)(cid:12)(cid:12) − λα ) if (cid:12)(cid:12)(cid:12) ˜ δ (cid:12)(cid:12)(cid:12) ≤ γλ, ˜ δ if (cid:12)(cid:12)(cid:12) ˜ δ (cid:12)(cid:12)(cid:12) > γλ (26)where α ∈ (0 , ∞ ) . It is worth noting that if γ → ∞ thesolution is soft-thresholding introduced by [40]; in case that αγ → + the solution is hard-thresholding (see proof in theAppendix A).To reduce computational complexity, the optimizationproblem in (25) is reformulated as: δ ( iter +1) = arg min δ n (( δ − δ ( iter ) ) T Ψ T I ( u − Ψ I δ ( iter ) )) (27) + α/ (cid:13)(cid:13)(cid:13) δ − δ ( iter ) (cid:13)(cid:13)(cid:13) + (cid:80) Jj =1 P λ j ,γ j ( δ j ) where Ψ T I is the transpose of matrix Ψ I , α I is an approx-imation of the Hessian and I is the identity matrix of size For a deﬁnition of biorthogonal wavelets see [24]. Unlike biorthogonality, orthogonality implies that the wavelet regres-sors are mutually uncorrelated and that the inverse transform is the transposeof the forward transform. This simpliﬁes the computation as the waveletcoefﬁcients are obtained analytically (closed-form) using element-wisethresholding operators (e.g., hard and soft thresholding operators). However,we opted for the biorthogonality wavelet to exploit the correlation structureof the regressors. Biorthogonal wavelets preserve the perfect reconstructionproperty (by employing dual-ﬁlters) as well, but is more ﬂexible in thatthe inverse of X is not required to be its transpose. Consequently, thethresholding is applied to the entire coefﬁcient vector. The penalty function in (24) represents a family of penalty functions asa generalization of the soft thresholding (if γ → ∞ ) and hard thresholding(if γ → + ) [38]. In order to utilize the min-max concave (MCP) penalty function in (24),we depart from the regularized least squares algorithm in [39], as it is limitedto its special case of the LASSO penalty function. We introduce α as anapproximation to the Hessian of the least squares problem in order to obtainan element-wise thresholding. This amounts to a dimensional reductiontechnique for reducing computational complexity. For the special case of α = 1 , see [38]. n × n . The number of iterations is denoted by the integer iter .For brevity, we divide the argument to be minimized by α and complete the squares using the expressions in (cid:13)(cid:13)(cid:13) · (cid:13)(cid:13)(cid:13) [39]to get: δ ( iter +1) = arg min δ (cid:13)(cid:13)(cid:13) δ − δ ( iter ) + 1 / ( αn ) Ψ T I ( u − Ψ I δ ( iter ) ) (cid:13)(cid:13)(cid:13) (28) + α (cid:80) Jj =1 P λ j ,γ j ( δ j ) The iterative procedure performs MCP-thresholding ona proximal gradient-descent update for k = 1 , ..., n (seeAlgorithm 5 in the Appendix): δ ( iter +1) j,k = S α (cid:16) δ ( iter ) j,k + 1 / ( αn ) ψ T Ik ( u − Ψ I δ ( iter ) ); λ j , γ j (cid:17) (29)where δ ( iter ) j,k is the k ’th coefﬁcient in vector δ ( iter ) j and ψ Ik is k ’th column in Ψ I (the inverse wavelet transform).The notation ψ T Ik implies the transpose of ψ Ik . We use (29)to update the wavelet coefﬁcients iteratively until the updateis negligible, such that the following convergence criterion issatisﬁed: (cid:13)(cid:13)(cid:13) δ ( iter +1) − δ ( iter ) (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13) δ ( iter ) (cid:13)(cid:13)(cid:13) < τ (30)where τ is the tolerance which is a positive real numberthat we arbitrarily set to − .The optimization method in (29) involves matrices multi-plication which is computationally infeasible for large datasets. To alleviate this computational complexity we developa lifting scheme to be employed in order to perform simul-taneously the transposed-inverse of the wavelet transform,consisting of lifting steps (see Algorithms 1-4 to follow).Conventionally, a lifting step can be either a prediction, thatis a procedure generating a smoothed version of the data(the scaled coefﬁcients), or an update that is the procedureto generate the remainder (the detail coefﬁcients) betweenthe data and its smoothed version. For the present case wedeﬁne a new operator because the existing lifting steps do notprovide an analytic representation of the transposed-inverse,as discussed in [7].In the next section we discuss the main idea behind liftingsteps, in order to obtain analytically the transposed-inversetransform. First we describe the lifting steps in a regular-spaced data given a sample size of J for a non-negativeinteger J . Then in equations (1)-(IV-E) to follow, we alleviatethese two restrictions by formulating our proposed algorithm.

1) Lifting steps to obtain the wavelet coefﬁcients

Let w = ( w , ...w n ) be a discrete sequence of data consistingof n real numbers, such that the sequence is referred to as dyatic iff n = 2 J for some integer J ≥ . The sequencecan be expressed uniquely in terms of detail (difference)and summation coefﬁcients denoted by (cid:8) d J − ,k (cid:9) n/ k =1 and (cid:8) c J − ,k (cid:9) n/ k =1 , respectively. The former capture the variationin the sequence at different scales and locations and the latterare a smooth representation of the original sequence. he multi-scale representation of a function g ∈ L ( R ) isobtained as follows: g ( t ) = (cid:88) k ∈ Z c ,k φ ,k ( t ) + (cid:88) j ∈ Z (cid:88) k ∈ Z d j,k ϕ j,k ( t ) (31)The ﬁrst set of terms, φ ,k , represents the average levelof function g and the second set of terms ϕ j,k represents itsdetails by accumulating information at a set of scales j ∈ Z .Let { , ..., J − } denotes a set of scales (resolution lev-els). We deﬁne d J − ,k and c J − ,k as follows [41]: d J − ,k = w k − w k − , k = 1 , ..., J − .c J − ,k = w k + w k − , k = 1 , ..., J − . (32)The key idea is that a lower detail coefﬁcient d J − ,k implies that w k is very close to w k − and visa versa, assuch a smoother function is represented by a small sequenceof detail coefﬁcients.In order to represent the sequence in a coarser-scale (usinga lower resolution), we deﬁne the coefﬁcients: d J − ,k = c J − , k − c J − , k − , k = 1 , ..., J − .c J − ,k = c J − , k + c J − , k − , k = 1 , ..., J − . (33)By repeating the procedure in (33) we obtain detailed andsmoothed coefﬁcients for lower resolutions. The multiscalealgorithm stops when the c , coefﬁcient is produced.Next we discuss how to select optimally the thresholding(tuning) parameter in (25) for each resolution-level.

2) Optimal thresholding by a reference-free criterion function

Since we deal with truncated distributions, the source (thecomplete non-truncated distribution) is intrinsically unob-servable and thus, we cannot assess the success of denoisingby comparing it to the original non-truncated distribution.Therefore, we utilize the “two-fold cross-validation" in [42]which is a reference-free criterion function assesing the qual-ity of the function estimated by denoising: λ j,γ j = arg min λ j,γj (cid:26) (cid:13)(cid:13)(cid:13) (cid:98) f oλ j,γj − u ej (cid:13)(cid:13)(cid:13) + 12 (cid:13)(cid:13)(cid:13) (cid:98) f eλ j,γj − u oj (cid:13)(cid:13)(cid:13) (cid:27) (34)where λ j,γ j is the tuning (thresholding) parameter beingused in (29), in which γ j is a speciﬁc penalty function.The odd sample and even sample are denoted by u oj and u ej , respectively, and their corresponding estimates are (cid:98) f oλ j,m and (cid:98) f eλ j,m . These estimates are obtained by employing theiterative procedure in (29).As previously discussed, our proposed truncation-proofIV estimator requires controlling for the bias terms M ( · ) and M ( · ) . For generality and applicability purposes of theproposed estimator, we adopt a semiparametric approachwhich is not subjected to distributional assumptions andconsequently, does not require specifying the functional formof these unknown functions. The methodology implemented in [42] chooses one threshold that isapplicable to all resolution levels in the wavelet transform. In the presentcase, however, we select a threshold for each level in order to implementmulti-resolution analysis increasing our proposed estimator’s accuracy.

The wavelet-based JPEG semiparametric estimator is cho-sen for its many advantages. It enables a multi-resolutionrepresentation of the noisy data points, implying that thedata points are characterized both globally as well as lo-cally. Such a multi-resolution decomposition facilitates dis-tinguishing between the noise and the systemic part. The sys-temic part is the functional relationship between the covari-ates and the dependent variables in the regression equationsin (14).An additional advantage of incorporating the aforemen-tioned newly introduced JPEG estimator is that it accommo-dates for various data set forms of different types of irregular-ities, such as non-equispaced design that will be described insection IV-B to follow. These irregularities are alleviated byintroducing the locations in space of the various data pointsas an additional covariate that is unique for each resolution-level. An additional merit of our approach is enabling agroup-wise denoising rather than the traditional element-wiseJPEG denoising in the cases of image processing. Group-wise denoising plays an important role in data denoising, as ittakes into account potential dependencies among the variousdata points. Thus, our contribution to the JPEG algorithm arecontrolling for irregularities, group-wise thresholding on theentire data and utilizing a reference-free criterion function tochoose the optimal thresholding.In next section we describe the JPEG algorithm which isintroduced to estimate each of the bias terms M ( · ) and M ( · ) . Although our proposed denoising procedure can beapplicable to both even as well as odd sample sizes (aswill be demonstrated in Algorithm 1 to follow), for easeof presentation and without loss of generality, the denoisingprocedure is formulated as a function of a data set consistingof n observations. IV. THE WAVELET-BASED JPEG DENOISING

Let { ( u i , t i ) } ni =1 be a pairwise sequence of n data points asdescribed in (22), such that t i < t j ∀ i < j . The sequence { u i } ni =1 indicates the noisy data points (or colors of pixelsin image processing) and their respective locations in spaceare represented by { t i } ni =1 . The JPEG algorithm is a pro-cedure generating a multi-resolution denoised representationof the sequence { u i } ni =1 , which is denoted by the sequence { (cid:98) u i } ni =1 . The purpose of the present section is three-fold:ﬁrst, to describe the JPEG algorithm to be employed inorder to obtain a noise-free representation of the noisy data;second, to extend the JPEG algorithm to be compatiblewith irregularities in the data; and thirdly, to incorporatea reference-free criterion to evaluate the denoising procedureaccuracy.Applying the conventional JPEG algorithm on a vectorof data points is equivalent to employing three different Global representation is a weighed average (smoothing) of the data,while local representation consists of more detailed information regardingﬁrst differences between neighboring data points. We deﬁne ∆ i ≡ t i − t i − ∀ ≤ i ≤ n , such that equispaced (regular)data is a sequence of data points satisfying ∆ i = ∆ j ∀ i and j . Other casesare referred to as non-equispaced (irregular) spaced data. rocedures on the noisy data: (i) the JPEG forward transform T F : R n × → R n × to obtain the wavelet-based JPEG co-efﬁcients (as will be shown in (45) to follow); (ii) coefﬁcientsselection T S : R n × → R n × by applying a thresholdingprocedure (as will be shown in (49)) and (iii) the JPEGinverse transform T I : R n × → R n × , which recovers thenoise-free data by utilizing the selected coefﬁcients (as willbe shown in (46) to follow).In the ensuing section we introduce auxiliary matrices tobe used in each of the JPEG transforms, which are essentialto construct the covariate matrix in the wavelet-based JPEGregression (in (48) to follow). A. AUXILIARY MATRICES FOR THE JPEG ALOGRITHM

The implementation of the JPEG algorithm necessitates theconstruction of T F and T I operators. For this purpose, weconstruct auxiliary matrices A n , S n , (cid:110) H ( t ) n,(cid:96) (cid:111) (cid:96) =1 whichare the shifting, rescaling and smoothing operator matrices,respectively, each of size n × n .Let S n and S − n be the rescaling and inverse-rescalingmatrices, respectively each of size n × n . Its elements aredeﬁned for m = 0 , ..., n as: ( S n ) i,j =  /ϕ if i = j = 2 mϕ if i = j = 2 m + 10 if i (cid:54) = j , ( S − n ) i,j =  ϕ if i = j = 2 m /ϕ if i = j = 2 m + 10 if i (cid:54) = j (35)The rescaling operator ˜ v = S n v takes a vector v ofsize n × and return a rescaled vector ˜ v of the same size,such that even and odd elements of the original vector aremultiplied by the scalars /ϕ and ϕ , respectively.Let A n be a shifting operator matrix of size n × n , itselements are deﬁned for m = 0 , ..., n as: ( A n ) i,j = (cid:40) if ( i > n, j = 2 m ) or ( i ≤ n, j = 2 m + 1)0 otherwise . (36)The ˜ v = A n v operator takes a vector v = [ v , ..., v n ] T of size n × and return the vector ˜ v = [ v T odd , v T even ] T .The vectors v odd = [ v , ..., v n − , v n − ] T and v even =[ v , ..., v n − , v n ] T consist of the odd and even elements of v , respectively.Unlike the conventional JPEG, we allow for data irregu-larities by controlling for the data set location in space. Fordoing so, we denote a sequence of matrices (cid:110) H ( t ) n,(cid:96) (cid:111) (cid:96) =1 ,such that the elements of matrix H ( t ) n,(cid:96) ∀ (cid:96) ∈ { , , , } ofsize n × n are deﬁned for m = 1 , ..., n as: ( H ( t ) n,(cid:96) ) i,j(cid:96) ∈{ , } =  π (cid:96) ω ( t ) (cid:96),i − if i =2 m and j = i − if j = i π (cid:96) (1 − ω ( t ) (cid:96),i − ) if i =2 m and j = i +1 otherwise (37) ( H ( t ) n,(cid:96) ) i,j(cid:96) ∈{ , } =  π (cid:96) ω ( t ) (cid:96),i − if i =2 m +1 and j = i − if j = i π (cid:96) (1 − ω ( t ) (cid:96),i − ) if i =2 m +1 and j = i +1 otherwise . (38)where each of sequences (cid:110) ω ( t ) (cid:96),l (cid:111) ∀ (cid:96) ∈ { , , , } are theinterpolation weights (deﬁned in (39) to follow) to control for the location in space of data points (enabling irregularnon-equispaced data to be used) and π , π , π , π are scalarconstants described in [43], which are referred to as the ﬁltercoefﬁcients of the wavelet-based JPEG. In the special casein which ω ( t ) (cid:96),l = 0 . ∀ l and (cid:96) ∈ { , , , } the algorithm isreduced to the regular-spaced wavelet-based JPEG.We deﬁne the linear interpolation weights: ω ( t ) (cid:96), i =  ˜ t i +1 − ˜ t i ˜ t i +1 − ˜ t i − if (cid:96) ∈ { , } ˜ t i +2 − ˜ t i +1 ˜ t i +2 − ˜ t i if (cid:96) ∈ { , } (39) ˜ t l =  t if l = 2 m, l < t l if ≤ l ≤ nt n − if l = 2 m + 1 , l > n − (40)For tractability, we formulate the JPEG coefﬁcients es-timation problem as a linear regression estimation, whichnecessitates obtaining a closed-form expressions of the JPEGforward and inverse transforms. These closed-form expres-sion are required to characterize the JPEG covariate matrixto be used in the wavelet-based JPEG regression. In thefollowing section we express analytically each of the forwardand inverse transforms using matrix notation as a function ofthe auxiliary matrices presented above. B. THE VARIOUS JPEG TRANSFORMS IN MATRIXNOTATION

Employing our proposed JPEG algorithm on a data set in-volves representation of data set in multiple resolution levels,a property which referred to as a multi-resolution analysis.Let J be the highest resolution level, which requires thesame number of data points as in the noisy data set. Thedata set representation in j ’th resolution-level ∀ j < J isa transformation of the data set representation in the ﬁner(higher) resolution-level j +1 . Consequently, the JPEG noise-free representation ∀ j < J can be formulated recursively.However, the implication of this formulation is that the loca-tion in space of the data points in any given resolution-levelis also determined recursively. This fact stems from depictingthe noisy data set u = [ u , ..., u n ] T and its location in space t = [ t , ..., t n ] T as a pairwise sequence. For ease of notationwe construct the adjusted space location operator ∀ j < J : t j ≡  J (cid:89) h = j +1 ˜ A h  t , ˜ A j = (cid:20) A m ( j ) × m ( j ) m ( j ) × ( m ( J ) − m ( j )) ( m ( J ) − m ( j )) × m ( j ) I ( m ( J ) − m ( j )) × ( m ( J ) − m ( j )) (cid:21) (41)where m ( j ) ≡ (cid:100) n/ J − j (cid:101) , J = (cid:100) log (2 n ) (cid:101) is the numberof resolution-levels and j is a speciﬁc resolution level. This recursive formulation takes the the noisy data pointslocations in space as control variables, which are essential foralleviating irregularities in the noisy data.Using the adjusted space location sequence (cid:8) t j (cid:9) in (41),we deﬁne matrix Ψ ( t ) F (to be used in (45) to follow) for J ∈ The operator’s notation (cid:100)·(cid:101) represents the ceiling of a real number. , ..., (cid:100) log (2 n ) (cid:101)} resolution levels as: Ψ ( t ) F ≡  J − (cid:89) j =1 ˜ Φ ( t j ) j  Φ ( t ) m ( J ) × m ( J ) (42) ˜ Φ ( t ) j = (cid:34) Φ ( t ) m ( j ) × m ( j ) m ( j ) × ( m ( J ) − m ( j )) ( m ( J ) − m ( j )) × m ( j ) I ( m ( J ) − m ( j )) × ( m ( J ) − m ( j )) (cid:35) (43)where Φ ( t ) m × m ≡ A m S m H ( t ) m, H ( t ) m, H ( t ) m, H ( t ) m, and I m × m is the identity matrix of size m × m . It worth noticing that Φ ( t ) m × m is a product of invertible matrices and consequently,its inverse is characterized as: (cid:16) Φ ( t ) m × m (cid:17) − = (cid:16) H ( t ) m, (cid:17) − (cid:16) H ( t ) m, (cid:17) − (cid:16) H ( t ) m, (cid:17) − (cid:16) H ( t ) m, (cid:17) − S − m A − m (44)The multi-resolution JPEG forward transform operator T F ( u , t ) is the linear transform: δ = Ψ ( t ) F u , T F ( u , t ) = Ψ ( t ) F u (45)where u , t and δ are the noisy data, the location in spaceand the JPEG coefﬁcient vectors, respectively, each of size n × .Similarly, the multi-resolution JPEG inverse transformoperator T I ( δ , t ) is the linear transform: u = Ψ ( t ) I δ , T I ( δ , t ) = Ψ ( t ) I δ (46)where u , t and δ are the noisy data, the location in spaceand the JPEG coefﬁcient vectors, respectively, each of size n × .Matrix Ψ ( t ) I in (46) is deﬁned for J ∈ { , ..., (cid:100) log (2 n ) (cid:101)} resolution levels as: Ψ ( t ) I ≡ (cid:16) Φ ( t ) m ( J ) × m ( J ) (cid:17) −  (cid:89) j = J − (cid:16) ˜ Φ ( t j ) j (cid:17) −  (47)which Ψ ( t ) I is the analytic inverse transform operator.We introduce the wavelet-based JPEG nonparametric re-gression given a non-equispaced irregular data set: u ( t ) = (cid:16) Ψ ( t ) I δ (cid:17) t = (cid:88) k ∈ Z c ,k φ ,k ( t ) + (cid:88) j ∈ Z (cid:88) k ∈ Z d j,k ϕ j,k ( t ) (48)where δ consists of the sequences of coefﬁcients (cid:8) c ,k (cid:9) and (cid:8) d j,k (cid:9) , which capture the function’s average behaviorand details, respectively. The matrix Ψ ( t ) I consists of thesequences of covariates (cid:8) φ ,k ( t ) (cid:9) and (cid:8) φ j,k ( t ) (cid:9) . Unlikethe present case which employs a group-wise denoisingon the entire data, in the conventional JPEG the coefﬁ-cients selection operator, T S ( δ , t ) , is constructed to be usedelement-wise (for each resolution-level j ), e.g, T S ( δ ij , t ) = S α ( δ ij ; λ j,γ j , γ j ) using S α ( · ) in (26), where λ j,γ j and γ j aredeﬁned in (34) using α = 1 , such that { δ ij } m ( j ) i =1 is a subset ofvector δ consisting of the j ’th resolution-level coefﬁcients.In the following section, we discuss about the JPEG group-wise coefﬁcients selection to perform denoising.

1) JPEG group-wise coefﬁcients selection for irregular datadenoising

Lastly, we formulate the transpose of the inverse wavelettransform, in order to employ the group-wise denoising pro- cedure depicted in (29): ˜ u = (cid:16) Ψ ( t ) I (cid:17) T u , (cid:16) Ψ ( t ) I (cid:17) T ≡  J − (cid:89) j =1 (cid:20)(cid:16) ˜ Φ ( t j ) j (cid:17) − (cid:21) T (cid:20)(cid:16) Φ ( t ) m ( J ) × m ( J ) (cid:17) − (cid:21) T (49)where (cid:20)(cid:16) Φ ( t ) m × m (cid:17) − (cid:21) T is constructed as: (cid:20)(cid:16) Φ ( t ) m × m (cid:17) − (cid:21) T = (cid:2) A − m (cid:3) T (cid:2) S − m (cid:3) T (cid:20)(cid:16) H ( t ) m, (cid:17) − (cid:21) T (50) × (cid:20)(cid:16) H ( t ) m, (cid:17) − (cid:21) T (cid:20)(cid:16) H ( t ) m, (cid:17) − (cid:21) T (cid:20)(cid:16) H ( t ) m, (cid:17) − (cid:21) T The JPEG estimator, denoted by (cid:98) δ , is obtained by minimiz-ing the objective function in (28) using the iterative procedurein (29) given the chosen thresholding level. The latter is de-termined by minimizing the reference-free criterion functiondepicted in section III-2.The construction of the wavelet-based JPEG transformsmatrix involves computational complexity, a problem whichcan be alleviated by employing a faster algorithm, referredto as ‘a lifting step’. In the succeeding section we discussthe algorithm to obtain a denoised representation of nonequispaced data design by using a procedure that does notnecessitate matrix operation to reduce computational com-plexity. C. THE IRREGULAR FORWARD TRANSFORM

The irregular forward transform in (45) is the procedure toobtain the wavelet coefﬁcients, δ , as follows: Algorithm 1

The wavelet-based JPEG Forward Transform . procedure F ORWARD T RANSFORM (u, Grid, Level)

Require: (i) Two real-number vectors u and Grid of size n × ; (ii) Level ∈ { , ..., log2( n ) } ; Ensure:

Output ← a real-number coefﬁcient vector δ of size n × ; The JPEG ﬁlter coefﬁcients : π ← − . ; π ← − . ; π ← +0 . ; π ← +0 . ϕ ← +1 . Start : n ← length of u m ← ceiling of ( n / Q ← ceiling of ( m / d ← copy the odd elements of u s ← copy the even elements of u top : NewGrid ← Grid

The parameter vector φδ is used to generate a zero wavelet coefﬁcient for the generated observation: φδ [1] = − × ( π × π × π / (1 + 2 × π × π φδ [2] = 2 × ( π × π / (1 + 2 × π × π φδ [3] = 2 × ( π π × π × π × π / (1 + 2 × π × π if n is an odd number then s [ m ] ← d [ m − ∗ φδ [1] + s [ m − ∗ φδ [2] + d [ m ] ∗ φδ [3] NewGrid[n+1] ← NewGrid[n] + NewGrid[n]-NewGrid[n-1] end if

OddGrid ← copy the odd elements of NewGrid EvenGrid ← copy the even elements of NewGrid s ← s + F ILTER (d, 0, π , NewGrid) d ← d + F ILTER (s, 1, π , NewGrid) s ← s + F ILTER (d, 0, π , NewGrid) d ← d + F ILTER (s, 1, π , NewGrid) u[1:m] ← d[1:m] u[(m+1):n] ← s[1:(n-m)] Grid[1:m] ← OddGrid

Grid[(m+1):n] ← EvenGrid

Scaling the wavelet coefﬁcients : d ← δ [1 : Q ] ∗ ϕ s ← δ [( Q + 1) : m ] /ϕ if Level > then (cid:26) u[1:m]Grid[1:m] (cid:27) ← F ORWARD T RANSFORM (u[1:m], Grid[1:m], Level-1) end if return ( u , Grid ) end procedure he Filter function in algorithm 1 deﬁned as follows: Algorithm 2

The wavelet Forward and Inverse Filter procedure F ILTER (Series, Even, π , Grid) Require: (i) Two real-number vectors: Series of size n × and Grid of size n × ; (ii) two scalars: Even ∈ { , } and π ∈ R . Ensure:

Output ← Filter, a real-number vector of size n × consisting of the predicted series; n ← length of Series O ← copy the odd elements of Grid E ← copy the even elements of Grid if Even = then Low ← O[1:n-1] High ← O[2:n] weights ← (High-E[1:(n-1)])/(High-Low) ωl ← [weights, 0.5] ωh ← − ωl S ← [Series[1], Series] Filter ← π ωl *S[1:n] + π ωh *S[2:(n+1)] else Low ← E[1:(n-1)]

High ← E[2:n] weights ← (High-O[2:n])/(High-Low) ωl ← [0.5, weights] ωh ← − ωl S ← [Series, Series[n]] Filter ← π ωl *S[1:n] + π ωh *S[2:(n+1)] end if return ( Filter ) end procedure D. THE IRREGULAR INVERSE TRANSFORM

The irregular inverse transform is the procedure to recon-struct the vector u in (46), as follows: Algorithm 3

The wavelet-based JPEG Inverse Transform procedure I NVERSE T RANSFORM ( δ , Grid, Level) Require: (i) Two real-number vectors δ and Grid of size n × ; (ii) Level ∈ { , ..., log2( n ) } ; Ensure:

Output ← a real-number vector u of size n × ; The JPEG ﬁlter coefﬁcients : π ← − . ; π ← − . ; π ← +0 . ; π ← +0 . ϕ ← +1 . Start : n ← length of δ m ← ceiling of n / Level − Q ← ceiling of m / Rescaling the wavelet coefﬁcients : d ← δ [1 : Q ] /ϕ s ← δ [( Q + 1) : m ] ∗ ϕ OldGrid ← Grid for i=1 To m do index ← (2( i −

1) + 1) ∗ ( i ≤ Q ) + 2( i − Q ) ∗ ( i > Q ) Grid[index] ← OldGrid[i] end for top : NewGrid ← Grid[1:m] if m is an odd number then s [ Q ] ← NewGrid[m+1] ← Grid[m] - Grid[m-1] + Grid[m] end if d ← d − F ILTER (s, 1, π , NewGrid) s ← s − F ILTER (d, 0, π , NewGrid) d ← d − F ILTER (s, 1, π , NewGrid) s ← s − F ILTER (d, 0, π , NewGrid) for i=1 To m do index ← (2( i −

1) + 1) ∗ ( i ≤ Q ) + 2( i − Q ) ∗ ( i > Q ) δ [index] ← d[i] ∗ ( i ≤ Q )+ s[i-Q] ∗ ( i > Q ) end for u ← δ if Level > then (cid:26) uGrid (cid:27) ← I NVERSE T RANSFORM ( δ , Grid, Level-1) end if return ( u , Grid ) end procedure E. THE IRREGULAR TRANSPOSE OF THE INVERSETRANSFORM

The irregular transpose of the inverse transform in (49)enables to obtain the vector ˜ u , as follows (see transposed-inverse ﬁlter function, TransInvFilter, in algorithm 6): Algorithm 4

The wavelet-based JPEG Transposed-InverseTransform procedure T RANS I NVERSE T RANSFORM ( u , Grid, Level) Require: (i) Two real-number vectors u and Grid of size m × ; (ii) Level ∈ { , ..., log2( m ) } Ensure:

Output ← Ψ TI u The JPEG ﬁlter coefﬁcients : π ← − . ; π ← − . ; π ← +0 . ; π ← +0 . ϕ ← +1 . Start : Output ← u m ← length of u Q ← ceiling of ( m / d ← copy the odd elements of u s ← copy the even elements of u OddGrid ← copy the odd elements of Grid EvenGrid ← copy the even elements of Grid top : NewGrid ← Grid if m is an odd number then s [ Q ] ← NewGrid[m+1] ← NewGrid[m] + NewGrid[m]-NewGrid[m-1] end if d ← d − T RANS I NV F ILTER ( s , 0, π , NewGrid) s ← s − T RANS I NV F ILTER ( d , 1, π , NewGrid) d ← d − T RANS I NV F ILTER ( s , 0, π , NewGrid) s ← s − T RANS I NV F ILTER ( d , 1, π , NewGrid) Rescaling the wavelet coefﬁcients : d ← d [1 : Q ] /ϕ s ← s [( Q + 1) : m ] ∗ ϕ Output[1:Q] ← d[1:Q] Output[(Q+1):m] ← s[1:(m-Q)] Grid[1:Q] ← OddGrid

Grid[(Q+1):m] ← EvenGrid if Level > then Output[1:Q] ← T RANS I NVERSE T RANSFORM (Output[1:Q], Grid[1:Q], Level-1) end if return ( Output , Grid ) end procedure In the ensuing section we describe the estimation proce-dure of the parameter vector of interest, β , using the wavelet-based JPEG estimate of (cid:99) M ( · ) obtained from (29). F. THE INSTRUMENT VARIABLE ESTIMATOR:TRUNCATED SAMPLE

Denote the truncated data by a sequence of observations { y i , x i , w i , z i } ni =1 , such that each observation is an in-dependent realization of the conditional joint distributionfunction of the random variables { y , x , w , z } given thatthey are selected into the sample ( y = 1) . The endogenousvariable is denoted by x and is included in vector x . Thereare two types of joint dependence between the covariatevector and the substantive equation’s random disturbance.The ﬁrst type is intrinsic in the model and is generated bya variation in v (the endogenous part of x ) leading to acomovement between x and ξ . The second type is relatedto the sample selection and is generated by a variation in w leading to a comovement between the covariate vector x and ξ . This implies that there are two sources of endogeneity tobe taken into consideration: the ﬁrst source is related to theendogenous covariate, while the second source is due to thetruncation environment of the data.Next we discuss the two-step estimation procedure to beemployed for the correction of both endogeneity and trunca-tion bias propagated by truncation. G. THE ESTIMATION PROCEDURE

In this section we introduce a two-step estimation procedureto eliminate the two sources of bias discussed. To eliminatethe endogeneity bias term we adapt a similar approach to thetwo step procedure in [44] for a partially linear single indexmodel estimation, in which the ﬁrst stage is a regression of he endogenous covariate on all the exogenous covariates andthe instrumental variable. In the second stage, the endoge-nous covariate is substituted with the ﬁtted values obtainedfrom the ﬁrst stage. However, the estimation approach in [44]cannot be implemented in a truncated environment, becauseit treats the ﬁrst stage regression as a linear populationregression (as if the entire covariates distribution functionis observed). We alleviate this by modeling both the ﬁrst aswell as the second stage equations as endogenously truncatedequations. In order to eliminate the endogenous truncationbias, we control for this source of bias by including the trun-cation bias term as an additional covariate in the substantiveequations, as depicted in (14). Thus, the partial linearity isapplied to both the ﬁrst as well as the second stage equations.In the ﬁrst stage, we regress the endogenous covariate onthe instrumental and exogenous variables, by minimizing thepartially linear index model: ( (cid:98) δ, (cid:100) θ f ) = arg min ( δ,θ f ) ∈ Θ × ∆ K n n (cid:88) i =1 (cid:16) x i − (cid:2) x T − i , z Ti (cid:3) δ − (cid:99) M ( w i ; θ f ) (cid:17) (51)In the second stage, the endogenous variable is replaced byits predicted value obtained from the ﬁrst stage in (51), andwe minimize the following function: ( (cid:98) β, (cid:100) θ f ) = arg min ( β,θ f ) ∈ Θ × ∆ K n n (cid:88) i =1 (cid:18) y i − (cid:2) (cid:99) x i , x T − i (cid:3) β − (cid:99) M ( w i ; θ f ) (cid:19) (52)As can be seen in (52) the two sources of endogeneitybias we deal with are: (i) the bias propagated by the en-dogenous covariate is alleviated by utilizing the covariateset (cid:2) (cid:99) x i , x T − i (cid:3) consisting entirely of exogenous covariatesand (ii) the bias propagated by the endogenous truncation isalleviated by controlling for the selection bias term (cid:99) M ( · ) .Next we present Monte Carlo simulation to examine oursemiparametric IV estimator’s performance in a truncatedenvironment. V. SIMULATION

In this section, we generate multiple random data sets to beused for the examination of our model’s performance, usingdifferent sample sizes.First, we discuss the procedure for the data generationprocess (DGP).

A. DATA GENERATION PROCESS

Denote the sample size by N ∈ (cid:8) , , , , , (cid:9) . In order to not restrict the data generationprocess to the family of symmetric unimodal distributionfunctions, a mixture of distribution functions is utilized togenerate each of the selection model’s disturbances that arejointly dependent (as will be discussed in section V-A1 tofollow). In order to verify that our proposed model per-forms well under different data generating processes (DGP),we construct a data set consisting of 2,000,000 distributionfunctions, practically generating 100 millions realizations The estimates obtained given the various data distribution functions willbe supplied upon request. which are not i.i.d. By construction, each observation is ran-domly drawn from a unique mixture of distribution functions.

1) The disturbances’ joint distribution function

Each triple of disturbances { ξ i , ξ i , v i } is randomly andindependently drawn from F ξ ,ξ , v , which is the substantiveand participation equations’ disturbances joint distributionfunction. The aforementioned joint density function consistsof two components: a Copula function, which characterizesthe disturbances’ dependence structure, and three marginaldistribution functions F ξ , F ξ and F v . In order to verifyour model’s performance in the presence of random distur-bances’ distribution functions that are not restricted to thefamily of symmetric and unimodal distribution functions,each one of the sample selection model’s disturbances ξ and ξ is marginally-distributed according to a mixture ofthree different distribution functions: (i) a normal distributionfunction with expectation and standard deviation parame-ters ( µ, σ a ) denoted by N ( µ, σ a ) ; (ii) a normal distributionfunction with expectation and standard deviation parameters ( − µ, σ b ) denoted by N ( − µ, σ b ) ; (iii) a gamma distributionfunction with scale and shape parameters ( µϕ, ϕ ) denotedby Γ Gamma ( µϕ, ϕ ) . This mixture distribution function isdeﬁned as: (cid:40) v ∼ N (0 , σ ) ξ j ∼ . N ( µ, σ a ) + 0 . N ( − µ, σ b ) + 0 . Gamma ( µϕ, ϕ ) , j = 1 , . (53)where E [ ξ j ] = 0 and E [v] = 0 .The parameters set ( µ, σ a , σ b , ϕ, σ v ) = (4 , . , . , , is arbitrarily chosen. Due to its simplicity, the Clayton Copula(as will be discussed in section V-A2 to follow) with a degreeof dependence parameter is set to equal , assuring a mildcorrelation between the disturbances, is used for controllingthe dependence structure. Choosing a mild correlation, is im-portant in order to be conservative by examining the potentialbias in the parameter estimates under conditions which arenot extreme.Next we employ a function characterizing the dependenceproperties of the Copula [46], referred to as a generatorfunction to construct the joint dependence of the randomdisturbances in (53).

2) Archimedean Copula function

An Archimedean Copula is a Copula characterized by a non-increasing, continuous generator function ψ : [0 , ∞ ] → [0 , ,which satisﬁes ψ (0) = 1 , ψ ( ∞ ) = 0 and is strictly decreas-ing on [0 , inf { t : ψ ( t ) = 0 } ] . In particular, we are interestedin the d dimensional Archemdean Copula family ( in the Any continuous joint distribution function can be characterized bya set of marginal distribution functions and a joint distribution functiondetermining the dependence structure which is referred to as a Copulafunction (Sklar’s Theorem [45]). The scale and shape parameters imply that the expectation and standarddeviation parameters are ( µ, (cid:112) µ/ϕ ) , respectively. resent case ) which has the simple algebraic form [46]: C ( u , .., u d ) = ψ ( ψ − ( u ) , ..., ψ − ( u d )) , ( u , ..., u d ) ∈ [0 , d (54)where ψ is a speciﬁc function known as the generator of C .To generate the disturbances, the Clayton Copula’s generator ψ ( t ) = (1 + t ) − /θ is chosen.The covariates vector of i ’th observation is a realizationof the random variables [z , x , w , w ] which are jointlydistributed with their corresponding marginal distributionfunctions (cid:8) (cid:70) i z , (cid:70) i x , (cid:70) i w , (cid:70) i w (cid:9) . The dependence structure ismodeled by utilizing a Gaussian Copula which is a con-venient way to generate high dimensional data. By con-struction, each datum is generated by utilizing a differentsequence of marginal distribution functions (constructed as aﬁnite mixture drawn from a menu of , , continuousdistribution functions). These random variables expectationvector µ = [0 , , , T and a covariance matrix Σ × . Thearbitrarily chosen covariance matrix is: Σ × =  σ σ z , x σ z , w σ z , w σ z , x σ σ x , w σ x , w σ z , w σ x , w σ σ w , w σ z , w σ x , w σ w , w σ  =  . . − . . .

264 0 . − . . .

36 2 − . − . − . − .  We generate the data y i , y i , x i according to the followingdata generation process (DGP) [49]:DGP  y ∗ i = α + β x i + β x i + ξ y ∗ i = α + γ w i + γ w i + ξ x ∗ i = δ z i + δ x i + v i (55)where each i element in the sequence { x i , z i , w i , w i } Ni =1 is an independent realization of the random vari-ables (x , z , w , w ) . We choose the parameter setting [ α , α , β , β , δ , δ , γ , γ ] = [2 , . , , . , . , , , − .The truncated data set is characterized by the followingequations: (cid:20) y i x i (cid:21) = (cid:34) α + β x i + β x i + ξ δ z i + δ x i + v i (cid:35) if y ∗ i ≥ Unobserved if y ∗ i < , (56)where x i is an endogenous variable included in vector x i ∈ R p , in which all the elements (except for x i ) areexogenous variables and β ∈ R p is a covariates vector. Thesubstantive equation’s random disturbance is denoted by ξ i . B. SIMULATIONS RESULT

We have randomly generated for each sample size N ∈{ , , , , , } , a total of , datasets using the data generation process elaborated on in V-A.For a given number of observations N , different models areestimated: (i) an OLS estimator utilizing a sample consist-ing of random realizations from the complete distribution d = 3 representing the three-dimensional vector of random distur-bances (v i , ξ i , ξ i ) . Knowing the distribution corresponding to a generator ψ , [47] presenteda sampling algorithm for exchangeable Archimedean copulas which doesnot require the knowledge of the copula density. This algorithm is thereforeapplicable to large dimensions [48]. TABLE 1: Monte Carlo Simulation - Non-truncated (com-plete) data set with (without) IV correction

True Parameter a Estimate Model SetupSample size500 2000 3000 5000 8000 10000

Full sample

OLS estimates ( without IV correction ) β = 1 Mean 2.6848 2.6798 2.6807 2.6810 2.6809 2.6805Median 2.6857 2.6815 2.6811 2.6812 2.6819 2.6812Std 0.1602 0.1138 0.0858 0.0605 0.0534 0.0507 β = 1 . Mean -0.7009 -0.6953 -0.6988 -0.6983 -0.6963 -0.6976Median -0.7085 -0.6983 -0.6967 -0.6994 -0.6968 -0.6976Std 0.2420 0.1743 0.1275 0.0838 0.0710 0.0652

Full sample

OLS estimates ( with IV correction ) β = 1 Mean 0.9826 0.9995 1.0025 1.0020 0.9986 0.9991Median 1.0050 1.0057 1.0083 0.9996 0.9997 0.9998Std 0.4397 0.3072 0.2174 0.1393 0.1116 0.1006 β = 1 . Mean 1.2687 1.2493 1.2457 1.2469 1.2510 1.2517Median 1.2365 1.2382 1.2438 1.2466 1.2497 1.2506Std 0.5406 0.3771 0.2655 0.1692 0.1366 0.1211

Note: a The parameters that are used in the data generation process.We estimate by ordinary least squares (OLS) method the parameters for the full sampleand truncated sample without correction for the selectivity bias. Then, we calculate forthese estimates the mean, median and standard deviation (Std.) over all data sets. Thestandard deviations are obtained using the estimates from the Monte Carlo simulations. function, without correcting for the endogeneity of x i co-variate; (ii) a conventional IV estimator, correcting for theendogeneity of x i covariate using the aforementioned entiredistribution function; (iii) both OLS as well as a conven-tional IV estimators are applied to a truncated portion ofthe data distribution function consisting of participants only(without correcting for the self-selection bias); (iv) truncatedsample model’s estimates using the developed wavelet-basedJPEG IV estimator, correcting for both truncation as well asendogeneity biases. Table 1 presents summary statistics of estimates for mod-els (i) and (ii), while Table 2 presents summary statisticsof estimates for models (iii) and (iv). In Table 3, differentconvergence measures of these estimates are presented.Entries in Table 1 indicate that regardless of sample size,the means of the

OLS estimates are biased, such that e.g.,for a sample size of observations β = 2 . and β = − . , while the mean of the full sample IV’sestimates are β = 1 . and β = 1 . for the samesample size. For a sample size of observations, thestandard deviation obtained for β (the endogenous covari-ate’s coefﬁcient) using the IV estimator is . times largerthan in the OLS estimator and decreases from . to . , when the sample size increases from to , observations.In Table 2 to follow, the estimates obtained by using thetruncated data are presented. It is evident that the mean ofthe truncated sample OLS estimates are β = 2 . and β = − . for a sample size of observations, whereas the true parameter values are β = 1 and β = 1 . ,respectively. This is a huge bias which is hardly improved asthe sample size increases. Further, applying a conventionalIV produces estimates which still represent a huge bias,particularly β = 0 . and β = 2 . for the samesample size (500 observations). For sake of brevity, we delegate results of the ﬁrst stage to Appendix B. ABLE 2: Monte Carlo Simulation - Truncated data set with(without) truncation bias correction

True Parameter a Estimate Model SetupSample size500 2000 3000 5000 8000 10000

Truncated sample

OLS estimates (withouttruncationbiascorrection) β = 1 Mean 2.2600 2.2493 2.2504 2.2485 2.2488 2.2492Median 2.2711 2.2514 2.2534 2.2510 2.2484 2.2493Std 0.2676 0.1878 0.1362 0.0865 0.0723 0.0675 β = 1 . Mean -0.3084 -0.2926 -0.2974 -0.2955 -0.2939 -0.2940Median -0.3170 -0.2884 -0.3048 -0.2974 -0.2944 -0.2942Std 0.3693 0.2647 0.1909 0.1162 0.0940 0.0851

Truncated sample

OLS estimates - IV correction (withouttruncationbiascorrection) β = 1 Mean 0.1084 0.1805 0.1942 0.1910 0.1850 0.1847Median 0.1410 0.2095 0.2074 0.2003 0.1901 0.1876Std 0.7595 0.5288 0.3588 0.2235 0.1777 0.1634 β = 1 . Mean 2.0544 2.0148 1.9973 2.0018 2.0083 2.0088Median 2.0022 1.9798 1.9833 1.9968 2.0058 2.0082Std 0.8910 0.6219 0.4212 0.2601 0.2055 0.1878

Truncated sample

JPEG IV estimates β = 1 Mean 0.9455 0.9875 1.0058 1.0010 1.0004 1.0000Median 0.9717 1.0085 1.0079 1.0047 1.0013 1.0009Std 0.7325 0.3641 0.2972 0.2178 0.1732 0.1577 β = 1 . Mean 1.3174 1.2830 1.2617 1.2465 1.2506 1.2499Median 1.2945 1.2623 1.2579 1.2456 1.2518 1.2469Std 0.8145 0.4081 0.3324 0.2433 0.1933 0.1754

Note: a The parameters that are used in the data generation process.We estimate the parameters for the truncated sample without correction for the selectivitybias by employing the

OLS and the (conventional) IV methods. Additionally, we estimatethe parameters using the same truncated sample by employing our JPEG IV estimator,correcting for both truncation as well as endogeneity bias. For brevity, we introduce hereonly the second stage results and delegate the ﬁrst stage results to Table 4 Appendix B.In each of the estimation procedures (

OLS , IV and JPEG IV), we calculate for theseestimates the mean, median and standard deviation (Std.) over all data sets.

Entries in Table 2 indicate that regardless of sample size,the means of the truncated sample IV’s estimates are biased(ranges from one-tenth to one-ﬁfth of the estimate that wouldhave emerged by employing the conventional IV method inthe absence of truncation). Note that estimates’ accuracyhardly improved as sample size increases. This is due to thepresence of two sources of bias. The mean estimate of β (theendogenous covariate’s parameter) which is obtained fromimplementing our proposed methodology, basically mimicsthe results obtained using a random sample from the entiredata distribution function for sample sizes, above , observations. The standard deviations of this estimate forsample sizes of and , observations are . and . , respectively. For a sample size of , observa-tions (or above), the mean estimate of β (the exogenouscovariate’s parameter) approximates the estimate obtained byemploying the conventional IV, using a random sample fromthe entire data distribution function. However, the estimate of β obtained by employing the conventional IV in a truncatedsample is biased even for , observations.We conduct sensitivity test to measure the inﬂuence of anincrease in number of observations on the accuracy of thetruncated sample’s parameter estimates.The ﬁrst accuracy measure we use is the standardizedroot mean square error, RM SE j , measuring the bias in thetruncated regression estimate relative to the true parametervalue that would have been obtained in an non truncateddistribution, deﬁned as: RM SE j (Ω) =  Ω (cid:88) i =1 (cid:32) ˆ β si,j − β sj β sj (cid:33)  / , (57) For sake of brevity, we have omitted the estimates of the nuisanceparameters which can be furnished upon request as well as the parameterestimates of the ﬁrst stage, which are delegated to Table 4 Appendix B. where ˆ β si,j and β sj stand for the substantive (s) equation’s j ’th coefﬁcient estimated in the i ’th sample and the coefﬁ-cient in the theoretical model that would have been obtainedin the entire population, respectively. Ω is the number ofdata sets generated for the Monte Carlo simulations, whichis data sets (each one consists of N observations).Another measure is based on a formula similar to the onedescribed in (57), and is intended to ﬁnd the relative accuracyof the truncated sample’s estimates, in comparison to fullsample estimates, deﬁned as: R j (Ω) =  Ω (cid:88) i =1 (cid:32) ˆ β tsi,j − ˆ β si,j ˆ β si,j (cid:33)  / , (58)where ˆ β tsi,j and ˆ β si,j stand for the substantive (s) equation’s j ’th coefﬁcient estimated using the truncated (t) sample andthe full sample, respectively. This measure evaluates therelative model’s performance in the truncated sample, withrespect to the conventional IV using the full sample.The last estimates’ accuracy measure is the δ coefﬁcientused for the calculation of the estimators’ standard deviationsconvergence rate n δ with respect to the sample size. It depictsthe speed of standard deviation’s shrinkage resulting fromincreasing the sample size. This coefﬁcient is calculatedbased on the following ratio: δ = ln ( σ / σ )ln ( n / n ) , (59)where σ and σ are the estimate’s standard deviationsthat are calculated for data sets with n and n numberobservations, respectively (calculated for a given estimate).TABLE 3: Monte Carlo Simulation - Convergence measures Model’s estimatesParameter Number of observations500 2000 3000 5000 8000 10000

RMSE measure

F ull sample

OLS estimates - IV correction β β T runcated sample

JPEG IV estimates β β R j ( n ) measure - relative to full sample IV T runcated sample

OLS estimates - IV correction β β T runcated sample

JPEG IV estimates β β δ consistency measure ( n δ ≡ the convergence rate ) T runcated sample

OLS estimates - IV correction β - 0.7627 0.3538 0.2949 0.2254 0.1793 T runcated sample

JPEG IV estimates β - 0.5040 0.5017 0.6072 0.4875 0.4201 Note:

The parameters that are used in the data generation process.We examine three different measures for the parameters presented in Table (3). First, thestandardized root mean square error

RMSE between each model’s estimates and thetrue parameters is calculated based on equation (57). Second, the R j measure is calculatedbased on equation (58). Third, the convergence rate which is measured n δ is calculated forboth the conventional IV as well as the JPEG IV estimates. This convergence rate measureimplies that multiplying the sample size by 2 shrinks the estimators’ standard deviationsby δ . able 3 entries indicate that the root mean squares error(RMSE) measure of the estimates obtained by employing theconventional IV estimator, using a random sample from theentire data distribution function, gets smaller as the samplesize increases, as can be expected. However, applying thesame procedure to the truncated data set leads to RMSEmeasures, which are in the range of 2 to 8-fold larger, given asample size of , to , observations, respectively.This is indeed a huge bias generated by the conventional IV,which is not immune to truncation bias. Additionally, theRMSE measures show negligible improvements as a functionof number of observations for the conventional IV, whereasthere is a huge improvement of the RMSE, as a function ofthe number of observations for the JPEG IV estimator pro-vided by our model. The proximity between the JPEG IV andthe full sample IV estimates increases with the sample size,as reﬂected by the R j proximity measure. Using the samemeasure, we ﬁnd that there is a much smaller improvementin the proximity between the truncated sample conventionalIV and the full sample IV, relative to the improvement inthe proximity between the JPEG IV and the full sample IVestimates.It is evident that JPEG IV is a √ n consistent estimator, asdepicted by the δ consistency measure, which is about . ,implying that multiplying the sample size by 2 shrinks theestimators’ standard deviations by δ = √ . It is also evidentthat the truncated data conventional IV is poorly functioningin terms of consistency, as is shown by entries in Table 3. VI. CONCLUSION

We provide an analytical proof showing that in an endoge-nously truncated data the conventional IV estimator does notperform the task it was intended to, but rather introduces anadditional unintended bias into the parameter estimates of thesubstantive equation. The instrumental variable is endoge-nous by itself in the context of endogenously truncated datadue to a comovement between the instrumental variable andthe substantive equation’s random disturbance, generated bymediating covariates. We offer a truncation-proof JPEG IV,shown to be a proper estimator under endogenous truncation.Employing Monte Carlo simulations attests to the JPEG IVestimator’s high accuracy and its √ n consistency. Theseresults have been veriﬁed by utilizing 2,000,000 different dis-tribution functions (not restricted to the unimodal symmetricfamily), generating 100 million realizations to construct thecovariates’ data sets which are not imposed to be i.i.d. Thevarious distribution functions attest to a very high accuracy ofthe model as depicted by the parameter estimates that closelymimic the true parameters. APPENDIX. APPENDICES

A. AN ELEMENT-WISE THRESHOLDING

In this appendix we show that for any γ ∈ (1 , ∞ ) , S α ( · ) in(26) is the solution to the min-max concave penalty function in (24) ∀ α ∈ (1 /γ, ∞ ) . Proof.

Let ˜ δ ≡ δ ( iter ) − / ( αn ) Ψ T I ( u − Ψ I δ ( iter ) ) δ = arg min δ (cid:13)(cid:13)(cid:13) δ − ˜ δ (cid:13)(cid:13)(cid:13) + α P λ j ,γ j ( δ ) (60)As P λ j ,γ j ( δ ) is a separable function of δ , the minimizationproblem can be implemented in an element-wise fashion. Let δ k be the solution to the following univariate regularized leastsquares problem: δ k =  arg min δ (cid:13)(cid:13)(cid:13) δ − ˜ δ k (cid:13)(cid:13)(cid:13) + λ | δ | − δ γ if | δ k | ≤ λγ arg min δ (cid:13)(cid:13)(cid:13) δ − ˜ δ k (cid:13)(cid:13)(cid:13) + λ γ if | δ k | > λγ (61)We obtain the ﬁrst order condition of (61) with respect to δ : δ k = (cid:40) − / ( αγ ) (cid:104) ˜ δ k − sign ( δ k ) λα (cid:105) if | δ k | ≤ λγ ˜ δ k if | δ k | > λγ (62)Note that if (cid:12)(cid:12) ˜ δ k (cid:12)(cid:12) > λγ and | δ k | ≤ λγ it implies that eithersign ( δ k ) = − and − λγ ≤ ˜ δ k ≤ λγ − λ/α or sign ( δ k ) = 1 and − λγ + 2 λ/α ≤ ˜ δ k ≤ λγ . Both cases contradict the factthat (cid:12)(cid:12) ˜ δ k (cid:12)(cid:12) > λγ . Similarly, if (cid:12)(cid:12) ˜ δ k (cid:12)(cid:12) ≤ λγ and | δ k | > λγ , itcontradicts the fact that δ k = ˜ δ k which follows directly from(62). Consequently, (cid:12)(cid:12) ˜ δ k (cid:12)(cid:12) > λγ ⇔ | δ k | > λγ . We get: δ k =  ˜ δ k − λα − / ( αγ ) if < ˜ δ k − λα − / ( αγ ) ≤ λγ ˜ δ k + λα − / ( αγ ) if − λγ ≤ ˜ δ k + λα − / ( αγ ) < (63)and provided that ∀ α ∈ (1 /γ, ∞ ) (63) is simpliﬁed to: δ k =  ˜ δ k − λα − / ( αγ ) if λα < ˜ δ k ≤ λγ ˜ δ k + λα − / ( αγ ) if − λγ ≤ ˜ δ k < − λα (64)After some algebraic manipulation we obtain: δ k =  − / ( αγ ) sign (˜ δ k )( (cid:12)(cid:12)(cid:12) ˜ δ k (cid:12)(cid:12)(cid:12) − λ/α ) if λ/α < (cid:12)(cid:12)(cid:12) ˜ δ k (cid:12)(cid:12)(cid:12) ≤ λγ ˜ δ k if (cid:12)(cid:12)(cid:12) ˜ δ k (cid:12)(cid:12)(cid:12) > λγ (65) (cid:4) Next we show how to obtain an analytic representationof the transpose of the wavelet inverse transform. For doingso, we based our arguments on the equivalence betweenthe matrices-product and the lifting scheme representationsof the wavelet transform. Then using this equivalence, wegenerate a ﬁlter to render the costly matrices building useless.Our objective is to reduce computational complexity.Next we present the ﬁrst stage estimates.

B. FIRST STAGE ESTIMATES

TABLE 4: Monte Carlo Simulation - Truncated data set withtruncation bias correction

True Parameter a Estimate Model SetupSample size500 2000 3000 5000 8000 10000

Truncated sample

JPEG IV estimates (First stage) δ = 0 . Mean 0.4944 0.4975 0.4973 0.4996 0.5000 0.4999Median 0.4951 0.4963 0.4971 0.4999 0.4998 0.5000Std 0.1236 0.0675 0.0514 0.0461 0.0352 0.0298 δ = 1 Mean 0.9976 0.9983 0.9994 1.0001 1.0005 0.9999Median 0.9977 0.9989 0.9989 1.0001 1.0006 1.0003Std 0.0902 0.0444 0.0360 0.0279 0.0221 0.0197

Note: a The parameters that are used in the data generation process.We estimate the parameters using the truncated sample by employing our JPEG IVestimator, correcting for both truncation as well as endogeneity bias. We calculate forthese estimates the mean, median and standard deviation (Std.) over all data sets. Thestandard deviations are obtained using the estimates from the Monte Carlo simulations. lgorithm 5 The wavelet-based JPEG penalized regression procedure P ENALIZED L INEAR R EGRESSION (u, Grid, Level, λ , γ ) Require: (i) Two real-number vectors u and Grid of size n × ; (ii) Level ∈ { , ..., log2( n ) } ; (iii) λ ∈ (0 , ∞ ) and γ ∈ (1 , ∞ ) . Ensure:

Output ← a real-number coefﬁcient vector (cid:98) δ of size n × ; n ← length of u top : δ old ← F ORWARD T RANSFORM (u, Grid, Level-1) (cid:98) u old ← u residualold ← u - (cid:98) u old Objold ← n (cid:80) ni =1 residual old[i] + (cid:80) ni =1 MCP.

PENALTY ( (cid:12)(cid:12) δ old [ i ] (cid:12)(cid:12) , λ , α , γ )) vold ← T RANS I NVERSE T RANSFORM (residualold, Grid, Level-1) ﬂag ← TRUE gap ← inﬁnite tolerance ← − maxiter ← η ← while ﬂag=TRUE and gap > tolerance do α ← Iter ← repeat Update ← δ old + αn vold δ new ← Sα (Update, λ , α , γ ) (cid:98) u new ← I NVERSE T RANSFORM ( δ new, Grid, Level-1) residualnew ← u - (cid:98) u new Objnew ← n (cid:80) ni =1 residual new[i] + (cid:80) ni =1 MCP.

PENALTY ( | δ new [ i ] | , λ , α , γ )) α ← αη ﬂag ← Objnew < Objold until ﬂag=TRUE or iter > maxiter if ﬂag = TRUE then δ old ← δ new Objold ← Objnew residualold ← residualnew vold ← T RANS I NVERSE T RANSFORM (residualnew, Grid, Level-1) end if gap ← (cid:13)(cid:13)(cid:13) δ new − δ old (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13) δ old (cid:13)(cid:13)(cid:13) Iter ← Iter + 1 end while return (cid:0) δ old (cid:1) end procedure C. ALGORITHMS

Algorithm 6

The Transposed-Inverse ﬁlter procedure T RANS I NV F ILTER (Series, Even, π , Grid) Require: (i) Two real-number vectors: Series of size n × and Grid of size n × ; (ii) two scalars: Even ∈ { , } and π ∈ R . Ensure:

Output ← Filter, a real-number vector of size n × consisting of the predicted series; n ← length of Series O ← copy the odd elements of Grid E ← copy the even elements of Grid if Even = then Low ← O[1:n-1] High ← O[2:n] weights ← (High-E[1:(n-1)])/(High-Low) (cid:36)l ← [0, 1-weights] (cid:36)h ← ← [weights, 1] S ← [Series[1], Series] Filter ← π (cid:36)l ← *S[1:n] + π (cid:36)h ← *S[2:(n+1)] else Low ← E[1:(n-1)]

High ← E[2:n] weights ← (High-O[2:n])/(High-Low) (cid:36)l ← [1, 1-weights] (cid:36)h ← [weights, 0] S ← [Series, Series[n]] Filter ← π (cid:36)l *S[1:n] + π (cid:36)h *S[2:(n+1)] end if return ( Filter ) end procedure ACKNOWLEDGMENT

We would like to thank Boaz Nadler, Yaniv Tenzer andseminar participants in the Weizmann Institute of Scienceand Tel-Aviv university for very constructive comments.

REFERENCES [1] J. D. Angrist and G. M. Kuersteiner, “Semiparametric causality tests usingthe policy propensity score,” National Bureau of Economic Research,Tech. Rep., 2004.[2] N. Billfeld and M. Kim, “Semiparametric correction for endogenous trun-cation bias with vox populi-based participation decision,” IEEE Access,vol. 7, pp. 12 114–12 132, 2019.[3] J. J. Heckman, “Sample selection bias as a speciﬁcation error,” Economet-rica: Journal of the econometric society, vol. 47, no. 1, pp. 153–161, 1979. [4] W. K. Newey, “Two-step series estimation of sample selection models,”The Econometrics Journal, vol. 12, no. s1, pp. S217–S229, 2009.[5] J. L. Powell, “Semiparametric estimation of censored selection models,”in Nonlinear Statistical Modeling: Proceedings of the Thirteenth Inter-national Symposium in Economic Theory and Econometrics: Essays inHonor of Takeshi Amemiya, vol. 165. Cambridge University Press, 2001,p. 96.[6] H. Xie, L. E. Pierce, and F. T. Ulaby, “Sar speckle reduction using waveletdenoising and markov random ﬁeld modeling,” IEEE Transactions ongeoscience and remote sensing, vol. 40, no. 10, pp. 2196–2212, 2002.[7] S. Voronin, D. Mikesell, and G. Nolet, “Compression approaches for theregularized solutions of linear systems from large-scale inverse problems,”GEM-International Journal on Geomathematics, vol. 6, no. 2, pp. 251–294,2015.[8] P. Hall, B. A. Turlach et al., “Interpolation methods for nonlinear waveletregression with irregularly spaced design,” The Annals of Statistics,vol. 25, no. 5, pp. 1912–1925, 1997.[9] S. Sardy, D. B. Percival, A. G. Bruce, H.-Y. Gao, and W. Stuetzle, “Waveletshrinkage for unequally spaced data,” Statistics and Computing, vol. 9,no. 1, pp. 65–75, 1999.[10] M. Carnec, P. Le Callet, and D. Barba, “Full reference and reducedreference metrics for image quality assessment,” in Signal Processing andIts Applications, 2003. Proceedings. Seventh International Symposium on,vol. 1. IEEE, 2003, pp. 477–480.[11] J. Farah, M.-R. Hojeij, J. Chrabieh, and F. Dufaux, “Full-reference andreduced-reference quality metrics based on sift,” in Acoustics, Speechand Signal Processing (ICASSP), 2014 IEEE International Conference on.IEEE, 2014, pp. 161–165.[12] A. Cohen, I. Daubechies, and J.-C. Feauveau, “Biorthogonal bases ofcompactly supported wavelets,” Communications on pure and appliedmathematics, vol. 45, no. 5, pp. 485–560, 1992.[13] H. Ichimura, “Semiparametric least squares (sls) and weighted sls estima-tion of single-index models,” Journal of Econometrics, vol. 58, no. 1, pp.71–120, 1993.[14] A. Lewbel and O. Linton, “Nonparametric matching and efﬁcient estima-tors of homothetically separable functions,” Econometrica, vol. 75, no. 4,pp. 1209–1227, 2007.[15] B. E. Usevitch, “A tutorial on modern lossy wavelet image compression:foundations of jpeg 2000,” IEEE signal processing magazine, vol. 18,no. 5, pp. 22–35, 2001.[16] P. M. Robinson, “Root-n-consistent semiparametric regression,” Econo-metrica: Journal of the Econometric Society, pp. 931–954, 1988.[17] H. Ichimura and L. F. Lee, “Semiparametric least squares estimation ofmultiple index models: single equation estimation,” in Nonparametric andsemiparametric methods in econometrics and statistics: Proceedings of theFifth International Symposium in Economic Theory and Econometrics.Cambridge, 1991, pp. 3–49.[18] A. Lewbel and S. M. Schennach, “A simple ordered data estimator forinverse density weighted expectations,” Journal of Econometrics, vol. 136,no. 1, pp. 189–211, 2007.[19] D. Williams, Probability with martingales. Cambridge university press,1991.[20] A. Haar, “Zur theorie der orthogonalen funktionensysteme,” Mathematis-che Annalen, vol. 69, no. 3, pp. 331–371, 1910.[21] V. Delouille, M. Jansen, and R. von Sachs, “Second-generation waveletdenoising methods for irregularly spaced data in two dimensions,” SignalProcessing, vol. 86, no. 7, pp. 1435–1450, 2006.[22] E. Vanraes, M. Jansen, and A. Bultheel, “Stabilised wavelet transforms fornon-equispaced data smoothing,” Signal Processing, vol. 82, no. 12, pp.1979–1990, 2002.[23] B. W. Silverman, “Wavelets in statistics: beyond the standard assump-tions,” Philosophical Transactions of the Royal Society of London A:Mathematical, Physical and Engineering Sciences, vol. 357, no. 1760, pp.2459–2473, 1999.[24] I. Daubechies and W. Sweldens, “Factoring wavelet transforms into liftingsteps,” Journal of Fourier analysis and applications, vol. 4, no. 3, pp. 247–269, 1998.

25] I. M. Johnstone and B. W. Silverman, “Empirical bayes selection ofwavelet thresholds,” Annals of Statistics, pp. 1700–1752, 2005.[26] I. M. Johnstone, B. W. Silverman et al., “Needles and straw in haystacks:Empirical bayes estimates of possibly sparse sequences,” The Annals ofStatistics, vol. 32, no. 4, pp. 1594–1649, 2004.[27] V. Delouille, J. Simoens, and R. von Sachs, “Smooth design-adaptedwavelets for nonparametric stochastic regression,” Journal of the AmericanStatistical Association, vol. 99, no. 467, pp. 643–658, 2004.[28] I. Daubechies, “Ten lectures on wavelets, vol. 61 of cbms-nsf regionalconference series in applied mathematics,” 1992.[29] I.-L. Chern et al., “Interpolating wavelets and difference wavelets,” in JointAustralian-Taiwanese Workshop on Analysis and Applications. Centrefor Mathematics and its Applications, Mathematical Sciences Institute,The Australian National University, 1999, pp. 133–147.[30] D. Wei, J. Tian, R. Wells, and C. S. Burrus, “A new class of biorthogonalwavelet systems for image transform coding,” IEEE Transactions on Imageprocessing, vol. 7, no. 7, pp. 1000–1013, 1998.[31] R. Zalik, “Riesz bases and multiresolution analyses,” Applied and Com-putational Harmonic Analysis, vol. 7, no. 3, pp. 315–331, 1999.[32] V. Dicken and P. Maass, “Wavelet-galerkin methods for ill-posed prob-lems,” Journal of Inverse and Ill-Posed Problems, vol. 4, no. 3, pp. 203–222, 1996.[33] F. u. Abramovich and B. Silverman, “Wavelet decomposition approachesto statistical inverse problems,” Biometrika, vol. 85, no. 1, pp. 115–129,1998.[34] J. L. Horowitz, “Adaptive nonparametric instrumental variables estima-tion: Empirical choice of the regularization parameter,” Journal of Econo-metrics, vol. 180, no. 2, pp. 158–173, 2014.[35] X. He, E. Hua, Y. Lin, and X. Liu, Computer, Informatics, Cyberneticsand Applications: Proceedings of the CICA 2011. Springer Science &Business Media, 2011, vol. 107.[36] D. Tomassi, D. Milone, and J. D. Nelson, “Wavelet shrinkage usingadaptive structured sparsity constraints,” Signal Processing, vol. 106, pp.73–87, 2015.[37] C.-H. Zhang et al., “Nearly unbiased variable selection under minimaxconcave penalty,” The Annals of statistics, vol. 38, no. 2, pp. 894–942,2010.[38] P. Breheny and J. Huang, “Coordinate descent algorithms for nonconvexpenalized regression, with applications to biological feature selection,”The annals of applied statistics, vol. 5, no. 1, p. 232, 2011.[39] Z. Yang, Z. Wang, H. Liu, Y. C. Eldar, and T. Zhang, “Sparse nonlinearregression: Parameter estimation and asymptotic inference,” 2015, arXivpreprint arXiv:1511.04514.[40] D. L. Donoho and J. M. Johnstone, “Ideal spatial adaptation by waveletshrinkage,” biometrika, vol. 81, no. 3, pp. 425–455, 1994.[41] G. Nason, Wavelet methods in statistics with R. Springer Science &Business Media, 2010.[42] G. P. Nason, “Wavelet shrinkage using cross-validation,” Journal of theRoyal Statistical Society. Series B (Methodological), pp. 463–479, 1996.[43] P. Schelkens, A. Skodras, and T. Ebrahimi, The JPEG 2000 suite. JohnWiley & Sons, 2009, vol. 15.[44] Y. Zhou, Y. Yang, J. Han, and P. Zhao, “Estimation for partiallylinear single-index instrumental variables models,” Communications inStatistics-Simulation and Computation, vol. 45, no. 10, pp. 3629–3642,2016.[45] M. Sklar, Fonctions de répartition à n dimensions et leurs marges. Uni-versité Paris 8, 1959.[46] A. J. McNeil and J. Nešlehová, “Multivariate archimedean copulas, d-monotone functions and â ˇD¸S1-norm symmetric distributions,” The Annalsof Statistics, pp. 3059–3097, 2009.[47] A. W. Marshall and I. Olkin, “Families of multivariate distributions,”Journal of the American statistical association, vol. 83, no. 403, pp. 834–841, 1988.[48] M. Hofert, “Sampling archimedean copulas,” Computational Statistics &Data Analysis, vol. 52, no. 12, pp. 5163–5174, 2008. [49] J. C. Escanciano, “A simple and robust estimator for linear regressionmodels with strictly exogenous instruments,” The Econometrics Journal,2017.

Dr. NIR BILLFELD is a researcher at the univer-sity of Haifa, Israel. He received the B.A. in eco-nomics and statistics from the university of Haifa(2006), Israel, M.A. in economics from Tel-Avivuniversity (2010). He received his Ph.D. from theuniversity of Haifa (2019). His work has appearedin IEEE.

Prof. MOSHE KIM is professor of economics atthe University of Haifa, Israel. He is the founderand former director of Barcelona Banking Sum-mer School at the Universitat Pompeu Fabra,Barcelona, former director of the endowed chairof banking at Humboldt University of Berlin, Se-nior Distinguished Fellow at the Swedish Schoolof Economics in Helsinki (Hanken), institute re-search professor at the German Institute for Eco-nomic Research (DIW), consultant at the CentralBank of Norway and recently visited NYU Shang-hai. He was recently declared high end foreignexpert by the Chinese foreign ministry and is arecent recipient of the Outstanding Tutor Awardfrom the Chinese Ministry of Education. He holds a PhD from the Universityof Toronto. Kim’s research interests are econometrics, banking, ﬁnancialmarkets, and industrial organization. His books include Microeconometricsof Banking: Methods, Applications, and Results (Oxford University Press,2009). His work has appeared in IEEE, the Journal of Finance, the Journalof Monetary Economics, the Journal of Financial Intermediation, the Journalof Business and Economics Statistics, the Journal of Money Credit andBanking, International Economic Review, the Journal of Accounting andEconomics, the Journal of Public Economics, the Journal of Urban Eco-nomics, the Journal of Law and Economics, the Journal of Banking andFinance, the Journal of Industrial Economics, the International Journal ofIndustrial organization.is professor of economics atthe University of Haifa, Israel. He is the founderand former director of Barcelona Banking Sum-mer School at the Universitat Pompeu Fabra,Barcelona, former director of the endowed chairof banking at Humboldt University of Berlin, Se-nior Distinguished Fellow at the Swedish Schoolof Economics in Helsinki (Hanken), institute re-search professor at the German Institute for Eco-nomic Research (DIW), consultant at the CentralBank of Norway and recently visited NYU Shang-hai. He was recently declared high end foreignexpert by the Chinese foreign ministry and is arecent recipient of the Outstanding Tutor Awardfrom the Chinese Ministry of Education. He holds a PhD from the Universityof Toronto. Kim’s research interests are econometrics, banking, ﬁnancialmarkets, and industrial organization. His books include Microeconometricsof Banking: Methods, Applications, and Results (Oxford University Press,2009). His work has appeared in IEEE, the Journal of Finance, the Journalof Monetary Economics, the Journal of Financial Intermediation, the Journalof Business and Economics Statistics, the Journal of Money Credit andBanking, International Economic Review, the Journal of Accounting andEconomics, the Journal of Public Economics, the Journal of Urban Eco-nomics, the Journal of Law and Economics, the Journal of Banking andFinance, the Journal of Industrial Economics, the International Journal ofIndustrial organization.