An Augmented Regression Model for Tensors with Missing Values
Feng Wang, Mostafa Reisi Gahrooei, Zhen Zhong, Tao Tang, Jianjun Shi
TThis work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Abstract βHeterogeneous but complementary sources of data provide an unprecedented opportunity for developing accurate statistical models of systems. Although the existing methods have shown promising results, they are mostly applicable to situations where the system output is measured in its complete form. In reality, however, it may not be feasible to obtain the complete output measurement of a system, which results in observations that contain missing values. This paper introduces a general framework that integrates tensor regression with tensor completion and proposes an efficient optimization framework that alternates between two steps for parameter estimation. Through multiple simulations and a case study, we evaluate the performance of the proposed method. The results indicate the superiority of the proposed method in comparison to a benchmark.
Note to Practitionersβ
The proposed method aims to obtain an accurate estimation of the regression model when certain entries of the response are inaccessible. By considering both the information from multiple inputs and the structure of the response, our proposed method can achieve more accurate estimation of the output tensor. In order to apply the proposed method in practice, two assumptions should hold: First, the response tensor should be low-rank, meaning that fewer variation patterns should exist in the response than its dimensions. Second, the relationship between the input tensors and the response should be linear or approximately linear. The presented method in this paper uses tensor decomposition techniques to exploit the correlation structures of the high-dimensional data and prevent overfitting. Another benefit of our integrated framework is that the rank of the response tensor converges automatically, which can be used directly in the parameter estimation.
Index Terms β Tensor regression, missing values, Tucker decomposition, ALS-ADMM. ( *Corresponding author: Mostafa Reisi Gahrooei. ) F. Wang and T. Tang are with State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]; [email protected]). M. R. Gahrooei is with the Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL 32611, USA (e-mail: [email protected]). N OMENCLATURE
Lower- and upper-case letter A scalar, e.g., π€π€ or ππ . Lower- or upper-case boldface letter A vector or a matrix, e.g., π°π° or ππ . Euler script letters A tensor, e.g., π²π² . π³π³ ππ Input tensor ππ for the regression model, π³π³ ππ β β πΌπΌ Γ ππ ππ1 Γ β― Γ ππ ππππππ with the sample dimension πΌπΌ and other dimensions ππ ππππ , ππ = 1, β― , ππ . π΄π΄ Output tensor for the regression model, π΄π΄ ππ ββ πΌπΌ Γ ππ Γ ππ Γ β― Γ ππ ππ . β¬ ππ Coefficient tensor ππ for the regression model with respect to π³π³ ππ . ππ ππ Core tensor of Tucker decomposition to β¬ ππ . ππ ( ππ ) Mode- ππ matricization of tensor π²π² . Ξ© Index set to define the missingness of the response. π«π« Ξ© Orthogonal projection based on the index set Ξ© . rank( β ) Rank function of a tensor. π΄π΄ Response with missing values. ππ ππβ Bases that spans the space of the input tensor ππ . ππ β Bases that spans the space of the response. β³ ππ The ππ π‘π‘β local copy of tensor π΄π΄ . ππ ( ππ ) Mode- ππ matricization of β³ ππ . π―π― ππ Dual variable. ππ Tuning parameter. I. I NTRODUCTION
HE rapid development of sensing and computing technology has accelerated the collection of heterogeneous sets of data, which may include a combination of scalars and high-dimensional (HD) data points such as profiles, images, and point clouds. Many applications, including multistage manufacturing [1], aircraft prognostics [2], and medical
Z. Zhong and J. Shi are with the H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA (e-mail: [email protected]; [email protected]).
An Augmented Regression Model for Tensors with Missing Values
Feng Wang, Mostafa Reisi Gahrooei*, Zhen Zhong, Tao Tang, and Jianjun Shi T his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. , et al. [10] extended PLS approach to tensor data by using Tucker decomposition. Li , et al. [11] proposed a generalized linear model with input tensors based on Tucker decomposition. Yan , et al. [12] proposed to link structured point clouds to scalar process variables with tensor regression. To consider a more general case, Lock [13] proposed a regression model that estimates an output tensor based on an input tensor (tensor-on-tensor regression [TOT]), using CP decomposition imposed on the tensors of model parameters. However, the TOT method is limited due to the inherent limitations of CP decomposition and can only include a single input tensor. Recently, Gahrooei, et al. [14] extended TOT to multiple tensor-on-tensor regression (MTOT) by leveraging Tucker operations. Although these existing approaches provide effective ways to model processes using HD data, they assume the available output tensor is structured and complete. However, in many applications, these assumptions are not valid, and the measurements may contain missing values for reasons such as the failure of the sensors or excessive cost of full measurement. For example, in lithography process of semiconductor manufacturing, overlay (OV) errors (i.e., the misalignment between different layers), which are highly dependent on the lithography machine settings (e.g., the alignments of the lens and the location of the wafer stage), are only measured at a limited number of marked locations over the wafer (see Fig. 1). Therefore, obtaining a full picture of OV errors requires developing models that link the overlays to machine settings based on partially observed OV errors on wafers. Another example is the battery system of the Tesla Model S, with more than 7000 cells [15], where a limited number of sensors monitors the temperature of a few of these cells. As a result, the temperature data obtained from these batteries is structured but contains missing observations. To effectively monitor the battery condition, a modeling framework that uses this data to estimate the temperature of all cells based on the observed measurements and covariates such as the speed and temperature of coolant flow in the battery pack is essential. In these situations, where the output contains missing values, an effective estimation of the model parameters is challenging due to the absence of several observations in high-dimensional outputs, which renders the exploitation of the complex correlation structure of the output even more difficult. A naΓ―ve approach of addressing this challenge is to first complete the HD outputs, using matrix [16] or tensor completion approaches [17], and then construct a prediction model based on the completed data using existing HD regression methods (e.g., [14]). Many methods are developed for the tensor completion problem, which attempt to build the relationship between the known entries and the missing values of a matrix or a tensor. These methods rely on the low-rank assumption of the tensor and either design decomposition of the tensor [18], [19], or perform rank minimization [17], [20], as reported in Table β ‘ . Nevertheless, these approaches do not consider other potentially relevant and available data when performing the tensor completion task. In many applications, however, the T ABLE β . R EGRESSION M ETHODS
Methods Literature Characteristics Principle component regression (PCR) or partial least square (PLS) [4], [5] -Fail to exploit the ordering or spatial structure of profiles or images. Functional regression [6], [7] -Mostly on profile data. -Difficult to extend to higher dimensions. Tensor regression (TOT, MTOT) [12], [13], [14] -Tensor inputs and tensor output. -Assumes complete data. T
ABLE β ‘. T ENSOR C OMPLETION M ETHODS
Methods Literature Characteristics Decomposition based methods CP [18], Tucker [19] decompositions -With tensor decomposition methods. -Rank specified. Rank minimization based methods CP [17], Tucker [20] rank -Automatic rank estimation
Fig. 1. Illustration of overlay measurements in the lithography process. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. π³π³ ππ , ππ = 1, β― , ππ represent the available inputs that are complete, and π΄π΄ denotes an incomplete output of a process. Please note that in the situations where π³π³ ππ , ππ = 1, β― , ππ contain missing values, one can complete them using tensor completion approaches as they are independent variables. The goal is to estimate the model parameters β¬ , given the inputs and incomplete output. After learning the model parameters, a complete output can be estimated based on a given set of new inputs. As it is shown in Fig. 2, our procedure completes the output and learns the model parameters iteratively. Specifically, the completion of the response π΄π΄ takes advantage of both the information of the inputs via the estimated parameters β¬ and the structural information in the response. By considering the information from inputs, the completion of response is expected to be more accurate than a completion procedure that only uses the structural information of π΄π΄ . Furthermore, a better completion of the response π΄π΄ can also benefit the estimation of the model parameters. Therefore, all available information from both inputs and output are fully and iteratively exploited, resulting in more accurate model estimation and improved performance of prediction. Mathematically, the model parameters are learned by minimizing an integrated objective function that includes a mean square error term and a tensor rank penalty on the response π΄π΄ for global exploitation of the relationship between the observed and missing entries. To avoid overfitting, a decomposition operation is applied on the model parameters β¬ by considering the correlation structures within the inputs and output spaces. The optimization problem is then solved through a novel and efficient algorithm with two block coordinate descent (BCD) steps. The proposed optimization framework iteratively completes the output tensor and learns the model parameters until convergence. The rest of the article is organized as follows: In Section II, we propose an integrated framework for augmented tensor regression with missing values and elaborate on the algorithms for estimating the missing values of the response as well as the model parameters. Section III carries out two simulation studies. The first one implements a curve-on-curve simulation with missing entries in the profile output. The second simulation study generates images and profiles for the inputs and the output. Based on those two simulations, the proposed method is evaluated in comparison to a benchmark method. In Section IV, we conduct a case study to estimate the incomplete overlay errors in the semiconductor lithographic process. Finally, we conclude the paper in Section V. II. F ORMULATION OF T ENSOR R EGRESSION WITH I NCOMPLETE R ESPONSE
In this section, we propose an approach that augments tensor-on-tensor regression with tensor completion for a more accurate estimation of model parameters and the high-dimensional process output with missing values. First, we introduce the notations and concepts of tensor algebra used in this paper. A. Tensor Notation and Multilinear Algebra
In this paper, we use Euler script letters to denote a tensor. For instance,
π³π³ β β ππ Γ ππ Γ β― Γ ππ ππ denotes a tensor of order ππ , where ππ ππ indicates the dimension of the mode ππ of the tensor π³π³ . The ππππππππ - ππ matricization and the vectorization of a tensor π³π³ are denoted by ππ ( ππ ) β β ππ ππ Γ ππ βππ ( ππ βππ = ππ Γ ππ Γ β― Γ ππ ππβ1 Γ ππ ππ+1 Γ β― Γ ππ ππ ) and vec( π³π³ ) , respectively. A tensor π³π³ can be obtained by folding any of its matricizations, which is denoted as π³π³ = fold οΏ½ππ ( ππ ) οΏ½ , ππ β {1, β¦ , ππ } . The ππππππππ - ππ product of a tensor π³π³ with a matrix ππ β β πΎπΎ Γ ππ ππ is denoted by π³π³ Γ ππ ππ ββ ππ Γ ππ Γ β― Γ ππ ππβ1 Γ πΎπΎ Γ ππ ππ+1 Γ β― Γ ππ ππ , where the ππ β― ππ ππβ1 ππππ ππ+1 β― ππ ππ entry of the product is given by, ( π³π³ Γ ππ ππ ) ππ β―ππ ππβ1 ππππ ππ+1 β―ππ ππ = οΏ½ π₯π₯ ππ β―ππ ππ π’π’ ππππ ππ ππ ππ ππ ππ =1 . The
Frobenius norm of a tensor π³π³ is defined as the Frobenius norm of its matricization along any mode, e.g., βπ³π³β
πΉπΉ2 = οΏ½ππ ( ππ ) οΏ½ F2 . The contraction product of two tensors β¬ ββ ππ Γ ππ Γ β― Γ ππ ππ Γ ππ Γ β― Γ ππ ππ and π³π³ β β ππ Γ ππ Γ β― Γ ππ ππ is denoted by β¬ β π³π³ β β ππ Γ β― Γ ππ ππ , ( β¬ β π³π³ ) ππ β―ππ ππ = οΏ½ π³π³ ππ β―ππ ππ β¬ ππ β―ππ ππ ππ β―ππ ππ ππ β―ππ ππ . We use βππβ β = β ππ ππ ( ππ ) ππ to denote the nuclear norm of the matrix ππ , where ππ ππ ( ππ ) is the ππ π‘π‘β largest singular value of the matrix. The nuclear norm of a tensor π³π³ , denoted by βπ³π³β β , is defined as the weighted average of nuclear norms of its matricizations along each mode [17], i.e., βπ³π³β β = β πΌπΌ ππ οΏ½ππ ( ππ ) οΏ½ βππππ=1 , where πΌπΌ ππ > 0 and β πΌπΌ ππππ = 1 . We let Ξ© denote an index set, and π«π« Ξ© ( π³π³ ) represent an orthogonal projection that copies the tensor entries of π³π³ with indices in Ξ© and sets the entries with indices outside Ξ© to be 0. The Tucker rank of tensor π³π³ is defined as the combination of the column ranks of its matricizations. That is, rank( π³π³ ) = ( π π , π π , β― , π π ππ ) , where π π ππ is the column rank of ππ ( ππ ) [21]. Fig. 2. Proposed framework of the augmented regression modeling with an incomplete response. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. B. Problem Formulation
Let ππ denote the number of available samples in the training set, π΄π΄ β β ππ Γ ππ Γ β― Γ ππ ππ ( ππ = 1, β― , ππ ) be the response tensor in the ππ π‘π‘β sample that contains missing values, and π³π³ ππππ ββ ππ ππ1 Γ ππ ππ2 Γ β― Γ ππ ππππππ ( ππ = 1, β― , ππ ; ππ = 1, β― , ππ ) represents the ππ π‘π‘β input tensor (predictor) within the ππ π‘π‘β sample. Let us denote by Ξ© ππ the set of indices at which entries of π΄π΄ are observed. Then, we characterize the relationship between the inputs and the output by, π«π« Ξ© ππ οΏ½ π΄π΄ ππ οΏ½ = π«π« Ξ© ππ οΏ½οΏ½ π³π³ ππππ β β¬ ππππππ =1 οΏ½ + π«π« Ξ© ππ ( β° ππ ) , ππ = 1, β― , ππ , (1) where β¬ ππ β β ππ ππ1 Γ ππ ππ2 Γ β― Γ ππ ππππππ Γ ππ Γ ππ Γ β― Γ ππ ππ is the tensor of coefficients related to input ππ and β° ππ is a tensor of errors whose entries follow a random process. For a more compact representation, we can fold tensors π΄π΄ ππ , π³π³ ππππ , and β° ππ ( ππ =1, β― , ππ ) along the sample mode to obtain tensors π΄π΄ ββ ππ Γ ππ Γ ππ Γ β― Γ ππ ππ , π³π³ ππ β β ππ Γ ππ ππ1 Γ ππ ππ2 Γ β― Γ ππ ππππππ ( ππ = 1, β― , ππ ) , and β° β β ππ Γ ππ Γ ππ Γ β― Γ ππ ππ . Then, the model can be written as, π«π« Ξ© οΏ½ π΄π΄ οΏ½ = π«π« Ξ© οΏ½οΏ½ π³π³ ππ β β¬ ππππππ =1 οΏ½ + π«π« Ξ© ( β° ) , (2) where Ξ© = {( ππ , ππ , β― , ππ ππ ) β β ππ Γ ππ Γ ππ Γ β― Γ ππ ππ } is a set of indices at which entries of π΄π΄ are observed. Upon estimation of β¬ ππ , the complete output can be estimated by β π³π³ ππ β β¬ ππππππ=1 , as illustrated in Fig. 3. Because the output contains missing values, existing approaches that assume complete tensors are not applicable for estimating the parameters. Fortunately, due to the potential correlation structure within high-dimensional data points, including profiles and images, the output tensor generally has a low-rank structure. This low-rank structure is key in designing an approach for estimating the model parameters in the presence of incomplete output tensor. The goal of our work is to build a regression model between the input tensors π³π³ ππ , ( ππ = 1, β― , ππ ) and the incomplete output tensor π΄π΄ . It can be achieved by integrating the low-rank structure of the response into the least square loss πΏπΏ ππ as follows, min π΄π΄ , β¬ , β― , β¬ ππ ππ rank( π΄π΄ ) + 12 οΏ½π΄π΄ β οΏ½ π³π³ ππ β β¬ ππππππ=1 οΏ½ F2 Subject to: π«π« Ξ© ( π΄π΄ ) = π«π« Ξ© ( π΄π΄ ). (3) The first term, ππ rank( π΄π΄ ) , in the objective function, exploits the low-rank structure of the output tensor. Note that the assumption of this problem is that the output is low-rank, which is common in high dimensional spaces and can be validated based on the domain knowledge. ππ is a user-specified tuning parameter that should be selected based on the procedure discussed in Section II.E. Unfortunately, problem (3) is NP-hard due to the nonconvexity of rank( π΄π΄ ) . To address this issue, we employ a convex relaxation of the rank( π΄π΄ ) term. Specifically, we employ the tensorβs nuclear norm to approximate the rank penalty, leading to the following tractable problem, min π΄π΄ , β¬ , β― , β¬ ππ ππβπ΄π΄β β + 12 οΏ½π΄π΄ β οΏ½ π³π³ ππ β β¬ ππππππ=1 οΏ½ F2 Subject to: π«π« Ξ© ( π΄π΄ ) = π«π« Ξ© ( π΄π΄ ), (4) where βπ΄π΄β β = β πΌπΌ ππ οΏ½ππ ( ππ ) οΏ½ βππππ=1 as it is defined in Section II.A. Solving this problem without imposing any constraint over the parameters β¬ ππ will result in severe overfitting as the number of parameters are extremely large. To avoid this issue, we decompose β¬ ππ , ( ππ = 1, β― , ππ ) by using a Tucker operation and write, β¬ ππ = ππ ππ Γ ππ ππ1 Γ β― Γ ππ ππ ππ ππππ ππ Γ ππ ππ +1 ππ Γ ππ ππ +2 β― Γ ππ ππ +ππ ππ ππ . Here, ππ ππ β β ( πποΏ½ ππ1 Γ β― Γ πποΏ½ ππ1 Γ πποΏ½ Γ β― Γ πποΏ½ ππ ) is a core tensor with πποΏ½ ππππ βͺππ ππππ ( ππ = 1, β― , ππ ; ππ = 1, β― , ππ ππ ) and πποΏ½ ππ βͺ ππ ππ ( ππ = 1, β― , ππ ); οΏ½ππ ππππ : ππ = 1, β― , ππ ; ππ = 1, β― , ππ ππ οΏ½ is a set of bases that spans the ππ π‘π‘β input space; and { ππ ππ : ππ = 1, β― , ππ } is a set of bases that spans the output space. Therefore, we aim to solve, min π΄π΄ , β¬ , β― , β¬ ππ ππβπ΄π΄β β + 12 οΏ½π΄π΄ β οΏ½ π³π³ ππ β β¬ ππππππ=1 οΏ½ F2 , Subject to: π«π« Ξ© ( π΄π΄ ) = π«π« Ξ© ( π΄π΄ ) ; β¬ ππ = ππ ππ Γ ππ ππ1 Γ β¦ Γ ππ ππ ππ ππππ ππ Γ ππ ππ +1 ππ Γ ππ ππ +2 β― Γ ππ ππ +ππ ππ ππ , ππ = 1, β¦ , ππ . (5) To estimate the model parameters, we design an algorithm with the following two steps (see Algorithm 1 ) as follows: Step 1 ( π΄π΄ -update ): This step optimizes the following sub-problem assuming that the model parameters β¬ ππππ ( ππ = 1, β― , ππ ) are given at the ( ππ + 1) π‘π‘β iteration of the algorithm, π΄π΄ ππ+1 = argmin π΄π΄ ππβπ΄π΄β β + 12 οΏ½π΄π΄ β οΏ½ π³π³ ππ β β¬ ππππππππ=1 οΏ½ F2 Subject to: π«π« Ξ© ( π΄π΄ ) = π«π« Ξ© ( π΄π΄ ). (6) Let ππ ππ = β π³π³ ππ β β¬ ππππππππ=1 ; then the problem can be rewritten as, Fig. 3. Illustration of the regression model for heterogeneous sets of data. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. π΄π΄ ππ+1 = argmin π΄π΄ ππβπ΄π΄β β + 12 βπ΄π΄ β ππ ππ β F2 Subject to: π«π« Ξ© ( π΄π΄ ) = π«π« Ξ© ( π΄π΄ ). (7) Step 2 ( β¬ ππ -update ): Given the estimated π΄π΄ ππ and remaining β¬ ππππ ( ππ = 1, β― , ππ , ππ β ππ ), this step attempts to solve, β¬ ππππ+1 = argmin β¬ ππ οΏ½π²π² ππππ β π³π³ ππ β β¬ ππ οΏ½ F2 Subject to: β¬ ππ = ππ ππ Γ ππ ππ1 Γ β― Γ ππ ππ ππ ππππ ππ Γ ππ ππ +1 ππ Γ ππ ππ +2 β― Γ ππ ππ +ππ ππ ππ , (8) where π²π² ππππ = π΄π΄ ππ β β π³π³ ππ β β¬ ππππππππβ ππ . The stopping criterion of Algorithm 1 is defined by the maximum number of iterations. C. Solution to π΄π΄ -update The problem of π΄π΄ -update (i.e., Equation (7)) can be transformed in, min ππ ( ) , β― , ππ ( ππ ) οΏ½ πΌπΌ ππ οΏ½πποΏ½ππ ( ππ ) οΏ½ β + 12 οΏ½ππ ( ππ ) β ππ ( ππ ) οΏ½ F2 οΏ½ ππππ=1 Subject to: π«π« Ξ© ( π΄π΄ ) = π«π« Ξ© ( π΄π΄ ) (9) The details of this transformation are provided in APPENDIX A. The default values of parameter πΌπΌ ππ , ππ = 1, β¦ , ππ are set to . That is, we assign equal weights to each of the tensor modes in the nuclear norm. Since this problem is the same at each iteration, we use the general notation π΄π΄ and ππ by ignoring the iteration index ππ for simplicity. Solving the above problem is challenging due to the interdependent nuclear norm terms. That is, the entries of the matricizations of the output tensor in the nuclear norm are shared and therefore, cannot be treated independently while minimizing the objective function. To tackle this issue, we define several auxiliary variables to split these interdependent terms. Specifically, we introduce additional local matrices ππ ( ) , β― , ππ ( ππ ) , which represent the mode- i matricizations of local tensors β³ , β― , β³ ππ , ππ = 1,2, β¦ , ππ , separately. Then, we have the following problem, min ππ ( ) , β― , ππ ( ππ ) , π΄π΄ οΏ½ πΌπΌ ππ οΏ½πποΏ½ππ ( ππ ) οΏ½ β + 12 οΏ½ππ ( ππ ) β ππ ( ππ ) οΏ½ F2 οΏ½ ππππ=1 Subject to: π«π« Ξ© ( π΄π΄ ) = π«π« Ξ© ( π΄π΄ ) π΄π΄ = β³ ππ , ππ = 1, β― , ππ . (10) Here, π΄π΄ can be treated as a global variable, and the problem becomes a global consensus problem. By introducing ππ local variables, problem (10) becomes separable since they are independent of each other. To solve (10), we first address the equality constraints ππ ( ππ ) = ππ ( ππ ) ( ππ = 1, β― , ππ ) by defining the augmented Lagrangian as follows, πΏπΏ ππ οΏ½ππ ( ) , β― , ππ ( ππ ) , π΄π΄ , Ξ , β― , Ξ ππ οΏ½ = οΏ½ οΏ½πππΌπΌ ππ οΏ½ππ ( ππ ) οΏ½ β + πΌπΌ ππ οΏ½ππ ( ππ ) β π΄π΄ ( ππ ) οΏ½ F2 + β©Ξ ππ , π΄π΄ β β³ ππ βͺ + ππ βπ΄π΄ β β³ ππ β F2 οΏ½ ππππ=1 , (11) where β©Ξ ππ , π΄π΄ β β³ ππ βͺ represents inner product of tensors Ξ ππ and π΄π΄ β β³ ππ ; Ξ ππ ( ππ = 1, . . , ππ ) denote the tensors of dual variables; and ππ is the step size. Then, the problem is represented as, min ππ ( ) , β― , ππ ( ππ ) , π΄π΄ , Ξ , β― , Ξ ππ πΏπΏ ππ οΏ½ππ ( ) , β― , ππ ( ππ ) , π΄π΄ , Ξ , β― , Ξ ππ οΏ½ Subject to: π«π« Ξ© ( π΄π΄ ) = π«π« Ξ© ( π΄π΄ ). (12) The resulting ADMM algorithm for this problem is summarized in Algorithm 2 , which includes three iterative steps: local variable update, global variable update, and dual variable update. For local variable update, we try to solve the following unconstrained problem, min ππ ( ππ ) ππ ππ οΏ½ππ ( ππ ) οΏ½ , (13) where ππ ππ οΏ½ππ ( ππ ) οΏ½ = πππΌπΌ ππ οΏ½ππ ( ππ ) οΏ½ β + πΌπΌ ππ οΏ½ππ ( ππ ) β ππ ( ππ ) οΏ½ πΉπΉ2 + β©Ξ ππ , π΄π΄ ββ³ ππ βͺ + ππ2 βπ΄π΄ β β³ ππ β πΉπΉ2 . This problem is equivalent to, min ππ ( ππ ) ππ ππ οΏ½ππ ( ππ ) οΏ½ β + 12 οΏ½ππ ( ππ ) β ππ ( ππ ) οΏ½ F2 , (14) where ππ ππ = πππΌπΌ ππ πΌπΌ ππ +ππ and ππ ( ππ ) = πΌπΌ ππ ππ ( ππ ) +ππππ ( ππ ) +π―π― ( ππ ) πΌπΌ ππ +ππ . The detailed derivation of this problem is provided in APPENDIX B. It turns out that this minimization problem in Equation (14) can be solved by first computing the singular value decomposition (SVD) of ππ ( ππ ) and then applying a soft-thresholding operator on the singular values. This result is proved by [16] and presented in the following proposition. Algorithm 1 : BCD algorithm for solving problem (5) 1:
Inputs : π³π³ , β¦ , π³π³ ππ , π΄π΄ and Ξ© . Initiate β¬ , β¦ , β¬ ππ0 randomly and let π΄π΄ = π΄π΄ . Loop π΄π΄ ππ+1 π΄π΄ -update step: update π΄π΄ ππ to π΄π΄ ππ+1 by solving Equation (7) given β¬ ππππ ( ππ = 1, β― , ππ ) 4: Let π΄π΄ ππ = π΄π΄ ππ+1 for ππ = 1, β― , ππ do : 6: β¬ ππππ+1 β¬ ππ -update step: update β¬ ππππ to β¬ ππππ+1 by solving Equation (8) given π΄π΄ ππ and the remaining β¬ ππππ for ππ = 1, β― , ππ , ππ β ππ . 7: Let β¬ ππππ = β¬ ππππ+1 end until convergence. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Proposition 1 . Let
ππ β β m Γ n and let ππ = ππππππ T be the SVD of ππ , where ππ β β ππ Γ ππ and ππ = β ππ Γ ππ are orthonormal matrices, ππ β β ππ Γ ππ is a diagonal matrix, and ππ = rank( ππ ) . Then, πποΏ½ β‘ argmin ππ οΏ½ππβππβ β + 12 βππ β ππβ F2 οΏ½ is optimized by πποΏ½ = ππ ππ ππ Ξ» ππ ππT , where ππ Ξ» is diagonal with ( ππ Ξ» ) ππππ = max(0, ππ ππππ β ππ ) , ππ ππ and ππ ππ are the first ππ columns of ππ and ππ . In order to update the global variable, we attempt to solve the following constrained optimization problem, min π΄π΄ ππ ( π΄π΄ ) Subject to: π«π« Ξ© ( π΄π΄ ) = π«π« Ξ© ( π΄π΄ ), (15) where ππ ( π΄π΄ ) = β οΏ½β©π©π© ππ , π΄π΄ β β³ ππ βͺ + ππ2 βπ΄π΄ β β³ ππ β πΉπΉ2 οΏ½ ππππ=1 . First, we compute its Lagrangian as follows, πΏπΏ π¦π¦ ( π΄π΄ , Ξ¦ ) = οΏ½ οΏ½β©Ξ ππ , π΄π΄ β β³ ππ βͺ + ππ βπ΄π΄ β β³ ππ β F2 οΏ½ ππππ=1 + β©Ξ¦ , π«π« Ξ© ( π΄π΄ β π΄π΄ ) βͺ , (16) where Ξ¦ is the tensor of the dual variable in this sub-problem. Then, a closed-form solution for this sub-problem is derived based on the optimality condition, π΄π΄ = οΏ½ π΄π΄ , ππππ ( ππ , ππ , β― , ππ ππ ) β Ξ© ππ οΏ½ οΏ½β³ ππ β ππ Ξ ππ οΏ½ ππππ=1 , ππππβπππππ€π€ππππππ . (17) The detailed derivations are provided in APPENDIX C. Finally, the dual variables Ξ ππ are updated separately according to, Ξ ππ β Ξ ππ + ππ ( π΄π΄ β β³ ππ ) . (18) D. Solution to β¬ -update After the estimation of the response π΄π΄ in each iteration of Algorithm , we then estimate the tensor of coefficients β¬ ππ . For this problem, the key is to learn the core tensor ππ ππ and the bases ππ ππ and ππ ππππ . Motivated by [14], we learn ππ ππππ directly from the input spaces by performing Tucker decomposition on the input tensors. After the estimation of ππ ππππ , the problem of learning β¬ ππ becomes the following optimization problem, οΏ½ππ ππ , ππ , β― , ππ ππ οΏ½ = argmin β¬ ππ οΏ½ππ ππ ( ) β ππ ( ) ππ ππ οΏ½ πΉπΉ2
Subject to: β¬ ππ = ππ ππ Γ ππ ππ1 Γ β― Γ ππ ππ ππ ππππ ππ Γ ππ ππ +1 ππ Γ ππ ππ +2 β― Γ ππ ππ +ππ ππ ππ , ππ ππT ππ ππ = ππ πποΏ½ ππ ( ππ = 1, β― , ππ ) , (19) where ππ ππ ( ) and ππ ( ) are mode-1 matricizations of π²π² ππ and π³π³ ππ , respectively. ππ ππ β β ππ ππ Γ ππ is the matricization of tensor β¬ ππ with ππ ππ = β ππ ππππππ ππ ππ=1 and ππ = β ππ ππππππ=1 . ππ πποΏ½ ππ β β πποΏ½ ππ Γ πποΏ½ ππ is an identity matrix. The problem (19) can be solved by using the ALS-BCD as in [14]. One requirement of ALS-BCD algorithm is to know the rank of the response π΄π΄ . Since π΄π΄ is estimated during the π΄π΄ -update step in Algorithm 1 , we use high-order SVD (HOSVD) to estimate its rank, which is the column rank of ππ ( ππ ) for ππ =1, β― , ππ . More specifically, we use the truncated SVD to estimate the rank of mode-n matricization ππ ( ππ ) by specifying a ratio, by which the data variability is explained. Then, the ranks of the matricizations of π΄π΄ along all modes are estimated, which are then combined into the Tucker rank of tensor π΄π΄ for ALS-BCD algorithm. E. Selection of Hyper-parameters
In the proposed method, the global hyper-parameter ππ and local hyper-parameters πποΏ½ ππππ ( ππ = 1,2, β― , ππ ; ππ = 1,2, β― , ππ ππ ) should be identified. ππ is designed to balance the rank penalty and the least square error in the objective function. Based on our experiments, the proposed method is robust to a wide range of values of ππ . In this paper, we set ππ = 1 by default. This parameter can also be determined by cross-validation. Similar to the procedure for estimating the rank of π΄π΄ in Section II.D, the Tucker rank of input tensors π³π³ ππ can also be estimated using the truncated HOSVD. Based on the estimated rank of input tensors, the parameters πποΏ½ ππππ ( ππ = 1,2, β― , ππ ; ππ =1,2, β― , ππ ππ ) are determined. III. S IMULATION S TUDY
In this section, a set of simulation studies is carried out to evaluate the performance of our proposed method, herein referred to as the tensor regression with missing values (called TRMV). A benchmark method, named TC-MTOT, is used for comparison, where the completion of output tensor (TC) is first conducted with the rank minimization method [16], [17] and then the regression model between the inputs and the completed response is constructed using MTOT [14]. The estimated
Algorithm 2
ADMM algorithm for problem (12) Inputs: π³π³ , β¦ , π³π³ ππ , the estimated β¬οΏ½ ππππβ1 , β¦ , β¬οΏ½ ππππβ1 , π΄π΄ and Ξ© . Loop: For ππ = 1, β¦ , ππ : ππ ( ππ ) ππ β arg min ππ ( ππ ) ππ ππ οΏ½ππ ( ππ ) οΏ½ β³ ππππ β fold οΏ½ππ ( ππ ) ππ οΏ½ (Local variable) π΄π΄ ππ β arg min { π΄π΄ : π«π« πΊπΊ ( π΄π΄ ) =π«π« πΊπΊ ( π΄π΄ )} ππ ( π΄π΄ ) (Global variable) For ππ = 1, β¦ , ππ : Ξ ππππ β Ξ ππππβ1 + ππ ( π΄π΄ ππ β β³ ππππ ) (Dual variable) Until convergence. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. A. Curve-on-curve Regression
Profile-on-profile regression has a wide range of applications in both manufacturing and healthcare [22]. This simulation mimics profile-on-profile regression modeling when the output contains missing values. Following the simulation study in [23], we first randomly generate οΏ½β¬ , β― , β¬ ππ οΏ½ as follows: β¬ ππ ( ππ , ππ ) = 1 ππ [ ππ ( ππ ) ππ ( ππ ) + ππ ( ππ ) ππ ( ππ ) + ππ ( ππ ) ππ ( ππ )], where ππ ππππ ( ππ ) and ππ ππππ ( ππ ) ( ππ = 1,2,3; ππ = 1, β― , ππ ) are Gaussian processes with covariance function ππ ( π§π§ , π§π§ β² ) = οΏ½ π§π§ β π§π§ β² | + (20| π§π§ β π§π§ β² |) οΏ½ ππ β20οΏ½π§π§βπ§π§ β² οΏ½ . Then, ππ input profiles are simulated with the following three steps: 1) generate a matrix ππ β β ππ Γ ππ with the ( ππ , ππ ) π‘π‘β entry equal to 1 for ππ = ππ and equal to ππ ππ = 0 or 0.5 for ππ β ππ ; 2) apply eigendecomposition ππ = ββ T to obtain a ππ Γ ππ matrix β and generate a set of profiles π°π° , π°π° , β― , π°π° ππ by using a Gaussian process with covariance function ππ ( π§π§ , π§π§ β² ) = ππ βοΏ½2οΏ½π§π§βπ§π§ β² οΏ½οΏ½ ; 3) generate the input curves at any given point ππ by, οΏ½π±π± ( ππ ), β― , π±π± ππ ( ππ ) οΏ½ = οΏ½π°π° ( ππ ), β― , π°π° ππ ( ππ ) οΏ½β T . According to this procedure, for each ππ the vector οΏ½π₯π₯ ( ππ ), β― , π₯π₯ ππ ( ππ ) οΏ½ follows a multivariate Gaussian distribution with covariance ππ . When setting ππ ππ = 0 to generate data, the variables of the vector οΏ½π₯π₯ ( ππ ), β― , π₯π₯ ππ ( ππ ) οΏ½ are uncorrelated. The output profiles can finally be simulated by, π¦π¦ ( ππ ) = οΏ½ οΏ½ β¬ ππ ( ππ , ππ ) π₯π₯ ππ ( ππ ) ππππ ππππ=1 + ππ ( ππ ), where ππ ( ππ ) are independently distributed random variables from a Gaussian distribution with mean of zero and the variance of ππ . All the input and output profiles are generated over an equidistant grid of size 100, which are defined over the intervals ππ < 2 and ππ < 1 , respectively. In this study, we set ππ = 2 . After the profiles are generated, we randomly remove ππ =
80% and 90% of the entries of the output profile in the training dataset. We refer to this data generation steps as procedure A . Fig. 4 shows an example of the ground truth profile, the profile with 80% missing values, and its estimation by TRMV. Based on procedure A , we first generate a dataset of 200 samples with ππ = 0 . We use ππ π‘π‘ππ = 100 samples for model training and hyper-parameter estimations and the remaining ππ π‘π‘π‘π‘ = 100 samples for testing. For quantitative evaluation, the standardized prediction error (SPE) is used to evaluate the performance of the proposed model and is defined as ππππππ = οΏ½π΄π΄βπ΄π΄οΏ½οΏ½ πΉπΉ βπ΄π΄β πΉπΉ . For better illustration, we transform SPE by taking the negative inverse of its logarithm, i.e., β ( ππππππ ) , called transformed SPE (TSPE). Given the generated dataset, we employ TRMV and TC-MTOT methods to complete π¦π¦ and estimate the tensor of coefficients β¬ ππ ( ππ = 1,2) at the aforementioned levels of missing values. The trained models are then applied to the testing set to predict responses. The performance of each of the models is then evaluated and compared in terms of TSPE. Fig. 5 compares the performance of TRMV to TC-MTOT at different levels of missing values in the response. The results reveal that TRMV outperforms TC-MTOT at every level of missing values (LMV). For example, at 80% missing values for ππ ππ = 0 , the mean TSPE of the proposed method is 0.3382, which is significantly lower than that of TC-MTOT, i.e., 0.7206. This indicates that TRMV can to take advantage of the available information in both the inputs and the response, leading to improved estimations. Fig. 6 shows the process of rank estimation. As it is illustrated, when the proportion of missing values is relatively small, our algorithm converges to the true rank with fewer iterations. For example, less than 60 iterations are sufficient to (a) ππ ππ = 0 (b) ππ ππ = 0.5 Fig. 5. Performance comparison between our proposed method (TRMV) and TC-MTOT under different levels of missing values. (a) 80% (b) 90% Fig. 6. Process of rank estimation for the response at different levels of missing values. (a) Ground truth (b) 80% missing (c) Estimation Fig. 4. Examples of the output profiles, including the original profiles (100 observations), the profiles with 80% missing values (20 observations), and the estimated ones. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. B. Simulation Study for Image Outputs
In this simulation, we evaluate the performance of our proposed method when multiple forms of data are available. Specifically, the waveform surfaces π΄π΄ ππ are generated based on two input tensors, π³π³ β β ππ Γ ππ Γ β― Γ ππ and π³π³ ββ ππ Γ ππ Γ β― Γ ππ ( ππ = 1, β― , ππ π‘π‘ ) , where ππ π‘π‘ is the sample size. First, π₯π₯ ππππππ = ππππ ππππ ( ππ = 1,2; ππ = 1, β― , ππ ππ ; ππ = 1, β― , ππ ππππ ) is defined in order to generate the input tensors. Then, we set ππ ππππ = οΏ½π’π’ ππππ1 , π’π’ ππππ2 , β― , π’π’ πππππ π ππ οΏ½ ( ππ = 1, 2; ππ = 1, β― , ππ ππ ), where π’π’ πππππ‘π‘ = οΏ½ οΏ½ cos(2 πππππ₯π₯ ππππ1 ) , β― , cos (2 πππππ₯π₯ ππππππ ππππ ) οΏ½ ππ , ππππ ππ ππππ πππππποΏ½ sin(2 πππππ₯π₯ ππππ1 ) , β― , sin (2 πππππ₯π₯ ππππππ ππππ ) οΏ½ ππ , ππππ ππ ππππ ππππππππ . Then, the elements of a core tensor ππ ππππ are generated randomly from a standard normal distribution. Next, π³π³ ππππ is simulated by using, π³π³ ππππ = ππ ππππ Γ ππ ππ1 Γ β― Γ ππ ππ ππ ππππ ππ , ( ππ = 1,2; ππ = 1, β― , ππ π‘π‘ ). Furthermore, the coefficient tensors β¬ ππ are generated for the regression model. First, a core tensor ππ ππππ is simulated from a standard normal distribution. Next, the basis of output space ππ ππ = [ ππ ππ1 , ππ ππ2 , β― , ππ πππ π ] ( ππ = 1, β― , ππ ) is generated as, ππ πππ‘π‘ = οΏ½οΏ½ cos(2 πππππ¦π¦ ππ1 ) , β― , cos οΏ½ πππππ¦π¦ ππππ ππ οΏ½οΏ½ ππ , ππππ ππ ππππ πππππποΏ½ sin(2 πππππ¦π¦ ππ1 ) , β― , sin οΏ½ πππππ¦π¦ ππππ ππ οΏ½οΏ½ ππ , ππππ ππ ππππ ππππππππ . where π¦π¦ ππππ = ππππ ππ . Finally, the coefficient tensors β¬ ππ are computed as, β¬ ππ = ππ ππ Γ ππ ππ1 Γ β― Γ ππ ππ ππ ππππ ππ Γ ππ ππ +1 ππ Γ ππ ππ +2 β― Γ ππ ππ +ππ ππ ππ . Given the input tensors and coefficient tensors, the response tensors are calculated as follows, π΄π΄ ππ = οΏ½ π³π³ ππππ β β¬ ππ + β° ππ ππππ=1 , where β° ππ is the error tensor whose elements follow a normal distribution π©π© (0, ππ ) . We call this data generation steps procedure B . This simulation is conducted for two main purposes: (i) To illustrate the response completion and prediction procedures; (ii) to evaluate the robustness of the proposed method under different noise levels and different rank configurations. First, we generate ππ = 2 inputs with ππ = 1 and ππ = 2 , i.e., a profile π³π³ β β and an image π³π³ β β Γ of rank . Then, the response tensors π΄π΄ ππ β β Γ are generated with rank . Thus, the core tensors of model parameters have dimensions ππ ββ Γ Γ and ππ β β Γ Γ Γ . Here, 200 samples are first generated under ππ = 0 , with 100 training samples and 100 testing samples. Then, we randomly remove certain proportions of entries, such as ππ =
80% and 90% entries from the training responses. In this study, we also use TSPE to evaluate the performance of the proposed method. For each level of missing values, we train models using TRMV and TC-MTOT based on the training set. Then, we evaluate and compare the estimated models in terms of their TSPEs, calculated based on the testing set. Fig. 7 illustrates an example of the ground truth surface, the surface with 90% missing values, and its estimation by TRMV. As it is illustrated, the estimation result is highly compatible with the ground truth. The TSPEs are computed based on the testing set to compare the proposed method to the benchmark. The results are plotted in Fig. 8. Although the TSPE of both methods increases as the percentage of the missing values grows, the magnitudes of (a) Ground truth (b) 90% missing (c) Estimation Fig. 7. Example of output surfaces, including the actual surface (a), the one with 90% missing values (b), and the estimation using TRMV (c). (a) 80% (b) 90% Fig. 9. Process of rank estimation of the output tensor at 80% and 90% missing values. Fig. 8. Performance comparison between the TRMV model and the benchmark under different levels of missing values. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. procedure B . Each setting contains 100 training samples and 100 testing samples. The first dataset is generated with the response π΄π΄ ππ β β Γ of ranks [3, 4, 5, 6, 7, 8] and the inputs π³π³ ππ1 β β and π³π³ ππ2 ββ Γ of rank 2. Here the noise level is set to be zero (i.e., ππ =0 ). Then, we remove entries randomly from the training responses to mimic different levels of missing values. For the second dataset, different levels of noises (i.e., ππ = 0.001 Γ[0, 2, 4, 6, 8, 10] ) are added to the response. The dimensions of the inputs are π³π³ ππ1 β β and π³π³ ππ2 β β Γ of rank 3 and the dimension of the response is π΄π΄ ππ β β Γ of rank 3. Next, we randomly generate an index set Ξ© with 90% missing entries and then we project the training set onto Ξ© , i.e., π«π« Ξ© ( π΄π΄ ) . Fig. 10 shows the TSPEs obtained by TRMV and TC-MTOT at different settings of ranks at 80% (panel (a)) and 90% (panel (b)) levels of missing values. In both panels, the dotted lines with error bars show the mean and variance of TSPEs obtained by TC-MTOT; the solid lines with error bars show those obtained by TRMV. Panel (a) of Fig 10 shows that although TC-MTOT achieves comparable TSPEs to TRMV for datasets that are highly correlated and have a simpler structure (i.e., rank is 3 or 4), it performs worse when the structure becomes complicated. This superiority is because of the use of input information in completing the output tensor. From both results illustrated in panels (a) and (b), we conclude that TRMV performs more robustly with smaller mean and variance of prediction errors at every rank setting. Table β ’ shows the TSPEs obtained by TRMV and TC-MTOT at different levels of noise. As reported in the table, TRMV achieves smaller TSPEs than TC-MTOT at every level of noise and missing values. For example, when ππ = 0.006 and LMV = 70% , the TRMV achieves TSPE of 0.4398, and TC-MTOT achieves TSPE of 0.4583. Although the TSPEs increase for higher levels of noise and missing values, TRMV performs more robustly in predicting the output than TC-MTOT. This superiority is mainly because TRMV benefits from the systematic and simultaneous estimation of both model parameters and missing values. IV. C ASE S TUDY
In this section, we further evaluate the effectiveness of our proposed method for predicting the overlay (OV) errors in the lithography process. We first introduce the OV errors and then implement TRMV method on an OV dataset. As shown in Fig. 11 [24], the lithography process aims to transfer a 2-D pattern from an optical mask to a light-sensitive chemical photoresist on the wafer surfaces, which is a critical step in semiconductor manufacturing. During this process, the desired pattern is etched into the material through a series of chemical treatments, such as exposure to light, cleaning, etc. In practice, a wafer may go through such photolithographic cycles for multiple times. During each cycle, one small rectangular field on the wafer surface is completed. Some misalignment error may occur when the patterns on the mask cannot be projected exactly at the desired location on the surface of a wafer. This misalignment is called overlay error and is critical to the wafer quality. As mentioned in the introduction, OV errors are only measured at a limited number of marked locations over the wafer because it is impossible to measure the OV errors of each rectangular field on the wafer. In addition, OV errors are highly dependent on the settings of the
Fig. 11. Illustration of the scanner in semiconductor lithography. T ABLE β ’: P ERFORMANCE C OMPARISON BETWEEN
TRMV
AND
TC-MTOT AT D IFFERENT L EVELS OF N OISES FOR THE W AVEFORM D ATASET
LMV
60% 70% 80% 90% ππ TRMV TC-MTOT TRMV TC-MTOT TRMV TC-MTOT TRMV TC-MTOT 0 (a) 80% (b) 90% Fig. 10. Performance comparison between TRMV and TC-MTOT at different values of ranks. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
10 lithography machine, which can be used to predict the change of OV errors. The knowledge of OV errors can support the process adjustments, such as the alignments of the projection lens and the location tuning of the wafer stage. To improve the quality of a lithography process, it is critical to correct the OV error by modeling the relationship between the machine settings and the partially measured OV errors. To construct and evaluate our model, we generate a dataset containing the machine settings and overlay error for 200 wafers based on a validated procedure explained in Appendix D.
We use this validated simulated data for two main reasons. First, the real data is proprietary. Second, the known βground truthβ OV errors can be used to evaluate the accuracy of our method. Given the generated dataset, we divide the 200 samples into a training set and a testing set, with 100 samples in each set. For the training set, we assume 80% or 90% points are inaccessible on a wafer. The goal is to generate a model that estimates the OV given the settings of the lithography machine. Given the incomplete training data, we simultaneously estimate the model parameters and complete the missing entries using TRMV. Finally, we investigate the prediction performance of the estimated model in the testing set and calculate the prediction error in terms of the TSPE. In this study, TC-MTOT is also used as the benchmark method for comparison. Fig. 12 shows the prediction results of two models estimated by TRMV and TC-MTOT at 80% and 90% missing values, respectively. From the results, it is evident that TRMV outperforms TC-MTOT at both levels of missing values. V. C ONCLUSION
In this paper, we develop a systematic framework for tensor regression when the response contains missing values. The novelty of this methodology lies in integrating the tensor completion with tensor regression in a unified manner. In order to address the challenge brought by missing values in the response, a penalty of the tensor nuclear norm is introduced into the least square loss function. Meanwhile, Tucker decomposition is applied to the model coefficients to prevent overfitting. A two-step BCD algorithm is proposed to estimate the variables, including the missing entries and the model coefficients. Iteratively, we first complete the missing entries of the response using the ADMM algorithm. Given the completed response, the core tensors and the bases of the coefficients are then estimated by a BCD-ALS algorithm. Notably, the completion procedure of the response in the first step of the algorithm enables the estimation of its rank, which can be used by the second step. These integrated optimization efforts successfully lead to three major advantages of the proposed methodology: (i) more accurate model estimation and response completion; (ii) robust prediction performance; and (iii) automatic rank convergence of the response. Two simulation studies and a case study are conducted to evaluate the performance of our proposed method in comparison to a two-step method, i.e., TC-MTOT. The numerical results have shown that TRMV outperforms TC-MTOT significantly. The main contribution of this work is to explore a new direction in tensor regression with missing values. An interesting extension for future study is to consider the selection of input tensors in the sense that some available inputs may not be informative for the response estimation. APPENDIX A: T RANSFORMATION OF (7) TO (9) The transformation of the Frobenius norm from problem (7) to problem (9) is derived as follows: Since βπ΄π΄ β ππβ F2 = οΏ½ππ ( ππ ) β π¨π¨ ( ππ ) οΏ½ πΉπΉ2 , βππ and β πΌπΌ ππππππ=1 = 1, we have βπ΄π΄ β ππβ πΉπΉ2 = β πΌπΌ ππ βπ΄π΄ β ππβ πΉπΉ2ππππ=1 = β πΌπΌ ππ οΏ½ππ ( ππ ) β π¨π¨ ( ππ ) οΏ½ πΉπΉ2ππππ=1 . APPENDIX B: P ROBLEM (13)
TRANSFORMATION
The objective of problem (13) can be transformed as follows: Since π΄π΄ and Ξ ππ are given, the β³ -update problem (Equation (14)) is equivalent to, πππΌπΌ ππ οΏ½ππ ( ππ ) οΏ½ β + πΌπΌ ππ οΏ½ππ ( ππ ) β ππ ( ππ ) οΏ½ πΉπΉ2 β β©Ξ ππ , β³ ππ βͺ + ππ βπ΄π΄ β β³ ππ β πΉπΉ2 = πππΌπΌ ππ οΏ½ππ ( ππ ) οΏ½ β + πΌπΌ ππ οΏ½ππ ( ππ ) β ππ ( ππ ) οΏ½ πΉπΉ2 + ππ οΏ½ππ ( ππ ) β οΏ½ππ ( ππ ) + 1 ππ π―π― ( ππ ) οΏ½οΏ½ πΉπΉ2 = πππΌπΌ ππ οΏ½ππ ( ππ ) οΏ½ β + πΌπΌ ππ +ππ2 οΏ½ππ ( ππ ) β πΌπΌ ππ ππ ( ππ ) +ππππ ( ππ ) πΌπΌ ππ +ππ οΏ½ πΉπΉ2 + ππ , (B1) where ππ is a constrant and ππ ( ππ ) = οΏ½ππ ( ππ ) + π―π― ( ππ ) οΏ½ . APPENDIX C: S OLUTION TO (17) In this appendix, we derive the solution for problem (17). Let the partial derivative of Lagrangian πΏπΏ π¦π¦ ( π΄π΄ , Ο ) be equal to zero and we have, οΏ½ πππΏπΏ π¦π¦ ( π΄π΄ , Ο ) πππ΄π΄ = β οΏ½Ξ ππ + ππ2 ( π΄π΄ β β³ ππ ) οΏ½ ππππ=1 + π«π« Ξ© ( Ο ) = 0 πππΏπΏ π¦π¦ ( π΄π΄ , Ο ) ππΟ = π«π« Ξ© ( π΄π΄ β π΄π΄ ) = 0 . (C1) When ( ππ , ππ , β― , ππ ππ ) β Ξ© , we only have, β οΏ½Ξ ππ + ππ2 οΏ½π΄π΄ ( ππ , ππ , β― , ππ ππ ) β β³ ππ οΏ½οΏ½ ππππ=1 = 0. (C2) Fig. 12. Performance comparison between TRMV and TC-MTOT under 80% and 90% missing values. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
11 Otherwise, we have, οΏ½ π«π« Ξ© ( π·π· ) = 0 π«π« Ξ© ( π΄π΄ β π΄π΄ ) = 0 . (C3) Thus, we can obtain the optimal solution of the minimization problem as, π΄π΄ = οΏ½ π΄π΄ , ππππ ( ππ , ππ , β― , ππ ππ ) β Ξ© β οΏ½β³ ππ β Ξ ππ οΏ½ ππππ=1 , ππππβπππππ€π€ππππππ . (C4) APPENDIX D: C ASE S TUDY D ATA G ENERATION
This appendix elaborates on the popular procedure used for generating the OV errors in the lithography process. The OV error is denoted as a vector οΏ½πΉπΉ π₯π₯ , πΉπΉ π¦π¦ οΏ½ in 2D coordinate system, with the initial point ( π₯π₯ , π¦π¦ ) and the terminal point οΏ½π₯π₯ + πΉπΉ π₯π₯ , π¦π¦ + πΉπΉ π¦π¦ οΏ½ representing the locations of the previous layer and the current layer, respectively. To maintain OV requirement, the polynomial model is widely used in practice for overlay control [25], [26], which is represented as follows, οΏ½πΉπΉ π₯π₯ ( π₯π₯ , π¦π¦ ) = π€π€ π₯π₯ πππΉπΉ π¦π¦ ( π₯π₯ , π¦π¦ ) = π€π€ π¦π¦ ππ , (D1) where πΉπΉ π₯π₯ and πΉπΉ π¦π¦ are the coordinates of the OV error; ππ =[1, π₯π₯ , π¦π¦ , π₯π₯ , π₯π₯π¦π¦ , π¦π¦ , π₯π₯ , π₯π₯ π¦π¦ , π₯π₯π¦π¦ , π¦π¦ ] ππ β β is a basis; π€π€ π₯π₯ =[ ππ , ππ , ππ , β¦ , ππ ] and π€π€ π¦π¦ = [ ππ , ππ , ππ , β¦ , ππ ] define the signatures of an OV error, which are used for on-line correction via machine settings such as wafer stage, lens, mask, and so on, along the motion directions. With this model, the OV error can be broken into linear components, i.e., ( ππ , ππ , β¦ , ππ ) and nonlinear components, i.e., ( ππ , ππ , β¦ , ππ ) , where the linear ones associate with errors including reticle rotation, wafer rotation, to name a few, and the nonlinear ones correspond to causes such as lens distortion and random errors [26]. Specifically, a set of points in a wafer are first specified to construct a basis ππ = [ ππ , ππ , β¦ , ππ ππ ] . For the signature, we only take the variation of linear components into consideration, which means ( ππ , ππ , β¦ , ππ ) equals to zeros. The variation patterns of the signature ( ππ , ππ , β¦ , ππ ) are randomly generated and we obtain ππ π₯π₯ππ = οΏ½ππ , ππ , ππ , 0, β¦ ,0 οΏ½ and ππ π¦π¦ππ = οΏ½ππ , ππ , ππ , 0, β¦ ,0 οΏ½ for ππ = 1, β¦ , ππ . Then, for each οΏ½ππ π₯π₯ππ , ππ π¦π¦ππ οΏ½ , the associated OV pattern οΏ½π π π₯π₯ππ , π π π¦π¦ππ οΏ½ can be calculated as π π π₯π₯ππ =[ πΉπΉ π₯π₯ππ1 , β¦ , πΉπΉ π₯π₯ππππ ] and π π π₯π₯ππ = οΏ½πΉπΉ π¦π¦ππ1 , β¦ , πΉπΉ π¦π¦ππππ οΏ½ . R EFERENCES [1] Y. Ding, J. H. Jin, D. Ceglarek, and J. J. Shi, "Process-oriented tolerancing for multi-station assembly systems,"
IIE Trans., vol. 37, no. 6, pp. 493-508, Jun 2005. [2] K. B. Liu, N. Z. Gebraeel, and J. J. Shi, "A Data-Level Fusion Model for Developing Composite Health Indices for Degradation Modeling and Prognostic Analysis,"
IEEE Trans. Autom. Sci. Eng., vol. 10, no. 3, pp. 652-664, Jul 2013. [3] H. Zhou, L. X. Li, and H. T. Zhu, "Tensor Regression with Applications in Neuroimaging Data Analysis,"
J. Am. Stat. Assoc., vol. 108, no. 502, pp. 540-552, Jun 2013. [4] S. Wold, K. Esbensen, and P. Geladi, "Principal Component Analysis,"
Chemometr. Intell. Lab., vol. 2, no. 1-3, pp. 37-52, Aug 1987. [5] H. Abdi, "Partial least squares regression and projection on latent structure regression (PLS Regression),"
Wiley Interdiscip. Rev. Comput. Stat., vol. 2, no. 1, pp. 97-106, Jan-Feb 2010. [6] X. W. Yue, H. Yan, J. G. Park, Z. Y. Liang, and J. J. Shi, "A Wavelet-Based Penalized Mixed-Effects Decomposition for Multichannel Profile Detection of In-Line Raman Spectroscopy,"
IEEE Trans. Autom. Sci. Eng., vol. 15, no. 3, pp. 1258-1271, Jul 2018. [7] N. Jin, S. Y. Zhou, T. S. Chang, and H. H. H. Huang, "Identification of influential functional process variables for surface quality control in hot rolling processes,"
IEEE Trans. Autom. Sci. Eng., vol. 5, no. 3, pp. 557-562, Jul 2008. [8] H. Yan, K. Paynabar, and J. J. Shi, "Image-Based Process Monitoring Using Low-Rank Tensor Decomposition,"
IEEE Trans. Autom. Sci. Eng., vol. 12, no. 1, pp. 216-227, Jan 2015. [9] X. W. Yue, J. G. Park, Z. Y. Liang, and J. J. Shi, "Tensor Mixed Effects Model With Application to Nanomanufacturing Inspection,"
Technometrics, vol. 62, no. 1, pp. 116-129, Jan 2 2020. [10] Q. B. Zhao, C. F. Caiafa, D. P. Mandic, Z. C. Chao, Y. Nagasaka, N. Fujii, L. Q. Zhang, and A. Cichocki, "Higher Order Partial Least Squares (HOPLS): A Generalized Multilinear Regression Method,"
IEEE Trans. Pattern Anal., vol. 35, no. 7, pp. 1660-1673, Jul 2013. [11] X. S. Li, D. Xu, H. Zhou, and L. X. Li, "Tucker Tensor Regression and Neuroimaging Analysis,"
Statist. Biosci., vol. 10, no. 3, pp. 520-545, Dec 2018. [12] H. Yan, K. Paynabar, and M. Pacella, "Structured Point Cloud Data Analysis Via Regularized Tensor Regression for Process Modeling and Optimization,"
Technometrics, vol. 61, no. 3, pp. 385-395, Jul 3 2019. [13] E. F. Lock, "Tensor-on-Tensor Regression,"
J. Comput. Graph. Stat., vol. 27, no. 3, pp. 638-647, 2018. [14] M. R. Gahrooei, H. Yan, K. Paynabar, and J. J. Shi, "Multiple Tensor-on-Tensor Regression: An Approach for Modeling Processes With Heterogeneous Sources of Data,"
Technometrics,
Jan 14 2020. [15] X. F. Lin, H. E. Perez, J. B. Siegel, and A. G. Stefanopoulou, "Robust Estimation of Battery System Temperature Distribution Under Sparse Sensing and Uncertainty,"
IEEE Trans. Control Syst. Technol., vol. 28, no. 3, pp. 753-765, May 2020. [16] J. F. Cai, E. J. Candes, and Z. W. Shen, "A Singular Value Thresholding Algorithm for Matrix his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
12 Completion,"
SIAM J. Optimiz., vol. 20, no. 4, pp. 1956-1982, 2010. [17] J. Liu, P. Musialski, P. Wonka, and J. P. Ye, "Tensor Completion for Estimating Missing Values in Visual Data,"
IEEE Trans. Pattern Anal., vol. 35, no. 1, pp. 208-220, Jan 2013. [18] G. Tomasi and R. Bro, "PARAFAC and missing values,"
Chemometr. Intell. Lab., vol. 75, no. 2, pp. 163-180, Feb 28 2005. [19] D. Kressner, M. Steinlechner, and B. Vandereycken, "Low-rank tensor completion by Riemannian optimization,"
BIT, vol. 54, no. 2, pp. 447-468, Jun 2014. [20] C. Y. Lu, J. S. Feng, Y. D. Chen, W. Liu, Z. C. Lin, and S. C. Yan, "Tensor Robust Principal Component Analysis with a New Tensor Nuclear Norm,"
IEEE Trans. Pattern Anal., vol. 42, no. 4, pp. 925-938, Apr 1 2020. [21] T. G. Kolda and B. W. Bader, "Tensor Decompositions and Applications,"
SIAM Rev., vol. 51, no. 3, pp. 455-500, Sep 2009. [22] F. Cismondi, A. S. Fialho, S. M. Vieira, S. R. Reti, J. M. C. Sousa, and S. N. Finkelstein, "Missing data in medical databases: Impute, delete or classify?,"
Artif. Intell. Med., vol. 58, no. 1, pp. 63-72, May 2013. [23] X. Qi and R. Y. Luo, "Function-on-function regression with thousands of predictive curves,"
J. Multivariate Anal., vol. 163, pp. 51-66, Jan 2018. [24] Canon. (2018).
What Is Semiconductor Lithography Equipment?
Available: https://global.canon/en/technology/support29.html. [25] B. Y. Hsueh, G. K. Huang, C.-C. Yu, J. K. Hsu, C.-C. K. Huang, C.-J. Huang, and D. Tien, "High order correction and sampling strategy for 45nm immersion lithography overlay control," in
Metrology, Inspection, and Process Control for Microlithography XXII , 2008, vol. 6922, p. 69222Q: International Society for Optics and Photonics. [26] T. Kikuchi, Y. Ishii, and N. Tokuda, "Introduction of new techniques for matching overlay enhancement,"
Optical Microlithography Xiv, Pts 1 and 2, vol. 4346, pp. 1608-1616, 2001. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Feng Wang (Student Member, IEEE) received the M.S. degree in Traffic Information Engineering & Control from Beijing Jiaotong University, Beijing, China, in 2016, where he is working toward the Ph.D. degree with the State Key Laboratory of Rail Traffic Control and Safety. Currently, he is a visiting scholar in the H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta. His research interests include statistical modeling and prognostics with high-dimensional data.
Mostafa Reisi-Gahrooei received the masterβs degree in computational science and engineering and the Ph.D. degree in industrial and systems engineering from Georgia Institute of Technology, Atlanta, GA, USA, and the M.Sc. degrees in transportation engineering and applied mathematics from the Southern Illinois University Edwardsville, Edwardsville, IL, USA. He is currently an Assistant Professor with the Department of Industrial and Systems Engineering, the University of Florida, Gainesville, FL, USA. His research focuses on modeling, monitoring, and control of complex systems with functional, high-dimensional data. Dr. Reisi Gahrooei is a member of the Institute for Operations Research and the Management Sciences (INFORMS) and the Institute of Industrial and Systems Engineers (IISE).
Zhen Zhong received the B.S. degree in Electrical Engineering from University of Science and Technology of China, Anhui, China, in 2017. Currently, he is a Ph.D. student at the H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta. His research interests are focused on the process control in semiconductor manufacturing.
Tao Tang (Senior Member, IEEE) received the Ph.D. degree in automatic control theory and application from Chinese Academy of Sciences, Beijing, China, in 1991. He is currently the Director of the School of Electronic and Information Engineering and the State Key Laboratory of Rail Trafο¬c Control and Safety, Beijing Jiaoto ng University, Beijing, China. His research interests include both high-speed and urban railway train control systems, as well as intelligent transportation systems. He is a member of the Experts Group of High Technology Research and Development Program of China (863 Program) and the Leader in the Field of the Modern Transportation Technology Experts Group. He is also a Specialist in the National Development and Reform Commission and the Beijing Urban Traffic Construction Committee.