[PDF] An Augmented Regression Model for Tensors with Missing Values

Abstract

Heterogeneous but complementary sources of data provide an unprecedented opportunity for developing accurate statistical models of systems. Although the existing methods have shown promising results, they are mostly applicable to situations where the system output is measured in its complete form. In reality, however, it may not be feasible to obtain the complete output measurement of a system, which results in observations that contain missing values. This paper introduces a general framework that integrates tensor regression with tensor completion and proposes an efficient optimization framework that alternates between two steps for parameter estimation. Through multiple simulations and a case study, we evaluate the performance of the proposed method. The results indicate the superiority of the proposed method in comparison to a benchmark.

Full PDF

TThis work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Abstract —Heterogeneous but complementary sources of data provide an unprecedented opportunity for developing accurate statistical models of systems. Although the existing methods have shown promising results, they are mostly applicable to situations where the system output is measured in its complete form. In reality, however, it may not be feasible to obtain the complete output measurement of a system, which results in observations that contain missing values. This paper introduces a general framework that integrates tensor regression with tensor completion and proposes an efficient optimization framework that alternates between two steps for parameter estimation. Through multiple simulations and a case study, we evaluate the performance of the proposed method. The results indicate the superiority of the proposed method in comparison to a benchmark.

Note to Practitioners—

The proposed method aims to obtain an accurate estimation of the regression model when certain entries of the response are inaccessible. By considering both the information from multiple inputs and the structure of the response, our proposed method can achieve more accurate estimation of the output tensor. In order to apply the proposed method in practice, two assumptions should hold: First, the response tensor should be low-rank, meaning that fewer variation patterns should exist in the response than its dimensions. Second, the relationship between the input tensors and the response should be linear or approximately linear. The presented method in this paper uses tensor decomposition techniques to exploit the correlation structures of the high-dimensional data and prevent overfitting. Another benefit of our integrated framework is that the rank of the response tensor converges automatically, which can be used directly in the parameter estimation.

Index Terms — Tensor regression, missing values, Tucker decomposition, ALS-ADMM. ( *Corresponding author: Mostafa Reisi Gahrooei. ) F. Wang and T. Tang are with State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]; [email protected]). M. R. Gahrooei is with the Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL 32611, USA (e-mail: [email protected]). N OMENCLATURE

Lower- and upper-case letter A scalar, e.g., 𝑤𝑤 or 𝑊𝑊 . Lower- or upper-case boldface letter A vector or a matrix, e.g., 𝐰𝐰 or 𝐖𝐖 . Euler script letters A tensor, e.g., 𝒲𝒲 . 𝒳𝒳 𝑗𝑗 Input tensor 𝑗𝑗 for the regression model, 𝒳𝒳 𝑗𝑗 ∈ ℝ 𝐼𝐼 × 𝑃𝑃 𝑗𝑗1 × ⋯ × 𝑃𝑃 𝑗𝑗𝑙𝑙𝑗𝑗 with the sample dimension 𝐼𝐼 and other dimensions 𝑃𝑃 𝑗𝑗𝑗𝑗 , 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 . 𝒴𝒴 Output tensor for the regression model, 𝒴𝒴 𝑖𝑖 ∈ℝ 𝐼𝐼 × 𝑄𝑄 × 𝑄𝑄 × ⋯ × 𝑄𝑄 𝑑𝑑 . ℬ 𝑗𝑗 Coefficient tensor 𝑗𝑗 for the regression model with respect to 𝒳𝒳 𝑗𝑗 . 𝒞𝒞 𝑗𝑗 Core tensor of Tucker decomposition to ℬ 𝑗𝑗 . 𝑊𝑊 ( 𝑗𝑗 ) Mode- 𝑘𝑘 matricization of tensor 𝒲𝒲 . Ω Index set to define the missingness of the response. 𝒫𝒫 Ω Orthogonal projection based on the index set Ω . rank( ∙ ) Rank function of a tensor. 𝒴𝒴 Response with missing values. 𝐔𝐔 𝑗𝑗∙ Bases that spans the space of the input tensor 𝑗𝑗 . 𝐕𝐕 ∙ Bases that spans the space of the response. ℳ 𝑖𝑖 The 𝑖𝑖 𝑡𝑡ℎ local copy of tensor 𝒴𝒴 . 𝐌𝐌 ( 𝑖𝑖 ) Mode- 𝑖𝑖 matricization of ℳ 𝑖𝑖 . 𝚯𝚯 𝑖𝑖 Dual variable. 𝜆𝜆 Tuning parameter. I. I NTRODUCTION

HE rapid development of sensing and computing technology has accelerated the collection of heterogeneous sets of data, which may include a combination of scalars and high-dimensional (HD) data points such as profiles, images, and point clouds. Many applications, including multistage manufacturing [1], aircraft prognostics [2], and medical

Z. Zhong and J. Shi are with the H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA (e-mail: [email protected]; [email protected]).

An Augmented Regression Model for Tensors with Missing Values

Feng Wang, Mostafa Reisi Gahrooei*, Zhen Zhong, Tao Tang, and Jianjun Shi T his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. , et al. [10] extended PLS approach to tensor data by using Tucker decomposition. Li , et al. [11] proposed a generalized linear model with input tensors based on Tucker decomposition. Yan , et al. [12] proposed to link structured point clouds to scalar process variables with tensor regression. To consider a more general case, Lock [13] proposed a regression model that estimates an output tensor based on an input tensor (tensor-on-tensor regression [TOT]), using CP decomposition imposed on the tensors of model parameters. However, the TOT method is limited due to the inherent limitations of CP decomposition and can only include a single input tensor. Recently, Gahrooei, et al. [14] extended TOT to multiple tensor-on-tensor regression (MTOT) by leveraging Tucker operations. Although these existing approaches provide effective ways to model processes using HD data, they assume the available output tensor is structured and complete. However, in many applications, these assumptions are not valid, and the measurements may contain missing values for reasons such as the failure of the sensors or excessive cost of full measurement. For example, in lithography process of semiconductor manufacturing, overlay (OV) errors (i.e., the misalignment between different layers), which are highly dependent on the lithography machine settings (e.g., the alignments of the lens and the location of the wafer stage), are only measured at a limited number of marked locations over the wafer (see Fig. 1). Therefore, obtaining a full picture of OV errors requires developing models that link the overlays to machine settings based on partially observed OV errors on wafers. Another example is the battery system of the Tesla Model S, with more than 7000 cells [15], where a limited number of sensors monitors the temperature of a few of these cells. As a result, the temperature data obtained from these batteries is structured but contains missing observations. To effectively monitor the battery condition, a modeling framework that uses this data to estimate the temperature of all cells based on the observed measurements and covariates such as the speed and temperature of coolant flow in the battery pack is essential. In these situations, where the output contains missing values, an effective estimation of the model parameters is challenging due to the absence of several observations in high-dimensional outputs, which renders the exploitation of the complex correlation structure of the output even more difficult. A naïve approach of addressing this challenge is to first complete the HD outputs, using matrix [16] or tensor completion approaches [17], and then construct a prediction model based on the completed data using existing HD regression methods (e.g., [14]). Many methods are developed for the tensor completion problem, which attempt to build the relationship between the known entries and the missing values of a matrix or a tensor. These methods rely on the low-rank assumption of the tensor and either design decomposition of the tensor [18], [19], or perform rank minimization [17], [20], as reported in Table Ⅱ . Nevertheless, these approaches do not consider other potentially relevant and available data when performing the tensor completion task. In many applications, however, the T ABLE Ⅰ. R EGRESSION M ETHODS

Methods Literature Characteristics Principle component regression (PCR) or partial least square (PLS) [4], [5] -Fail to exploit the ordering or spatial structure of profiles or images. Functional regression [6], [7] -Mostly on profile data. -Difficult to extend to higher dimensions. Tensor regression (TOT, MTOT) [12], [13], [14] -Tensor inputs and tensor output. -Assumes complete data. T

ABLE Ⅱ. T ENSOR C OMPLETION M ETHODS

Methods Literature Characteristics Decomposition based methods CP [18], Tucker [19] decompositions -With tensor decomposition methods. -Rank specified. Rank minimization based methods CP [17], Tucker [20] rank -Automatic rank estimation

Fig. 1. Illustration of overlay measurements in the lithography process. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 𝒳𝒳 𝑗𝑗 , 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 represent the available inputs that are complete, and 𝒴𝒴 denotes an incomplete output of a process. Please note that in the situations where 𝒳𝒳 𝑗𝑗 , 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 contain missing values, one can complete them using tensor completion approaches as they are independent variables. The goal is to estimate the model parameters ℬ , given the inputs and incomplete output. After learning the model parameters, a complete output can be estimated based on a given set of new inputs. As it is shown in Fig. 2, our procedure completes the output and learns the model parameters iteratively. Specifically, the completion of the response 𝒴𝒴 takes advantage of both the information of the inputs via the estimated parameters ℬ and the structural information in the response. By considering the information from inputs, the completion of response is expected to be more accurate than a completion procedure that only uses the structural information of 𝒴𝒴 . Furthermore, a better completion of the response 𝒴𝒴 can also benefit the estimation of the model parameters. Therefore, all available information from both inputs and output are fully and iteratively exploited, resulting in more accurate model estimation and improved performance of prediction. Mathematically, the model parameters are learned by minimizing an integrated objective function that includes a mean square error term and a tensor rank penalty on the response 𝒴𝒴 for global exploitation of the relationship between the observed and missing entries. To avoid overfitting, a decomposition operation is applied on the model parameters ℬ by considering the correlation structures within the inputs and output spaces. The optimization problem is then solved through a novel and efficient algorithm with two block coordinate descent (BCD) steps. The proposed optimization framework iteratively completes the output tensor and learns the model parameters until convergence. The rest of the article is organized as follows: In Section II, we propose an integrated framework for augmented tensor regression with missing values and elaborate on the algorithms for estimating the missing values of the response as well as the model parameters. Section III carries out two simulation studies. The first one implements a curve-on-curve simulation with missing entries in the profile output. The second simulation study generates images and profiles for the inputs and the output. Based on those two simulations, the proposed method is evaluated in comparison to a benchmark method. In Section IV, we conduct a case study to estimate the incomplete overlay errors in the semiconductor lithographic process. Finally, we conclude the paper in Section V. II. F ORMULATION OF T ENSOR R EGRESSION WITH I NCOMPLETE R ESPONSE

In this section, we propose an approach that augments tensor-on-tensor regression with tensor completion for a more accurate estimation of model parameters and the high-dimensional process output with missing values. First, we introduce the notations and concepts of tensor algebra used in this paper. A. Tensor Notation and Multilinear Algebra

In this paper, we use Euler script letters to denote a tensor. For instance,

𝒳𝒳 ∈ ℝ 𝑃𝑃 × 𝑃𝑃 × ⋯ × 𝑃𝑃 𝑛𝑛 denotes a tensor of order 𝑛𝑛 , where 𝑃𝑃 𝑖𝑖 indicates the dimension of the mode 𝑖𝑖 of the tensor 𝒳𝒳 . The 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 - 𝑖𝑖 matricization and the vectorization of a tensor 𝒳𝒳 are denoted by 𝐗𝐗 ( 𝑖𝑖 ) ∈ ℝ 𝑃𝑃 𝑖𝑖 × 𝑃𝑃 −𝑖𝑖 ( 𝑃𝑃 −𝑖𝑖 = 𝑃𝑃 × 𝑃𝑃 × ⋯ × 𝑃𝑃 𝑖𝑖−1 × 𝑃𝑃 𝑖𝑖+1 × ⋯ × 𝑃𝑃 𝑛𝑛 ) and vec( 𝒳𝒳 ) , respectively. A tensor 𝒳𝒳 can be obtained by folding any of its matricizations, which is denoted as 𝒳𝒳 = fold �𝐗𝐗 ( 𝑖𝑖 ) � , 𝑖𝑖 ∈ {1, … , 𝑛𝑛 } . The 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 - 𝑘𝑘 product of a tensor 𝒳𝒳 with a matrix 𝐔𝐔 ∈ ℝ 𝐾𝐾 × 𝑃𝑃 𝑘𝑘 is denoted by 𝒳𝒳 × 𝑗𝑗 𝐔𝐔 ∈ℝ 𝑃𝑃 × 𝑃𝑃 × ⋯ × 𝑃𝑃 𝑘𝑘−1 × 𝐾𝐾 × 𝑃𝑃 𝑘𝑘+1 × ⋯ × 𝑃𝑃 𝑛𝑛 , where the 𝑝𝑝 ⋯ 𝑝𝑝 𝑗𝑗−1 𝑘𝑘𝑝𝑝 𝑗𝑗+1 ⋯ 𝑝𝑝 𝑛𝑛 entry of the product is given by, ( 𝒳𝒳 × 𝑗𝑗 𝐔𝐔 ) 𝑝𝑝 ⋯𝑝𝑝 𝑘𝑘−1 𝑗𝑗𝑝𝑝 𝑘𝑘+1 ⋯𝑝𝑝 𝑛𝑛 = � 𝑥𝑥 𝑝𝑝 ⋯𝑝𝑝 𝑛𝑛 𝑢𝑢 𝑗𝑗𝑝𝑝 𝑘𝑘 𝑃𝑃 𝑘𝑘 𝑝𝑝 𝑘𝑘 =1 . The

Frobenius norm of a tensor 𝒳𝒳 is defined as the Frobenius norm of its matricization along any mode, e.g., ‖𝒳𝒳‖

𝐹𝐹2 = �𝐗𝐗 ( 𝑖𝑖 ) � F2 . The contraction product of two tensors ℬ ∈ℝ 𝑃𝑃 × 𝑃𝑃 × ⋯ × 𝑃𝑃 𝑛𝑛 × 𝑄𝑄 × ⋯ × 𝑄𝑄 𝑑𝑑 and 𝒳𝒳 ∈ ℝ 𝑃𝑃 × 𝑃𝑃 × ⋯ × 𝑃𝑃 𝑛𝑛 is denoted by ℬ ∗ 𝒳𝒳 ∈ ℝ 𝑄𝑄 × ⋯ × 𝑄𝑄 𝑑𝑑 , ( ℬ ∗ 𝒳𝒳 ) 𝑞𝑞 ⋯𝑞𝑞 𝑑𝑑 = � 𝒳𝒳 𝑝𝑝 ⋯𝑝𝑝 𝑙𝑙 ℬ 𝑝𝑝 ⋯𝑝𝑝 𝑙𝑙 𝑞𝑞 ⋯𝑞𝑞 𝑑𝑑 𝑝𝑝 ⋯𝑝𝑝 𝑙𝑙 . We use ‖𝐗𝐗‖ ∗ = ∑ 𝜆𝜆 𝑖𝑖 ( 𝐗𝐗 ) 𝑖𝑖 to denote the nuclear norm of the matrix 𝐗𝐗 , where 𝜆𝜆 𝑖𝑖 ( 𝐗𝐗 ) is the 𝑖𝑖 𝑡𝑡ℎ largest singular value of the matrix. The nuclear norm of a tensor 𝒳𝒳 , denoted by ‖𝒳𝒳‖ ∗ , is defined as the weighted average of nuclear norms of its matricizations along each mode [17], i.e., ‖𝒳𝒳‖ ∗ = ∑ 𝛼𝛼 𝑖𝑖 �𝐗𝐗 ( 𝑖𝑖 ) � ∗𝑑𝑑𝑖𝑖=1 , where 𝛼𝛼 𝑖𝑖 > 0 and ∑ 𝛼𝛼 𝑖𝑖𝑖𝑖 = 1 . We let Ω denote an index set, and 𝒫𝒫 Ω ( 𝒳𝒳 ) represent an orthogonal projection that copies the tensor entries of 𝒳𝒳 with indices in Ω and sets the entries with indices outside Ω to be 0. The Tucker rank of tensor 𝒳𝒳 is defined as the combination of the column ranks of its matricizations. That is, rank( 𝒳𝒳 ) = ( 𝑅𝑅 , 𝑅𝑅 , ⋯ , 𝑅𝑅 𝑛𝑛 ) , where 𝑅𝑅 𝑖𝑖 is the column rank of 𝐗𝐗 ( 𝑖𝑖 ) [21]. Fig. 2. Proposed framework of the augmented regression modeling with an incomplete response. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. B. Problem Formulation

Let 𝑚𝑚 denote the number of available samples in the training set, 𝒴𝒴 ∈ ℝ 𝑄𝑄 × 𝑄𝑄 × ⋯ × 𝑄𝑄 𝑑𝑑 ( 𝑖𝑖 = 1, ⋯ , 𝑚𝑚 ) be the response tensor in the 𝑖𝑖 𝑡𝑡ℎ sample that contains missing values, and 𝒳𝒳 𝑗𝑗𝑖𝑖 ∈ℝ 𝑃𝑃 𝑗𝑗1 × 𝑃𝑃 𝑗𝑗2 × ⋯ × 𝑃𝑃 𝑗𝑗𝑙𝑙𝑗𝑗 ( 𝑖𝑖 = 1, ⋯ , 𝑚𝑚 ; 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 ) represents the 𝑗𝑗 𝑡𝑡ℎ input tensor (predictor) within the 𝑖𝑖 𝑡𝑡ℎ sample. Let us denote by Ω 𝑖𝑖 the set of indices at which entries of 𝒴𝒴 are observed. Then, we characterize the relationship between the inputs and the output by, 𝒫𝒫 Ω 𝑖𝑖 � 𝒴𝒴 𝑖𝑖 � = 𝒫𝒫 Ω 𝑖𝑖 �� 𝒳𝒳 𝑗𝑗𝑖𝑖 ∗ ℬ 𝑗𝑗𝑝𝑝𝑗𝑗 =1 � + 𝒫𝒫 Ω 𝑖𝑖 ( ℰ 𝑖𝑖 ) , 𝑖𝑖 = 1, ⋯ , 𝑚𝑚 , (1) where ℬ 𝑗𝑗 ∈ ℝ 𝑃𝑃 𝑗𝑗1 × 𝑃𝑃 𝑗𝑗2 × ⋯ × 𝑃𝑃 𝑗𝑗𝑙𝑙𝑗𝑗 × 𝑄𝑄 × 𝑄𝑄 × ⋯ × 𝑄𝑄 𝑑𝑑 is the tensor of coefficients related to input 𝑗𝑗 and ℰ 𝑖𝑖 is a tensor of errors whose entries follow a random process. For a more compact representation, we can fold tensors 𝒴𝒴 𝑖𝑖 , 𝒳𝒳 𝑗𝑗𝑖𝑖 , and ℰ 𝑖𝑖 ( 𝑖𝑖 =1, ⋯ , 𝑚𝑚 ) along the sample mode to obtain tensors 𝒴𝒴 ∈ℝ 𝑚𝑚 × 𝑄𝑄 × 𝑄𝑄 × ⋯ × 𝑄𝑄 𝑑𝑑 , 𝒳𝒳 𝑗𝑗 ∈ ℝ 𝑚𝑚 × 𝑃𝑃 𝑗𝑗1 × 𝑃𝑃 𝑗𝑗2 × ⋯ × 𝑃𝑃 𝑗𝑗𝑙𝑙𝑗𝑗 ( 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 ) , and ℰ ∈ ℝ 𝑚𝑚 × 𝑄𝑄 × 𝑄𝑄 × ⋯ × 𝑄𝑄 𝑑𝑑 . Then, the model can be written as, 𝒫𝒫 Ω � 𝒴𝒴 � = 𝒫𝒫 Ω �� 𝒳𝒳 𝑗𝑗 ∗ ℬ 𝑗𝑗𝑝𝑝𝑗𝑗 =1 � + 𝒫𝒫 Ω ( ℰ ) , (2) where Ω = {( 𝑖𝑖 , 𝑞𝑞 , ⋯ , 𝑞𝑞 𝑑𝑑 ) ∈ ℝ 𝑚𝑚 × 𝑄𝑄 × 𝑄𝑄 × ⋯ × 𝑄𝑄 𝑑𝑑 } is a set of indices at which entries of 𝒴𝒴 are observed. Upon estimation of ℬ 𝑗𝑗 , the complete output can be estimated by ∑ 𝒳𝒳 𝑗𝑗 ∗ ℬ 𝑗𝑗𝑝𝑝𝑗𝑗=1 , as illustrated in Fig. 3. Because the output contains missing values, existing approaches that assume complete tensors are not applicable for estimating the parameters. Fortunately, due to the potential correlation structure within high-dimensional data points, including profiles and images, the output tensor generally has a low-rank structure. This low-rank structure is key in designing an approach for estimating the model parameters in the presence of incomplete output tensor. The goal of our work is to build a regression model between the input tensors 𝒳𝒳 𝑗𝑗 , ( 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 ) and the incomplete output tensor 𝒴𝒴 . It can be achieved by integrating the low-rank structure of the response into the least square loss 𝐿𝐿 𝑟𝑟 as follows, min 𝒴𝒴 , ℬ , ⋯ , ℬ 𝑝𝑝 𝜆𝜆 rank( 𝒴𝒴 ) + 12 �𝒴𝒴 − � 𝒳𝒳 𝑗𝑗 ∗ ℬ 𝑗𝑗𝑝𝑝𝑗𝑗=1 � F2 Subject to: 𝒫𝒫 Ω ( 𝒴𝒴 ) = 𝒫𝒫 Ω ( 𝒴𝒴 ). (3) The first term, 𝜆𝜆 rank( 𝒴𝒴 ) , in the objective function, exploits the low-rank structure of the output tensor. Note that the assumption of this problem is that the output is low-rank, which is common in high dimensional spaces and can be validated based on the domain knowledge. 𝜆𝜆 is a user-specified tuning parameter that should be selected based on the procedure discussed in Section II.E. Unfortunately, problem (3) is NP-hard due to the nonconvexity of rank( 𝒴𝒴 ) . To address this issue, we employ a convex relaxation of the rank( 𝒴𝒴 ) term. Specifically, we employ the tensor’s nuclear norm to approximate the rank penalty, leading to the following tractable problem, min 𝒴𝒴 , ℬ , ⋯ , ℬ 𝑝𝑝 𝜆𝜆‖𝒴𝒴‖ ∗ + 12 �𝒴𝒴 − � 𝒳𝒳 𝑗𝑗 ∗ ℬ 𝑗𝑗𝑝𝑝𝑗𝑗=1 � F2 Subject to: 𝒫𝒫 Ω ( 𝒴𝒴 ) = 𝒫𝒫 Ω ( 𝒴𝒴 ), (4) where ‖𝒴𝒴‖ ∗ = ∑ 𝛼𝛼 𝑖𝑖 �𝐘𝐘 ( 𝑖𝑖 ) � ∗𝑑𝑑𝑖𝑖=1 as it is defined in Section II.A. Solving this problem without imposing any constraint over the parameters ℬ 𝑗𝑗 will result in severe overfitting as the number of parameters are extremely large. To avoid this issue, we decompose ℬ 𝑗𝑗 , ( 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 ) by using a Tucker operation and write, ℬ 𝑗𝑗 = 𝒞𝒞 𝑗𝑗 × 𝐔𝐔 𝑗𝑗1 × ⋯ × 𝑙𝑙 𝑗𝑗 𝐔𝐔 𝑗𝑗𝑙𝑙 𝑗𝑗 × 𝑙𝑙 𝑗𝑗 +1 𝐕𝐕 × 𝑙𝑙 𝑗𝑗 +2 ⋯ × 𝑙𝑙 𝑗𝑗 +𝑑𝑑 𝐕𝐕 𝑑𝑑 . Here, 𝒞𝒞 𝑗𝑗 ∈ ℝ ( 𝑃𝑃� 𝑗𝑗1 × ⋯ × 𝑃𝑃� 𝑗𝑗1 × 𝑄𝑄� × ⋯ × 𝑄𝑄� 𝑑𝑑 ) is a core tensor with 𝑃𝑃� 𝑗𝑗𝑖𝑖 ≪𝑃𝑃 𝑗𝑗𝑖𝑖 ( 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 ; 𝑖𝑖 = 1, ⋯ , 𝑙𝑙 𝑗𝑗 ) and 𝑄𝑄� 𝑖𝑖 ≪ 𝑄𝑄 𝑖𝑖 ( 𝑖𝑖 = 1, ⋯ , 𝑚𝑚 ); �𝐔𝐔 𝑗𝑗𝑖𝑖 : 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 ; 𝑖𝑖 = 1, ⋯ , 𝑙𝑙 𝑗𝑗 � is a set of bases that spans the 𝑗𝑗 𝑡𝑡ℎ input space; and { 𝐕𝐕 𝑖𝑖 : 𝑖𝑖 = 1, ⋯ , 𝑚𝑚 } is a set of bases that spans the output space. Therefore, we aim to solve, min 𝒴𝒴 , ℬ , ⋯ , ℬ 𝑝𝑝 𝜆𝜆‖𝒴𝒴‖ ∗ + 12 �𝒴𝒴 − � 𝒳𝒳 𝑗𝑗 ∗ ℬ 𝑗𝑗𝑝𝑝𝑗𝑗=1 � F2 , Subject to: 𝒫𝒫 Ω ( 𝒴𝒴 ) = 𝒫𝒫 Ω ( 𝒴𝒴 ) ; ℬ 𝑗𝑗 = 𝒞𝒞 𝑗𝑗 × 𝐔𝐔 𝑗𝑗1 × … × 𝑙𝑙 𝑗𝑗 𝐔𝐔 𝑗𝑗𝑙𝑙 𝑗𝑗 × 𝑙𝑙 𝑗𝑗 +1 𝐕𝐕 × 𝑙𝑙 𝑗𝑗 +2 ⋯ × 𝑙𝑙 𝑗𝑗 +𝑑𝑑 𝐕𝐕 𝑑𝑑 , 𝑗𝑗 = 1, … , 𝑝𝑝 . (5) To estimate the model parameters, we design an algorithm with the following two steps (see Algorithm 1 ) as follows: Step 1 ( 𝒴𝒴 -update ): This step optimizes the following sub-problem assuming that the model parameters ℬ 𝑗𝑗𝑗𝑗 ( 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 ) are given at the ( 𝑘𝑘 + 1) 𝑡𝑡ℎ iteration of the algorithm, 𝒴𝒴 𝑗𝑗+1 = argmin 𝒴𝒴 𝜆𝜆‖𝒴𝒴‖ ∗ + 12 �𝒴𝒴 − � 𝒳𝒳 𝑗𝑗 ∗ ℬ 𝑗𝑗𝑗𝑗𝑝𝑝𝑗𝑗=1 � F2 Subject to: 𝒫𝒫 Ω ( 𝒴𝒴 ) = 𝒫𝒫 Ω ( 𝒴𝒴 ). (6) Let 𝒜𝒜 𝑗𝑗 = ∑ 𝒳𝒳 𝑗𝑗 ∗ ℬ 𝑗𝑗𝑗𝑗𝑝𝑝𝑗𝑗=1 ; then the problem can be rewritten as, Fig. 3. Illustration of the regression model for heterogeneous sets of data. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 𝒴𝒴 𝑗𝑗+1 = argmin 𝒴𝒴 𝜆𝜆‖𝒴𝒴‖ ∗ + 12 ‖𝒴𝒴 − 𝒜𝒜 𝑗𝑗 ‖ F2 Subject to: 𝒫𝒫 Ω ( 𝒴𝒴 ) = 𝒫𝒫 Ω ( 𝒴𝒴 ). (7) Step 2 ( ℬ 𝑗𝑗 -update ): Given the estimated 𝒴𝒴 𝑗𝑗 and remaining ℬ 𝑖𝑖𝑗𝑗 ( 𝑖𝑖 = 1, ⋯ , 𝑝𝑝 , 𝑖𝑖 ≠ 𝑗𝑗 ), this step attempts to solve, ℬ 𝑗𝑗𝑗𝑗+1 = argmin ℬ 𝑗𝑗 �𝒲𝒲 𝑗𝑗𝑗𝑗 − 𝒳𝒳 𝑗𝑗 ∗ ℬ 𝑗𝑗 � F2 Subject to: ℬ 𝑗𝑗 = 𝒞𝒞 𝑗𝑗 × 𝐔𝐔 𝑗𝑗1 × ⋯ × 𝑙𝑙 𝑗𝑗 𝐔𝐔 𝑗𝑗𝑙𝑙 𝑗𝑗 × 𝑙𝑙 𝑗𝑗 +1 𝐕𝐕 × 𝑙𝑙 𝑗𝑗 +2 ⋯ × 𝑙𝑙 𝑗𝑗 +𝑑𝑑 𝐕𝐕 𝑑𝑑 , (8) where 𝒲𝒲 𝑗𝑗𝑗𝑗 = 𝒴𝒴 𝑗𝑗 − ∑ 𝒳𝒳 𝑖𝑖 ∗ ℬ 𝑖𝑖𝑗𝑗𝑝𝑝𝑖𝑖≠𝑗𝑗 . The stopping criterion of Algorithm 1 is defined by the maximum number of iterations. C. Solution to 𝒴𝒴 -update The problem of 𝒴𝒴 -update (i.e., Equation (7)) can be transformed in, min 𝐘𝐘 ( ) , ⋯ , 𝐘𝐘 ( 𝑑𝑑 ) � 𝛼𝛼 𝑖𝑖 �𝜆𝜆�𝐘𝐘 ( 𝑖𝑖 ) � ∗ + 12 �𝐘𝐘 ( 𝑖𝑖 ) − 𝐀𝐀 ( 𝑖𝑖 ) � F2 � 𝑑𝑑𝑖𝑖=1 Subject to: 𝒫𝒫 Ω ( 𝒴𝒴 ) = 𝒫𝒫 Ω ( 𝒴𝒴 ) (9) The details of this transformation are provided in APPENDIX A. The default values of parameter 𝛼𝛼 𝑖𝑖 , 𝑖𝑖 = 1, … , 𝑚𝑚 are set to . That is, we assign equal weights to each of the tensor modes in the nuclear norm. Since this problem is the same at each iteration, we use the general notation 𝒴𝒴 and 𝒜𝒜 by ignoring the iteration index 𝑘𝑘 for simplicity. Solving the above problem is challenging due to the interdependent nuclear norm terms. That is, the entries of the matricizations of the output tensor in the nuclear norm are shared and therefore, cannot be treated independently while minimizing the objective function. To tackle this issue, we define several auxiliary variables to split these interdependent terms. Specifically, we introduce additional local matrices 𝐌𝐌 ( ) , ⋯ , 𝐌𝐌 ( 𝑑𝑑 ) , which represent the mode- i matricizations of local tensors ℳ , ⋯ , ℳ 𝑑𝑑 , 𝑖𝑖 = 1,2, … , 𝑚𝑚 , separately. Then, we have the following problem, min 𝐌𝐌 ( ) , ⋯ , 𝐌𝐌 ( 𝑑𝑑 ) , 𝒴𝒴 � 𝛼𝛼 𝑖𝑖 �𝜆𝜆�𝐌𝐌 ( 𝑖𝑖 ) � ∗ + 12 �𝐌𝐌 ( 𝑖𝑖 ) − 𝐀𝐀 ( 𝑖𝑖 ) � F2 � 𝑑𝑑𝑖𝑖=1 Subject to: 𝒫𝒫 Ω ( 𝒴𝒴 ) = 𝒫𝒫 Ω ( 𝒴𝒴 ) 𝒴𝒴 = ℳ 𝑖𝑖 , 𝑖𝑖 = 1, ⋯ , 𝑚𝑚 . (10) Here, 𝒴𝒴 can be treated as a global variable, and the problem becomes a global consensus problem. By introducing 𝑚𝑚 local variables, problem (10) becomes separable since they are independent of each other. To solve (10), we first address the equality constraints 𝐌𝐌 ( 𝑖𝑖 ) = 𝐘𝐘 ( 𝑖𝑖 ) ( 𝑖𝑖 = 1, ⋯ , 𝑚𝑚 ) by defining the augmented Lagrangian as follows, 𝐿𝐿 𝜌𝜌 �𝐌𝐌 ( ) , ⋯ , 𝐌𝐌 ( 𝑑𝑑 ) , 𝒴𝒴 , Θ , ⋯ , Θ 𝑑𝑑 � = � �𝜆𝜆𝛼𝛼 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) � ∗ + 𝛼𝛼 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) − 𝐴𝐴 ( 𝑖𝑖 ) � F2 + 〈Θ 𝑖𝑖 , 𝒴𝒴 − ℳ 𝑖𝑖 〉 + 𝜌𝜌 ‖𝒴𝒴 − ℳ 𝑖𝑖 ‖ F2 � 𝑑𝑑𝑖𝑖=1 , (11) where 〈Θ 𝑖𝑖 , 𝒴𝒴 − ℳ 𝑖𝑖 〉 represents inner product of tensors Θ 𝑖𝑖 and 𝒴𝒴 − ℳ 𝑖𝑖 ; Θ 𝑖𝑖 ( 𝑖𝑖 = 1, . . , 𝑚𝑚 ) denote the tensors of dual variables; and 𝜌𝜌 is the step size. Then, the problem is represented as, min 𝐌𝐌 ( ) , ⋯ , 𝐌𝐌 ( 𝑑𝑑 ) , 𝒴𝒴 , Θ , ⋯ , Θ 𝑑𝑑 𝐿𝐿 𝜌𝜌 �𝐌𝐌 ( ) , ⋯ , 𝐌𝐌 ( 𝑑𝑑 ) , 𝒴𝒴 , Θ , ⋯ , Θ 𝑑𝑑 � Subject to: 𝒫𝒫 Ω ( 𝒴𝒴 ) = 𝒫𝒫 Ω ( 𝒴𝒴 ). (12) The resulting ADMM algorithm for this problem is summarized in Algorithm 2 , which includes three iterative steps: local variable update, global variable update, and dual variable update. For local variable update, we try to solve the following unconstrained problem, min 𝐌𝐌 ( 𝑖𝑖 ) 𝑓𝑓 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) � , (13) where 𝑓𝑓 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) � = 𝜆𝜆𝛼𝛼 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) � ∗ + 𝛼𝛼 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) − 𝐀𝐀 ( 𝑖𝑖 ) � 𝐹𝐹2 + 〈Θ 𝑖𝑖 , 𝒴𝒴 −ℳ 𝑖𝑖 〉 + 𝜌𝜌2 ‖𝒴𝒴 − ℳ 𝑖𝑖 ‖ 𝐹𝐹2 . This problem is equivalent to, min 𝐌𝐌 ( 𝑖𝑖 ) 𝜆𝜆 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) � ∗ + 12 �𝐌𝐌 ( 𝑖𝑖 ) − 𝐂𝐂 ( 𝑖𝑖 ) � F2 , (14) where 𝜆𝜆 𝑖𝑖 = 𝜆𝜆𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑖𝑖 +𝜌𝜌 and 𝐂𝐂 ( 𝑖𝑖 ) = 𝛼𝛼 𝑖𝑖 𝐀𝐀 ( 𝑖𝑖 ) +𝜌𝜌𝐘𝐘 ( 𝑖𝑖 ) +𝚯𝚯 ( 𝑖𝑖 ) 𝛼𝛼 𝑖𝑖 +𝜌𝜌 . The detailed derivation of this problem is provided in APPENDIX B. It turns out that this minimization problem in Equation (14) can be solved by first computing the singular value decomposition (SVD) of 𝐂𝐂 ( 𝑖𝑖 ) and then applying a soft-thresholding operator on the singular values. This result is proved by [16] and presented in the following proposition. Algorithm 1 : BCD algorithm for solving problem (5) 1:

Inputs : 𝒳𝒳 , … , 𝒳𝒳 𝑝𝑝 , 𝒴𝒴 and Ω . Initiate ℬ , … , ℬ 𝑝𝑝0 randomly and let 𝒴𝒴 = 𝒴𝒴 . Loop 𝒴𝒴 𝑗𝑗+1 𝒴𝒴 -update step: update 𝒴𝒴 𝑗𝑗 to 𝒴𝒴 𝑗𝑗+1 by solving Equation (7) given ℬ 𝑗𝑗𝑗𝑗 ( 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 ) 4: Let 𝒴𝒴 𝑗𝑗 = 𝒴𝒴 𝑗𝑗+1 for 𝑗𝑗 = 1, ⋯ , 𝑝𝑝 do : 6: ℬ 𝑗𝑗𝑗𝑗+1 ℬ 𝑗𝑗 -update step: update ℬ 𝑗𝑗𝑗𝑗 to ℬ 𝑗𝑗𝑗𝑗+1 by solving Equation (8) given 𝒴𝒴 𝑗𝑗 and the remaining ℬ 𝑖𝑖𝑗𝑗 for 𝑖𝑖 = 1, ⋯ , 𝑝𝑝 , 𝑖𝑖 ≠ 𝑗𝑗 . 7: Let ℬ 𝑗𝑗𝑗𝑗 = ℬ 𝑗𝑗𝑗𝑗+1 end until convergence. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Proposition 1 . Let

𝐂𝐂 ∈ ℝ m × n and let 𝐂𝐂 = 𝐔𝐔𝐔𝐔𝐕𝐕 T be the SVD of 𝐂𝐂 , where 𝐔𝐔 ∈ ℝ 𝑚𝑚 × 𝑚𝑚 and 𝐕𝐕 = ℝ 𝑛𝑛 × 𝑛𝑛 are orthonormal matrices, 𝐔𝐔 ∈ ℝ 𝑚𝑚 × 𝑛𝑛 is a diagonal matrix, and 𝑟𝑟 = rank( 𝐂𝐂 ) . Then, 𝐌𝐌� ≡ argmin 𝐌𝐌 �𝜆𝜆‖𝐌𝐌‖ ∗ + 12 ‖𝐌𝐌 − 𝐂𝐂‖ F2 � is optimized by 𝐌𝐌� = 𝐔𝐔 𝑟𝑟 𝐔𝐔 λ 𝐕𝐕 𝑟𝑟T , where 𝐔𝐔 λ is diagonal with ( 𝐔𝐔 λ ) 𝑖𝑖𝑖𝑖 = max(0, 𝐔𝐔 𝑖𝑖𝑖𝑖 − 𝜆𝜆 ) , 𝐔𝐔 𝑟𝑟 and 𝐕𝐕 𝑟𝑟 are the first 𝑟𝑟 columns of 𝐔𝐔 and 𝐕𝐕 . In order to update the global variable, we attempt to solve the following constrained optimization problem, min 𝒴𝒴 𝑔𝑔 ( 𝒴𝒴 ) Subject to: 𝒫𝒫 Ω ( 𝒴𝒴 ) = 𝒫𝒫 Ω ( 𝒴𝒴 ), (15) where 𝑔𝑔 ( 𝒴𝒴 ) = ∑ �〈𝛩𝛩 𝑖𝑖 , 𝒴𝒴 − ℳ 𝑖𝑖 〉 + 𝜌𝜌2 ‖𝒴𝒴 − ℳ 𝑖𝑖 ‖ 𝐹𝐹2 � 𝑑𝑑𝑖𝑖=1 . First, we compute its Lagrangian as follows, 𝐿𝐿 𝑦𝑦 ( 𝒴𝒴 , Φ ) = � �〈Θ 𝑖𝑖 , 𝒴𝒴 − ℳ 𝑖𝑖 〉 + 𝜌𝜌 ‖𝒴𝒴 − ℳ 𝑖𝑖 ‖ F2 � 𝑑𝑑𝑖𝑖=1 + 〈Φ , 𝒫𝒫 Ω ( 𝒴𝒴 − 𝒴𝒴 ) 〉 , (16) where Φ is the tensor of the dual variable in this sub-problem. Then, a closed-form solution for this sub-problem is derived based on the optimality condition, 𝒴𝒴 = � 𝒴𝒴 , 𝑖𝑖𝑓𝑓 ( 𝑗𝑗 , 𝑞𝑞 , ⋯ , 𝑞𝑞 𝑑𝑑 ) ∈ Ω 𝑚𝑚 � �ℳ 𝑖𝑖 − 𝜌𝜌 Θ 𝑖𝑖 � 𝑑𝑑𝑖𝑖=1 , 𝑚𝑚𝑜𝑜ℎ𝑚𝑚𝑟𝑟𝑤𝑤𝑖𝑖𝑒𝑒𝑚𝑚 . (17) The detailed derivations are provided in APPENDIX C. Finally, the dual variables Θ 𝑖𝑖 are updated separately according to, Θ 𝑖𝑖 ← Θ 𝑖𝑖 + 𝜌𝜌 ( 𝒴𝒴 − ℳ 𝑖𝑖 ) . (18) D. Solution to ℬ -update After the estimation of the response 𝒴𝒴 in each iteration of Algorithm , we then estimate the tensor of coefficients ℬ 𝑗𝑗 . For this problem, the key is to learn the core tensor 𝒞𝒞 𝑗𝑗 and the bases 𝐕𝐕 𝑖𝑖 and 𝐔𝐔 𝑗𝑗𝑖𝑖 . Motivated by [14], we learn 𝐔𝐔 𝑗𝑗𝑖𝑖 directly from the input spaces by performing Tucker decomposition on the input tensors. After the estimation of 𝐔𝐔 𝑗𝑗𝑖𝑖 , the problem of learning ℬ 𝑗𝑗 becomes the following optimization problem, �𝒞𝒞 𝑗𝑗 , 𝐕𝐕 , ⋯ , 𝐕𝐕 𝑑𝑑 � = argmin ℬ 𝑗𝑗 �𝐖𝐖 𝑗𝑗 ( ) − 𝐗𝐗 ( ) 𝐁𝐁 𝑗𝑗 � 𝐹𝐹2

Subject to: ℬ 𝑗𝑗 = 𝒞𝒞 𝑗𝑗 × 𝐔𝐔 𝑗𝑗1 × ⋯ × 𝑙𝑙 𝑗𝑗 𝐔𝐔 𝑗𝑗𝑙𝑙 𝑗𝑗 × 𝑙𝑙 𝑗𝑗 +1 𝐕𝐕 × 𝑙𝑙 𝑗𝑗 +2 ⋯ × 𝑙𝑙 𝑗𝑗 +𝑑𝑑 𝐕𝐕 𝑑𝑑 , 𝐕𝐕 𝑖𝑖T 𝐕𝐕 𝑖𝑖 = 𝐈𝐈 𝑄𝑄� 𝑖𝑖 ( 𝑖𝑖 = 1, ⋯ , 𝑚𝑚 ) , (19) where 𝐖𝐖 𝑗𝑗 ( ) and 𝐗𝐗 ( ) are mode-1 matricizations of 𝒲𝒲 𝑗𝑗 and 𝒳𝒳 𝑗𝑗 , respectively. 𝐁𝐁 𝑗𝑗 ∈ ℝ 𝑃𝑃 𝑗𝑗 × 𝑄𝑄 is the matricization of tensor ℬ 𝑗𝑗 with 𝑃𝑃 𝑗𝑗 = ∏ 𝑃𝑃 𝑗𝑗𝑗𝑗𝑙𝑙 𝑗𝑗 𝑗𝑗=1 and 𝑄𝑄 = ∏ 𝑄𝑄 𝑗𝑗𝑑𝑑𝑗𝑗=1 . 𝐈𝐈 𝑄𝑄� 𝑖𝑖 ∈ ℝ 𝑄𝑄� 𝑖𝑖 × 𝑄𝑄� 𝑖𝑖 is an identity matrix. The problem (19) can be solved by using the ALS-BCD as in [14]. One requirement of ALS-BCD algorithm is to know the rank of the response 𝒴𝒴 . Since 𝒴𝒴 is estimated during the 𝒴𝒴 -update step in Algorithm 1 , we use high-order SVD (HOSVD) to estimate its rank, which is the column rank of 𝐘𝐘 ( 𝑛𝑛 ) for 𝑛𝑛 =1, ⋯ , 𝑚𝑚 . More specifically, we use the truncated SVD to estimate the rank of mode-n matricization 𝐘𝐘 ( 𝑛𝑛 ) by specifying a ratio, by which the data variability is explained. Then, the ranks of the matricizations of 𝒴𝒴 along all modes are estimated, which are then combined into the Tucker rank of tensor 𝒴𝒴 for ALS-BCD algorithm. E. Selection of Hyper-parameters

In the proposed method, the global hyper-parameter 𝜆𝜆 and local hyper-parameters 𝑃𝑃� 𝑗𝑗𝑖𝑖 ( 𝑗𝑗 = 1,2, ⋯ , 𝑝𝑝 ; 𝑖𝑖 = 1,2, ⋯ , 𝑙𝑙 𝑗𝑗 ) should be identified. 𝜆𝜆 is designed to balance the rank penalty and the least square error in the objective function. Based on our experiments, the proposed method is robust to a wide range of values of 𝜆𝜆 . In this paper, we set 𝜆𝜆 = 1 by default. This parameter can also be determined by cross-validation. Similar to the procedure for estimating the rank of 𝒴𝒴 in Section II.D, the Tucker rank of input tensors 𝒳𝒳 𝑗𝑗 can also be estimated using the truncated HOSVD. Based on the estimated rank of input tensors, the parameters 𝑃𝑃� 𝑗𝑗𝑖𝑖 ( 𝑗𝑗 = 1,2, ⋯ , 𝑝𝑝 ; 𝑖𝑖 =1,2, ⋯ , 𝑙𝑙 𝑗𝑗 ) are determined. III. S IMULATION S TUDY

In this section, a set of simulation studies is carried out to evaluate the performance of our proposed method, herein referred to as the tensor regression with missing values (called TRMV). A benchmark method, named TC-MTOT, is used for comparison, where the completion of output tensor (TC) is first conducted with the rank minimization method [16], [17] and then the regression model between the inputs and the completed response is constructed using MTOT [14]. The estimated

Algorithm 2

ADMM algorithm for problem (12) Inputs: 𝒳𝒳 , … , 𝒳𝒳 𝑝𝑝 , the estimated ℬ� 𝑝𝑝𝑗𝑗−1 , … , ℬ� 𝑝𝑝𝑗𝑗−1 , 𝒴𝒴 and Ω . Loop: For 𝑖𝑖 = 1, … , 𝑚𝑚 : 𝐌𝐌 ( 𝑖𝑖 ) 𝑗𝑗 ← arg min 𝐌𝐌 ( 𝑖𝑖 ) 𝑓𝑓 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) � ℳ 𝑖𝑖𝑗𝑗 ← fold �𝐌𝐌 ( 𝑖𝑖 ) 𝑗𝑗 � (Local variable) 𝒴𝒴 𝑗𝑗 ← arg min { 𝒴𝒴 : 𝒫𝒫 𝛺𝛺 ( 𝒴𝒴 ) =𝒫𝒫 𝛺𝛺 ( 𝒴𝒴 )} 𝑔𝑔 ( 𝒴𝒴 ) (Global variable) For 𝑖𝑖 = 1, … , 𝑚𝑚 : Θ 𝑖𝑖𝑗𝑗 ← Θ 𝑖𝑖𝑗𝑗−1 + 𝜌𝜌 ( 𝒴𝒴 𝑗𝑗 − ℳ 𝑖𝑖𝑗𝑗 ) (Dual variable) Until convergence. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. A. Curve-on-curve Regression

Profile-on-profile regression has a wide range of applications in both manufacturing and healthcare [22]. This simulation mimics profile-on-profile regression modeling when the output contains missing values. Following the simulation study in [23], we first randomly generate �ℬ , ⋯ , ℬ 𝑝𝑝 � as follows: ℬ 𝑖𝑖 ( 𝑒𝑒 , 𝑜𝑜 ) = 1 𝑝𝑝 [ 𝛄𝛄 ( 𝑜𝑜 ) 𝛙𝛙 ( 𝑒𝑒 ) + 𝛄𝛄 ( 𝑜𝑜 ) 𝛙𝛙 ( 𝑒𝑒 ) + 𝛄𝛄 ( 𝑜𝑜 ) 𝛙𝛙 ( 𝑒𝑒 )], where 𝛄𝛄 𝑗𝑗𝑖𝑖 ( 𝑜𝑜 ) and 𝛙𝛙 𝑗𝑗𝑖𝑖 ( 𝑒𝑒 ) ( 𝑘𝑘 = 1,2,3; 𝑖𝑖 = 1, ⋯ , 𝑝𝑝 ) are Gaussian processes with covariance function 𝐔𝐔 ( 𝑧𝑧 , 𝑧𝑧 ′ ) = � 𝑧𝑧 − 𝑧𝑧 ′ | + (20| 𝑧𝑧 − 𝑧𝑧 ′ |) � 𝑚𝑚 −20�𝑧𝑧−𝑧𝑧 ′ � . Then, 𝑝𝑝 input profiles are simulated with the following three steps: 1) generate a matrix 𝐒𝐒 ∈ ℝ 𝑝𝑝 × 𝑝𝑝 with the ( 𝑖𝑖 , 𝑗𝑗 ) 𝑡𝑡ℎ entry equal to 1 for 𝑖𝑖 = 𝑗𝑗 and equal to 𝜌𝜌 𝑐𝑐 = 0 or 0.5 for 𝑖𝑖 ≠ 𝑗𝑗 ; 2) apply eigendecomposition 𝐒𝐒 = ∆∆ T to obtain a 𝑝𝑝 × 𝑝𝑝 matrix ∆ and generate a set of profiles 𝐰𝐰 , 𝐰𝐰 , ⋯ , 𝐰𝐰 𝑝𝑝 by using a Gaussian process with covariance function 𝐔𝐔 ( 𝑧𝑧 , 𝑧𝑧 ′ ) = 𝑚𝑚 −�2�𝑧𝑧−𝑧𝑧 ′ �� ; 3) generate the input curves at any given point 𝑒𝑒 by, �𝐱𝐱 ( 𝑒𝑒 ), ⋯ , 𝐱𝐱 𝑝𝑝 ( 𝑒𝑒 ) � = �𝐰𝐰 ( 𝑒𝑒 ), ⋯ , 𝐰𝐰 𝑝𝑝 ( 𝑒𝑒 ) �∆ T . According to this procedure, for each 𝑒𝑒 the vector �𝑥𝑥 ( 𝑒𝑒 ), ⋯ , 𝑥𝑥 𝑝𝑝 ( 𝑒𝑒 ) � follows a multivariate Gaussian distribution with covariance 𝐒𝐒 . When setting 𝜌𝜌 𝑐𝑐 = 0 to generate data, the variables of the vector �𝑥𝑥 ( 𝑒𝑒 ), ⋯ , 𝑥𝑥 𝑝𝑝 ( 𝑒𝑒 ) � are uncorrelated. The output profiles can finally be simulated by, 𝑦𝑦 ( 𝑜𝑜 ) = � � ℬ 𝑗𝑗 ( 𝑒𝑒 , 𝑜𝑜 ) 𝑥𝑥 𝑗𝑗 ( 𝑒𝑒 ) 𝑚𝑚𝑒𝑒 𝑝𝑝𝑗𝑗=1 + 𝜖𝜖 ( 𝑜𝑜 ), where 𝜖𝜖 ( 𝑜𝑜 ) are independently distributed random variables from a Gaussian distribution with mean of zero and the variance of 𝜎𝜎 . All the input and output profiles are generated over an equidistant grid of size 100, which are defined over the intervals 𝑒𝑒 < 2 and 𝑜𝑜 < 1 , respectively. In this study, we set 𝑝𝑝 = 2 . After the profiles are generated, we randomly remove 𝑟𝑟 =

80% and 90% of the entries of the output profile in the training dataset. We refer to this data generation steps as procedure A . Fig. 4 shows an example of the ground truth profile, the profile with 80% missing values, and its estimation by TRMV. Based on procedure A , we first generate a dataset of 200 samples with 𝜎𝜎 = 0 . We use 𝑚𝑚 𝑡𝑡𝑟𝑟 = 100 samples for model training and hyper-parameter estimations and the remaining 𝑚𝑚 𝑡𝑡𝑡𝑡 = 100 samples for testing. For quantitative evaluation, the standardized prediction error (SPE) is used to evaluate the performance of the proposed model and is defined as 𝑆𝑆𝑃𝑃𝑆𝑆 = �𝒴𝒴−𝒴𝒴�� 𝐹𝐹 ‖𝒴𝒴‖ 𝐹𝐹 . For better illustration, we transform SPE by taking the negative inverse of its logarithm, i.e., − ( 𝑆𝑆𝑃𝑃𝑆𝑆 ) , called transformed SPE (TSPE). Given the generated dataset, we employ TRMV and TC-MTOT methods to complete 𝑦𝑦 and estimate the tensor of coefficients ℬ 𝑗𝑗 ( 𝑗𝑗 = 1,2) at the aforementioned levels of missing values. The trained models are then applied to the testing set to predict responses. The performance of each of the models is then evaluated and compared in terms of TSPE. Fig. 5 compares the performance of TRMV to TC-MTOT at different levels of missing values in the response. The results reveal that TRMV outperforms TC-MTOT at every level of missing values (LMV). For example, at 80% missing values for 𝜌𝜌 𝑐𝑐 = 0 , the mean TSPE of the proposed method is 0.3382, which is significantly lower than that of TC-MTOT, i.e., 0.7206. This indicates that TRMV can to take advantage of the available information in both the inputs and the response, leading to improved estimations. Fig. 6 shows the process of rank estimation. As it is illustrated, when the proportion of missing values is relatively small, our algorithm converges to the true rank with fewer iterations. For example, less than 60 iterations are sufficient to (a) 𝜌𝜌 𝑐𝑐 = 0 (b) 𝜌𝜌 𝑐𝑐 = 0.5 Fig. 5. Performance comparison between our proposed method (TRMV) and TC-MTOT under different levels of missing values. (a) 80% (b) 90% Fig. 6. Process of rank estimation for the response at different levels of missing values. (a) Ground truth (b) 80% missing (c) Estimation Fig. 4. Examples of the output profiles, including the original profiles (100 observations), the profiles with 80% missing values (20 observations), and the estimated ones. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. B. Simulation Study for Image Outputs

In this simulation, we evaluate the performance of our proposed method when multiple forms of data are available. Specifically, the waveform surfaces 𝒴𝒴 𝑖𝑖 are generated based on two input tensors, 𝒳𝒳 ∈ ℝ 𝑃𝑃 × 𝑃𝑃 × ⋯ × 𝑃𝑃 and 𝒳𝒳 ∈ℝ 𝑃𝑃 × 𝑃𝑃 × ⋯ × 𝑃𝑃 ( 𝑖𝑖 = 1, ⋯ , 𝑚𝑚 𝑡𝑡 ) , where 𝑚𝑚 𝑡𝑡 is the sample size. First, 𝑥𝑥 𝑗𝑗𝑚𝑚𝑗𝑗 = 𝑗𝑗𝑃𝑃 𝑘𝑘𝑘𝑘 ( 𝑘𝑘 = 1,2; 𝑚𝑚 = 1, ⋯ , 𝑙𝑙 𝑗𝑗 ; 𝑗𝑗 = 1, ⋯ , 𝑃𝑃 𝑗𝑗𝑚𝑚 ) is defined in order to generate the input tensors. Then, we set 𝐔𝐔 𝑗𝑗𝑚𝑚 = �𝑢𝑢 𝑗𝑗𝑚𝑚1 , 𝑢𝑢 𝑗𝑗𝑚𝑚2 , ⋯ , 𝑢𝑢 𝑗𝑗𝑚𝑚𝑅𝑅 𝑘𝑘 � ( 𝑘𝑘 = 1, 2; 𝑚𝑚 = 1, ⋯ , 𝑙𝑙 𝑗𝑗 ), where 𝑢𝑢 𝑗𝑗𝑚𝑚𝑡𝑡 = � � cos(2 𝜋𝜋𝑜𝑜𝑥𝑥 𝑗𝑗𝑚𝑚1 ) , ⋯ , cos (2 𝜋𝜋𝑜𝑜𝑥𝑥 𝑗𝑗𝑚𝑚𝑃𝑃 𝑘𝑘𝑘𝑘 ) � 𝑇𝑇 , 𝑖𝑖𝑓𝑓 𝑜𝑜 𝑖𝑖𝑒𝑒 𝑚𝑚𝑚𝑚𝑚𝑚� sin(2 𝜋𝜋𝑜𝑜𝑥𝑥 𝑗𝑗𝑚𝑚1 ) , ⋯ , sin (2 𝜋𝜋𝑜𝑜𝑥𝑥 𝑗𝑗𝑚𝑚𝑃𝑃 𝑘𝑘𝑘𝑘 ) � 𝑇𝑇 , 𝑖𝑖𝑓𝑓 𝑜𝑜 𝑖𝑖𝑒𝑒 𝑚𝑚𝑒𝑒𝑚𝑚𝑛𝑛 . Then, the elements of a core tensor 𝒟𝒟 𝑗𝑗𝑖𝑖 are generated randomly from a standard normal distribution. Next, 𝒳𝒳 𝑗𝑗𝑖𝑖 is simulated by using, 𝒳𝒳 𝑗𝑗𝑖𝑖 = 𝒟𝒟 𝑗𝑗𝑖𝑖 × 𝐔𝐔 𝑗𝑗1 × ⋯ × 𝑙𝑙 𝑘𝑘 𝐔𝐔 𝑗𝑗𝑙𝑙 𝑘𝑘 , ( 𝑘𝑘 = 1,2; 𝑖𝑖 = 1, ⋯ , 𝑚𝑚 𝑡𝑡 ). Furthermore, the coefficient tensors ℬ 𝑗𝑗 are generated for the regression model. First, a core tensor 𝒞𝒞 𝑗𝑗𝑖𝑖 is simulated from a standard normal distribution. Next, the basis of output space 𝐕𝐕 𝑚𝑚 = [ 𝑒𝑒 𝑚𝑚1 , 𝑒𝑒 𝑚𝑚2 , ⋯ , 𝑒𝑒 𝑚𝑚𝑅𝑅 ] ( 𝑚𝑚 = 1, ⋯ , 𝑚𝑚 ) is generated as, 𝑒𝑒 𝑚𝑚𝑡𝑡 = �� cos(2 𝜋𝜋𝑜𝑜𝑦𝑦 𝑚𝑚1 ) , ⋯ , cos � 𝜋𝜋𝑜𝑜𝑦𝑦 𝑚𝑚𝑄𝑄 𝑘𝑘 �� 𝑇𝑇 , 𝑖𝑖𝑓𝑓 𝑜𝑜 𝑖𝑖𝑒𝑒 𝑚𝑚𝑚𝑚𝑚𝑚� sin(2 𝜋𝜋𝑜𝑜𝑦𝑦 𝑚𝑚1 ) , ⋯ , sin � 𝜋𝜋𝑜𝑜𝑦𝑦 𝑚𝑚𝑄𝑄 𝑘𝑘 �� 𝑇𝑇 , 𝑖𝑖𝑓𝑓 𝑜𝑜 𝑖𝑖𝑒𝑒 𝑚𝑚𝑒𝑒𝑚𝑚𝑛𝑛 . where 𝑦𝑦 𝑚𝑚𝑗𝑗 = 𝑗𝑗𝑄𝑄 𝑘𝑘 . Finally, the coefficient tensors ℬ 𝑗𝑗 are computed as, ℬ 𝑗𝑗 = 𝒞𝒞 𝑗𝑗 × 𝐔𝐔 𝑗𝑗1 × ⋯ × 𝑙𝑙 𝑘𝑘 𝐔𝐔 𝑗𝑗𝑙𝑙 𝑘𝑘 × 𝑙𝑙 𝑘𝑘 +1 𝐕𝐕 × 𝑙𝑙 𝑘𝑘 +2 ⋯ × 𝑙𝑙 𝑘𝑘 +𝑑𝑑 𝐔𝐔 𝑑𝑑 . Given the input tensors and coefficient tensors, the response tensors are calculated as follows, 𝒴𝒴 𝑖𝑖 = � 𝒳𝒳 𝑗𝑗𝑖𝑖 ∗ ℬ 𝑗𝑗 + ℰ 𝑖𝑖 𝑝𝑝𝑗𝑗=1 , where ℰ 𝑖𝑖 is the error tensor whose elements follow a normal distribution 𝒩𝒩 (0, 𝜎𝜎 ) . We call this data generation steps procedure B . This simulation is conducted for two main purposes: (i) To illustrate the response completion and prediction procedures; (ii) to evaluate the robustness of the proposed method under different noise levels and different rank configurations. First, we generate 𝑝𝑝 = 2 inputs with 𝑙𝑙 = 1 and 𝑙𝑙 = 2 , i.e., a profile 𝒳𝒳 ∈ ℝ and an image 𝒳𝒳 ∈ ℝ × of rank . Then, the response tensors 𝒴𝒴 𝑖𝑖 ∈ ℝ × are generated with rank . Thus, the core tensors of model parameters have dimensions 𝒞𝒞 ∈ℝ × × and 𝒞𝒞 ∈ ℝ × × × . Here, 200 samples are first generated under 𝜎𝜎 = 0 , with 100 training samples and 100 testing samples. Then, we randomly remove certain proportions of entries, such as 𝑟𝑟 =

80% and 90% entries from the training responses. In this study, we also use TSPE to evaluate the performance of the proposed method. For each level of missing values, we train models using TRMV and TC-MTOT based on the training set. Then, we evaluate and compare the estimated models in terms of their TSPEs, calculated based on the testing set. Fig. 7 illustrates an example of the ground truth surface, the surface with 90% missing values, and its estimation by TRMV. As it is illustrated, the estimation result is highly compatible with the ground truth. The TSPEs are computed based on the testing set to compare the proposed method to the benchmark. The results are plotted in Fig. 8. Although the TSPE of both methods increases as the percentage of the missing values grows, the magnitudes of (a) Ground truth (b) 90% missing (c) Estimation Fig. 7. Example of output surfaces, including the actual surface (a), the one with 90% missing values (b), and the estimation using TRMV (c). (a) 80% (b) 90% Fig. 9. Process of rank estimation of the output tensor at 80% and 90% missing values. Fig. 8. Performance comparison between the TRMV model and the benchmark under different levels of missing values. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. procedure B . Each setting contains 100 training samples and 100 testing samples. The first dataset is generated with the response 𝒴𝒴 𝑖𝑖 ∈ ℝ × of ranks [3, 4, 5, 6, 7, 8] and the inputs 𝒳𝒳 𝑖𝑖1 ∈ ℝ and 𝒳𝒳 𝑖𝑖2 ∈ℝ × of rank 2. Here the noise level is set to be zero (i.e., 𝜎𝜎 =0 ). Then, we remove entries randomly from the training responses to mimic different levels of missing values. For the second dataset, different levels of noises (i.e., 𝜎𝜎 = 0.001 ×[0, 2, 4, 6, 8, 10] ) are added to the response. The dimensions of the inputs are 𝒳𝒳 𝑖𝑖1 ∈ ℝ and 𝒳𝒳 𝑖𝑖2 ∈ ℝ × of rank 3 and the dimension of the response is 𝒴𝒴 𝑖𝑖 ∈ ℝ × of rank 3. Next, we randomly generate an index set Ω with 90% missing entries and then we project the training set onto Ω , i.e., 𝒫𝒫 Ω ( 𝒴𝒴 ) . Fig. 10 shows the TSPEs obtained by TRMV and TC-MTOT at different settings of ranks at 80% (panel (a)) and 90% (panel (b)) levels of missing values. In both panels, the dotted lines with error bars show the mean and variance of TSPEs obtained by TC-MTOT; the solid lines with error bars show those obtained by TRMV. Panel (a) of Fig 10 shows that although TC-MTOT achieves comparable TSPEs to TRMV for datasets that are highly correlated and have a simpler structure (i.e., rank is 3 or 4), it performs worse when the structure becomes complicated. This superiority is because of the use of input information in completing the output tensor. From both results illustrated in panels (a) and (b), we conclude that TRMV performs more robustly with smaller mean and variance of prediction errors at every rank setting. Table Ⅲ shows the TSPEs obtained by TRMV and TC-MTOT at different levels of noise. As reported in the table, TRMV achieves smaller TSPEs than TC-MTOT at every level of noise and missing values. For example, when 𝜎𝜎 = 0.006 and LMV = 70% , the TRMV achieves TSPE of 0.4398, and TC-MTOT achieves TSPE of 0.4583. Although the TSPEs increase for higher levels of noise and missing values, TRMV performs more robustly in predicting the output than TC-MTOT. This superiority is mainly because TRMV benefits from the systematic and simultaneous estimation of both model parameters and missing values. IV. C ASE S TUDY

In this section, we further evaluate the effectiveness of our proposed method for predicting the overlay (OV) errors in the lithography process. We first introduce the OV errors and then implement TRMV method on an OV dataset. As shown in Fig. 11 [24], the lithography process aims to transfer a 2-D pattern from an optical mask to a light-sensitive chemical photoresist on the wafer surfaces, which is a critical step in semiconductor manufacturing. During this process, the desired pattern is etched into the material through a series of chemical treatments, such as exposure to light, cleaning, etc. In practice, a wafer may go through such photolithographic cycles for multiple times. During each cycle, one small rectangular field on the wafer surface is completed. Some misalignment error may occur when the patterns on the mask cannot be projected exactly at the desired location on the surface of a wafer. This misalignment is called overlay error and is critical to the wafer quality. As mentioned in the introduction, OV errors are only measured at a limited number of marked locations over the wafer because it is impossible to measure the OV errors of each rectangular field on the wafer. In addition, OV errors are highly dependent on the settings of the

Fig. 11. Illustration of the scanner in semiconductor lithography. T ABLE Ⅲ: P ERFORMANCE C OMPARISON BETWEEN

TRMV

AND

TC-MTOT AT D IFFERENT L EVELS OF N OISES FOR THE W AVEFORM D ATASET

LMV

60% 70% 80% 90% 𝜎𝜎 TRMV TC-MTOT TRMV TC-MTOT TRMV TC-MTOT TRMV TC-MTOT 0 (a) 80% (b) 90% Fig. 10. Performance comparison between TRMV and TC-MTOT at different values of ranks. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

10 lithography machine, which can be used to predict the change of OV errors. The knowledge of OV errors can support the process adjustments, such as the alignments of the projection lens and the location tuning of the wafer stage. To improve the quality of a lithography process, it is critical to correct the OV error by modeling the relationship between the machine settings and the partially measured OV errors. To construct and evaluate our model, we generate a dataset containing the machine settings and overlay error for 200 wafers based on a validated procedure explained in Appendix D.

We use this validated simulated data for two main reasons. First, the real data is proprietary. Second, the known “ground truth” OV errors can be used to evaluate the accuracy of our method. Given the generated dataset, we divide the 200 samples into a training set and a testing set, with 100 samples in each set. For the training set, we assume 80% or 90% points are inaccessible on a wafer. The goal is to generate a model that estimates the OV given the settings of the lithography machine. Given the incomplete training data, we simultaneously estimate the model parameters and complete the missing entries using TRMV. Finally, we investigate the prediction performance of the estimated model in the testing set and calculate the prediction error in terms of the TSPE. In this study, TC-MTOT is also used as the benchmark method for comparison. Fig. 12 shows the prediction results of two models estimated by TRMV and TC-MTOT at 80% and 90% missing values, respectively. From the results, it is evident that TRMV outperforms TC-MTOT at both levels of missing values. V. C ONCLUSION

In this paper, we develop a systematic framework for tensor regression when the response contains missing values. The novelty of this methodology lies in integrating the tensor completion with tensor regression in a unified manner. In order to address the challenge brought by missing values in the response, a penalty of the tensor nuclear norm is introduced into the least square loss function. Meanwhile, Tucker decomposition is applied to the model coefficients to prevent overfitting. A two-step BCD algorithm is proposed to estimate the variables, including the missing entries and the model coefficients. Iteratively, we first complete the missing entries of the response using the ADMM algorithm. Given the completed response, the core tensors and the bases of the coefficients are then estimated by a BCD-ALS algorithm. Notably, the completion procedure of the response in the first step of the algorithm enables the estimation of its rank, which can be used by the second step. These integrated optimization efforts successfully lead to three major advantages of the proposed methodology: (i) more accurate model estimation and response completion; (ii) robust prediction performance; and (iii) automatic rank convergence of the response. Two simulation studies and a case study are conducted to evaluate the performance of our proposed method in comparison to a two-step method, i.e., TC-MTOT. The numerical results have shown that TRMV outperforms TC-MTOT significantly. The main contribution of this work is to explore a new direction in tensor regression with missing values. An interesting extension for future study is to consider the selection of input tensors in the sense that some available inputs may not be informative for the response estimation. APPENDIX A: T RANSFORMATION OF (7) TO (9) The transformation of the Frobenius norm from problem (7) to problem (9) is derived as follows: Since ‖𝒴𝒴 − 𝒜𝒜‖ F2 = �𝒀𝒀 ( 𝑖𝑖 ) − 𝑨𝑨 ( 𝑖𝑖 ) � 𝐹𝐹2 , ∀𝑖𝑖 and ∑ 𝛼𝛼 𝑖𝑖𝑑𝑑𝑖𝑖=1 = 1, we have ‖𝒴𝒴 − 𝒜𝒜‖ 𝐹𝐹2 = ∑ 𝛼𝛼 𝑖𝑖 ‖𝒴𝒴 − 𝒜𝒜‖ 𝐹𝐹2𝑑𝑑𝑖𝑖=1 = ∑ 𝛼𝛼 𝑖𝑖 �𝒀𝒀 ( 𝑖𝑖 ) − 𝑨𝑨 ( 𝑖𝑖 ) � 𝐹𝐹2𝑑𝑑𝑖𝑖=1 . APPENDIX B: P ROBLEM (13)

TRANSFORMATION

The objective of problem (13) can be transformed as follows: Since 𝒴𝒴 and Θ 𝑖𝑖 are given, the ℳ -update problem (Equation (14)) is equivalent to, 𝜆𝜆𝛼𝛼 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) � ∗ + 𝛼𝛼 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) − 𝐀𝐀 ( 𝑖𝑖 ) � 𝐹𝐹2 − 〈Θ 𝑖𝑖 , ℳ 𝑖𝑖 〉 + 𝜌𝜌 ‖𝒴𝒴 − ℳ 𝑖𝑖 ‖ 𝐹𝐹2 = 𝜆𝜆𝛼𝛼 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) � ∗ + 𝛼𝛼 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) − 𝐀𝐀 ( 𝑖𝑖 ) � 𝐹𝐹2 + 𝜌𝜌 �𝐌𝐌 ( 𝑖𝑖 ) − �𝐘𝐘 ( 𝑖𝑖 ) + 1 𝜌𝜌 𝚯𝚯 ( 𝑖𝑖 ) �� 𝐹𝐹2 = 𝜆𝜆𝛼𝛼 𝑖𝑖 �𝐌𝐌 ( 𝑖𝑖 ) � ∗ + 𝛼𝛼 𝑖𝑖 +𝜌𝜌2 �𝐌𝐌 ( 𝑖𝑖 ) − 𝛼𝛼 𝑖𝑖 𝐀𝐀 ( 𝑖𝑖 ) +𝜌𝜌𝐂𝐂 ( 𝑖𝑖 ) 𝛼𝛼 𝑖𝑖 +𝜌𝜌 � 𝐹𝐹2 + 𝑏𝑏 , (B1) where 𝑏𝑏 is a constrant and 𝐂𝐂 ( 𝑖𝑖 ) = �𝐘𝐘 ( 𝑖𝑖 ) + 𝚯𝚯 ( 𝑖𝑖 ) � . APPENDIX C: S OLUTION TO (17) In this appendix, we derive the solution for problem (17). Let the partial derivative of Lagrangian 𝐿𝐿 𝑦𝑦 ( 𝒴𝒴 , ϕ ) be equal to zero and we have, � 𝜕𝜕𝐿𝐿 𝑦𝑦 ( 𝒴𝒴 , ϕ ) 𝜕𝜕𝒴𝒴 = ∑ �Θ 𝑖𝑖 + 𝜌𝜌2 ( 𝒴𝒴 − ℳ 𝑖𝑖 ) � 𝑑𝑑𝑖𝑖=1 + 𝒫𝒫 Ω ( ϕ ) = 0 𝜕𝜕𝐿𝐿 𝑦𝑦 ( 𝒴𝒴 , ϕ ) 𝜕𝜕ϕ = 𝒫𝒫 Ω ( 𝒴𝒴 − 𝒴𝒴 ) = 0 . (C1) When ( 𝑗𝑗 , 𝑞𝑞 , ⋯ , 𝑞𝑞 𝑑𝑑 ) ∉ Ω , we only have, ∑ �Θ 𝑖𝑖 + 𝜌𝜌2 �𝒴𝒴 ( 𝑗𝑗 , 𝑞𝑞 , ⋯ , 𝑞𝑞 𝑑𝑑 ) − ℳ 𝑖𝑖 �� 𝑑𝑑𝑖𝑖=1 = 0. (C2) Fig. 12. Performance comparison between TRMV and TC-MTOT under 80% and 90% missing values. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

11 Otherwise, we have, � 𝒫𝒫 Ω ( 𝛷𝛷 ) = 0 𝒫𝒫 Ω ( 𝒴𝒴 − 𝒴𝒴 ) = 0 . (C3) Thus, we can obtain the optimal solution of the minimization problem as, 𝒴𝒴 = � 𝒴𝒴 , 𝑖𝑖𝑓𝑓 ( 𝑗𝑗 , 𝑞𝑞 , ⋯ , 𝑞𝑞 𝑑𝑑 ) ∈ Ω ∑ �ℳ 𝑖𝑖 − Θ 𝑖𝑖 � 𝑑𝑑𝑖𝑖=1 , 𝑚𝑚𝑜𝑜ℎ𝑚𝑚𝑟𝑟𝑤𝑤𝑖𝑖𝑒𝑒𝑚𝑚 . (C4) APPENDIX D: C ASE S TUDY D ATA G ENERATION

This appendix elaborates on the popular procedure used for generating the OV errors in the lithography process. The OV error is denoted as a vector �𝐹𝐹 𝑥𝑥 , 𝐹𝐹 𝑦𝑦 � in 2D coordinate system, with the initial point ( 𝑥𝑥 , 𝑦𝑦 ) and the terminal point �𝑥𝑥 + 𝐹𝐹 𝑥𝑥 , 𝑦𝑦 + 𝐹𝐹 𝑦𝑦 � representing the locations of the previous layer and the current layer, respectively. To maintain OV requirement, the polynomial model is widely used in practice for overlay control [25], [26], which is represented as follows, �𝐹𝐹 𝑥𝑥 ( 𝑥𝑥 , 𝑦𝑦 ) = 𝐤𝐤 𝑥𝑥 𝐛𝐛𝐹𝐹 𝑦𝑦 ( 𝑥𝑥 , 𝑦𝑦 ) = 𝐤𝐤 𝑦𝑦 𝐛𝐛 , (D1) where 𝐹𝐹 𝑥𝑥 and 𝐹𝐹 𝑦𝑦 are the coordinates of the OV error; 𝐛𝐛 =[1, 𝑥𝑥 , 𝑦𝑦 , 𝑥𝑥 , 𝑥𝑥𝑦𝑦 , 𝑦𝑦 , 𝑥𝑥 , 𝑥𝑥 𝑦𝑦 , 𝑥𝑥𝑦𝑦 , 𝑦𝑦 ] 𝑇𝑇 ∈ ℝ is a basis; 𝐤𝐤 𝑥𝑥 =[ 𝑘𝑘 , 𝑘𝑘 , 𝑘𝑘 , … , 𝑘𝑘 ] and 𝐤𝐤 𝑦𝑦 = [ 𝑘𝑘 , 𝑘𝑘 , 𝑘𝑘 , … , 𝑘𝑘 ] define the signatures of an OV error, which are used for on-line correction via machine settings such as wafer stage, lens, mask, and so on, along the motion directions. With this model, the OV error can be broken into linear components, i.e., ( 𝑘𝑘 , 𝑘𝑘 , … , 𝑘𝑘 ) and nonlinear components, i.e., ( 𝑘𝑘 , 𝑘𝑘 , … , 𝑘𝑘 ) , where the linear ones associate with errors including reticle rotation, wafer rotation, to name a few, and the nonlinear ones correspond to causes such as lens distortion and random errors [26]. Specifically, a set of points in a wafer are first specified to construct a basis 𝐁𝐁 = [ 𝐛𝐛 , 𝐛𝐛 , … , 𝐛𝐛 𝑛𝑛 ] . For the signature, we only take the variation of linear components into consideration, which means ( 𝑘𝑘 , 𝑘𝑘 , … , 𝑘𝑘 ) equals to zeros. The variation patterns of the signature ( 𝑘𝑘 , 𝑘𝑘 , … , 𝑘𝑘 ) are randomly generated and we obtain 𝐊𝐊 𝑥𝑥𝑖𝑖 = �𝑘𝑘 , 𝑘𝑘 , 𝑘𝑘 , 0, … ,0 � and 𝐊𝐊 𝑦𝑦𝑖𝑖 = �𝑘𝑘 , 𝑘𝑘 , 𝑘𝑘 , 0, … ,0 � for 𝑖𝑖 = 1, … , 𝑚𝑚 . Then, for each �𝐊𝐊 𝑥𝑥𝑖𝑖 , 𝐊𝐊 𝑦𝑦𝑖𝑖 � , the associated OV pattern �𝐅𝐅 𝑥𝑥𝑖𝑖 , 𝐅𝐅 𝑦𝑦𝑖𝑖 � can be calculated as 𝐅𝐅 𝑥𝑥𝑖𝑖 =[ 𝐹𝐹 𝑥𝑥𝑖𝑖1 , … , 𝐹𝐹 𝑥𝑥𝑖𝑖𝑛𝑛 ] and 𝐅𝐅 𝑥𝑥𝑖𝑖 = �𝐹𝐹 𝑦𝑦𝑖𝑖1 , … , 𝐹𝐹 𝑦𝑦𝑖𝑖𝑛𝑛 � . R EFERENCES [1] Y. Ding, J. H. Jin, D. Ceglarek, and J. J. Shi, "Process-oriented tolerancing for multi-station assembly systems,"

IIE Trans., vol. 37, no. 6, pp. 493-508, Jun 2005. [2] K. B. Liu, N. Z. Gebraeel, and J. J. Shi, "A Data-Level Fusion Model for Developing Composite Health Indices for Degradation Modeling and Prognostic Analysis,"

IEEE Trans. Autom. Sci. Eng., vol. 10, no. 3, pp. 652-664, Jul 2013. [3] H. Zhou, L. X. Li, and H. T. Zhu, "Tensor Regression with Applications in Neuroimaging Data Analysis,"

J. Am. Stat. Assoc., vol. 108, no. 502, pp. 540-552, Jun 2013. [4] S. Wold, K. Esbensen, and P. Geladi, "Principal Component Analysis,"

Chemometr. Intell. Lab., vol. 2, no. 1-3, pp. 37-52, Aug 1987. [5] H. Abdi, "Partial least squares regression and projection on latent structure regression (PLS Regression),"

Wiley Interdiscip. Rev. Comput. Stat., vol. 2, no. 1, pp. 97-106, Jan-Feb 2010. [6] X. W. Yue, H. Yan, J. G. Park, Z. Y. Liang, and J. J. Shi, "A Wavelet-Based Penalized Mixed-Effects Decomposition for Multichannel Profile Detection of In-Line Raman Spectroscopy,"

IEEE Trans. Autom. Sci. Eng., vol. 15, no. 3, pp. 1258-1271, Jul 2018. [7] N. Jin, S. Y. Zhou, T. S. Chang, and H. H. H. Huang, "Identification of influential functional process variables for surface quality control in hot rolling processes,"

IEEE Trans. Autom. Sci. Eng., vol. 5, no. 3, pp. 557-562, Jul 2008. [8] H. Yan, K. Paynabar, and J. J. Shi, "Image-Based Process Monitoring Using Low-Rank Tensor Decomposition,"

IEEE Trans. Autom. Sci. Eng., vol. 12, no. 1, pp. 216-227, Jan 2015. [9] X. W. Yue, J. G. Park, Z. Y. Liang, and J. J. Shi, "Tensor Mixed Effects Model With Application to Nanomanufacturing Inspection,"

Technometrics, vol. 62, no. 1, pp. 116-129, Jan 2 2020. [10] Q. B. Zhao, C. F. Caiafa, D. P. Mandic, Z. C. Chao, Y. Nagasaka, N. Fujii, L. Q. Zhang, and A. Cichocki, "Higher Order Partial Least Squares (HOPLS): A Generalized Multilinear Regression Method,"

IEEE Trans. Pattern Anal., vol. 35, no. 7, pp. 1660-1673, Jul 2013. [11] X. S. Li, D. Xu, H. Zhou, and L. X. Li, "Tucker Tensor Regression and Neuroimaging Analysis,"

Statist. Biosci., vol. 10, no. 3, pp. 520-545, Dec 2018. [12] H. Yan, K. Paynabar, and M. Pacella, "Structured Point Cloud Data Analysis Via Regularized Tensor Regression for Process Modeling and Optimization,"

Technometrics, vol. 61, no. 3, pp. 385-395, Jul 3 2019. [13] E. F. Lock, "Tensor-on-Tensor Regression,"

J. Comput. Graph. Stat., vol. 27, no. 3, pp. 638-647, 2018. [14] M. R. Gahrooei, H. Yan, K. Paynabar, and J. J. Shi, "Multiple Tensor-on-Tensor Regression: An Approach for Modeling Processes With Heterogeneous Sources of Data,"

Technometrics,

Jan 14 2020. [15] X. F. Lin, H. E. Perez, J. B. Siegel, and A. G. Stefanopoulou, "Robust Estimation of Battery System Temperature Distribution Under Sparse Sensing and Uncertainty,"

IEEE Trans. Control Syst. Technol., vol. 28, no. 3, pp. 753-765, May 2020. [16] J. F. Cai, E. J. Candes, and Z. W. Shen, "A Singular Value Thresholding Algorithm for Matrix his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

12 Completion,"

SIAM J. Optimiz., vol. 20, no. 4, pp. 1956-1982, 2010. [17] J. Liu, P. Musialski, P. Wonka, and J. P. Ye, "Tensor Completion for Estimating Missing Values in Visual Data,"

IEEE Trans. Pattern Anal., vol. 35, no. 1, pp. 208-220, Jan 2013. [18] G. Tomasi and R. Bro, "PARAFAC and missing values,"

Chemometr. Intell. Lab., vol. 75, no. 2, pp. 163-180, Feb 28 2005. [19] D. Kressner, M. Steinlechner, and B. Vandereycken, "Low-rank tensor completion by Riemannian optimization,"

BIT, vol. 54, no. 2, pp. 447-468, Jun 2014. [20] C. Y. Lu, J. S. Feng, Y. D. Chen, W. Liu, Z. C. Lin, and S. C. Yan, "Tensor Robust Principal Component Analysis with a New Tensor Nuclear Norm,"

IEEE Trans. Pattern Anal., vol. 42, no. 4, pp. 925-938, Apr 1 2020. [21] T. G. Kolda and B. W. Bader, "Tensor Decompositions and Applications,"

SIAM Rev., vol. 51, no. 3, pp. 455-500, Sep 2009. [22] F. Cismondi, A. S. Fialho, S. M. Vieira, S. R. Reti, J. M. C. Sousa, and S. N. Finkelstein, "Missing data in medical databases: Impute, delete or classify?,"

Artif. Intell. Med., vol. 58, no. 1, pp. 63-72, May 2013. [23] X. Qi and R. Y. Luo, "Function-on-function regression with thousands of predictive curves,"

J. Multivariate Anal., vol. 163, pp. 51-66, Jan 2018. [24] Canon. (2018).

What Is Semiconductor Lithography Equipment?

Available: https://global.canon/en/technology/support29.html. [25] B. Y. Hsueh, G. K. Huang, C.-C. Yu, J. K. Hsu, C.-C. K. Huang, C.-J. Huang, and D. Tien, "High order correction and sampling strategy for 45nm immersion lithography overlay control," in

Metrology, Inspection, and Process Control for Microlithography XXII , 2008, vol. 6922, p. 69222Q: International Society for Optics and Photonics. [26] T. Kikuchi, Y. Ishii, and N. Tokuda, "Introduction of new techniques for matching overlay enhancement,"

Optical Microlithography Xiv, Pts 1 and 2, vol. 4346, pp. 1608-1616, 2001. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Feng Wang (Student Member, IEEE) received the M.S. degree in Traffic Information Engineering & Control from Beijing Jiaotong University, Beijing, China, in 2016, where he is working toward the Ph.D. degree with the State Key Laboratory of Rail Traffic Control and Safety. Currently, he is a visiting scholar in the H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta. His research interests include statistical modeling and prognostics with high-dimensional data.

Mostafa Reisi-Gahrooei received the master’s degree in computational science and engineering and the Ph.D. degree in industrial and systems engineering from Georgia Institute of Technology, Atlanta, GA, USA, and the M.Sc. degrees in transportation engineering and applied mathematics from the Southern Illinois University Edwardsville, Edwardsville, IL, USA. He is currently an Assistant Professor with the Department of Industrial and Systems Engineering, the University of Florida, Gainesville, FL, USA. His research focuses on modeling, monitoring, and control of complex systems with functional, high-dimensional data. Dr. Reisi Gahrooei is a member of the Institute for Operations Research and the Management Sciences (INFORMS) and the Institute of Industrial and Systems Engineers (IISE).

Zhen Zhong received the B.S. degree in Electrical Engineering from University of Science and Technology of China, Anhui, China, in 2017. Currently, he is a Ph.D. student at the H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta. His research interests are focused on the process control in semiconductor manufacturing.

Tao Tang (Senior Member, IEEE) received the Ph.D. degree in automatic control theory and application from Chinese Academy of Sciences, Beijing, China, in 1991. He is currently the Director of the School of Electronic and Information Engineering and the State Key Laboratory of Rail Trafﬁc Control and Safety, Beijing Jiaoto ng University, Beijing, China. His research interests include both high-speed and urban railway train control systems, as well as intelligent transportation systems. He is a member of the Experts Group of High Technology Research and Development Program of China (863 Program) and the Leader in the Field of the Modern Transportation Technology Experts Group. He is also a Specialist in the National Development and Reform Commission and the Beijing Urban Traffic Construction Committee.