[PDF] SONIC: SOcial Network with Influencers and Communities

Abstract

The integration of social media characteristics into an econometric framework requires modeling a high dimensional dynamic network with dimensions of parameter typically much larger than the number of observations. To cope with this problem, we introduce SONIC, a new high-dimensional network model that assumes that (1) only few influencers drive the network dynamics; (2) the community structure of the network is characterized by homogeneity of response to specific influencers, implying their underlying similarity. An estimation procedure is proposed based on a greedy algorithm and LASSO regularization. Through theoretical study and simulations, we show that the matrix parameter can be estimated even when sample size is smaller than the size of the network. Using a novel dataset retrieved from one of leading social media platforms - StockTwits and quantifying their opinions via natural language processing, we model the opinions network dynamics among a select group of users and further detect the latent communities. With a sparsity regularization, we can identify important nodes in the network.

Full PDF

SSONIC: SOcial Network analysis with Inﬂuencers andCommunities

Cathy Yi-Hsuan Chen ∗ University of Glasgow

[email protected]

Wolfgang Karl H¨ardle † Humboldt-Universit¨at zu Berlin [email protected]

Yegor Klochkov ‡ Cambridge-INET, Faculty of Economics, University of Cambridge [email protected]

February 9, 2021

Abstract

The integration of social media characteristics into an econometricframework requires modeling a high dimensional dynamic network withdimensions of parameter typically much larger than the number of ob-servations. To cope with this problem, we introduce SONIC, a new high-dimensional network model that assumes that (1) only few inﬂuencersdrive the network dynamics; (2) the community structure of the net-work is characterized by homogeneity of response to speciﬁc inﬂuencers,implying their underlying similarity. An estimation procedure is pro-posed based on a greedy algorithm and LASSO regularization. Throughtheoretical study and simulations, we show that the matrix parametercan be estimated even when sample size is smaller than the size of thenetwork. Using a novel dataset retrieved from one of leading social me-dia platforms — StockTwits and quantifying their opinions via naturallanguage processing, we model the opinions network dynamics among aselect group of users and further detect the latent communities. With asparsity regularization, we can identify important nodes in the network.

JEL codes : C1, C22, C51, G41 ∗ Adam Smith Business School, University of Glasgow, UK and Humboldt-Universit¨at zu Berlin inGermany, corresponding author † BRC Blockchain Research Center, Humboldt-Universit¨at zu Berlin, Germany; Sim Kee Boon Insti-tute, Singapore Management University, Singapore; WISE Wang Yanan Institute for Studies in Eco-nomics, Xiamen Uni- versity, Xiamen, China; Dept. Information Science and Finance, National ChiaoTung University, Hsinchu, Taiwan, ROC; Dept. Mathematics and Physics, Charles University, Prague,Czech Republic, Grants–DFG IRTG 1792, CAS: XDA 23020303, and COST Action CA19130 gratefullyacknowledged. ‡ The work was done while this author was a postgraduate student at Humboldt-Universit¨at zu Berlin.Financial support from the German Research Foundation (DFG) via the International Research TrainingGroup 1792 “High Dimensional Nonstationary Time Series” in Humboldt-Universit¨at zu Berlin and isgratefully acknowledged. a r X i v : . [ s t a t . A P ] F e b . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov Keywords : social media, network, community, inﬂuencers, sentiment

A network is deﬁned through a set of nodes and edges with a given adjacency structure.In a social, ﬁnancial, or econometric context, such networks are often dynamic, andnodes, such as individuals or ﬁrms, are changing their activities over time. An analysisof such network dynamics is often based on vector autoregression . Consider a networkthat produces a time series Y t ∈ R N , t = 1 , . . . , T and dependencies between its elementsare modeled through the equation Y t = Θ Y t − + W t , (1.1)where W t are innovations that satisfy E [ W t | F t − ] = 0, F t = σ { Y t − , Y t − , . . . } , so that theinteractions between the nodes are described by an autoregression operator Θ ∈ R N × N .In terms of the network connections we say that a node i is connected to the node j ifΘ ij (cid:54) = 0 , so that the nonzero coeﬃcients represent the adjacency matrix of such network, and thesparsity of Θ represents the number of edges. For large-scale time series, one encountersthe curse of dimensionality, as estimating the matrix-parameter Θ with N elementsrequires a signiﬁcantly large number of observations T .Several attempts to reduce the dimensionality have been made in the past literature.Assuming that the elements of a time series form a connected network, Zhu et al. (2017)introduce a Network Autoregression (NAR) with Θ ij = βA ij / (cid:80) Nk =1 A ik , provided thatthe adjacency matrix A ∈ R N × N is known. Here, the regression operator, deﬁned upto a single parameter β , which is called the network eﬀect , can be estimated throughsimple least squares. Zhu et al. (2019) also extend this model for conditional quantiles.Furthermore, Zhu and Pan (2020) argue that a single network parameter may not besatisfactory as it treats all nodes of the network homogeneously. In particular, the NARimplies that each node is aﬀected by its neighbors in the same extent, while in reality,we may have, e.g., ﬁnancial institutions that are aﬀected less or more than the others(see Mihoci et al. (2020)). Hence they propose to detect communities in a network basedon the given adjacency matrix and suggest that the nodes in each community share aseparate network eﬀect parameter. Gudmundsson (2018) take a somewhat opposite di-rection: their BlockBuster algorithm determines the communities through the estimatedautoregressive model, which, however, does not solve the dimensionality problem. Apartfrom this line of work, sparse regularisations have been extensively used, see Fan et al.(2009); Han et al. (2015); Melnyk and Banerjee (2016).To sum up, we point out the following problems that one may encounter while dealingwith vector autoregression in this social media context: • The VAR parameter dimension is signiﬁcant; one requires even larger time intervalsfor consistent estimation. Even if one can aﬀord such a dataset, in the long run,autoregressive models may have time-varying parameters, see e.g., ˇC´ıˇzek et al.(2009). We, therefore, impose some assumptions on the structure of the operatorΘ, so that estimation through moderate sample sizes is possible. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov • The NAR model assumes that the adjacency matrix is known. In particular, this isjustiﬁed for social networks with a stable and natural friendship/follower-followeerelationship. For a realistic network of ﬁnancial institutions, there is no explicitlydeﬁned adjacency matrix, and one has to heuristically evaluate it using additionalinformation (identical shareholders, trading volumes) or through analyzing corre-lations and lagged cross-correlations between returns or risk proﬁles, see Dieboldand Yılmaz (2014) and Chen et al. (2019b). However, there is no rigorous reason tobelieve that the operator in (1.1) depends explicitly on such an adjacency matrix,see also Cha et al. (2010).Our main contribution is to propose a new method for modeling social network dy-namics, which is a challenging task in the presence of the curse of dimensionality andthe absence of knowledge of adjacency matrix. The proposed SONIC —

SOcial Networkanalysis with Inﬂuencers and Communities has the following advantages. First, it allowsus to identify the hidden ﬁgures who mainly drive the opinion generating process on socialmedia. Second, it discovers the hidden community structure. The proposed estimationalgorithm uncovers the hidden ﬁgures and communities simultaneously until the minimalempirical risk is attained. Third, we discuss the theoretical properties and underpinningsto ensure estimation eﬃciency. Apart from dimensionality, the social media data are fea-tured with missing observations, bringing another challenge to researchers. The proposedSONIC is therefore equipped with a correction mechanism for missing observations. Wedemonstrate the applicability of SONIC on a novel social media dataset.In more detail, the heuristics about the assumptions on SONIC are motivated bysocial media users’ activities and characteristics. Based on well-known user experienceon platforms like facebook, twitter, etc., one can assume that some users have signiﬁcantlymore followers than others. Take, for example, celebrities, athletes, analysts, politicians,or Instagram divas. In a network view, these users are the nodes that have much moreinﬂuence than the rest of nodes: these nodes are thereby deﬁned as inﬂuencers . In theframework of autoregression, a node j is an inﬂuencer if there is a substantial amountof other nodes i such that Θ ij (cid:54) = 0. Assuming that the number of inﬂuencers is limited,we ﬁx only a few columns of matrix Θ to be non-zero. This allows us to concentrateon the connections to the inﬂuencers, signiﬁcantly reducing the number of parametersto be estimated. A similar idea is used in Chen et al. (2018), with a group-LASSOregularisation imposed, yielding a solution with few active columns. Notice, however,that relying on the sparsity alone still requires T > N , see e.g., Fan et al. (2009);Chernozhukov et al. (2020).It is also well-known that social networks have small communities, with the nodesexhibiting higher connection density or similar behavior inside communities. Zhu andPan (2020) analyze a more realistic set-up by allowing separate parameters for eachcommunity instead of a single network eﬀect parameter. In our notation, the conditionalmean of the response of the node i satisﬁes E [ Y it | F t − ] = Θ i Y t − + · · · + Θ iN Y Nt − . Therefore, the behavior of the node i is characterized by the coeﬃcients Θ i , . . . , Θ iN i.e.,the nodes it depends upon. We assume that the nodes are separated into a few clusterssuch that the nodes from the same cluster share the same dependency structure, whichbrings a bigger picture into the view: instead of saying that two nodes from the samecluster are more likely to be connected, we say that they connect to the same inﬂuencers. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov https://stocktwits.com .) For each user, one can quantify the average sentiment score, viaa textual analysis, over the messages he posts during the day. Analyzing these high-dimensional time series, on the one hand, we can identify inﬂuencers — the users whoseopinions are overwhelmingly important, and on the other hand, we determine the com-munity structure. One challenge emerges here: the presence of missing observationssince sometimes users do not leave any message. We treat this as follows: assume thereis an underlying opinion process that follows network dynamics (1.1). However, such anopinion process might be partially observed, given the random arrival of messages fromeach user, which renders a commonly used model for missing observations that involvemasked Bernoulli random variables. The proposed SONIC accommodates this situation.We return to it in detail in Section 3.3.The rest of the paper is organized as follows. Section 2 introduces readers to theStockTwits platform, describes in detail the available dataset and the process of users’sentiment scores extraction. In Section 3, we ﬁrst introduce our SONIC model, thendescribe the estimation procedure and provide a consistency result. In Section 4, weprovide simulation results that support the theoretical properties of our estimator. Next,in Section 5, we present and discuss the results of the application of our model to thedataset retrieved from StockTwits. Section 6 concludes. We dedicate Section 7 to theproofs, as well as Sections A, B in the appendix. Readers can ﬁnd all numerical examplesand the codes developed for the SONIC model on . Social media are an ideal platform where users can easily communicate with each other,exchange information, and share opinions. The increasing popularity of social media isevidence of growing demand for exchanging opinions and information among granularusers. Among social media platforms, we are particularly interested in StockTwits forseveral reasons. Firstly, it is a social media platform designed for sharing ideas betweeninvestors, traders, and entrepreneurs. It is similar to Twitter but dedicated to the discus-sion on ﬁnancial issues. One of the innovations that led to its popularity is a well-designedreference between the message content and the mentioned stock symbols. Conversationsare organized around ‘cashtags’ (e.g., ‘ $ AAPL’ for APPLE; ‘ $ BTC.X’ for BITCOIN)that allow to narrow down streams on speciﬁc assets. Secondly, users can express theirsentiments/opinions by labeling their messages as ‘Bearish’ (negative) or ‘Bullish’ (pos-itive) via a toggle button. These are so-called self-report sentiments , and these labeleddata permits the use of supervised textual analysis that requires the training dataset.We use the StockTwits Application Programming Interface (API) to retrieve all mes-sages containing the preferred cashtags. StockTwits API also provides for each messageits unique user identiﬁer, the time it was posted within one-second precision and thesentiments declared by users (‘Bullish,’ ‘Bearish,’ or unclassiﬁed). Among over thousandtickers/symbols, we particularly pick up two symbols, $ AAPL for APPLE; $ BTC.X forBITCOIN, which represents the most popular security and cryptocurrency, respectively.Concerning the fact that two symbols may attract the investors/users with diﬀerentdegrees of interaction, we may uncover disparate network dynamics. In Table 1, we sum-marize the messages’ statistics and document the generated sentiment series. Firstly,the BTC investors tend to disclose their sentiment, evident by 44% of labeled messages, . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov

Symbols

AAPL BTCmessage volume 449,761 644,597number of distinct users (N) 26,521 25,492number of bullish messages 133,316 196,555number of bearish messages 48,186 90,677percentage of bullish messages 20.6% 30.4%percentage of bearish messages 7.4% 14.0%percentage of labeled messages 28.0% 44.4%mean of sentiment 0.285 0.292standard deviation of sentiment 0.478 0.397size of positive training dataset 99,985 147,759size of negative training dataset 36,100 67,752message volume per day 730 305number of positive terms in lexicon 4,000 3,775number of negative terms in lexicon 4,000 3,759number of daily observations (T) 423 2108sample period 2017-05-22 2013-03-212019-01-27 2018-12-27Table 1: Summary statistics of social media messages

Two main methods are used for textual sentiment analysis: the dictionary-based ap-proaches and the machine learning techniques. We opt for the dictionary-based approachin consideration of transparency, comprehension, less computational burden and shorttexts. StockTwits, like Twitter, limits message length to 140 characters, which furtherlimits the power of a machine learning-based approach concerning little contextual infor-mation on the short texts. A dictionary, or lexicon, is a list of words labeled as positive,negative, or neutral. Given such a list, the bag-of-words approach consists of countingthe number of positive and negative words in a document in order to assign it a sentimentvalue or a tone. For example, a simple dictionary containing only the words ‘good’ and‘bad’ with positive and negative labels, respectively, would classify the sentence ‘Bitcoinis a good investment’ as positive with a tone +1.The simplicity of the dictionary-based approach guarantees transparency and replica-bility, on the con side, it comes with the limitations on natural language analysis. First,referring to Deng et al. (2017) to the ‘context of discourse,’ one needs to be aware of thecontent domain, to which language interpretation is sensitive. For example, Loughranand McDonald (2011) point out that words like ‘tax’ or ‘cost’ are classiﬁed as negativeby Harvard General Inquirer lexicon, whereas they should be considered neutral in theﬁnancial context. Another example is about quantifying sentiment on cryptocurrency. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov 🚀 (positive) and 💩 (negative) when talking about cryptocurrencies. Theseare missing in the traditional dictionary.Bearing the aforementioned considerations in mind, in the sentiment quantiﬁcationfor the messages of AAPL we employ the social media lexicon developed by Renault(2017), while in the case of BTC we advocate the lexicon tailored for cryptocurrencyasset, by Chen et al. (2019a). Renault (2017) demonstrates that the constructed lexiconsigniﬁcantly outperforms the benchmark dictionaries while remaining competitive withhigh-level machine learning algorithms. Based on 125,000 bullish and another 125,000bearish messages published on StockTwits, using the lexicon for social media achieves90% of classiﬁed messages and 75.24% of correct classiﬁcations. With a collection of1,533,975 messages from 38,812 distinct users, posted between March 2013 and December2018, and related to 465 cryptocurrencies listed in StockTwits , Chen et al. (2019a)documents that implementing the crypto lexicon can classify 83% of messages, with 86%of them correctly classiﬁed.To convert unstructured text into a machine-readable text, we proceed by the naturallanguage processing (NLP) using NLTK toolkit. First, all messages are lowercased. Tick-ers (‘ $ BTC.X,’ ‘ $ LTC.X,’ ...), dollar or euro values, hyperlinks, numbers, and mentionsof users are respectively replaced by the words ‘cashtag,’ ‘moneytag,’ ‘linktag,’ ‘num-bertag,’ and ‘usertag’. The preﬁx “negtag ” is added to any word consecutive to ‘not,’‘no,’ ‘none,’ ‘neither,’ ‘never,’ or ‘nobody’. Finally, the three stopwords ‘the,’ ‘a,’ ‘an’and all punctuation except the characters ‘?’ and ‘!’ are removed. For each collectedmessage we ﬁlter the terms appearing in the designated lexicon, and equally weight theﬁltered terms to generate the sentiment score of message, which also means that thesentiment score of a message is estimated as the average over the weights of the lexiconterms it contains. Since the weights of the terms lexicon are in the range of − y -axis for user’s ID, and x -axisfor message posting date, the cell of the heatmap is the quantiﬁed sentiment score.The level of sentiment is color-coded, so that the evolution and dynamics of sentimentamong users can be read in such a heatmap presentation. It appears that users expressdiverging opinions over time. From Figure 2.1a (AAPL) or Figure 2.1b (BTC), oneobserves the similar color codes among a group of users at particular date or period,indicating a contemporaneous and potentially intertemporal dependency among users’sentiment time series. The correlation matrices of users’ sentiment time series in Figure The percentage of correct classiﬁcation is deﬁned as the proportion of correct classiﬁcations among allclassiﬁed messages, while the percentage of classiﬁed messages is denoted as the proportion of classiﬁedmessages among all messages. This list can be found at https://api.stocktwits.com/symbol-sync/symbols.csv . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov

Let us ﬁrst introduce some basic notations. Through the whole paper, N always denotesthe size of the network. Denote by [ N ] the set of integers from 1 to N , i.e., [ N ] = { , . . . , N } . For a subset of indices Λ ⊂ [ N ] we denote its complement Λ c = [ N ] \ Λ.Moreover, if A is a N × N matrix and Λ , Λ ⊂ [ N ] are two subsets of indices, wedenote the submatrix A Λ , Λ = ( A ij ) i ∈ Λ ,j ∈ Λ . We also write for short A Λ , · = A Λ , [ N ] and A · , Λ = A [ N ] , Λ .Furthermore, for a vector a ∈ R d denote a square matrix diag { a } ∈ R d × d that hasthe values a , . . . , a d on the diagonal and zeros elsewhere. For a square matrix A ∈ R d × d we denote Diag( A ) ∈ R d × d as a diagonal matrix of the same size that coincides with A on the diagonal, i.e., Diag( A ) = diag( A , . . . , A dd ). For the oﬀ-diagonal part we use thenotation Oﬀ( A ) = A − Diag( A ).For a real vector x ∈ R d and q ≥ q = ∞ denote the (cid:96) q -norm (cid:107) x (cid:107) q = ( | x | q + · · · + | x d | q ) /q ; for q = 2 we ignore the index, i.e., (cid:107) x (cid:107) = (cid:107) x (cid:107) ; we also denote the pseudo-norm (cid:107) x (cid:107) = (cid:80) i ( x i (cid:54) = 0). For A ∈ R d × d , σ ( A ) ≥ σ ( A ) ≥ · · · ≥ σ min( d ,d ) ( A )denote the non-trivial singular values of A . We will also refer to σ min ( A ) as the leastnontrivial eigenvalue, i.e., σ min ( A ) = σ min( d ,d ) ( A ). Furthermore, we write ||| A ||| op =max j σ j ( A ) for the spectral norm and ||| A ||| F = Tr / ( A (cid:62) A ) = (cid:16)(cid:80) min( p,q ) j =1 σ j ( A ) (cid:17) / forthe Frobenius norm. Additionally, we introduce element-wise norms (cid:107) A (cid:107) p,q for p, q ≥ ∞ ) denotes (cid:96) q norm of a vector composed of (cid:96) p norms of rows of A , i.e., (cid:107) A (cid:107) p,q = (cid:18)(cid:80) i (cid:16)(cid:80) j | A ij | p (cid:17) q/p (cid:19) /q . Notice that (cid:107) A (cid:107) , = ||| A ||| F . Finally, let e , . . . , e N denotes the standard basis in R N , i.e. e i = (0 , . . . , , , , . . . ) with element 1 at the i -thposition. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (a) AAPL users(b) BTC users Figure 2.1: Social media users’ sentiment over time y -axis is the user’s id, while x -axis is time stamp. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (a) AAPL(b) BTC Figure 2.2: Correlation matrix of users’ sentiment time series . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov

Daily aggregate sentiment indicator for each symbol is obtained by averaging, at 24-hour intervals, thesentiment scores of individual messages published per calendar day. Θ : Inﬂuencers & communities In our set-up, the behavior of each node i ∈ [ N ] is characterized by the coeﬃcientsΘ i , . . . , Θ iN , and when we group the nodes using their characteristics the notion ofcommunity is merged with the notion of cluster. We assume that the nodes are separatedinto clusters, such that these coeﬃcients remain quantitatively comparable for the nodeswithin each cluster. Let us ﬁrst give a precise deﬁnition of a clustering. Deﬁnition 3.1. A K -clustering of the set of the nodes [ N ] is called a sequence C =( C , . . . , C K ) of K subsets of [ N ] , such that • any two subsets are disjoint C i ∩ C j = ∅ for i (cid:54) = j ; • the union of subsets C j gives all nodes, C ∪ · · · ∪ C K = { , . . . , N } . Two clusterings C and C (cid:48) are equivalent if the corresponding clusters are equal up to arelabeling, i.e., there is a permutation π on { , . . . , K } , such that C j = C (cid:48) π ( j ) for every j = 1 , . . . , K .Furthermore, deﬁne a distance between two clusterings as d ( C , C (cid:48) ) = min π K (cid:88) j =1 | C j \ C (cid:48) π ( j ) | . Remark 3.1.

The distance between clusterings is, in fact, the minimal amount of nodetransferring from one cluster to another, that is required to make the clusterings equiva-lent. To see this, notice that each clustering can be deﬁned as a sequence ( l , . . . , l N ) of N labels taking values in { , . . . , K } , so that each cluster is deﬁned as C j = { i : l i = j } .Then, if the clustering C (cid:48) corresponds to the labels l (cid:48) , . . . , l (cid:48) N , the distance between them . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov d ( C , C (cid:48) ) = min π N (cid:88) i =1 ( l i (cid:54) = π ( l (cid:48) i )) . We specify our model by imposing assumptions concerning the communities and thepresence of inﬂuencers.

Deﬁnition 3.2.

We say that Θ ∈ SONIC ( s, K ) (SOcial Network with Inﬂuencers andCommunities) if • each user is inﬂuenced by at most s inﬂuencers, i.e., max i N (cid:88) j =1 (Θ ij (cid:54) = 0) ≤ s ; • there is a K -clustering C = ( C , . . . , C K ) such that Θ ij = Θ i (cid:48) j , j = 1 , . . . , N whenever i, i (cid:48) are from the same cluster C l , l = 1 , . . . , K .We will also say that Θ has clustering C . Once Θ ∈ SONIC ( s, K ) has clustering C = ( C , . . . , C K ), the following factor repre-sentation takes place Θ = Z C V (cid:62) , (3.1)where Z C , V are N × K matrices such that • Z C = [ z C , . . . , z C K ] is a normalized index matrix of clustering C , where for any C ⊂ [ N ] we denote z C = 1 (cid:112) | C | ( (1 ∈ C ) , . . . , ( N ∈ C )) ∈ R N — a normalized index vector for the cluster C and Z (cid:62)C Z C = I K ; • V = [ v , . . . , v K ] has sparse columns, (cid:107) v j (cid:107) ≤ s, i.e., only a few nodes are active and carrying information;We present a schematic picture of what we expect in Figure 3.1. Here, the nodesfrom the same clusters are subject to the same inﬂuencers (the grey nodes may be in anyof the clusters), which also coincides with the idea of Rohe et al. (2016), who looks forthe right-hand side singular vectors of the Lagrangian in a directed network, groupingthe nodes aﬀected by the same group of nodes.The equation (3.1) is akin to bilinear factor models, which appear in the econometricliterature as a model with factor loadings, see e.g., Moon and Weidner (2018) and thereferences therein. It is also a popular machine learning technique for low-rank approxi-mation, see a thorough review in Udell et al. (2016). Chen and Schienle (2019) use sparse . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov K = 3 and s = 1.factors for a closely related model. We also mention the line of work (Kapetanios et al.,2019; Parker and Sul, 2016; Pesaran and Yang, 2020) with a similar notion of dominantunits, but in contrast with our analysis, they are deﬁned through modeling cross-sectionaldependencies. A network of size N represents a multivariate time series Y t = ( Y t , . . . , Y Nt ) (cid:62) ∈ R N ,where Y it is the response of a node i = 1 , . . . , N at a time t = 1 , . . . , T and contaminatedwith missing observations. Instead of specifying the exact distribution under the para-metric model (1.1), we assume there is a true parameter Θ ∗ ∈ R N × N and some unknownprobability measure P with the expectation E , such that under this measure the timeseries follows the autoregressive equation Y t = Θ ∗ Y t − + W t , (3.2)with E [ W t | F t − ] = 0 for F t − = σ ( W t − , W t − , . . . ). For the sake of simplicity, weadditionally assume that W t are independent and have Var( W t ) = S under P . Once ||| Θ ∗ ||| op < Y t = (cid:88) k ≥ (Θ ∗ ) k W t − k , (3.3)and the covariance of the process reads asΣ = Var( Y t ) = (cid:88) k ≥ (Θ ∗ ) k S { (Θ ∗ ) k } (cid:62) . (3.4)For simplicity, we consider sub-Gaussian vectors W t , as it allows us to have deviationbounds for covariance estimation with exponential probabilities. Recall the followingdeﬁnition, that appears, e.g., in Vershynin (2018). Deﬁnition 3.3.

A random vector W ∈ R d is called L -sub-Gaussian if for every u ∈ R d it holds (cid:107) u (cid:62) W (cid:107) ψ ≤ L (cid:107) u (cid:62) X (cid:107) L , . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov where for a random variable X ∈ R we denote (cid:107) X (cid:107) ψ = inf (cid:40) C > E exp (cid:40)(cid:18) | X | C (cid:19) (cid:41) ≤ (cid:41) , (cid:107) X (cid:107) L = E / | X | . Estimating SONIC is not impeded by the presence of missing data that appear to beone of the features of social media data. We adopt the framework of Lounici (2014) forvectors with missing observations, assuming that each variable Y it is independent andonly partially observed with some probability. Formally speaking, instead of having arealization of the whole vector Y t , we only observe the masked process Z t deﬁned as Z t = ( δ t Y t , . . . , δ Nt Y Nt ) (cid:62) , t = 1 , . . . , T, (3.5)where δ it ∼ Be( p i ) are independent Bernoulli random variables for every i = 1 , . . . , N and some p i ∈ (0 , Y it is only observed with proba-bility p i independently from other variables, with δ it = 1 corresponding to the observed Y it and δ it = 0 to the unobserved Y it . Obviously, the case p i = 1 for every i = 1 , . . . , N corresponds to the process without missing observations. Therefore, the framework con-stituted by (3.5) serves as a generalization of dynamic network models. Remark 3.2.

In terms of the StockTwits world, we interpret the process Y t as an unob-served underlying opinion process . Such an opinion process quantiﬁed from the messagesis subject to random arrival of messages, as users disclose their opinions randomly onsocial media. Although one may restrict the sample to the case of full observation, thestatistical inference may be questionable. Also, discarding nodes with very few missingobservations is a waste of available information. Given the fact that some users are moreactive than others, we need to account for diﬀerent probabilities p i .Notice that in general the probabilities p i are not known, but can be easily estimatedthrough the frequencies ˆ p i = T − (cid:80) Tt =1 [ Y it (cid:54) = 0]. Set ˆ p = (ˆ p , . . . , ˆ p N ) (cid:62) . FollowingLounici (2014), we denote the observed empirical covariance Σ ∗ = T − (cid:80) Tt =1 Z t Z (cid:62) t andconsider the following covariance estimator,ˆΣ = diag { ˆ p } − Diag(Σ ∗ ) + diag { ˆ p } − Oﬀ(Σ ∗ )diag { ˆ p } − . This estimator is motivated by the fact that E Σ ∗ ii = p i Σ ii and E Σ ∗ ij = p i p j Σ ij for i (cid:54) = j in the case of independent observations. The state-of-the-art bound for the error of suchcovariance estimator is inspired by Klochkov and Zhivotovskiy (2020), Theorem 4.2. Inthe case of independent vectors Y t and equal probabilities of observations p = · · · = p N = p they show that for any u ≥ − e − u it holds ||| ˆΣ − Σ ||| op ≤ C ||| Σ ||| op (cid:32)(cid:115) ˜ r (Σ) log ˜ r (Σ) T p (cid:95) (cid:114) uT p (cid:95) ˜ r (Σ) { log ˜ r (Σ) + u } log TT p (cid:33) , where ˜ r (Σ) = Tr(Σ) ||| Σ ||| op denotes the eﬀective rank of the covariance Σ. Similarly, the eﬀectiverank appears as well in the classic covariance estimation problem (i.e., p = 1), see, e.g.,Koltchinskii and Lounici (2017) who even provide a matching lower bound. Notice thatthe eﬀective rank takes values between 1 and the rank of Σ. However, if there is no . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov N , which means that the bound above can only guarantee the error of order (cid:113) NT p , not taking into account the logarithms.On the other hand, one only needs to bound the error within speciﬁc low-dimensionalsubspaces. Say, given two projectors P , Q of rank lower than N , one needs to bound theerror ||| P ( ˆΣ − Σ) Q ||| op , which can be signiﬁcantly smaller than the total error ||| ˆΣ − Σ ||| op . For example, if we areinterested in the error of estimation of Σ Λ , Λ , where Λ ⊂ [ N ], the corresponding projectorswould have the form P = Q = (cid:80) i ∈ Λ e i e (cid:62) i . Notice that this projector will be sparse, inthe sense that most of its values will be zeros, when | Λ | is much smaller than N . Infact, due to the unknown probabilities p i , which we estimate via the frequencies, the“sparsity” of projectors P, Q will play an important role as well. We deﬁne it below.

Deﬁnition 3.4.

Let P ∈ R N × N be a symmetric projector, i.e. P = P . Let Λ ⊂ [ N ] bethe smallest set such that P ij is nonzero only for indices i, j ∈ Λ . Then, we refer to thevalue | Λ | as the sparsity of P . Remark 3.3.

We employ this technical condition to state bounds for the error of thecovariance estimator with missing observations. The corresponding diagonal projectorΠ Λ = (cid:80) i ∈ Λ e i e (cid:62) i commutes not only with P , but also with any other diagonal operator,in particular, with diag { ˆ p } − . Thus, with the help of this larger projector (obviously,Rank( P ) ≤ | Λ | ) we can take into account the error that comes from the estimatedfrequencies.The following theorem provides a deviation bound for the autoregressive process (3.2).Unlike the bound of Klochkov and Zhivotovskiy (2020), it accounts for possibly distinctprobabilities p i . Theorem 3.5.

Assume the vectors W t are independent L -sub-Gaussian and also ||| Θ ∗ ||| op ≤ γ < , p i ≥ p min > . Let

P, Q ∈ R N × N be two arbitrary orthogonal projectors of ranks M , M and with spar-sities K , K respectively. Suppose, that u > is such that max { , K , K , (cid:112) K K log T } log(4 N ) + uT p ≤ . (3.6) Then, it holds with probability at least − e − u that ||| P ( ˆΣ − Σ) Q ||| op ≤ C ||| S ||| op (cid:32)(cid:115) M ∨ M (log N + u ) T p (cid:95) √ M M (log N + u ) log TT p (cid:33) , where C = C ( γ, L ) only depends on L and γ . See proof of this result in Section A.Additionally, we are interested in estimating lag-1 cross-covariance under the samescenario. Namely, based on the sample Z , . . . , Z T and given the estimated probabilities . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov p , . . . , ˆ p N , we wish to estimate the matrix A = E Y t Y (cid:62) t +1 . Since E [ Y t +1 | F t ] = Θ ∗ Y t forthe linear process (3.3), the corresponding cross-covariance reads as A = Σ(Θ ∗ ) (cid:62) . Consider the following estimatorˆ A = diag { ˆ p } − A ∗ diag { ˆ p } − , where A ∗ is the observed empirical cross-covariance A ∗ = 1 T − T − (cid:88) t =1 Z t Z (cid:62) t +1 . For this estimator, we provide an upper-bound, again with a restriction to some low-dimensional subspaces.

Theorem 3.6.

Under conditions of Theorem 3.5, it holds, with probability at least − e − u ,that ||| P ( ˆ A − A ) Q ||| op ≤ C ||| S ||| op (cid:32)(cid:115) ( M ∨ M )(log N + u ) T p (cid:95) √ M M (log N + u ) log TT p (cid:33) , where C = C ( γ, L ) only depends on γ and L . We postpone the proof to Section A.

In order to estimate the matrix Θ = Z C V (cid:62) , we need to estimate both C and V simul-taneously. Suppose that we have some clustering C at hand and we aim to estimate thecorresponding V . The mean squared loss from the fully observed sample is: R ∗ ( V ; C ) = 12( T − T − (cid:88) t =1 (cid:107) Y t +1 − Z C V (cid:62) Y t (cid:107) = 12 Tr( V (cid:62) ˜Σ V ) − Tr( V (cid:62) ˜ AZ C ) + 12( T − T − (cid:88) t =1 (cid:107) Y t +1 (cid:107) , (3.7)where we used the fact that Z (cid:62)C Z C = I K and the trace of a matrix product is invariantwith respect to transition Tr( AB ) = Tr( BA ). Here, we also denote˜Σ = 1 T − T − (cid:88) t =1 Y t Y (cid:62) t , ˜ A = 1 T − T − (cid:88) t =1 Y t Y (cid:62) t +1 , to be empirical covariance and empirical lag-1 covariance built on a sample Y , . . . , Y T ,respectively, which we observe only partially. In reality, the feasible estimators are ˆΣ andˆ A , which we have introduced in the previous section. A natural solution is to plug-inthese estimators into the expression (3.7) instead of the unobserved ˜Σ and ˜ A . The lastterm T − (cid:80) T − t =1 (cid:107) Y t +1 (cid:107) does not depend on the parameters C and V at all; therefore, . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov R ( V ; C ) = 12 Tr( V (cid:62) ˆΣ V ) − Tr( V (cid:62) ˆ AZ C ) . In particular, it is not hard to derive from Theorems 3.5 and 3.6 that for any ﬁxed pair C , V the values of R ( V ; C ) and R ∗ ( V ; C ) − T − (cid:80) T − t =1 (cid:107) Y t +1 (cid:107) are close with highprobability.As we are searching for a sparse matrix V , we additionally impose a LASSO regular-ization and end up with the following convex optimization,ˆ V C ,λ = arg min R λ ( V ; C ) , R λ ( V ; C ) = R ( V ; C ) + λ (cid:107) V (cid:107) , = 12 Tr( V (cid:62) ˆΣ V ) − Tr( V (cid:62) ˆ AZ C ) + λ (cid:107) V (cid:107) , , where (cid:107) V (cid:107) , = (cid:80) ij | v ij | , and tuning parameter λ > N andnumber of observations T . Concerning this minimization problem, we have the followingobservations: • the problem reduces to simple quadratic programming and therefore can be eﬃ-ciently solved; • since (cid:107) V (cid:107) , = (cid:80) Kj =1 (cid:107) v j (cid:107) we can rewrite R λ ( V ; C ) = 12 Tr (cid:16) V (cid:62) ˆΣ V (cid:17) − Tr (cid:16) V (cid:62) ˆ AZ C (cid:17) + λ (cid:107) V (cid:107) , = K (cid:88) j =1 v (cid:62) j ˆΣ v j − v (cid:62) j ˆ A z j + λ (cid:107) v j (cid:107) . Therefore, we need to solve K independent problems of size N , which reducescomputational complexity, and may therefore be implemented in parallel.Ideally, we want to solve the following problem (note that the number of clusters K andthe tuning parameter λ are ﬁxed) F λ ( C ) → min C , F λ ( C ) = min V R λ ( V ; C ) . (3.8)We can employ a simple greedy procedure. In the beginning, we initialize C (0) =( l , . . . , l N ) randomly; each label takes values 1 , . . . , K . Then, at a step t , we try tochange one label of a node that reduces the risk the most, in other words, we try all theclusterings in the nearest vicinity of the current solution C ( t ) , i.e., C ( t +1) = arg min d ( C , C ( t ) ) ≤ F λ ( C ) . At each such step, we would need to calculate F λ ( C ) for O{ N ( K − } diﬀerent candidates. Remark 3.4.

In general, it is impossible to optimize an arbitrary function f ( C ) withrespect to a clustering. The K -means is well-known to be NP-hard, however, diﬀerentsolutions are widely used in practice, see Shindler et al. (2011) and Likas et al. (2003). . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov C (0) and compute theLASSO solution V (0) = V C (0) ,t . When updating the clustering, we ﬁx the matrix V = V ( t ) and solve the problem R λ ( V ; C ) = 12 Tr( V (cid:62) ˆΣ V ) − Tr( V (cid:62) ˆ AZ C ) + λ (cid:107) V (cid:107) , → min C , where only the term − Tr( V (cid:62) ˆ AZ C ) depends on C . Minimizing by conducting a few stepsof the greedy procedure we obtain the next clustering update C ( t +1) . Then, we againupdate the V -factor by setting V ( t +1) = V C ( t +1) ,λ . We continue so until the clusteringdoes not change or the number of iterations exceeds a speciﬁc limit. The pseudo-code inAlgorithm 1 summarizes this procedure. Result: a pair ( ˆ C , ˆ V )initialize C (0) = ( l (0)1 , . . . , l (0) N ) randomly; t ← while t < max iter do update ˆ V ( t ) ← arg min R C ( t ) ,λ ( V ); for i = 1 , . . . , N dofor l = 1 , . . . , K do consider candidate C (cid:48) = ( l ( t )1 , . . . , l ( t ) i − , l, l ( t ) i +1 , . . . , l ( t ) N ); r il ← − Tr( V ( t ) ˆ AZ C (cid:48) ); endend ( i ∗ , l ∗ ) = arg min r il ;update C ( t +1) ← ( l ( t )1 , . . . , l ( t ) i ∗ − , l ∗ , l ( t ) i ∗ +1 , . . . , l ( t ) N ); if C ( t +1) = C ( t ) then return ( C ( t ) , V ( t ) ); else t ← t + 1; endend Algorithm 1: Alternating greedy clustering procedure.

In this section, we show the existence of a locally optimal solution in the neighborhood ofthe true parameter with high probability. We call a clustering solution ˆ C locally optimal if the functional F λ ( · ) in (3.8) has the minimum value at point ˆ C among its nearestneighbours d ( C , ˆ C ) ≤

1. In particular, Algorithm 1 stops at such a solution.

Conditions

Here we describe the conditions that we need for the consistency result. The ﬁrst condi-tion concludes the requirements of Theorems 3.5 and 3.6.

Assumption 1.

There is some Θ ∗ ∈ R N × N such that ||| Θ ∗ ||| op ≤ γ for some γ < andthe time series Y t follows (3.3) . The innovations W t are independent with E W t = 0 andVar ( W t ) = S . Moreover, each W t is L -sub-Gaussian. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov ∗ de-scribed in Section 3.2. Assumption 2.

The true VAR operator admits decomposition with K -clustering C ∗ Θ ∗ = Z C ∗ V ∗ , and meets the following conditions:1. ||| Θ ∗ ||| op = ||| V ∗ ||| op ≤ γ < for some constant γ ∈ (0 , ;2. cluster separation σ min ([ V ∗ ] (cid:62) Σ V ∗ ) ≥ a (3.9) for some a > ;3. sparsity: for every j = 1 , . . . , K the active set Λ j = supp( v ∗ j ) satisﬁes | Λ j | ≤ s ;

4. active coeﬃcients separated from zero: there is τ > such that | v ∗ ij | ≥ τ s − / , i ∈ Λ j , j = 1 , . . . , K . (3.10) Here each (cid:107) v ∗ j (cid:107) ≤ has at most s nonzero values, hence the normalization;5. signiﬁcant cluster sizes: for some α ∈ (0 , it holds min j | C ∗ j | max j | C ∗ j | ≥ α. Notice that the condition (3.9) corresponds to an appropriate separation of clusters,i.e., each v ∗ j is far enough from a linear combination of the rest. Another assumptionimposes conditions on the population covariance Σ. Assumption 3.

The covariance of Y t reads as Σ = ∞ (cid:88) k =0 (Θ ∗ ) k S [(Θ ∗ ) k ] (cid:62) , where S = Var ( W t ) , and it is assumed that1. bounded operator norm ||| Σ ||| op ≤ σ max ;

2. restricted least eigenvalue σ min (Σ Λ j , Λ j ) ≥ σ min , j = 1 , . . . , K . Note that we do not require that the smallest eigenvalue of Σ is bounded away fromzero, but only those corresponding to the small subsets of indices are. Such assumption isnot too restrictive. In fact, Σ − j Λ j would correspond to the Fisher information if we wereestimating the vector v j knowing the cluster C ∗ j and the sparsity pattern Λ j in advance. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov σ max σ min ≤ κ, is bounded by some constant κ ≥

1. Additionally, we can treat the values L , γ , a , τ , and α as constants. Below we focus on to what extent the relationship between N, T, s, K ,and the probabilities of the observations p i , i = 1 , . . . , N allows consistent estimation ofthe parameter Θ.Finally, we present the assumption that allows controlling the exact recovery of spar-sity patterns for the LASSO estimator. Assumption 4.

For every j = 1 , . . . , K it holds (cid:107) Σ Λ cj , Λ j Σ − j , Λ j (cid:107) , ∞ ≤ . Recall that Λ c is the complement of Λ ⊂ [ N ] in [ N ] . Remark 3.5.

Zhao and Yu (2006) call the inequality (cid:107) Σ Λ cj , Λ j Σ − j , Λ j (cid:107) , ∞ < η with con-stant η ∈ (0 ,

1) the strong Irrepresentable Condition. To avoid technical burden, we picka concrete constant η = 1 /

4. In a special case with ﬁxed design and no noise, Tropp(2006) shows that the inequality (cid:107) Σ Λ cj , Λ j Σ − j , Λ j (cid:107) , ∞ < v j . In Section B, we show a straightforward exten-sion of Tropp’s sparsity recovery results to the case with random design and missingobservations.We are now ready to state our main theorem. Theorem 3.7.

Suppose that Assumptions 1-4 hold. There are constants c, C > thatdepend on L, γ such that the following holds. Suppose, (cid:115) sn ∗ log NT p (cid:95) (cid:115) s log N log TT p ≤ c, (3.11) where n ∗ = max j ≤ K | C ∗ j | and, additionally, N ≥ ( Cα ∨ κ ) K . Then, with probability atleast − /N for any λ in the range Cσ max (cid:115) log NT p ≤ λ ≤ c (cid:110) κ − ( a /σ max ) K − s − (cid:94) σ min τ s − (cid:111) , (3.12) and, additionally, λ ≥ Cα K/N , there is a locally optimal solution ˆ C satisfying ||| Z ˆ C ˆ V (cid:62) ˆ C ,λ − Θ ∗ ||| F ≤ (cid:40) σ − √ Ks + Cγa (cid:18) σ max σ min (cid:19) K √ s (cid:41) λ . Moreover, the exact support recovery takes place, i.e., supp( ˆ V ˆ C ,λ ) = supp( V ∗ ) . Remark 3.6.

In the above theorem we only show the existence of a local minimumof the functional F λ ( C ) deﬁned in (3.8) near the true clustering C ∗ and, in addition,the statistical properties of the corresponding estimator ˆΘ λ . This is not uncommon in . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov λ gives greater erroronce it is in the required range. This comes naturally, as the result is based on the exactrecovery, see e.g., Tropp (2006). Ideally, we want to choose the smallest available value, λ ∗ = Cσ max (cid:115) log NT p . (3.13)In this case, the error of the estimator reads as ||| ˆΘ λ ∗ − Θ ∗ ||| F ≤ C (cid:48) K (cid:115) s log NT p , where C (cid:48) does not depend on N, T, K, s . Notice that in a hypothetical situation wherethe clustering C ∗ is known precisely, we only need to estimate the matrix V that consistsof at most Ks non-zero parameters. Therefore, according to Lemma 7.7, the LASSOestimator must give us ||| Z C ∗ ˆ V (cid:62)C ∗ ,λ ∗ − Θ ∗ ||| F = ||| ˆ V C ∗ ,λ ∗ − V ∗ ||| F ≤ C (cid:48) (cid:115) Ks log NT p , where we used the fact that Z C ∗ has orthonormal columns; see also Melnyk and Banerjee(2016) and Han et al. (2015). We may say in a loose way that not knowing the exactclustering provides an estimator that is at most √ K times worse.Let us take a closer look at condition (3.11). Under the cluster size restriction fromAssumption 2, we have that all clusters have the size of order N/K , since α NK ≤ | C ∗ j | ≤ α − NK , j = 1 , . . . , K.

Therefore, if we ignore missing observations, we only need( sN/K ) log NT ≤ c, (3.14)with some constant c depending on α , enabling the estimation toward the parameters.So, once K is large enough, the estimator works with the corresponding error. Noticethat the (cid:96) -regularisation alone requires the number of the observations to be at leastthe number of edges times log N , see Fan et al. (2009). In our setting, the number ofconnections is up to N s , hence such a condition reads as (cid:114) sN log NT ≤ . Therefore, the SONIC model is an improvement in this regard. Finally, we point outthat the conditions of Theorem 3.7 imply some limitations on the size of the network . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov N s log NT p ≤ c, where c > L , γ , a , τ , and α . Though we donot state that this condition is necesarry, it is clear that in some cases the estimation ispossible even when N > T . The theoretical properties of the SONIC model and the developed theorems can be furthersupported via simulation. We check the discussed theorems and properties via relativeestimation errors and cluster errors. We particularly discuss the choice of regularizationparameter and number of clusters before turning to the StockTwits applications.We set up the simulations as follows. Take N = 100 and s = 1, while K will vary inthe range 5 ...

25. For every K = 5 , , , ,

25 we construct the following matrix Θ ∗ , • pick clusters C ∗ j having approximately the same size NK ± • for every j = 1 , . . . , K set v ∗ j = 0 . e j = (0 , . . . , . , . . . , (cid:62) , with a single nonzero value at the place j , so that s = 1. • by construction we have, ||| Θ ∗ ||| op = ||| V ∗ ||| op = 0 . , ||| Θ ∗ ||| F = ||| V ∗ ||| F = 0 . √ K. As for the sample size, we consider two scenarios:(a) with T = 100 and p i = 1, i.e., no missing observations;(b) with T = 400 and p i = 0 .

5, i.e., each Y it is observed with probability 0 . W − , W − , . . . , W T ∼ N(0 , I ) and set Y t = (cid:88) k =0 (Θ ∗ ) k W t − k , t = 1 , . . . T, where due to 0 . − ≈ − the terms for k >

20 can be neglected. In Figure 4.1 we showthe relative error E ||| ˆΘ − Θ ∗ ||| F / ||| Θ ∗ ||| F along the regularization paths for diﬀerent choicesof K . Picking the best λ , we show the relative error against the number of clusters inFigure 4.2. We also show that the clustering error E d ( ˆ C , C ∗ ) in Figure 4.3 is subject tothe choice of K . All expectations are estimated based on 20 independent simulations.Evidently, within the considered range of cluster numbers, larger ones lead to a smallerrelative error as well as smaller clustering error. The simulations partially conﬁrm thediscussion in the end of the previous section, namely, that the conditions of Theorem 3.7can be met when K is large enough, although not too large. In addition, we can see that . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (a) T = 100 and p min = 1. (b) T = 400 and p min = 0 . Figure 4.1: Expected relative loss E ||| ˆΘ − Θ ∗ ||| F ||| Θ ∗ ||| F for diﬀerent λ . N = 100, and K =5 , , , ,

25. SoNIC simulation study (a) T = 100 and p min = 1. (b) T = 400 and p min = 0 . Figure 4.2: Expected relative loss E ||| ˆΘ − Θ ∗ ||| F ||| Θ ∗ ||| F for optimal λ , N = 100, and K = 5 , ..., T = 100 and p i = 1, and the graphs for the scenario(b) with T = 400 and p i = 0 . λ inFigure 4.2. This is consistent with the results of Section 3.3 and with the Theorem 3.7,where the value T p plays the role of the eﬀective number of observations. λ It is often suggested to use the regularisation λ = σ (cid:112) log N /T in the LASSO literature,where σ stands for the noise level (Belloni and Chernozhukov, 2013; Van de Geer, 2008;Bickel et al., 2009; Van de Geer et al., 2014). In the example above, we have σ = . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (a) T = 100 and p min = 1. (b) T = 400 and p min = 0 . Figure 4.3: Expected clustering error E d ( ˆ C , C ∗ ) for optimal λ , N = 100, and K = 5 , ..., (a) T = 100 and p min = 1. (b) T = 400 and p min = 0 . Figure 4.4: The optimal value of λ for N = 100 and K = 5 , ...,

25. The red line corre-sponds to the value λ = (cid:113) log NT p . SoNIC simulation study1. In our case of missing observations, the value T must be replaced by T p , theeﬀective number of observations. Furthermore, Wang and Samworth (2018) recommendto disregard multiplicative constants that appear in theory in front of σ (cid:113) log N / ( T p )(see equation (3.13)) since it leads to consistent, but rather conservative estimation.The simulation results support this choice. Let us take a look at the regularisationpaths in Figure 4.1 for diﬀerent values of K . All of the graphs that we show exhibit similarbehavior: with λ increasing, the evaluated expected relative loss drops until it reaches itsminimum, then it starts to increase until it reaches the constant value that correspondsto ˆΘ λ = 0, which obviously happens once the regularization is big enough. Typically, the“oracle” choice corresponds to the minimizer of the expected loss E ||| ˆΘ λ − Θ ∗ ||| F . In order . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov K = 5 , ...,

25, we pick thetuning parameter (among the available choices on the graph) that delivers the minimumto the evaluated expected loss. In Figure 4.4 we show the values of the best λ for each K = 5 , ...,

25 (blue line) and compare it to the heuristic value (cid:113) log

NT p (red line). Weobserve that once the number of clusters is large enough ( K ≥ λ approximately equals to (cid:113) log NT p . On the other hand, as the graph inFigure 4.3 suggests, for K ≤

10 the number of nodes assigned to a wrong cluster growssigniﬁcantly, and one cannot estimate the model with any given regularization parameter.

Remark 4.1.

In practice, one must evaluate the noise level σ in a data-driven way(Belloni and Chernozhukov, 2013). We suggest to evaluate it using the spectrum ofthe covariance estimator ˆΣ. One obvious choice can be ˆ σ = (cid:107) ˆΣ (cid:107) . However, this maylead to an overestimated noise level. We suggest using the following strategy. SinceΣ = Θ ∗ Σ(Θ ∗ ) (cid:62) + S , we expect the original covariance to have either K or K − S = σI . We thereforesuggest using the singular value ˆ σ = σ K ( ˆΣ), which means that we skip the ﬁrst K − λ = σ K ( ˆΣ) (cid:115) log NT p . In the next section, we stick to this strategy. K via stability analysis In the simulation study above we ﬁxed a priori the number of clusters. When applyingSONIC to empirical data, this is rarely the case. One possible way to decide the number K is to analyze the stability of the clustering algorithm (Rakhlin and Caponnetto, 2007;Le Gouic and Paris, 2018). The idea is that if we guess the number of clusters correctly,then on diﬀerent subsamples we should get similar results. On the other hand, if ourguess is wrong, we can end up with randomly split or glued clusters. In other words,the resulting clustering will be unstable with respect to the change of the sample. Wetherefore propose the following procedure. Consider a sequence of intervals I , . . . , I l ⊂{ , . . . , T } of the same length and let us estimate the clusterings ˆ C I j using the observations( Y t ) t ∈ I j for each j = 1 , . . . , l . If the number of clusters is correct, we expect that thepairwise distances ˆ C j are small. We take l = 6 intervals of length 3 T / ±

1, each of theform I j = (cid:20) j − T + 1 , j + 1420 T (cid:21) , j = 1 , . . . , , (4.1)so that we include all available observations. We then calculate the distances d ( ˆ C I , ˆ C I j )for each j = 2 , . . . , l and for diﬀerent choices of K . We suggest to choose the number ofclusters that has small distances d ( ˆ C I , ˆ C I j ) when compared to the total number of nodesin the network.We demonstrate how the picture can look in the following simulation scenarios:(a) N = 100, K = 2, p min = 1, and T = 100 , , , , N = 100, K = 2, p min = 0 .

5, and T = 100 , , , , . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov N = 100, K = 5, p min = 1, and T = 100 , , , , N = 100, K = 5, p min = 0 .

5, and T = 100 , , , , T = 100 , , , , K = 2 and p min = 1, at ﬁrst we do not see any stability. Although,the clustering errors corresponding to the correct guess K = 2 may be smaller, theyare still rather large when compared to the total number of nodes. Only for T = 2000the clustering distances become small (up to 4), and we can clearly see that there isonly two clusters. Figure 4.5b shows the results for K = 2 and p min = 0 .

5. Since theeﬀective number of observations is

T p , the considered numbers of observations arenot enough in this case. Figure 4.5c shows the results for K = 5 and p min = 1. Here,we can see stable estimation of the clustering for T = 1000 , K = 2. Notice that in thecase T = 2000, the correct case K = 5 shows the smallest distance between clusteringsobtained from diﬀerent windows. For K = 6 it is still rather small, but the choice isincorrect. Figure 4.5d shows the results for K = 5 and p min = 0 .

5. Eﬀectively, thenumber of observations reduces by four times, and we can see the similarity between thegraph for T = 2000 and for T = 500 in Figure 4.5c, as well as somewhat resemblancebetween T = 1000 in Figure 4.5d and T = 200 in Figure 4.5c. We can see that none of thegraphs in Figure 4.5d demonstrates stability due to the lack of simulated observations.In conclusion, we suggest to look for the smallest number of clusters that shows a“reasonably” small clustering diﬀerence for diﬀerent windows, in the sense that it is muchsmaller than the total amount of nodes. However, at this point we are not able to provideany statistical explanation of what is a “reasonable” clustering distance. The stabilityanalysis we suggest should be used as a qualitative heuristic. Here we present the applicability of SONIC to the dataset described in Section 2. We lookat the two (AAPL and BTC) networks comprising of users’ sentiment time series. Thesetwo symbols, representing the most popular security and cryptocurrency respectively,may reveal disparate characteristics, thereby distinct network dynamics featured withdiﬀerent communities and inﬂuencers.To ensure that the model is applicable in the real world, we require that the ob-servations are persistent with the same probability p i over the considered time period.Moreover, since in Theorems 3.5 and 3.6 the amount of observations scales with the factor p , we need to avoid the users whose p i is too small. We propose the following criteriain sample selection to account for missing observations.1. pick users with estimated probability ˆ p i ≥ . p i ≥ . . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (a) N = 100, K = 2, and p min = 1.(b) N = 100, K = 2, and p min = 0 . N = 100, K = 5, and p min = 1.(d) N = 100, K = 5 and p min = 0 . Figure 4.5: Analysis of stability for four diﬀerent scenarios and T =100 , , , , T , each point represents one of the ﬁve distances d ( ˆ C I , ˆ C I j ), where the clusterings are estimated based on the moving window (4.1). Onthe x -axis we have diﬀerent guesses for the number of clusters 2 , ...,

15, the y -axis repre-sent the clustering distance. SoNIC stability simulation . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov λ = 0 .

08 according toSection 4.1 and Remark 4.1. As for the number of clusters, we perform the analysisdescribed in Section 4.2 and present the results in Figure 5.2a for K = 2 , , , ,

6. Basedon these results we suggest to pick K = 2 with maximum clustering distance 3 outof 36 users in total. We present a heatmap visualization for the estimated matrix ˆΘin Figure 5.1a, where we identify the candidates of inﬂuencers with the identiﬁcationnumbers 619769, 850976, 5, 962572, 526780, 473512. To parallel our identiﬁcation withthe indicators from social conventions in terms of what ought to possess as inﬂuencerse.g. the number of followers, we analyze the social network proﬁles of selected usersincluding the register date of membership, the number of followers, the number of ideas,liked count, etc.To retrieve users’ social proﬁles, we use the StockTwits API toolkit to request theusers’ message streams and proﬁles.We stratify the retrieved data and particularly focuson the number of followers, the number of ideas, and the liked count, in hopes of theseselected characteristics to comply with the social consensus in terms of the notion ofinﬂuencers. Table 2 summarizes inﬂuencers’ social proﬁle and reports the correspondingpercentile rank among a pool of users.The identiﬁed users appear to either attract many followers or behave actively, pro-vided with tremendous ideas (posts) or liked count. The ﬁrst three inﬂuencers representthe trading companies oﬀering technical and fundamental analysis for the symbols ofinterest. It shows that investment companies or ﬁnancial industry entrepreneurs targettheir potential customers appearing on social media and inﬂuence them strategically.The latter two are ﬁnancial analysts or trading consultants, and they may serve for asmall group of users.As to the BTC dataset, applying the proposed strategy, we end up with λ = 0 .

21 and,using the results in Figure 5.2b, we choose K = 2. Figure 5.1b displays the estimatedmatrix ˆΘ and identiﬁes the inﬂuencers 398367 and 969971. Likewise, we elicit theirsocial proﬁle data and document the relevant features in Table 2. The ﬁrst one is aninvestment company with a specialization on crypto assets, while the second one is acrypto specialist updating price information and producing the technical analytics tocryptocurrency traders. Both broadcast tactical trading information and update thesefrequently.We notice that for anyone relying on these social characteristics may oversimplify thetask of identifying inﬂuencers. One should be aware that some users with much morefollowers or ideas may not be able to surpass those being identiﬁed via our approach interms of opinions’ importance. In the case of BTC, those who have specialized themselvesin crypto-assets may lend themselves to serve a relatively smaller group of people withspeciﬁc trading preference, albeit not attracting granular followers. Prediction performance compared with other methods

To highlight the advantages of the proposed model, we compare the prediction accuracyof our method with other benchmarks. We consider the following prevalent benchmarks, . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (a) AAPL dataset with N = 36, T = 82 and K = 2.(b) BTC dataset with N = 53, T = 78 and K = 2. Figure 5.1: Estimated ˆΘ for AAPL and BTC datasets. The axes correspond to users’id’s and are rearranged with respect to the estimated clusterings. SoNIC AAPL BTC . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (a) AAPL dataset(b) BTC dataset Figure 5.2: Stability analysis for AAPL and BTC datasets. For each K = 2 , , , , x -axis, we plot ﬁve points corresponding to the clustering distances d ( ˆ C I , ˆ C I j ) on the y -axis, with sampling windows as in (4.1). SoNIC AAPL BTC stability . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov user ID Followers Ideas like count

AAPL

BTC

We report the number of followers, the number of ideas and the like count tagged to each speciﬁcuser ID. The value in parenthesis is the corresponding percentile rank among a pool of users. each of which considers missing observations. • VAR with missing observations:ˆΘ = arg min Θ ∈ R N × N

12 Tr(Θ ˆΣΘ (cid:62) ) − Tr(Θ ˆ A ) , where ˆΣ and ˆ A are the covariance and cross-covariance estimators, respectively(recall the deﬁnition from Section 3.3); • Lasso VAR with missing observations:ˆΘ = arg min Θ ∈ R N × N

12 Tr(Θ ˆΣΘ (cid:62) ) − Tr(Θ ˆ A ) + λ (cid:107) Θ (cid:107) , , where we choose the same λ as in SONIC; • Constant estimator Θ = 0, which corresponds to no correlation across time.In our exercise, we split the available sample —82 weeks for AAPL and 78 weeks for BTC— into the train and test subsamples, approximately 70% to 30%. Measuring predictionerror on data with missing observations, we stumble into the same problem. Ideally, wewant to access the value, 1 T test − T test (cid:88) t =2 (cid:107) Y t − ˆΘ Y t − (cid:107) , where T test is the number of observations in the test sample and ˆΘ is estimated on thetrain sample. Observe that (similar to (3.7)),1 T test − T test (cid:88) t =2 (cid:107) Y t − ˆΘ Y t − (cid:107) =Tr (cid:32) T test − T test (cid:88) t =2 Y t Y (cid:62) t − (cid:34) T test − T test (cid:88) t =2 Y t − Y (cid:62) t (cid:35) + Θ (cid:34) T test − T test − (cid:88) t =1 Y t Y (cid:62) t (cid:35) Θ (cid:62) (cid:33) , . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov T test − (cid:80) T test t =2 Y t Y (cid:62) t and T test − (cid:80) T test t =2 Y t − Y (cid:62) t with ˆΣ test andˆ A test , respectively, which are the covariance and cross-covariance estimated from the testsample. To sum up, we evaluate the prediction performance by,Tr( ˆΣ) − A ) + Tr(Θ ˆΣΘ (cid:62) ) . The results are presented in Table 3. We ﬁnd that, in terms of the prediction perfor-mance, SONIC is slightly better that the sparse VAR for the AAPL dataset and as goodas the sparse VAR for the BTC dataset. The regular VAR blows up in both cases, whichis not surprising given the dimension and the sample sizes in each case. The similarityof the SONIC and sparse VAR shows that the number of clusters K = 2 is too smallto beneﬁt from our model in terms of performance. Notice that the condition (3.11)of Theorem 3.7 is likely to break for small number of clusters. However, the fact thatSONIC is not worse than sparse VAR conﬁrms that the model we propose indeed reﬂectsthe dynamics of a real sentiment based network. In addition, we compare the resultswith the constant estimator ˆΘ = 0, which corresponds to the no causality case. We seethat in both cases the loss is higher than that of the SONIC model. AAPL BTC

SONIC 2.609 4.332VAR 1.302 × × Sparse VAR 2.659 4.332Θ = 0 (no causality) 5.719 8.995Table 3: Prediction error for SONIC and alternative methods.SoNIC AAPL BTC benchmark

Nowadays the interest in dynamics of interaction among the users emerging in socialmedia is dramatically growing. Social media become an attractive venue where userscan easily and instantly interact with others. The research in this strand is, however,challenging. From an econometric point of view, these dynamics require eﬀective state-of-the-art methodologies that cope with the curse of dimensionality, as well as characterizepsychological interdependence. From a quantitative perspective, with the textual analy-sis, the text-based information distilled from Twitter or StockTwits social networks boilsdown to a numerical expression of sentiment or opinions. The joint evolvement of senti-ment variables from individuals constitutes a dynamic network with a possibly growingdimension.In order to cope with dimensionality in a limited observation setting, we proposeSONIC (SOcial Network analysis with Inﬂuencers and Communities). SONIC character-izes the social network dynamics and interdependence featured with identiﬁed inﬂuencersand detectable communities. We provide and discuss several theoretical results on theasymptotic consistency of the dynamic network parameters, even when observations aremissing. We propose an estimation procedure based on a greedy algorithm and LASSOregularization that we extensively test in simulations. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov K = 2communities using stability analysis and identify the inﬂuencers subsequently. We discussthe choice of the regularization parameter λ of LASSO and the choice of the number ofclusters. This section is devoted to the proof of Theorem 3.7. We start with some preliminarylemmata and then proceed with the proof that consists of several steps. Following theideas in Gribonval et al. (2015), the proof relies on explicit representation of the lossfunction.We exploit the following simpliﬁed notation. Denote, z ∗ j = z C ∗ j to be the columns of Z ∗ = Z C ∗ and we also denote n ∗ j = | C ∗ j | for every j = 1 , . . . , K . When the clustering C = ( C , . . . , C K ) is clear from the context we will also write Z for Z C , z j for z C j , and n j = | C j | for every j = 1 , . . . , K . Lemma 7.1.

Suppose that C j is such that (cid:107) z C j − z ∗ j (cid:107) ≤ . . Then, . | C ∗ j | ≤ | C j | ≤ . | C ∗ j | . Proof.

Let (cid:107) z C − z C (cid:107) ≤ . . Then, (cid:107) z C − z C (cid:107) ≤ . (cid:112) N (cid:107) z C − z C (cid:107) . Proof.

Let N j = | C j | and a = | C ∩ C | , b = | C \ C | , c = | C \ C | , so that N = a + b , N = a + c , and | C (cid:52) C | = b + c . We have, (cid:107) z C − z C (cid:107) = (cid:18) √ N − √ N (cid:19) a + bN + cN ≥ bN + cN . . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (cid:107) z C − z C (cid:107) = (cid:12)(cid:12)(cid:12)(cid:12) √ N − √ N (cid:12)(cid:12)(cid:12)(cid:12) a + b √ N + c √ N ≤ (cid:12)(cid:12)(cid:12)(cid:12) √ N − √ N (cid:12)(cid:12)(cid:12)(cid:12) a + (cid:112) N ∨ N (cid:107) z C − z C (cid:107) . Since | N − N | ≤ b + c we obviously have, (cid:12)(cid:12)(cid:12)(cid:12) √ N − √ N (cid:12)(cid:12)(cid:12)(cid:12) a = | N − N | a (cid:112) ( a + b )( a + c )( √ a + b + √ a + c ) ≤ ( b + c ) a √ N ∨ N √ a (2 √ a ) ≤ (cid:112) N ∨ N (cid:107) z C − z C (cid:107) / , and it is left to apply Lemma 7.1. Lemma 7.3.

Suppose, min j n ∗ j max j n ∗ j ≥ α for some α ∈ (0 , and let (cid:107) z j − z ∗ j (cid:107) ≤ r . Suppose, r ≤ . . Then, (cid:107) [ Z ∗ ] (cid:62) ( z j − z ∗ j ) (cid:107) ≤ . α − / r . Proof.

1) We ﬁrst consider the case | C j | = n ∗ j . It holds then[ z ∗ j ] (cid:62) ( z ∗ j − z j ) = 1 n ∗ j ( n ∗ j − | C j ∩ C ∗ j | ) = 1 n ∗ j | C ∗ j \ C j | . Moreover, for every k (cid:54) = j it holds | [ z ∗ k ] (cid:62) ( z ∗ j − z j ) | = | [ z ∗ k ] (cid:62) z j | = 1 (cid:113) n ∗ k n ∗ j | C ∗ k ∩ C j | ≤ α − / n ∗ j | C ∗ k ∩ C j | . Summing up, we get (cid:107) [ Z ∗ ] (cid:62) ( z j − z ∗ j ) (cid:107) ≤ α − / n ∗ j  | C ∗ j \ C j | + (cid:88) k (cid:54) = j | C ∗ k ∩ C j |  ≤ α − / n ∗ j (cid:0) | C ∗ j \ C j | + | C j \ C ∗ j | (cid:1) = α − / n ∗ j | C j (cid:52) C ∗ j | . It is left to notice that in the case | C j | = | C ∗ j | = n ∗ j we have exactly (cid:107) z j − z ∗ j (cid:107) = n ∗ j | C j (cid:52) C ∗ j | .2) Suppose, n j = | C j | > n ∗ j . Obviously, we can decompose C j = C (cid:48) j ∪ B such that | C (cid:48) j | = n ∗ j and B ∩ C ∗ j = ∅ . Setting z (cid:48) j = z C (cid:48) j we get by the above derivations that (cid:107) [ Z ∗ ] (cid:62) ( z (cid:48) j − z ∗ j ) (cid:107) ≤ α − / (cid:107) z (cid:48) j − z ∗ j (cid:107) . Since C (cid:48) j ∩ C ∗ j = C j ∩ C ∗ j we can compare the . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (cid:107) z j − z ∗ j (cid:107) = 2 − (cid:113) n j n ∗ j | C j ∩ C ∗ j | > − n ∗ j | C j ∩ C ∗ j | = (cid:107) z (cid:48) j − z ∗ j (cid:107) . Taking the remainder b = z j − z (cid:48) j we have, b i =  n j − / − ( n ∗ j ) − / , i ∈ C (cid:48) j ,n j − / , i ∈ B, . Setting d = n j − n ∗ j = | B | it is easy to obtain | n j − / − ( n ∗ j ) − / | ≤ dn j √ n ∗ j . Thus, we get K (cid:88) k =1 | [ z ∗ k ] (cid:62) b | ≤ k (cid:88) i =1 (cid:112) n ∗ k  dn j (cid:113) n ∗ j | C (cid:48) j ∩ C ∗ k | + | B ∩ C ∗ k | √ n j  ≤ α − / dn ∗ j n j | C (cid:48) j | + α − / (cid:113) n ∗ j n j d< α − / d (cid:113) n j n ∗ j . We show that the latter is at most 2 . α − / r . Indeed, it is not hard to show that from n j ≤ . n ∗ j (see Lemma 7.1) it follows n j − n ∗ j (cid:113) n j n ∗ j ≤ .  − n ∗ j (cid:113) n j n ∗ j  ≤ . × r , thus (cid:107) [ Z ∗ ] (cid:62) ( z j − z ∗ j ) (cid:107) ≤ . α − / r and the result follows.3) The case n j < n ∗ j can be resolved similarly to the previous one. Since | C ∗ j \ C j | ≥ n ∗ j − n j we can pick a subset B ⊂ C ∗ j \ C j of size d = n ∗ j − n j and set C (cid:48) j = B ∪ C j with | C (cid:48) j | = n ∗ j ; set also z (cid:48) j = z C (cid:48) j . Then, we have (cid:107) z (cid:48) j − z ∗ j (cid:107) = 2 − | C (cid:48) j ∩ C ∗ j | n ∗ j ≤ − | C j ∩ C (cid:48) j | (cid:113) n j n ∗ j = (cid:107) z j − z ∗ j (cid:107) . Thus, by the ﬁrst part of this proof it holds (cid:107) [ Z ∗ ] (cid:62) ( z (cid:48) j − z ∗ j ) (cid:107) ≤ α − / r . Setting b = z (cid:48) j − z j we have, b i =  ( n ∗ j ) − / − n j − / , i ∈ C j ,n ∗ j − / , i ∈ B, . . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov | n j − / − ( n ∗ j ) − / | ≤ dn ∗ j √ n j we obtain, K (cid:88) k =1 | [ z ∗ k ] (cid:62) b | ≤ k (cid:88) i =1 (cid:112) n ∗ k  dn ∗ j √ n j | C j ∩ C ∗ k | + | B ∩ C ∗ k | (cid:113) n ∗ j  ≤ α − / d ( n ∗ j ) / n / j | C j | + α − / n ∗ j d< α − / dn ∗ j . It is left to notice that r ≥ − n j (cid:113) n j n ∗ j = 2( (cid:113) n ∗ j − √ n j ) √ n j = 2( n ∗ j − n j ) n ∗ j + (cid:113) n j n ∗ j ≥ d n ∗ j , therefore (cid:107) [ Z ∗ ] (cid:62) b (cid:107) ≤ α − / r , thus (cid:107) [ Z ∗ ] (cid:62) ( z j − z ∗ j ) (cid:107) ≤ α − / r . Lemma 7.4.

Let r = ||| Z C − Z ∗ ||| F and suppose that r ≤ . . Then ||| P C − P C ∗ ||| F ≥ r (1 − α − r ) .Proof. Denote z j = z C j and r j = (cid:107) z j − z ∗ j (cid:107) . It holds, ||| P C − P C ∗ ||| F = 2 K − P C P C ∗ ) = 2 K − (cid:88) j,k ( z (cid:62) j z ∗ k ) . Notice, that 2 z (cid:62) j z ∗ j = 2 − (cid:107) z j (cid:107) − (cid:107) z ∗ j (cid:107) + 2 z (cid:62) j z ∗ j = 2 − (cid:107) z j − z ∗ j (cid:107) , i.e., z (cid:62) j z ∗ j = 1 − r j / − ( z (cid:62) j z ∗ j ) = r j − r j /

4, whereas ([ z ∗ j ] (cid:62) ( z j − z ∗ j )) = r j /

4. Since weadditionally have [ z ∗ k ] (cid:62) ( z j − z ∗ j ) = [ z ∗ k ] (cid:62) z j for k (cid:54) = j , it holds2 K − (cid:88) j,k ( z (cid:62) j z ∗ k ) = 2 (cid:88) j r j − r j / − (cid:88) j (cid:88) k (cid:54) = j (cid:16) [ z ∗ k ] (cid:62) ( z j − z ∗ j ) (cid:17) = 2 r − (cid:88) j,k (cid:16) [ z ∗ k ] (cid:62) ( z j − z ∗ j ) (cid:17) = 2 r − (cid:88) j (cid:107) [ Z ∗ ] (cid:62) ( z j − z ∗ j ) (cid:107) By Lemma 7.3 we have for every j = 1 , . . . , K (cid:107) [ Z ∗ ] (cid:62) ( z j − z ∗ j ) (cid:107) ≤ (cid:107) [ Z ∗ ] (cid:62) ( z j − z ∗ j ) (cid:107) ≤ . α − / r j , therefore (cid:88) j (cid:107) [ Z ∗ ] (cid:62) ( z j − z ∗ j ) (cid:107) ≤ α − (cid:88) j r j ≤ α − r , thus inequality follows. Lemma 7.5.

Let

Suppose, | C (cid:48) | > | C | then C (cid:48) = C ∪ { a } and denoting n = | C | we have (cid:107) z C − z C (cid:48) (cid:107) = n (cid:32)(cid:114) n + 1 − (cid:114) n (cid:33) + 1 n + 1 = ( √ n + 1 − √ n ) + 1 n + 1 ≤ n + 1 . The proof consists of several steps, each represented by a separate lemma.

Lemma 7.6.

Suppose, Assumption 1 holds and let N ≥ . There is a constant C = C ( γ, L ) , so that if max(2 , s log T, n ∗ ) log NT p ≤ , (7.1) then with probability at least − /N and for with ∆ = Cσ max (cid:113) log NT p the followinginequalities take place for every j = 1 , . . . , K (cid:107) ˆ A − A (cid:107) ∞ , ∞ ≤ ∆ , (cid:107) Σ − j , Λ j ( ˆ A Λ j , · − A Λ j , · ) (cid:107) ∞ , ∞ ≤ σ − ∆ ; (7.2) (cid:107) ( ˆ A − A ) z ∗ j (cid:107) ∞ ≤ ∆ , (cid:107) Σ − j , Λ j ( ˆ A Λ j , · − A Λ j , · ) z ∗ j (cid:107) ∞ ≤ σ − ∆ ; (7.3) (cid:107) ˆΣ − Σ (cid:107) ∞ , ∞ ≤ ∆ , (cid:107) ( ˆΣ Λ j , · − Σ Λ j , · ) v ∗ j (cid:107) ∞ ≤ ∆ ; (7.4) (cid:107) Σ − j , Λ j ( ˆΣ Λ j , · − Σ Λ j , · ) v ∗ j (cid:107) ∞ ≤ σ − ∆ ; (7.5) ||| ˆΣ Λ j , Λ j − Σ Λ j , Λ j ||| op ≤ √ s ∆ . (7.6) Proof.

By Theorem 3.6 for any pair a , b ∈ R N with (cid:107) a (cid:107) ≤ (cid:107) b (cid:107) ≤ ≥ − N − m , | a (cid:62) ( ˆ A − A ) b | ≤ Cσ max (cid:40)(cid:115) ( m + 1) log NT p (cid:95) ( m + 1) log N log TT p (cid:41) . Suppose for a moment that m is such that (cid:115) ( m + 1) s log NT p log T = O (1) , (7.7)so that we can neglect the second term. In order to meet the condition (3.6) we also needto have, max { , (cid:107) a (cid:107) , (cid:107) b (cid:107) , (cid:112) (cid:107) a (cid:107) (cid:107) b (cid:107) log T } log(4 N ) + m log NT p ≤ . Set, A = { ( e i , e i (cid:48) ) : i, i (cid:48) ≤ N } , B = { ( e i , z ∗ l ) : i ≤ N, l ≤ K } , . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov j = 1 , . . . , KA j = { ( σ min Σ − j , Λ j e i , e i (cid:48) ) : i ∈ Λ j , i (cid:48) ≤ N } ,B j = { ( σ min Σ − j , Λ j e i , z ∗ l ) : i ∈ Λ j , l ≤ K } . We have | A | ≤ N , | B | ≤ N K and | A j | ≤ sN, | B j | ≤ sK for j = 1 , . . . , N , so since s, K ≤ N together they have not more than 4 N pairs of vectors ( a , b ), each havingnorm bounded by one. In addition, each ( a , b ) ∈ A j has (cid:107) a (cid:107) ≤ s and (cid:107) b (cid:107) = 1, whereaseach (cid:107) a (cid:107) ≤ s , (cid:107) b (cid:107) ≤ n ∗ . In the worst case, we needmax(2 , s, n ∗ , √ n ∗ s log T ) log(4 N ) + m log NT p ≤ . Taking a union bound, we have that the inequalities (7.2) and (7.3) hold with probabilityat least 1 − N − m . By analogy, we can show that (7.4) and (7.5) hold with probabilityat least 1 − N − m .As for the last inequality, for every j = 1 , . . . , K pick P j = (cid:80) i ∈ Λ j e i e (cid:62) i , i.e., projectorsonto the subspace of vectors supported on Λ j . Then by Theorem 3.5 it holds withprobability at least 1 − KN − m for every j = 1 , . . . , K (taking into account (7.7)) ||| ˆΣ Λ j , Λ j − Σ Λ j , Λ j ||| op = ||| P j ( ˆΣ − Σ) P j ||| op ≤ Cσ max (cid:115) s ( m + 1) log NT p . The sparsity condition is satisﬁed oncemax(2 , s log T ) log(4 N ) + uT p ≤ . The total probability will be at least 1 − N − m − KN − m , which is at least 1 − /N whenever m ≥ N ≥

2, and both sparsity conditions are satisﬁed for m = 7.In what follows we use the additional notation. For a vector v ∈ R N let sign( v ) ∈{− , , } N denotes the vector consisting of coordinates,sign( v ) j =  − , v j < , , v j = 0 , , v j > j = 1 , . . . , N . We write ¯ s j = sign( v ∗ j ) for each j = 1 , . . . , K . In addition, s ∗ j = (¯ s j ) Λ j , which onlyconsists of the values ± j is the support of v ∗ j .In the following, we apply the technique from Gribonval et al. (2015). Suppose thatthe LASSO solution ˆ v j for a given clustering C is not only supported exactly on Λ j , butits signs are matching those of the true v ∗ j . Let s (cid:62) j ∈ {− , , } N be the vector consistingof the signs of coordinates of v ∗ j , i.e. − . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov v ∗ j . Then, (cid:107) ˆ v j (cid:107) = ¯ s (cid:62) j (ˆ v j ) Λ j . Therefore, we can write(ˆ v j ) Λ j = arg min v ∈ R Λ j v (cid:62) ˆΣ Λ j , Λ j v − v (cid:62) ˆ A Λ j , · z j + λ ¯ s (cid:62) j v = ˆΣ − j , Λ j ( ˆ A Λ j , · z j − λ ¯ s j ) , and plugging this solution into the risk function we get that F λ ( C ) = Φ λ ( C ), where thelatter is deﬁned explicitlyΦ λ ( C ) = − K (cid:88) j =1 ( ˆ A Λ j , · z j − λ ¯ s j ) (cid:62) ˆΣ − j , Λ j ( ˆ A Λ j , · z j − λ ¯ s j ) . The next lemma shows that such representation takes place in the local vicinity of thetrue clustering C ∗ . Lemma 7.7.

Suppose, the inequalities (7.2) – (7.6) take place. Assume, s ∆ ≤ / , ≤ λ ≤ σ min τ s − . (7.8) Then, for any C = ( C , . . . , C K ) satisfying max j (cid:107) z C j − z C ∗ j (cid:107) ≤ . ∧ . (cid:114)(cid:16) σ max α − / + √ n ∗ ∆ (cid:17) − λ (7.9) it holds ||| ˆ V λ, C − V ∗ ||| F ≤ σ − √ Ksλ, and the equality F λ ( C ) = Φ λ ( C ) takes place.Proof. Taking into account Z (cid:62) Z = I K , it holds R λ ( V ; C ) = 12 Tr (cid:16) V (cid:62) ˆΣ V (cid:17) − Tr (cid:16) V (cid:62) ˆ AZ (cid:17) + λ (cid:107) V (cid:107) , = K (cid:88) j =1 v (cid:62) j ˆΣ v j − v (cid:62) j ˆ A z j + λ (cid:107) v j (cid:107) , so that the optimization problem separates into K independent subproblems. Solvingeach of the problems 12 v (cid:62) j ˆΣ v j − v (cid:62) j ˆ A z j + λ (cid:107) v j (cid:107) → min v j corresponds to Corollary B.3 with ˆ D = ˆΣ and ˆ c = ˆ A z j , whereas the “true” version ofthe problem corresponds to ¯ D = Σ and ¯ c = A z ∗ j = Σ(Θ ∗ ) (cid:62) z ∗ j = Σ v ∗ j . We need to controlthe diﬀerences between ˆ c and ¯ c , and between ˆ D and ¯ D . It holds, (cid:107) ˆ A z j − A z ∗ j (cid:107) ∞ ≤(cid:107) A ( z j − z ∗ j ) (cid:107) ∞ + (cid:107) ( ˆ A − A ) z ∗ j (cid:107) ∞ + (cid:107) ( ˆ A − A )( z j − z ∗ j ) (cid:107) ∞ . Since A = Σ V ∗ [ Z ∗ ] (cid:62) , we bound the ﬁrst term using Lemma 7.3 (cid:107) A ( z j − z ∗ j ) (cid:107) ∞ ≤ (cid:107) Σ V ∗ (cid:107) ∞ , ∞ (cid:107) [ Z ∗ ] (cid:62) ( z j − z ∗ j ) (cid:107) ≤ . α − / (cid:107) Σ V ∗ (cid:107) ∞ , ∞ r j . . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov , whereas the fourth term satisﬁes (cid:107) ( ˆ A − A )( z j − z ∗ j ) (cid:107) ∞ ≤ (cid:107) ˆ A − A (cid:107) ∞ , ∞ (cid:107) z j − z ∗ j (cid:107) ≤ . √ n ∗ r j , where we also used Lemma 7.2. Summing up,we get, (cid:107) ˆ c − c (cid:107) ∞ ≤ . σ max α − / + (cid:113) n ∗ j ∆ ) r j + ∆ . Similarly, we bound (cid:107) Σ Λ j , Λ j (ˆ c Λ j − ¯ c Λ j ) (cid:107) ∞ as follows (cid:107) Σ − j , Λ j ( ˆ A Λ j , · z j − A Λ j , · z ∗ j ) (cid:107) ∞ ≤(cid:107) Σ − j , Λ j A ( z j − z ∗ j ) (cid:107) ∞ + (cid:107) Σ − j , Λ j ( ˆ A Λ j , · − A Λ j , · ) z ∗ j (cid:107) ∞ + (cid:107) Σ − j , Λ j ( ˆ A Λ j , · − A Λ j , · )( z j − z ∗ j ) (cid:107) ∞ ≤(cid:107) Σ − j , Λ j A ( z j − z ∗ j ) (cid:107) ∞ + 1 . σ − ∆ √ n ∗ r j + σ − ∆ ≤ . σ − (2 σ max α − / + (cid:113) n ∗ j ∆ ) r j + σ − ∆ To sum up, Corollary B.3 is applied with δ c =1 . σ max α − / + √ n ∗ ∆ ) r j + ∆ ,δ (cid:48) c =1 . σ − (2 σ max α − / + √ n ∗ ∆ ) r j + σ − ∆ δ D =∆ , δ (cid:48) D = ∆ , δ (cid:48)(cid:48) D = σ − ∆ . It requires the conditions,3 { . σ max α − / + √ n ∗ ∆ ) r j + 2∆ } ≤ λ, s ∆ ≤ , and due to the fact that (cid:107) D − j , Λ j (cid:107) , ∞ ≤ √ s ||| D − j , Λ j ||| op and Assumption 3.10,2 σ − (1 . σ max α − / + √ n ∗ ∆ ) r j + 2∆ + √ sλ ) < τ s − / , which are not hard to derive from the given inequalities. Together this yields that ˆ v j issupported on Λ j and the solution satisﬁes(ˆ v j ) Λ j = ˆΣ − j , Λ j (cid:16) ˆ A Λ j , · z j − λ s ∗ j (cid:17) , and the corresponding minimum is equal to12 ˆ v (cid:62) j ˆΣˆ v (cid:62) j − ˆ v (cid:62) j ˆ A z j + λ (ˆ v j ) (cid:62) Λ j s ∗ j = − (cid:16) ˆ A Λ j , · z j − λ s ∗ j (cid:17) (cid:62) ˆΣ − j , Λ j (cid:16) ˆ A Λ j , · z j − λ s ∗ j (cid:17) . Summing up, we get the corresponding expression for F λ ( C ). Moreover, we have (cid:107) ˆ v j − v ∗ j (cid:107) ≤ √ s (cid:110) + 1 . σ max α − + √ n ∗ ∆ ) r j + λ (cid:111) ≤ σ − √ s (cid:18) λ . λ

20 + λ (cid:19) ≤ σ − √ sλ, . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov ||| ˆ V λ, C − V ∗ ||| F .Consider the function,¯Φ λ ( C ) = − k (cid:88) j =1 (cid:0) A Λ j , · z j − λ s ∗ j (cid:1) (cid:62) Σ − j , Λ j (cid:0) A Λ j , · z j − λ s ∗ j (cid:1) . The following lemma shows how this function grows with C retreating from the trueclustering C ∗ . Lemma 7.8.

Suppose, C is a clustering such that r = ||| Z C − Z ∗ ||| F ≤ . . Then, ¯Φ λ ( C ) − ¯Φ λ ( C ∗ ) ≥ a r (1 − α − r ) − λ √ Ks ||| V ∗ ||| F r. Proof.

Denoting ¯Φ ( C ) = − (cid:80) kj =1 z (cid:62) j ˆ A (cid:62) Λ j , · ˆΣ − j , Λ j ˆ A Λ j , · z j (which indeed corresponds to λ = 0), we have the decomposition¯Φ λ ( C ) − ¯Φ λ ( C ∗ ) = ¯Φ ( C ) − ¯Φ ( C ∗ ) − λ K (cid:88) j =1 [ s ∗ j ] (cid:62) Σ − j , Λ j A Λ j , · ( z j − z ∗ j ) . Let us ﬁrst deal with the term ¯Φ ( C ) − ¯Φ ( C ∗ ). Note that since [ v ∗ j ] Λ j = Σ − j , Λ j A Λ j , · z ∗ j ,we have ¯Φ ( C ∗ ) = − K (cid:88) j =1 [ v ∗ j ] (cid:62) Σ v ∗ j = −

12 Tr([ V ∗ ] (cid:62) Σ V ∗ ) = −

12 Tr(Θ ∗ Σ[Θ ∗ ] (cid:62) ) . whereas ¯Φ ( C ) = min V =[ v ,..., v k ]

12 Tr( V (cid:62) Σ V ) − Tr( V (cid:62) AZ C )where the minimum is taken s.t. the restrictions supp( v j ) ⊂ Λ j . Dropping the restrictionswe get, ¯Φ ( C ) − ¯Φ ( C ∗ ) ≥ min V

12 Tr( V (cid:62) Σ V ) − Tr( V (cid:62) AZ C ) + 12 Tr(Θ ∗ Σ[Θ ∗ ] (cid:62) )= min V ||| Z C V (cid:62) Σ / ||| F − Tr( Z C V (cid:62) Σ[Θ ∗ ] (cid:62) ) + ||| Θ ∗ Σ / ||| F = min V ||| ( Z C V (cid:62) − Θ ∗ )Σ / ||| F . It is not hard to calculate that the minimum is attained for V = [Θ ∗ ] (cid:62) Z C and therefore¯Φ ( C ) − ¯Φ ( C ∗ ) ≥ ||| ( Z C Z (cid:62)C − I )Θ ∗ Σ / ||| F ≥ a ||| ( Z C Z (cid:62)C − I ) Z ∗ ||| F , where the latter follows using Θ ∗ = Z ∗ [ V ∗ ] (cid:62) and from the fact that λ min ([ V ∗ ] (cid:62) Σ V ∗ ) ≥ σ .Moreover, ||| ( Z C Z (cid:62)C − I ) Z ∗ ||| F = Tr(( P C − I ) P C ∗ ( P C − I )) = Tr( P C ∗ ) − Tr( P C P C ∗ )= 12 ||| P C − P C ∗ ||| F , . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov P C ) = Tr( P C ∗ ) = K . It is left to recall the result ofLemma 7.4, so that we get¯Φ ( C ) − ¯Φ ( C ∗ ) ≥ a r − α − r ) . As for the linear term, it holds  K (cid:88) j =1 [ s ∗ j ] (cid:62) Σ − j , Λ j A Λ j , · ( z j − z ∗ j )  ≤  K (cid:88) j =1 (cid:107) [ s ∗ j ] (cid:62) Σ − j , Λ j A Λ j , · (cid:107)  r Since A = Σ[Θ ∗ ] (cid:62) , we have A (cid:62) Λ j , · Σ − j , Λ j s ∗ j = Θ ∗ Σ · , Λ j Σ − j , Λ j s ∗ j . Denote, x = Σ · , Λ j Σ − j , Λ j s ∗ j ,then we have x Λ j = s j and (cid:107) x Λ j (cid:107) ∞ = 1. Moreover, by the ERC property (cid:107) x Λ cj (cid:107) ∞ = (cid:107) Σ Λ cj , Λ j Σ − j , Λ j s j (cid:107) ∞ ≤ (cid:107) Σ Λ cj , Λ j Σ − j , Λ j (cid:107) , ∞ ≤ / . We have (cid:107) A (cid:62) Λ j , · Σ − j , Λ j s ∗ j (cid:107) = (cid:107) (cid:88) z ∗ j [ v ∗ j ] (cid:62) x (cid:107) = K (cid:88) k =1 | [ v ∗ k ] (cid:62) x | , where, since v ∗ k is supported on Λ k of size at most s , | [ v ∗ k ] (cid:62) x | ≤ (cid:107) v ∗ k (cid:107) (cid:107) x (cid:107) ∞ ≤ √ s (cid:107) v ∗ k (cid:107) . Summing up, we get (cid:107) A (cid:62) Λ j , · Σ − j , Λ j s ∗ j (cid:107) ≤ s ||| V ∗ ||| F , so that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) j =1 [ s ∗ j ] (cid:62) Σ − j , Λ j A Λ j , · ( z j − z ∗ j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ Ks ||| V ∗ ||| F r. The lemma now follows from the two terms put together.The next step is to bound the diﬀerence Φ λ ( C ) − ¯Φ λ ( C ) uniformly in the neighbourhoodof C ∗ . Lemma 7.9.

Suppose that the inequalities (7.2) – (7.6) hold and let ∆ ≤ σ min / (2 √ s ) ∨ λ , σ max /σ min ≤ n ∗ , λ ≤ σ min s − Let some r ≤ . satisﬁes √ sn ∗ ∆ r ≤ σ max . Then, sup ||| Z − Z ∗ ||| F ≤ r | Φ λ ( C ) − ¯Φ λ ( C ) − Φ λ ( C ∗ ) + ¯Φ λ ( C ∗ ) |≤ (cid:32)(cid:18) σ max σ min (cid:19) √ s ||| V ∗ ||| F + σ max σ min √ K (cid:33) Delta r + 16 σ max σ min √ sn ∗ ∆ r . . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov Proof.

Denote, ˜Φ λ ( C ) = − K (cid:88) j =1 (cid:0) A Λ j , · z j − λ s ∗ j (cid:1) (cid:62) ˆΣ − j , Λ j (cid:0) A Λ j , · z j − λ s ∗ j (cid:1) , so that we have | ˜Φ λ ( C ) − ¯Φ λ ( C ) − ˜Φ λ ( C ∗ ) + ¯Φ λ ( C ∗ ) |≤ K (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:0) A Λ j , · ( z j + z ∗ j ) − λ s ∗ j (cid:1) (cid:62) ( ˆΣ − j , Λ j − Σ − j , Λ j ) A Λ j , · ( z j − z ∗ j ) (cid:12)(cid:12)(cid:12) First of all, due to (7.6) it holds, ||| ˆΣ − j , Λ j − Σ − j , Λ j ||| op ≤ σ − √ s ∆ − σ − √ s ∆ ≤ σ − √ s ∆ . Since A = Σ[Θ ∗ ] (cid:62) , we have (cid:107) A Λ j , · ( z j − z ∗ j ) (cid:107) ≤ σ max r j (cid:107) A Λ j , · ( z j + z ∗ j ) − λ s ∗ j (cid:107) ≤ σ max (2 (cid:107) v ∗ j (cid:107) + r j ) + 2 λ √ s. Then by Cauchy-Schwartz, | ˜Φ λ ( C ) − ¯Φ λ ( C ) − ˜Φ λ ( C ∗ ) + ¯Φ λ ( C ∗ ) | ≤ σ − √ s ∆  K (cid:88) j =1 σ max r j (cid:8) σ max (2 (cid:107) v j (cid:107) + r j ) + 2 λ √ s (cid:9) ≤ (cid:18) σ max σ min (cid:19) √ s ||| V ∗ ||| F ∆ r + 2 σ max σ λs √ K ∆ r + (cid:18) σ max σ min (cid:19) √ s ∆ r . Going further,Φ λ ( C ) − ˜Φ λ ( C ) = − K (cid:88) j =1 (cid:16) ( A Λ j , · + ˆ A Λ j , · ) z j − λ s ∗ j (cid:17) (cid:62) ˆΣ − j , Λ j ( ˆ A Λ j , · − A Λ j , · ) z j , which implies that | Φ λ ( C ) − ˜Φ λ ( C ) − Φ λ ( C ∗ ) + ˜Φ λ ( C ∗ ) |≤ K (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) ( A Λ j , · + ˆ A Λ j , · )( z j − z ∗ j ) (cid:17) (cid:62) ˆΣ − j , Λ j ( ˆ A Λ j , · − A Λ j , · ) z j (cid:12)(cid:12)(cid:12)(cid:12) + 12 K (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) ( A Λ j , · + ˆ A Λ j , · ) z ∗ j − λ s ∗ j (cid:17) (cid:62) ˆΣ − j , Λ j ( ˆ A Λ j , · − A Λ j , · )( z j − z ∗ j ) (cid:12)(cid:12)(cid:12)(cid:12) (7.10) . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (cid:107) ( ˆ A Λ j , · − A Λ j , · )( z j − z ∗ j ) (cid:107) ≤ √ s (cid:107) ˆ A Λ j , · − A Λ j , · (cid:107) ∞ , ∞ (cid:107) z j − z ∗ j (cid:107) ≤ . √ sn ∗ ∆ r j . Therefore, it follows (cid:107) ( ˆ A Λ j , · + A Λ j , · )( z j − z ∗ j ) (cid:107) ≤ σ max r j + 1 . √ sn ∗ ∆ r j . Moreover, using (7.3) we get (cid:107) ( ˆ A Λ j , · − A Λ j , · ) z j (cid:107) ≤ ∆ + 1 . √ sn ∗ ∆ r j (cid:107) ( ˆ A Λ j , · + A Λ j , · ) z ∗ j − λ s ∗ j (cid:107) ≤ σ max (cid:107) v j (cid:107) + ∆ + 2 λ √ s. and we also have ||| ˆΣ − j , Λ j ||| op ≤ σ − due to the condition σ − √ s ∆ ≤ /

2. Thus weget that the ﬁrst sum of (7.10) is bounded by σ − K (cid:88) j =1 (cid:16) σ max r j + 1 . √ sn ∗ ∆ r j (cid:17) (cid:16) ∆ + 1 . √ sn ∗ ∆ r j (cid:17) ≤ σ max σ min ∆ √ Kr + 1 . σ − √ sn ∗ ∆ r + 3 . σ max σ min √ sn ∗ ∆ r + 2 . σ − sn ∗ ∆ r , while the second sum is bounded by σ − K (cid:88) j =1 (cid:0) σ max (cid:107) v ∗ j (cid:107) + ∆ + 2 λ √ s (cid:1) (cid:16) . √ sn ∗ ∆ r j (cid:17) ≤ . σ min (cid:16) σ max √ sn ∗ + √ sn ∗ ∆ + 2 λs √ n ∗ (cid:17) ∆ r ≤ . σ min (cid:16) σ max √ sn ∗ + λs √ n ∗ (cid:17) ∆ r where we used the fact that max j (cid:107) v ∗ j (cid:107) ≤ ||| V ∗ ||| op = ||| Θ ∗ ||| op < ≤ σ max . Combining all the bounds we get | Φ λ ( C ) − ¯Φ λ ( C ) − Φ λ ( C ∗ ) + ¯Φ λ ( C ∗ ) |≤ (cid:40)(cid:18) σ max σ min (cid:19) √ s ||| V ∗ ||| F + 2 σ max σ λs √ K + 2 σ max σ min √ K (cid:41) ∆ r + (cid:40) . σ max σ min √ sn ∗ + 3 . σ − λs √ n ∗ + 1 . σ − √ sn ∗ ∆ + (cid:18) σ max σ min (cid:19) √ s (cid:41) ∆ r + 3 . σ max σ min √ sn ∗ ∆ r + 2 . σ − sn ∗ ∆ r , where by r ≤ . √ sn ∗ ∆ ≤ σ max we can neglect the third and the fourth power,respectively, and thus the required bound follows. Lemma 7.10.

There are numerical constants c, C > such that the following holds. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov Suppose, the inequalities take place: (cid:115) sn ∗ log NT p ≤ c a σ min σ , n ∗ ≥ σ max /σ min . (7.11) Let Cσ max (cid:113) log NT p ≤ λ ≤ cσ min τ s − , and set ¯ r = 0 . ∧ . √ α ∧ . (cid:114)(cid:16) σ max α − / + √ n ∗ ∆ (cid:17) − λ. Then under the inequalities (7.2) – (7.6) the clustering ˆ C = arg min ||| Z C − Z ∗ ||| F ≤ r max F λ ( C ) satisﬁes ||| Z ˆ C − Z ∗ ||| F ≤ Ca (cid:18) σ max σ min (cid:19) λK √ s . Proof.

It is not hard to see that for ∆ = (cid:113) log NT p the inequalities required by Lem-mata 7.7–7.9 are satisﬁed for r ≤ ¯ r due to (7.11) and conditions on λ and ¯ r . Sinceobviously ˆ C satisﬁes F λ ( ˆ C ) ≤ F λ ( C ∗ ), we have for r = ||| Z ˆ C − Z C ∗ ||| F ≤ r max F λ ( ˆ C ) − F λ ( C ∗ ) ≥ ¯Φ λ ( C ) − ¯Φ λ ( C ) − | F λ ( C ) − ¯Φ λ ( C ) − F λ ( C ∗ ) + ¯Φ λ ( C ∗ ) |≥ a r (cid:0) − α − r (cid:1) − λ √ Ks ||| V ∗ ||| F r − (cid:40)(cid:18) σ max σ min (cid:19) √ s ||| V ∗ ||| F + σ max σ min √ K (cid:41) ∆ r − σ max σ min √ sn ∗ ∆ r = a r (cid:18) − α − r − a σ max σ min √ sn ∗ ∆ (cid:19) − λ √ Ks ||| V ∗ ||| F r − (cid:40)(cid:18) σ max σ min (cid:19) √ s ||| V ∗ ||| F + σ max σ min √ K (cid:41) ∆ r . Since ¯ r ≤ . √ α implies 10 α − r ≤ , it holds by (7.11)1 − α − r − a σ max σ min √ sn ∗ ∆ ≥ . Therefore, after dividing by r , we get that such optimal clustering must satisfy a r ≤ λ √ Ks ||| V ∗ ||| F + 4 (cid:40)(cid:18) σ max σ min (cid:19) √ s ||| V ∗ ||| F + σ max σ min √ K (cid:41) ∆ . Recalling that ||| V ∗ ||| F ≤ √ K , ∆ = Cσ max (cid:113) log NT p , and ∆ = C (cid:113) s log NT p yields the result.Now we are ready to ﬁnalize the proof of Theorem 3.7. Firstly, we need to showthat the clustering ˆ C from the lemma above is locally optimal. By Lemma 7.5, any . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov C (cid:48) satisﬁes ||| Z C (cid:48) − Z ˆ C ||| F ≤ √ αN/K . Therefore, ||| Z C (cid:48) − Z C ∗ ||| F ≤ Ca (cid:18) σ max σ min (cid:19) λK √ s + 2 α − / (cid:114) KN , and it is enough to check that this value is at most ¯ r . We check that each of the termsis at most ¯ r/

2. For the ﬁrst one, it is suﬃcient to have Ca (cid:18) σ max σ min (cid:19) α − / λK √ s ≤ . ,C a (cid:18) σ max σ min (cid:19) λ (cid:16) σ max α − / + √ n ∗ ∆ (cid:17) K s ≤ . , and both are satisﬁed due to the upper bound λ ≤ cκ − ( a /σ max ) K − s − and the re-quirement (cid:113) sn ∗ log NT p ≤ c . For the second term we need α − KN ≤ . α, α − (cid:16) σ max α − / + √ n ∗ ∆ (cid:17) KN ≤ λ, both are satisﬁed once N ≥ Cα K and λ ≥ Cσ max α − / KN .Moreover, by Lemma 7.7 we have for ˆΘ = Z ˆ C ˆ V ˆ C ,λ ||| ˆΘ − Θ ∗ ||| F ≤ ||| Z ˆ C ( ˆ V ˆ C ,λ − V ∗ ) (cid:62) ||| F + ||| ( Z ˆ C − Z ∗ ) V ∗ ||| F ≤ σ − √ Ksλ + Ca (cid:18) σ max σ min (cid:19) γK √ sλ, which ﬁnishes the proof. References

Avery, C. N., Chevalier, J. A., and Zeckhauser, R. J. (2016). The “CAPS” PredictionSystem and Stock Market Returns.

Review of Finance , 20(4):1363–1381.Belloni, A. and Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models.

Bernoulli , 19(2):521–547.Bickel, P. J., Ritov, Y., and Tsybakov, A. (2009). Simultaneous analysis of Lasso andDantzig selector.

The Annals of Statistics , 37(4):1705–1732.Cha, M., Haddadi, H., Benevenuto, F., and Gummadi, K. P. (2010). Measuring userinﬂuence in twitter: The million follower fallacy. In

Proceedings of the 4th InternationalAAAI Conference on Weblogs and Social Media , pages 10–17.Chen, C. Y.-H., Despr´es, R., Guo, L., and Renault, T. (2019a). What makes cryptocur-rencies special? Investor sentiment and price predictability during the bubble.

IRTG1792 Discussion Paper 2019-016 .Chen, C. Y.-H., H¨ardle, W. K., and Okhrin, Y. (2019b). Tail event driven networks ofSIFIs.

Journal of Econometrics , 208(1):282–298. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov

Warwick Economics Research Paper Series No. 1120.

Chen, M., Fern´andez-Val, I., and Weidner, M. (2021). Nonlinear factor models for net-work and panel data.

Journal of Econometrics , 220(2):296 – 324. Annals Issue: Cele-brating 40 Years of Panel Data Analysis: Past, Present and Future.Chen, S. and Schienle, M. (2019). Pre-screening and reduced rank regression for high-dimensional cointegration.

KIT working paper .Chen, Y., Trimborn, S., and Zhang, J. (2018). Discover Regional and Size Eﬀects inGlobal Bitcoin Blockchain via Sparse-Group Network AutoRegressive Modeling.

Avail-able at SSRN: https://ssrn.com/abstract=3245031 .Chernozhukov, V., H¨ardle, W. K., Huang, C., and Wang, W. (2020). LASSO-DrivenInference in Time and Space.

Annals of Statistics, to appear .ˇC´ıˇzek, P., H¨ardle, W., and Spokoiny, V. (2009). Adaptive pointwise estimation intime-inhomogeneous conditional heteroscedasticity models.

The Econometrics Journal ,12(2):248–271.Deng, S., Sinha, A. P., and Zhao, H. (2017). Adapting sentiment lexicons to domain-speciﬁc social media texts.

Decision Support Systems , 94:65–76.Diebold, F. X. and Yılmaz, K. (2014). On the network topology of variance decom-positions: Measuring the connectedness of ﬁnancial ﬁrms.

Journal of Econometrics ,182(1):119–134.Fan, J., Feng, Y., and Wu, Y. (2009). Network exploration via the adaptive LASSO andSCAD penalties.

The Annals of Applied Statistics , 3(2):521.Gribonval, R., Jenatton, R., and Bach, F. (2015). Sparse and spurious: dictionary learn-ing with noise and outliers.

IEEE Transactions on Information Theory , 61(11):6298–6319.Gudmundsson, G. (2018). Community Detection in Large Vector Autoregressions.

Avail-able at SSRN: https://ssrn.com/abstract=3072985 .Han, F., Lu, H., and Liu, H. (2015). A Direct Estimation of High Dimensional StationaryVector Autoregressions.

The Journal of Machine Learning Research , 16(1):3115–3150.Hsu, D., Kakade, S., and Zhang, T. (2012). A tail inequality for quadratic forms ofsubgaussian random vectors.

Electronic Communications in Probability , 17(52):6 pp.Kapetanios, G., Pesaran, M. H., and Reese, S. (2019). Detection of units with pervasiveeﬀects in large panel data models.

USC-INET Research Paper .Kim, S.-H. and Kim, D. (2014). Investor sentiment from internet message postings andthe predictability of stock returns.

Journal of Economic Behavior & Organization ,107, Part B:708–729.Klochkov, Y. and Zhivotovskiy, N. (2020). Uniform Hanson-Wright type concentrationinequalities for unbounded entries via the entropy method.

Electronic Journal of Prob-ability , 25(20):1–30. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov

Bernoulli , 23(1):110–133.Le Gouic, T. and Paris, Q. (2018). A notion of stability for k -means clustering. ElectronicJournal of Statistics , 12(2):4239–4263.Likas, A., Vlassis, N., and Verbeek, J. J. (2003). The global k -means clustering algorithm. Pattern Recognition , 36(2):451–461.Loughran, T. and McDonald, B. (2011). When is a liability not a liability? Textualanalysis, dictionaries, and 10-ks.

The Journal of Finance , 66(1):35–65.Lounici, K. (2014). High-dimensional covariance matrix estimation with missing obser-vations.

Bernoulli , 20(3):1029–1058.Melnyk, I. and Banerjee, A. (2016). Estimating structured vector autoregressive models.In

Proceedings of the 33rd International Conference on Machine Learning , pages 830–839.Mihoci, A., Althof, M., Chen, C. Y.-H., and H¨ardle, W. K. (2020). FRM FinancialRisk Meter. In

Advances in Econometrics Conference , volume 42. The Econometricsof Networks.Moon, H. R. and Weidner, M. (2018). Nuclear norm regularized estimation of panelregression models. arXiv preprint arXiv:1810.10987 .Parker, J. and Sul, D. (2016). Identiﬁcation of unknown common factors: Leaders andfollowers.

Journal of Business & Economic Statistics , 34(2):227–239.Pesaran, M. H. and Yang, C. F. (2020). Econometric analysis of production networkswith dominant units.

Journal of Econometrics, in press .Rakhlin, A. and Caponnetto, A. (2007). Stability of k -means clustering. In Proceedingsof the 21th Annual Conference on Neural Information Processing Systems , pages 1121–1128.Renault, T. (2017). Intraday online investor sentiment and return patterns in the USstock market.

Journal of Banking & Finance , 84:25–40.Rohe, K., Qin, T., and Yu, B. (2016). Co-clustering directed graphs to discover asymme-tries and directional communities. In

Proceedings of the National Academy of Sciences ,volume 113, pages 12679–12684.Shindler, M., Wong, A., and Meyerson, A. W. (2011). Fast and Accurate k -means ForLarge Datasets. In Proceedings of the 25th Annual Conference on Neural InformationProcessing Systems , pages 2375–2383.Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparsesignals in noise.

IEEE Transactions on Information Theory , 52(3):1030–1051.Udell, M., Horn, C., Zadeh, R., and Boyd, S. (2016). Generalized low rank models.

Foundations and Trends ® in Machine Learning , 9(1):1–118. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov The Annals ofStatistics , 42(3):1166–1202.Van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso.

The Annals of Statistics , 36(2):614–645.Vershynin, R. (2018).

High-Dimensional Probability: An Introduction with Applications ,volume 47 of

Cambridge Series in Statistical and Probabilistic Mathematics . CambridgeUniversity Press.Wang, T. and Samworth, R. J. (2018). High dimensional change point estimation viasparse projection.

Journal of the Royal Statistical Society: Series B (Statistical Method-ology) , 80(1):57–83.Zhao, P. and Yu, B. (2006). On model selection consistency of lasso.

Journal of Machinelearning research , 7(Nov):2541–2563.Zhu, X. and Pan, R. (2020). Grouped Network Vector Autoregression.

Statistica Sinca ,30:1437–1462.Zhu, X., Pan, R., Li, G., Liu, Y., and Wang, H. (2017). Network vector autoregression.

The Annals of Statistics , 45(3):1096–1123.Zhu, X., Wang, W., Wang, H., and H¨ardle, W. K. (2019). Network quantile autoregres-sion.

Journal of Econometrics , 212:345–358.

A Proof of Theorems 3.5 and 3.6

Recall that we have a time series, Y t = (cid:88) k ≥ Θ k W t − k , t ∈ Z , (A.1)where W t ∈ R N , t ∈ Z are independent vectors with E W t = 0 and Var( W t ) = S . We alsohave ||| Θ ||| op ≤ γ for some γ <

1, and the covariance Σ = Var( Y t ) reads asΣ = (cid:88) k ≥ Θ k S [Θ k ] (cid:62) . We have the observations Z t = ( δ t Y t , . . . , δ Nt Y Nt ) (cid:62) , t = 1 , . . . , T, (A.2)where δ it ∼ Be( p i ) are independent Bernoulli random variables for every i = 1 , . . . , N and t = 1 , . . . , T and some p i ∈ (0 , X ∈ R the value (cid:107) X (cid:107) ψ j = inf (cid:40) C > E exp (cid:32)(cid:12)(cid:12)(cid:12)(cid:12) XC (cid:12)(cid:12)(cid:12)(cid:12) j (cid:33) ≤ (cid:41) . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov ψ j -norm. For j = 1 the norm is referred to as subexponential and for j = 2 as sub-Gaussian , see Deﬁnition 3.3. Theorem A.1 (Klochkov and Zhivotovskiy (2020), Proposition 4.1) . Suppose, the ma-trices A t for t = 1 , . . . , T are independent and let M = max t (cid:13)(cid:13) ||| A t ||| op (cid:13)(cid:13) ψ is ﬁnite. Then, S T = (cid:80) Tt =1 A t satisﬁes for any u ≥ P (cid:104) ||| S T − E S T ||| op > C (cid:110)(cid:112) σ (log N + u ) + M log T (log N + u ) (cid:111)(cid:105) ≤ e − u , where σ = ||| (cid:80) Tt =1 E A (cid:62) t A t ||| op ∨ ||| (cid:80) Tt =1 E A t A (cid:62) t ||| op and C is an absolute constant. Both Lounici (2014) and Klochkov and Zhivotovskiy (2020) assume that the proba-bilities of the observations are given. Using Chernov’s bound for the diﬀerenceˆ p i − p i = 1 N T (cid:88) t =1 δ it − E δ it , and applying the union bound, we derive that with probability at least 1 − e − u it holdsthat max i ≤ N (cid:12)(cid:12)(cid:12)(cid:12) − ˆ p i p i (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:115) N ) + uT p min + log(4 N ) + uT p min . (A.3)Notice that what appears on the right-hand side of the above display is dominated bythe error that appears in Theorems 3.5 and 3.6. Consider the auxiliary estimators˜Σ = diag { p } − Diag(Σ ∗ ) + diag { p } − Oﬀ(Σ ∗ )diag { p } − , ˜ A = diag { p } − A ∗ diag { p } − . Then, we have for ˆ I = diag { ˆ p } − diag { p } thatˆΣ = ˆ I Diag( ˜Σ) + ˆ I Oﬀ( ˜Σ) ˆ I, ˆ A = ˆ I ˜ A ˆ I. Given that log(4 N )+ uT p min ≤ , we easily get that by (A.3), ||| ˆ I − I ||| op ≤ δ = 3 (cid:115) log(4 N ) + uT p min . (A.4)with the corresponding probability. In this case, we have ||| ˆ A − A ||| op ≤ ||| ˆ I ˜ A ˆ I − A ||| op ≤ ||| ˆ I ||| op ||| ˜ A − A ||| op + ||| ˆ IA ˆ I − A ||| op ≤ (1 + δ ) ||| ˜ A − A ||| op + 2(1 + δ ) δ ||| A ||| op . Similarly, |||

Diag( ˆΣ) − Diag(Σ) ||| op ≤ (1 + δ ) ||| Diag( ˜Σ) − Diag(Σ) ||| op + δ ||| Σ ||| op , ||| Oﬀ( ˆΣ) − Oﬀ(Σ) ||| op ≤ (1 + δ ) ||| Oﬀ( ˜Σ) − Oﬀ(Σ) ||| op + 4(1 + δ ) δ ||| Σ ||| op . Recall that S = Var( W t ), and from (3.4) we can easily derive that ||| Σ ||| op ≤ − γ ||| S ||| op , . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov ||| Θ ∗ ||| op ≤ γ <

1. Correspondingly, from A = Θ ∗ Σ, it follows ||| A ||| op ≤ γ − γ ||| S ||| op .The condition (3.6) ensures that δ ≤

3, and both the theorems now follow from theproposition below.

Proposition A.2.

Under the conditions of Theorems 3.5 and 3.6, for any two projectorswith

P, Q with ranks M , M , respectively, we have that for any u > , with probabilityat least − e − u , ||| P (Diag( ˜Σ) − Diag(Σ)) Q ||| op ≤ C ||| S ||| op (cid:32)(cid:115) ( M ∨ M )(log N + u ) T p (cid:95) √ M M (log N + u ) log TT p (cid:33) . (A.5) and ||| P (Oﬀ( ˜Σ) − Oﬀ(Σ)) Q ||| op ≤ C ||| S ||| op (cid:32)(cid:115) ( M ∨ M )(log N + u ) T p (cid:95) √ M M (log N + u ) log TT p (cid:33) . (A.6) Moreover, with probability at least − e − u we have that, ||| P ( ˜ A − A ) Q ||| op ≤ C ||| S ||| op (cid:32)(cid:115) ( M ∨ M )(log N + u ) T p (cid:95) √ M M (log N + u ) log TT p (cid:33) . (A.7) Here, C = C ( γ, L ) only depends on γ and L . We ﬁrst derive Theorems 3.5 and 3.6 from the above proposition. Then the rest ofthe section will be devoted to the proof of the above proposition.

Proof of Theorems 3.5 and 3.6.

Observe that ||| P (Diag( ˆΣ) − Diag(Σ)) Q ||| op = ||| P ( ˆ I Diag( ˜Σ) − Diag(Σ)) Q ||| op ≤ ||| P ( ˆ I − I )Diag( ˜Σ) Q ||| op + ||| P (Diag( ˜Σ) − Diag(Σ)) Q ||| op The last term of the right-hand side is controlled by (A.5). As for the ﬁrst one, let Λbe the support of P in accordance with Deﬁnition 3.4, and set Π Λ = (cid:80) i ∈ Λ e i e (cid:62) i , so that P = P Π Λ and Rank(Π Λ ) = K . Moreover, Π Λ is diagonal, therefore, Π Λ ˆ I = ˆ I Π Λ . This,we have ||| P ( ˆ I − I )Diag(Σ) Q ||| op = ||| P ( ˆ I − I )Π Λ Diag(Σ) Q ||| op ≤ δ ( ||| Σ ||| op + ||| Π Λ (Diag( ˜Σ) − Diag(Σ)) Q ||| op ) . By (A.5) and (3.6) we have that with probability at least 1 − e − u , ||| Π Λ (Diag( ˜Σ) − Diag(Σ)) Q ||| op ≤ C ||| Σ ||| op , . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov ||| Σ ||| op ≤ (1 − γ ) − ||| S ||| op . Furthermore, ||| P (Oﬀ( ˆΣ) − Oﬀ(Σ)) Q ||| op = ||| P ( ˆ I Oﬀ( ˜Σ) ˆ I − Oﬀ(Σ)) Q ||| op ≤||| P (Oﬀ( ˜Σ) − Oﬀ(Σ)) Q ||| op + ||| P ( ˆ I − I )Oﬀ( ˜Σ) ˆ IQ ||| op + ||| P Oﬀ( ˜Σ)( ˆ I − I ) Q ||| op Let Λ (cid:48) be the sparsity pattern for the projector Q and Π Λ (cid:48) is the corresponding diagonalprojector. Then we apply (A.6) to Π Λ (Oﬀ( ˜Σ) − Oﬀ(Σ))Π Λ (cid:48) ) so that provided with (3.6),we have with probability at least 1 − e − u , ||| Π Λ Oﬀ( ˜Σ)Π Λ (cid:48) ||| op ≤ C ||| S ||| op . Using that Π Λ (cid:48) , Π Λ commute with the diagonal matrices ˆ I , ˆ I − I , and given that δ ≤ ||| P Oﬀ( ˜Σ)( ˆ I − I ) Q ||| op + ||| P ( ˆ I − I )Oﬀ( ˜Σ) ˆ IQ ||| op ≤ C δ ||| S ||| op . Applying (A.5) to P (Diag( ˜Σ) − Diag(Σ)) Q with probability 1 − e − u and (A.6) to P (Oﬀ( ˜Σ) − Oﬀ(Σ)) Q , and putting the diagonal and oﬀ-diagonal terms together, weget that, with probability at least 1 − e − u , it holds that ||| P ( ˆΣ − Σ) Q ||| op ≤ C ||| S ||| op (cid:32) δ + (cid:115) ( M ∨ M )(log N + u ) T p (cid:95) √ M M (log N + u ) log TT p (cid:33)

It remains to notice that the bound (A.4) for δ holds with probability at least 1 − e − u ,and the corresponding δ is dominated by the remaining error term. This concludes theproof of Theorem 3.5.Theorem 3.6 can be proved treating P ( ˆ A − A ) Q similarly to the oﬀ-diagonal caseabove.We now turn to the proof of Proposition A.2. Let δ t = ( δ t , . . . , δ tN ) (cid:62) denotes thevector with Bernoulli variables from above corresponding to the time point t . In whatfollows we consider the following matrices, A k,jt,t (cid:48) = diag { δ t } Θ k W t − k W (cid:62) t (cid:48) − j [Θ j ] (cid:62) diag { δ t (cid:48) } , so that since Z t = (cid:80) k ≥ diag { δ t } Θ k W t − k , we have Z t Z (cid:62) t = (cid:88) k,j ≥ diag { δ t } Θ k W t − k W (cid:62) t − j [Θ j ] (cid:62) diag { δ t } = (cid:88) k,j ≥ A k,jt,t . Therefore, the decomposition takes placeΣ ∗ = (cid:88) k,j ≥ S k,j , S k,j = 1 T T (cid:88) t =1 A k,jt,t , (A.8)and we shall analyze the sum S k,j for every pair of k, j ≥ ||| S ||| op = 1, since if we . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov Lemma A.3.

Under the assumptions of Theorem 3.5 it holds, (cid:107)||| P diag { p } − Diag( A k,jt,t (cid:48) ) Q ||| op (cid:107) ψ ≤ Cp − (cid:112) M M γ k + j , (cid:107)||| P diag { p } − Oﬀ( A k,jt,t (cid:48) )diag { p } − Q ||| op (cid:107) ψ ≤ Cp − (cid:112) M M γ k + j , with some C = C ( L ) > .Proof. Denote for simplicity x = Θ k W t − k , y = Θ j W t (cid:48) − j , as well as x δ = diag { δ t } x , y δ =diag { δ t } y , such that A k,jt,t (cid:48) = x δ [ y δ ] (cid:62) . Since W t are sub-Gaussian and ||| Θ k S Θ k ||| op ≤ γ k ,we have for any u ∈ R N log E exp( u (cid:62) x ) ≤ C (cid:48) γ k (cid:107) u (cid:107) , (A.9)and since δ t takes values in [0 , N , same takes place for x δ . By Theorem 2.1 in Hsu et al.(2012) it holds for any matrix A and vector u ∈ R N , (cid:107)(cid:107) A x δ (cid:107)(cid:107) ψ ≤ C (cid:48)(cid:48) γ k ||| A ||| F , (cid:107) u (cid:62) x δ (cid:107) ψ ≤ C (cid:48)(cid:48) γ k (cid:107) u (cid:107) , (A.10)and, similarly, (cid:107)(cid:107) A y δ (cid:107)(cid:107) ψ ≤ C (cid:48)(cid:48) γ j ||| A ||| F , (cid:107) u (cid:62) y δ (cid:107) ψ ≤ C (cid:48)(cid:48) γ j (cid:107) u (cid:107) . We ﬁrst deal with the diagonal term. Let P = (cid:80) M i =1 u j u (cid:62) j be its eigen-decompositionwith (cid:107) u j (cid:107) = 1, then (cid:107)||| P diag( x δ ) ||| op (cid:107) ψ = (cid:107)||| diag( x δ ) P diag( x δ ) ||| op (cid:107) ψ ≤ M (cid:88) j =1 (cid:107)||| diag( x δ ) u j u (cid:62) j diag( x δ ) ||| op (cid:107) ψ = M (cid:88) j =1 (cid:107)(cid:107) diag( u j ) x δ (cid:107)(cid:107) ψ , where each term in the latter is bounded by γ k due the fact that ||| diag( u j ) ||| F = 1.Summing up and taking square root, we arrive at (cid:13)(cid:13) ||| P diag( x δ ) ||| op (cid:13)(cid:13) ψ ≤ √ C (cid:48)(cid:48) M γ k .Taking into account similar bound for Q diag( y δ ), we have by H¨older inequality (cid:107)||| P diag { δ } − diag( x δ )diag( y δ ) Q ||| op (cid:107) ψ ≤ p − (cid:107)||| P diag( x δ ) ||| op (cid:13)(cid:13) ψ (cid:107)||| Q diag( y δ ) ||| op (cid:107) ψ ≤ C (cid:48)(cid:48) (cid:112) M M γ k + j , which yields the bound for the diagonal. As for the oﬀ-diagonal, consider ﬁrst the wholematrix, (cid:107)||| P x δ [ y δ ] (cid:62) Q ||| op (cid:107) ψ ≤ (cid:107)(cid:107) P x δ (cid:107)(cid:107) ψ (cid:107)(cid:107) Q y δ (cid:107)(cid:107) ψ ≤ ( C (cid:48)(cid:48) ) (cid:112) M M γ j + k , and since Oﬀ( A j,kt,t (cid:48) ) = A j,kt,t (cid:48) − Diag( A j,kt,t (cid:48) ), the bound follows from the triangular inequality.The following technical lemma will help us to upper-bound σ in Theorem A.1. . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov Lemma A.4.

Let δ , . . . , δ N consists of independent Bernoulli components with proba-bilities of success p , . . . , p N and set p min = min i ≤ N p i . Let a , b ∈ R N be two arbitraryvectors. It holds, E (cid:32)(cid:88) i δ i p i a i b i (cid:33) ≤ p − (cid:107) a (cid:107) (cid:107) b (cid:107) , E (cid:88) i (cid:54) = j δ i δ j p i p j a i b j  ≤ p − (cid:107) a (cid:107) (cid:107) b (cid:107) + 4 (cid:32)(cid:88) i a i (cid:33) (cid:32)(cid:88) i b i (cid:33) . Additionally, if δ (cid:48) , . . . , δ (cid:48) N are independent copies of δ , . . . , δ N , it holds E (cid:88) i,j δ i δ (cid:48) j p i p j a i b j  ≤ p − (cid:107) a (cid:107) (cid:107) b (cid:107) + 4 (cid:32)(cid:88) i a i (cid:33) (cid:32)(cid:88) i b i (cid:33) . Proof.

It holds, E (cid:32)(cid:88) i δ i p i a i b i (cid:33) = (cid:88) i,j E δ i δ j p i p j a i b i a j b j = (cid:88) i,j { ( i = j )( p − i − } a i b i a j b j ≤ (cid:32)(cid:88) i a i b i (cid:33) + ( p − − (cid:88) i a i b i ≤(cid:107) a (cid:107) (cid:107) b (cid:107) + ( p − − (cid:107) a (cid:107) (cid:107) b (cid:107) . To show the second inequality we use decoupling (Theorem 6.1.1 in Vershynin (2018))and the trivial inequality ( x + y ) ≤ x + 2 y , E (cid:88) i (cid:54) = j δ i δ j p i p j a i b j  ≤ (cid:88) i (cid:54) = j a i b j  + 2 E (cid:88) i (cid:54) = j ( δ i − p i )( δ j − p j ) p i p j a i b j  ≤ (cid:88) i (cid:54) = j a i b j  + 32 E (cid:88) i (cid:54) = j ( δ i − p i )( δ (cid:48) j − p j ) p i p j a i b j  . (A.11)Denote for simplicity δ i = δ i − p i and δ (cid:48) i = δ (cid:48) i − p i . Since the latter are centered we have, E (cid:88) i (cid:54) = j δ i δ (cid:48) j p i p j a i b j  = (cid:88) i (cid:54) = jk (cid:54) = l E δ i δ k p i p k E δ (cid:48) j δ (cid:48) l p j p j a i a k b j b l (A.12)note that the expectation E δ i δ k is only non-vanishing when i = k , in which case it holds E δ i = p i − p i . Taking into account similar property of E δ (cid:48) j δ (cid:48) l we have that the sum aboveis equal to (cid:88) i (cid:54) = j ( p i − p i )( p j − p j ) p i p j a i b j ≤ ( p − − (cid:88) i,j a i b j ≤ ( p − − (cid:107) a (cid:107) (cid:107) b (cid:107) . . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (cid:88) i (cid:54) = j a i b j  ≤ (cid:88) i,j a i b j  + 2 (cid:32)(cid:88) i a i b j (cid:33) ≤ (cid:32)(cid:88) i a i (cid:33) (cid:32)(cid:88) i b i (cid:33) + 2 (cid:107) a (cid:107) (cid:107) b (cid:107) , which recalling (A.11) and noting that 32( p − − + 4 ≤ p − for p min ∈ [0 , S kj deﬁned in (A.8), dealingseparately with diagonal and oﬀ-diagonal parts. After that, we present the proof ofTheorem 3.5. Lemma A.5.

Under the assumptions of Theorem 3.5, it holds for any u ≥ withprobability at least − e − u ||| P diag { p } − (Diag( S k,j ) − E Diag( S k,j )) Q ||| op ≤ Cγ k + j (cid:32)(cid:115) M ∨ M (log N + u ) T p min (cid:95) √ M M (log N + u ) T p min (cid:33) where C = C ( K ) only depends on K .Proof. Note that, P diag { p } − Diag( S kj ) Q = T − T (cid:88) t =1 A t , A t = P diag { p } − Diag( A k,jt,t ) Q. By Lemma A.3 we have (cid:107)||| A t ||| op (cid:107) ψ ≤ Cp − √ M M γ k + j . Moreover, using decomposi-tion Q = (cid:80) M j =1 u j u j , we have ||| E A t A (cid:62) t ||| op ≤||| E diag { p } − Diag( A k,jt,t ) Q Diag( A k,jt,t )diag { p } − ||| op ≤ M (cid:88) j =1 ||| E diag { p } − Diag( A k,jt,t ) u j u (cid:62) j Diag( A k,jt,t )diag { p } − ||| op ≤ M (cid:88) j =1 sup (cid:107) γ (cid:107) =1 E ( γ (cid:62) diag { p } − Diag( A k,jt,t ) u j ) By deﬁnition, Diag( A k,jt,t ) = diag { δ ti x i y i } Ni =1 for x = Θ k W t − k , y = Θ j W t − j . Let E δ de-notes the expectation w.r.t. the Bernoulli variables and conditioned on everything else.Setting a = ( x γ , . . . , x N γ N ) (cid:62) and b = ( y u , . . . , y N u N ) (cid:62) , we have by the ﬁrst inequal-ity of Lemma A.4, E ( γ (cid:62) diag { p } − Diag( A k,jt,t ) u j ) = EE δ (cid:32)(cid:88) i γ i x i δ ti p i y i u i (cid:33) ≤ p − E (cid:107) a (cid:107) (cid:107) b (cid:107) ≤ p − E / (cid:107) a (cid:107) E / (cid:107) b (cid:107) . . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (cid:107) a (cid:107) = (cid:88) i γ i x i = x (cid:62) diag { γ } x , so since Tr(diag { γ } ) = 1 and due to (A.9) and by Theorem 2.1 Hsu et al. (2012), itholds E / (cid:107) a (cid:107) ≤ (cid:107)(cid:107) a (cid:107) (cid:107) ψ ≤ C (cid:48) γ k . Similarly, it holds E / (cid:107) a (cid:107) ≤ C (cid:48) γ j , which togetherimplies ||| E A t A (cid:62) t ||| op ∨ ||| E A (cid:62) t A (cid:62) t ||| op ≤ C (cid:48)(cid:48) M ∨ M γ k +2 j . Now notice that A t is not necessary an independent sequence, as A t depends directlyon ( W t − k , W t − j , δ t ), which might intersect with t (cid:48) = t + | j − k | . However, if we take a set I ⊂ [1 , T ] such that any two t, t (cid:48) ∈ I satisfy | t (cid:48) − t | (cid:54) = | j − k | then the sequence ( A t ) t ∈ I isindependent. We separate the whole interval [1 , T ] into two such independent sets, I = { t ∈ [1 , T ] : (cid:100) t/ | j − k |(cid:101) is odd } ,I = { t ∈ [1 , T ] : (cid:100) t/ | j − k |(cid:101) is even } =[1 , T ] \ I . (A.13)Indeed, if for t, t (cid:48) ∈ I then (cid:100) t/ | j − k |(cid:101) and (cid:100) t (cid:48) / | j − k |(cid:101) are either equal or diﬀer in at leasttwo, so that in the ﬁrst case we have | t − t (cid:48) | < | j − k | and in the second | t − t (cid:48) | > | j − k | .Since both intervals have at most T elements, it holds by Theorem A.1 with probabilityat least 1 − e − u for both j , ||| (cid:88) t ∈ I j A t − E A t ||| op ≤ Cγ j + k (cid:18)(cid:113) p − ( M ∨ M ) T (log N + u ) ∨ p − (cid:112) M M (log N + u ) log T (cid:19) , so summing up the two and dividing by T , we get the result. Lemma A.6.

Under the assumptions of Theorem 3.5, it holds for any u ≥ withprobability at least − e − u ||| P diag { p } − (Oﬀ( S k,j ) − E Oﬀ( S k,j ))diag { p } − Q ||| op ≤ Cγ k + j (cid:32)(cid:115) M ∨ M (log N + u ) T p (cid:95) √ M M (log N + u ) log TT p (cid:33) where C = C ( K ) only depends on K .Proof. It holds, P diag { p } − Oﬀ( S kj )diag { p } − Q = T − T (cid:88) t =1 B t ,B t = P diag { p } − Oﬀ( A k,jt,t )diag { p } − Q. By Lemma A.3 we have (cid:107)||| B t ||| op (cid:107) ψ ≤ Cp − √ M M γ k + j . Using decomposition Q = . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (cid:80) M j =1 u j u j with (cid:107) u j (cid:107) = 1 we get that ||| E B t B (cid:62) t ||| op ≤||| E diag { p } − Oﬀ( A k,jt,t )diag { p } − Q diag { p } − Oﬀ( A k,jt,t )diag { p } − ||| op ≤ M (cid:88) j =1 ||| E diag { p } − Oﬀ( A k,jt,t )diag { p } − u j u (cid:62) j diag { p } − Oﬀ( A k,jt,t )diag { p } − ||| op ≤ M (cid:88) j =1 sup (cid:107) γ (cid:107) =1 E ( γ (cid:62) diag { p } − Oﬀ( A k,jt,t )diag { p } − u j ) Again, using the notation x = Θ k W t − k , y = Θ j W t − j and a = diag { γ } x , b = diag { u } y ,we have Oﬀ( A j,kt,t ) = Oﬀ( xy (cid:62) ). Therefore, by Lemma A.4 E ( γ (cid:62) diag { p } − Oﬀ( A k,jt,t )diag { p } − u j ) = EE δ (cid:88) i (cid:54) = j γ i δ it p i x i y j δ jt δ j u j  = EE δ (cid:88) i (cid:54) = j δ it p i δ jt δ j a i b j  ≤ p − E (cid:107) a (cid:107) (cid:107) b (cid:107) + 4 E (cid:32)(cid:88) i a i (cid:33) (cid:32)(cid:88) i b i (cid:33) . From the proof of Lemma A.6 we know that E (cid:107) a (cid:107) (cid:107) b (cid:107) ≤ C (cid:48) γ k +2 j . Moreover, we have (cid:80) i a i = γ (cid:62) x and (cid:80) i b i = u (cid:62) y . Thus, by (A.10) it holds E / (cid:107) γ (cid:62) x (cid:107) ≤ (cid:107) γ (cid:62) x (cid:107) ψ ≤ C (cid:48) γ j and, similarly, E / (cid:107) u (cid:62) y (cid:107) ≤ C (cid:48) γ k . Putting those bounds together and applying Cauchy-Schwarz inequality, we have ||| E B t B (cid:62) t ||| op ≤ C (cid:48)(cid:48) p − M γ k +2 j . By analogy, ||| E B t B (cid:62) t ||| op ∨ ||| E B (cid:62) t B t ||| op ≤ C (cid:48)(cid:48) p − M ∨ M γ k +2 j . Applying the same sample splitting (A.13) we obtain the bound ||| (cid:88) t A t − E A t ||| op ≤ Cγ j + k (cid:18)(cid:113) p − ( M ∨ M ) T (log N + u ) ∨ p − (cid:112) M M (log N + u ) (cid:19) , which divided by T provides the result. Proof of (A.5) . Setting, D k,j = diag { p } − Diag( S k,j ) , by Lemma A.5 for any u ≥ ||| P ( D k,j − E D k,j ) Q ||| op > Cγ k + j (cid:32)(cid:115) M ∨ M (log N + u ) T p (cid:95) √ M M (log N + u ) T p (cid:33) holds with probability at least 1 − e − u . Take a union of those bounds for every k, j with u = u k,j = k + j + 1 + u (cid:48) for arbitrary u (cid:48) ≥

0. The total probability of complementary . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (cid:88) k,j ≥ e − k − j − − u (cid:48) = e − − u (cid:48) (cid:88) k ≥ e − k  = e − u (cid:48) / ( e − < e − u (cid:48) . By deﬁnition, Diag( ˜Σ) = diag { p } − (cid:80) i,j ≥ S k,j . Due to (A.8) and since E ˜Σ = Σ, it holdson such event ||| P (Diag( ˜Σ) − Diag(Σ)) Q ||| op ≤ (cid:88) k,j ≥ ||| P ( D k,j − E D k,j ) Q ||| op ≤ C (cid:88) k,j ≥ γ k + j (cid:32)(cid:115) M ∨ M (log N + u k,j ) T p (cid:95) √ M M (log N + u k,j ) T p (cid:33) ≤ C (cid:48)  (cid:88) k,j ≥ γ k + j  (cid:32)(cid:115) ( M ∨ M ) log NT p (cid:95) √ M M log NT p (cid:33) + C (cid:88) k,j ( k + j ) γ k + j  (cid:32)(cid:115) ( M ∨ M ) u (cid:48) T p (cid:95) √ M M u (cid:48) T p (cid:33) , which completes the proof due to the equalities (cid:88) k,j ≥ γ k + j = (cid:88) k ≥ γ k  = 1(1 − γ ) (cid:88) k,j ≥ ( k + j ) γ k + j =2 (cid:88) k,j ≥ kγ k + j = 2(1 − γ ) (cid:88) k ≥ kγ k = 2(1 − γ ) . Proof of (A.6) . This works similarly to the above, but applying Lemma A.6 to D k,j =diag { p } − Oﬀ( S k,j )diag { p } − and using the fact that Oﬀ( ˜Σ) = (cid:80) j,k ≥ D j,k by deﬁnition. Proof of (A.7) . Recall the deﬁnition, A k,jt,t (cid:48) = diag { δ t } Θ k W t − k W (cid:62) t (cid:48) − j [Θ j ] (cid:62) diag { δ t (cid:48) } . Then, it holds Z t Z (cid:62) t +1 = (cid:88) k,j ≥ diag { δ t } Θ k W t − k W (cid:62) t +1 − j [Θ j ] (cid:62) diag { δ t +1 } = (cid:88) k,j ≥ A k,jt,t +1 , and the decomposition takes place, A ∗ = (cid:88) k,j ≥ S k,j , S k,j = 1 T − T − (cid:88) t =1 A k,jt,t +1 . . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov S k,j separately. Observe that P diag { p } − S k,j diag { p } − Q = 1 T − T − (cid:88) t =1 B t , B t = P diag { p } − A k,jt,t +1 diag { p } − Q. By Lemma A.3 each term satisﬁesmax t (cid:107)||| B t ||| op (cid:107) ψ ≤ C (cid:112) M M γ k + j . Furthermore, let Q = (cid:80) M j =1 u j u (cid:62) j with unit vectors u j . Also, denoting x = Θ k W t − k and y = Θ k W t +1 − k it holds A k,jt,t +1 = diag { δ t } xy (cid:62) diag { δ t +1 } . Then, using Lemma A.4 wehave for any unit γ ∈ R N , E ( γ (cid:62) diag { p } − A k,jt,t +1 diag { p } − u j ) = EE δ (cid:88) i,j γ i x i δ ti p i δ t +1 ,j p j y j u j  ≤ p − E (cid:107) diag { γ } x (cid:107) (cid:107) diag { u } y (cid:107) + E ( γ (cid:62) x )( u (cid:62) y ) , which due to the subgaussianity of x and y yields, E (cid:107) diag { γ } x (cid:107) (cid:107) diag { u } y (cid:107) ≤ E / (cid:107) diag { γ } x (cid:107) E / (cid:107) diag { u } y (cid:107) ≤ C (cid:48) γ k +2 j E ( γ (cid:62) x )( u (cid:62) y ) ≤ E / ( γ (cid:62) x ) E / ( u (cid:62) y ) ≤ C (cid:48) γ k +2 j . Therefore, we get that ||| E B t B (cid:62) t ||| op = sup (cid:107) γ (cid:107) =1 M (cid:88) j =1 E (cid:16) γ (cid:62) diag { p } − A k,jt,t +1 diag { p } − u j (cid:17) ≤ C (cid:48)(cid:48) p − M γ k +2 j . Using similar derivations we can arrive at σ = ||| E B t B (cid:62) t ||| op ∨ ||| E B (cid:62) t B t ||| op ≤ C (cid:48)(cid:48) p − ( M ∨ M ) γ k +2 j . Now we separate the indices t = 1 , . . . , T into four subsets, such that each correspondsto a set of independent matrices B t . Since each B t is generated by W t − k , W t +1 − j , δ t , and δ t +1 , we need to ensure that none of the pair of indices t, t (cid:48) from the same subset satisﬁes | t − t (cid:48) | = | k − j + 1 | nor | t − t (cid:48) | = 1. It can be satisﬁed by the following partition. First,we split the indices into two subsets with odd and even indices, respectively, so thatnone of the subsets contains two indices with | t − t (cid:48) | = 1. Then, both of the subsetsneed to be separated into two according to the scheme (A.13), so that the assertion | t − t (cid:48) | = | k − j + 1 | is avoided within each subset. Therefore, applying the Bernsteininequality, Theorem A.1, to each sum separately and summing them up, we get that for . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov u ≥ − e − u , ||| P diag { δ } − ( S k,j − E S k,j )diag { δ } − Q ||| op ≤ C (cid:18)(cid:113) p − ( M ∨ M ) T (log N + u ) (cid:95) (cid:112) M M (log N + u ) log T (cid:19) . Similarly to the proof of Theorem 3.5, we take the union of those bounds for every i, j with u = j + k + u (cid:48) and then the result follows. B LASSO and missing observations

Suppose, we observe a signal y ∈ R n of the form y = Φ b ∗ + ε , where Φ = [ φ , . . . , φ p ] ∈ R n × p is a dictionary of words φ j ∈ R n and b ∗ is some sparse pa-rameter with support Λ ⊂ { , . . . , p } . We want to recover the exact sparse representationby solving a quadratic program12 (cid:107) y − Φ b (cid:107) + γ (cid:107) b (cid:107) → min b ∈ R p . (B.1)Denote by R Λ the set of vectors with elements indexed by Λ, for b ∈ R n let x Λ ∈ R Λ be the result of taking only elements indexed by Λ. With some abuse of notation we willassociate every vector x Λ ∈ R Λ with a vector x from R n that has same coeﬃcients on Λand zeros elsewhere. Let Φ Λ = [ φ j ] j ∈ Λ be a subdictionary composed of words indexedby Λ, and P Λ is the projector onto the corresponding subspace.The following suﬃcient conditions for the global minimizer of (B.1) to be supportedon Λ are due to Tropp (2006), who uses the notion of exact recovery coeﬃcient ,ERC Φ (Λ) = 1 − max j / ∈ Λ (cid:107) Φ +Λ φ j (cid:107) , The results are summarized in the next theorem.

Theorem B.1 (Tropp (2006)) . Let ˜ b be a solution to (B.1) . Suppose that (cid:107) Φ (cid:62) ε (cid:107) ∞ ≤ γ ERC(Λ) . Then, • the support of ˜ b is contained in Λ ; • the distance between ˜ b and optimal (non-penalized) parameter satisﬁes, (cid:107) ˜ b − b ∗ (cid:107) ∞ ≤ (cid:107) Φ +Λ ε (cid:107) ∞ + γ (cid:107) (Φ Λ Φ (cid:62) Λ ) − (cid:107) , ∞ , (cid:107) Φ Λ (˜ b − b ∗ ) − P Λ ε (cid:107) ≤ γ (cid:107) (Φ +Λ ) (cid:62) (cid:107) , ∞ ;In what follows, we want to extend this result for the possibility of using missingobservations model. Observe that the program (B.1) is equivalent to12 b (cid:62) [Φ (cid:62) Φ] b − b (cid:62) [Φ (cid:62) y ] + γ (cid:107) b (cid:107) → min b ∈ R p , . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov D = Φ (cid:62) Φ and c = Φ (cid:62) y . Supposethat instead we have only the access to some estimators ˆ D ≥ c that are closeenough to the original matrix and vector, respectively, which may come e.g., from missingobservations model. Then, we can solve instead the following problem,12 b (cid:62) ˆ D b − b (cid:62) ˆ c + γ (cid:107) b (cid:107) → min b ∈ R p . (B.2)In what follows, we provide a slight extension of Tropp’s result towards missing observa-tions, the proof mainly follows the same steps.Below, for a matrix D and two sets of indices A, B , we denote the submatrix on thoseindices as D A,B , and for a vector c , the corresponding subvector is c A . Lemma B.2.

Suppose that (cid:107) ˆ D Λ c , Λ ˆ D − , Λ ˆ c Λ − ˆ c Λ c (cid:107) ∞ ≤ γ (1 − (cid:107) ˆ D Λ c , Λ ˆ D − , Λ (cid:107) , ∞ ) . Then, the solution ˜ b to (B.2) is supported on Λ .Proof. Let ˜ b be the solution to (B.2) with the restriction supp( b ) ⊂ Λ. Since ˆ D ≥ D Λ , Λ ˜ b − ˆ c Λ + γ g = 0 , g ∈ ∂ (cid:107) ˜ b (cid:107) , where ∂f ( b ) denotes the subdiﬀerential of a convex function f at a point b , in the caseof (cid:96) norm we have (cid:107) g (cid:107) ∞ ≤

1. Thus,˜ b = ˆ D − , Λ ˆ c Λ − γ ˆ D − , Λ g . (B.3)Next, we want to check that ˜ b is a global minimizer. To do so, let us compare theobjective function at a point b = ˜ b + δ e j for arbitrary index j / ∈ Λ. Since (cid:107) b (cid:107) = (cid:107) ˜ b (cid:107) + | δ | , we have L (˜ b ) − L ( b ) = 12 ˜ b (cid:62) ˆ D ˜ b − b (cid:62) ˆ D b − ˆ c (cid:62) (˜ b − b ) − γ | δ | = δ e (cid:62) j ˆ D e j + | δ | γ − δ e (cid:62) j ˆ D ˜ b + δ (cid:98) c j > | δ | γ − δ e (cid:62) j ˆ D ˜ b + δ (cid:98) c j , where the latter comes from the fact that ˆ D is positively deﬁnite. Applying the equality(B.3) yields, e (cid:62) j ˆ D ˜ b = ˆ D j, Λ ˆ D − , Λ ˆ c Λ − γ ˆ D j, Λ ˆ D − , Λ g , therefore, taking into account (cid:107) g (cid:107) ∞ ≤ L (˜ b ) − L ( b ) > | δ | (cid:104) γ (1 − (cid:107) ˆ D Λ c , Λ ˆ D − , Λ (cid:107) , ∞ ) − (cid:12)(cid:12) ˆ D j, Λ ˆ D − , Λ ˆ c Λ − (cid:98) c j (cid:12)(cid:12)(cid:105) , where the right-hand side is nonnegative by the condition of the lemma. Since j / ∈ Λ isarbitrary, ˜ b is a global solution as well. Remark B.1.

It is not hard to see that in the exact case ˆ D = Φ (cid:62) Φ and ˆ c = Φ (cid:62) y . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov (cid:107) Φ (cid:62) Λ c P Λ ε (cid:107) ∞ ≤ γ ERC(Λ) ofTheorem B.1.Since we are particularly interested in applications to time series, the features matrixΦ should in fact be random, thus stating a ERC-like condition onto it might result inadditional unnecessary technical diﬃculties. Instead, let us assume that there is someother matrix ¯ D , potentially the expectation of Φ (cid:62) Φ, such that it is close enough to ˆ D (with some probability, but we are stating all the results deterministically in this section),and the value that controls the exact recovery looks likeERC(Λ; ¯ D ) = 1 − (cid:107) ¯ D Λ c , Λ ¯ D − , Λ (cid:107) , ∞ . Additionally, we set ¯ c = ¯ D b ∗ = ¯ D · , Λ b ∗ Λ — the vector that ˆ c is intended to approximate.Note that in this case we have ¯ D Λ c , Λ ¯ D − , Λ ¯ c Λ − ¯ c Λ c = ¯ D Λ c , Λ b ∗ Λ − ¯ c Λ c = 0, thus theconditions of Lemma B.2 hold for ¯ D, ¯ c once ERC(Λ; ¯ D ) and γ are nonnegative. Inwhat follows, we control the values appearing in the lemma for ˆ D and ˆ c through thediﬀerences between ¯ c , ¯ D and ˆ c , ˆ D , respectively, thus allowing the exact recovery of thesparsity pattern. Lemma 7.7 Corollary B.3.

Let ¯ D and ¯ c be such that ¯ c = ¯ D b ∗ . Assume that (cid:107) ˆ c − ¯ c (cid:107) ∞ ≤ δ c , (cid:107) ¯ D − , Λ (ˆ c Λ − ¯ c Λ ) (cid:107) ∞ ≤ δ (cid:48) c , (cid:107) ¯ D − , Λ ( ˆ D Λ , · − ¯ D Λ , · ) (cid:107) ∞ , ∞ ≤ δ D , (cid:107) ( ˆ D · , Λ − ¯ D · , Λ ) b ∗ Λ (cid:107) ∞ ≤ δ (cid:48) D , (cid:107) ¯ D − , Λ ( ¯ D Λ , Λ − ˆ D Λ , Λ ) b ∗ Λ (cid:107) ∞ ≤ δ (cid:48)(cid:48) D . Suppose,

ERC(Λ) ≥ / and δ c + 3 δ (cid:48) D ≤ γ, sδ D ≤ , where | Λ | = s . Then, the solution to (B.2) is supported on a subset of Λ and satisﬁes ˜ b Λ = ˆ D − , Λ ˆ c Λ − γ ˆ D − , Λ g , (B.4) with some g ∈ R s satisfying (cid:107) g Λ (cid:107) ∞ ≤ and the max-norm error satisﬁes (cid:107) ˜ b − b ∗ (cid:107) ∞ ≤ δ (cid:48)(cid:48) D + δ (cid:48) c + γ (cid:107) ¯ D − , Λ (cid:107) , ∞ ) , while the (cid:96) -norm error satisﬁes (cid:107) ˜ b − b ∗ (cid:107) ≤ √ s ( δ (cid:48)(cid:48) D + δ (cid:48) c + γσ − ) . If additionally δ (cid:48)(cid:48) D + δ (cid:48) c + γ (cid:107) ¯ D − , Λ (cid:107) , ∞ ) ≤ min j ∈ Λ | b ∗ j | , then we have the exact recovery,so that the following equality takes place ˜ b Λ = ˆ D − , Λ ˆ c λ − γ ˆ D − , Λ s Λ , where s = sign ( b ∗ ) .Proof. First, observe that D Λ c , Λ D − , Λ c Λ − c Λ c = Φ (cid:62) Λ c (Φ +Λ y − y ) = Φ (cid:62) Λ c ( P Λ − I ) ε . ByLemma B.4 we have, (cid:107) ˆ D Λ c , Λ ˆ D − , Λ (cid:107) , ∞ ≤ (cid:107) ¯ D Λ c , Λ ¯ D − , Λ (cid:107) , ∞ + 4 sδ D ≤ / , . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov c Λ c = ¯ D Λ c , Λ b ∗ Λ = ¯ D Λ c , Λ ¯ D − , Λ ¯ c Λ , (cid:107) ˆ D Λ c , Λ ˆ D − , Λ ˆ c Λ − ˆ c Λ c (cid:107) ∞ ≤ (cid:107) ˆ D Λ c , Λ ˆ D − , Λ ˆ c Λ − ¯ D Λ c , Λ ¯ D − , Λ ¯ c Λ (cid:107) ∞ + (cid:107) ˆ c Λ c − ¯ c Λ c (cid:107) ∞ ≤ (cid:107) ˆ D Λ c , Λ ˆ D − , Λ (ˆ c Λ − ¯ c Λ ) (cid:107) ∞ + (cid:107) ˆ D Λ c , Λ ( ˆ D − , Λ − ¯ D − , Λ )¯ c Λ (cid:107) ∞ + (cid:107) ( ˆ D Λ c , Λ − ¯ D Λ c , Λ ) ¯ D − , Λ ¯ c Λ (cid:107) ∞ + δ c ≤ (cid:107) ˆ D Λ c , Λ ˆ D − , Λ (ˆ c Λ − ¯ c Λ ) (cid:107) ∞ + (cid:107) ˆ D Λ c , Λ ( ˆ D − , Λ − ¯ D − , Λ )¯ c Λ (cid:107) ∞ + δ (cid:48) D + δ c . Here, (cid:107) ˆ D Λ c , Λ ˆ D − , Λ (ˆ c Λ − ¯ c Λ ) (cid:107) ∞ ≤ δ c / (cid:107) ˆ D Λ c , Λ ˆ D − , Λ (cid:107) , ∞ ≤ /

2. Moreover, we have (cid:107) ˆ D Λ c , Λ ( ˆ D − , Λ − ¯ D − , Λ )¯ c Λ (cid:107) ∞ = (cid:107) ˆ D Λ c , Λ ˆ D − , Λ ( ¯ D Λ , Λ − ˆ D Λ , Λ ) ¯ D − , Λ ¯ c Λ (cid:107) ∞ ≤ (cid:107) ˆ D Λ c , Λ ˆ D − , Λ (cid:107) , ∞ (cid:107) ( ¯ D Λ , Λ − ˆ D Λ , Λ ) ¯ D − , Λ ¯ c Λ (cid:107) ∞ ≤ δ (cid:48) D / . Using the condition on γ , we get that (cid:107) ˆ D Λ c , Λ ˆ D − , Λ ˆ c Λ − ˆ c Λ c (cid:107) ∞ ≤

32 ( δ (cid:48) D + δ c ) ≤ γ ≤ γ (1 − (cid:107) ˆ D Λ c , Λ ˆ D − , Λ (cid:107) , ∞ ) , so that the conditions of Lemma B.2 are satisﬁed and (B.4) takes place. Therefore, wecan write˜ b Λ − b ∗ Λ = ˆ D − , Λ ˆ c Λ − ¯ D − , Λ ¯ c Λ − γ ˆ D − , Λ g , = ˆ D − , Λ ( ¯ D Λ , Λ − ˆ D Λ , Λ ) ¯ D − , Λ ¯ c Λ + ˆ D − , Λ (ˆ c Λ − ¯ c Λ ) − γ ˆ D − , Λ g = ˆ D − , Λ ( ¯ D Λ , Λ − ˆ D Λ , Λ ) b ∗ Λ + ˆ D − , Λ (ˆ c Λ − ¯ c Λ ) − γ ˆ D − , Λ g = ˆ D − , Λ ¯ D Λ , Λ (cid:16) ¯ D − , Λ ( ¯ D Λ , Λ − ˆ D Λ , Λ ) b ∗ Λ + ¯ D − , Λ (ˆ c Λ − ¯ c Λ ) − γ ¯ D − , Λ g (cid:17) By Lemma B.4 we have (cid:107) ˆ D − , Λ ¯ D Λ , Λ (cid:107) ∞(cid:55)→∞ ≤ (cid:107) ˜ b Λ − b ∗ Λ (cid:107) ∞ ≤ (cid:107) ¯ D − , Λ ( ¯ D Λ , Λ − ˆ D Λ , Λ ) b ∗ Λ (cid:107) ∞ + 2 (cid:107) ¯ D − , Λ (ˆ c Λ − ¯ c Λ ) (cid:107) ∞ + 2 γ (cid:107) ¯ D − , Λ (cid:107) , ∞ . Since we also have ||| ˆ D − , Λ ¯ D Λ , Λ ||| op ≤ (cid:107) g (cid:107) ≤ √ s , it holds (cid:107) ˜ b Λ − b ∗ Λ (cid:107) ≤ √ s (cid:16) (cid:107) ¯ D − , Λ ( ¯ D Λ , Λ − ˆ D Λ , Λ ) b ∗ Λ (cid:107) ∞ + (cid:107) ¯ D − , Λ (ˆ c Λ − ¯ c Λ ) (cid:107) ∞ + γ ||| ¯ D − , Λ ||| op (cid:17) . Before we proceed with the proof of this corollary, we present a technical lemma thatcollects some trivial inequalities.

Lemma B.4.

Set δ c = (cid:107) ˆ c − ¯ c (cid:107) ∞ , δ D = (cid:107) ( ˆ D Λ c , Λ − ¯ D Λ c , Λ ) ¯ D − , Λ (cid:107) ∞ , ∞ . Suppose, (cid:107) ¯ D Λ c Λ ¯ D − (cid:107) , ∞ ≤ and sδ D ≤ / . It holds, • for any q ≥ (cid:107) D Λ , Λ ˆ D − , Λ (cid:107) q → q ≤ , (cid:107) ˆ D − , Λ D Λ , Λ (cid:107) q → q ≤ . Y.-H. Chen, W.K. H¨ardle, and Y. Klochkov • (cid:107) ˆ D Λ c , Λ ˆ D − , Λ − D Λ c , Λ D − , Λ (cid:107) , ∞ ≤ sδ D . Proof.

First, we have (cid:107) D Λ , Λ ˆ D − , Λ (cid:107) q → q = (cid:107) I + ( D Λ , Λ − ˆ D Λ , Λ ) ˆ D − , Λ (cid:107) q → q ≤ (cid:107) ( D Λ , Λ − ˆ D Λ , Λ ) D − , Λ (cid:107) q → q (cid:107) D Λ , Λ ˆ D − , Λ (cid:107) q → q ≤ sδ D (cid:107) D Λ , Λ ˆ D − , Λ (cid:107) q → q , which solving the inequality and since sδ D ≤ /

2, turns into (cid:107) D Λ , Λ ˆ D − , Λ (cid:107) q → q ≤ − sδ D ≤ . Similarly, (cid:107) ˆ D − , Λ D Λ , Λ (cid:107) q → q ≤ (cid:107) ( ˆ D Λ c , Λ − D Λ c , Λ ) ˆ D − , Λ (cid:107) , ∞ ≤ (cid:107) ( ˆ D Λ c , Λ − D Λ c , Λ ) D − , Λ (cid:107) , ∞ (cid:107) D Λ , Λ ˆ D − , Λ (cid:107) → ≤ sδ D . and (cid:107) D Λ c , Λ ( D − , Λ − ˆ D − , Λ ) (cid:107) , ∞ ≤(cid:107) D Λ , Λ c D − , Λ (cid:107) , ∞ (cid:107) ˆ D − , Λ ( ˆ D Λ , Λ − D Λ , Λ ) (cid:107) → ≤(cid:107) D Λ , Λ c D − , Λ (cid:107) , ∞ (cid:107) ˆ D − , Λ D Λ , Λ (cid:107) → (cid:107) D − , Λ ( ˆ D − D ) (cid:107) → ≤ (cid:107) D Λ , Λ c D − , Λ (cid:107) , ∞ sδ D ,,