[PDF] Manifold Feature Index: A novel index based on high-dimensional data simplification

Abstract

In this paper, we propose a novel stock index model, namely the manifold feature(MF) index, to reflect the overall price activity of the entire stock market. Based on the theory of manifold learning, the researched stock dataset is assumed to be a low-dimensional manifold embedded in a higher-dimensional Euclidean space. After data preprocessing, its manifold structure and discrete Laplace-Beltrami operator(LBO) matrix are constructed. We propose a high-dimensional data feature detection method to detect feature points on the eigenvectors of LBO, and the stocks corresponding to these feature points are considered as the constituent stocks of the MF index. Finally, the MF index is generated by a weighted formula using the price and market capitalization of these constituents. The stock market studied in this research is the Shanghai Stock Exchange(SSE). We propose four metrics to compare the MF index series and the SSE index series (SSE 50, SSE 100, SSE 150, SSE 180 and SSE 380). From the perspective of data approximation, the results demonstrate that our indexes are closer to the stock market than the SSE index series. From the perspective of risk premium, MF indexes have higher stability and lower risk.

Full PDF

MManifold Feature Index: A novel index based on high-dimensional datasimpliﬁcation

Chenkai Xu a , Hongwei Lin a,b , Xuansu Fang a a School of Mathematics Science, Zhejiang University, Hangzhou, 310027, China b State Key Laboratory of CAD & CG, Zhejiang University, Hangzhou, 310058, China

Abstract

In this paper, we propose a novel stock index model, namely the manifold feature(MF) index, to reﬂect theoverall price activity of the entire stock market. Based on the theory of manifold learning, the researchedstock dataset is assumed to be a low-dimensional manifold embedded in a higher-dimensional Euclideanspace. After data preprocessing, its manifold structure and discrete Laplace-Beltrami operator(LBO) matrixare constructed. We propose a high-dimensional data feature detection method to detect feature pointson the eigenvectors of LBO, and the stocks corresponding to these feature points are considered as theconstituent stocks of the MF index. Finally, the MF index is generated by a weighted formula usingthe price and market capitalization of these constituents. The stock market studied in this research is theShanghai Stock Exchange(SSE). We propose four metrics to compare the MF index series and the SSE indexseries (SSE 50, SSE 100, SSE 150, SSE 180 and SSE 380). From the perspective of data approximation, theresults demonstrate that our indexes are closer to the stock market than the SSE index series. From theperspective of risk premium, MF indexes have higher stability and lower risk.

Keywords:

Stock Index, Constituent Selection, Manifold Learning, High-dimensional Data Simpliﬁcation

1. Introduction

A stock market index or stock index is a statistic that reﬂects the overall price activity of the stockmarket. It is typically designed by stock exchanges or ﬁnancial services and computed by the prices ofselected stocks (constituents) with weights. For investors and ﬁnancial managers, a stock market index isa tool to design portfolios, to test investment results, and to forecast market trends. For companies, newsagencies, and government agencies, it is a tool to observe and predict the development trends of socialpolitics and economics. Furthermore, it is the basis of some ﬁnancial derivatives, such as stock index futuresand stock index options, which provide more diverse investment strategies and reduce the risk of investment.A primary criterion of the stock market index is transparency [1]. For this reason, the two main parts ofcomputing a stock market index, selecting constituent stocks and presetting the weights must be publishedby the stock exchanges or ﬁnancial service providers who design the stock market index. Some stock indexessuch as the Shanghai Stock Exchange composite index (SSE), whose constituents include all the stocks in theShanghai Stock Exchange, do not need to select the constituent stocks. However most existing stock indexessuch as the Dow Jones Industrial Average (DJIA) index and the National Association of Securities DealersAutomated Quotations (NASDAQ) composite index select constituents from a larger stock sample space.The selection criteria include market capitalization (market cap), free ﬂoat ratio, sales/earnings (S/E), netproﬁt, and other ﬁnancial factors. The constituents are typically determined by experts in exchanges orﬁnancial services, after ranking them based on the aforementioned criteria. After selecting the constituentsof the stock index, the next step is to preset the weights to composite them as an index. Besides simply

Email addresses: [email protected] (Chenkai Xu), [email protected] (Hongwei Lin), (Xuansu Fang)

Preprint submitted to Expert Systems with Applications June 22, 2020 a r X i v : . [ q -f i n . S T ] J un sing market cap (SSE), the SSE index series (SSE 50, SSE 100, SSE 150, etc.) and DJIA add free ﬂoat todetermine the weights. In addition, there are some indexes that consider more weighting factors, includingcash ﬂow, dividends, book value (Financial Times Stock Exchange index series, FTSE), growth variables,and value variables (S&P index series).For existing stock index models, the selection of constituents typically considers ﬁnancial factors suchas market cap and free ﬂoat, regardless of stock price changes. This means that the information on priceactivity is not used when selecting constituents. Meanwhile, the selection strategy is typically based onranking and ﬁnally determined by experts in exchanges or ﬁnancial services. Therefore there are subjectivefactors involved.In this paper, we propose a novel method for selecting constituents, that considers moreinformation on price activity, and speciﬁes an objective selection rule. Our index is named as the manifoldfeature(MF) index as we treat stock dataset as a point cloud set embedded in the manifold, and the selectionmethod is based on the structure of the manifold itself.Our study includes all the stocks in the Shanghai Stock Exchange. After data preprocessing, the closingprice curves of all stocks are represented as high-dimensional vectors with the same dimensionality. As perselection rules, each stock is regarded as a high-dimensional point, and the entire stocks dataset is regardedas a point-cloud in a high-dimensional Euclidean space. Based on the theory of manifold learning [2, 3], thehigh-dimensional dataset can be regarded as points in a low-dimensional manifold, embedded in a higher-dimensional Euclidean space. Using the construction method mentioned in [4], we establish the connectionbetween the stocks and construct its discrete Laplace-Beltrami operator (LBO) matrix. The eigenfunctionsof the LBO, which can be regarded as scalar ﬁelds deﬁned on the manifold, present the geometric featuresof the manifold [5]. We propose a high-dimensional data feature detection method to select constituentsusing these eigenvectors. In our selection method, we detect the local maximum and minimum points onthe eigenvectors as feature points of the entire dataset. The detection starts from the eigenvectors withthe smallest eigenvalue, and stops when the required number of points is reached. As each feature pointcorresponds to a stock in the SSE, the points (stocks) that we detect are regarded as the selected constituentsfrom the complete stock space. Finally, the MF index is computed using the existing weighting method,similar to that of the SSE composite index.Our contributions are twofold. • We propose a new stock index, namely the MF index, whose method of constituent selection is com-pletely new and diﬀerent from existing stock indexes. The method regards the stocks as points in alow-dimensional manifold embedded in a high-dimensional Euclidean space, and selects constituentsbased on the manifold structure itself. • To select the constituents, we propose a high-dimensional data feature detection algorithm based on theLBO of the manifold. The algorithm detects the local maximum and minimum points on eigenvectorsas feature points of the entire dataset, and the stocks corresponding to these feature points are regardedas the constituents selected from the complete stock space.In addition, to quantitatively measure the quality of our index, this study employs four metrics including

Pearson correlation coeﬃcient , Alpha , Beta , and

Jensen’s alpha . They are used to analyze our index in theexperiments, for the aspects of data approximation and risk premium.The remainder of this paper is organized as follows. Section 2 reviews related work on the stock index,the applications of LBO and the metrics we use to measure the quality of our index. Section 3 introduces thedetailed procedures for generating the MF index including data preprocessing, the theory and computationof high-dimensional data simpliﬁcation, which is used to determine the constituents of the MF index, as wellas the weighting formula. Moreover, we propose an algorithm for our method. In Section 4, the four metricsthat are used to measure the quality of the MF index are introduced. Section 5 presents the experimentalresults, detailed analyses, and discussions. Section 6 concludes the paper and presents further discussion.

2. Related work

In this section, we review related work in three aspects. Firstly, we address the works that are related tothe stock index model, including stock indexes that are mainly used in global stock markets, and novel index2odels proposed in previous papers and patents. Secondly, we review manifold learning on high-dimensionaldata, which is the core idea of our work. Moreover, the applications of LBO on a manifold are introduced.Finally, the studies and applications of the four metrics, Pearson correlation coeﬃcient, Alpha, Beta, andJensen’s Alpha are reviewed, because we use these metrics to compare our index with current indexes, inour experiments.

Capitalization-weighted index is a stock market index wherein the constituents are weighted according tothe total market capitalization (or free-ﬂoat market capitalization) of their outstanding shares. It is widelyused in global stock markets including DJIA, NASDAQ, S&P 500, SSE, and HSI. According to Haugenet al. [6], a capitalization-weighted index is ineﬃcient when investors disagree about the risk and expectedreturns, when short selling is restricted, when investment income is taxed, among other reasons. Thereare already a small number of stock indexes weighted by a fundamentally based method, which considersa company’s economic fundamental factors rather than the company’s listed market value [7]. Dow JonesGlobal Titans 50 (DJGT), Dow Jones Sector Titans Index (DJST), and Dow Jones Asian Titans 50 (DJAT)use two fundamental factors as criteria for selection, sales/earnings and net proﬁt. The RAFI (ResearchAﬃliates) series in the FTSE index series uses sales/earnings, cash ﬂow, dividends, and book value, whichare fundamental factors, as the criteria for selection [8].Meanwhile, some papers and patents have proposed new stock index models. Wang et al. [7] proposeda fundamental stock price constituent index (JOYFI 300), which uses decision trees and logistic regressionto select the constituents and compute the weights. Fernholz et al. [9] proposed a diversity-weighted index,which has the desirable characteristics of a passive large-stock index, and a performance advantage over acapitalization-weighted index, under conditions of a neutral diversity change. Arnott et al. [10] proposed anon-capitalization weighted index model. Unlike typical market capitalization weighting and price weighting,it uses ﬁnancial metrics such as book value, sales, revenue, earnings, and some non-ﬁnancial metrics. Arnott’smodel avoids overexposure to overvalued securities and underexposure to undervalued securities. Sauter etal. [11] provided a computer-implemented method to design an index, which uses a systematic stock migrationprocess to add or remove stocks from an index. One advantage of Sauter’s model is that the number ofstocks in the index need not be a ﬁxed value.

The basic idea of manifold learning is to assume that the high-dimensional data lie on an embeddednon-linear manifold in the higher-dimensional space [2, 3], which is employed in this study to analyze stockdata. A typical application of manifold learning is non-linear dimensionality reduction, such as isomap [2],LLE [3], and Laplace eigenmaps [4]. Manifold learning is also widely used in other ﬁelds, including signalprocessing, image classiﬁcation, energy engineering data analysis, medical image data, and ﬁnancial datadimensionality reduction [12, 13, 14, 15, 16, 17, 18]. Our study focuses not only on the manifold itself,but also on the LBO on the manifold, which is helpful for analyzing the structure of the manifold fromthe perspective of the frequency domain. For the Fourier analysis of a manifold [19], the eigenvalues ofan LBO specify the discrete frequency domain of a manifold, and the eigenfunctions are extensions ofthe basis functions. They have been widely used in shape analysis [20, 21], shape matching [22], shaperetrieval [20, 23] and mesh processing [24, 19]. More importantly, the LBO has been successfully applied inhigh-dimensional data processing. A well-known application of the LBO in high-dimensional data processingis Laplacian eigenmaps [4]. To appropriately represent high-dimensional data for machine learning andpattern recognition, Belkin et al. [4] proposed a locality-preserving method for non-linear dimensionalityreduction using the LBO eigenfunctions. Laplacian eigenmaps have a natural connection to clustering, andcan be computed eﬃciently.

In this section, we introduce four metrics that are used to quantitatively measure the quality of ourindex.

Pearson correlation coeﬃcient was developed by Karl Pearson [25] from a related idea introduced3y Francis Galton in the 1880s [26], and its mathematical formula was derived and published by AugusteBravais in 1844 [27]. It is a metric of the linear correlation between two variables X and Y , which iswidely used in ﬁnancial research, medicine, bioinformatics, climatology, signal processing, and recommendersystems [28, 29, 30, 31, 32, 33, 34]. Alpha , which is also called excess return, is a ﬁnance term that measureshow much a portfolio or investment returns in excess of the market. It is typically applied in ﬁnancialresearch, including fund management, portfolio analysis, leverage aversion, and risk parity [35, 36, 37, 38].

Beta originates from the capital asset pricing model(CAPM) [39], which is a measure of the systematicrisk of an individual stock in comparison to the entire market. Similar to

Alpha , Beta ’s main applicationarea is ﬁnancial research, such as the hedge fund market and stock returns [40, 41, 42, 43].

Jenson’s Alpha is proposed by Jensen [44], which considers the risk factor

Beta when calculating excess returns. It canbe seen as a version of the standard

Alpha based on a theoretical performance index instead of a marketindex.

Jensen’s Alpha is also mainly applied to ﬁnancial research, such as portfolio returns, market activitiesanalysis, stock trading, and so on [45, 46, 47, 48].

3. Methodology

The stock market studied in this paper is the Shanghai Stock Exchange (SSE). To construct the manifoldstructure of all stocks in the SSE, we use the closing price data from January 2014 to January 2018. Supposewe want to update the list of constituents at the beginning of the t -th year, then the price data in the t − t − m , the closing price of eachstock can be represented by an m -dimensional vector v = ( v , v , · · · , v m ). Considering listing, delisting,suspension, and other situations, the vector of the closing price may be incomplete ( v i = null ). Therefore,we need to perform some preprocessing, to ensure all vectors are complete for the next work. Data completion.

When a stock is suspended, its closing price may not exist in the database; therefore,the original closing price vector may be mutilate due to the absence of a few elements. In this situation, weconsider the price of the absent elements to be the same as that of the previous trading day.

Data screening.

As the study duration is one year, there may be some stocks that are listed ordelisted at a time node in the target year. Then, the data of these stocks are absent before listing or afterdelisting. Completing the absent elements here is inappropriate, because the constituents we select are usedto represent the entire market in the target year, they should be stable in this year, at least be tradedthroughout the year; thus, such stocks that involve listing and delisting are inappropriate. Therefore, weremove the stocks that are listed or delisted during the study duration.

Data transformation.

In the next section, we build connection between stocks using Euclidean dis-tance. Suppose there are two stocks with the same price change trend and signiﬁcant price diﬀerence, thenthe distance between them will be large. We aim to avoid this situation, because in this paper, the con-nection between stocks is expected to be related only to price changes, rather than the prices themselves.Therefore, this paper adds data transformation in preprocessing. To decrease the inﬂuence of the price, wenormalized the vectors in the data preprocessing as˜ v i = v i (cid:112)(cid:80) mk =1 ( v k ) . After data preprocessing, each stock in the studied market is represented by an m -dimensional vector,which can also be regarded as an m -dimensional point. Meanwhile, the entire stock market is regardedas an m -dimensional point cloud. In this section, a high-dimensional point cloud simpliﬁcation method isintroduced. The simpliﬁed point cloud is regarded as the constituent stock dataset.4 .2.1. LBO on manifold The Laplace-Beltrami operator ∆ is the divergence of the gradient. Suppose (

M, g ) is a Riemannianmanifold (a diﬀerentiable manifold M with Riemannian metric g ), and f ∈ C is a real-valued functiondeﬁned on M . The LBO ∆ on M is deﬁned as∆ f = div ( grad f ) , (1)where grad f is the gradient operator of f , and div ( · ) is a divergence operator.The eigenvalues and eigenfunctions of the Eq. 1 can be calculated by solving the eigen-equation:∆ f = − λf, (2)where λ is a real number. The solution of Eq. 2 is a list of eigenvalues such as0 ≤ λ ≤ λ ≤ λ ≤ ... ≤ + ∞ , (3)with the corresponding eigenfunctions φ ( x ) , φ ( x ) , φ ( x ) , · · · , (4)which are normalized and orthogonal. In the case of a closed manifold, the ﬁrst eigenvalue λ is always zero.The eigenvalues λ i , i = 0 , , , · · · specify the discrete frequency domain of an LBO, and the eigenfunc-tions are the extensions of the basis functions in the Fourier analysis to a manifold [19]. The numericaldistributions of diﬀerent eigenfunctions correspond to diﬀerent frequency geometric information, and theeigenfunctions with larger eigenvalues contain higher frequency information [5]. In Fig. 1, the eigenfunctionsof a 2D manifold are demonstrated, their eigenvalues are increasing from Fig. 1(a) to Fig. 1(f). For eacheigenfunction, the value is shown in colors from blue to red, corresponding to -1 to 1, respectively. Thedark red or blue parts correspond to the geometric features of the model, such as convex areas and concaveareas. This provides the main inspiration for our constituent selection method. We detect the local extremepoints on eigenfunctions (typically the points with the darkest color), and merge them into the simpliﬁeddata set, which can be regarded as the constituents of the MF index. In the next section, the speciﬁc detailsare introduced. (a)Figure 1: Overview of LBO eigenfunctions with diﬀerent frequencies of a 2D manifold. .2.2. Discrete LBO on high-dimensional dataset Our research object is all the stocks in the Shanghai Stock Exchange in a target year. We use a dailyclosing price curve to describe a stock. After the data preprocessing mentioned in Section 3.1, each stockcan be represented as a high-dimensional vector. According to the assumption in [2, 3, 4], the stocksare points distributed in a lower-dimensional manifold that is embedded in a higher-dimensional Euclideanspace. This section describes the construction of the manifold structure and discrete LBO using thesediscrete high-dimensional points.Suppose the number of stocks in the SSE in the study year is n , each stock is represented as a high-dimensional vector v i = ( v i , v i , · · · , v mi ) , i = 1 , , · · · , n . The dimension of each stock v i is m , the numberof trading days in this year, so the vector v i can also be regarded as a point in the m -d Euclidean space.We use the discretization method developed in [4] to construct the discrete LBO. The LBO discretizationmethod consists of two steps:1. adjacency graph construction,2. weight matrix construction,which are explained in detail as follows. Adjacency graph construction.

In our implementation, we employ k -nearest neighbors ( KNN ) toconstruct an adjacency graph (a manifold connection structure): For any vertex v , its KNN set N v = { v i , i = 1 , , · · · , k } is determined using KD-tree [49], which is a fast algorithm to ﬁnd k -nearest pointsof v in high-dimensional Euclidean space. Then the connection between the vertex v and each neighbor v i , i = 1 , , · · · , k is established, and thus constructing the adjacency graph. One thing worth noting is thatthe KNN are not symmetric for sure, i.e., it is possible that the vertex v i ∈ N v j , yet, v j / ∈ N v i ,then theweight matrix (see below) constructed by the KNN method is not guaranteed to be symmetric.

Weight computation.

Using the adjacency graph for each point v i in the given data set, the connectionsbetween point v i and the other points in the data set are established. Based on the adjacency graph, theweight w ij between point v i and v j can be calculated as follows: w ij =  − e − (cid:107) vi − vj (cid:107) t , if i, j are adjacent, (cid:80) k (cid:54) = i − w ik , if i = j, , otherwise , (5)where t is an adjustment parameter.As stated above, the weight matrix ˜ W = [ w ij ] (5) constructed by the KNN method is not symmetric,so we take the following matrix: W = ˜ W + ˜ W T A = diag ( a , a , · · · , a n ) , where a i = w ii (refer to Eq. (5)).Finally, the LBO on manifold can be discretized into the matrix L = A − W, whatever the dimension of the space the LBO deﬁned on. Meanwhile, the Eq. 2 is discretized as Lφ = A − W φ = λφ, where λ is the eigenvalue of the Laplace matrix L , and φ is the corresponding eigenvector. It is equivalentto W φ = λAφ. (6)6ecause the matrix W is symmetric, and the matrix A is diagonal and positive, it has been shown [4] thatthe eigenvalues λ are all non-negative real numbers satisfying0 = λ ≤ λ ≤ · · · ≤ λ n , and the corresponding eigenvectors φ i , i = 1 , , · · · , n are orthogonal, i.e., < φ i , φ j > A = φ Ti Aφ j = 0 , i (cid:54) = j. By solving Eq. (6), eigenvalues λ ≤ λ ≤ · · · ≤ λ n , and the corresponding eigenvectors φ i , i = 1 , , · · · , n are determined. The local maximum and minimum of the eigenvectors are used as the feature points. Asstated above, the eigenvalue λ i measures the frequency of its corresponding eigenvector φ i . The larger theeigenvalue, the higher is the frequency of the eigenvector φ i , and the greater is number of feature points inthe eigenvector φ i . The data simpliﬁcation algorithm ﬁrst detects the feature points of the eigenvector φ ,and adds them to the simpliﬁed dataset. Then, it detects and adds the feature points of φ to the simpliﬁeddataset. This procedure is performed iteratively until the number of elements of the simpliﬁed data reachesthe required number of constituents.As eigenvector φ i , i = 1 , , · · · , n is a scalar function deﬁned on an unorganized point set in a high-dimensional space, the detection of the maximum and minimum of the deﬁned function is not straightfor-ward. For each point x , we use its KNN results N x , calculated previously in Section 3.2.2 as the neighborconstruction. The method for detecting the maximum and minimum of the function φ ( x ) is elucidated asfollows: Maximum and minimum point detection:

For a data point x in the data set, if φ ( y ) < φ ( x ), ∀ y ∈ N x , the data point x is a local maximum point of φ ( x ). Conversely, if φ ( y ) > φ ( x ), ∀ y ∈ N x , the datapoint x is a local minimum point of φ ( x ). 𝑦 ∈ 𝑁 𝑥 𝑥 (a) 𝑦 ∈ 𝑁 𝑥 𝑥 (b) 𝑦 ∈ 𝑁 𝑥 𝑥 (c)Figure 2: Maximum and minimum point detection: x is a local maximum point in (a), a local minimum point in (b), andneither in (c). As illustrated in Fig. 2, for any y ∈ N x , if φ ( y ) < φ ( x ), it is labeled as ⊕ ; otherwise, it is labeled as (cid:9) .For a local maximum point, all the neighbors are labeled as ⊕ ; for a local minimum point, all the neighborsare labeled as (cid:9) ;As each data point x corresponds to a stock in the study stock market, the simpliﬁed dataset composedof feature points naturally corresponds to a simpliﬁed stock set, which is regarded as the set of constituentsto compute the MF index. 7 .3. Index computation In this section, a detailed methodology to generate the MF index is introduced.For stock indexes related to the SSE, the SSE composite index is composed of all SSE listed stocks,including A share and B share stocks. For other indexes such as SSE 50, SSE 100, and SSE 180, theirconstituents are selected from a larger constituent space. Taking the SSE 50 as an example, its constituentsare selected from the constituent list of SSE 180. The selection criteria include size, liquidity and so on.After ranking the stocks by negotiable market capitalization and trading value, in principle, the top 50ranked stocks are selected except for stocks with abnormal market performance, and some inappropriatestocks will be removed by the Index Advisory Committee. In our method, the constituents of the MF indexare determined by the high-dimensional data simpliﬁcation method stated in Section 3.2.3, and the selectedconstituent stock set is regarded as a simpliﬁed dataset which can represent the entire SSE stock market.The index calculation method we use is the same as that of the SSE composite index. The index isweighted using the following formula:

M F ( t ) = (cid:80) P i ( t ) N i D B, (7)where

M F ( t ) is the index value at time t . For the i -th stock in the database, P i ( t ) is the stock price attime t , N i is the number of shares issued. B is the base level, which is a ﬁxed number. D is a divisorwhich is used to adjust the MF index when changes occur in the constituent list (delisting), or the sharestructure (share changes), or market cap (right issue and bonus issue) due to non-trading factors. Whensuch situations happen to a stock, its market cap ( P i ( t ) × N i ) may be changed; then, the value of MF indexchanges, but we expect the index to be unchanged because it should be aﬀected only by trading. This isthe reason for setting divisor D in the calculation formula, and the detailed adjustment formula of the newdivisor D new is D new = D old M new M old , (8)where D old is the divisor before adjustment, M old is the market cap before adjustment and M new is themarket cap after adjustment.At the end of this section, the generation method of the MF index is presented in Algorithm 1. Algorithm 1

MF Index

Input : Closing price and market cap of stocks in SSE, number of constituents N ;1. Data preprocessing: the stocks are represented as v i , i = 1 , , · · · , n ;2. Build the KNN of all stocks, construct the discrete LBO L ;3. Solve Eq. 6, obtain eigenvectors φ i ;4. Simpliﬁed dataset S = Ø, k=0;5. Data simpliﬁcation: while | S | < N do k = k + 1;Detect the feature points on φ k , add them into S ; end while if | S | > N delete the stock which have the smallest market cap until S = N ;6. Calculate the MF index using the constituents in S using Eq. 7. Output : MF N Index.

4. Metrics

As stated above, after selecting the constituents using our high-dimensional data simpliﬁcation method,and computing using the weighted formula Eq. 7, the proposed MF index is generated. The issue now is8ow the quality of the MF index should be evaluated. In this section, four metrics are devised to measurethe MF index. They are

Pearson correlation coeﬃcient , Alpha , Beta and

Jensen’s alpha . They measure thequality of the MF index from two aspects: • data correlation( Pearson correlation coeﬃcient ), • risk premium( Alpha , Beta and

Jensen’s alpha ). Pearson correlation coeﬃcient(Pearson):

The direct purpose of designing a stock index is to reﬂect theoverall price activity of the stock market by using fewer stocks. The constituents we select are expected torepresent the entire stock market better than other indexes with the same constituent number. For example,if we set the number of constituents of the MF index to 50, then the MF 50 should be more similar to theSSE composite index than the SSE 50.We use

Pearson to describe the similarity between two indexes. In statistics, the

Pearson is a metricof the linear correlation between two variables X and Y . In this paper, two indexes are represented ashigh-dimensional vectors v X and v Y , dimension m is the number of trading days in a target study year, thevector v X = ( v X , v X , · · · , v mX ) can be viewed as m samples of variable X , and vector v Y = ( v Y , v Y , · · · , v mY )is the m sample of variable Y . Then, the Pearson can be calculated as follows: ρ X,Y = cov ( X, Y ) σ X σ Y . (9)If we denote ¯ v X = (cid:80) mk =1 v kX m and ¯ v Y = (cid:80) mk =1 v kY m , Eq. (9) can be calculated by ρ X,Y = (cid:80) mk =1 ( v kX − ¯ v X )( v kY − ¯ v Y ) (cid:113)(cid:80) mk =1 ( v kX − ¯ v X ) (cid:113)(cid:80) mk =1 ( v kY − ¯ v Y ) . (10) Pearson measures the correlation between vectors v X and v Y . The closer the Pearson is to 1, the higher isthe correlation between the two indexes v X and v Y .However, in terms of investment fund management, because an index can be regarded as a portfolio, and awell-performing index should be instructive for investors to make asset portfolios compared to other indexes,we measure the MF index from the perspective of the risk premium. In this section, we use Alpha , Beta and

Jensen’s Alpha , which are three metrics to measure the quality of the MF index, from three investmentperspectives, the excess returns, the risk, and the excess returns after the risk adjustment, respectively.Before introducing the metrics, we provide the following symbol descriptions: • R i : the realized return of the portfolio or investment, • R m : the realized return of the appropriate market index, • R rf : the risk-free rate of return.In this paper, all returns are monthly returns, R i is calculated using our MF index, R m is calculated usingthe SSE composite index (SSE) during the same period, and R rf is approximated using national debt and R rf = 0 .

2% for easy calculation.

Alpha ( α ) is a ﬁnancial term that measures how much a portfolio or investment returns in excess ofthe market. In investment fund management, Alpha is commonly calculated as the excess returns a fund9anager achieves over a fund’s stated benchmark. In this paper, we calculate the average of the MF index’sexcess monthly returns over a speciﬁc benchmark, with the average of the SSE’s monthly returns, as: α = ¯ R i − ¯ R m . (11)If the MF index has an α >

0, the realized monthly returns of the MF index exceeds the SSE, which meansthe MF index is a portfolio that performs better than the SSE in the excess monthly returns. Meanwhile,

Alpha can also measure the similarity between the index and the market, that is, the closer

Alpha is to 0,the higher is the similarity between the index and the market.

Beta ( β ) is a measure of the systematic riskof an individual stock in comparison to the entire market. In statistical terms, it represents the slope of theline through a regression of data points from an index’s returns against the entire market. In this paper, wecalculate β as: β = cov ( R i , R m ) var ( R m ) . (12)If the MF index has a β = 1 .

0, it indicates that its price activity is strongly correlated with the market(SSE). If the MF index has a β < .

0, it means that the security of the MF index is theoretically lessvolatile than the market. If the MF index has a β > .

0, it indicates that the security of the MF index istheoretically more volatile than the market. Some stocks even have negative β , a β = − .

0, which meansthat the index is inversely correlated to the market benchmark, as if it were an opposite mirror image of thebenchmarks trends. Meanwhile,

Beta can also measure the similarity between the index and the market,that is, the closer

Beta is to 1.0, the higher is the similarity between the index and the market.

Jensen’s Alpha ( α J ) is a version of the standard Alpha based on a theoretical performance index insteadof a market index. The theoretical performance index usually needs to be adjusted by a risk factor. In thispaper, we use β to calculate α J as: α J = R i − [ R rf + β ( R M − R rf )] . (13)If the realized returns of the MF index exceeds over the theoretical performance index after being adjustedby beta, i.e. α J >

0, the index can be said to have excess returns. Investors are always looking for investmentproducts with higher α J values. Similar to α , α J can also measure the similarity between the index and themarket, the closer α J is to 0, the higher is the similarity between the index and the market.

5. Implementation, results and discussions

The market under study is the SSE. A total of 1559 stocks were extracted within a period of 5 years,from 2014 to 2018. We update the constituent list of the MF index on the ﬁrst trading day of each year,that is, we use the closing price of all stocks in the t -th year to determine the constituents, and compute theMF index in the t + 1-th year using these constituents. The excess returns mentioned above are monthlyreturns. We use the SSE composite index as the market benchmark, and national debt to approximaterisk-free returns.The proposed method is implemented using C++ with OpenCV and ARPACK [50], and it executes on aPC with a 3.6 GHz Intel Core i7 4790 CPU, GTX 1060 GPU, and 16GB memory. We employed the OpenCVfunction ’cv::ﬂann’ to perform the KNN operation [49] and ARPACK to solve the sparse symmetric matrixeigen-equation(Eq. 6).

Firstly, in Fig. 3, a visual display of the MF index series is demonstrated to illustrate the approximationof the MF index to the market (SSE). We extract the closing prices of all stocks in 2017 (244 trading days),and perform preprocessing mentioned in Section 3.1, and then, a 244-d point cloud set including 1438 pointsis built. The MF index generation algorithm 1 sets the number of constituents as 50, 100, 150, 180 and 380,for later comparison with SSE 50, SSE 100, SSE 150, SSE 180 and SSE 380, respectively. Using the closing10

50 100 150 200 250

Trading Day N o r m a li z ed I nde x Normalized MF150 VS SSE150

SSEMF150SSE150 (a)

Trading Day D i ff e r en c e (SSE N-SSE) vs (MF N-SSE) SSE150SSE 380MF150MF380 (b)

Trading Day D i ff e r en c e -3 MF index series-SSE composite index

MF50MF100MF150MF180MF380 (c)

Trading Day D i ff e r en c e SSE index series-SSE composite index

SSE50SSE100SSE150SSE180SSE380 (d)Figure 3: Display of MF index series, SSE index series, and SSE composite index in 2018. (a)Normalized MF 150, SSE 150, andSSE. (b)The diﬀerence between MF 150 (180) and SSE vs. that between SSE 150 (180) and SSE. (c)The diﬀerence betweenthe MF index series and SSE. (d)The diﬀerence between the SSE index series and SSE.

50 100 150 180 380Number of constituents0.800.850.900.951.00 P e a r s o n Pearson-2015

SSEMF (a)

50 100 150 180 380Number of Constituents-2.00%0.00%2.00%4.00%6.00% A l p h a Alpha-2015

SSEMF (b)

50 100 150 180 380Number of Constituents0.800.850.900.951.001.051.101.15 B e t a Beta-2015

SSEMF (c)

50 100 150 180 380Number of Constituents−0.010.000.010.020.030.04 J e n s e n ' s A l p h a Jensen's Alpha-2015

SSEMF (d)Figure 4: Pearson, Alpha, Beta and Jensen’s Alpha of indexes in 2015. .1.1. Metrics This section quantitatively demonstrates the characteristics of the MF index series using the four metricsmentioned in Section 4. In Fig. 4, we calculate

P earson , Alpha , Beta and

Jensen (cid:48) sAlpha of the MF indexseries and the SSE index series in 2015.

Pearson correlation coeﬃcient is a quantitative performance of the approximation and reﬂects the simi-larity between the indexes and the stock market. We set a base line(yellow) at 1.0; the closer the

Pearson isto 1.0, the higher is the correlation between the indexes and the market (SSE). In Fig. 4(a), the

P earson ofMF 100, MF 150 MF 180, and MF 380 are closer to the base line than those of SSE 100, SSE 150 SSE 180,and SSE 380. This result indicates that MF 100, MF 150, MF 180, and MF 380 perform more approximatelyto the stock market (SSE) than SSE 100, SSE 150, SSE 180, and SSE 380, respectively.

Alpha calculates the excess monthly returns of indexes over a benchmark(SSE composite index). If the

Alpha of the index is 0, the index have the same excess returns as the market(SSE). We set the base line at0; the closer

Alpha is to 0, the higher is the similarity between the index and the market. In Fig. 4(b), the

Alpha of MF 100, MF 150, MF 180, and MF 380 are closer to 0 than those of SSE 100, SSE150, SSE 180,and SSE 380, respectively. This result means, the returns of the MF index series are closer to the marketthan those of the SSE indexes. From the perspective of

Alpha , the MF index series is more approximate tothe stock market than the SSE index series (The Alpha of MF 180 and the Jensen’s Alpha of MF 180 areclose to 0, so they are diﬃcult to recognize). To reinforce this conclusion, we calculate the

Alpha of eachindex in four years, and display the results in Table 1. In order to make it easier to judge which is closer tothe market, we marked the

Alpha that are closer to 0. In Table 1, the α of MF indexes are close to 0 thanthose of the SSE, except for MF 50 in 2015 and 2018. This result reinforces the conclusion above. Table 1:

Alpha of indexes from 2015-2018

Indexes Number of constituents50 100 150 180 3802015 SSE -1.3965% -0.7249% -0.0524% 0.0142% 0.6185% -0.3905% -0.2689% -0.8270% -0.6286% -0.4335% -0.5911% -1.4430% 0.3667% -0.7367%MF -0.8548% -0.3600% 0.0772% -0.0696% 0.0982%

Beta calculates the systematic risk of indexes in comparison to the benchmark market (SSE compositeindex). We set the base line at 1.0; the closer

Beta is to 1.0, the closer is the risk of indexes from themarket. In Fig. 4(c), the

Beta of MF 150 and MF 380 are closer to the base line than those of SSE 150and SSE 380. This result means, from the perspective of metric

Beta , that MF 150 and MF 380 performmore approximately to the stock market (SSE) than SSE 150 and SSE 380, respectively. Meanwhile, fromthe perspective of systematic risk, the

Beta of MF indexes are all less than those of the SSE indexes, thus,the systematic risk of the MF index series is less than that of the SSE index series in 2015. Combining theresults of excess returns in Fig. 4(b), MF indexes have lower excess returns and lower systematic risk, andSSE indexes have higher excess returns and higher systematic risk. This result is a typical conclusion ininvestment theory.

Jensen’s alpha ’s calculation considers both excess returns and systematic risk. In Fig. 4(d), the result issimilar to that of

Alpha . The

Jensen (cid:48) sAlpha of MF 100, MF 150, MF 180, and MF 380 are closer to 0 thanthose of SSE 100, SSE150, SSE 180, and SSE 380, respectively. From the perspective of

Jensen’s alpha , theMF index series is more approximate to the stock market than the SSE index series.Summarizing all the results in Fig. 4, the MF index series is more approximate to the stock market thanthe SSE index series in 2015. To make the conclusion more convincing, in Fig. 5, we calculate the data from2015 to 2018 and show the mean value of the diﬀerence between the metric and the base line. For example,13n Fig. 5(a), to compare MF 50 and SSE 50 which is closer to the market, we calculate the mean value ofthe distance from the

P earson of MF 50 to the base line as:¯ D P earson = (cid:80) i =2015 abs ( P earson MF i − . , and the calculation of the P earson of SSE 50 is the same. The results demonstrate that except for the

P earson and

Beta of MF 50 and MF 180 (Fig. 5(a) and Fig. 5(c)), other metrics of the MF index series areall less than those of the SSE index series. Finally, we conclude that the MF index series is more approximateto the stock market than the SSE index series.

50 100 150 180 380Number of constituents0.000.050.100.150.200.250.30 M e a n v a l u e o f a b s ( P e a r s o n - . ) Mean of Pearson(2015~2018)

SSEMF (a)

50 100 150 180 380Number of constituents0.0000.0050.0100.0150.020 M e a n v a l u e o f a b s ( A l p h a - . % ) Mean of Alpha(2015~2018)

SSEMF (b)

50 100 150 180 380Number of constituents0.00.10.20.30.40.5 M e a n v a l u e o f a b s ( B e t a - . ) Mean of Beta(2015~2018)

SSEMF (c)

50 100 150 180 380Number of constituents0.0000.0050.0100.0150.020 M e a n v a l u e o f a b s ( J e n s e n ' s A l p h a - . % ) Mean of Jensen's Alpha(2015~2018)

SSEMF (d)Figure 5: Mean value of Pearson, Alpha, Beta and Jensen’s Alpha from 2015 to 2018.

In this section, we discuss the stability of the MF indexes in two aspects. On the one hand, for a metricof a single stock index, such as the

P earson of MF 50, as MF 50 is designed to reﬂect the price activity ofthe market, the

P earson of MF 50 is supposed to be always close to 1.0 in diﬀerent year, which means the

P earson of a good index is expected to be stable around 1.0. We use the standard diﬀerence to describe thestability of the MF indexes, and we calculate the standard diﬀerence of the metrics, the closer the standarddiﬀerence is to 0, the higher is the stability of the indexes. In Fig. 6, we calculate the standard diﬀerence ofmetrics for each index in 4 years. The results show that the stability of the MF index series is better thanthat of the SSE index series, especially in Fig. 6(b) and (d), respectively. In Fig. 6(a) and Fig. 6(c), thestability of MF 150 and MF 380 is better than that of SSE 150 and SSE 380, respectively.14n the other hand, for a metric of a stock index series, such as the

P earson of the MF index series, whenthe constituent number N increases, the P earson of MF index series is expected to gradually approach to1.0. A stable index series is expected to have stable

P earson on the entire index series (MF 50, MF 100,MF 150, MF 180 and MF 380), then the MF index series can reﬂect the activity of the market stably. InFig. 7, we calculate the standard diﬀerence of the metrics for each index series per year. The result showsthat the MF index series is more stable than the SSE index series in every year.

50 100 150 180 380Number of constituents0.00.10.20.30.4 S t a n d a r d d i ff e r e n c e Standard difference of Pearson for different indexes

SSEMF (a)

50 100 150 180 380Number of constituents0.0000.0050.0100.0150.0200.025 S t a n d a r d d i ff e r e n c e Standard Difference of Alpha for different indexes

SSEMF (b)

50 100 150 180 380Number of constituents0.00.10.20.30.40.50.60.7 S t a n d a r d d i ff e r e n c e Standard Difference of Beta for different indexes

SSEMF (c)

50 100 150 180 380Number of constituents0.0000.0050.0100.0150.0200.0250.030 S t a n d a r d d i ff e r e n c e Standard Difference of Jensen's Alpha for different indexes

SSEMF (d)Figure 6: The standard diﬀerence of four metrics from 2015-2018 for each index.

An experimental result that can be found is that MF 50 performs worse than SSE 50 in almost allcases(Fig. 4), but MF 100, MF 150, MF 180, and MF 380 do not. The reason for this result may be that ourhigh-dimensional simpliﬁcation method approximates the market as the number of constituents increases,as in Fig. 3(c), but the SSE index series does not have such an approximation as the MF index series, asin Fig. 3(d). Thus, MF indexes may perform worse than SSE indexes when they have few constituents;however, as the number of constituents increases, this result disappears.

6. Conclusion

In this paper, we introduce the construction and calculation method of the MF index series in detail. Byconsidering the stock dataset as a low-dimensional manifold embedded in a higher-dimensional Euclideanspace, we construct its manifold structure and discrete LBO matrix. After detecting the local maximum andminimum points on the eigenvectors of LBO, the stocks corresponding to these feature points are regardedas the constituents we selected from the complete stock space. Moreover, this study details the weightingformula of the generation of the MF index and the index maintenance rules when the list of constituents15hanges or other non-trading factors are involved. A complete calculation algorithm 1 of the MF index seriesis given. In the experiments conducted, we compare the MF index series and SSE index series using fourmetrics. The results demonstrate that from the perspective of data approximation, our indexes are closerto the stock market (SSE composite index), and from the perspective of the risk premium, our indexes havehigher stability and lower risk.Finally, two points for future work are proposed. First, in terms of the construction of the manifoldstructure, in this study, the descriptors of stocks only use high-dimensional closing prices. They do notconsider market cap, free-ﬂoat, S/E, and other ﬁnancial factors. The advantage is that the relationshipbetween diﬀerent dimensions need not be considered when calculating the high-dimensional Euclidean dis-tance. However, a manifold structure based solely on the closing price is not the best. Therefore, addingnew factors, to establish a better manifold structure, is worth considering. Second, with regard to theweighting formula of the ﬁnal MF index, because this study considers approximating the MF index to theSSE composite index, it uses the same market cap weighting method as the SSE composite index withoutconsidering other ﬁnancial factors such as the free-ﬂoat. However, in the existing stock market, some stockindex weighting calculations consider more ﬁnancial factors, such as sales, revenue, and earnings. Theseweighting methods can also reﬂect the stock market suﬃciently; therefore, modifying the weighting methodis also worth considering. S t a n d a r d d i ff e r e n c e Standard Difference of Pearson for different years

SSEMF (a) S t a n d a r d d i ff e r e n c e Standard Difference of Alpha for different years

SSEMF (b) S t a n d a r d d i ff e r e n c e Standard Difference of Beta for different years

SSEMF (c) S t a n d a r d d i ff e r e n c e Standard Difference of Jensen's for different years

SSEMF (d)Figure 7: The standard diﬀerence of MF index series(SSE index series)in each year.

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant Nos.61872316,61932018.16 eferences [1] A. W. Lo, What is an index?, The Journal of Portfolio Management 42 (2) (2016) 21–36.[2] M. Balasubramanian, E. L. Schwartz, The isomap algorithm and topological stability, Science 295 (5552) (2002) 7–7.[3] D. L. Donoho, C. Grimes, Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data, Proceedingsof the National Academy of Sciences 100 (10) (2003) 5591–5596.[4] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural computation15 (6) (2003) 1373–1396.[5] D. Shen, P. T. Bremer, M. Garland, V. Pascucci, J. C. Hart, Spectral Surface Quadrangulation, Acm Transactions onGraphics 25 (3) (2006) 1057–1066.[6] R. A. Haugen, N. L. Baker, The eﬃcient market ineﬃciency of capitalization–weighted stock portfolios, The Journal ofPortfolio Management 17 (3) (1991) 35–40.[7] X. WANG, L.-z. YIN, Z.-b. FANG, Research on constructing stock price constituent index, Forecasting (3) (2009) 10.[8] C. W. Granger, Investigating causal relations by econometric models and cross-spectral methods, Econometrica: Journalof the Econometric Society (1969) 424–438.[9] R. Fernholz, R. Garvy, J. Hannon, Diversity-weighted indexing, Journal of Portfolio Management 24 (2) (1998) 74.[10] R. D. Arnott, P. C. Wood, Non-capitalization weighted indexing system, method and computer program product, uSPatent 7,620,577 (Nov. 17 2009).[11] G. U. Sauter, J. D. Troyer, Method of constructing a stock index, uS Patent 7,558,751 (Jul. 7 2009).[12] Y. Wang, G. Xu, L. Liang, K. Jiang, Detection of weak transient signals based on wavelet packet transform and manifoldlearning for rolling element bearing fault diagnosis, Mechanical Systems and Signal Processing 54 (2015) 259–276.[13] B. Tang, T. Song, F. Li, L. Deng, Fault diagnosis for a wind turbine transmission system based on manifold learning andshannon wavelet support vector machine, Renewable Energy 62 (2014) 1–9.[14] Y. Huang, G. Kou, A kernel entropy manifold learning approach for ﬁnancial data analysis, Decision Support Systems 64(2014) 31–42.[15] Y. Huang, G. Kou, Y. Peng, Nonlinear manifold learning for early warnings in ﬁnancial markets, European Journal ofOperational Research 258 (2) (2017) 692–702.[16] R. Guerrero, C. Ledig, A. Schmidt-Richberg, D. Rueckert, A. D. N. I. (ADNI, et al., Group-constrained manifold learning:Application to ad risk assessment, Pattern recognition 63 (2017) 570–582.[17] J. Lu, G. Wang, W. Deng, P. Moulin, J. Zhou, Multi-manifold deep metric learning for image set classiﬁcation, in:Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1137–1145.[18] S. Changpinyo, W.-L. Chao, B. Gong, F. Sha, Synthesized classiﬁers for zero-shot learning, in: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2016, pp. 5327–5336.[19] B. Vallet, B. L´evy, Spectral geometry processing with manifold harmonics, in: Computer Graphics Forum, Vol. 27, WileyOnline Library, 2008, pp. 251–260.[20] M. Reuter, F.-E. Wolter, N. Peinecke, Laplace–beltrami spectra as ”shape-dna” of surfaces and solids, Computer-AidedDesign 38 (4) (2006) 342–366.[21] M. Ovsjanikov, Q. M´erigot, F. M´emoli, L. Guibas, One point isometric matching with the heat kernel, in: ComputerGraphics Forum, Vol. 29, Wiley Online Library, 2010, pp. 1555–1564.[22] A. Sharma, R. Horaud, Shape matching based on diﬀusion embedding and on mutual isometric consistency, in: 2010 IEEEComputer Society Conference on Computer Vision and Pattern Recognition-Workshops, IEEE, 2010, pp. 29–36.[23] A. M. Bronstein, M. M. Bronstein, L. J. Guibas, M. Ovsjanikov, Shape google: Geometric words and expressions forinvariant shape retrieval, ACM Transactions on Graphics (TOG) 30 (1) (2011) 1.[24] B. Levy, Laplace-beltrami eigenfunctions towards an algorithm that” understands” geometry, in: IEEE InternationalConference on Shape Modeling and Applications 2006 (SMI’06), IEEE, 2006, pp. 13–13.[25] K. Pearson, Notes on regression and inheritance in the case of two parents proceedings of the royal society of london, 58,240-242 (1895).[26] F. Galton, Regression towards mediocrity in hereditary stature., The Journal of the Anthropological Institute of GreatBritain and Ireland 15 (1886) 246–263.[27] A. Bravais, Analyse math´ematique sur les probabilit´es des erreurs de situation d’un point, Impr. Royale, 1844.[28] B. V. de Melo Mendes, R. M. de Souza, Measuring ﬁnancial risks with copulas, International Review of Financial Analysis13 (1) (2004) 27–45.[29] D. Y. Kenett, X. Huang, I. Vodenska, S. Havlin, H. E. Stanley, Partial correlation analysis: Applications for ﬁnancialmarkets, Quantitative Finance 15 (4) (2015) 569–578.[30] M. M. Mukaka, A guide to appropriate use of correlation coeﬃcient in medical research, Malawi Medical Journal 24 (3)(2012) 69–71.[31] Q. Zou, J. Zeng, L. Cao, R. Ji, A novel features ranking metric with application to scalable visual and bioinformatics dataclassiﬁcation, Neurocomputing 173 (2016) 346–354.[32] J. Abbot, J. Marohasy, Application of artiﬁcial neural networks to rainfall forecasting in queensland, australia, Advancesin Atmospheric Sciences 29 (4) (2012) 717–730.[33] J. Benesty, J. Chen, Y. Huang, On the importance of the pearson correlation coeﬃcient in noise reduction, IEEE Trans-actions on Audio, Speech, and Language Processing 16 (4) (2008) 757–765.[34] L. Sheugh, S. H. Alizadeh, A note on pearson correlation coeﬃcient as a metric of similarity in recommender system, in:2015 AI & Robotics (IRANOPEN), IEEE, 2015, pp. 1–6.