aa r X i v : . [ c s . L G ] O c t Random Projection and Its Applications
Mahmoud NabilDepartment of Electrical and Computer Engineering, Tennessee Tech. University, TN, USA [email protected]
Abstract —Random Projection is a foundational research topicthat connects a bunch of machine learning algorithms under asimilar mathematical basis. It is used to reduce the dimensionalityof the dataset by projecting the data points efficiently to asmaller dimensions while preserving the original relative distancebetween the data points. In this paper, we are intended to ex-plain random projection method, by explaining its mathematicalbackground and foundation, the applications that are currentlyadopting it, and an overview on its current research perspective.
Index Terms —Big Data, Random Projections, DimensionalityReduction
I. I
NTRODUCTION
Data transformation and projection is fundamental tool thatis used in many application to analyze data sets and character-ize its main features. Principal component analysis (PCA) forsquare matrices, and its generalization Singular-value decom-position (SVD) for rectangular real or complex matrices areexamples of orthogonal data transformation techniques that areused in many fields such as signal processing and statistics.They are used to transform sparse matrices to condensedmatrices in order to get high information density, patterndiscovery, space efficiency and ability to visualize the data set.Despite their popularity, Classical dimensionality reductiontecniques have some limitations. First, the resultant directionsof projection are data dependent which make problems whenthe size of the data set increased in the future. Second, theyrequire high computational resources as so it is impracticalfor high dimensional data. For instance, R-SVD one of thefastest algorithms for SVD requires O ( km n + k ′ n ) [1] for m × n matrix (where k and k ′ are constants). Third, in someapplications access to the data is restricted to streams whereonly frame sequences are available every period of time. Last,these algorithms approximate data in low dimensional spacebut not near a linear subspace.Random projection were presented to address these lim-itation where the idea is to project data points to randomdirections that are independent on the dataset. Random pro-jection is simpler and computationally faster than classicalmethods especially when the dimensions increased. Regardingthe computational requirement for random projection, it is O ( dmn ) for m × n matrix [2], where d is the size of theprojected dimensions. This means that it compromises betweenthe processing time and a controlled accuracy for the intendedapplication. An interesting fact about random projection is thatit can preserve distance between the original and the projecteddata points with high probability. And therefore, beside thegeometric intuition for random projection, it can be viewed asa local sensitivity hashing method that can be used for datahiding and security applications [3], [4], [5]. Another task which frequently involves random projectionwhen the data dimensionality is high, is nearest neighborsearch where the target is to return a group of data pointswhich are closely related to a given query. One can argue herewhy textual search methods like inverted index can work onlarge document data sets but can’t work for images. This fortwo main reasons. First, textual data are sparse which meansif you picked up any document it only contains a few setof tokens from the language vocabulary, however for images,data are dense where for any image the useful pixels spansmost of the image. Second, the tokens themselves are thefeatures to the document, where only two or three words areenough to describe the document unlike the pixels. Thesereasons make random projection more appealing for nearestneighbor searching applications. The idea is that, For a givensearch query instead of doing a similarity matching brute forcesearch for all data points in our dataset, we are only need tosearch in the region that surrounds our query. The searching isdone in two stages namely: candidate selection, and candidatesevaluation where every data point in the new search spaceare evaluated. The core idea is to partition the search spaceinto dynamic variable size regions. This force close datapoints to be mapped to the same regions which increases theirprobabilities to be as candidates for a given search query inthe same region. In addition, to further increase the searchsuccess rate, the search region can be partitioned severaltime depending on the required accuracy and the processingtime. Figure 1 shows an example of random projection usingapproximate nearest neighbors method on two dimensionaldata where regions have different colors.In practice, some companies have utilized random projec-tion into their systems. Spotify a digital music platform usesthis method to find the approximate nearest neighbors musicrecommendations [6] as a part of their open source system . Esty is an E-commerce platform uses random projectionfor user/product recommendation, their model can be adaptedin other ways, such as finding people with similar interests,finding products that can be bought together and so on.Random projection is based upon the Johnson-Lindenstrausslemma [7] proposed in 1984 which states that “A set of pointsin a high-dimensional space can be projected into a lowerdimension subspace of in such a way that relative distancesbetween data points are nearly preserved”. It should be notedthat the lower dimension subspace is selected randomly basedon some distribution. Furthermore, some recent and faster https://github.com/spotify/annoy Figure 1. Random projection using approximate nearest neighbors method algorithms rely on this lemma will be also discussed in thispaper.The remainder of this paper is organized as follows. Themathematical background and theorem proof are discussed inSection II. Some faster and computationally efficient randomprojections methods are discussed in Section III. Applicationsand the current research perspective are discussed in SectionIV. Finally we draws our conclusion in Section V.II. M
ATHEMATICAL B ACKGROUND
The ultimate aim of any data transformation on any datatransformation/projection technique is to perserve as much in-formation as possible between the original and the transformeddata sets while better presents the data in its new form. Anessential step towards the proof of the random projection of aa vector v ∈ R d where d is typically large to a k-dimensionalspace R k is the Johnson–Lindenstrauss lemma [7] below: Lemma 1.
For any small value < ǫ < and a set of V of n points in R d , ∃ f : R d → R k such that ∀ u, v ∈ V thefollowing inequality holds with high probability: (1 − ǫ ) √ k | v i − v j | ≤ | f ( v i ) − f ( v j ) | ≤ (1 + ǫ ) √ k | v i − v j | The previous lemma act as a limiting bound (sandwich) forthe distance between the projected vectors | f ( v i ) − f ( v j ) | and the distance of the original vectors | v i − v j | . Proof.
Assume with out loss of generality the projectionfunction f : R d → R k is given by f ( v ) = ( u .v, u .v . . . , u k .v ) where each u i ∈ R d is a Gaussian vector. In addition assumethat | v | = 1 .Step 1. Each u i .v value is an independent Gaussian randomvariable with zero mean and unit variance. This can be easilyproved, since u i .v = P dj =1 u ij v j is sum independent Gaussianrandom variables, therefore, the result random variable u i .v isalso Gaussian with mean equal the sum the individual means,and variance can be obtained as the following V ar ( u i · v ) = V ar ( d X j =1 u ij v j ) = d X j =1 v j V ar ( u ij ) = | v | = 1 Step 2. According to Gaussian Annulus Theorem [8] forany high dimension Gaussian vector x ∈ R d , and for β ≤√ d , − e − cβ of the probability mass lies within the annulus √ d − β ≤ | x | ≤ √ d + β this can be written as P rob ( || x | − √ d | ≥ β ) ≤ e − cβ (1)Applying Gaussian Annulus Theorem 1 to the Gaussianvector f ( v ) ∈ R k and setting β to ǫ √ k ≤ √ k we got P rob ( | f ( v ) | − √ k ≥ ǫ √ k ) ≤ e − cǫ k Multiplying inner inequality by | v | = 1 P rob ( || f ( v ) | − √ kv | ≥ ǫ √ kv ) ≤ e − cǫ k (2)The latter equation is called the random projection theorem,which bounds the upper bound of the probability that thedifference between the projected vector and the original vectorshall exceeds a certain threshold. What interesting is that withhigh probability | f ( v ) | ≈ √ k | v | . So to estimate the differencebetween any two projected vectors v and v , we can calculate f ( v − v ) = f ( v ) − f ( v ) √ k ≈ v − v Step3. By Applying the Random Projection Theorem 2 thedifference that | f ( v i ) − f ( v j ) | is bounded by the range [(1 − ǫ ) √ k | v i − v j | , (1 + ǫ ) √ k | v i − v j | ] with probability − e − cǫ k Two interesting facts we have from this poof. First, thenumber of projected dimensions k is completely independenton the original number of dimension d in the space, and it canbe proved that it only depends on the number of points in thedataset in a logarithmic form and the selected error threshold ǫ where k ≥ ncǫ (3)However, the error ǫ has a quadratic effect in the denomi-nator of equation 3 which means for . error, k should bein the range of tens of thousands which is very high. Second,unlike PCA and SVD the projection function is independenton the original data completely. In addition, the k projectiondimensions don’t need to be orthogonal.III. C OMPUTATIONALLY E FFICIENT M ETHODS
Despite the simplicity of the random projection methodas we showed in section II, in some applications such asdatabases the proposed method may be costly. So Achlioptas[9] proposed a new method that is computationally efficient forthis kind of applications. Achlioptas show that for a random d × k transformation matrix T , where each entry t ij of the Figure 2. Traditional communication architecture versus compressed sensingarchitecture. matrix is independent random variable that follow one of thefollowing very simple probability distributions t ij = ( +1 with probability . − . . . . t ij = √ × +1 with probability / . . . / − . . . / with probability at least − n − β and for all vectors inthe database Johnson-Lindenstrauss lemma is satisfied. Thismethod is very efficient due to the use of the integer arithmeticin the calculations.IV. R ELATED A PPLICATIONS AND C URRENT R ESEARCH
Sparse recovery is an inverse problem to random projectionand it is the basic building block behind compressed sensingand matrix completion. In this section we define each of theapplication and by showing how they were inspired by therandom projection idea.
A. Compressed Sensing
According to Shannon-Nyquist sampling theorem, in orderto be able to reconstruct a signal with bandwidth B fromits samples, we need a sampling rate B . In compressedsensing, a very low sampling rate can be used while the signalconstruction is achievable.Let’s consider a camera with 10 Megabyte pixels resolutionthat capture a high quality image then it automatically convertsit to a storage efficient extension such as JPEG so that theresultant image can be stored in a compressed format ofabout 100 Kilobyte with about the same acceptable humaneye resolution. This seems as a large waste of the captureddata. The idea is that, unlike the traditional way of acquiring ahigh quality measurements then store them in an efficient way,compressed sensing is working in a different way as shownfigure 2 the sampling and the compression stages are mergedtogether and the receiver has to decode the incoming message.In compressed sensing, each sensor acquire a very low qual-ity measurement for example a ’ Single-pixel Camera ’ [10],nevertheless, we should be able to combine and decompressall the sensed data and get an acceptable quality compared tothe 10 Megabyte camera. In nutshell, the classical overviewof sensing was to measure as much data as possible, which is very wasteful. In compressed sensing, the idea is to take m random measurements then with high probability we are stillable to reconstruct the measured signal. In [11] Candes andTao proposed the Exact Reconstruction Principle, that givesa new bounds for reconstructing any signal using its randomcompressed samples.Lets consider a discrete time signal f ∈ R n . In addition,assume Ψ ∈ R n × n be a basis matrix where ψ i ∈ R n . Soany signal y can be represented as a linear combination of thecolumns of Ψ . In particular, suppose that our signal is definedby f = n X i =1 ψ i x i = Ψ x where x ∈ R n is a sparse coefficient vector to determinethe significant of the basis vector ψ i .We can measure f by taking few random measurements y j = φ Tj f = φ Tj Ψ x (4)where φ j ∈ R n is the jth compressed sensing vector Another interested task is the low rank matrix completion. Itis used in many applications like image in-painting where thegoal is to recover deteriorated pixels in an image as shown infigure 3. In addition, Netflix problem where the goal is to com-plete the customer-movie rating matrix given only the somecustomers rating, in order to build a robust recommendationsystem. The Netflix one million dollar grand prize was given to[16] BellKor team for their 10.06% recommendation system.Lets consider a partially observed matrix Y ∈ R m × n , wedefine the matrix completion problem as to find the minimumrank matrix X ∈ R m × n that best approximates the matrix Y . Removing this limitation, matrix completion problem hasundetermined solution because the missing values can beassigned any random values. The mathematical formulationof the problem is defined by min X ∈ R m × n rank ( X ) s.t. X ij = Y ij f or observed locations ( i, j ) In general the rank minimization is NP-hard problem.However, in [17] Candes et al. proposed a convex relaxationsolution to the problem to minimize the nuclear norm || X || ∗ which is defined as the sum of the singular values of X .Candes proposed some assumptions on the number of theobserved entries in Y so that X can be recovered with highprobability. The nuclear norm minimization is given by min X ∈ R m × n || X || ∗ = m X i =1 σ i ( X ) s.t. X ij = Y ij f or observed locations ( i, j ) (7)The assumptions that are proposed to solve the matrixcompletion problem are:1) The observed entries are uniformly sampled from allsubsets of entries. 2) Coherence: where the goal is to try to align the rowsand/or the columns of X with the basis vectors. Weare interested in low coherence subspace, where if weassumed column and row spaces are U and V then max ( µ ( U ) , µ ( V )) ≤ µ for some positive value µ where µ is the coherence factor. In addition, The matrix P ≤ k ≤ r u ik v jk should have an upper bound on itsentries by µ p r/ ( n n ) where n and n are the matrixdimensions.3) Number of observed entries: this sets a lower bound onthe number of the observed elements m in X so that thecompletion is possible. In [17], Candes proved that thislower bound is m ≥ Cmax ( µ , µ / µ , µ n / ) nr ( β log n ) where C and β are constants and µ = µ √ r .For β > equation 7 is solvable and it is equal to Y withhigh probability − cn − B . C. Human Activity Recognition Tracking the state and the actions of elderly and disabledpeople using some sensors attached to their bodies has consid-erable importance in health-care applications. It can facilitatethe monitoring and the detecting of any abnormal condition atthe patient body and report it. In [18] the authors proposeda method that is working offline and it can recognize ofdaily human activities. The system has three main stages:(a) de-noising sensor data (b) feature extraction and featuredimensionality reduction using computationally efficient ran-dom projection presented in section III (d) classification usingJaccard distance between kernel density probabilities. Thereported results on the USC-HAD dataset (Human ActivityDataset) is within-person classification of 95.52% and inter-person identification accuracy of 94.75%. D. Privacy Preserving Distributed Data Mining In many data mining applications such as health care, frauddetection, customer segmentation, and bio-informative privacyand security concerns have an immense importance due todealing with different types sensitive data. This call for privacypreserving techniques that can work on encrypted or noisydata while being able to compute accurately and efficiently aset of predefined operations such as Euclidean distance, dotproduct, and correlation etc. In [3] the authors introduced dataperturbation technique using random projection transformationwhere some noise is added to the data before being sent to thecloud server. The proposed technique preserves the statisticalproperties of the dataset and also allows the dimensionalityreduction of it. It is considered as value distortion approachwhere the all data entries are perturbed directly and at once(i.e. not independently) using multiplicative random projectionnoise. The advantage of this technique is that many elementsare mapped to one element, which is totally different fromthe traditional individual data perturbation technique, and,therefore, it is even harder for the adversary to reconstructthe plain text data. The technique depends on some lemmasexplained as follows Lemma 2. For random matrix R ∈ R p × q where all entries r i,j are independent and identically chosen from gaussiandistribution with zero mean and σ r variance then E ( R T R ) = pσ r I, and E ( RR T ) = qσ r I Proof. lets proof the first inequality. Assume ǫ i,j is the entryfrom R T R then ǫ i,j = p X t =1 r i,t r t,j E ( ǫ i,j ) = E ( p X t =1 r i,t r t,j )= p X t =1 E ( r i,t r t,j )= p X t =1 E ( r i,t ) E ( r t,j )= (P pt =1 E ( r i,t ) E ( r t,j ) i = j P pt =1 E ( r t,i ) i = j = ( i = jpσ r i = j Lemma 3. for any two data sets X ∈ R m × n and Y ∈ R m × n , and let random matrix R ∈ R p × q where all entries r i,j are independent and identically chosen from unknowndistribution with zero mean and σ r variance, also let U = √ kσ r RX , V = √ kσ r RY , then E ( U T V ) = X T Y The above results enables the following statistical measure-ments (distance, angle, correlation) to be applied to the hiddendata knowing the original vectors are normalized dist ( x, y ) = sX i ( x i − y i ) = sX i x i + X i y i − X i x i y i = p − x T ycosθ = x T y | x | . | y | = x T yρ x,y = x T y Thus, the number of attributes of the data can be reduced byrandom projection and the statistical dependencies among theobservations will be maintained. It is worth to mention that,given only the projected data U or V , original data can notbe retrieved as the number of possible solutions are infinite. For error analysis , it can easily be proven that the meandifference and the variance difference between the projectedand the original data are given as E ( u T v − x T y ) = 0 V ar ( u T v − x T y ) ≤ k It can be seen that, the error goes down as k increases. Thisimplies that at high dimension space, the technique worksbetter. For privacy analysis. two types of attacks are considered1) The adversary tries to retrieve the exact values of theprojected matrix X or Y, the authors proved that when m ≥ k − , even if matrix R is disclosed the originalmatrices can not be retrieved.2) The adversary tries to estimates matrix X or Y, if thedistribution of R is known, if the adversary generates ˆ R according to the known distribution then √ k ˆ σ r ˆ R T u = 1 √ k ˆ σ r ˆ R T √ kσ r Rx = 1 kσ r ˆ ǫx the estimation of any data element from the vector x isgiven by ˆ x i = 1 k ˆ σ r σ r X t ˆ ǫ i,t x t The expectation and the variance can be calculated as E ( ˆ x i ) = 0 V ar ( ˆ x i ) = 1 k X t x t So the adversary can only get a null vector centeredaround the zero.The authors considered three applications on their paper all ofthem relies on the dot product estimation namely: distanceestimation, k-mean clustering, and linear perceptron. As aresult, the random projection-based multiplicative perturbationtechnique keeps both the statistical properties and the confi-dentiality of the data.V. C ONCLUSION In this paper, we explained the random projection and themathematical foundation behind it. In addition, we explainedsome related applications such as compressed sensing whichmade a breakthrough in the traditional communication the-orems where a very low sampling rate can be used whilethe signal construction is achievable. Also, we explained thematrix completion problem that is a basis for many datamining tasks such as recommendation systems and image in-painting algorithms. R EFERENCES[1] G. H. Golub and C. F. Van Loan, Matrix computations . JHU Press,2012, vol. 3.[2] E. Bingham and H. Mannila, “Random projection in dimensionalityreduction: applications to image and text data,” in Proceedings ofthe seventh ACM SIGKDD international conference on Knowledgediscovery and data mining . ACM, 2001, pp. 245–250.[3] K. Liu, H. Kargupta, and J. Ryan, “Random projection-based multiplica-tive data perturbation for privacy preserving distributed data mining,” IEEE Transactions on knowledge and Data Engineering , vol. 18, no. 1,pp. 92–106, 2006. [4] S. Jassim, H. Al-Assam, and H. Sellahewa, “Improving performanceand security of biometrics using efficient and stable random projectiontechniques,” in Image and Signal Processing and Analysis, 2009. ISPA2009. Proceedings of 6th International Symposium on . IEEE, 2009,pp. 556–561.[5] B. Yang, D. Hartung, K. Simoens, and C. Busch, “Dynamic randomprojection for biometric template protection,” in Biometrics: TheoryApplications and Systems (BTAS), 2010 Fourth IEEE InternationalConference on . IEEE, 2010, pp. 1–7.[6] S. Arya, D. M. Mount, N. Netanyahu, R. Silverman, and A. Y. Wu, “Anoptimal algorithm for approximate nearest neighbor searching in fixeddimensions,” in Proc. 5th ACM-SIAM Sympos. Discrete Algorithms ,1994, pp. 573–582.[7] W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappingsinto a hilbert space,” Contemporary mathematics , vol. 26, no. 189-206,p. 1, 1984.[8] A. Blum, J. Hopcroft, and R. Kannan, “Foundations of data science,” Vorabversion eines Lehrbuchs , 2016.[9] D. Achlioptas, “Database-friendly random projections,” in Proceedingsof the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Prin-ciples of database systems . ACM, 2001, pp. 274–281.[10] M. F. Duarte, M. A. Davenport, D. Takbar, J. N. Laska, T. Sun,K. F. Kelly, and R. G. Baraniuk, “Single-pixel imaging via compressivesampling,” IEEE signal processing magazine , vol. 25, no. 2, pp. 83–91,2008.[11] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles:Exact signal reconstruction from highly incomplete frequency informa-tion,” IEEE Transactions on information theory , vol. 52, no. 2, pp. 489–509, 2006.[12] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEEtransactions on information theory , vol. 51, no. 12, pp. 4203–4215,2005.[13] J. Lukas, J. Fridrich, and M. Goljan, “Digital camera identification fromsensor pattern noise,” IEEE Transactions on Information Forensics andSecurity , vol. 1, no. 2, pp. 205–214, 2006.[14] C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising tocompressed sensing,” IEEE Transactions on Information Theory , vol. 62,no. 9, pp. 5117–5144, 2016.[15] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proofof the restricted isometry property for random matrices,” ConstructiveApproximation , vol. 28, no. 3, pp. 253–263, 2008.[16] Y. Koren, “The bellkor solution to the netflix grand prize,” Netflix prizedocumentation , vol. 81, pp. 1–10, 2009.[17] E. J. Candès and B. Recht, “Exact matrix completion via convexoptimization,” Foundations of Computational mathematics , vol. 9, no. 6,p. 717, 2009.[18] R. Damaševiˇcius, M. Vasiljevas, J. Šalkeviˇcius, and M. Wo´zniak, “Hu-man activity recognition in aal environments using random projections,” Computational and mathematical methods in medicine , vol. 2016, 2016., vol. 2016, 2016.