Efficient Sequential and Parallel Algorithms for Estimating Higher Order Spectra
Abdullah-Al Mamun, Zigeng Wang, Xingyu Cai, Nalini Ravishanker, Sanguthevar Rajasekaran
EEfficient Sequential and Parallel Algorithms forEstimating Higher Order Spectra
Abdullah-Al Mamun , Zigeng Wang , Xingyu Cai , Nalini Ravishanker , andSanguthevar Rajasekaran (cid:0) ) Dept. of Computer Science, University of Connecticut, Storrs, CT 06269, USA [email protected] Dept. of Statistics, University of Connecticut, Storrs, CT 06269, USA
Abstract.
Polyspectral estimation is a problem of great importance inthe analysis of nonlinear time series that has applications in biomedicalsignal processing, communications, geophysics, image, radar, sonar andspeech processing, etc. Higher order spectra (HOS) have been used inunsupervised and supervised clustering in big data scenarios, in testingfor Gaussianity, to suppress Gaussian noise, to characterize nonlinearitiesin time series data, and so on [13].Any algorithm for computing the k th order spectra of a time seriesof length n needs Ω ( n k − ) time since the output size will be Ω ( n k − ) aswell. Given that we live in an era of big data, n could be very large. In thiscase, sequential algorithms might take unacceptable amounts of time.Thus it is essential to develop parallel algorithms. There is also room forimproving existing sequential algorithms. In addition, parallel algorithmsin the literature are nongeneric. In this paper we offer generic sequentialalgorithms for computing higher order spectra that are asymptoticallyfaster than any published algorithm for HOS. Further, we offer memoryefficient algorithms. We also present optimal parallel implementations ofthese algorithms on parallel computing models such as the PRAM andthe mesh. We provide experimental results on our sequential and parallelalgorithms. Our parallel implementation achieves very good speedups. Keywords: higher order spectra; sequential algorithms; parallel algo-rithms; linear speedups
Fast computation of HOS such as the bispectrum and the trispectrum becomesescpecially crucial for long nonlinear time series. For example, intra-day financialdata analysis usually involves very long time series of stock or index returns ortime durations between events of interest such as price or volume changes, see[18]. Typically, depending on the liquidity of a stock, the time series length withina single day can be as high as 20,000 or more. [15] discusses the use of HOS formonitoring the condition of rotating machinery due to cracks whose signaturesare captured as long nonlinear time series consiting of 2,560 observations persecond. Existing algorithms are very slow. For instance, the MATLAB code [17]to compute the bispectrum takes 23 seconds on an input of size 2,048. In theapplication of cracks and misalignment detection [15], if we collect samples forone hour, the sample size will be more than 9 million and the MATLAB code a r X i v : . [ c s . D C ] M a y ill take an estimated time of more than 14 years! Thus it is essential to improveexisting sequential algorithms, see [6]. It is also important to develop effectiveparallel algorithms. Existing parallel algorithms are either inefficient or applyto only specific architectures. In this paper we offer general parallel algorithmsthat are very efficient. Problem Statement: If X ( i ) is a stationary random process ( i denoting dis-crete time), the moments of order k are given by [13]: m Xk ( w , w , . . . , w k − ) = E [ X ( i ) X ( i + w ) X ( i + w ) · · · X ( i + w k − )] . Cumulants are functions of the moments. For example, the first order cumulantis defined as c X = m X = E [ X ( i )], the second order cumulant as c X ( w ) = m X ( w ) − ( m X ) , and so on. The moments and cumulants defined above arebased on expectations over the (infinite) ensemble. For ergodic processes, theseensemble averages may be estimated using the corresponding time averages.The Fourier transform of the third and fourth cumulants are respectively thebispectrum and the trispectrum. The problem we address is the following: Givena finite sequence X (1) , X (2) , . . . , X ( n ), compute smoothed sample bispectrumand trispectrum which are statistically consistent estimates of the correspondingtrue HOS. Some Applications:
HOS are useful in unsupervised and supervised classifi-cation of long sequences of nonlinear time series with applications in finance,geoscience, neuroscience, etc. HOS can also be used as a test for Gaussianity ofany data, since if X ( i ) is Gaussian, the cumulant c Xk ( w , w , . . . , w k − ) = 0 for k > Direct Method for HOS:
Two kinds of algorithms can be found in the liter-ature for HOS: direct method using the Fast Fourier Transform (FFT) and anindirect method via the Fourier transform of the third cumulant. In this paperwe use the direct method. However, the algorithms we propose can be used forthe indirect method as well. The following summary of the direct method canbe found in [13]. Let X (1) , X (2) , . . . , X ( n ) be the input sequence. The directbispectrum (DBS) method works as follows: Algorithm
DBS1) Partition the input into K parts with M samples in each part. Let X i stand for the i th part, for 1 ≤ i ≤ nM .2) In each part subtract the mean of that part from each element in thepart.3) Compute the Discrete Fourier Transform F iX ( k ) for each part: F iX ( k ) = (cid:80) M − u =0 X i ( u ) e − j πM uk , for k = 0 , , . . . , M − i = 1 , , . . . , K , and j = √− C X i ( k , k ) = M F iX ( k ) F iX ( k ) F iX ( k + k ) , for i = 1 , , . . . , K. Due to vari-ous symmetries, it suffices to compute the bispectrum C X i ( k , k ) only in theprincipal domain: 0 ≤ k ≤ k , k + k < M/ M × M andyields a consistent estimate of the true bispectrum: (cid:101) C X i ( k , k ) = M (cid:80) M / − n = − M / (cid:80) M / − n = − M / C X i ( k + n , k + n ) .
6) The estimated bispectrum of the entire time series is computed as theaverage over all parts: ˆ C ( w , w ) = K (cid:80) Ki =1 ˜ C X i ( w , w ). Time Complexity Analysis:
Step 2 in the direct method takes O ( n ) time.Step 3 takes a total of O ( n log M ) time. Step 4 takes O ( M ) time per part. ThusStep 4 takes a total of O ( KM ) = O ( M n ). In Step 5 smoothing is done. Forevery point ( k , k ), the smoothed value (cid:101) C X i ( k , k ) is computed as the averagevalue of C X i over a region of size O ( M ). Thus each such computation takes O ( M ) time. The total time taken in Step 5 is O ( M KM ) = O ( M nM ).In summary, the total run time of the direct method is O ( M nM ). In thispaper we show that this run time can be improved to O ( M n ). Note that thisrun time is independent of M . Known parallel algorithms for HOS:
We summarize below some of thethe known algorithms. As we can see, these algorithms are very inefficient andrestricted to specific architectures.Manolakos, et al. [12] discuss the importance of power spectra in signalprocessing. Followed by this, they employ the canonical mapping methodology(CMM) to derive parallel programs for computing bispectrum. This paper fo-cused exclusively on the design of the systolic array and no experimental resultswere presented. In [10], the authors present data parallel algorithms for comput-ing 3rd and 4th order moments on the MasPar-1 SIMD parallel system. Theirprogram handles input sequences of length up to 2 . Their algorithm can bethought of as a mesh algorithm. No time complexity analyses were given in thepaper and the algorithm was very specific for the MasPar-1 machine. In [11]also, the authors consider the parallel computation of bispectrum. They haveimplemented the direct and the indirect methods using two different parallelprogramming techniques: semi-automatic and fully automatic using the PowerC Analyzer. The machine used was the Silicon Graphics Power Challenge MIMDMachine HOTBLACK. This paper also falls under the category of developinga parallel program for a specific machine. In [4] and [3] the authors considerparallel reconstruction of images using bispectra. They parallelize the bispec-trum algorithm in a straight forward manner without worrying about achievingoptimal run times. Contributions of this paper:
None of the above papers deals with the problemof constructing smoothed sample HOS which are consistent estimates of the trueHOS. In this paper our focus is on developing generic parallel algorithms that canbe employed on any parallel machine or platform. We also provide experimentalevaluations of our algorithms. For HOS computing algorithms one of the majorottlenecks could be in the memory needed. For computing order k momentsthe memory needed is Ω ( n k − ). This could indeed be prohibitive. For example,when k = 3 and n = 10 , the memory needed will be at least 1,000 GB. Thusit is essential to develop memory efficient algorithms. In this paper we addressthis crucial problem. Also, for bispectrum computation with smoothing over awindow of size M , existing algorithms take O ( nM M ) time. In this paper wepresent sequential and parallel algorithms that do only O ( nM ) work. Here M isthe size of each part of the input. In this section we show how to improve the run time of the direct methodfrom O ( M nM ) to O ( M n ). The new algorithm is based on an efficient way ofcomputing window sums that we describe next.
Let X = k , k , . . . , k n be any sequence of real numbersand let w be a window size. The problem is to compute s i = (cid:80) wj =1 k i + j − , for1 ≤ i ≤ ( n − w + 1).A straight forward algorithm for this problem will take O ( nw ) time. Wecan improve this to O ( n ) using overlaps in successive window sums. Specifically, s i +1 = s i − k i + k i + j , for 1 ≤ i ≤ ( n − w ). This means that s i +1 can be obtainedfrom s i in O (1) time. Therefore, if we compute the window sums in this order: s , s , . . . , s n − w +1 , then we can compute all of them in O ( n ) time. The case of 2D data:
The above idea can be extended to 2D data as well.Let A = ( a i,j ) be an n × n matrix and let w be a window size. Consider theproblem of computing s i,j = (cid:80) wu =1 (cid:80) wv =1 a i + u − ,j + v − , for 1 ≤ i ≤ ( n − w + 1)and 1 ≤ j ≤ ( n − w + 1).A trivial algorithm for solving the above problem will take O ( n w ) time.We can improve this run time to O ( n ) as follows. Algorithm 1 WS for i = 1 to ( n − w + 1) do for j = 1 to ( n − w + 1) do r i,j = (cid:80) w − k =0 A i,j + k ;4: end for end for for i = 1 to ( n − w + 1) do for j = 1 to ( n − w + 1) do c i,j = (cid:80) w − k =0 A i + k,j ;9: end for end for
11: Compute s , in w time;12: for j = 2 to ( n − w + 1) do s ,j = s ,j − − c ,j − + c ,j + w − ;14: end for for i = 2 to ( n − w + 1) do for j = 1 to ( n − w + 1) do s i,j = s i − ,j − r i − ,j + r i + w − ,j ;18: end for end for Analysis:
The total run time of the above algorithm is O ( n ). .2 Direct Method for Bispectrum We can employ the above window sums algorithms in the smoothing step (5) ofthe direct method. In this case we get the following theorem.
Theorem 1.
We can compute bispectrum of any input of size n using the directmethod in O ( M n ) time, M being the partition size. (cid:3) In this section we describe the parallel models of computing that we employin this paper, namely, the PRAM and the mesh. A Parallel Random AccessMachine (PRAM) is a collection of RAMs working in synchrony where commu-nication takes place with the help of a common block of shared memory [7, 8].Depending on how read and write conflicts are handled, a PRAM can furtherbe classified into three: Exclusive Read and Exclusive Write (EREW) PRAM,Concurrent Read and Exclusive Write (CREW) PRAM, and Concurrent Readand Concurrent Write (CRCW) PRAM. There are variants of a CRCW PRAMdepending on how write conflicts are handled. In a Common-CRCW PRAM,concurrent writes are permissible only if the processors trying to write in thesame cell at the same time have the same data to write. In an Arbitrary-CRCWPRAM, if more than one processor tries to write in the same cell at the sametime, an arbitrary one of them succeeds. In a Priority-CRCW PRAM, processorshave assigned priorities. Write conflicts are resolved using these priorities.An n × n mesh can be represented as a directed n × n grid-graph whose nodescorrespond to processing elements and whose edges correspond to bidirectionalcommunication links [7]. If two processors are connected by an edge, they cancommunicate in a unit step. Otherwise, they communicate by sending a messagealong a connecting path. The work done by a parallel algorithm that uses P processors and runs in time T is defined as the product P × T .Let ⊕ be any associative unit-time computable binary operator defined insome domain Σ . Given a sequence of n elements k , k , . . . , k n from Σ , theproblem of prefix computation is to compute k , k ⊕ k , . . . , k ⊕ k ⊕ · · · ⊕ k n .Proof of the following Lemma can be found in relevant texts (such as [7, 8]). Lemma 1.
Prefix computation on a sequence of n elements can be performedin O (log n ) time using n log n EREW PRAM processors.
We now show how to implement the direct method on an EREW PRAM opti-mally. First we consider the computation of window sums in 1D and 2D.
The case of 1D data in parallel:
Let X = k , k , . . . , k n be any sequenceof real numbers and let w be a window size. The problem is to compute s i = (cid:80) wj =1 k i + j − , for 1 ≤ i ≤ ( n − w + 1).A straight forward PRAM algorithm for this problem could use ( n − w + 1)processors. Each processor can in parallel compute one window sum in O ( w )ime. The work done will be O ( nw ). We can improve these bounds using theprefix computation.1) Perform a prefix sums computation on k , k , . . . , k n .Let the results be q , q , . . . , q n ; Let q = 0;2) for i = 1 to ( n − w + 1) in parallel do s i = q i + w − − q i − ; Analysis:
Step 1 can be done using n log n EREW PRAM processors in O (log n )time (c.f. Lemma 1). The for loop of line 2 can be performed in O (1) time using n EREW PRAM processors. Using the slow-down lemma (see e.g., [7, 8]), Step2 can also be completed in O (log n ) time using n log n processors. Thus we arriveat the following lemma. Lemma 2.
The window sums computation problem on any input sequence oflength n can be solved in O (log n ) time using n log n EREW PRAM processors. (cid:3)
The case of 2D data in parallel:
Let A = ( a i,j ) be an n × n matrix and let w bea window size. We are interested in computing s i,j = (cid:80) wu =1 (cid:80) wv =1 a i + u − ,j + v − ,for 1 ≤ i ≤ ( n − w + 1) and 1 ≤ j ≤ ( n − w + 1).A trivial algorithm for solving the above problem will do O ( n w ) work.We can improve this work to O ( n ) as follows. In this algorithm, t i, = 0, for1 ≤ i ≤ ( n − w + 1). Algorithm 2
WS PRAM for j = 1 to n in parallel do
2: Compute window sums in column j ;3: Specifically, let c i,j = (cid:80) w − k =0 a i + k,j ,for 1 ≤ i ≤ ( n − w + 1);4: end for for i = 1 to ( n − w + 1) in paralleldo
6: Perform a prefix sums computationon c i, , c i, , . . . , c i,n ; 7: Let the results be t i, , t i, , . . . , t i,n ;8: end for for i = 2 to ( n − w + 1) in paralleldo for j = 1 to ( n − w +1) in paralleldo s i,j = t i,j + w − − t i,j − ;12: end for end for Analysis:
In line 1, for a specific value of j , window sums can be computed in O (log n ) time using n log n EREW PRAM processors (c.f. Lemma 2). Thus the for loop of line 1 can be completed in O (log n ) time given n log n EREW PRAMprocessors.In line 5, for any given value of i , prefix sums computation can be performedin O (log n ) time using n log n EREW PRAM processors (c.f. Lemma 1). As a result,the for loop of line 5 takes O (log n ) time given n log n EREW PRAM processors.Line 11 can performed (for a given i and j ) in O (1) time using one processor.Therefore, the for loop of line 9 can be performed in O (1) time given ( n − w +1) REW PRAM processors. Using the slow-down lemma, the for loop of line 9can also be completed in O (log n ) time using n log n processors.Put together, the above algorithm runs in a total of O (log n ) time using n log n EREW PRAM processors. Clearly, this algorithm is asymptotically work-optimal. We arrive at the following lemma:
Theorem 2.
The window sums computation problem can be solved in O (log n ) time using n log n EREW PRAM processors. (cid:3)
In this section we present a PRAM algorithm for direct bispectrum computation.There are 5 steps in the algorithm (c.f. Algorithm DBS). We discuss how toparallelize each step. Let X (1) , X (2) , . . . , X ( n ) be the input sequence.Step 1 is that of partitioning the data into K parts and this does not costany time since the input will be given in the common memory. Let the parts be X i , ≤ i ≤ K .In Step 2, finding the mean of X i can be done in O (log M ) time using M log M processors, for a specific i . Thus the mean of all the parts can be found in O (log M ) time using n log M processors. Using the slow down lemma, Step 2 canbe performed in O (log n ) time using n log n EREW PRAM processors.Step 3 involves the computation of the Discrete Fourier Transform (DFT) F iX ( k ) for each part: F iX ( k ) = (cid:80) M − u =0 X i ( u ) e − j πM uk , for k = 0 , , . . . , M − i =1 , , . . . , K. For each part, the time taken is O (log M ) using M processors (seee.g., [7, 8]). Therefore, the DFT of all the parts can be computed in O (log M )time using n processors.We have to estimate the third order spectrum of each part in Step 4. Specifi-cally, we have to compute C X i ( k , k ), for 1 ≤ i ≤ K and 0 ≤ k ≤ k , k + k 2. This can be done in O (1) time using O ( nM ) processors. Equivalently, Step4 can also be done in O (log n ) time using nM log n EREW PRAM processors (usingthe slow down lemma).Step 5 is concerned with the smoothing operation. The value of the bispec-trum at any point is computed as an average over a surrounding window of size M × M . This Step can be performed using the Algorithm WS PRAM (c.f.Theorem 2). For each part, this Step can be completed in O (log M ) time using M log M processors. For all the K parts together, Step 5 takes O (log M ) time us-ing nM log M processors. Using the slow down lemma, Step 5 can be completed in O (log n ) time employing nM log n processors.In Step 6, the bispectrum is computed as the average over all parts. In partic-ular, we have to compute ˆ C ( w , w ) = K (cid:80) Ki =1 ˜ C X i ( w , w ). For a given w and w , ˆ C ( w , w ) can be computed using a prefix sums computation on K elementsand hence can be done in O (log K ) time using K log K processors. Thus Step 6 canbe completed in O (log K ) time using M K log K = nM log K processors. The slow downlemma implies that Step 6 can also done in O (log n ) time using nM log n processors.In summary, we get the following theorem. heorem 3. We can compute the bispectrum on any sequence of length n in O (log n ) time using nM log n EREW PRAM processors, where M is the size used topartition the input sequence. (cid:3) The following theorems pertain to computing the bispectrum computationin a memory efficient manner. Proofs are omitted due to space constraints andwill be supplied in the full version. Theorem 4. We can solve the window sums problem on any n × n matrix in O ( n log n ) time using n log n EREW PRAM processors using only O ( nw ) memory, w being the window size. (cid:3) Theorem 5. Bispectrum computation on any given sequence of length n can becomputed in O ( n log n ) time using M log n EREW PRAM processors and O ( M M ) memory, where M is the size of each part and M is the window size of smoothing(assuming that M = ω (log n ) ). (cid:3) Theorem 6. Window sums on any n × n data can be computed in O (cid:16) n w log n (cid:17) time using w log n EREW PRAM processors and O ( w ) memory, w being the win-dow size. (cid:3) Theorem 7. Window sums on any n × n data can be computed in O ( n log n ) time using w log n EREW PRAM processors and O ( w ) memory, w being the win-dow size. (cid:3) Note that the work done in the above algorithm is O ( n w ) and hence thealgorithm is not work optimal. However, the memory used is very small. Theo-rems 2, 4, 6, and 7 consider memories of different specific sizes. Theorems 2 and6 can be used to develop work optimal algorithms when the memory available is m for any w ≤ m ≤ n . The following theorems consider the mesh model andhigher order spectra, respectively. Proofs are omitted due to space constraints. Theorem 8. Bispectrum computation of a sequence of length n can be per-formed on an n × n mesh in O ( n ) time. (cid:3) Theorem 9. On any input of size n , we can compute k th order spectrum in O ( nM k − ) time, for any k ≥ where M is the size of each part in the input. (cid:3) We have conducted extensive experiments to evaluate the performance of ourproposed approaches. In this section we report the results. All the experiments have been performed on the test server, which is equippedwith Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz, with 16 cores (Hyper-threading to 32 threads), 256 GB main memory and 4 TB HDD disk. All thealgorithms have been implemented using C++ and the standard GCC compiler.The parallel version is implemented using OpenMP. We have used a value of = 1 throughout. We have generated different types of the time series datasequences for our experiments using guidelines given in [5, 9].We have implemented algorithms for spectral computation for bispectrumand trispectrum computations. For both of them, we have compared 5 differentapproaches: Naive approach with O ( n m ) run time, denoted as Naive . Here m is nothing but M ; Our sequential algorithm that takes O ( n ) time (c.f. The-orem 1) - Call this algorithm as WS in consistent with above sections; Ourfastest algorithm of Theorem 5 that does O ( n ) work and uses O ( nm ) space -Call this algorithm Fast ; The most efficient algorithm in both time and memory( O ( n ) time, O ( m ) memory) - Call this algorithm Efficient ; Parallel approach(denoted as Parallel ) with P threads, P = 2 , , , We have set a run time threshold of 10 hours and a memory threshold of 100 GB.Any algorithm exceeding one or both of these thresholds was forced to stop. Forlarge datasets such as those with n = 2 , , some of the algorithms exceededthese thresholds. In Figure 1, we show the running time of different approachesfor bispectrum and trispectrum, respectively. Note that this is a log plot. Thusthe parallel curves show orders of magnitude difference. From this figure we canclearly see that compared with the naive algorithm, all of our algorithms offermuch better run times. Log (n) R unn i ng T i m e ( m s ) Bispectrum Computation NaiveWSFastEfficient Log (n) R unn i ng T i m e ( m s ) Trispectrum Computation NaiveWSFastEfficient Fig. 1. Run time comparison for different values of n We provide the maximum memory cost during the running time of each algo-rithm in Table 1. From this table we see that the memory cost in our experimentsis consistent with our theoretical analyses. As it is shown, the Efficient approachis extremely memory efficient. For instance, for the time series sequence with alength of n = 2 , it only requires less than 100 MB of memory. In contrast, eventhe second memory efficient approach Fast uses around 1 GB, and the othersoccupy more than 30 GB. Efficient would take a longer time than Fast , whichdemonstrates the trade-off between space and time.We have compared running times of bispectrum computation by our Fast implementation and HOSA Toolbox in MATLAB [17] for single thread. Theresults from the two programs match exactly . Table 2 shows how fast our Fast implementation becomes when series lengths increase. We have computedbispectrum for every pair of frequencies with linear smoothing window using able 1. Memory (in MB) Comparison n Naive WS Fast Efficient n Naive WS Fast Efficient Bispectrum Trispectrum2 NA 32890.0 344.3 37.9 2 NA 30224.3 1297.7 64.62 NA NA 1055.8 86.3 2 NA NA 7869.3 228.4 both of these implementations . In our experiments with HOSA Toolbox, wesupplied 0 . Table 2. Comparison of running times (in sec) of Fast and HOSA Toolbox series length window length Fast HOSA Toolbox 128 21 0.001 0.010256 33 0.005 0.042512 49 0.011 0.2201,024 77 0.032 1.7552,048 117 0.126 23.3424,096 181 0.448 329.9618,192 279 1.751 3102.4 Next, we evaluate our proposed parallel algorithm. Due to the fact that the Efficient approach has a significant advantage in memory, we have implementedthe parallel version of Efficient to offer a fast and memory efficient approach inhigh order spectra computations. We have tested the Parallel algorithm using P = 2 , , , 16 and n from 2 to 2 . Log (P) P a r a ll e l S peedup Parallel Bispectrum Computation n = 2 n = 2 n = 2 n = 2 n = 2 n = 2 n = 2 n = 2 n = 2 Log (P) P a r a ll e l S peedup Parallel Trispectrum Computation n = 2 n = 2 n = 2 n = 2 n = 2 Fig. 2. Speedups using 2, 4, 8, 16 cores In Figure 2 we plot the parallel speedup against number of cores P . As wecan see, from 2 cores to 16 cores (log ( P ) = 1 , , , n values), thegain of multi-cores is not as significant as for larger n values. This is due to theoverhead introduced in multi-core implementations, such as processor schedulingand communication. For instance, in the case of n = 2 of bispectrum compu-tation, our 2-core Efficient approach has a run time of around 1 ms. However,using 16-threads Parallel still took 1 ms to finish. Thus for small datasets, theoverhead of work scheduling becomes dominant. On the contrary, the computa-tion time is still the dominant part for large datasets, e.g., more than 10 hoursfor n = 2 using the sequential algorithm. This makes our parallel algorithmespecially useful for large datasets.From the memory point of view, the memory cost for the parallel implemen-tation is linearly dependent on the number of cores. This is due to the fact thateach processor is independently working on its own smoothing window. We have evaluated four approaches, Naive, WS, Fast , and Efficient , respec-tively, as well as the Parallel algorithm. Both bispectrum and trispectrum im-plementations have been tested on different lengths of time series data.All of our proposed algorithms outperform the Naive algorithm by orders ofmagnitude, in terms of both run time and memory. Please note that the naivealgorithm is the best found in the literature. Fast is 25 × times faster when n = 2 , and Efficient uses less than 1 / WS is simpler, Fast runs the fastest. It could be due tothe cache misses and memory accessing time costs, as WS occupies a significantlylarger memory. Fast and Efficient display a memory-time trade-off. Efficient has a better balance and is extremely frugal in memory usage. Parallel is a fast and memory saving algorithm, suitable for problems withvery large n . A linear speedup can be achieved by Parallel on larger datasets.The memory cost for Parallel is also linearly dependent on P . For large n , Parallel ’s performance is better than for small n . This is due to the overhead ofparallel implementation, making the parallel approach more preferred for largedatasets. Large data sets are quite relevant in today’s world of big data. In this paper we offer efficient sequential and parallel algorithms for computingHOS. The work done by our algorithms is asymptotically better than any knownalgorithm for HOS. For our proposed sequential approach, the run time is re-duced to O ( n ). Another crucial problem in computing HOS is in the need forlarge memories. To address this problem, we offer memory efficient algorithms.We have also presented work optimal parallel algorithms. Experimental resultsreveal that our algorithms are indeed highly competitive and especially suitablefor long sequence data problems. Acknowledgement This work has been partly supported by NSF Grants 1447711 & 1743418 to SR. eferences 1. U.R. Acharya, E.Y.K. Ng, S.V. Sree, C.K. Chua, and S. Chattopadhyay, Higherorder spectra analysis of breast thermograms for the automated identification ofbreast cancer, Expert Systems , 31(1), 2014, pp. 37-47.2. U. R. Acharya, V. K. Sudarshan, J. EW Koh, R. J. Martis, J. H. Tan, S. L. Oh,A. Muhammad, Y. Hagiwara, M. R. K. Mookiah, K. P. Chua and C. K. Chua,Application of higher-order spectra for the characterization of coronary artery dis-ease using electrocardiogram signals, Biomedical Signal Processing and Control Journal of Real-Time Image Processing (2016): 1-11.4. S. Hajmohammadi, S. Nooshabadi and J.P. Bos, Massive parallel processing ofimage reconstruction from bispectrum through turbulence, Applied optics , 54.32(2015): 9370-9378.5. J.L. Harvill, N. Ravishanker, and B.K. Ray, Bispectral-based methods for cluster-ing time series, Computational Statistics & Data Analysis , 64 (2013): 113-131.6. M. J. Hinich, Testing for Gaussianity and linearity of a stationary time series, Journal of Time Series Analysis , 3 (1982): 169-176.7. E. Horowitz, S. Sahni, and S. Rajasekaran, Computer Algorithms , Silicon Press,1998.8. J. Ja Ja, An Introduction to Parallel Algorithms , Addison-Wesley Professional,1992.9. D.A. Jones, Nonlinear autoregressive processes, Proceedings of the Royal Society ofLondon A: Mathematical, Physical and Engineering Sciences , Vol. 360. No. 1700.The Royal Society, 1978.10. J.N. Kalamatianos and E.S. Manolakos, Parallel computation of higher order mo-ments on the MasPar-1 machine, Proc. International Conference on Acoustics,Speech, and Signal Processing , ICASSP-95, 1995.11. K.N. Le, G.K. Egan and K.P. Dabke, Parallel Computation of the Bispectrum, Proc. Fifth International Symposium on Signal Processing and its Applications ,ISSPA ’99, Brisbane, Australia, 22-25 August, 1999.12. E.S. Manolakos, H.M. Stellakis, and D.H. Brooks, Parallel Processing for Biomed-ical Signal Processing, Computer , Volume: 24, Issue: 3, March 1991.13. C.L. Nikias and A.P. Petropulu, Higher-Order Spectra Analysis: A Nonlinear SignalProcessing Framework , New Jersey: Prentice-Hall, 1993.14. M. Sanaullah, A review of higher order statistics and spectra in communicationsystems, Global Journal of Science Frontier Research, Physics and Space Science Structural Health Monitoring , 6.4 (2007): 325-334.16. T. Subba Rao and M. M. Gabr, An Introduction to Bispectral Analysis and BilinearTime Series Models , Lecture Notes in Statistics, Springer-Verlag, 1984.17. A. Swami, J. M. Mendel and C. L. Nikias, Higher-order spectral analysistoolbox: for use with MATLAB, The MathWorks and United Signals & Sys-tems, Inc., 2001. Retrieved from .18. J. Zou, Y. An and H. Yan, Volatility matrix inference in high-frequency financewith regularization and efficient computations,