[PDF] Optimization meets Big Data: A survey

Abstract

This paper reviews recent advances in big data optimization, providing the state-of-art of this emerging field. The main focus in this review are optimization techniques being applied in big data analysis environments. Integer linear programming, coordinate descent methods, alternating direction method of multipliers, simulation optimization and metaheuristics like evolutionary and genetic algorithms, particle swarm optimization, differential evolution, fireworks, bat, firefly and cuckoo search algorithms implementations are reviewed and discussed. The relation between big data optimization and software engineering topics like information work-flow styles, software architectures, and software framework is discussed. Comparative analysis in platforms being used in big data optimization environments are highlighted in order to bring a state-or-art of possible architectures and topologies.

Full PDF

OOptimization meets Big Data: A survey

Ricardo Di Pasquale

Facultad de Ingeniería y Ciencias AgrariasUCA (Pontifica Universidad Católica Argentina)Buenos Aires, [email protected]

Javier Marenco

Instituto de CienciasUNGS (Universidad Nacional de Gral. Sarmiento)Buenos Aires, [email protected]

Abstract — This paper reviews recent advances in bigdata optimization, providing the state-of-art of this emergingfield. The main focus in this review are optimization techniquesbeing applied in big data analysis environments. Integer linearprogramming, coordinate descent methods, alternatingdirection method of multipliers, simulation optimization andmetaheuristics like evolutionary and genetic algorithms,particle swarm optimization, differential evolution, fireworks,bat, firefly and cuckoo search algorithms implementations arereviewed and discussed. The relation between big dataoptimization and software engineering topics like informationwork-flow styles, software architectures, and softwareframework is discussed. Comparative analysis in platformsbeing used in big data optimization environments arehighlighted in order to bring a state-or-art of possiblearchitectures and topologies.

Keywords – Big Data; Optimization; Metaheuristics

I. I

NTRODUCTION

Optimization is a field at the intersection ofmathematics, computer science, and operations research,which studies methods to find the best solution(s) accordingto a criterion usually given by an objective function, among aset of alternatives bound by a set of constraints.The big data concept is related to the theory andtechnologies that allow the processing of big volumes ofdata, which, for size or complexity reasons, cannot beprocessed with traditional tools. The word “big” as a prefixmeans more than an evolution, and it’s a paradigm shiftinstead: Classical data analytics cannot deal with the big data5Vs [1]-[2] (volume, velocity, variety, veracity, and value).Data science is considered an evolutionary extension ofstatistics [3] with the added capacity of dealing with massiveamounts of data. It is considered a fusion between computerscience and statistics [1].Big data analytics (BDA) is an evolution of data analytics(and data mining) techniques and algorithms [4]. Acomprehensive definition of big data analytics is given in [5][6]: “Big data analytics is the area of research focused oncollecting, examining, and processing large multi-modal andmulti-source data sets in order to discover patterns,correlations and extract information from data” . Although data mining and optimization are differentfields of study, there are many points in common, e.g., aclassification problem can be considered an optimizationproblem where the goal is to maximize the classification accuracy and minimize the complexity under certainconstraints [7]. In this vein, machine learning (ML) andoptimization can be considered as two faces of the same coin[2]. Knowledge discovery in databases (KDD) problems canbe treated as optimization problems e.g. [8]-[11].What happens when optimization meets big data? Is therea new field of study at this intersection? If optimizationalgorithms could be applied to big data with no additionaleffort or study, then there would be no need for a new field.Also, a new field of study would not be necessary if any ofthe following elements do not make any effect onoptimization results: new data types, new data ingestpatterns, streaming, data complexity or data size. Howoptimization interacts with big data is discussed in this paper.Several works in the literature like [12]-[17] contributedto consider Big Data Optimization (BDO) as a new field.BDO arises where new optimization algorithms (or scalableversions of classical algorithms) and techniques must bedeveloped to be applied in combination with Big DataAnalytics (BDA). This way, BDO arises from a cross-fertilization [18] between statistics, optimization, and appliedmathematics.BDA aids BDO in dealing with data analysis. BDOemploys all the tools used in classical optimization, but themost distinctive feature of BDO is the development ofdistributed and scalable implementations of classicaloptimization algorithms, as much as the development of newmethods and algorithms to deal with big data issues [15]. The remainder of the paper is organized as follows.Section II reviews previous work on exact methods includinginteger linear programming, classical optimization solvers,and coordinate descent methods introducing hadoop mapreduce as one of the main big data platforms. Section IIIfocuses in the use of metaheuristics, simulation, and iterativemethods in BDO. Section IV discusses the role of softwareengineering in BDO, as well as state-of-the-art BDOinformation work-flow styles, software architectures andframeworks. Section V reviews the main platforms, while theconclusions and future trends are drawn in Section VI.II. B IG DATA

OPTIMIZATION IN EXACT

METHODS

The first open source framework for the implementationof distributed file systems and big data processing via the map reduce (MR) framework [19] was Apache Hadoop [20]. The work in [21] integrating Integer Linear Programming(ILP) denotes the difficulty of efficiently solving ILP in thisontext, and literally asks “can we find an easy way tohandle the massive computations without involvingsophisticated mathematical logic?”

The rest of [21]advocates to answer this question by stating that the Hadoopframework is suitable to run these kind of solutions. Thesolution in [21] is based on applying a dual decompositionmethod to separate variables of an ILP problem in a smallersubset of ILP dual subproblems (i.e., of a smallerdimension), combined with the notion of iterative updates inLagrangian multipliers can fit into Hadoop’s MR model. Theauthors implemented an air traffic flow optimization problemof the USA National Airspace System (NAS) to illustratetheir statements. IBM Ilog CPLEX Java API was used tointegrate to a Hadoop MR cluster. Distributed processingwas handled by Hadoop, and parallel local processing (ILPsubproblems) was handled by the CPLEX solver. Numericresults are very interesting, and denote scalability, which isan appreciated distributed processing property. Although it’san interesting approach, well implemented and documentedsolution, there are a few observations to have in mind: (a)Hadoop MR is well suited to process big amounts of data ina streaming way, which excludes the iterative nature ofprocess. If every decomposed subproblem could not beidentified in only one iteration, the overhead of theframework could be traumatic. And (b), the classical HadoopMR approach is to “send” algorithms where data is allocated.If the subproblems cannot be fulfilled by data allocated intheir node, then the network impact may also be traumatic.Another approach to implement distributed optimizationis [22], where the objective is to introduce a framework forimplementing distributed optimization with arbitrary localsolvers. This work extends the CoCoA framework [23] witha new version called CoCoA+. In [22] is stated thattraditional optimization solvers –the ones working in onlyone node– have been developed and improved for a longtime by the software industry, and, in consequence,nowadays perform better than the new distributed solvers.Opposite to [21], [22] considers Hadoop MR a too slow harddisk-based work-flow system. CoCoA+ is implemented inC++/MPI. Numeric results show that CoCoA+ convergesfaster than other algorithms like DiSCO [24], but tends to beslower after some iterations. Performance analysiscomparing local solvers are also shown in [22]. Someobservations follow: (a) data distribution needs someprerequisites in order to run under MPI. Fault toleranceneeds to be implemented manually in C++/MPI. And (b), theobservation about data availability in each node in order tobe processed by local solvers made to [21] is valid in [22].Coordinate descent methods (CDM) are iterativeoptimization algorithms that deal with an optimizationproblem by solving a sequence of lower dimensionaloptimization problems. It’s considered one of the moresuccessful algorithms in BDA and BDO environments [4].Essentially, the algorithm tries to modify only one coordinateof the variables vector for each iteration. There are paralleland distributed implementations of these algorithms. In thework proposed in [4], LASSO algorithm was implementedwith CDM, showing very interesting numeric results, and anear linear speedup. The experiments were executed on 24cores. Regrettably, the work does not include multinodedistributed executions of the algorithm. III. B IG DATA

OPTIMIZATION IN HEURISTIC

METHODS

After Hadoop MR has become the main big dataprocessing platform, the main problems detected were: (a)Lack of iterative nature: Hadoop MR does not admit aniterative style of programming, and implementing such analgorithm implies overhead that limites development. Therewere many documented efforts to avoid this non-iterativenature like [25] and [26]. And (b), limited power ofexpression of the map reduce style. There were efforts tomake developers “think in map reduce” like [27], wheregraph algorithms are implemented (and extensivelyexplained) in a pure map reduce manner. However, ingeneral state cannot be simply shared between mappers, andit is not simple to convert algorithms to this format.When Apache Spark was introduced, it rapidly turnedinto the evolution of the Hadoop MR framework. It is builtupon the implementation of Resilient Distributed Data(RDD) as introduced in [28], to build in-memory clusters.This kind of structures could take advantage of all thebenefits of Hadoop and could also bring iterative nature tothe process and expand framework expressive power. Mostof this kind of optimization needs a support for iterativeimplementations. The emerging of Spark as a standard forbig data processing was great news for BDO.

A. Alternating direction method of multipliers (ADMM)

Originally developed in 1974 [29], this method wasintroduced to solve convex optimization problems bydividing them into smaller pieces. Although ADMM couldbe considered an exact method, its nature allowsimplementations (modifications) well suited for deliveringapproximated results. This is an appreciated feature in bigdata environments.The implementation of [15] aims to apply ADMM-basedalgorithms to the optimization of communications smartgrids, particularly power flow systems with securityconstraints. It introduces the ADMM technique in twoblocks, and an extension to n blocks. The authors affirm thatADMM has a parallel nature that can be implemented withSpark, excluding Hadoop MR because its problems withiterative procedures. The authors referenced a previouspublication [20] where they proposed a distributed parallelapproach for a big data scale optimal power flow withsecurity constraints. The goal of [15] is to optimizeeconomics in the power flow of electric energy dispatchintroducing the evaluation of security constraints or riskanalysis scenarios. This implementation takes the classicaleconomic dispatch problem and introduces some previouslyelaborated contingency scenarios as constraints. Theimplementation is based in ADMM distributed process in away that each node can solve a different subproblem inparallel. The numerical results showed that after a fewiterations, the output of the proposed algorithm were near tooptimal solutions, which means that the solution is able toapproximate good solutions rapidly. They used a standardbenchmark (called IEEE 57 bus), which shows a speedupfactor within the interval (1.4, 2.4), but it is supposed to raisethe speedup factor within the interval (4.4, 4.8) by improvingcommunications. . Metaheuristic algorithms Evolutionary algorithms (EA):

EA, including geneticalgorithms (GA) is a field that naturally can make use ofdistributed and parallel computing. The work described in[1] focuses on GA and swarm intelligence. The authorsvision with respect to the simultaneous application of BDAand EA should be considered dual: apply EA on big data, orapply big data techniques in order to improve EA.The authors of [1] emphasize the difference in methodsused to generate new solutions in metaheuristics, recognizingtwo main categories: instance based search (IBS), and modelbased search (MBS). In IBS new candidate solutions aregenerated using only the current solution. Simulatedannealing and iterated local search belongs to IBS category.In MBS the candidate solutions are generated using an(explicit or implicit) model that employs information ofprevious candidate solutions, making the search focus inhigh quality solution regions. Ant colony optimization andestimation of distribution algorithms belong to the MBScategory. Finally, [1] establishes key challenges of EA inorder to sole BDA problems:  BDA requires rapid data mining on big volumes of data.It reinforces the idea of characterizing data miningproblems as optimization problems.  The complexity of data is also relevant, given that high-dimensional data is not the same that a big volume ofdata. Performance in optimization problems uses todecrease exponentially according to the number ofvariables or goals, as far as the search spaces grows(also exponentially).  Dynamic problem management: data nature in real worldis dynamic, changes rapidly in short periods of time.Sometimes this kind of problems can be considerednon-stationary environment [30] or uncertainenvironment [31] problems, in both cases EA wereapplied successfully [32] [33].  Multi-objective optimization: EA are particularly goodon this kind of problems, allowing to find a Paretooptimal set in only one run [34].A big data perspective on EA is given in [35], where it isstated that in order to explore a big data store (and also high-dimensional data) some kind of data analysis is needed toguide genetic operations. The use of informed geneticoperators (IGOs) is illustrated in order to achieve a goodperformance in big data EA (or GA). IGOs usually use ameta-model (or reduced model) that is built on top of dataanalysis. In [35] the use of IGOs is aligned with the idea ofexpanding the search to non-explored regions. These kind ofregions are discovered by allowing not well-suitableindividuals to develop. It implies that the flow of GA (or EA)must not kill those kinds of individuals prematurely. In orderto achieve this goal, populations are divided in tiers of subpopulations built with the idea of being dynamic nichesclassified by fitness ranks where promising sub-optimalsolutions could survive. Its implementation [35] is verysimilar to the multiple-deme GA approach in [36], inparticular, the idea of population islands and migrationprocess. The algorithm avoids local convergence byidentifying it in the algorithm work-flow and replacing redundant individuals (minor fitness criteria) by morepromising candidate solutions from unexplored areas of thesolution space. The IGO operation in [35] is a mutation.

2) Parallel Particle Swarm Optimization:

The PSOapproach based on Hadoop MR was used in [37] in order tooptimize a back propagation (BP) neural network. The goalof this work is to improve the precision and performance ofthe classification. The implementation, based in HadoopMR, makes use of a classical PSO algorithm with a paralleldesign. Authors highlight how optimization techniques(algorithms and metaheuristics like PSO and GA) have beenused successfully to adjust neural networks weights.Besides, this work values research works focused indeveloping parallel or distributed designs for classicalalgorithms as, e.g., [38] where a MPI implementation for aBP neural network algorithm in a supercomputer wasdeveloped. Some of these works rely on GPU technology,which require a detailed knowledge of the hardware wherethe solution runs, highlighting Hadoop MR as a hardwareindependent solution. Some works in Hadoop MR werereferenced in [37], like [39] where a MR model was used todesign a density-based clustering algorithm with goodexperimental results . One of the most important problems in migrating BPalgorithms to big data platforms referenced in [37] is howinadequate traditional attribute reduction (fundamental insets theory) perform, and in order to deal with it, authorsreference [40], where hierarchical attribute reductionalgorithms for big data using MR were proposed.

C. Simulation optimization (SO)

SO studies how to find optimal solutions to systemsrepresented in computer simulation models. It has somedistinctive characteristics [42]: objective values can only beestimated with certain noise level and an importantcomputational cost, given that each objective functionevaluation requires a simulation run. SO has been used in several environments, given thecomplex nature of models of real stochastic systems thatcannot be processed with other techniques. The authors emphasize the advantages of cloudcomputing related to SO, remarking possibilities brought byparallelism and distributed processing. The authors state twotypes of challenges presented by big data adoption in SO:large data, and the quantity of existent data sources and thedifferent styles of information processing. The authorssuggest to adopt “multi-fidelity optimization with ordinaltransformation and optimal sampling” MO TOS [43], aframework for SO running as far as data processing. IV. B IG DATA

OPTIMIZATION

AND

SOFTWAREENGINEERING

As stated in [18], BDO emphasizes the idea of a cross-fertilization between statistics, applied mathematics, andptimization, but it seems to be insufficient: BDA andsoftware engineering (SE) need to be heavily consideredwhen implementing algorithms in this field.As seen on [13], [14], and [17], researchers had to buildsoftware frameworks to deal with BDO. It seems to beevident that the complexity of big data not only affects the“data” dimension of the BDO solutions. Usually,optimization researchers do not need do to deal withsoftware frameworks, but BDO researchers cannot ignorethis need. Something similar happens with the “process”dimension: optimization researchers just needed the data inorder to use it as an input for the optimization model, butBDO interacts with processes and data in a different way, amuch more integrated way. Hence, the big data paradigmshift in optimization implies to deal not only with bigdatabases or high-dimensional data, but also with work-flowsand software frameworks.As seen in [2], it is a good practice to tackle a complexsystem by dividing it in less complex parts, in order tooptimize them sequentially. With BDO, optimizationproblems need to be tackled in an integrated way. In particular, software architecture turns out to be arelevant issue in this cross-fertilization schema, in order todevelop better BDO solutions and to be able to applyoptimization to new types of data like, e.g., stream data [13][44]. To the best of our knowledge, there are no specificpublications on BDO SE, but there are several interestingworks in big data SE like [45]-[48]. In BDA SE, theconference paper [47] appeared at the 2017 IEEE CCWC inBIGDSE [48]. The “process” dimension, usually studied in SE, addsvalue to BDO by specifying the information work-flow styleto be used in a particular solution. There are no studies ofinformation work-flow styles in BDO, but informationextracted from the study of the bibliography suggests thefollowing classification.

A. A classification of BDO Information work-flow styles1) Classical cascade work-flow

The simplest work-flow to implement. It starts with theBDA phase, where the data sources are processed in order tobuild the optimization model input. Then, the optimizationmodel (e.g., LP) produces results or solutions. The particleswarm algorithm implementation shown in [37] implementsthis type of work-flow, as well as the airline routeprofitability analysis and optimization in [41]. The diagramin “Fig.1” represents the classical cascade work-flow.

Fig. 1. Classical cascade work-flow

2) Iterative cascade work-flow

This style of work-flow is similar to the classical cascadework-flow. The iteration may happen when (a) a refinementis required in the data process, or (b) streams processing mayrefine the data process because of its real time nature. Morerefined results are output from the optimization process ateach iteration. The simulation methods for optimization [42]employ this style of work-flow. The diagram in “Fig. 2”represents this style.

Fig. 2. Iterative cascade work-flowig. 3. Feedback model work-flow

3) Feedback model work-flow

This style of work-flow is designed under the premise of theoptimization model being the main process of the solution,and implementing a producer-consumer pattern style, whereBDO is the producer and the optimization model is theconsumer. The Fireworks Algorithm Framework in [14]works under the premise of an information work-flowsimilar to the feedback model work-flow, as well as the workin [49] and [50]. The diagram in “Fig. 3” illustrates thefeedback model work-flow.

4) Integrated model

This model is not a work-flow at all, just because bothprocesses are conceived fully integrated to each other. In thismodel, the feedback between optimization and BDA iscontinuous, and the results are conceived as dynamic. When streaming in real time is the main data source, andthe model must solve in near real time, this work-flow modelshould be employed, also when the “best solution possible”approach is needed in a real time context. The diagram in“Fig. 4” illustrates this model.

B. Software architecture (SA)

Developments in SA like Lambda Architectures [51]allow BDO to implement solutions with sophisticated BDAon-line and in a near real time manner. In this kind ofarchitectures, the “best result right now” approach ispreferred. Usually the dynamic nature of the BDO problemswould make obsolete results in short periods of time.Although there are many critics to Lambda Architecture like[52], other software frameworks to provide similar solutionsare still developing, like Apache Samza and Apache Beam.Researches in SA are worried about how cloud computingcan serve as a platform for enterprise integrated big datasolutions. An interesting road map is documented in [48].

Fig. 4. Integrated model

C. Software frameworks (SF)

SF are usually studied in SA, as part of SE. The valueadded by software frameworks to BDO allows optimizationresearchers to focus in “optimization” itself, dealing withdata and process distribution and parallelization as well asdomain model design. SA should be studied as high levelconceptual frameworks. One of the main ideas of SA is the reuse and extension ofsoftware and concepts. The authors of [17] extend jMetal, anexistent multi-objective optimization frameworkimplemented in Java [53]. This extension enables thepossibility of running distributed metaheuristics on Spark. Italso manages streaming Spark data sources. In order tovalidate the new extension, authors have implemented a bi-objective TSP [54] with real traffic flow data. They choose aGA called NSGA-II (Non-dominated Sorting GeneticAlgorithm-II), extending provided jMetal functionality inorder to dynamically adjust several execution parameters. In Section III we mentioned that [14] presents a FWAframework. This framework groups several implementationsof FWA, and sets some domain entities suitable for extendingthe framework by adding another FWA. There is not enoughdocumentation in [14] to evaluate this SF from a SAperspective. Judging its utility by studying [14], it is areasonably good toolkit for FWA. Something similar occurs with [13]. Also introduced inSection III, it presents an automated differential evolution SF(ADEF). Its main objective is to encapsulate DE algorithmsand variants (e.g., multi-objective) logic, but it automaticallyconfigures the best set of operators to use, as no single DEoperator is considered the best for solving all types ofoptimization problems. There is not enough information in[13] to evaluate the SF from a SE perspective.Mentioned in Section II, CoCoA+ was developed in [22]as an extension of the CoCoA framework in C++/MPI withthe objective of managing the distributed processing of localarbitrary solvers running on a cluster. Observations from theSE perspective were presented in Section II. Too much workis needed to efficiently distribute data, manage process, andynchronize state in MPI with arbitrary tools running locallyall over the cluster.Mentioned in Section II, MO TOS was presented in [43]and used in [42] to run SO. This SF has two main objectives:(1) support SO modeling and (2) support BDA dataprocessing activities. The SF consists of two methodologies:ordinal transformation (OT low fidelity) and optimalsampling (OS high fidelity). No details about itsimplementation are provided in the bibliography, so it isimpossible to find out its internal architecturalcharacteristics.V. B IG DATA

OPTIMIZATION

PLATFORMS

The focus in this section is to review useful BDAplatforms in the BDO context. The Hadoop file system (HDFS) is one of the mostimportant standard on distributed file systems for big datapurposes. A great early and extensive work about HDFSfrom Yahoo researchers is given in [55].MPI is a portable library designed to providesynchronization and communication functionality todistributed processes. It specification was standardized in1993. It is well suited for supercomputer architectures andhigh performance network clusters. Nowadays its mainapplication is to process heavy CPU intensive solutions [56].When used with commodity hardware, it can be combinedwith Beowulf clusters [57], which is a cluster architecturespecification compatible with low cost clusters and opensource software.As mentioned in Section II, Hadoop MR was introducedas a simplified and highly scalable data processing platform[19] well suited for BDA processing. There are many higherabstraction level tools that are based on MR. Its mainadvantages on the big data context over previoustechnologies like MPI are discussed in [27]:  The way the problem is divided in smaller parts in orderto be executed in a distributed platform: Hadoop MRabstracts these processes.  The way tasks are assigned to workers in a cluster: InMPI it must be explicitly defined by developers.Hadoop MR leverages the assignment of map or reducetask in agreement with historical performance.  In MPI, data must be allocated carefully in order to letthe workers have all they need to process. Hadoop MRemploys a different paradigm: algorithms travel throughthe network to where the data is allocated, not the otherway around.  The way synchronization between nodes is coordinated:MPI developers usually specify a master program. InHadoop MR there are a job and a task coordinatorimplemented in the core of the platform.  The way high availability (HA) and fault tolerance (FT)are achieved: MPI HA and FT has to be programmedby developers, or to be provided by cluster. HadoopMR is designed to provide HA and FT. MR disadvantages include the lack of iterative nature anda limited power of expression. Spark emerged as a solutionfor both MR disadvantages. Based on RDD structures, Sparkenables in-memory cluster processing [28]. It also includesseveral actions and transformations that facilitate BDAprocessing. It also includes specific ML, streaming, andgraph processing libraries.Several works have compared distributed and big dataplatforms. Specifically for the BDA context, [6] is acomparative performance analysis for Spark on Hadoopagainst MPI/OpenMP on Beowulf clusters. The selectedalgorithms for performance benchmarking were KNN (K-Nearest Neighbors) and Pegasos SVM (Support VectorMachines). Data used for benchmark were 7Gb of HIGGSdata [58] with 11 million records and 28 dimensions. Ref [6]highlights the difference in performance between Spark andHadoop, placing Spark closer to the high performancecomputing (HPC) paradigm. In CPU terms, numeric resultsshowed a superior performance in Beowulf, but, in aqualitative analysis, Spark advantages on FT, replicationmanagement, hot nodes swapping and simplicity. Severalobservations about the results shown in [6] must beconsidered: (a) Spark’s own KNN and SVM algorithms werenot used. MLLib provides such implementations, and itshould overperform a simple MR implementation. (b) Anadvantage on Spark/Hadoop to process large data is shown.Big data platforms were designed to process many terabytesof data, even petabytes. The amount of 7 Gb should not beconsidered enough volume of data in order to compare bigdata platforms with HPC platforms, as this may bias theresults. Near real time BDA is an emerging field related tostreaming. In this kind of solutions, analysis should not bebased on batch patterns. As stated in [2] and [44], there aremany cases where BDO must be applied directly onstreaming data. A comparative analysis on streaming BDAperformance is made in [44], where Spark Streamingcapacities were compared with Apache AsterixDB use.AsterixDB is a scalable, open source big data managementsystem that still has not reached version 1 release, but hasmany of BDA concerning features to be considered. In [44],the Spark/Cassandra was compared with AsterixDB in asocial network data streaming solution. The algorithms usedin [44] were “word count” and a sentiment analysisalgorithm. Results in [44] show that AsterixDB reached abetter throughput and lower latency. AsterixDB sufferedsome performance issues when data batch segments size wasincreased. The author remarked that the work was notaccomplished by Spark or AsterixDB experts, and neitherimplemented in a functional language like Scala. Severaldiscussion issues were documented after publication, e.g.,Spark/Cassandra version stores streaming data beforeprocessing, which generates overhead, where a simpleworkaround or design improvement would incrementSpark/Cassandra throughput.VI. C

ONCLUSIONS

AND

FUTURE

TRENDS

BDO has emerged as a new field of study in last years.BDO emerges from a cross-fertilization environmentbetween optimization, applied mathematics, statistics, BDA,nd SE. Not always classical HPC platforms are suitable forrunning BDO software, but specific BDA platforms likeSpark on Hadoop have demonstrated to be more accurate forthis task. Several optimization algorithms have beensuccessfully rewritten in order to be applied to BDOenvironments. Other algorithms have been developedspecifically to match BDO requirements.A new classification on information work-flow styles forBDO is proposed in this work, inspired by the review of thebibliography and previous studies.Future trends include new comparative studies andbenchmarks in specific BDO problems with accurateamounts of data; new applications in classical optimizationproblems as well as new industrial applications; studies onthe implications of real time streaming BDO and theparadigm shift of continuous real time BDO.R

EFERENCES[1] S. Cheng, B. Liu, Y. Shi, Y. Jin, and B. Li, “Evolutionary computationand big data: key challenges and future directions”, in: Y. Tan, Y. Shi(eds) Data Mining and Big Data. DMBD 2016. Lecture Notes inComputer Science, vol 9714. Springer, Cham, pp. 3–14, 2016. DOI:10.1007/978-3-319-40973-3_1.[2] A. Lodi, “Big data and mixed-integer (nonlinear) programming” -Data Science for Whole Energy Systems, Alan Turing Institutescoping workshop, 28-29 Jan 2016, Edinburgh.[3] D. Cielen, A.D.B. Meysmann, M. Ali, “Introducing data science. Bigdata, machine learning, and more, using python tools”, ManningPublications Co., 2016, ISBN 9781633430037.[4] C.Tsai, C. Lai, H. Chao, and A. Vasilakos, “Big data analytics: asurvey”, Journal of Big Data (2015), a Springer Open Journal 2:21DOI 10.1186/s40537-015-0030-3.[5] Y. Zhai, Y. Ong, and I. Tsang, “The emerging big dimensionality”,IEEE Computational Intelligence Magazine, Vol. 9, Nr. 3, pp. 14–26,2014.[6] J.L. Reyes-Ortiz, L. Oneto, and D. Anguita, “Big data analytics in thecloud: Spark on Hadoop vs MPI/OpenMP on Beowulf”, ProcediaComputer Science Volume 53, 2015, pp. 121–130, 2015 INNSConference on Big Data.[7] W. Art Chaovalitwongse, C.A. Chou, Z. Liang, S, Wang, “Appliedoptimization and data mining”, Ann Oper Res 249, pp. 1–3, DOI10.1007/s10479-017-2402-x, 2017.[8] S. Ezhil, C. Vijayalakshmi, “An implementation of integerprogramming techniques in clustering algorithm”, Indian Journal ofComputer Science and Engineering (IJCSE), Vol. 3, No. 1 Feb-Mar2012, pp. 173-179, ISSN: 0976-5166.[9] G. Pavanelli, M.T.A. Steiner, A. Goes, A.M. Pavanelli, D.M.Bertholdi Costa, “Extraction of classification rules in databasesthrough metaheuristic procedures based on GRASP”, AdvancedMaterials Research, VL – 945-949, pp. 3369-3375, Trans TechPublications, 2014.[10] V. Ramasamy, “Prioritization of association rules usingmultidimensional genetic algorithm” in International Journal ofApplied Engineering Research, Vol. 10, Nr. 55, pp. 2288-2291,January 2015.[11] W. Wu, L. Liu, B. Xu, “Application research on data miningalgorithm in intrusion detection system”, Chemical EngineeringTransactions, Vol. 51, 2016, ISBN 978-88-95608-43-3, ISSN 2283-92162016.[12] A. Emrouznejad (Ed.), “Big data optimization: recent developmentsand challenges”, Springer, ISBN 978-3-319-30263-8, DOI10.1007/978-3-319-30265-2, 2016.[13] S. Elsayed and R. Sarker, “Differential evolution framework for bigdata optimization”, Springer, Memetic Computing, DOI:10.1007/s12293-015-0174-x, 2016. [14] M. El Majdouli, I. Rbouh, S. Bougrine, B. El Benani, and A. ElImrani, “Fireworks algorithm framework for big data optimization” -Springer, Memetic Computing 8(4): pp. 333-347, DOI:10.1007/s12293-016-0201-6, 2016.[15] L. Liu and Z. Han, “Multi-block ADMM for big data optimization insmart grid”, Systems and Control (cs.SY), DOI:10.1109/ICCNC.2015.7069405, 2015.[16] P. Richtarik and M. Takac, "Parallel coordinate descent methods forbig data optimization”, Mathematical Programming, 156 (1).Springer. pp. 433–484, 2012.[17] C. Barba-González et al., “A big data optimization framework basedon jMetal and Spark” (“Un Framework para Big Data OptimizationBasado en jMetal y Spark”), The 2nd International Workshop onMachine learning, Optimization & big Data, MOD 2016, SIAFLearning Village Tuscany – Volterra (Pisa), Italy, August 26-29, 2016.[18] J. Fan, F. Han and H. Liu, “Challenges of big data analysis”, Natl SciRev, 1 (2), pp. 293-314, 2014[19] J. Dean and S. Ghemawat, “MapReduce: simplified data processingon large clusters”, OSDI'04 Proceedings of the 6th conference onSymposium on Opearting Systems Design & Implementation, Vol. 6,pp. 10-22, 2004.[20] F. Wang, J. Qiu, J. Yang, B. Dong, X. Li and Y. Li, “Hadoop highavailability through metadata replication”, CloudDB '09 Proceedingsof the 1st international workshop on Cloud data management, pp. 37-44, 2009, ISBN: 978-1-60558-802-5, doi:10.1145/1651263.1651271.[21] Y. Cao and D. Sun, “Large-scale and big optimization based onhadoop”, in Ali Emrouznejad (ed.) “Big data optimization: recentdevelopments and challenges”, Springer, ISBN 978-3-319-30265-2,2016.[22] C. Ma et al. “Distributed optimization with arbitrary local solvers”,ARXIV B.Code: 2015arXiv151204039M, 2015.[23] M. Jaggi et al., “Communication-efficient distributed dual coordinateascent” in Advances in Neural Information Processing Systems 27,pp. 3068–3076, 2014.[24] Y. Zhang, and L. Xiao, “DiSCO: distributed optimization for self-concordant empirical loss” in ICML 2015 - Proceedings of the 32thInternational Conference on Machine Learning, pp. 362–370, 2015.[25] C. Jin, C. Vecchiola and R. Buyya, “MRPGA: an extension ofmapreduce for parallelizing genetic algorithms”, in eScience '08.IEEE 4th Int. Conference on eScience, ISBN 978-1-4244-3380-3,DOI: 10.1109/eScience.2008.78.[26] F. Teng and D. Tuncay, “Genetic algorithms with mapreduceruntimes”, Indiana University Bloomington School of Informatics andComputing Department, unpublished.[27] J. Lin and C.Dyer, “Data-intensive text processing with mapreduce”,Synthesis Lectures on Human Language Tech., Morgan-Claypool,ISBN-13: 978-1608453429, 2010.[28] M. Zaharia et al., “Resilient distributed datasets: a fault-tolerantabstraction for in-memory cluster computing”, 9th USENIXSymposium on Networked Systems Design and Implementation(NSDI 12), pp. 15-28, 2012, San José, California, ISBN 978-931971-92-8.[29] R. Glowinski, “On alternating direction methods of multipliers: ahistorical perspective” in W. Fitzgibbon, Y.A. Kuznetsov, P.Neittaanmäki and O. Pironneau (eds.), Vol. 34 ComputationalMethods in Applied Sciences pp 59-82, “Modeling, Simulation andOptimization for Science and Technology”, Springer, 2014.[30] R.W. Morrison, K.A. De Jong, “A test problem generator for non-stationary environments”, in Proceedings of the 1999 Congress onEvolutionary Computation (CEC 1999), vol. 3, pp. 2047–2053, July1999.[31] Y. Jin, J. Branke, “Evolutionary optimization in uncertainenvironments - a survey”, IEEE Trans. Evol. Comput. 9(3), pp. 303–317, 2005.[32] S. Yang, C. Li, “A clustering particle swarm optimizer for locatingand tracking multiple optima in dynamic environments”, IEEE Trans.Evol. Comput. 14(6), pp. 959–974, 2010.[33] L.T. Bui, Z. Michalewicz, E. Parkinson, M.B. Abello, “Adaptation indynamic environments: a case study in mission planning”, IEEETrans. Evol. Comput. 16(2), pp. 190–209, 2012.34] C.A.C. Coello, G.B. Lamont, D.A.V. Veldhuizen, “Evolutionaryalgorithms for solving multi-objective problems” in Genetic andEvolutionary Computation Series, 2nd ed. Springer, New York, 2007.[35] M. Bhattacharya, R. Islam, and J. Abawajy, “Evolutionaryoptimization: a big data perspective” Journal of network andcomputer applications, vol. 59, pp. 416-426, doi:10.1016/j.jnca.2014.07.032.[36] E. Cantú-Paz, "A survey of parallel genetic algorithms", CalculateursParalleles, Vol. 10, 1998.[37] J. Cao, H. Cui, H. Shi, and L. Jiao, “Big data: a parallel particleswarm optimization-back-propagation neural network algorithmnased on mapreduce”, PLoS ONE 11(6): e0157551.doi:10.1371/journal.pone.0157551, 2016.[38] J.Q. Feng, W.D. Gu, J.S. Pan, H.J. Zhong, J.D. Huo, “Parallelimplementation of BP neural network for traffic prediction on sunwayblue light supercomputer” in Applied Mechanics & Materials, Nr.614, pp. 521–525, 2014.[39] Y. Kim, K. Shim, M.S. Kim, J.S. Lee, “DBCURE-MR: An efficientdensity-based clustering algorithm for large data using MapReduce”,Information Systems Nr. 42, pp. 15–35, 2013. DOI:10.1016/j.is.2013.11.002.[40] J. Qian, P. Lv, X.D. Yue, C.H. Liu, Z.J. Jing, “Hierarchical attributereduction algorithms for big data using MapReduce”, Knowledge-based Systems, Nr. 73, pp. 18–31, 2015, DOI:10.1016/j.knosys.2014.09.001.[41] E. Kasturi, S. Prasanna Devi, S. Vinu Kiran, and S. Manivannan,“Airline route profitability analysis and optimization using big dataanalytics on aviation data sets under heuristic techniques” - ProcediaComputer Science 87 (2016), pp. 86–92 -Fourth InternationalConference on Recent Trends in Computer Science & Engineering. [42] J. Xu, E. Huang, C. Chen and L. Hay Lee, “Simulation optimization:a review and exploration in the new era of cloud computing and bigdata”, Asia-Pacific Journal of Operational Research Vol. 32, No. 3(2015) 1550019 (34 pages), World Scientific Publishing Co. &Operational Research Society of Singapore.[43] J. Xu, S. Zhang, E. Huang, C.H. Chen, L.H. Lee and N. Celik“Efficient multifidelity simulation optimization”, Proceedings of 2014Winter Simulation Conference, 2014.[44] P. Pääkönen, “Feasibility analysis of AsterixDB and Spark streamingwith Cassandra for stream based processing”, Journal of Big Data, ‐‐