Next Generation Business Intelligence and Analytics: A Survey
Quoc Duy Vo, Jaya Thomas, Shinyoung Cho, Pradipta De, Bong Jun Choi, Lee Sael
IIEEE COMMUNICATIONS SURVEYS & TUTORIALS 1
Next Generation Business Intelligence andAnalytics: A Survey
Quoc Duy Vo ∗ Jaya Thomas ∗ Shinyoung Cho ∗ Pradipta De † Bong Jun Choi ∗ Lee Sael ∗∗ Department of Computer Science, SUNY Korea, Incheon, South KoreaDepartment of Computer Science, Stony Brook University, New York, USAEmail: { rayvo, jaya.thomas, sycho, sael, bjchoi } @sunykorea.ac.kr † Department of Computer Sciences, Georgia Southern University, Georgia, USAEmail:[email protected]
Abstract —Business Intelligence and Analytics (BI&A) is theprocess of extracting and predicting business-critical insightsfrom data. Traditional BI focused on data collection, extraction,and organization to enable efficient query processing for derivinginsights from historical data. With the rise of big data andcloud computing, there are many challenges and opportunitiesfor the BI. Especially with the growing number of data sources,traditional BI&A are evolving to provide intelligence at differentscales and perspectives - operational BI, situational BI, self-service BI. In this survey, we review the evolution of businessintelligence systems in full scale from back-end architectureto and front-end applications. We focus on the changes inthe back-end architecture that deals with the collection andorganization of the data. We also review the changes in the front-end applications, where analytic services and visualization are thecore components. Using a uses case from BI in Healthcare, whichis one of the most complex enterprises, we show how BI&A willplay an important role beyond the traditional usage. The surveyprovides a holistic view of Business Intelligence and Analytics foranyone interested in getting a complete picture of the differentpieces in the emerging next generation BI&A solutions.
Index Terms —Business intelligence, Operational BI, Situa-tional BI, Self-service BI, Machine Learning, BI data analytics,Integrative data analysis, Healthcare BI, data governance
I. I
NTRODUCTION
Business Intelligence can be expressed as the automatedprocess to collect raw data from heterogeneous sources, andto organize them in a systematic manner. With the automatedprocesses, models and insights can be derived from the data toimprove business processes. The best practice in enterprise BIarchitectures is to split back-end architecture, associated withthe data collection and data organization, with the the front-end, where data analyzed and displayed to the user. The trans-actions data are generated when transactions are processedand they are stored in the Online Transaction Processingserver (OLTP), also called Operational Data Sources. Fromthe OLTP servers, data is extracted, transformed, and storedinto a data warehouse, which is a structured data repository.Different query optimization techniques can be applied on thedata warehouse for speed-up of data analysis and analyticsquery can run on the data warehouse. Further speed-up can beachieved through creation of data marts, which are subsets ofthe data warehouse.In addition to the traditional data sources, i.e. transactiondata, the sources of BI data are evolving to include even the messages sent via company intranets and personal profiles ofemployees and customers from the web. The mobile devicesand other sensor data also add to the data sources. However,many of these data sources are not structured. Texts frommessages posted on online social networks and data fromdifferent sensors are two type of unstructured data. This makesit challenging to maintain as traditional relational databasewhile still achieving query efficiency. In the perspective ofthe data analysis, more data means more opportunity for theanalytics engines to discover more insight. However, therestill remains the big data challenges even in the analyticperspective.The increase in data opens up opportunities in expanding thescope of BI. That is going beyond being just a mechanism toanalyze trends from historical data. Situational BI can combinereal time data from sensors and other personal information inreal time to infer insights that are not traditionally available[1]. Operational BI is concerned with providing real timeinsights to the business operations, such as a call centeroperative who may benefit by getting instant feedback on theirwork. Analytics is also evolving with the notion of self-serviceBI, where the user may compose the analytics rules based onmeta-information about the data exposed to her. These newapproaches to BI however, must be carefully orchestrated suchthat the enterprise governance and compliance models are notviolated.In this survey, we capture the shifting trends in BI archi-tecture. For the backend, we show how different technologyevolutions are transforming the architecture. For the frontend,where analytics engines play the pivotal role, we focus ondifferent trends in Machine Learning that are enabling theevolution of BI from the traditional historical analysis tool.We also chart the how challenges in enforcing the enterprisegovernance models are being addressed in this landscape.Business Intelligence is no longer just a tool to supportenterprise environments. It can be used by public enterprisesand by the government to understand social scale initiativesand predict requirements. We showcase healthcare use caseto illustrate how the evolved BI architectures fit into the usecases. Two other technology trends, namely big data and cloudcomputing, are also closely tied to the changes happening inthe BI architecture. We present the connections to big data anduse of cloud computing as opportunity, and discuss research a r X i v : . [ c s . A I] A p r EEE COMMUNICATIONS SURVEYS & TUTORIALS 2
Fig. 1: Different components in traditional BI architecture.Fig. 2: A traditional BI systemchallenges. II. P
RELIMINARIES
A. Traditional BI
Traditional BI systems use reporting mechanisms to accesstransaction data stored in data warehouse. Analyzing transac-tion data can help us to detect patterns and predict businesstrends. A traditional BI system consists of three separatedlayers as shown in figure 2: presentation layer, applicationlayer, and database layer With the three-tier architecture, it ischallenging to fulfill service level objectives such as maximalresponse time and minimal throughput rates. This is due to thedifficulties in predicting execution times where the applicationlayer does not know about the data storage management at thelow-lever layers.Although, typical BI system can give us a forward viewof the business, it is well-known that traditional BI systemsare slow, rigid, time-consuming, and maintenance requiresa expert knowledge. Many researches has been conductedtowards improvement of the three-tier architecture as well asto add modern features of the next generation BI.
B. Modern Features of Next Generation BI1) Operational (Real-Time) BI:
The competitive pressureof today’s businesses has increased the need for near real-time BI, also called operational BI. The goal of operationalBI is to reduce the latency between the data acquisition timeand data analysis time. Reduction on the latency enables thesystem to take appropriate actions when an event occurs. Withthe operational BI realized, businesses can detect the patternsor temporal trends over the streaming operational data.
2) Situational BI:
Situational BI enable situational aware-ness. Situation BI is important in companies were fast shift ofsituations, often external business trends, affect the business[2]. However, such external information, which mostly comefrom the corporate intranet, external vendor, or internet, areunstructured. Moreover, these unstructured data need to beintegrated with structured information from local data ware-house to support decision marking in real-time. For example, abusiness might want to know whether its customers are postingpositive or negative comments about its new product. With theanalysis of the comment, businesses can provide immediatefeedback to the development team to make the product morecompetitive. As another example, it is important for a companyto know whether a natural disaster has affected its contractedsuppliers. Being aware of the natural disasters, enable businessoperatives to take appropriate actions necessary in minimizingthe loss [3].
3) Self-service BI:
Self-service BI (SSBI) enables end usersto create analytical queries and reports without the IT depart-ment’s involvement. The user interface in SSBI applicationsmust be user-friendly, intuitive and easy to use, so that atechnical knowledge of the data warehouse is not required.The user also should be allowed to access or extend not justIT-curated data sources, but also non-traditional ones.
EEE COMMUNICATIONS SURVEYS & TUTORIALS 3
Fig. 3: A classification of database systems
C. Data Architecture1) Background and Challenges:
Traditional architectureof business applications consists of three separated layers:presentation, application, and database. With the three-tierarchitecture the execution time is hard to predict, due to thecorrelation between low-level data management operations andhigh-level processes. The workload management solutions areusually built on top of general-purpose database managementsystems, which require time delays when executing requestsin parallel. This creates challenges for modern business ap-plications to be able to work as operational or real-timeBI. Therefore, technologies that enable performing analyticalqueries and business transaction queries at the same time onthe same data is important.Today’s enterprises use an extraction, transformation, andload (ETL) model to extract data, perform transformations,and load the transformed data into the data warehouse. Thismodel rely on two types of processes which are vital tobusiness operations: online transaction processing (OLTP) andonline analytical processing (OLAP). OLTP is used to managebusiness processes, such as order processing. OLAP is usedto support strategic decision making, such as sales analytics.Workloads from OLTP and OLAP are traditionally executedon the same database system. However OLAP workloadscomposes mostly of massive read-only operations on the datathat is constantly being updated by the OLTP. Therefore,when both workloads executed in a single database, thetransaction processing performance might be unpredictabledue to resource contention. It is thus beneficial to separatedout the workloads for OLTP and OLAP. Figure 3-a shows thetraditional ETL-based BI where OLTP and OLAP are separate.In this architecture, each OLAP workload has to wait untilthe data in date warehouse are completely updated and visiblecausing delays.To reduce the delay, today’s operational BI systems performOLTP and shorter-running analytical queries, called short OLAP workloads, together on the operational database man-agement system (DBMS), as shown in Figure 3-b. However,many short OLTP transactions, which make changes to thedatabase, may conflict with longer-running OLAP workloads.High synchronization overhead is required to handle theresource contention, which results in low overall resourceutilization.In addition, commercial DBMS use special techniques,such as shadow copy [4], to handle mixed workloads withlow performance overhead. That is, different workloads areseparated and executed on different logical copies of the data.This may cause additional space overhead, which increasethe infrastructure requirements and costs. Therefore, manag-ing these mixed workloads (OLTP and OLAP) in the datamanagement systems is a big challenge for current disk-basedDBMSs [5].
D. Recent BI systems1) Extended traditional BI systems:
In this section, wepresent existing traditional BI techniques which can performOLTP transactions and OLAP queries on the same databasewithout interfering with each other. Due to the contradictionof the dramatic explosion of the dynamic data volume, theintegration of these mixed workloads on the same systemrequires extreme performance improvements. • In-memory database (IMDB)
In most today’s BI systems, the mixed workload com-prised of OLTP and OLAP on a single system can behandled by using in-memory (or main-memory) database(IMDB). This technique requires the system to store alldata in main-memory, which is faster than disk-optimizeddatabases since the internal optimization algorithms aresimpler and uses fewer CPU instruction. When queryingthe data, this approach provides faster and more pre-dictable performance than disk by eliminating the seektime.However, IMDB systems can be said to lack of durabilitydue to the losing of stored information when the deviceloses power or is reset. Many IMDB systems have pro-posed different mechanisms to support durability such assnapshot files, transaction logging, non-volatile DIMM,non-volatile random access memory, and high availability.Table I shows recent BI systems which use differentapproaches to keep most or all data in main-memory.As shown in the table, the BI systems use differentapproaches to keep most or all data in main-memory toobtain high OLTP throughput rates. For example, the H-Store system operates on a distributed cluster of shared-nothing machines where the data resided entirely inmain memory. By removing traditional DBMS features,such as buffer management, locking and latching, the H-Store system can execute transaction processing with highthroughput rates. The H-Store prototype was recentlycommercialized by a start-up company named VoltDB[9]. • Hybrids with on-disk database
EEE COMMUNICATIONS SURVEYS & TUTORIALS 4
System Type Approaches Achievements YearH-store[6] IMDB Distributed,row-storetechnique high OLTP through-put rate 2008RaduStoica[7] Hybrid Data re-organization, High performance 2013Reduce paging I/O,and improve mem-ory hit rateSiberia[8] Hybrid Cold data ac-cess and mi-gration mecha-nisms Acceptable accessrates with 7-14%throughput loss 2014
TABLE I: A summary of traditional BI systems which handlemixed workloads (OLTP and OLAP).Although main-memory is becoming large enough tohandle most OLTP database, it may not always be the bestoption. Using the access patterns of the OLTP workloads,where some records are ”hot” (frequently accessed),others are ”cold” (infrequently or never accessed), recentsystems tend to store the coldest records on a fastsecondary storage devices, and hot records should residein memory to guarantee good performance. For example,Stoica and Ailamaki [7] proposed method to migratedata of main-memory database to a larger and cheapersecondary storage. In their work, in order to reduce OSpaging I/O and improve the main memory hit rates, therelational data structures are re-organized using the accessstatistics of the OLTP workloads.Recently, Siberia has been introduced as a frameworkfor managing cold data in the Microsoft Hekaton IMDBengine [8]. Similar to [7], it does not require a databasebe stored entirely in main memory. Only some tablescan be declared in the main-memory and managed byHekaton. Hekaton focus on how to migrate records toand from the cold store and how to access and updaterecords in the cold store in a transactionally consistentmanner. The experiment evaluation shows that when thecold store resides on a commodity flash, the Siberia couldlead to an acceptable throughput loss of 7-14%, given thatthe cold data access rates appropriate for a main-memoryoptimized database.
2) BI Systems with modern features:
In this section, wedescribe three modern BIs: operational BI, situational BI, andself-service BI. Table. II shows a categorization of recent BIsystems in terms of modern features.While the H-Store system is limited to only OLTP trans-action processing, a recent system, called HyPer, can handlemixed workloads from both OLTP and OLAP at extremelyhigh throughput rates using a low-overhead mechanism forcreating differential snapshots [10]. This system employs thelock-less approach which allows all OLTP transactions to beexecuted sequentially or on private partitions. In parallel to theOLTP processing, the HyPer system executes OLAP queries on the same and consistent snapshot. These virtual memorysnapshots are created by forking the OLTP process and keptconsistent via the implicit operating systems / processor-controller lazy copy-on-write mechanism.Similar to H-Store, MobiDB is a special-purpose main-memory DBMS which guarantees serializability and mixedworkloads using the queuing approach [11]. Instead of pro-cessing incoming transaction and periodic business queriesright away, the MobiDB adds them to a queue and processesthem later. These requests are first analyzed to estimate howlong they would take to be performed. Using this analysis andthe required guarantees in terms of throughput rate and latency,the MobiDB decides when to execute the queued requestadaptively. Therefore, the execution times can be estimatedquite accurately.To alert business managers of situations that can potentiallyaffect their business, Castellanos et al. [12] propose a novelplatform, called SIE-OBI, that integrates the required func-tionalities to exploit relevant fast stream information fromweb . The authors proposed novel algorithms which extractand correlate the information obtained from the web withhistorical data stored in data warehouse to detect the situationpatterns. Only relevant information is extracted from two ormore disparate sources of unstructured data, typically oneinternal slow text stream and one external fast text stream. Thisplatform is created to reduce the time and effort of buildingdata flows that integrate structured and unstructured, slow andfast streams, and analyze them in near-real-time.
E. Data Governance1) Background:
Data Management International (DAMA I)[13] defines data governance as “the exercise of authority andcontrol over the management of data assets, and the planning,supervision and control over data management and use”. Datagovernance defines roles and responsibilities of the organiza-tion to promote desirable behavior in the use of data [14]. Datagovernance is differentiated with data management, whichinvolves determining standards for data quality, and makingand implementing decisions [15]. It is also differentiated withBI governance, which aims to provide a customized frameworkfor decision-making through the governance of all activitieswithin BI environment [16]. DAMA I [17] defines ten datamanagement functions as shown in Figure 4. Data governancefunction is high-level planning, supervision, and control overall other functions.In this section, we focus on only four data managementfunctions related to next generation BI which requires access-ing fast to data, utilizing external data, and allowing generalusers to analyze data. Data architecture management involvessetting data standards, developing and maintaining enterprisedata architecture and connecting between the applicationprojects and the architecture. Data quality management focuseson planning, applying, and controlling activities that applyquality management techniques to measure, assess, improve,and ensure that the data is fit to use. Data warehousingand business intelligence management focuses on providingdecision support data for reporting, query, and analysis. Meta
EEE COMMUNICATIONS SURVEYS & TUTORIALS 5
System Approaches Achievements Opera-tionalBI Situa-tionalBI Self-ServiceBI YearHyPer[10] Hardware-assisted replicationmechanisms Fast OLAP query responsetimes O X X 2011.Copy-on-write mechanism High throughput rates for bothOLTP and OLAPMobiDB[11] Queuing approach Low latency O X X 2011.High throughput rates for bothOLTP and OLAP.Optimum space overheadSIE-OBI[12] Data extraction algorithm Reduce the latency O O X 2012Information correlation infor-mation Reduce the effort to build data
TABLE II: A categorization of recent BI systems in terms of modern features.Fig. 4: Data governance frameworkdata management focuses on activities to enable easy accessto high quality meta data, such as architecture, integration,control, and delivery.
2) Deploying Next Generation BI in Data Governance:
Thegovernance of data is becoming vital to an enterprise as databecomes an asset. An enterprise derives business value andmakes a decision based on the information driven from data.Thus, the data is needed to be governed to ensure the qualityof data which affects directly to the quality of decisions takenby an enterprise [18].More effective governance of data can result in higherlevel of decision making. For achieving effective governanceof data, data governance maturity models help an enterpriseto understand its data governance and to know what is theanticipated next plan [19]. Several data governance maturitymodels [20] are proposed to guide an enterprise to recognizethe level at which its data governance is. Oracle Corporation[19] predicted that a data governance maturity model willassist the enterprise in determining where it is in the evolutionof its data governance discipline, identifying the short-termsteps necessary to get to the next level, and improving itsdata governance capabilities [19]. The highest maturity levelof Oracle model is integrating data governance with business intelligence.Next generation BI supports near real-time insights andusing the influx of external information which creates a hugeamount of data flows and manipulations. This requires highlymatured data governance to provide data quality, integrity,and reliability. The three properties are crucial for extractingaccurate insight via data mining techniques. For example, inself-service BI, e.g., Tableau and QlikTech, allows users todiscover insight from multiple data sources without modellingthe data environment and creating complicated ETL processes,which are one of the most difficult and time-consuming tasksin BI. These new features allow users to access data easily,and get quick results and agile data visualization. To enableevolution of the next generation BI, data governance is crucialfor data reliability of the insight discovered. For example,in the case of self-service BI, the truth that the end usersare able to access and manipulate their own data decreasesthe reliability of the results of BI [21]. In data governance,useful functions can be considered to ensure reliability such astracking the data lineage back to the source and creating logsof how the data were manipulated or transformed. Integratingdata governance to the next generation BI, however, has facedsome challenges due to the requirements of agile and reliableresponses while there are the huge amount of external dataand general user participation.
3) Data governance challenges:
There are two major fea-tures of the next generation BI that affect the data governancemodel. In the next generation BI, the decision making shouldbe more effective and expeditious among a large amount ofdata which are from multiple data sources and formats. Datafrom multiple sources, however, make data governance morecomplicated to control and difficult to manage properly, sothat they can also result in ineffective decision making. If datafrom different sources conflict, the decision maker has to domuch more research and analyze the various data and sourcesfor that data to determine or approximate what is true andaccurate, and these processes are costly. Thus, it is importantto manage data across heterogeneous sources and applicationsin the next generation BI system.In the next generation BI, especially self-service BI, busi-ness users are being involved in decision making procedure. In
EEE COMMUNICATIONS SURVEYS & TUTORIALS 6 general, centralized IT organization and several data stewardshave engaged in data governance initiatives and have datamanagement platform metadata repository and a suite of datamanagement tools to handle disparate data. In advance, theystandardize common data definitions for master data and ref-erence data which are broadly shared across many enterpriseapplications. When they receive disparate data, they matchthem to predefined common data definition, determine theirquality, define any rules, transform them, and integrate them.But, in the next generation BI, users also define their own datanames, manipulate, or integrate them on their own by usingdifferent self-service BI tools. They might want to upload tothe database and share their insight with others. Business userparticipation in data process can incur data in chaos sincethe same data can be transformed and integrated in differentways by a centralized organization and data stewards with datamanagement tools and by business users with self-service BItools. Thus, metadata sharing standards are crucial across theseparticipation for common data transformation, common datanames, and common integration rules [22].
4) Data governance model for next generation BI:
Thedesign of the data governance model is classified to central-ized v.s. decentralized and hierarchical v.s. cooperative. Thecentralized design assigns all decision making authority in acentral IT department while the decentralized design distributethe authority over individual business units [14].
Roles Description TasksExecutiveSponsor Provides sponsorship, strategic direc-tion, funding, advocacy, and oversight ConsultedData Qual-ity Board Defines the data governance frameworkfor the whole enterprize and controlsits implementation Consulted,InformedChiefSteward Puts the boards decisions into prac-tice, enforces the adoption of stan-dards, helps establish DQ metrics andtargets ConsultedBusinessDataSteward Details corporate-wide DQ standardsand policies for his/her area of respon-sibility from a business perspective Accountable,Responsi-bleTechnicalDataSteward Provides standardized data elementdefinitions and formats, profiles andexplains source system details and dataflows between systems Accountable,Responsi-ble
TABLE III: Specification of Data Governance Model for NextGeneration BIIII. N
EXT G ENERATION F RONTEND A RCHITECTURE
A. Data
The business insight is obtained from the raw data whichis heterogeneous in nature. The heterogeneity in the datamay be as a result of difference in data sources, the contentof data format, type of data or as a result of diversity indata extraction process. Depending on the content of studydata source can be human generated, machine generated,internal data sources, web and social media, transaction data,biometric data etc. Further, the context data format may vary from being structured, unstructured, semi-structured, images,text, videos, audio etc [23], [24]. Considering the data type,there is heterogeneity as meta data, master data, historicaland transactional data. In the business intelligence study theheterogeneity in data is also contributed by the data extractionmethod that may depend on the on-demand feeds, continuousfeed, real time feeds or time series.
B. BI Analysis
The growing volume of data in many business makes costeffective manual data analysis virtually impossible. The useof data mining techniques in business not only handles thevolume and variety of data, but also helps to take a proactiveknowledge-driven decisions and enhance business intelligencein general. Data mining is a broad term that includes a numberof process as data modeling techniques, statistical analysis, andmachine learning in search for consistent pattern or relation-ship and determining some predictive information in the dataanalyzed from large amounts of data [25]. Machine learningleverage to data mining algorithms in order to improve thepredictive analysis. Machine learning techniques have becomepopular in BI as they can handle growing volumes and vari-eties of available data, and makes the computational processingthat is cheaper and more powerful. The two broad categories ofmachine learning algorithms are supervised and unsupervised[26]. Supervised learning involves the training data that con-tain sets of training examples, whereas unsupervised learningdraw inference from training data without labeled responses.Business intelligence and analytic shares a mutual relation-ship. Analytics focuses on uncovering data insights that maybe beneficial for strategic planning resulting in competitiveadvantages by analyzing the customer and market behavior innew ways to deliver a quicker real insight. The traditional BIanalysis process followed a consistent procedure to explorefuture decisions from the historical data. The common toolsused to carry out the analysis include sophisticated dataanalysis tools as OLAP/ROLAP, Machine learning tools andVisualization tools [27]. These tools provided an access tobusiness users to directly access, interact and visualize thebusiness data without knowing the technical complexity ofdata retrieval, storage and processing. The OLAP is ableto provide some analytics functionality as exploring largeamounts of data stored in multidimensional database, theirrelationship, carrying complex computation and visual rep-resentation of results from different points of view [28].However, such analysis fails to give a deeper insight andaddress question as “why” which requires more exploratoryperspectives provided by machine learning techniques. Thesetechniques are employed to carry out the predictive analyticstask, build analytic model at a lower level, search for pre-dictable behaviors, business rules and look for answers onpredicting performance and prescribing specific actions or rec-ommendations. The techniques include regression modeling,clustering, neural networks, genetic algorithm, text mining,decision tree, and more.
EEE COMMUNICATIONS SURVEYS & TUTORIALS 7
C. Time series forecasting
In this survey, we will discuss these dynamic decisiontechniques for time series forecasting. The description oftime series data includes high dimensionality, volume andcontinuously evolving. The analysis of time series data isa powerful analytics tool as it help to address questionsas rate of change in user behavior with time, co-variancebetween the product, marketing promotion strategies, currenttrend in product sale, profit monitoring, determine anomaliesetc. in nearly all enterprises as sales, manufacturing, mobilecompanies, hospitals, etc.
1) Machine Learning in financial forecasting:
The financialprediction on the stock price is of great interest for theinvestors as well as the analytics. However, the predictionabout the current time to buy or sell any stock is not an easytask as the price is influenced by number many factors. We listsome of the machine learning techniques from the literaturecommonly used to carry out such predictions.Support Vector Machine (SVM) is a popular supervisedmachine learning algorithm, which can be used for classifi-cation or regression problems. SVM algorithms are able tocapture complex relationships between data samples withoutcarrying out difficult transformations. Cao et al. [29] proposedan application of SVM for financial time series forecastingusing single data source. The work proposed a SVM withadaptive parameter (ASVM) to handle the structural changesin the financial data. The experiment was carried out on fivereal futures contracts collated from the Chicago MercantileMarket.The Fuzzy logic methods have been proven effective indealing with complex systems containing uncertainties thatare otherwise difficult to model. In [30] a granular computingis proposed based fuzzy model to improve the accuracy offinancial forecast. The data source considered for experimenta-tion work considered first-order fuzzy time series model wereTaiwan Stock Exchange Capitalization Weighted Stock Index(TAIEX), Dow-Jones Industrial Average (DJIA), S & P 500 andIBOVESPA stock indexes.Hybrid methods are also widely used. A hybrid neurogeneticapproach that combined a recurrent neural network and thegenetic algorithm was also used to predict economic growth[31]. The recurrent neural network had one hidden layeris used for the prediction model and the genetic algorithmwas used to optimize weight. The data source used were36 companies data from New York Stock Exchange (NYSE)and National Association of Securities Dealers AutomatedQuotations (NASDAQ). Another hybrid approach namely reg-ularized least squares fuzzy support vector regression wasproposed to address noise and non-stationarity existing infinancial time series data [32]. Six financial data sets weregathered from Yahoo financial website for IBM, Microsoft,Google Inc., Redhat Software, Citigroups, and Standard & Poor 500 enterprise. The performance using multi-outputsupport vector regression (MSVR) considering interval-valuedover short and long horizons [33]. The global dataset used fortesting were S & P 500 for the US, FTSE 100 for the UK, andNikkei 225 for Japan. Today, the stock markets databases are flurried from a widerange of complex data from diverse sources as market data,reference data, exchange, security description, fundamentaldata as enterprise financial, analyst report, filing etc., andeven data from social media that may include blogs, webfeeds etc. Moreover, the financial data is highly dynamic andvolatile which raise a need for some integrative approach thatcombine data from the other sources and contributes towardsthe accuracy of the forecast.
2) Machine learning in sales forecasting:
The predictionof future sale by an enterprise is termed as sales forecasting,which is a part of its critical management strategy. Themachine learning techniques as Genetic algorithms (GA) aresuitable candidates for this task since GAs are most usefulin multiclass, high-dimensionality problems where heuristicknowledge is sparse or incomplete. Neural networks are alsoconsidered as efficient computing models for pattern classifi-cation, function approximation and regression problems. Thesales forecasting problem for printed circuit board (PCB) salesaddressed in [ [34]], where the model is built by integratingGAs and Neural Network. The study was carried out forPCB electronic industries in Taiwan, where the feature ofdata included monthly sales amount, total production squaremeasures, etc. The PCB sale is studied [ [35]], using integratedgenetic fuzzy systems (GFS) and data clustering. The approachwas experimented on the PCB data source with parameters aspre-processed historical data, Consumer price index, Liquidcrystal element demand and Total production value of PCB.Kuo et al. [36] proposed a hybrid machine learning systemwith fuzzy neural network on locally chain supermarket data.They method enabled incorporating expert knowledge in theforecasting. In one example of automobile sale forecast [37],adaptive network-based fuzzy inference system was consid-ered, which included several economic variables as currentautomobile sales quantity, coincident indicator, leading indi-cator, wholesale price index and income for prediction. Thedata analyzed was whole automobile market in Taiwan thatincluded sales data corresponding to sedan, small commercialvehicle, and large commercial vehicle. The sales forecasting isvery complicated owing to influence by internal and externalenvironments.
3) Machine learning in heathcare:
The time series data arestudied by the healthcare enterprises using machine learningto understand existing patterns that help the administration tomake strategic decisions. Some prediction based on the timeseries data includes outpatient visits, and customer behaviorin choosing hospital Chang et al. [38] propose a fuzzy logicbased approach based on weighted transitional matrix toforecast the outpatient (patient who receives medical treatmentwithout being admitted to a hospital) visits. The forecastingis important as effective prediction helps the administrationto manage operation, distribute resources and other aspects.The build model was tested on the data gathered from thedepartment of internal medicine in a hospital. The data hadtwo features to be monitored, month of the year and thenumber of outpatient. Another similar study was carried outby Hadavandi et al. [39] were a hybrid model was built thatcombine genetic algorithm with fuzzy rule based learning to
EEE COMMUNICATIONS SURVEYS & TUTORIALS 8 forecast outpatient visits. The data were collected from thedepartment of internal medicine in a hospital in Taiwan andfour big hospitals in Iran. The data features being the monthand the number of outpatient. In a different application, neuralnetwork techniques were used to make predictions about theconsumers behavior in choosing hospital [40]. The resultsare useful as the hospital operating environment is gettingmore competitive. The data feature considered were cost ofmedical care, accessibility, parking, hospital reputation, doctorreputation, doctors medical skill, modern equipment, etc
D. Improving BI via integrative data analysis
In todays enterprise, data are created from multiple source.The data from multiple sources can provide an insight toincrease productivity, improve policy making, support per-formance measurement, and can help in strategic planning.These insights can help achieve benefits as improved customersatisfaction, quality improvement, increased accessibility andanalysis of information, timeliness and better informationutilization. The need to handle the heterogeneous data andautomated analytics algorithm can be resolved by doing pre-dictive analysis. It supports the study of the integrated datausing machine learning algorithm that continuously evolve theaccuracy of predictive models and enable it to adjust.An integrative analysis task involves machine learning tech-niques for the integration of the available training data fromdifferent data sources in order to better analysis and generatea proactive response. A single data source may not containall required information about the data object. Thus, whenwe combine multiple sources of information for a particulardata object it helps to add on some different or missinginformation that may lead to a better prediction accuracy.Integrating data from multiple sources and making decisionsfrom these combined sources is becoming common to enhancethe prediction performance for different applications as inbio-informatics, image classification, stock market, etc. Thetwo most common integrative analysis are Multiple KernelLearning and the Bayesian Network (BN) [23]. Tensor de-composition based data mining algorithms is also promising.Tensor are high dimensional array which naturally allowsmultiple source of the same component can be integrativerepresented and analyzed in both time series data as wellas static data [41]–[43]. However, we will mainly focus onapplications of MKL.
1) Multiple kernel learning:
Multiple kernel learning(MKL) refers to a set of machine learning methods that usea predefined set of kernels, where kernel selection depend onthe notions of similarity that may exist in the data source.MKL learn an optimal linear or non-linear combination ofkernels as part of the algorithm that result in data integration.MKL algorithms have been developed for supervised, semi-supervised, as well as unsupervised learning.In stock price prediction it is observed that relying on thestudy of the time series historical data is not sufficient, ratherby considering multiple sources of information such as newsand trading volume one can significantly improve the stockmarket volatility predication [44]. The experiments carried out on HKEx 2001 stock market datasets shows that the use ofmultiple kernel learning approach on the multiple data sourcesresults in higher accuracy and lower degree of false predictionas compared to single source data.The work [45], shows that analyzing communication dy-namics on the internet and using stock price movements mayprovide some new insights into relations between stock pricesand communication patterns. They use MKL to combine in-formation from time series data, stock price and stock volumewith other data source: news and comments that includedthe frequency of News, frequency of the comments, averageLength/ Standard Deviation of length of comments, numberof Early/Late response etc. The experiment was tested for thestock of Amazon, Microsoft and Google and it was found thatMKL prediction model outperforms other baseline methodsMean Absolute Error (MAE), Mean Absolute Percentage Error(MAPE) and Root Mean Square Error(RMSE).Integrative data approaches are applied to draw inferencesfrom the biomedical data. In [46], the paper reports the advan-tage of the MKL integrative approach that thoroughly combin-ing complementary information from the heterogeneous datasources over the sparse integration method for the biomedicaldata. The experiments were carried out for different applica-tions as disease relevant gene prioritization by genomic datafusion, Prioritization of recently discovered prostate cancergenes by genomic data fusion, Clinical decision support byintegrating microarray and proteomics data, Clinical decisionsupport by integrating multiple kernels integration of genome-wide data for clinical decision support in cancer diagnosis, andComputational complexity and numerical experiments on largescale problems. An application is also seen in [47], wherethe different biological measurements are integrated usingthe regularized unsupervised multiple kernel learning to findcancer subtypes. In [48], bayesian networks are consideredfor data integration of two different biological data (geneexpression and transcription factor binding (ChIP-chip)) fordiscovering the transcriptional modules.Integrative data analysis finds applicability in high-dimensional hyperspectral classification. In satellite remotesensor application [49], MKL approach proves beneficialbecause of the high-dimensional feature induced by usingmany heterogeneous information sources. The approach isbased on the automatic optimization of a linear combinationof kernels dedicated to different meaningful sets of featuresas groups of bands, contextual or textural features, or bandsacquired by different sensors. The results obtained showeda good performance of the method in image classification,when multispectral, hyperspectral, contextual or multi-sensorinformation were used. The urban classification using MKLwas carried [50], to integrate heterogeneous features fromtwo data sources, i.e., spectral images and LiDAR data. Thefeatures from spectral images are good at identifying groundtruth such as trees, grass, and soil; whereas features fromLiDAR data perform better for classes with flat surfaces suchas buildings. The classification model build by combiningthe data source with complementary relationship significantlyimprove the classification accuracy.
EEE COMMUNICATIONS SURVEYS & TUTORIALS 9
IV. U SE C ASES
A. Use case: BI in Healthcare
The wealth of available data in healthcare will continueto increase. This together with raises the demand for im-proved quality of patient care, necessitates the improvement ofHealthcare BI. This rising demand can be best addressed byconsidering the business intelligence technological solutionsof data acquisition, storage, interpretation and evaluation.
1) Heterogeneous data in healthcare:
Healthcare deals withhuge heterogeneity in the data coming from different sources.The major challenges involved in managing healthcare dataare format, structure and complex nature. Health care dataoccurs in different formats as numeric, textual, digital, images,videos, multimedia etc. Electronic health record (EHR) holdshundreds of rows of textual and numerical data correspondingto patient demographics, progress notes, vital signs, medicalhistories, diagnoses, radiology images, medications, lab andtest results. In healthcare, the data is structured and unstruc-tured. Structured data refer to the lab or patient demographicdata that is consistent and stored in a pre-defined format. Theunstructured data are nonuniform and can be of great valuein analyzing the patient data. These include clinical notes,audio voice dictations, email messages and attachments, textmessage, online video and typed transcriptions. The presenceof structured and unstructured data makes the healthcare datacomplex to process and increase further as the number ofvariable increases.
2) Use of sensor in healthcare:
Business intelligence inhealthcare provides a wide range of analytics to improvethe decision making process related to both patient andperformance that covers many functional areas, includingresource planing, care delivery, patient accounting, financialand revenue cycle. The technological advancement has alloweda greater accessibility to data. The use of sensors in healthcareproduces large volumes of data continuously over time. Anapplication can be seen in the Intensive Care Unit (ICU)were sensors are used to monitor the current state of patients,it includes ECG, EEG, blood pressure monitors, respiratorymonitors.
3) ICU case study:
Intensive care units has a data richenvironment with multiple source of continuous data originatefrom medical devices which includes electroencephalogram,Bedside Monitors, Brain Tissue Oxygen Monitor, CentralVenous Catheters (CVC), Clinical Information Systems andventilators, resulting in several kilobits of data each secondper patient. A patient in severe health state are often mon-itored by number of body sensors connected to monitoringdevices producing large volumes of physiological data, alongwith comprehensive and detailed clinical data and minute-by-minute trends for the patient. Another, ICU patient monitoringinvolves the system to generate some alerts or alarms whenthe physiological state of the patient shows the detection ofrelevant abnormalities or changes in a patient’s condition.The physicians use certain thresholds, which when exceedstriggers the in alarm. However, such simple alerting schemesmay result in large number of false alarms, example,Tsienand Fackler [51] found that 92 % of alarms were false alarms in their observation in a pediatric intensive care unit, in[52], authors digitally recorded all the alarms for 38 patientson a 12-bed medical ICU and retrospectively assessed theirrelevance and validity: Only 17 % of the alarms were relevant,with 44 % being technically false.In case of medical emergency, it is critical for the patientto receive the medical aid at the earliest. The recent advance-ment in the sensor technology helps to achieve this goal bymonitoring patient crucial parameters such as heart beat, bodytemperature, blood monitoring, EEG etc. In the given scenario,business intelligence can play a key role by carrying out thepredictive analysis by integrating the sensor data with theavailable traditional data. The processed data that give thevital statistics of the patient can be transmitted to the doctorto guide the paramedic traveling with the patient. In addition,this information can be used to take appropriate action as soonas the patient arrives at the hospital.The existing complexity in the data makes the existingapproaches may not be suitable to manage data in healthcare.In healthcare business rules and definitions are volatile andmay change over a period of time. This calls for a solution thatis based on agile approach and can handle data from multiplesources, the different format, the structured and unstructureddata and manages the complexity within an ever-changingregulatory environment. Thus, an intelligent machine learningalgorithms are required to find a coherent meaning fromdisparate data to process heterogeneous data captured throughdifferent sensors. V. C ONCLUSION
We have reviewed traditional and next generation businessintelligence and analytic in a holistic view. The three-tiertraditional BI architecture is still valid. However, it is notsufficient in providing real-time analysis, situation aware,and self-service capabilities. The next generation BIs, i.e.,operational BI, situation BI, and self-service BI, are eachfocusing in realization of the three capabilities that becomingmore important in BI as the data becomes business assets. Welooked at challenges and enabling technologies for the threetype of BIs in the back-end perspective. We also pointed outdata governance is critical in the next generation BIs. In thefront-end, we looked at data analytic methods focusing ontime series forecasting and integrative data analysis. We alsolooked at healthcare uses case and showed that next generationBI and analytics can save lives. Next generation BI&A is atits beginning and there remains several challenges in both theback-end architecture and in the front-end analysis.R
EFERENCES[1] A. L¨oser, F. Hueske, and V. Markl, “Situational business intelligence,”in
Business Intelligence for the Real-Time Enterprise . Springer, 2009,pp. 1–11.[2] A. Lser, F. Hueske, and V. Markl, “Situational business intelligence,” in
Business Intelligence for the Real-Time Enterprise , ser. Lecture Notesin Business Information Processing, M. Castellanos, U. Dayal, andT. Sellis, Eds. Springer Berlin Heidelberg, 2009, vol. 27, pp. 1–11.[Online]. Available: http://dx.doi.org/10.1007/978-3-642-03422-0 1
EEE COMMUNICATIONS SURVEYS & TUTORIALS 10 [3] M. Castellanos, C. Gupta, S. Wang, and U. Dayal, “Leveraging webstreams for contractual situational awareness in operational bi,” in
Proceedings of the 2010 EDBT/ICDT Workshops , ser. EDBT ’10.New York, NY, USA: ACM, 2010, pp. 7:1–7:8. [Online]. Available:http://doi.acm.org/10.1145/1754239.1754248[4] R. Elmasri and S. B. Navathe,
Fundamentals of database systems .Pearson, 2014.[5] H. Kuno, U. Dayal, J. Wiener, K. Wilkinson, A. Ganapathi, andS. Krompass, “Managing dynamic mixed workloads for operationalbusiness intelligence,” in
Databases in Networked Information Systems ,ser. Lecture Notes in Computer Science, S. Kikuchi, S. Sachdeva, andS. Bhalla, Eds. Springer Berlin Heidelberg, 2010, vol. 5999, pp. 11–26.[Online]. Available: http://dx.doi.org/10.1007/978-3-642-12038-1 2[6] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik,E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg,and D. J. Abadi, “H-store: A high-performance, distributed mainmemory transaction processing system,”
Proc. VLDB Endow. , vol. 1,no. 2, pp. 1496–1499, Aug. 2008. [Online]. Available: http://dx.doi.org/10.14778/1454159.1454211[7] R. Stoica and A. Ailamaki, “Enabling efficient os paging for main-memory oltp databases,” in
Proceedings of the Ninth InternationalWorkshop on Data Management on New Hardware , ser. DaMoN ’13.New York, NY, USA: ACM, 2013, pp. 7:1–7:7. [Online]. Available:http://doi.acm.org/10.1145/2485278.2485285[8] A. Eldawy, J. Levandoski, and P.-A. Larson, “Trekking through siberia:Managing cold data in a memory-optimized database,”
Proc. VLDBEndow. , vol. 7, no. 11, pp. 931–942, Jul. 2014. [Online]. Available:http://dx.doi.org/10.14778/2732967.2732968[9] “Voltdb,” https://voltdb.com/.[10] A. Kemper and T. Neumann, “Hyper: A hybrid oltp&olap mainmemory database system based on virtual memory snapshots,”in
Proceedings of the 2011 IEEE 27th International Conferenceon Data Engineering , ser. ICDE ’11. Washington, DC, USA:IEEE Computer Society, 2011, pp. 195–206. [Online]. Available:http://dx.doi.org/10.1109/ICDE.2011.5767867[11] M. Seibold, A. Kemper, and D. Jacobs, “Strict slas for operationalbusiness intelligence,” in
Cloud Computing (CLOUD), 2011 IEEEInternational Conference on , July 2011, pp. 25–32.[12] M. Castellanos, C. Gupta, S. Wang, U. Dayal, and M. Durazo,“A platform for situational awareness in operational { BI } ,” DecisionSupport Systems
The DAMA guide to the data management body ofknowledge . Technics Publications, Bradley Beach, 2009.[14] K. Weber, B. Otto, and H. ¨Osterle, “One size does not fit all—a contingency approach to data governance,”
Journal of Data andInformation Quality (JDIQ) , vol. 1, no. 1, p. 4, 2009.[15] V. Khatri and C. V. Brown, “Designing data governance,”
Communica-tions of the ACM , vol. 53, no. 1, pp. 148–152, 2010.[16] R. S. Seiner, “Real-world data governance bi governance and thegovernance of bi data,” .[17] D. M. Association et al. , “Dama dmbok functional framework (version3.02),”
DAMA International , 2008.[18] NASCIO, “Data governance - managing information as an enterpriseasset part 1 - an introduction,”
NASCIO Governance Series , 2009.[19] ——, “Data governance part iii: Frameworks - structure for organizingcomplexity,”
NASCIO Governance Series , 2009.[20] P. Aiken, M. D. Allen, B. Parker, and A. Mattia, “Measuring data man-agement practice maturity: a community’s self-assessment,”
Computer ,vol. 40, no. 4, pp. 42–50, 2007.[21] B. Potter and R. Software, “Self-service bi vs. data governance,” https://tdwi.org/articles/2015/03/17/self-service-bi-vs-data-governance.aspx , Mar. 17. 2015.[22] M. Ferguson, “Is self-service bi going to drive a truck though enter-prise data governance?” http://intelligentbusiness.biz/wordpress/?p=489Accessed:2015-10 .[23] J. Thomas and L. Sael, “Overview of integrative analysis methodsfor heterogeneous data,” in
The 2015 International Conference onBig Data and Smart Computing (BigComp 2015) , no. 1, 2015, pp.266–270. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7072811[24] ——, “Maximizing information through multiple kernel-basedheterogeneous data integration and applications to ovarian cancer,” in
Advances in Knowledge Discovery and Data Mining . The MIT Press,1996.[26] R. Michalski, J. Carbonell, and T. Mitchell,
Machine Learning: AnArtificial Intelligence Approach . Springer Science & Business Media,2013.[27] R. A. Khan and S. K. Quadri, “Business intelligence: An integratedapproach,”
Business Intelligence Journal , vol. 5, no. 1, pp. 64–70, 2012.[28] S. Chaudhuri, U. Dayal, and V. Narasayya, “An overview of businessintelligence technology,”
Communications of the ACM , vol. 54, no. 8,pp. 88–98, 2011.[29] L. J. Cao and F. E. H. Tay, “Support vector machine with adaptiveparameters in financial time series forecasting,”
IEEE Trans. on NeuralNetworks , vol. 14, no. 6, pp. 1506–241, 2003.[30] M.-Y. Chen and B.-T. Chen, “A hybrid fuzzy time series model based ongranular computing for stock price forecasting,”
Information Sciences ,vol. 294, pp. 227–241, 2015.[31] Y.-K. Kwon and B.-R. Moon, “A hybrid neurogenetic approach for stockforecasting,”
IEEE Trans. on Neural Networks , vol. 18, no. 3, pp. 851–864, 2007.[32] R. Khemchandani, Jayadeva, and S. Chandra, “Regularized least squaresfuzzy support vector regression for financial time series forecasting,”
Expert Systems with Applications , vol. 36, pp. 132–138, 2009.[33] T. Xiong, Y. Bao, and Z. Hu, “Multiple-output support vector regressionwith a firefly algorithm for interval-valued stock price index forecasting,”
Knowledge-Based Systems , vol. 55, pp. 87–100, 2014.[34] P.-C. Chang, Y.-W. Wang, and C.-Y. Tsai, “Evolving neural network forprinted circuit board sales forecasting,”
Expert Systems with Applica-tions , vol. 29, no. 1, pp. 83–92, 2005.[35] E. Hadavandi, H. Shavandi, and A. Ghanbari, “An improved salesforecasting approach by the integration of genetic fuzzy systems anddata clustering: Case study of printed circuit board,”
Expert Systemswith Applications , vol. 38, pp. 9392–9399, 2011.[36] R. Kuo and K. Xue, “Fuzzy neural networks with application to salesforecasting,”
Fuzzy Sets and Systems , vol. 108, pp. 123–143, 1999.[37] F.-K. Wang, K.-K. Chang, and C.-W. Tzeng, “Using adaptive network-based fuzzy inference system to forecast automobile sales,”
ExpertSystems with Applications , vol. 38, pp. 10 587–10 593, 2011.[38] C.-H. Chang, J.-W. Wang, and C.-H. Lia, “Forecasting the number ofoutpatient visits using a new fuzzy time series based on weighted-transitional matrix,”
Expert Systems with Applications , vol. 34, no. 4,pp. 2568–2575, 2008.[39] E. Hadavandi, H. Shavandi, A. Ghanbari, and S. Abbasian-Naghneh,“Developing a hybrid artificial intelligence model for outpatient visitsforecasting in hospitals,”
Applied Soft Computing , vol. 12, no. 2, pp.700–711, 2012.[40] W.-I. Lee, B.-Y. Shih, and Y.-S. Chung, “The exploration of consumersbehavior in choosing hospital by the application of neural network,”
Expert Systems with Applications , vol. 34, no. 2, pp. 806–816, 2008.[41] L. Sael, I. Jeon, and U. Kang, “Scalable tensor mining,”
BigData Research , Helsinki,Finland, 2016.[43] K. Shin, L. Sael, and U. Kang, “Fully scalable methods for distributedtensor factorization,”
IEEE Transactions on Knowledge and DataEngineering
Services Computing(SCC), 2012 IEEE Ninth International Conference . IEEE, 2012, pp.49–56.[45] S. Deng, T. Mitsubuchi, K. Shioda, T. Shimada, and A. Sakurai,“Multiple kernel learning on time series data and social networks forstock price prediction,” in
Machine Learning and Applications andWorkshops (ICMLA), 2011 10th International Conference on . IEEE,2011, pp. 228–234.[46] S. Yu, T. Falck, A. Daemen, L.-C. Tranchevent, J. A. Suykens, B. D.Moor, and Y. Moreau, “L2-norm multiple kernel learning and itsapplication to biomedical data fusion,”
BMC Bioinformatics , vol. 11,no. 309, 2010.
EEE COMMUNICATIONS SURVEYS & TUTORIALS 11 [47] N. K. Speicher and N. Pfeifer, “Integrating different data types byregularized unsupervised multiple kernel learning with application tocancer subtype discovery,”
BMC Bioinformatics , vol. 31, no. 12, pp.i268–i275, 2015.[48] R. S. Savage, Z. Ghahramani, J. E. Griffin, B. J. d. l. Cruz, andD. L. Wild, “Discovering transcriptional modules by bayesian dataintegration,”
BMC Bioinformatics , vol. 26, no. 12, pp. i158–i167, 2010.[49] D. Tuia, G. Camps-Valls, G. Matasci, and M. Kanevski, “Learningrelevant image features with multiple-kernel classification,”
IEEE Trans.on Geoscience and Remote Sensing , vol. 48, no. 10, pp. 3780–3791,2010.[50] Y. Gu, Q. Wang, X. Jia, and J. A. Benediktsson, “A novel mkl modelof integrating lidar data and msi for urban area classification,”
IEEETrans. on Geoscience and Remote Sensing , vol. 53, no. 10, pp. 5312–5326, 2015.[51] C. Tsien and J. Fackler, “Poor prognosis for existing monitors in theintensive care unit,”
Crit Care Med. , vol. 25, no. 4, pp. 614–619, 1997.[52] S. Siebig, S. Kuhls, M. Imhoff, J. Langgartner, M. Reng, J. Schlmerich,U. Gather, and C. E. Wrede, “Collection of annotated data in a clinicalvalidation study for alarm algorithms in intensive care - a methodologicframework.”