Gaining insight from large data volumes with ease
GGaining insight from large data volumes with ease
Valentin
Kuznetsov , ∗ Cornell University, Ithaca, NY, USA 14850
Abstract. E ffi cient handling of large data-volumes becomes a necessity in today’s world.It is driven by the desire to get more insight from the data and to gain a better un-derstanding of user trends which can be transformed into economic incentives(profits, cost-reduction, various optimization of data workflows, and pipelines).In this paper, we discuss how modern technologies are transforming well estab-lished patterns in HEP communities. The new data insight can be achieved byembracing Big Data tools for a variety of use-cases, from analytics and mon-itoring to training Machine Learning models on a terabyte scale. We provideconcrete examples within the context of the CMS experiment where Big Datatools are already playing or would play a significant role in daily operations. With the CERN LHC program underway, we start seeing an exponential acceleration of datagrowths in the High-Energy Physics (HEP) field . In Run II CERN experiments operated inthe petabyte (PB) regime. For instance, in 2017 the CMS experiment alone produced around30 billion raw events and complemented them with 16 billion Monte Carlo similated events.It successfully transferred a few PB / week with average transfer rates of 2-6 GB / s; almost20PB of data were replicated at GRID T1 sites and about 80PB at T2 sites. The disk uti-lization was at the level of 20 / / / T1 / T2 sites respectively. With the up-comingHigh Luminosity LHC (HL-LHC) program at CERN all HEP experiments will face a newchallenge, the exabyte (10 ) era of computing [1]. We anticipate that new techniques andtechnologies will be required to handle this unprecedented amount of data. For example,the overall time of a typical physics analysis may be significantly reduced by moving awayfrom sequential processing of events at GRID data-centers to data-reduction facilities basedon Big Data technologies [2]. Similarly, the experiment’s meta-data and services are under-going significant changes by embracing data-processing on Hadoop + Spark platforms, andadding NoSQL databases with their traditional RDBMS counterparts to a growing list ofdata-services.In this paper, we discuss new technologies and techniques for handling experiment ameta-data to gain additional insight from distributed data sources, located at data-centers,on HDFS, in relational and NoSQL databases. The new information obtained with parallelprocessing of large datasets helps us better understand our resources, more e ffi ciently utilize ∗ e-mail: [email protected] Here we refer to the data as raw and MonteCarlo (MC) data produced and processed by experiments as well asthe associated meta-data and distinguish them explicitly in the paper. a r X i v : . [ phy s i c s . d a t a - a n ] N ov omputing infrastructure, and gradually move towards a data-driven approach in the HL-LHCera of computing. When the concept of relational databases was introduced in 1970 [3] it solved many problemsin the world of information technology. In the HEP community the RDBMS technologieswere used in many data services from online calibration to o ffl ine data bookkeeping systemsas well as surrounding infrastructure. Most HEP experiments successfully adopted RDBMStechnologies (open-source and commercial products) for their needs. At CERN, almost allproduction systems rely on ORACLE databases. For instance, in CMS we use it for dozens ofdata-services, and the two largest databases, DBS and PhEDEx [4], are around a few hundredGBs each excluding indexes.During Run II operations we start seeing a limitations of RDBMS based solutions to ad-dress the experiment needs. For instance, in CMS, users are required to place queries acrossmultiple databases to find their desired information. To overcome these obstacles, the CMSexperiment has developed a Data Aggregation System [5]. It was designed as an additionallayer above existing data-services (based on RDBMS backends) and used an NoSQL (Mon-goDB [6]) database as a caching layer. It aggregates information from di ff erent data-services,and presents it to end-users via the flexible Query Language based on the SQL syntax withoutexplicitly requiring joints among database tables. Even though it was successfully used byCMS in production for many years the growth of information in the experiment requires newsolutions to address increasing demand for information, e.g. for monitoring purposes. By theend of Run II, users started becoming more interested in a new type of aggregated informa-tion which requires data-processing among distributed databases, joining various attributes,and spans across large datasets. A typical example would be data popularity plots of useractivities for all T1 and T2 sites over large period of time, e.g. up to a year. To extract thisinformation, it was required to join almost all tables among the three largest databases: theCMS Data Bookkeeping System (DBS), which tracks all experimental datasets, the PhEDExdatabase, which knows about data location, and the data popularity database which keepstrack of user jobs run at various data-centers. Such tasks cannot be accomplished via SQLqueries since the information physically resides in di ff erent databases, and even though OR-ACLE tools provide the ability to perform cross-database joins we found that it does not scalewell in practice. Therefore, the process requires manual extraction of relevant informationfrom all databases, proper data-preprocessing, and a complex workflow to obtain desired re-sults. Later, we realized that such a workflow can be easily resolved if all required data willreside on HDFS where we can apply the Spark framework [7] over distributed dataframesand take advantage of parallel processing on a HDFS cluster. Data placement of variousmeta-data sources started in early 2015 and includes several CMS databases, HTCondor logs,CMSSW file access logs, file transfer records as well as Workflow Management logs. At themoment, the CMS experiment has migrated dozens of data sources to HDFS (Table. 1) andaccumulated more than 32 TB of data stored in various data-formats.The data on HDFS are stored in various data-formats which are suited for di ff erent pur-poses, e.g. log files are usually streamed to HDFS in native data-format (JSON), the databasetables are easily converted into CSV data-format, while large unstructured data sets, e.g. incase of the WMArchive system [12], are converted into compact, fast, binary Avro data-format with a pre-defined schema. Fortunately, the HDFS libraries support a broad variety ofTCondor logs [8] JSON 11.1 TBAAA (Global Data Access) logs [9] JSON 11 TBEOS logs [10] JSON 5.3 TBFTS (File Transfer System) logs [11] JSON 4.2 TBPhEDEx snapshots [4] CSV 3.3 TBWMArchive logs [12] Avro 1.3 TBCMSSW (CMS SoftWare framework) logs Avro 0.5 TBDBS tables [4] CSV 0.3 TBJobMonitoring logs Avro 0.2 TB Table 1.
Current snapshot of CMS meta-data on HDFS stored on HDFS. data-formats, and the Spark framework is guaranteed to work seamlessly and e ffi ciently withall of them.Such availability of large datasets and e ffi cient processing on Hadoop clusters open upnew possibilities to push the boundaries of analytics tasks beyond traditional approachesbased on relational databases. The run time to spawn TB of data using the Spark frameworkon HDFS is of the order of a couple of minutes and is not restricted to the content of a singledatabase. Multiple sources can be combined and e ffi ciently processed. New approaches to handle large datasets in the HEP community are emerging from thebusiness world. First, the NoSQL solutions are adopted to allow storage of unstructureddocuments, support distributed natures of actors, and information replication. For example,in CMS we successfully adopted MongoDB [6] and CouchDB [13] technologies for di ff er-ent use-cases. The former is used as caching and persistent layers in the Data Aggregationsystem [5], while the latter is successfully used in Data Management and Workflow Man-agement system [14] to continuously replicate workflows across distributed agents handlingCMS MonteCarlo production.As we mentioned in the previous section the Hadoop eco-system starts playing a signifi-cant role in almost every HEP experiment. Moreover, the CERN central monitoring system(MONIT) [15] heavily relies on it and incorporates various technologies and tools in theirstack, e.g. Kafka, EalsticSearch, InfluxDB, Kibana, Grafana, etc. But all of these innova-tions come with their own price. Users are required to learn a broad variety of new tools,data-formats, etc., and understand how to run their workflows in such an environment. And,experiments need to adopt their tools and data management systems to new technologies, too.For example, in CMS a typical Spark workflow is quite complex task, see Fig. 1. We rely onPython Spark APIs, to read and pre-process data from multiple data-providers on HDFS. Theaggregated information is often placed back into the Asynchronous Message Queuing sys-tem and ends-up either in ElasticSearch engine or CERN MONIT systems. Obviously, such aworkflow is di ffi cult to construct properly and even harder to maintain on the long run. Often,the majority of tasks was repeated among di ff erent users, and a new level of abstraction wasdesired.We simplified user access to HDFS, Spark and the CERN Hadoop eco-system via theCMSSpark framework [16] which takes care of setting up a cluster environment, provides alayer of abstraction to data access on HDFS, performs data-format transformation to SparkDataFrames, handles job handling and data placement back to HDFS and / or CERN MONIT set environmentspark-submit —jars X.jar \ —master XXX \ code.py ${1+"$@"} Python Templatecode.py SparkContext HDFSAMQ ElasticSearch Figure 1.
A typical CMSSpark workflow to process data on HDFS cluster via the Spark framework. systems. At the end, users are required only to write an analysis code to process the desiredDataFrames. A submission of user tasks to the CERN HDFS cluster is simplified to thefollowing command:
Such simplicity has boosted the adaptation of Hadoop tools within the CMS collaborationand was quickly adapted to a variety of use-cases, see the discussion in Sect. 3.1.Although the discussion above was applied to meta-data, the Big Data tools can also helpthe HEP community apply new techniques to process and analyze real data. For instance,the authors of [2] proposed to use HDFS / Hadoop as a data-reduction facility in HEP analysiswith the ambitious goal of reducing 1PB of raw input data to 1TB output data in a few hours.This approach can be further extended not only to HEP analysis per-se, but also to trainMachine Learning (ML) models on petabyte datasets.The ML models in the HEP community have been used for years, e.g. Boosted Deci-sion Trees or simple Neural Networks which are often used in various physics analysis. Therecent advances of technologies both on hardware and software fronts allow ML models tobe adopted universally, from computing infrastructure to trigger systems [17]. But the keyproblem with ML training is the data preparation step which involves data transformationand pre-processing. The most common data-format for ML training is CSV (or alike) whilethe HEP data are stored in ROOT data-format. Recent developments in ROOT I / O [18] andROOT data access [19] open up a possibility to directly read and process ROOT data on theHDFS cluster. With this change we are already able to organize new types of workflowsof reading petabytes of data, perform necessary data-transformation and pre-processing onHDFS, train ML models, and deliver them to end-users as a data-service [20]. Fig. 2 demon-strates such an R&D pipeline in the context of the TFaaS project for the CMS experiment.Preliminary studies shows that we can achieve reading HEP events at the rate of 100kHz(50 MB / s) via the uproot [21] library, pre-process TB of data in a range of minutes to a fewminutes to few hours on the HDFS cluster [22], train a model and serve it to end-users via The processing time strongly depends on the specific use-case and complexity of the executed workflow.
F boxModelLocalstorageHDFS Data-serviceREST API CMSSWpipelineROOTfileslocalfilesystemRemotestorage
Figure 2.
A TensorFlow as a Service [20] architecture for a HEP use case. The input raw data in ROOTdata-format can be read from remote data-providers, including XRootD servers, local filesystems, andHDFS, be processed and transformed on the HDFS / Spark platform and fed into ML framework fortraining (e.g. via SparkML). The trained model can be served to end-users or the entire framework(CMSSW) as an HTTP based data-service.
TensorFlow as a Service [20] tool with high throughput in a distributed environment. Ourbenchmarks have shown that we can easily achieve 500 req / s throughput for concurrentclients using a single node serving TFaaS data-service.These new approaches are developed rapidly and are partially adopted in the CMS ex-periment. Below we briefly discuss a few of them in the context of the CMS monitoringinfrastructure. The CMS monitoring infrastructure is gradually migrating from experiment specific toolsto the central CERN Monitoring system [15]. The CERN MONIT system consists of morethan a hundred data producers with 3.5 TB / day injection rate and dozens of Spark jobs run-ning 24 /
7. It is highly integrated with the CERN Analytix cluster composed of 39 nodeswith 64GB of RAM, 32 cores / node Mix of Intel R (cid:13) Xeon R (cid:13) CPU E5-2650 @ 2GHz AMDOpteron( TM ) 6276, and capable of storing and handling PB of data.So far we have migrated several experiment dashboards to the CERN MONIT infrastruc-ture, among them AAA, EOS, HTCondor, task monitoring of user analysis jobs as well asWMArchive sub-systems. We found that the Spark platform significantly improved our an-alytics capabilities. For instance, in the WMArchive [12] system we can promptly performthe following tasks: • identify failed workflows and problematic sites • spot production issues via log look-up and exit codes • monitor CMS production status, including sites, campaigns monitoring and extractingthroughput metrics • perform data aggregation and produce aggregated statistics.The system was designed to collect 100M + documents per year from distributed WMAgentswith an upload rate of O (1 M ) documents per day. The documents were injected into the localcache of MongoDB and are transferred to HDFS for long-term storage. We periodically rundaily and hourly aggregation jobs to gain insight on MonteCarlo production workflows. Thisinformation is fed into the CERN MONIT system where it is displayed in various dashboards.sing CMS data on HDFS, as outlined Table 1, we analyzed the most popular data tieramong end-users in 2017. To our satisfaction it was MINIAODSIM accessed by 56%, 42%,40%, 40% in AAA, EOS, CMSSW, and CRAB systems, respectively. Then it was followedby MINIAOD, AOD, and RAW data-tiers. The usage of the RECO data-tier was quite negli-gible, at the level of a few percent in corresponding systems.We also look at the data popularity content of our data and successfully used the HadoopSpark platform as a data reduction and processing facility. We performed studies to predictdataset popularity using user AAA logs [22]. We demonstrated that it can be modeled viaML and be used as a seed by the CMS dynamic data placement system. It worth mentioningthat the original dataset had 2B rows of AAA logs combined with PhEDEx database tables.This dataset was reduced to 0.5M records in a couple of hours and fed into the SparkMLframework for training the ML model. We foresee that such a workflow pipeline can besuccessfully adopted in the HL-LHC era where intelligent data placement may play a criticalrole.Finally, we used Job Monitoring and WMArchive data to measure sites performance. Thestudies targeted a concrete architecture on T2 sites where node throughput was calculated as anumber of processed events per second for various processor architectures taking into accountthe number of job slots per core. These results complemented well established HS06 scoresand were included in the HEPiX Benchmarking Working Group [23]. The CMS experiment is continuously improving its computing and o ffl ine infrastructure. Inparticular, it is shifting its monitoring infrastructure to the central CERN monitoring systemand has a large number of ML projects.In this paper we discussed a gradual shift to handle large datasets in the CMS experiment.The large portion of CMS meta-data has been successfully migrated to HDFS and comple-ments our existing database solutions. The usage of Hadoop tools and its eco-system is nolonger a problem due to the simplicity of the CMSSpark framework. It simplified accessto a broad variety of CMS data located on HDFS by abstracting the data access layer, andprovided a simple and uniform way to submit, process, and analyze these data.We also described a new use-case for Machine Learning training over large distributeddatasets and are continuously delivering ML models to end-users via a new Tensor as a Ser-vice data platform recently developed and which is currently undergoing testing in the CMSexperiment. Such a service may not only serve the CMS experiment per-se but can be appli-cable to other experiments. It may fill the gap of integrating ML tools into existing infrastruc-ture and experiment framework. Finally, we foresee that this approach will gain popularityin the upcoming years of data taking in the HL-LHC regime where training ML models overpetabyte datasets will become a norm. I would like to thank my CMS colleagues David Lagne (Princeton) and Carl Vuosalo (Univ. of Wiscon-sin) for their support and various contributions in CMS monitoring infrastructure, including productionand validation of of T1 / T2 usage plots. I also would like to thank Luca Menichetti from CERN IT whoprovided support for development, maintenance and deployment of our scripts on the Spark platform.Special thanks go to the CERN MONIT team for collaboration and support to set up and maintain theCMS Monitoring dashboards.
References [1] A. A. Alves Jr., et. al.,
A Roadmap for HEP Software and Computing R & D for the 2020s ,https: // arxiv.org / abs / Big Data in HEP: A comprehensive use case study , doi: 10.1088 / / / / // arxiv.org / abs / A Relational Model of Data for Large Shared Data Banks , Communicationsof the ACM. 13 (6): 377–387. doi:10.1145 / ff els, Y. Guo, V. Kuznetsov, N. Magini and T. Wildish The CMS Data ManagementSystem
J. Phys.: Conf. Ser. 513 (2014) 042052; doi:10.1088 / / / / The CMS Data Aggregation System ,doi:10.1016 / j.procs.2010.04.172[6] MongoDB document-oriented database, https: // docs.mongodb.org / [7] Apache Spark, https: // spark.apache.org / [8] D. Thain, T. Tannenbaum and M. Livny (2005), Distributed computing in practice: theCondor experience Concurrency and Computation: Practice and Experience
17 2-4 323-356 doi:10.1002 / cpe.938[9] K. Bloom, et. al., Any Data, Any Time, Anywhere: Global Data Access for Science , BDC2015: 85-91 https: // arxiv.org / pdf / Disk storage at CERN: Handling LHC data and beyond
Journal ofPhysics: Conference Series 513 (2014) 042017 doi:10.1088 / / / / FTS3: New Data Movement Service For WLCG , 2014 J. Phys. Conf.Ser. 513 032081[12] V. Kuznetsov, N. Fischer, Y. Guo,
The archive solution for distributed workflow man-agement agents of the CMS experiment at LHC
Computing and Software for Big Science2018, 2:1, doi: 10.1007 / s41781-018-0005-0[13] Apache CouchDB data-management system, http: // couchdb.apache.org / [14] D. Evans, et. al., The CMS workload management system , Journal of Physics: Confer-ence Series, Volume 396, Part 3[15] A. Aimar, et. al.,
Unified Monitoring Architecture for IT and Grid Services
Journal ofPhysics Conference Series 2017, 898 092033, doi: 10.1088 / / / / CMSSpark a general purpose framework to run CMS ex-periment workflows on HDFS / Spark platform
DOI 10.5281 / zenodo.1401228https: // zenodo.org / badge / latestdoi / Machine Learning in High Energy Physics Community WhitePaper https: // arxiv.org / abs / Optimizing ROOT IO For Analysis ,https: // arxiv.org / abs / Fast Access to Columnar, HierarchicalData via Code Transformation
CoRR abs / TensorFlow as a Service doi: 10.5281 / zenodo.1308048.[21] DIANA-HEP Scikit-hep uproot library, Minimalist ROOT I / O in pure Python andNumpy , https: // github.com / scikit-hep / uproot[22] M. Meoni, V. Kuznetsov, L. Menichetti, J. Rumševiˇcius, T. Boccali, D. Bona-corsi, Exploiting Apache Spark platform for CMS computing analytics , ACAT 2017,http: // arxiv.org / abs / // w3.hepix.org //