A proposed solution for analysis management in high energy physics
AA proposed solution for analysis management in high energy physics
Mingrui Zhao a Department of Nuclear Physics, China Institute of Atomic Energy, Beijing 102413, China b Center of High Energy Physics, Tsinghua University, Beijing 100084, China
Abstract
This paper presents an architecture for the analysis management in high energy physicsexperiments. Some new concepts on data analysis are introduced. A protocol for organizingand operating an analysis is raised. A toolkit following this architecture is developed, whichprovides a solution of analysis management with both flexibility and reproducibility. Aforeseen development of this toolkit is discussed. The author was a student in Tsinghua University when this work started.E-mail address: [email protected] a r X i v : . [ phy s i c s . d a t a - a n ] J un Introduction
A physics analysis should be well managed. One of the most important aspects in analysis management isthe analysis preservation. Analysis preservation is essential in scientific research. It is the responsibilityof a researcher to clarify the details to produce the result from the raw data. The details include the codes,data, analysis steps and other information needed to reproduce the results. An analysis result which cannot be reproduced is meaningless and should be regarded as being produced through “magic”. With thereproducibility of an analysis, knowledge is able to transfer between analyzers. The young analyzer canlearn the complete analysis steps from the analysis codes itself. And if an analyzer leaves a group, his / hercolleagues can pick up his / her work without barriers.Another critical point in analysis management is its convenience. The convenience and reproducibil-ity should make a balance. A tool with high reproducibility but low convenience will force the users tobypass it. And thus the reproducibility will lose.People often overlook the significance of analysis management. They argue that management is justa bookkeeping work. However, management of the analysis is not an easy job. Recording the analysisdetails to an extent that others can repeat the analysis require quite a lot manpower. It is the presentsituation that most analyses are not able to be reproduced.Why is analysis management di ffi cult? One of the major sources of di ffi culty is from the preservationof data. Generally, data are sizable, and preserving the data using twice as many disks seems unwise. Thecodes and the data are usually lightly bound. The data are not generated intermediately after the codesare written or modified, leading the mismatch of analysis codes and results. If the data were generatedfrom some code, it is usually impossible to read the code from the data. Besides, the following problemsor cases often occur in high energy physics analysis. They make it harder to reproduce the analysisconveniently, as is pointed out in [6] and [4]. • Data and codes are frequently modified or moved. For example, the shape of the fit functions canbe adjusted. The upstream data may be modified and the whole analysis should rerun. • Analyses in high energy physics are usually long-term, and they are usually highly complicated.Human beings are not reliable and analyzer often have terrible habits to organize the data andcodes. The exact analysis step might be forgotten by the analyzer after a long time.The importance of analysis management has been recognized within high energy physics community.A conference preceding on data preservation[5] is published in 2012. The CERN analysis preservationportal(CAP)[3] is set up to prompt the systematic preservation of analysis for LHC experiments. CAPprovides a centralized platform on which scientists can document their analysis as early as the start of anew project. CAP use REANA[4] as backend, which is a platform for reusable research data analyses,receiving join e ff ort from CERN IT, SIS, DASPOS and DIANA-HEP.In this paper, a proposal is raised to resolve the pain of analysis management. The proposal include ananalysis architecture and a toolkit. The recent development of the container techniques both conceptuallyinspires the designation and technically provides a tool to realize the designation, as will be talked aboutin details in the following sections. The new management system will change the current behavior ofanalyzer as less as possible. The management system allows the users to use whatever analysis tool theywant. A possible connection of this proposal with the CAP and REANA is also discussed.1 Conceptual designation
An analysis is always step by step. Each step requires some inputs and generates some outputs. If theinputs, outputs and steps are regarded as vertex, and directed edges are linked from the inputs to steps, aswell as from steps to the outputs. The structure of an analysis naturally forms a directed acyclic graph,which can be called workflow. Fig. 1 shows examples of workflow.
Raw data Step 1 Data A Step 2 Data B (Result)Raw data Step 1 Data AStep 2 Data B Step 3 Data C Step 4 Data D(Result)
Figure 1: Examples of analysis workflow. The upper one represents a simple analysis and the lower is aslightly complicated one.Workflow exists in all analysis. The workflow of analysis contains fruitful information. It tells theprocedure of the analysis, and it is also an invariant when the physical locations or names of the data orcodes change.In the traditional way of analysis management, workflow is usually implicitly defined through thedata location written in the codes, and it is usually ignored and seldom used. Workflow in analysis isalso used in other domains like bioinformatics, medical imaging, astronomy, and chemistry. Workflowlanguages such as CWL[1] are developed. This explicit way of describing workflow has the advantagethat the analysis procedure is transparent and directly readable to both the machine and the analyzer,clarifying the exact steps to do the analysis. So that the machine can run the analysis according to it, andanalyzer can obtain knowledge of the analysis from it.Additionally, there are two observations: • A step in workflow does not necessarily know the whole workflow. It need only to know itspredecessors to execute this specific step. • The step can be abstract, i.e. only the topological structure of the analysis matters.Although workflow is widely used in management tool, it is seldom realized that the key function ofworkflow in analysis preservation is providing a tight binding between codes and data.The typical workflow concept has some slight di ff erences with the one in this paper. In the workflowlanguages, they describe the exact steps to do the analysis. In this architecture, I choose to describethe workflow in a half-explicit way. It means that the workflow is not defined implicitly as describedabove. And it is not also defined explicitly in a sheet using some workflow language. It is definedthrough the relationship between steps. The exact name or directory of an analysis step is unimportant.When they are changing, the analysis keeps invariant. The di ff erent methods to describe the workfloware obviously equivalent. However, the half-explicit way make the problem clear and is more suitable tomy designation, as will be shown in the following sections.Some of the analysis steps could be quite similar. They are better to be reused. I would like toseparate the common part and the specialized part of an analysis step as “algorithm” and “task”. Looselyspeaking, an “algorithm” contains the environment definition and the code template. An “algorithm”2 aw data Task 1 Data A Task 2 Data B (Result)Algorithm 1 Algorithm 2 Figure 2: Example of workflow with separated algorithm and task.
Raw data task Task 1 Task 2Algorithm 1 Algorithm 2
Figure 3: Example of workflow with combination of task and data.is will be used to run a “task” and the “algorithm” can be shared by “tasks”, while a “task” containsparameters. The rigorous definition of “algorithm” and “task” will be in the next section. And they willbe written in italics to clear up the possible ambiguities. The workflow of an possible analysis after theseparation of “algorithm” and “task” is shown in Fig 2.It is also possible to refine the analysis architecture further. A task is always followed by some data.It is convenient to merge the “data” and “task” into one object. And the merge makes the binding ofcodes and data even tighter.The final architecture is illustrated in Fig. 3.
The code and the data in the above designation are still loosely bounded, since the modification onthe codes would not impact the data immediately. There exists a conflict between the requirement ofmodification of codes and the preservation of the corresponding data. In another word, a di ffi culty inthe analysis management is the balance of its flexibility and reproducibility. To solve the di ffi culty, theconcept of “impression” is introduced. Noticing the conflicts between di ff erent requirements discussed above, the analysis can be split intoabstract layer and concrete layer according to the requirement of flexibility or reproducibility. To bespecific, the abstract layer of an analysis is what the author write or operate. The concrete layer is a copyof the abstract layer, while the machine-useless information is stripped. The machine-useless informationis only for the convenience of analyzer, such as the path of putting a task. The concrete layer can reallybe run and that is why it is called “concrete”. The corresponding entity of “algorithm” and “task” inconcrete layer are “image” and “container”, whose concepts are borrowed from Docker[2]. They can beexecuted to generate data. After the separation of the abstract layer and the concrete layer, the topologicalstructure of an analysis may look like Fig. 4.Since the abstract layer of the analysis contains only codes and some metadata, it has quite smallsize. They can be easily shared between analyzer to reuse and review. And the abstract layer can also beuploaded to a central portal for centralized management.3 aw data Task 1 Task 2Algorithm 1 Algorithm 2Image 1 Image 2Container 1 Container 2 Figure 4: Example of abstract and concrete layer.
The reader may be confused about how such a splitting can balance the flexibility with reproducibility.In fact, it should move further to prompt “image” and “container” to “impression”. An “impression” isa version of an “image” or an “container”. It is able to generate data uniquely. And it is tightly boundedwith the “task” and “algorithm” in the abstract layer. Therefore, a tight binding of codes and data, usingan intermediate “impression”, is established. The abstract layer of analysis is allowed to be arbitrarymodified. Once it is modified, the link between the abstract layer and the original “impression” willbreak immediately, i.e. the data is sensitive to the codes. Therefore, the objects necessary to preserve areonly codes and corresponding “impressions”, which are all have small size.For example, as shown in Fig. 1, assume the original version of the analysis has four impressions,“Image 1.1”, “Image 2.1”, “Container 1.1”, and “Container 2.1”. After the modification of “Algorithm2”, a new impression “Image 2.2” is created. At the same time, the “Task 2” will also generate a newimpression “Container 2.2”. Although the contents “Task 2” does not change, the upstream dependenceof “Task 2” changed. The workflow of the concrete part analysis keep changing when the analysis is ongoing. That is why a half-explictly way of constructing workflow is more suitable for the case.The abstract layer contains all the history of impression. Analysis can be reverted to a specific versionif necessary. “Impression” is designed to determine the result. Conceptually, there should be a standalone runner torun the impression. The runner can be on a local machine and on a remote machine. The Analyzer sends“impression” to the runner. The runner executes the “impression” and generates result. And then theanalyzer gets the results from the runner. The “containers” with tiny resource consumption can be run onthe personal computer. For the “containers” with huge resource consumption, the analyzer can choose totransfer them to computing grids or powerful computing machines to compute. The computed containerscan be transferred back to personal computer to view the results or to satisfy the dependence of executingthe next container on the personal computer.
Each “task”, “algorithm” or “impression” can attach an status. The status is determined so that runnercan manage the job sequence and analyzer can monitor the procedure of the analysis.4 aw data
Task 1 Task 2Algorithm 1 Algorithm 2
Image 1.1 Image 2.1Image 2.2Container 1.1 Container 2.1Container 2.2Container 2.1
Figure 5: Example of abstract and concrete layer. The container and images are all impressions.
Since an “impression” is immutable, it is not di ffi cult to determine the status of an impression. If an“impression” is executed, the status of it will be “built” or “done” for “image” and “container” separately.The on going status is “building” or “running”. If it happened to be failed, a “failed” status is assigned. The status of algorithm and task can be “impressed” or “new”. One condition of “impressed” status isthat the contents of an algorithm or task is the same as that of the corresponding impression. And theother condition is that dependences of the two parts also match. Otherwise, the status is assigned as“new”.
Having the “impression” in hand, it is natural to link all the analysis in a collaboration together. Ananalyzer will not ask his / her colleague for data. The change of the upstream analysis will also be detectedby the analyzer. This protocol is designed to provide a standard way to organize the analysis in the high energy physics. Ifthe analysis is organized in a standard way, the toolkit developer can develop software helping analyzer tomanage it. The designation aims to validate conceptual designation of the analysis management system.The toolkit developer can benefit from a standard way of organization and develop tool to make theoperation easier. 5igure 6: A possible analysis organization.
All the analysis codes and metadata should be put in a directory. This directory is called a Chern reposi-tory. In the Chern repository, there exists a lot of subdirectories. Some of the directories contain hiddenfolder called “.chern”. Directory with the hidden “.chern” is called an object . The root directory of therepository is also an object . And it should be addressed that an object is a directory in the file system.There are four types of object . They are algorithm , task , directory and project . Each object containsa file “README.md” used for documentation purpose. And a object also has a file named “.chern / con-fig.json” serving as the configuration file of the object . The type of the object( algorithm , task , directory and project ) is written in the configuration file of the object . In addition to the files common in all kindsof object s. The di ff erent types of object have di ff erent contents. • A directory is just a folder to contain other sub object s. The object in a directory can be any typeexcept directory . • A project is exactly the same as the directory except that it is the root of the whole analysisrepository. • An algorithm contains the code templates and environment specification files(Dockerfile for tem-poratory), provided by collaboration. There is a folder called “.chern / impressions” containing allthe impressions corresponding to the algorithm . • In a task , there should be a file “parameter.json” containing the parameters of the task. There isalso a folder called “.chern / impressions” as algorithm .The connection between the task s and algorithm s, i.e. workflow, is specified in the configuration filesof the task s and algorithm s. In the configuration of a task , the algorithm and task s that this task dependson are recorded, in the form of the relative path to the project . The depending task has also an “alias”recorded in this configuration file. In the configuration file of a task or algorithm , the task s it dependingare recorded.Other files or directories that are not organized as above statements, do not belong to the repositoryand should be ignored. Impression An impression is a version of a task or algorithm . It has a unique id and is physically stored in the“.chern / impressions / (ID)” under the task or algorithm , where the (ID) refer to the id of the impression .6nder the “.chern / impressions / (ID)”, there exists a file named “config.json”, called the configuration fileof the impression . And there is also a folder named “contents” to store the contents of the corresponding task or algorithm . In the configuration file, the file tree of contents is written. It also has the historical impression id and the impression s that this impression depends on.The above information stored in a impression is enough for running the impression anywhere. / O map
The parameters defined in a task should be able to be transfered to the corresponding algorithm . Let ustake an ROOT or C ++ application as example. When the analyzer writes a template code in ROOT foran algorithm , he / she could include a header file called “arguments”. After including the header, he / shecan use “parameters[(PARAMETER NAME)]” as a std::string type value in the template code, where(PARAMETER NAME) represents the parameter defined in the task .Dealing with the inputs and outputs is similar. As discussed above, each predecessor of the task hasan alias. After including the header, analyzer can use “folder[(ALIAS)]” as a std::string type value. Thisvalue means the storage location of the data of the predecessor task . The output directory to store theresult is just “folder[“output”]”.The runner should generate the header file “arguments”.Other programming languages can be supported in a similar way. However, this do not mean that thestandard transfer header should be created for all the programming language that we can image. Creating“header file” for script language like bash, tcsh or python, and using them to start unsupported languageis feasible. task It is a common case that the starting point of an data analysis is external data. However, as the “data”can only exists in a “task”, there is no place to put the external data. This can be solved in the followingway. The Analyzer creates a file named “data.json” under a task and pretend that it can be executed andgenerate data. The md5 of the data expected to generate are recorded in “data.json”. The “data.json” cannot actually be run, and the analyzer should manually feed the external data to the runner, pretending thatthe data is generated by the task . The operation applied to the repository should be restricted in order to keep the structure of the operation.The following operations are allowed to use and should be regarded as the atomic operations. • Creating repository: making an empty folder and creating file “.chern / config.json” and “.chern / pro-ject.json” under it. Writing the object type of “project” into file “.chern / config.json”. • Creating directory : making an empty folder and creating file “.chern / config.json” under it. Writingthe object type of “directory” into file “.chern / config.json”. • Creating algorithm : making an empty folder and creating file “.chern / config.json” under it. Writ-ing the object type of “algorithm” into file “.chern / config.json”. • Creating task : algorithm : make an empty folder and creating file “.chern / config.json” under it.Writing the object type of “task” into file “.chern / config.json”. • Link an algorithm and a task : adding the task to the successor list of the algorithm in the configu-ration file of the algorithm . Adding the algorithm to the predecessor list of the task .7 Unlink and algorithm and a task : removing the task from the successor list of algorithm . Removingthe algorithm from the predecessor list of the task . • Link a task to a second task : adding the second task to the successor list of the first task . Addingthe first task to the predecessor list. • Unlink the relation of a task to a second task : removing the second task of the successor list of thefirst task . Removing the first task from the predecessor list of the second task . Removing the alias. • Moving object : moving the directory. Since the object recorded in its predecessors and successorsin the form of the path relative to the root of the whole repository. The configuration files of itspredecessors and successors should be modified to use the new relative path. • Copying object : copying the directory. Unlink the predecessors and the successors. • Removing object : unlink the predecessors and the successors. Removing the directory. • Modify “README.md”: just editing “README.md”. • Creating impression : Once an impression is required to be created for an algorithm or a task ,firstly, the impression of the preceding task or algorithm is checked to be available. After that, thecontents of the task or algorithm and the preceding impression define the required impression . A prototype toolkit named “Chern” is designed to realize the conceptual designation. The github repos-itory of the toolkit is on https://github.com/hepChern . And the documents for user can be foundin http://chern.redthedocs.io/en/latest . A few examples including some toy examples andsome real analyses can be found on github.
A toolkit should develop commands to realize the operation defined in the above section and their com-bination helping the analyzer to operate the Chern repository conveniently. I choose to use IPythonnotebook to hide the details of managing Chern repository as much as possible. The commands such as“mv”, “cd”, “ls”, “cp” and many others are redefined. For example, “ls” will not only show the contentsof a directory, but also additional information like status, predecessors and so on as in Fig 7. Other help-ful commands such as “mktask” to create a new task , “submit” to submit the current object to the runner,and other helpful commands are developed. impression
Impression should be able to be executed on any machine to generate the exact same results. In the past,to achieve such a requirement, a virtual machine is needed. Nowadays, the development of containertechnique makes such a requirement much easier to achieve. The runner will mount the storage ofcontainer and link the containers. The “arguments” file is created by runner at this time.8igure 7: The shown contents when the user type “ls” under a folder or a task.
The communication protocols between the Chern repository and the runner, and between runners are notdefined in the protocol above, since di ff erent communication methods are in principle equivalent and it isbetter to support many of them. In the current prototype, I only use http for the communication betweenChern repository and runner. The exchange of impression and results between runners on di ff erentmachines is not realized in the current prototype. In the following cases, using the proposed Chern architecture will be convenience. They often occur inthe analysis of high energy physics. • Using the management system, analyzers can directly read the analysis steps from the repository. • Analysis steps are preserved and for every data. The detailed steps to generate data are clear. • The name and location of the directory are not useful for backend to run the analysis. Therefore,they can be arbitrary modified without changing any result. When adding new parallel steps toanalysis, this feature is especially useful. • Similar analysis can easily be constructed. An analyzer can just copy a Chern repository fromothers and perform some slight tunning to create a new analysis.9 .2 Future development
Although the full history is recorded in the Chern repository. The time machine function is not realized inthe prototype. Reverting the analysis is often useful and this function will be added in the future versionof the toolkit.
A job monitoring system can be developed to manage the job status and disk usage.
The current frontend of the toolkit is based on IPython notebook. But the architecture is in fact indepen-dent of the analysis. The frontend can also be web application or GUI application. They might be moreuser-friendly to the analyzers.
Technically, the current prototype of Chern only naively run, while RENAN support a lot of advancedtechniques. REANA uses container technologies(Docker) and aims at supporting several di ff erent declar-ative workflow engines(Yadage, CWL), several di ff erent computing cloud back-ends for job execu-tion(Kubertenetes, HTCondor) and several di ff erent storage back-ends for data sharing(EOS, Ceph).Using REANA as backend could make Chern more powerful. However, an interface between themrequire a lot e ff ort. In this paper, an architecture to manage an analysis is proposed. With the new architecture, the workflowis constructed in an apparent way without losing convenience. The requirement of preservation is alsosatisfied. A novel toolkit is provided for analyzers. So that they can do the analysis as what they do inthe daily life today.The future development of the architecture and the toolkit is also discussed. The standard, timemachine function, the new frontend and the new backend are also proposed.
I would like to express my acknowledgements to Sidan Chen, Mengzhen Wang, Li Xu, and ZhenweiYang who help to polish this paper.
References [1] Common workflow language, .[2] Docker, https://docker.com .[3] X. Chen et al. CERN Analysis Preservation: A Novel Digital Library Service to Enable Reusableand Reproducible Research.
International Conference on Theory and Practice of Digital Libraries ,pages 347–356, 2016. 104] A. Khodak et al. Reana - reusable analyses.[5] Roman Kogler, David M South, Michael Steder, ICFA DPHEP Study Group, et al. Data preservationin high energy physics. In