aa r X i v : . [ c s . D L ] M a y A computational EXFOR database
Georg
Schnabel , ∗ ∗∗ Division of Applied Nuclear Physics, Uppsala University, Sweden
Abstract.
The EXFOR library is a useful resource for many people in the field of nuclear physics. In particular,the experimental data in the EXFOR library serves as a starting point for nuclear data evaluations. There is anongoing discussion about how to make evaluations more transparent and reproducible. One important ingredientmay be convenient programmatic access to the data in the EXFOR library from high-level languages. To thisend, the complete EXFOR library can be converted to a MongoDB database. This database can be convenientlysearched and accessed from a wide variety of programming languages, such as C ++ , Python, Java, Matlab, andR. This contribution provides some details about the successful conversion of the EXFOR library to a MongoDBdatabase and shows simple usage examples to underline its merits. All codes required for the conversion havebeen made available online and are open-source. In addition, a Dockerfile has been created to facilitate theinstallation process. The EXFOR library [1] as a comprehensive collectionof experimental reaction data is a valuable resource formany people in the field of nuclear physics. The Nu-clear Reaction Data Centers (NRDC) host online servicesto search for data, visualize, and retrieve them in vari-ous formats, e.g., among those the well known EXFORweb retrieval system [2]. In particular, both the flexiblesearch options provided by those online services and thesimple structure of so-called computational formats, suchas C4, are tremendously helpful in nuclear data evalua-tions. Notwithstanding the indisputable value of existingservices and formats, there may be use cases involving re-action data from EXFOR which are not optimally coveredby existing services and formats yet:1. There is an ongoing discussion in the communityabout how to make evaluations more transparentand reproducible, as demonstrated by the recentproposal for WPEC subgroup 49 on reproducibil-ity in nuclear data evaluation, and the idea of auto-mated evaluation pipelines gains momentum. Sucha pipeline is a sequence of scripts in a program-ming language to perform an evaluation. Choicesof an evaluator are implemented as instructions inscripts, thereby removing the need for manual in-tervention in the event that an evaluation needs tobe reproduced. The creation of such scripts shouldbe facilitated as much as possible to enable evalua-tors to focus on the essential evaluation work. Alsoreadability counts: Other evaluators should later beable with as little e ff ort as possible to understand the ∗ Currently at IAEA NAPC-Nuclear Data Section ∗∗ e-mail: [email protected] scripts. Convenient access in just a few lines of codeto the EXFOR library, which means both searchingand retrieving data, from a variety of programminglanguages serves this goal.2. It can be foreseen that more and more sophisticatedmethods from statistics and machine learning willbe used to analyze the data in the EXFOR library.The variety of available algorithms and the rapid de-velopment of new ones suggest that it may be di ffi -cult to come up with a computational format thatsuits all potential requirements. In this applicationscenario, the user should be put in the position toquickly create a customized computational formatthemselves and traverse the EXFOR library in anyway they want.The solution put forward in this contribution to en-able user-friendly access to information in the EXFORlibrary is to convert the complete EXFOR library to aMongoDB database [3] using JSON [10] as an interme-diate format. This database application and the associateddatabase architecture are very popular in the commercialworld. It can be downloaded and used free of charge. Dueto their widespread adoption, MongoDB databases can beaccessed from a wide variety of programming languagesincluding C ++ , Python, Java, Perl, Matlab and R. Beingendowed with a rich query language and the possibilityto access any textual and non-textual information suggestthat a MongoDB database is a viable solution for the twodescribed use case scenarios above.Finally, we want to stress that similar e ff orts have beenundertaken in the past and associated codes released tothe public to facilitate accessing data in the EXFOR li-brary: The X4TOC4 code [4] developed by D.E. Cullenonverts EXFOR entries into a tabular format. The END-VER code [5] developed by A. Trkov can convert EXFORdata to the computational C4 format. The x4i code devel-oped by D.A. Brown provides a programmatic interfaceto EXFOR, and a fork of it, the x4i3 code [6] is availableon GitHub. A thorough comparison to these codes andtheir functionality and design considerations is outside thescope of this paper. Also the web retrieval system of theIAEA has been upgraded recently to enable the retrieval ofdata in the JSON format.In this contribution, we first discuss the original EX-FOR format [7–9] in comparison with the JSON for-mat [10]. Afterwards we discuss the creation of a Mon-goDB database based on EXFOR data given in the JSONformat. Then we provide a simple example of how theMongoDB EXFOR database can be accessed from Python.A conclusion section ends this document. The EXFOR format [7–9] is a data exchange format de-signed to enable the storage of numerical data and textualinformation related to experiments. Data associated withan experiment is bundled together as a so-called entry .An entry comprises several subentries . The first suben-try contains general bibliographical information, such asthe authors of the measurements or the facility, and spec-ifications common to all subsequent subentries. From thesecond subentry onwards, each subentry contains the dataassociated with a specific reaction process measured in theexperiment.The EXFOR format defining the structure of an entrywas conceived to be readable by humans and machines. Inorder to be readable by machines, it follows rigid structuralrules. For instance, a line contains two or more fields andthe size of a field must be a multiple of eleven characters.Maximal 66 character slots in a line can be used to accom-modate fields. Many lines are dedicated to store key-valuepairs: The first field of size eleven specifies a keyword,such as
AUTHOR , and the second field of size 55 containsthe associated value, e.g., the names of the authors.The rigid structure is in principle a good thing as itfacilitates to write programs to parse the content in EX-FOR entries, making the content accessible in a high-levelprogramming language. However, the syntax of the EX-FOR format follows complex rules and reflects its coevo-lution with the Fortran programming language. Writingparsers for EXFOR entries in a high-level language, suchas Python or R, is therefore a tedious and time-consumingendeavor.Furthermore, even though the EXFOR format ishuman-readable, modifications by hand bear the risk of in-advertently violating format specifications. For instance,the
BIB keyword introducing a block with bibliographi-cal information and the description of the data is followedby two fields indicating the number of keywords and thenumber of lines in the block. The entry is easily put intoan inconsistent state by adding a bibliographical field byhand without changing accordingly such counter fields. Here one may argue that users do not need to change EX-FOR entries themselves. However, it can be countered thatformat specifications may also be motivated by the needsof users. If users have the option to create their individ-ual EXFOR entries with customized fields, e.g., an alter-native representation of covariance matrices, pertinent intheir domain of application, this could provide helpful in-put for potential extensions of the o ffi cial EXFOR libraryin the future.Both problems, the need to write a parser and lesserror-prone modifications by hand, can be solved by con-verting the entries given in the EXFOR format to anotherformat that is widely supported in the field of informationtechnology. A natural candidate for that purpose is theJSON format [10]. It is a hierarchical format and can storenumerical and textual data. Therefore it provides all thefeatures to store without any loss of information or accu-racy all the information available in an EXFOR entry.An R package to convert entries or subentries given inthe EXFOR format to the JSON format has been imple-mented and is available for download [11]. The philoso-phy of the converter is to preserve the logical structure ofthe original EXFOR entry as much as possible. Keywordsin the original entry are also keywords in the JSON object.Once an EXFOR entry or subentry is available asJSON object, one can make use of existing functionalityin most of the popular programming languages to retrieveinformation or manipulate the JSON object. Here is a sim-ple example that shows how to extract information from anEXFOR entry given as JSON object using Python: import json with open (’exforEntryFile ’) as json_file : entry = json .load (json_file ) entry[’SUBENT ’][0][ ’ BIB ’][’AUTHOR ’] entry[’SUBENT ’][3][ ’ BIB ’][’ REACTION ’] entry[’SUBENT ’][3][ ’ DATA ’][’ TABLE ’] As this example shows, having the EXFOR entriesstored as JSON objects greatly facilitates the access to theinformation. Also the creation of customized computa-tional formats becomes doable by users without too muche ff orts, e.g., by writing a small Python script to this end.However, it is equally important to be able to e ff ectivelysearch for relevant data, which is the topic of the next sec-tion. Storing a large collection of data and searching throughthem is the purpose of database software. Often so-called relational databases are employed for storing data. Thedata is organized in several tables and the association ofrows residing in di ff erent tables is established by columnscontaining the same information even though in di ff erenttables. For instance, in the context of EXFOR, each ta-ble could possess a column EntryID containing the en-try identification numbers which uniquely identify exper-iments. This column links the rows of di ff erent tables to-gether.sing this database type to store the EXFOR libraryrequires to restructure the information present in the col-lection of EXFOR entries as a collection of tables. Thevariability in terms of the amount of information of di ff er-ent subentries makes this conversion process challenging.For instance, a specific field type can be present in onesubentry but absent in another one. These challenges forthe conversion can and have been solved, e.g., as has beenproven by the conception and implementation of the webretrieval system and its underlying SQL database of theIAEA [2].As the organization of data in EXFOR entries (hier-archical) is di ff erent from that in an SQL database (rela-tional), there may be use cases where it is beneficial topreserve the original organization of the data if the benefitof powerful search capabilities associated with a databaseapplication can be achieved as well. Database solutionsthat can store the EXFOR entries in a way that preservestheir original structure are available in the form of so-called document-oriented databases . This database typebelongs to the class of NoSQL databases. Instead of usinga collection of tables, document-oriented databases man-age a collection of documents. A document includes allinformation related to one entity. This is in sharp con-trast to relational databases where properties belonging toone entity are usually distributed over several tables. Theadvantage of this database software in context of the EX-FOR library is that each EXFOR entry can be stored as adocument without the need to restructure or distribute itsinformation.The database adopted here is MongoDB [3], which canbe downloaded and used free of charge. In a MongoDBdatabase, documents are organized in collections. Docu-ments are stored in the BSON format which stands for bi-nary JSON . Technical details aside, the BSON and JSONformat are equivalent. Due to this reason, it is possible todirectly insert EXFOR entries given in the JSON formatinto a collection of a MongoDB database. A script to con-vert EXFOR entries to JSON objects and insert them intoa MongoDB database has been made available at [11].Once the MongoDB database is filled, the expressivequery language provided by the MongoDB database soft-ware can be used to search for relevant information. Sim-ple usage examples showing how EXFOR data can bequeried from Python will be the topic of the next section.The remainder of this section elaborates on some choicesmade for the conversion from the original EXFOR libraryto a MongoDB database as it is provided at [12]. The fol-lowing list describes e ff ected modifications, which are inthe opinion of the author reasonable and helpful:1. The logical unit most people operate with are not en-tries but subentries. For this reason, the documentsin the MongoDB database are subentries.2. The first subentry of an entry contains informa-tion which is common to all subsequent subentries.For this reason, the information of the first suben-try has been merged into all subsequent subentriesof the same entry, before they were added to the MongoDB database. Sometimes collisions of fieldnames occur during the merging process. In suchcases, the o ff ending field names in the first subentryare altered by adding the su ffi x _firstSub to themprior to merging.3. The information in a COMMON section of an originalEXFOR subentry contains quantities that are con-stant for all measured data points. For instance, anangle di ff erential cross sections may have been mea-sured at various angles for 15 MeV incident neu-trons. Thus the neutron energy is often stored ina COMMON field to avoid redundancy. From the userpoint of view, it may be still helpful to have the in-cident energy stored in the
DATA table together withthe angles, and the measured cross section value. In-formation in the
COMMON section is therefore mergedinto the
DATA table but nevertheless also preservedas a separate field.4. Standardized units are helpful to remove conversionerrors. Therefore all energies have been converted toMeV and all cross sections to millibarn. Also unitsof compound quantities such as associated with an-gle di ff erential cross sections and spectra are modi-fied accordingly.Some other modifications of less significance have notbeen mentioned here for the sake of brevity. They are doc-umented in the manual at [12] accompanying the installa-tion files for the MongoDB EXFOR database. According to the TIOBE index [13] one of the most pop-ular programming languages is Python, which finds alsobroad adoption in the field of nuclear physics. There-fore some simple examples how to retrieve data from theMongoDB EXFOR database are provided here to demon-strate the ease of use. The following examples rely on the pymongo module.To interact with the database, one needs first to connectto it: from pymongo import MongoClient client = MongoClient (’localhost ’, 27017) db = client ["exfor"] entries = db[" entries "] One of the most elementary user actions is to retrieve asubentry using its subentry identification number (an eightdigit string). subent = entries .find_one ({’ID ’: ’11701004 ’}) The expression passed as an argument to the function find_one specifies the search query. Search queries fora MongoDB database have to be formulated as JSON ob-jects. Since the data structure called a (nested) dictio-nary in Python is essentially equivalent to a JSON ob-ject, pythonists can probably get used to this query syntaxquickly.he result of the query is a (potentially nested) dic-tionary. It can be explored by making use of the Pythonfunctions provided for dictionaries. Just as an example: subent ["BIB "][" AUTHOR "] subent ["DATA "][" UNIT "] subent ["DATA "][" TABLE"] As a final example for a more advanced query, wecan use a regular expression to match reaction strings thatspecify neutrons ( N ) as projectile, Fe-56 as target, andangle-integrated cross sections ( SIG ): import re regex = re.compile ( "^\(26 - FE -56\(N ,[^)]+\)[^ ,]* , , SIG \)" ) subents = entries .find ( { ’BIB.REACTION ’ : regex } ) The variable subents is an iterator, which can be usedto iterate over the found subentries in a loop, e.g.: for subent in subents: print (subent ["BIB "][" AUTHOR "]) Finally, the connection to the MongoDB databaseshould be closed: client.close() This example provided just a small glimpse into thepossibilities to interact with the data in the MongoDBEXFOR database. Comprehensive information about thequery language can be found in the o ffi cial MongoDB doc-umentation. We argued that the JSON format is a suitable format tostore all the information available in the EXFOR library.Entries and subentries in the EXFOR library can be con-verted without loss of information or accuracy to the JSONformat. Due to the wide support for the JSON format, theextraction of EXFOR data from JSON objects is trivialin high-level languages, such as Python. An EXFOR toJSON converter package has been made available at [11].We also argued that a relational database may not al-ways be the ideal solution to store the data of the EX-FOR library from the perspective of the user. Document-oriented databases, such as MongoDB, enable storing theEXFOR library without structural transformations. AsJSON objects can be imported into a MongoDB databaseand an EXFOR to JSON converter is in place, the con-version of the complete EXFOR library into a MongoDBdatabase is straight-forward. A script that automates thisconversion process is available at [14].The complete installation process of the MongoDBEXFOR database requires several steps, such as the in-stallation of the MongoDB database and the conversion of EXFOR entries. Therefore, to facilitate the installationfor the user, the installation process has been automatedto a large extent using the Docker technology for virtual-ization. The required files and installation instructions toinstall the EXFOR MongoDB on the local computer canbe found at [12].It is the hope that the provided computational databasewill be helpful for users in its current form. Future usecases will certainly point to possible improvements andissues will potentially surface. Because modifications ofthe database can be e ff ected by the user themselves with-out too much e ff orts, users can become designers andfor instance create their own computational formats anddatabases. This circumstance may foster the developmentof tools and formats related to nuclear data and potentiallyprovides inspiration for how to improve the o ffi cial EX-FOR library. References [1] N. Otuka, E. Dupont, V. Semkova, B. Pritychenko,A. Blokhin, M. Aikawa, S. Babykina, M. Bossant,G. Chen, S. Dunaeva et al., Nuclear Data Sheets ,272 (2014)[2] V. Zerkin, B. Pritychenko, Nuclear Instruments andMethods in Physics Research Section A: Accel-erators, Spectrometers, Detectors and AssociatedEquipment , 31 (2018)[3]
MongoDB: The database for modern applications ,https: // / [4] D. Cullen, A. Trkov, Tech. Rep. IAEA-NDS-0080,IAEA (2001)[5] A. Trkov, Tech. Rep. IAEA-NDS-77, IAEA (2008)[6] x4i3 - EXFOR interface for Python ,https: // github.com / afedynitch / x4i3[7] Tech. Rep. IAEA-NDS-0206, IAEA (2008)[8] V. McLane, Tech. Rep. IAEA-NDS-207, IAEA(2000)[9] EXFOR Basics: A short guide to theneutron reaction data exchange format ,https: // / nrdc / basics / [10] JSON format , http: // / [11] exforParser: EXFOR to JSON converter ,https: // github.com / gschnabel / exforParser[12] Dockerfile and installation instructionsto create MongoDB EXFOR database ,http: // / TIOBE Index , https: // / tiobe-index / [14] createExforDb: Script to create Mon-goDB database with EXFOR data ,https: // github.com / gschnabel / createExforDbcreateExforDb