[PDF] Conquery: an open source application to analyze high content healthcare data

Abstract

Background: Big data in healthcare must be exploited to achieve a substantial increase in efficiency and competitiveness. Especially the analysis of patient-related data possesses huge potential to considerably improve decision-making processes in the healthcare sector. Most analytical approaches used today are highly time- and resource-consuming. The presented software solution Conquery is an open source software tool providing advanced, but intuitive data analysis without the need for specialized statistical training. Results: We developed a highly scalable column-oriented distributed timeseries database and analysis platform. Its main application is the analysis of per-person medical records by non-technical medical professionals. Complex analyses can be realized in a web-frontend without necessitating deep knowledge of the underlying data and structure. Queries are evaluated by a bespoke distributed query-engine for medical records. We present a custom compression scheme to facilitate low response times that uses online calculated as well as precomputed metadata and statistics. Conclusions: Conquery enables users fast queries and analysis of large datasets without requiring technical expertise. This reduces the technical burden in the decision-making process, to facilitate better data utilization in the healthcare sector. As the only open-source software in Germany that explicitly addresses the stringent use and analysis of health data, Conquery is of great value to the healthcare community.

Full PDF

CCONQUERY: AN OPEN SOURCE APPLICATION TO ANALYZE HIGH CONTENT HEALTHCARE DATA Fabian Kovacs , Max Thonagel , Marion Ludwig , Alexander Albrecht , Hannes Priehn , Manuel Hegner , Dirk Enders , Lennart Hickstein , Maximilian von Knobloch , Anne Rothhardt , Jochen Walker InGef – Institute for applied healthcare research Berlin GmbH, Germany bakdata GmbH, Berlin, Germany Corresponding author: Jochen Walker Spittelmarkt 12, 10117 Berlin [email protected]; +49 30 586945 470 BSTRACT

Background:

Big data in healthcare must be exploited to achieve a substantial increase in efficiency and competitiveness. Especially the analysis of patient-related data possesses huge potential to considerably improve decision-making processes in the healthcare sector. Most analytical approaches used today are highly time- and resource-consuming. The presented software solution Conquery is an open source software tool providing advanced, but intuitive data analysis without the need for specialized statistical training.

Results:

Conquery is a column-oriented distributed timeseries database and analysis platform. Its main application is the analysis of per-person medical records by non-technical medical professionals. Complex analyses are realized in the Conquery frontend by dragging tree nodes into the query editor. Queries are evaluated by a bespoke distributed query-engine for medical records in a column-oriented fashion. We present a custom compression scheme to facilitate low response times that uses online calculated as well as precomputed metadata and data statistics.

Conclusions:

Conquery is an efficient and intuitive open source software for performant and secure data analysis and aims at supporting decision-making processes in the healthcare sector.

KEYWORDS:

Data mining, software, delivery of healthcare, information storage and retrieval, statistical data analyses

ACKGROUND

IMPLEMENTATION

For the sake of comprehensibility, we will differentiate between and refer to operator and user. The operator is responsible for maintaining a Conquery instance, while the user interfaces with Conquery through the browser-based user interface (or Conquery’s JSON -based query language). Users are assumed to be non-technical medical experts, whereas the operators require both technical knowledge (Java, JSON, SQL etc.) and domain expertise. The data organization of Conquery is illustrated schematically in figure 1. rchitecture overview

Conquery is implemented in Java, using the manager-worker paradigm for distributed parallel computation. The manager node maintains the schema and data-distribution, but does no computation of its own, nor does it hold any data. The worker nodes hold non-replicated data, doing the computation on behalf of the manager node. All data in Conquery is identified by a global primary identifier (Id). This Id marks data that belongs to the same entity and is used to partition the data among workers. In brief, all records relating to the same entity are stored on a single worker node. Additionally, queries are evaluated per-entity, allowing complete parallelization of query evaluation. Entities Entities are defined by a set of associated data throughout the whole database and a primary Id. They usually consist of a person and its respective medical records but could also comprise data related to a single hospital case. As will be further explained below, a query in Conquery is evaluated per entity, requiring that all data be stored on a single worker node. Tables & Imports Tables define a data-scheme composed of columns and their data-type descriptions (string, money, number, date, etc.), serving as the building blocks of so-called Concepts (outlined below). A table itself does not hold any data, instead it is associated with data by so called imports. Imports are defined as a compressed binary representation of the underlying data, generated in a mandatory preprocessing step executed by the operator beforehand. The most important part of preprocessing is creating a compressed representation of strings, i.e. instead of storing the same string multiple times, it is stored in a Patricia Trie based dictionary and only the index into the dictionary is stored in the record. [7] When loading an import, the manager node distributes data evenly among its workers, the manager only stores mappings from the internal dictionary to the entities. The workers analyze the incoming data, creating an abstract description of the data, then apply compression and memory optimizations on a per-column-per-import basis. For example, a column of integer numbers might only contain values ranging from 0 to 255, allowing them to be stored as single bytes instead of full 8-bytes sized long. Strings and dates are internally represented as numbers. The resulting integer representation of those non-integer datatypes can therefore also undergo integer-based compression. The underlying values are only computed when strictly necessary (i.e.: for query evaluation and CSV rendering).

The workers’ analysis results in an internal data description used to generate and compile new Java-classes at runtime. It stores data in a column-oriented fashion, reproducing the original values when queried. The aim of this on-line compilation is firstly, to provide enough type information for Java’s virtual machine (JVM) to optimize the code effectively and secondly, storing the data in a memory optimized fashion. The JVM can analyze the usage of methods at runtime, and will subsequently generate optimized code if considered appropriate (e.g.: inlining method calls, pruning unused code and removing redundant type checks etc.). [8] Comparison of compression The above outlined compression schema strategy results in a compression comparable to Gzip. Comparing the size of raw CSV files to G zip and Conquery’s compression, Gzip nets an average size of 20.98%, while Conquery’s average size is 24 .87% of the original CSV. However, the data format of Conquery is a record-based format, meant for in-memory seeking, at single event precision, which Gzip does not support natively. [9] In addition, for on- disk storage, Conquery’s compressed files can be compressed further, utilizing Gzip compression. The hereof resulting size is 62.11% smaller than files exclusively compressed with Gzip. oncepts Concepts are the central component of Conquery. They describe the underlying semantic of data. Concepts are specified in JSON format with reference to certain columns in a table containing conceptual information as structured text. Conquery allows automatic classification of conceptual information according to a specified hierarchical schema, such as ICD-10-GM, de-structuring it onto a tree-like structure. To exemplify this, the ICD-10-GM code “G20.11” for the diagnosis “Parkinson's disease with fluctuations and medium to severe impairment”: It is first classified as “G0 nervous system”, subordinated “G20 -G26 -

Extrapyramidal and movement disorders”, followed by “G20

Parkinson’s Disease”, and then “G20.11 - Parkinson's disease with fluctuations and medium to severe impairment”. This hierarchy is defined in a tree -structured JSON-file (see supplementary data code

A query for all diagnoses of “G20 - G26” will therefore not only contain “G20” diagnoses, but also every subordinated diagnosis. We optimize this classification threefold: First, since a record cannot be changed once it is loaded, we compute classifications just once on load time. Secondly, we use a Patricia-Trie for classification, as it specifically allows querying for prefixes. [7] Third, we cache classifications and thus, effectively flattening the Trie on demand. Classification can also undergo more involved conditions, such as basic Boolean logic and filtering along multiple columns. Multiple tables can all be funneled into the same concept allowing the user to query for records across different domains (e.g.: inpatient or outpatient records), containing the same conceptual information. Concepts also contain definitions for filtering and aggregation per table and characterize a set of date-columns that define the timeframe in which a record should have taken place. Queries

The basic semantics of a Conquery query are “return all entities, satisfying X”, where X is a set of conditions on the entities’ records. Additionally, every query returns a union of the times over which

X holds true for the entity, and if specified, aggregation results over the entity. Therefore, Conquery executes all queries independently per entity, enabling trivial parallelization, while maintaining a high degree of readability in one of its core features. The lifetime of a query is illustrated and described in detail in figure 1.

Figure 1 Data organization of Conquery ilters and Aggregators Filters are defined in a concept over a specified table (Examples: see supplementary data code tity is included or not (i.e.: If the aggregation results comply with the users’ requirements.).

The aggregators’ values are extracted and stored to be batch transmitted to the manager node. Finally, a result csv file is issued to the user (figure 2, step 8).

Figure 2 Lifetime of a query in Conquery: 1. Submission of an API-layer description of the query, by the user, to the manager node 2. Distribution of the Query description to all worker nodes 3. Translation of the API-description to a concrete query plan (Optimization of the Query and preparation of necessary data-structures) 4. Per-Entity submission of a job into a queue for parallel processing 5. Evaluation of the query per entity, producing at least one result line for entities which satisfy the query, but can produce more for specific types of queries (outlined below) 6. Batched transmission of query results to the manager node 7. If all worker nodes have sent results, or a single node has sent a failure-result, the query is marked as finished and the user can download the result in CSV format if available 8. Rendering of the query to CSV from internal representation.

RESULTS AND DISCUSSION

The user interface displays concepts according to the tree structure outlined above, in a directory like format, allowing any novel user easy navigation through the hierarchy (Fig. 3). Queries can be composed by dragging tree nodes onto the query editor, where the user can choose to combine (OR) or intersect (AND) a node-criteria with further nodes by choosing the corresponding AND/OR-drop-zones (compare fig. 3). Clicking on a query element opens an advanced settings dialogue box to define further parameters. This allows users to select specific data sources or extending the output to contain further aggregation results through the options found under “additional values”. However, the extent and content of the dialogue field depends on the underlying node concept. The query can be executed by clicking the “start query” button and a “CSV” button will appear when the backend has finished query execution. This offers the option to download the resulting CSV file for further data processing. The query is automatically saved as a query history entry and can be re-executed and referred to for later tasks. For advanced users Conquery offers a REST-API (Python and HTTP) to programmatically formulate requests, for example in Python. The queries can then be used to generate custom analysis and reports, which can be integrated into Conquery through a Jupyter based backend.

Figure 3 Conquery user interface to construct queries.

Specific concepts are dragged from the „CONCEPTS“ panel into the „EDITOR“ panel (right). The user can choose between a „GROUP EDITOR“ (setup of the concept- based study cohort) or a „TIMEBASED EDITOR“, which takes into account the chronology of events. The resulting cohort can be accessed via an output CSV file or reused for further queries. Previous queries are available in the „QUERIES“ panel (middle). Basic information on the concepts can be found in the left „INFO“ panel.

Recently, Conquery serves as the basis for an in-house developed analysis tool that is used by various German health insurance providers and enables a growing number of hospitals to rapidly explore their research data. Once implemented in a local DWH-setup, it could be used for real world evidence studies and generation of hypotheses, to improve quality of care and study start-up or to enhance patient recruitment and the execution of clinical trials. The analytical possibilities grow with the extension of the integrated data in the specific DWH . In a next step, Conquery aims to support research institutions in understanding and efficiently using their obtained data and thus, to increase data value and the feasibility of prospective projects. Future work will encompass cloud integration and scalability features, preview and visualization features for the frontend and further performance improvements of the query engine. As Conquery is in active development and usage, other minor and major features not listed are expected to be added. Providing the recent interface standard FHIR HL7 is a prospective challenge and will further improve software usability. Moreover, Conquery offers the import and integration of heterogenous data types, such as data from claims data analysis, health apps, medical registries or electronic medical records. Limitations The design of Conquery is single-pass and entity-independent query evaluation, this results in two major limitations of Conquery’s query engine. The query language does not allow for relative queries (e.g.: a person being in the 50% percentile of a ranking) or SQL-like joins. Both justified by the objective to minimize inter-worker communication (reducing programmatic complexity and query latency). To achieve similar functionality, we instead offer a system for implementing new query logic acting as a driver for multiple coordinated queries.

CONCLUSIONS

AVAILABILITY AND REQUIREMENTS

DECLARATIONS

Consent for publication

Not applicable.

Availability of data and materials

Not applicable.

Competing interests

The authors declare that they have no competing interests

Funding

Partly funded by the BMWi project HLaN-Health Reality Lab Network (01MD18004D)

Author Contributions

JW: original idea, led study, manuscript drafting and editing; ML/DE: manuscript drafting and editing; MH: software development; AA: manuscript drafting and editing; FK: software development, manuscript drafting and editing; MvK/LH/MT/AR/HP: software development and support for manuscript discussions. All authors read and approved the manuscript.

Acknowledgments

We thank Peter Ihle from the working group “Erhebung und Nutzung von Sekundärdaten“ (AGENS group of DGSMP and DGepi) for providing a test dataset for a demo version of Conquery. Current affiliation of Manuel Hegner is Deutsche Bahn AG, department HCB, Berlin, Germany.

REFERENCES [1]

IDC/EMC, “The Digital Universe Driving Data Growth in Healthcare.” 2014, [Online]. Available:

McKinsey & Company, “Digi talisierung im Gesundheitswesen: die 34-Milliarden-Euro-Chance für Deutschland,” p. 9, Sep. 27, 2018. [3]

K. Lee, N. Weiskopf, and J. Pathak, “A Framework for Data Quality Assessment in Clinical Research Datasets,”

AMIA. Annu. Symp. Proc. , vol. 2017, pp. 1080 – et al. , “A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data,” EGEMS Wash. DC , vol. 4, no. 1, p. 1244, 2016, doi: 10.13063/2327-9214.1244. [5] E. Scheufele et al. , “tranSMART: An Open Source Knowledge Management and High Content Data Analytics Platform,”

AMIA Summits Transl. Sci. Proc. , vol. 2014, pp. 96 – et al. , “Serving the enterprise and beyond with informatics for integra ting biology and the bedside (i2b2),” J. Am. Med. Inform. Assoc. JAMIA , vol. 17, no. 2, pp. 124 – D. R. Morrison, “PATRICIA—

Practical Algorithm To Retrieve Information Coded in

Alphanumeric.” Association for Com puting Machinery, Oct. 01, 1968, Accessed: Apr. 07, 2020. [Online]. Available: https://doi.org/10.1145/321479.321481. [8] “Understanding JIT Compilation and Optimizations.” https://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/underst_jit.html (accessed Apr. 07, 2020). [9] “GZinga: Seekable and Splittable Gzip,” Oct. 09, 2015. https://tech.ebayinc.com/engineering/gzinga-seekable-and-splittable-gzip/ (accessed Apr. 07, 2020). upplementary Data

Code { ... "name": "g20-g26", "description": " Extrapyramidal and movement disorders", "condition": { "type": "PREFIX_RANGE", "min": "G20", "max": "G26" }, "children": [ { "name": "g20", "description": "

Parkinson’s disease", "condition": { "type": "PREFIX", "prefix": "G20" }, "children": [ { "name": "g20_1", "description": "

Parkinson’s disease with moderate to severe impairment ", "condition": { "type": "PREFIX", "prefix": "G201" }, "children": [ { "name": "g20_11", "description": "

Parkinson’s disease with moderate to severe impairment with fluctuations", "condition": { "type": "PREFIX", "prefix": "G2011" } } ] } ] } ] }

Code { "label": "Hospital Diagnoses", "validityDates": [ { "label": "Case begin", "column": "hospital_diagnosis.case_begin" }, { "label": "Case end", "column": "hospital_diagnosis.case_end" } ], "column": "hospital_diagnosis.icd_code", "filters": [ { "type": "SELECT", "label": "Diagnose kind", "column": "hospital_diagnosis.kind", "labels": { "primary": "Primary", "secondary": "Secondary", "initial": "Initial" } }, { "type": "COUNT", "distinct": true, "label": "Case number", "column": "hospital_diagnosis.case_id" } ], "selects": [ { "label": "ICD-Codes", "type": "DISTINCT", "column": "hospital_diagnosis.icd_code" }, { "label": "Number of Cases", "type": "COUNT", "distinct": true, "column": "hospital_diagnosis.case_id" } ] }