[PDF] An Empirical Study on the Design and Evolution of NoSQL Database Schemas

Abstract

We study how software engineers design and evolve their domain model when building applications against NoSQL data stores. Specifically, we target Java projects that use object-NoSQL mappers to interface with schema-free NoSQL data stores. Given the source code of ten real-world database applications, we extract the implicit NoSQL database schema. We capture the sizes of the schemas, and investigate whether the schema is denormalized, as is recommended practice in data modeling for NoSQL data stores. Further, we analyze the entire project history, and with it, the evolution history of the NoSQL database schema. In doing so, we conduct the so far largest empirical study on NoSQL schema design and evolution.

Full PDF

AAn Empirical Study on the Design andEvolution of NoSQL Database Schemas

Stefanie Scherzinger and Sebastian Sidortschuck OTH Regensburg, Regensburg, Germany [email protected] OTH Regensburg, Regensburg, Germany& SPARETECH.io, Stuttgart, Germany [email protected]

Abstract.

We study how software engineers design and evolve theirdomain model when building applications against NoSQL data stores.Speciﬁcally, we target Java projects that use object-NoSQL mappers tointerface with schema-free NoSQL data stores. Given the source codeof ten real-world database applications, we extract the implicit

NoSQLdatabase schema. We capture the sizes of the schemas, and investigatewhether the schema is denormalized, as is recommended practice in datamodeling for NoSQL data stores. Further, we analyze the entire projecthistory, and with it, the evolution history of the NoSQL database schema.In doing so, we conduct the so far largest empirical study on NoSQLschema design and evolution.

Keywords:

Schema Evolution · NoSQL Databases · Empirical Study.

Schema-ﬂexible NoSQL data stores have become popular backends for buildingdatabase applications. Systems like MongoDB allow for ﬂexible changes to thedomain model during application development. In particular, they have proventhemselves in settings where applications are frequently deployed to their pro-duction environment, e.g., when web applications are built in an agile approach.While the data stores do not enforce a global schema, the application codegenerally assumes that persisted entities adhere to a certain (if loose) domainmodel. Given that schema-ﬂexibility is one of the major selling points of NoSQLdata stores, this raises the question how the domain model, and thereby theimplied

NoSQL database schema , actually evolves. We empirically study thedynamics of NoSQL database schema evolution. Further, we investigate thequestion whether the NoSQL database schema is denormalized, as commonlyrecommended in literature, e.g. [15].Unfortunately, real-world data dumps of NoSQL data stores are hard to comeby. We therefore resort to analyzing the source code of applications hosted onGitHub. We focus on the relevant software stack shown in Figure 1a, namely Javaapplications that use an object-NoSQL mapper to store data in either Google a r X i v : . [ c s . D B ] F e b Scherzinger and Sidortschuck

Cloud Datastore or MongoDB , both popular and mature data stores. Amongover 1.2K open source GitHub repositories with this stack, we have identiﬁedthe ten projects with the largest NoSQL schemas (a notion introduced shortly).Previous studies on schema evolution have primarily focused on schema-full,relational databases [5,11,13,18,20,25]. About NoSQL schema evolution in real-world applications, little is known that is based on systematic, empirical studies(versus anecdotal evidence): Earlier studies have a diﬀerent focus (such as theusage of certain mapper features [14]), or analyze a single project (c.f. [12]).In this paper, we introduce our notion of the NoSQL database schema,which is implicit in object mapper class declarations, even though the under-lying NoSQL data stores are schema-free. In this setting, this paper makes thefollowing contributions: – We formulate three research questions, namely (RQ1) whether the NoSQLdatabase schema is denormalized (as recommended in literature), (RQ2)which growth in complexity we can observe in NoSQL database schemasover the project development time, and (RQ3) how the NoSQL databaseschema evolves, thereby identifying the common changes. – We analyze the ten projects with the largest NoSQL database schemasamong over 1.2K candidate projects, based on static code analysis and thecommit history. We are able to conﬁrm that denormalization is common inNoSQL database schemas. We are further able to show evidence of evolu-tionary changes to the NoSQL database schema in all analyzed projects. – We discuss our ﬁndings w.r.t. related studies on relational schema evolutionand ﬁnd that the churn rate of NoSQL schemas is comparatively high.

Structure.

Next, we introduce preliminaries in Section 2. In Section 3, we describeour methodology, and state our research questions. In Section 4, we present theresults of our study, which we then discuss in Section 5. We point out threats tothe validity of our results in Section 6, and give an overview over related workin Section 7. We conclude with an outlook on future work.

We next introduce the software stack studied, as well as our terminology.

Physical entities.

We consider two popular NoSQL data stores: Google CloudDatastore (called Datastore hereafter) is commercial and hosted on the GoogleCloud Platform, MongoDB is open source. Both data stores are schema-free(however, MongoDB oﬀers optional schema validation). Both manage document-like data, which we refer to as the (physical) entities . On an abstract level, anentity is a collection of key-value pairs, or properties . Entities may be nestedand properties may be multi-valued. We sketch a Datastore entity representinga player and his or her missions in a role playing game in Figure 1a, in (simpliﬁed)JSON notation, to abstract away from system-proprietary storage formats. https://cloud.google.com/datastore/, available since 2009. Fig. 1. (a) The object-NoSQL mapper separates the domain model from the NoSQLdata store (adapted from [6]). (b) Not all code changes are actually schema-relevant.

Domain models.

In principle, each entity in a schema-free data store may haveits very own, unique structure. However, in database applications, it is safeto assume that the software engineers have agreed on some domain model , assketched in Figure 1a. In our setting, the domain model is captured by Java classdeclarations, yet in the Figure, we use the more compact UML notation. (Fornow, we ignore the @ -labeled annotations.) Class Player declares attributes foran identiﬁer, a name, an amount of credits, and a list of missions. Each missionalso has an identiﬁer, a title, a level of diﬃculty, and tracks its completion. Object-NoSQL Mappers.

Object mappers are state-of-the-art in building data-base applications [6]. Like object-relational mappers, the object-NoSQL map-pers Objectify and Morphia map Java objects to entities. Objectify is tied toDatastore, and Morphia to MongoDB. With object-NoSQL mappers, develop-ers merely specify their domain model as Java classes that are annotated withthe keyword @Entity . Each entity-class has a unique key (annotated with @Id ).The object mapper provides methods for saving and loading: In Figure 1a, theclass name and the identifying attribute are mapped to the designated proper-ties _kind and _id . Objectify maps the player’s list of missions to an array ofnested entities. Yet at application runtime, an entity-class declaration may notmatch the structure of all persisted entities, as discussed next. Lazy data migration.

The data store may also store legacy versions of entities.Figure 1b shows a new version of entity-class Player, with changes due to newrequirements in the software development project. Attribute coins has replaced credits . Merely changing the entity-class in the application code does not af-fect any existing entities. Instead, persisted entities are only migrated lazily, https://github.com/objectify/objectify https://github.com/MorphiaOrg/morphia Scherzinger and Sidortschuck upon loading: The new version of entity-class Player in Figure 1b is backwards-compatible with Figure 1a. Once the legacy entity for Frodo has been loaded, thecorresponding Java object will have an attribute coins , as annotation @AlsoLoad lazily renames attributes.Thus, to obtain a summary of the structural variety of physical entities inthe data store (based on code analysis alone, not having access to the data storecontents itself), we need to consider the entire evolution history of entity-classes. NoSQL database schema evolution.

We base our notion of the

NoSQL databaseschema (or shorter,

NoSQL schema ) on the domain model. This idea of treatingentity-classes as schema declarations is re-current in literature, c.f. [4, 17]. Notethat not all Java attributes are relevant for the NoSQL database schema: At-tributes that are transient, e.g., carrying Objectify annotation @Ignore , are notschema-relevant: The value of hoursSinceLastLogin in Figure 1b is not per-sisted (it may be derived from lastLogin ). Also, class methods are not schema-relevant. Thus, code changes that only aﬀect transient attributes or class meth-ods are part of software evolution, but not of schema evolution. Therefore, theyare not considered schema changes by us.

Denormalized entity-classes.

The recommendation in working with Datastoreand MongoDB is to intentionally denormalize the schema. This can be doneby either nesting entities, or by using multi-valued properties, such as the arrayof

Mission s in Figure 5. There are various motivations for denormalization, onebeing that traditionally, the query languages do not provide a join operator(such is still the case in Google Cloud Datastore, and this also used to be thecase for MongoDB), so joined data is materialized in the data store. Anotherreason is that transactions between an arbitrary number of entities may notbe supported . Consequently, transactionally safe updates are often realized byupdates to a single, aggregate entity.In the following, we say an entity-class is denormalized if it does not declareﬂat, relational-style tuples in ﬁrst normal form, i.e., with atomic attribute valuesonly. So unless all schema-relevant attributes have Java primitive types (such as Integer , String , Boolean , . . . ), we say the entity-class is denormalized. As wediscuss in Section 3.3, this is a practical yet conservative approach.As an example, the entity-class declarations for players, sketched in Figure 1,is denormalized, due to the multi-valued attribute listOfMissions . In the following, we describe our methodology, such as the context of our analysis,the research questions, and the analysis process. While our outline has stronganalogies to Qiu et al. [13] and their analysis of relational schema evolution, our We point to the concepts such of entity groups and cross-group transactions in the classic

Google Cloud Datastore [16], which is in the process of being deprecated.esign and Evolution of NoSQL Database Schemas 5

Table 1.

Characteristics of the studied database applications.

Project Life Cycle O b j e c t i f y Cryptonomica/cryptonomica 04/16 ∼ ∼

29 0 ∼ ∼ ∼

82 0 ∼ ∼ ∼

55 138 ∼ ∼ ∼

79 0 ∼ ∼ ∼

43 0 ∼ M o r ph i a altiplanogao/tallyframework 06/15 ∼ ∼

24 0 ∼ ∼ ∼

36 0 ∼ ∼ ∼

28 0 ∼ ∼ ∼

27 0 ∼ ∼ ∼

51 0 ∼ process is rather diﬀerent: we cannot analyze schemas declared in a declarativedata deﬁnition language, such as SQL. Rather, we need to parse raw Java code. We used BigQuery to identify relevant open source repositories on GitHub, asof September 4th, 2018. We consider a repository (which we synonymously referto as a project) relevant if it contains Java import statements for Objectify orMorphia. We cloned over 1.2K candidate repositories and excluded any reposi-tories that (1) have fewer than 20 commits (to exclude tinker projects), (2) arethe Morphia or Objectify source code (or forks thereof), (3) or are ﬂagged asforks from repositories already covered, with no schema-relevant code changesafter the fork. We analyze the project history using git log . This allows usto re-trace the development history of all entity classes. We parse and aggregatethe log output using Python scripts.Among all projects analyzed, we determined the maximum number of entity-classes throughout the project history, and settled on the top-5 projects forObjectify and Morphia respectively. Table 1 lists these projects with their lifecycles up to the latest commit at the time of our analysis. We also state the totalnumber of commits at the time. We state the minimum and maximum numberof entity-classes throughout the project history, as well as the total number oflines of code between the ﬁrst and last analyzed commit (measured with cloc and reported in thousands). Google BigQuery is a commercial cloud service. This data warehousing tool allowsfor querying the GitHub open data collection, mostly non-forked projects with anopen source license: https://cloud.google.com/bigquery/. We state the exact command pattern for reproducability: git log--before=2018-09-04T00:00:00 --cherry-pick --date-order --pretty=format:"%H;%aI;%cI;%P" . https://github.com/AlDanial/cloc Scherzinger and Sidortschuck We analyze the structure of entity-classes, whether they map to ﬂat tuples in ﬁrst normal form, or whether theyrepresent denormalized data.

RQ2: What is the growth in complexity of the NoSQL schema?

We cap-ture schema complexity based on metrics recognized in literature.

RQ3: How does the NoSQL schema evolve?

We automatically identify andclassify evolutionary changes to the NoSQL schema.

Locating entity-classes.

We replay the commit histories and use the Java parserQDox to parse class declarations. We identify entity-classes by the object map-per annotation @Entity , which may also be inherited. Denormalization.

To determine whether an entity-class is denormalized, we parseits Java declaration and strip away attributes that are not relevant to the NoSQLschema. We then analyze the types of the remaining attributes. Unless all haveprimitive types (such as

Integer , String , or

Boolean ), we assume that theentity-class is denormalized.In most cases, we correctly recognize denormalization: (1) if the entity classdeclaration contains container classes (e.g., a Java

Collection ), and thereforean attribute is multi-valued. (2) Equally, the entity-class may contain nestedentity classes, giving it a hierarchical structure.However, there are also cases where this approach is a conservative sim-pliﬁcation, and we might falsely categorize an entity-class as denormalized: anattribute type may be declared in a third-party library, which is inaccessibleto us (see also our discussion in Section 6). Also, an attribute type may be acustom type that the developers declared. To realize that a custom type is just awrapper for a basic Java type, we would have to run more involved code analysis.Yet typically, polymorphic types are involved, and we are confronted with theinherent limitations of static code analysis.

Identifying schema changes.

We identify commits with schema-relevant changesby comparing succeeding versions of the application source code: We registerwhen (1) a new entity-class is added or an entity-class is removed, (2) a schema-relevant attribute is added or removed in an entity-class declaration, and (3) fur-ther, changes to schema-relevant attributes, such as to their types, default ini-tializations, or even object mapper annotations. We only focus on changes whichwe can recognize programmatically. Recognizing renaming or splitting an entity-class, or renaming an attribute, are instances of the challenge of schema matchingand mapping [2], and cannot be fully automated. https://github.com/paul-hammant/qdox In earlier versions of the mapper libraries, this annotation was only optional, so itcannot be relied upon. We therefore also search for the mandatory annotation @Id ,and thus reliably detect polymorphic entity-classes. (c.f. Section 6).esign and Evolution of NoSQL Database Schemas 7

Cryptonomica/cryptonomica FraunhoferCESE/madcap google/nomulus nareshPokhriyal86/testing Nekorp/Tikal-Technology (a) Objectify-based projects. altiplanogao/tallyframework bujilvxing/QinShihuang catedrasaes-umu/NoSQLDataEngineering GBPeters/PubInt MKLab-ITI/simmo

Denormalized entity-classOther (b) Morphia-based projects.

Fig. 2.

Visualization of denormalized NoSQL database schemas.

We analyze the entity-class declarations in their most current version w.r.t. de-normalization. The results are visualized in Figure 2. For each analyzed project,we show a dot matrix chart. The number of dots represents the number of entity-classes. The brighter (orange) dots represent the entity-class declarations whichwe must assume to be denormalized, due to the limits of static code analysis.The darker (blue) dots represent the other entity-classes.Notably, each project contains at least one denormalized entity-class, so allschemas are denormalized. With the exception of two Objectify-based projects,denormalized entity-classes dominate the NoSQL database schemas. There areeven two Morphia-based projects where all entity-classes are denormalized.

Results.

We ﬁnd that each project analyzed has denormalized entity-classes in itsNoSQL schema. This shows that developers make active use of denormalization.However, without qualitative studies based on developer surveys, we do notknow whether (1) the developers consciously chose a database which allows fora denormalized database schema, as this better suits their conceptual model.However, it could also be that (2) they are actually forced denormalize theirdata model, due to the technological limitations of NoSQL data stores (brieﬂydiscussed in Section 2).

In empirical studies on relational schema evolution, the number of tables isconsidered a simple approximation for schema complexity [7]. Accordingly, we

Scherzinger and Sidortschuck

0% 20% 40% 60% 80% 100%0%20%40%60%80%100% Cryptonomica/cryptonomica 0% 20% 40% 60% 80% 100%FraunhoferCESE/madcap 0% 20% 40% 60% 80% 100%google/nomulus 0% 20% 40% 60% 80% 100%nareshPokhriyal86/testing 0% 20% 40% 60% 80% 100%Nekorp/Tikal-Technology (a) Objectify-based projects.

0% 20% 40% 60% 80% 100%0%20%40%60%80%100% altiplanogao/tallyframework 0% 20% 40% 60% 80% 100%bujilvxing/QinShihuang 0% 20% 40% 60% 80% 100%catedrasaes-umu/NoSQLD... 0% 20% 40% 60% 80% 100%GBPeters/PubInt 0% 20% 40% 60% 80% 100%MKLab-ITI/simmo (b) Morphia-based projects.

Fig. 3.

Evolution trend of entity classes. The horizontal axes show the project progress,in percentage of commits analyzed. The vertical axes show the complexity of the schemaw.r.t. its maximum, for two alternative metrics. (Visualization modeled after [13].) track the number of entity-classes over time in Figure 3 (based on a visualizationidea from [13]). For each project, one chart is shown. On the horizontal axis, wetrack the progress of the project, measured as the percentage of git commitsanalyzed. For the madcap project, this is based on 853 commits (c.f. Table 1).On the vertical axis, we track the size of the NoSQL database schema using twometrics. One is the number of entity classes (blue solid line). This metric is alsonormalized w.r.t. its maximum throughout the project history. So for madcap ,the 100% peak corresponds to 82 entity classes, some of which were removed inthe later phase of the project.The second line denotes a “proxy metric” [7] for approximating the size of theNoSQL schema, where we count the lines of code of the entity-classes (includingsuperclasses, excluding comments and empty lines), and thereby compute the

Schema-LoC . There is shrinkage, yet overall, schema complexity increases.

Results. As in the study by Qiu et al. on relational software evolution [13], wecan conﬁrm that while the projects diﬀer in their life-spans and commit activity,in nearly all projects, the NoSQL schema grows over time. However, there may bephases of refactoring, leading to dips in the curves. Apparently, Schema-LoClends itself nicely as a proxy-metric, and we obtain high correlation coeﬃcients We ﬁnd this proxy-metric preferable over counting (schema-relevant) attributes, asis common in studies on relational schema evolution: (1) Entity-classes with moreschema-relevant attributes have more lines of code accordingly. (2) In static codeanalysis, we cannot reliably count nested attributes: Abstract container classes andthe use of polymorphism in general, make it impossible to know the number and na-ture of nested attributes at compile time. With Schema-LoC, we are able to abstractfrom this issue.esign and Evolution of NoSQL Database Schemas 9

Fig. 4.

Visualizing relative schema sizes and churn. Each rectangle represents an entity-class, its area proportional to its size in lines of code (speciﬁcally Schema-LoC). Thehue represents the relative frequency of schema changes within the same project. when comparing to the number of entity classes. As Schema-LoC depends onthe number of attributes in an entity class, we can retrace an eﬀect reportedin [13], namely that entity-classes and their attributes (corresponding to tablesand columns) have largely analogous dynamics. In general, the schema growsmore than it shrinks. This is in line with studies on relational schema evolution. One observation in [13] was that the schema stabilizes early: There, for 7 outof 10 projects, 60% of the maximum number of tables is reached in the ﬁrst 20%of the commits. Interestingly, in our study, the number of entity-classes reachesthe 60% in only 4 projects. In [13], less than 2% of all commits contain validschema changes (across all ten projects analyzed there). In our study, the shareof commits with schema-relevant changes is between 2.8% and over 30%, with 4projects reaching over 20%. Clearly, we observe higher churn rates.

We ﬁrst investigate how often entity-classes undergo schema changes when com-pared to others inside the same project, and how large they are in terms of ourproxy metric Schema-LoC. Figure 4 visualizes the entity classes making up theten NoSQL schemas as a tree map. This ﬁgure is best viewed in color. Eachcolored area represents one project. Inside, each rectangle represents one entity-class, the area proportional to its Schema-LoC. Darker hue indicates that anentity-class has undergone more schema changes than the other entity-classesin the same project. For instance, for nomulus , the darkest area represents 12schema changes against the same entity-class. Thus, some entity-classes changequite more often than others. However, there are also projects where schemachanges aﬀect entity-classes quite uniformly.

0% 20% 40% 60% 80% 100%Nekorp/Tikal-TechnologynareshPokhriyal86/testinggoogle/nomulusFraunhoferCESE/madcapCryptonomica/cryptonomica O b j e c t i f y Add entity-classRemove entity-class Add schema-relevant attributeRemove schema-relevant attribute Change schema-relevant attribute0% 20% 40% 60% 80% 100%MKLab-ITI/simmoGBPeters/PubIntcatedrasaes-umu/NoSQLD...bujilvxing/QinShihuangaltiplanogao/tallyframework M o r p h i a (a) By project. Entity-class Schema-relevant attributeadd remove add remove changeObjectify

Morphia

Overall (b) Objectify-based vs. Morphia-based projects.

Project Type Initialization Annotations O b j e c t i f y Cryptonomica/cryptonomica 2 0 3FraunhoferCESE/madcap 9 0 6google/nomulus 11 2 58nareshPokhriyal86/testing 0 0 0Nekorp/Tikal-Technology 0 0 3 M o r ph i a altiplanogao/tallyframework 7 3 54bujilvxing/QinShihuang 32 33 15catedrasaes-umu/NoSQLDataEngineering 43 0 13GBPeters/PubInt 0 2 2MKLab-ITI/simmo 7 5 18 (c) Drill-down into the remaining changes to schema-relevant attributes. Fig. 5.

Distinguishing diﬀerent kinds of schema changes: (a) and (b): Relative shares ofschema changes (by project and by mapper library). (c) Zooming in on the remainingchanges in schema-relevant attributes mentioned in (a), showing absolute values.

In Figure 5, we capture the distribution of schema changes according tothe kind of change. In Subﬁgure 5a (after Qiu et al. in [13]), we break downthe distribution of changes by project. Note that when a new entity-class isadded, we do not count this as adding attributes at the same time. Notably, thedistributions are project-speciﬁc. We now discuss two projects that stand out.In the fourth Objectify-based project, adding an entity-class makes up fornearly all changes. Considering the project characteristics in Table 1 reveals that esign and Evolution of NoSQL Database Schemas 11 this project is an outlier in several regards: With only 25 commits, it has barelymade the bar for being considered in our analysis (see Section 3.1). At the sametime, this project holds the second largest number of entity-classes in any projectconsidered in this analysis. Since the life cycle considered is only two months,this project is in a very early stage of development at the time of this analysis.Thus, it seems plausible that at this early phase, the developers kick start theirdata model by declaring the entity classes in bulk.In contrast, the second Morphia-based project stands out as the project withthe least share of entity class additions. Since the git commit messages are inChinese (which the authors of this paper do not master), we ﬁnd it diﬃcult toretrace the developers’ motivation. What is noticeable in Subﬁgure 3b is thatwhile the number of entity classes increases in less than 10 distinct steps, theproxy metric Schema-LoC changes in more ﬁne-granular steps. Thus, the entity-classes undergo more frequent changes. This matches the distribution plotted,as then the share of entity-class creations is smaller by comparison. Subﬁgure 5bsummarizes Subﬁgure 5a, and aggregates the changes by the mapper library.While we see project-speciﬁc ﬂuctuations, when we group by mapper library, wealso observe diﬀerences in the distribution. Overall, additions (whether of entityclasses or of schema-relevant attributes) dominate.In Table 5c, we break down the schema-relevant attribute changes listed inSubﬁgures 5a and 5b: (a)

For some projects, types change. (b)

For others, theinitialization changes. A drill-down reveals that (as may be expected) addingan initial value is the most frequent change, followed by changing the initializa-tion value. (c)

In other cases, mapper annotations that aﬀect the schema areadded or removed. The most frequent annotations added are @PersistField and @Reference . The ﬁrst is from a third-party framework. Since it is schema-relevant, we report it. The second supports referential constraints. Sporadically,third-party annotations are added to declare additional constraints, such as @Min . Results. We can conﬁrm the observations from related work on relationalschema evolution that schema changes are generally not distributed uniformly [13,24]. As already observed for RQ2, the trend is that entity-classes are addedmore frequently than they are removed. We see a similar pattern for schema-relevant attributes, in line with studies on relational schema evolution. Over-all, in 9 out of 10 projects, additions collectively account for more than 50%of the changes. In 5 projects, they even account for over 70% of the changes. While additions are generally more frequent, there are also projects whereremovals of entity classes occur to a non-signiﬁcant degree. Related work onrelational schema evolution has shown that there are what the authors call sur-vivor tables [22], whereas there are that are more short-lived. The observationthat entity-class removals are very project speciﬁc has also been made in [13]. Among all annotation changes, only 15 concern referential constraints (an-notation @Reference ). The authors of two relate studies on relational schemaevolution, both [13] and [21], have observed that changes concerning referentialintegrity constraints are also rare in relational schema evolution. With NoSQLdata stores, this is to be expected, as referential integrity is not supported to the same extent. While Qiu et al. [13] found changes in attribute types tobe the number one change for half of the projects analyzed (even outnumberingadditions of either tables or columns), we do not see evidence of this eﬀect here.

We can reproduce the main results from related work on relational schema evo-lution: There is strong evidence of NoSQL schema evolution, and additions aredominant schema changes. However, we do not see the schema stabilizing inthe early phases of all projects, which may partly be due to shorter project lifespans: The ten projects studied in [13] are PHP applications backed by rela-tional databases, and have longer life cycles (two with ten years), more commits(starting at nearly 5K), and more lines of code. This is to be expected with amuch older and thus more widely adopted stack.Still, we do suspect that NoSQL developers evolve their schema more continu-ously. One indicator supporting this hypothesis is that we see higher churn rates ,so a larger share of the commits contains code changes that aﬀect the schema.This calls for further study. Due to this churn, making sure that entity-class dec-larations are “backwards” compatible with legacy entities, persisted by earlierversions of the application code, may become an overwhelming task. There areﬁrst proposals for assisting tools, e.g., by type-checking versions of entity-classdeclarations [3]. Clearly, more research is needed on systematic tool support.The fact that denormalization is common shows that solutions for managingrelational schema evolution, managing ﬂat tuples, will not transfer immediately.Rather, when devising frameworks, we may want to turn to related work onframeworks for handling schema evolution in XML (e.g. [9]) or object-orienteddatabases (e.g. [26]) for inspiration on what has shown to be feasible.

Construct validity. (1)

With applications using older versions of Objectify andMorphia, we cannot rely on the @Entity -annotation to identify entity-classes, sowe also consider the @Id annotation. To be conﬁdent that this does not lead tofalse positives, we performed manual checks. (With Objectify, we cross-checkedwhich entity-classes were registered with

ObjectifyService , a mandatory pro-gramming step.) (2)

In static analysis, we encounter a limitation with attributetypes from third-party libraries. Tracking down these libraries is out of scope(and not even possible in all cases). Thus, there are attributes that are not fullycaptured by Schema-LoC. Yet as this is a proxy-metric to start with, we con-sider this threat acceptable. Third-party libraries also aﬀect the recognition ofentity-classes as denormalized. Having sampled and inspected the entity-classdeclarations, we are conﬁdent that – given the limitations of static code analysis– the risk of false positives is acceptable. (3)

We treat each single commit ascontributing to a new version of the schema. There are software developmentteams that operate by continuous deployment, so tested code is immediately esign and Evolution of NoSQL Database Schemas 13 and autonomously deployed to the production environment. There, in theory,each commit containing a schema change comprises a new schema version. Yetrather often, a release to production comprises more than one commit. Unfor-tunately, we are not able to tell in static code analysis which commits wherereleased when. There are development teams that tag release commits, but thisis project-internal culture, and not consistently the practice across all ten stud-ied projects. Therefore, we must go by the simplifying assumption that eachcommit declares a new NoSQL database schema.

External validity.

We next discuss threats in generalizing our results to othersoftware stacks. (1)

It would be desirable to search additional code repositories,and extend to further NoSQL data stores, object mapper libraries, and program-ming languages. (2)

Extending our analysis to projects that do not use object-mappers requires a diﬀerent kind of static code analysis, and was implementedin a related study that involved a single MongoDB project [12]. At the sametime, object mappers are state-of-the art in modern application development,and by now, Objectify and Morphia are actually part of oﬃcial Datastore andMongoDB tutorials (even though they started as independent projects). Thus,we do analyze a highly relevant stack. (3)

There is the fundamental questionwhether studies on open source projects generalize to commercial projects.

Database schema evolution is a timeless research area, with various proposalshow to systematically manage schema changes. Providing tool support, however,is not the scope of this paper. In the following discussion of related work, wetherefore focus on empirical studies on schema evolution in open source projects.It is only natural that the availability of public code repositories has en-abled empirical studies on relational schema evolution [5,11,13,18–20,22,23,25].Among their key ﬁndings, these studies show that the schema evolves. Theyconﬁrm that adding tables or columns are frequent changes. In these settings,the schema is speciﬁed declaratively (usually in SQL). Accordingly, the term schema modiﬁcation operations (SMOs) [5] does not transfer well to our stack.Rather than declarative DDL statements, we need to parse raw Java code: Whilethe authors of [25] also parse application code, they do so to extract declarativestatements embedded in code.So far, there are only few empirical studies on schema evolution in NoSQLdata stores. Our work builds on an earlier analysis [14] on the adoption of map-per annotations for lazy schema evolution, which is a diﬀerent focus. The au-thors in [12] present an approach for identifying a schema evolution history inMongoDB-based Java applications. Diﬀerent from us, the authors do not assumethat an object-NoSQL mapper is used to access the data store. Rather, they an-alyze direct calls to the MongoDB API. The schema derived is similar to ournotion of the NoSQL schema, since it captures the perspective of the applicationcode. The authors evaluated their approach for a single open source project, whereas our study has a broader basis, considering ten projects. Moreover, theircontribution is to derive a visualization of the schema evolution history.Meanwhile, there is a growing body of work on extracting schema descrip-tions [1, 4, 8] from large collections of JSON data. While this a bottom-up ap-proach, starting from the data, we proceed top-down, analyzing application code.In capturing schema complexity based on Java class declarations, we couldhave resorted to software metrics [10]. However, it is not clear how metricsindicating an overly complex object-oriented design (e.g. classes with many at-tributes) transfer. The practice of building aggregate models in NoSQL schemadesign may actually be orthogonal.We refer to [7] for a high-level discussion on schema variety versus codevariety, as well as metrics for programmatic schema analysis.

In this paper, we present the study on NoSQL schema evolution with the largestdata basis so far, analyzing ten real-world, open source projects. We track theschema growth as well as the nature of changes to the NoSQL schema. We areable to reproduce most of the insights of related studies on relational schemaevolution, but we have also identiﬁed subtle diﬀerences.Since this is a ﬁrst systematic study, many interesting questions remain unan-swered. We remark on two. (1) Originally, we set out to compile detailed statis-tics on the structure of denormalized entity-classes, such as their nesting depth.However, we found that Java code written by experienced developers (e.g., as isthe case with Google’s nomulus project) is highly polymorphic. This makes itimpossible to compute reliable statistics based on the static analysis of entity-class declarations. However, more holistic analysis techniques, such as data ﬂowanalysis of the entire application code, might reveal further insights. (2) We seeevidence that the schema evolves, but we do not know the factors that inﬂuenceNoSQL schema evolution. This calls for follow-up work, where we take the git commit messages into account, which often comment the reason for a schemachange. What is also needed are qualitative studies, surveying developers whoroutinely deal with NoSQL schema evolution.

Acknowledgements

This project was funded by the

Deutsche Forschungsgemein-schaft (DFG, German Research Foundation), grant number

References

1. Baazizi, M., Colazzo, D., Ghelli, G., Sartiani, C.: Parametric schema inference formassive JSON datasets. VLDB J. (4), 497–521 (2019)2. Bellahsene, Z., Bonifati, A., Rahm, E.: Schema Matching and Mapping. SpringerPublishing Company, Incorporated, 1st edn. (2011)3. Cerqueus, T., Cunha de Almeida, E., Scherzinger, S.: Safely Managing Data Vari-ety in Big Data Software Development. In: Proc. BIGDSE’15 (2015)4. Chilln, A.H., Ruiz, D.S., Molina, J.G., Morales, S.F.: A Model-Driven Approachto Generate Schemas for Object-Document Mappers. IEEE Access , 59126–59142(2019)5. Curino, C.A., Tanca, L., Moon, H.J., Zaniolo, C.: Schema evolution in Wikipedia:Toward a Web Information System Benchmark. In: Proc. ICEIS’08 (2008)esign and Evolution of NoSQL Database Schemas 156. Fowler, M.: Patterns of Enterprise Application Architecture. Addison-Wesley Long-man Publishing Co., Inc., Boston, MA, USA (2002)7. Jain, S., Moritz, D., Howe, B.: High variety cloud databases. In: Proc. ICDE Work-shops 2016 (2016)8. Klettke, M., St¨orl, U., Scherzinger, S.: Schema Extraction and Structural OutlierDetection for JSON-based NoSQL Data Stores. In: Proc. BTW’15 (2015)9. Kl´ımek, J., Mal´y, J., Necask´y, M., Holubov´a, I.: eXolutio: Methodology for Designand Evolution of XML Schemas Using Conceptual Modeling. Informatica, Lith.Acad. Sci. (3), 453–472 (2015)10. Lanza, M., Marinescu, R.: Object-Oriented Metrics in Practice: Using SoftwareMetrics to Characterize, Evaluate, and Improve the Design of Object-OrientedSystems. Springer Publishing Company, Incorporated, 1st edn. (2010)11. Lin, D.Y., Neamtiu, I.: Collateral Evolution of Applications and Databases. In:Proc. IWPSE-Evol’09 (2009)12. Meurice, L., Cleve, A.: Supporting schema evolution in schema-less NoSQL datastores. In: Proc. SANER’17 (2017)13. Qiu, D., Li, B., Su, Z.: An Empirical Analysis of the Co-evolution of Schema andCode in Database Applications. In: Proc. ESEC/FSE’13 (2013)14. Ringlstetter, A., Scherzinger, S., Bissyand´e, T.F.: Data Model Evolution Usingobject-NoSQL Mappers: Folklore or State-of-the-art? In: Proc. BIGDSE’16 (2016)15. Sadalage, P.J., Fowler, M.: NoSQL Distilled: A Brief Guide to the Emerging Worldof Polyglot Persistence. Addison-Wesley Professional, 1st edn. (2012)16. Sanderson, D.: Programming Google App Engine with Java: Build & Run ScalableJava Applications on Google’s Infrastructure. O’Reilly Media, Inc., 1st edn. (2015)17. Scherzinger, S., Cerqueus, T., Cunha de Almeida, E.: ControVol: A framework forcontrolled schema evolution in NoSQL application development. In: ICDE 2015.pp. 1464–1467 (2015)18. Sjøberg, D.: Quantifying schema evolution. Information & Software Technology (1), 35–44 (1993)19. Skoulis, I., Vassiliadis, P., Zarras, A.V.: Open-Source Databases: Within, Outside,or Beyond Lehman’s Laws of Software Evolution? In: Proc. CAiSE 2014. pp. 379–393 (2014)20. Skoulis, I., Vassiliadis, P., Zarras, A.V.: Growing Up with Stability. Inf. Syst. (C), 363–385 (Oct 2015)21. Vassiliadis, P., Kolozoﬀ, M., Zerva, M., Zarras, A.V.: Schema evolution and foreignkeys: a study on usage, heartbeat of change and relationship of foreign keys to tableactivity. Computing (10), 1431–1456 (2019)22. Vassiliadis, P., Zarras, A.V.: Survival in Schema Evolution: Putting the Lives ofSurvivor and Dead Tables in Counterpoint. In: Proc. CAiSE 2017. pp. 333–347(2017)23. Vassiliadis, P., Zarras, A.V., Skoulis, I.: How is Life for a Table in an EvolvingRelational Schema? Birth, Death and Everything in Between. In: Proc. ER 2015.pp. 453–466 (2015)24. Vassiliadis, P., Zarras, A.V., Skoulis, I.: Gravitating to rigidity: Patterns of schemaevolution - and its absence - in the lives of tables. Inf. Syst.63