DBOS: A Proposal for a Data-Centric Operating System
Michael Cafarella, David DeWitt, Vijay Gadepally, Jeremy Kepner, Christos Kozyrakis, Tim Kraska, Michael Stonebraker, Matei Zaharia
DDBOS: A Proposal for a Data-Centric Operating System
The DBOS Committee ∗ [email protected] Abstract
Current operating systems are complex systems that were designed before today’s computingenvironments. This makes it difficult for them to meet the scalability, heterogeneity, availability,and security challenges in current cloud and parallel computing environments. To addressthese problems, we propose a radically new OS design based on data-centric architecture : alloperating system state should be represented uniformly as database tables, and operations onthis state should be made via queries from otherwise stateless tasks. This design makes iteasy to scale and evolve the OS without whole-system refactoring, inspect and debug systemstate, upgrade components without downtime, manage decisions using machine learning, andimplement sophisticated security features. We discuss how a database OS (DBOS) can improvethe programmability and performance of many of today’s most important applications andpropose a plan for the development of a DBOS proof of concept.
Current operating systems have evolved over the last forty years into complex overlappingcode bases [70, 4, 51, 57], which were architected for very different environments than existtoday. The cloud has become a preferred platform, for both decision support and online servingapplications. Serverless computing supports the concept of elastic provision of resources, which isvery attractive in many environments. Machine learning (ML) is causing many applications to beredesigned, and future operating systems must intimately support such applications. Hardwareis becoming massively parallel and heterogeneous. These “sea changes” make it imperative torethink the architecture of system software, which is the topic of this paper.Mainstream operating systems (OSs) date from the 1980s and were designed for the hardwareplatforms of 40 years ago, consisting of a single processor, limited main memory and a smallset of runnable tasks. Today’s cloud platforms contain hundreds of thousands of processors,heterogeneous computing resources (including CPUs, GPUs, FPGAs, TPUs, SmartNICs, andso on) and multiple levels of memory and storage. These platforms support millions of activeusers that access thousands of services. Hence, the OS must deal with a scale problem of 10 or 10 more resources to manage and schedule. Managing OS state is a much bigger problem ∗ DBOS committee members in alphabetic order: Michael Cafarella (MIT CSAIL), David DeWitt (MIT CSAIL),Vijay Gadepally (MIT LLSC), Jeremy Kepner (MIT LLSC), Christos Kozyrakis (Stanford University), Tim Kraska(MIT CSAIL), Michael Stonebraker (MIT CSAIL), and Matei Zaharia (Stanford University). a r X i v : . [ c s . O S ] J u l han 40 years ago in terms of both throughput and latency, as thousands of services mustcommunicate to respond in near real-time to a user’s click [21, 5].Forty years ago, there was little thought about parallelism. After all, there was only oneprocessor. Now it is not unusual to run Map-Reduce or Apache Spark jobs with thousands ofprocesses using millions of threads [13]. Stragglers creating long-tails inevitably result from sub-stantial parallelism and are the bane of modern systems: incredibly costly and nearly impossibleto debug [21].Forty years ago programmers typically wrote monolithic programs that ran to completion andexited. Now, programs may be coded in multiple languages, make use of libraries of services (likesearch, communications, databases, ML, and others), and may run continuously with varyingload. As a result, debugging has become much more complex and involves a flow of control inmultiple environments. Debugging such a network of tasks is a real challenge, not consideredforty years ago.Forty years ago there was little-to-no-thought about privacy and fraud. Now, GDPR [73]dictates system behavior for Personally Identifiable Information (PII) on systems that are undercontinuous attack. Future systems should build in support for such constructs. Moreover, thereare many cases of bad actors doctoring photos or videos, and there is no chain of provenance toautomatically record and facilitate exposure of such activity.Machine learning (ML) is quickly becoming central to all large software systems. However,ML is typically bolted onto the top of most systems as an after thought. Application and systemdevelopers struggle to identify the right data for ML analysis and to manage synchronization,ordering, freshness, privacy, provenance, and performance concerns. Future systems shoulddirectly support and enable AI applications and AI introspection, including first-order supportfor declarative semantics for AI operations on system data.In our opinion, serverless computing will become the dominant cloud architecture. Onedoes not need to spin up a virtual machine (VM), which will sit idle when there is no workto do. Instead, one should use an execution environment like Amazon Lambda. Lambda is anefficient task manager that encourages one to divide up a user task into a pipeline of several-to-many subtasks . Resources are allocated to a task when it is running, and no resources areconsumed at other times. In this way, there are no dedicated VMs; instead there is a collectionof short-running subtasks. As such, users only pay for the resources that they consume andtheir applications can scale to thousands of functions when needed. We expect that Lambdawill become the dominant cloud environment unless the cloud vendors radically modify theirpricing algorithms. Lambda will cause many more tasks to exist, creating a more expansivetask management problem.Lastly, “bloat” has wrecked havoc on elderly OSs, and the pathlength of common operationssuch as sending a message and reading bytes from a file are now uncompetitively expensive. Onekey reason for the bloat is the uncontrolled layering of abstractions. Having a clean, declarativeway of capturing and operating on operating system state can help reduce that layering.These changed circumstances dictate that system software should be reconsidered. In thisproposal, we explore a radically different design for operating systems that we believe willscale to support the performance, management and security challenges of modern computingworkloads: a data-centric architecture for operating systems built around clean separation of In this paper, we will use Lambda as an exemplar of any resource allocation system that supports “pay only forwhat you use.” ll state into database tables, and leveraging the extensive work in DBMS engine technologyto provide scalability, high performance, ease of management and security. We sketch why thisdesign could eliminate many of the difficult software engineering challenges in current OSes andhow it could aid important applications such as HPC and Internet service workloads. In thenext seven sections, we describe the main tenets of this data-centric architecture. Then, inSection 9, we sketch a proposal concerning how to move forward. One of the main reasons that current operating systems are so hard to scale and secure is the lackof a single, centralized data model for OS state. For example, the Linux kernel contains dozens ofdifferent data structures to manage the different parts of the OS state, including a process table,scheduler, page cache, network packet queues, namespaces, filesystems, and many permissionstables. Moreover, each of the kernel components offers different interfaces for management, suchas the dozens of APIs to monitor system state (/proc, perf, iostat, netstat, etc). This designmeans that any efforts to add capabilities to the system as a whole must be Herculean in scope.For example, there has been more than a decade of effort to make the Linux kernel more scalableon multicores by improving the scalability of one component at a time [11, 10, 50, 51], whichis still not complete. Likewise, it took years to add uniform security management interfaces toLinux – AppArmor [6] and SELinux [64] – that have to be kept in sync with changes to theother kernel components. It similarly took years to enable DTrace [16], a heavily engineeredand custom language for querying system state developed in Solaris, to run on other OSs. TheOS research community has also proposed numerous extensions to add powerful capabilities toOSs, such as tracing facilities [23, 35], tools for undoing changes made by bad actors [45], andnew security models [78, 66], but these remain academic prototypes due to the engineering costof integrating them into a full OS.To improve the scalability, security and operability of OSes, we propose a data-centric ar-chitecture : designing the OS to explicitly separate data from computation, and centralize allstate in the OS into a uniform data model. In particular, we propose using database tables, asimple data model that has been used and optimized for decades, to represent OS state. Withthe data-centric approach, the process table, scheduler state, flow tables, permissions tables,etc all become database tables in the OS kernel, allowing the system to offer a uniform in-terface for querying this state. Moreover, the work to scale or modify OS behavior can nowbe shared among components. For example, if the OS components access their state via tablequeries, then instead of reimplementing dozens of data structures to make them scalable onmulticores, it is enough to scale the implementations of common table operations. Likewise,new debugging or security features can be implemented against the tabular data model once,instead of requiring separate integration work with each OS component. Finally, making theOS state explicitly isolated also enables radical changes in OS functionality, such as supportfor zero-downtime updates [3, 59], distributed scale-out [63, 7], rich monitoring [16, 2], and newsecurity models [78, 66].To manage the state in a data-centric operating system, we will require a scalable andreliable implementation of database tables. For this purpose, we simply recommend buildingthe OS over a scale-out DBMS engine, leveraging the decades of engineering and operationalexperience running mission-critical applications. In other words, we suggest to build a database perating system (DBOS) . While the DBMS engine will need some basic resource managementfunctionality to bootstrap its execution, this could be done over a cluster of servers runningcurrent OSs, and eventually bootstrapped over the new DBOS. Today, DBMS engines alreadymanage the most critical information in some of the largest computer systems on the planet(e.g. cloud provider control planes). Thus, we believe that they can handle the challenges in anext-generation OS. Moreover, recent trends such as support for polystores [68, 53] that combinemultiple storage engines will enable the DBMS to use appropriate storage strategies for each ofthe wide range of data types in an OS, from process tables all the way to file systems.In more detail, this DBOS approach results in several prescriptive suggestions as discussedin the next section. All OS state should be stored in tables in the DBMS.
Unix was developed with themantra that “everything is a file” . This mantra should be updated to “everything is a table” ,with first class support for high performance declarative semantics for query and AI operationson dense, sparse, and hypersparse tables [32, 28, 41, 37, 43, 15]. For example, there should be atask table with the state of every task known to the system, a flow table with ongoing networkflows, a set of tables to represent the file system, etc [38].
All changes to OS state should be through DBMS transactions.
The OS will need toinclude multiple routines in complex imperative code to implement APIs or complex resourcemanagement logic, but when these routines need to access OS state, we will require them todo so through DBMS transactions. This choice offers several benefits. First, parallelism andconcurrency become easier to reason about because there is a transaction manager to identifyconflicts. Second, computation threads in the OS can safely fail without corrupting systemstate, enabling a wide range of features including geographic distribution, improved reliability,and hot-swapping OS code. Third, transactions provide a natural point to enforce security andintegrity constraints as is standard in DBMSs today.
The DBMS should be leveraged to perform all functions of which it is capable.
For example, files should be supported as blobs and tables in the DBMS. As a result, fileoperations are simply queries or updates to the DBMS. File protection should be implementedusing DBMS security features such as view-based access controls for complex security policies.In other words, there should only be ONE extensible security system, which will hopefully bebetter at avoiding configuration errors and leaks than the sprawl of configuration tools today.Authentication should similarly be done only once using DBMS facilities. Finally, virtualizationand containerization features can elegantly be implemented using database views: each containersimply acts on a view of the OS state tables restricted to objects in that container.As a result, ALL system data should reside in the DBMS. To achieve very high performance,the DBMS must leverage sophisticated caching and parallelization strategies and compile repet-itive queries into machine code [2], as is being done by multiple SQL DBMSs, including Redshift[3]. A DBMS supports transactions, so ALL OS objects should be transactional. As a result,transactions are implemented just once, and used by everybody.
Decision support capabilities are facilitated.
OSs currently perform many decision supportand monitoring tasks. These include: Choosing the next task to run • Discovering stragglers in a parallel computation • Finding over(under) loaded resources • Discovering utilization for the various resources • Predicting bottlenecks in real-time systemsAll of these can be queries to the DBMS.
Performance optimization:
OS kernel subsystems have often undergone extensive refactoringto improve performance by changing the data structures used to manage various state [52, 75,31, 69]. If the OS had been designed around a DBMS instead, many of these updates wouldamount to changing indexes or changing operator implementations in the DBMS (e.g., addingparallel versions of operators). Moreover, the DBMS approach would enable further methods toimprove performance that are not implemented in OSes today, such as cost-based optimization(switching access paths for an operation based on the current data statistics and expected sizeof the operation) or adaptive mid-query reoptimization.
Security:
DBMS access control tools such as view, attribute and role based ACLs [18, 74]can elegantly implement many of the security policies in SELinux, AppArmor and other OSsecurity modules. Moreover, if these rules are implemented as view definitions or SQL statementswithin the DBMS, the security checking code can be compiled into the queries that regularOS operations run, instead of being isolated in a separate module that adds overhead to OSoperations [48].
Virtualization and containerization:
Tremendous engineering effort has gone into enablingvirtualization and containerization in OSes over the past decade, i.e., enabling a single instanceof the OS to host multiple applications that each get the abstraction of an isolated systemenvironment. These changes have generally required modifying all data structures and a largeamount of logic in the kernel to support different ”namespaces” of objects for each container.With DBOS, virtualization and containerization can elegantly be achieved using DBMS views:each container’s DBMS queries only have access to a view that restricts to objects with thatcontainer ID, whereas a root user can have access to all objects. We believe that many queriesand logic in OS components would not have had to be modified at all to add virtualization withthis approach, other than being made to run on these views instead of on the raw OS statetables.
Geographic distributability:
After all, nodes in a cloud vendor’s offering are geographicallydistributed. Transactional replication is a desired service of cloud offerings. This can be triviallyprovided by a geographically dispersed DBMS. This is in keeping with “implement any functiononly once; in the interest of simplicity”.
More sophisticated file management:
Since files are stored in the DBMS, as blobs andtables, and the directory structure is a collection of tables, and SQL access control is used forprotection, the large amount of code that implements current file systems, essentially disappears.Also, we claim that current DBMSs which use aggressive compilation query and caching have otten a great deal faster than the DBMSs of yesteryears. Also, multinode main memoryDBMSs such as VoltDB and MemSQL are capable of tens of millions of simple transactions persecond. Since a file read/write is just such a simple transaction, we believe that our proposedimplementation can be performance competitive. In addition, more sophisticated file searchbecomes trivial to implement. For example, finding all files underneath a specific directoryaccessed in the last 24 hours that are more than 1GByte in size is merely a SQL query. The netresult is additional features, much less code and (hopefully) competitive performance. Better scheduling:
There will be task and resource tables in the DBMS capturing whattasks runs on cores, chips, nodes, and datacenters and what resources are available. Schedulingthousands of parallel tasks in such environments as Map-Reduce and Spark is mainly an exercisein finding available resources and stragglers, because running time is the time of the slowestparallel task. Finding outliers in a large task table is merely a decision support query that canbe coded in SQL. Again, we believe that the additional functionality can be provided at a netsavings in code.
Enhanced state management:
Using this approach it is straight-forward to divide applica-tion state into two portions. The first is transient and can be stored in data structures externalto the DBMS. The second is persistent and must be stored in the DBMS transactionally. Sincereplication will be provided for all DBMS objects, application failures can merely failover to anew instance. This instance reads the persistent state from the DBMS and resumes the compu-tation. This failover architecture was pioneered by Tandem Computers in the 1980’s and canbe provided nearly for free using our architecture.Additional benefits accrue to this architecture by using a modern “server-less” applicationarchitecture, a topic which we defer to Section 8.
Data communications can be readily expressed as operations on a geographically distributedDBMS. A pull-based system can be supported by the sender writing a record into the DBMSand the receiver reading it. A push-based system can be supported by the sender writing tothe DBMS and setting a trigger to alert the receiver when he becomes active. This can bereadily extended to multiple senders and recipients. In addition, DBMS transactions supportexactly-once messages. Such an approach significantly simplifies programming allowing theprogrammer to easily implement non-blocking send programs that have been demonstratedcomparable bandwidth to more complex messaging systems [36, 12]The CPU overhead of conventional TCP/IP communication is considered onerous by most,and new lighter-weight mechanisms, such as RDMA and kernel-bypass systems, are an orderof magnitude faster [9, 56]. Hence, it seems reasonable to build special purpose lightweightcommunication systems whose only customer is the DBMS. This has already been shown toaccelerate DBMS transactions by an order of magnitude, relative to TCP/IP in a local areanetworking environment [77], and it is possible that appropriate hardware could offer advantagesof this approach in a wide area networking world. As such, it is an interesting exercise to see ifa competitive messaging system can be done through the DBMS. It should also be noted thatAmazon Lambda uses a storage-based communication system [72]. Of course, a performantimplementation would use something much faster than S3, such as a multi-node main memory BMS.If this approach is successful, this will lower the complexity of future system software byreplacing a heavyweight general purpose system with a lightweight and optimized, special pur-pose one. It seems highly likely that the approach will work well in a hardware-assisted LANenvironment. WAN utilization seems more speculative.
It is clear that privacy will be a future requirement of all system software. GDPR [73] is theEuropean law that mandates “the right to be forgotten”. In other words, Personally IdentifiableInformation (PII) that a service holds on an individual must be permanently removed upon auser request. In addition, data access must be based on the notion of “purposes”. Purposesare intended to capture the idea that performing aggregation for reporting purposes is a verydifferent use case than performing targeted advertising based on PII data. In SQL DBMSs accesscontrol is based on the notion of individuals and their roles. These constructs have nothing todo with purposes, and a separate mechanism is required. Obviously, this is a DBMS service.As noted in [29], a clean DBMS design can facilitate locating and deleting PII data inside theDBMS. However, one must also deal with the case where data is copied to an application andthen sent to a second application. Since all communication between applications goes throughthe DBMS, this message can be recorded by the DBMS, allowing the DBMS to track PII dataeven when it goes out to applications. Of course, this will not prevent a malicious human fromwriting PII data to the screen and copying it outside of the system. To deal with these kindsof leaks, applications must be “sandboxed” either virtually or cryptographically which can bereadily incorporated into the database [25, 76, 60, 42, 58, 27].
Data provenance is key to addressing many of the ills of modern data-centric life. Consider thefollowing problems:
Data forging:
Detecting whether a photograph is doctored has become impossible for the typ-ical news consumer. Even if a news service wants to provide trustworthy authorship informationabout its articles and photos, it has no trustworthy way to do so. Simply signing a photographat the time it was taken is not sufficient, since there are some data-mutating operations (suchas cropping or color adjustment) that news organizations must perform before publication.
Data debugging:
Modern machine learning projects involve huge data pipelines, incorporatingdatasets and models from many different sources. Debugging pipeline output requires closelyexamining and testing these different inputs. Unfortunately, these inputs can come from partnerswith opaque engineering pipelines, or are incorporated in an entirely untracked manner, suchas via a downloaded email attachment. As a result, simply enumerating the inputs to a datapipeline can be challenging, and fixing ”root cause data problems” is frequently impossible.
Data spills:
Today, an inadvertent data revelation is an irreversible mistake. There is no suchthing as cleaning up after a database of social security numbers is mistakenly posted online.Although data handling practices must and can be improved, ensuring total data privacy today s a very difficult and brittle problem. Data consumption and understanding:
Much of modern life (as a professional, a consumer,and a citizen) consists of consuming and acting on data. The data processes that produce human-comprehensible outputs, such as the plots in a scientific article, are so complicated that it isquite easy for there to be errors that are undetectable even to the producer. Consider the case ofeconomists Carmen Reinhart and Kenneth Rogoff, who in 2010 wrote an enormously influentialarticle on public finance, cited by Representative Paul Ryan to defend a 2013 budget proposal,that was later found to be based on simplistic errors in an Excel spreadsheet [71]. The authorsdid not acknowledge the error until three years after the paper was first written. Responsibledata use means people must be able to quickly examine and understand the processes that yieldthe data artifacts all around us.
Data policy compliance:
Datasets and models often carry policies about how they can beused. For example, a predictive medical model might be appropriate for some age populations,but not others. Unfortunately, it is impossible for anyone, whether a data artifact producer orconsumer, to have confidence about how data is being used.A strong data provenance system would help address all of the above problems. All dataoperations by a modern operating system, such as copying, mutating, transmitting, etc., shouldbe tracked and stored for possible later examination. It should be impossible to perform opera-tions on a modern OS that sidestep responsible data provenance tracking. Our proposed DBOSarchitecture effectively logs all such operations, allowing an authoritative chain of provenanceto be recorded. (As with all the data the system collects, it will be stored in a DBMS.) This willsupport solutions to all of the above issues, requiring only log processing applications. Further-more, first-class support for provenance throughout OS data structures will also simplify manysystem administration tasks, such as recovering from user errors or security breaches [19].
Designing an operating system requires making assumptions about its future workload and data.These assumptions then materialize themselves as default parameters, heuristics, and variouscompromises. Unfortunately, all these decisions can significantly impact performance, especiallyif the assumptions turn out to be wrong. For example, if we assume that the OS mainly runsvery short Lambda-like functions, then reducing the overhead of starting a Lambda functionmay be more critical than optimal scheduling. However, if we assume the workload is dominatedby long-running memory intensive services, we require a very different scheduling algorithm, fairresource allocation strategies, and service migration techniques, whereas the startup time willmatter very little.Moreover, operating systems offer a variety of knobs to tune the system for a particularworkload or hardware. While providing flexibility, all the options put a burden on the adminis-trator to set the knobs correctly and to adjust them in the case the workload, data, or hardwarechanges.To overcome those challenges, we suggest that DBOS should be introspective, adaptable,and self-tuning through two design principles:
Knob-free design:
We believe that all parameters of the system should be designed to be elf-tuning from the beginning. That is, DBOS will deploy techniques similar to SmartChoices[17] for all parameters and constants to make them automatically tuneable. The key challengein globally optimizing all these parameters is then to gather and analyzing the state of the OSand the different components. Storing all this information in the OS database will significantlysimplify the process and make true self-tuning possible. Learned components:
To address a wide range of use cases, the system developer often hasto make algorithmic compromises. For instance, every operating system requires a schedulingalgorithm, but the chosen scheduling algorithm might not be optimal under all workloads orhardware types. In order to provide the best performance, we envision that the system is ableto automatically switch the algorithm used, based on the workload and data. This would applyto scheduling, memory management, etc [22, 20].In some cases it might be even possible to learn the entire component or parts of it. Forexample, recent results have shown that it is sometimes possible to learn a scheduling algorithm,which performs better than traditional more static heuristics [54, 55]. This learning of compo-nents would allow the system to more readily adapt to the workload and data, and perhapsprovide unprecedented performance.To achieve a knob-free design and learned components, we suggest that the DBOS needs tobe designed from the beginning to be Reinforcement Learning (RL)-enabled. RL is the leadingtechnique to tune knobs and build components based on the observed behaviour in an onlinefashion. Today, RL is usually added as an afterthought. This leads to several problems includingdifficulty in finding the right award function or supporting the required RL exploration phase.In many cases this requires the extra work of building a simulator or a light-weight executionenvironment to try out new approaches. By making RL a first-class citizen in the systemdesign, we believe that we can overcome these challenges. Moreover, managing all state data ina database and making it analyzable, will again be a key enabler for this effort.If successful, the resulting system would be able to quickly adapt itself to changing condi-tions and provide unprecedented performance for a wide range of workloads while making theadministration of the system considerably easier.
Managing compute, storage, and communication hardware is a primary function for an operatingsystem. The key abstractions in existing operating systems were developed for the homogeneoushardware landscape of the last century. Kernel threads (processes), virtual memory, files, andsockets were sufficient to abstract and manage single-core computers with limited main memorybacked by a slow hard disk, connected with low-bandwidth, high latency networking.Present-day hardware looks radically different. A single server machine contains tens tohundreds of cores in one or more chips, terabytes of main memory across a dozen channels, andmultiple storage devices (SSDs and HDDs). The end of Dennard scaling [49] and the ascentof machine learning applications has led to the introduction of domain-specific acceleratorslike GPUs and TPUs, each with its own primitives for massively parallel computation and high-bandwidth memory [33]. The end of scaling for DRAM technology is motivating multi-level mainmemory systems using storage-class memories (SCM) [34]. Network interfaces allow direct accessto remote memory at speeds faster than local storage. Beyond the single node, concepts such s multi-cloud, edge cloud, globally replicated clouds, and hardware disaggregation introduceheterogeneity in the type and scale of hardware resources. Existing operating systems were notdesigned for such scales or heterogeneity. This shortcoming is a primary culprit for the softwarebloat in applications and operating systems, including kernel bypass subsystems. Solutions havelimited portability and are difficult to understand, debug, and reuse.Placing the operating system state in a DBMS introduces two properties that are usefulin managing heterogeneous hardware. First, it clearly separates compute from data access.The operating system can manage data placement, caching, replication, and synchronizationseparately from the accelerated functions that operate on it. Second, it clearly separates control-plane from data-plane actions. One can improve or customize control-plane operations, such asscheduling, independently of the compute implementation using the best available accelerators.To run efficiently on heterogeneous hardware, DBOS will be designed around two key prin-ciples. Accelerated interfaces to DBMS:
DBOS will implement the interfaces that allow hetero-geneous hardware to interact with the DBMS, hiding the overall system scale and complexity.For example, the interface to a compute accelerator like a TPU can be a query that applies auser-defined function (UDF). The accelerator implements the UDF, while DBOS implements thequery that involves preparing inputs and outputs. This interface remains constant regardless ifthe accelerator is local, disaggregated, or in a remote datacenter. The accelerator state is storedin the DBMS to facilitate scheduling and introspection. DBOS will directly manage memoryand storage layers, as part of the DBMS resources available for data sharing, replication, orcaching. DBOS interfaces will leverage existing hardware mechanisms, such as virtual memory,as well as emerging mechanisms such as zero-copy/direct memory access networking interfacesor coherent fabrics (CXL). Over the time, hardware mechanisms will evolve to further acceleratethe interactions between the DBMS and heterogeneous hardware. For example, SmartNICs willbe optimized to accelerate DBMS interfaces, not just RDMA protocols, while GPUs and TPUswill directly support DBMS data operations.
Accelerating the DBMS itself:
The performance and scalability of DBOS itself relies heavilyon the speed of DBMS operations. In addition to distributed execution and extensive caching,the DBMS will build upon modern hardware – accelerators, storage class memory, and fastSmartNICs. Since all communication, dataplane, and control plane operations interface withthe DBMS, the deployment of specialized accelerators for common DB operations like joins,filters, and aggregations will likely become essential [1].
Historically, the programming model of choice was a single-threaded computation with executioninterspersed with stalls for I/O or screen communication. This model effectively requires multi-tasking to fill in for the stalls. In turn, this requires interprocess protection and other complexity.Instead, we would recommend that everybody adopt the Lambda model, popularized byAWS [72]. In other words, computation is done in highly parallel “bursts,” and resources arerelinquished between periods of computation [24]. This model allows one to give the CPU toone task at a time, eschewing multithreading and multiprogramming. In addition, parallel pro-cessing can be done with a collection of short-lived, stateless tasks that communicate through he DBMS.The DBMS optimizes the communication by locally caching and co-scheduling com-municating tasks when possible. In effect, this is “server-less computing,” whereby one onlypays for resources that are used and not for long-lived tasks. Hence, under current cloud billingpractices, this will save significant dollars.That means DBOS should adopt the Lambda model as well. One should divide up a queryplan into “steps” (operators). Each operator is executed (in parallel) and then dies. State isrecorded in the DBMS. Sharding of the data allows operator parallelism.Each Lambda task is given a exclusive set of resources, e.g., one or more cores until it dies.In the interest of simplicity and security, multi-tenancy and multi-threading may be turned off.There is a sharded scheduling table in the DBMS. A task is runnable or waiting. Thescheduler picks a runnable task — via a query — and executes it. When the task quits, thescheduler loops. This will work well as long as applications utilize the Lambda model.Dynamic optimization in the OS is gated by the time it takes stop, checkpoint, migrate, andrestart applications/processes/threads. In the cloud, this is often minutes, which means thatvery little dynamic optimization is possible. Recent work has demonstrated that hand-codedfast launch (thousands of applications per second) is possible [61, 62]. This is all human-controlled static optimization [14]. The optimizing scheduler in DBOS should be able to do thisdynamically and launch millions of applications per second [38]. Obviously, DBOS is a huge undertaking. An actual commercial implementation will take tens ofperson-years. As such, we need to quickly validate the ideas in this document. Hence, we discussdemonstrating the validity of the ideas and then discuss convincing the systems community thatDBOS is worth the effort involved.
A key challenge is to show a DBMS capable of acceptable performance and scalability to formthe foundation of DBOS. We believe that such a system should have the following characteristics:
Multi-core, multi-node executor:
Many DBMSs support this today.
Server-less architecture:
Commercial DBMSs are moving toward allocating CPU resourceson a per-query basis. Snowflake has moved aggressively in this direction, based on a distributedfile system (S3) aggressive caching and sharding only for CPU resources [65].
Polystore architecture:
Clearly, DBOS will need to manage data from heterogenous sourcessuch as process tables, schedulers, network tables, namespaces, and many permissions tables. Itis likely that any single data management system will be able to efficiently manage the diversityand scale of the associated data structures. Different OS functionality will naturally fit intodifferent types of storage engines and a polystore architecture [68, 53] can provide a singleinterface to these disparate and federated systems. A critical system characteristic would be toavoid developing a “one size fits all” [67, 26, 44] solution that is incapable of adapting as newtypes of data are collected and managed by DBOS.
Open source code:
Obviously any code in a DBOS prototype should be readily available.
Lambda-style, serverless runtime system:
This will facilitate optimizing resource alloca-tion.Possible choices include SciDB, Presto, Accumulo, etc. We think the best option is to startwith a prototype that comprises a DBMS built on an MIT Lambda-style system.We view the key design choices of AWS Lambda as reservation-free, fixed-resource servicefor short-lived functions and will embody those in our own system. Other choices in today’scommercial version of Lambda, such as S3 as the exclusive storage system, or the lack of directcommunication between functions, seem like they should be rethought. It is unclear whetheruniform resource constraints on the Lambda functions is a key design choice, or whether thesystem should offer heterogeneous resource constraints to enable a more flexible developmentenvironment.We would expect to replace S3 as the storage system with something much faster [46, 47],based on the discussion earlier. We expect in one person year, we could demonstrate a LAN-based system along these lines. We would then expect to test the performance of this prototypein two contexts. The first goal is to provide file system performance comparable to today’ssystems. In addition, we expect to show our communication implementation can be comparableor faster to traditional TCP/IP networking.To bootstrap running the DBMS itself, we plan to rely on minimal operating systems thathave already been designed for cloud environments, such as unikernels, Dune or IX [8, 9], whichare designed to run one application at a time and to give it high-performance access to thehardware. We will also make sure that the DBMS runs on Linux systems for easy development.The main facilities that the DBMS needs to bootstrap are a boot and configuration process,network access (which can also be used for logging), threads, and an interface to access storage. n the latter case, because the DBMS will manage all large data structures, raw block accessmay be sufficient. Today’s minimal OSes already support these facilities for hosting serverapplications as efficiently as possible in virtualized datacenters. As a first example of using DBOS to improve current OS functionality, we will implement a data-centric log processing and monitoring infrastructure in DBOS that can monitor applicationsusing existing OSes such as Linux. OSes, Networks, Schedulers, and File Systems generateenormous amounts of logs and metadata which are mostly kept in raw files. Attempts to putthese in databases (OS logs to Splunk; Network logs to NetAPP; Scheduler logs to MySQL; FileSystem metadata to MySQL) barely meet minimal auditing requirements.A DBMS-based OS that organically stored these data in a high-performance database withfirst class support for dense, sparse, and hypersparse tables would be a huge win as it would makethese data readily analyzable and actionable. It would also be able to execute streaming queriesto compute complicated monitoring views in real time in order so simplify system management;simple metrics such as “how many files has each user created” can sometimes take hours to runwith today’s file systems and OSes. Our team has conducted experiments showing the high-performance databases such as Apache Accumulo, SciDB, and RedisGraph can easily absorbthis data while enabling analysis that are not currently possible [38, 15, 39, 40]. For example,”All files touched by a user during a time window”, ”Largest 10 folders owned by a user”,”Computing cycles consumed by an application during a time window”, ”Network traffic causedby a specific application”, ... These are very important questions for Cloud operators and verydifficult to answer and require custom built tools to do so. A DBMS OS should be able toanswer these questions by design.
In Section 7, we discussed DBMS support for heterogeneous hardware, GPUs and FPGAs, basedon user-defined DBMS functions. Our plan is to implement a prototype of this functionality todemonstrate its feasibility and performance.One of the defining features of modern datacenters is hardware heterogeneity. Far frombeing a uniform pool of machines, datacenters offer machines with different memory, storage,processing, and other capacities. Most notably, different machines offer vastly different accel-erator capacities. Although GPUs for machine learning tasks comprise the most common classof accelerator, datacenters also contain FPGAs and other accelerators for video processing andencryption applications. These accelerators can be expensive: it is not feasible to outfit ev-ery machine in a large system with a top-flight GPU. Matching a heterogeneous workload toa heterogeneous pool of resources is a complicated and important task that is tailor-made formachine- rather than human-driven optimization.To address this challenge, we need to first design the DBMS-based API in DBOS allowsfor portability. The same user code can drive execution on a local or remote GPU. Next, weneed to exploit the flexibility of Lambda-style task allocation and the visibility into system statethrough the DBMS in order implement scheduling algorithms that utilize better the datacenterresources that naive server-centric allocation schemes. We will demonstrate this functionality sing by running a range of workloads on small clusters and by simulating larger, datacenterenvironments. Since DBOS is designed around a distributed DBMS, it is a natural fit for data mining applica-tions like the log processing discussed in Section 9.2. However, it is not as obvious a match foronline-serving applications, such as social networks, e-commerce sites, and media services, thatconsume large fractions on cloud systems. These applications consist of tens to thousands ofmicroservices that must quickly communicate and respond to user actions within tight servicelevel objectives (SLOs) [30]. Some microservices are simple tasks, such as looking up sessioninformation, while others are complicated functions such as recommendation systems based onneural networks or search functions using distributed indices. Microservice applications form thebulk of software-as-a-service products today and are the most critical operational applicationsfor many organizations.We will prototype an end-to-end microservices workload, such as a Twitter-like social net-work, in order to evaluate DBOS’s feasibility for these applications. During this process, wewill answer two key questions. First, can DBOS support the computation and communicationpatterns of such latency-critical applications in a performant manner? Second, can DBOS helpaddress the challenges in developing, scaling, and evolving such applications over time?With DBOS, a social network will be implemented as a collection of serverless functionsoperating on multiple database tables. This presents multiple opportunities for performanceoptimization. For example, DBOS can colocate communication functions to avoid remote com-munication, or selectively introduce new caching layers and indexes. Accelerators are also nowused in many components of microservice applications, such as recommendation engines forsocial network content and search result re-ranking, so we will use the accelerator managementcapabilities in Section 9.3 to automatically offload and optimize these tasks.Finally, because DBOS uses a serverless model, data management decisions such as shardingand replicating datasets or evolving schemas are separated from the application code. Thismakes it significantly easier for application developers to implement architectural changes thatare very difficult in microservice applications today. We will show how to use DBOS to easilyimplement several such architectural changes:1. Changing the partitioning and schema of data in the application to improve performance(a common type of change that requires large engineering efforts in today’s services).2. Changing the partitioning of compute logic, e.g., moving from a “monolith” of co-locatedfunctions to separately scaling instances for different parts of the application logic.3. Making the application GDPR-compliant, by storing each user’s data in their geographicregion and using the data provenance features of DBOS to track which data was derivedfrom each user or delete it on-demand.4. Changing the security model ( e.g. , which users can see data from minors or from Europeancitizens) without having to refactor the majority of application code. We have presented a dramatically simpler view of systems software which avoids implementingthe same functions in multiple components. Instead, the architecture bets on a sophisticatedDBMS to implement most functionality. Section 9 suggested initial experiments to demonstratefeasibility. Obviously, these steps should be carried out first.Following that, there are still many unanswered questions. The most notable one is “Canthis scale to a million nodes?”. To the best of our knowledge, nobody has built a distributedDBMS at this scale. Clearly, there will be unforeseen bottlenecks and inefficiencies to contendwith. Managing storage for a 1M node DBMS will be a challenge. A second question is “Canthis be built to function efficiently?” Section 9 discussed the file system and IPC. However,memory management, caching, scheduling, and outlier processing are still issues. Obviously, thenext step is to build a full-function prototype to answer these questions. eferences [1] S. R. Agrawal, S. Idicula, A. Raghavan, E. Vlachos, V. Govindaraju, V. Varadarajan,C. Balkesen, G. Giannikis, C. Roth, N. Agarwal, and E. Sedlar. A many-core architecturefor in-memory data processing. In Proceedings of the 50th Annual IEEE/ACM InternationalSymposium on Microarchitecture , MICRO-50 ’17, page 245–258, New York, NY, USA, 2017.Association for Computing Machinery.[2] D. Ardelean, A. Diwan, and C. Erdman. Performance analysis of cloud applications. In ,pages 405–417, Renton, WA, Apr. 2018. USENIX Association.[3] J. Arnold and M. F. Kaashoek. Ksplice: Automatic rebootless kernel updates. In
Pro-ceedings of the 4th ACM European Conference on Computer Systems , EuroSys ’09, page187–198, New York, NY, USA, 2009. Association for Computing Machinery.[4] V. Atlidakis, J. Andrus, R. Geambasu, D. Mitropoulos, and J. Nieh. Posix abstractionsin modern operating systems: The old, the new, and the missing. In
Proceedings of theEleventh European Conference on Computer Systems , EuroSys ’16, New York, NY, USA,2016. Association for Computing Machinery.[5] L. Barroso, M. Marty, D. Patterson, and P. Ranganathan. Attack of the killer microseconds.
Commun. ACM , 60(4):48–54, Mar. 2017.[6] M. Bauer. Paranoid penguin: An introduction to Novell AppArmor.
Linux J. , 2006(148):13,Aug. 2006.[7] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe,A. Sch¨upbach, and A. Singhania. The multikernel: A new os architecture for scalablemulticore systems. In
Proceedings of the ACM SIGOPS 22nd Symposium on OperatingSystems Principles , SOSP ’09, page 29–44, New York, NY, USA, 2009. Association forComputing Machinery.[8] A. Belay, A. Bittau, A. Mashtizadeh, D. Terei, D. Mazi`eres, and C. Kozyrakis. Dune: Safeuser-level access to privileged cpu features. In
Proceedings of the 10th USENIX Conferenceon Operating Systems Design and Implementation , OSDI’12, page 335–348, USA, 2012.USENIX Association.[9] A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion. IX: Aprotected dataplane operating system for high throughput and low latency. In , pages 49–65,Broomfield, CO, Oct. 2014. USENIX Association.[10] S. S. Bhat, R. Eqbal, A. T. Clements, M. F. Kaashoek, and N. Zeldovich. Scaling a filesystem to many cores using an operation log. In
Proceedings of the 26th Symposium on Op-erating Systems Principles , SOSP ’17, page 69–86, New York, NY, USA, 2017. Associationfor Computing Machinery.[11] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, andN. Zeldovich. An analysis of linux scalability to many cores. In
Proceedings of the 9thUSENIX Conference on Operating Systems Design and Implementation , OSDI’10, page1–16, USA, 2010. USENIX Association.[12] C. Byun, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, V. Gadepally, M. Houle,M. Hubbell, M. Jones, A. Klein, P. Michaleas, J. Mullen, A. Prout, A. Rosa, S. Samsi,C. Yee, and A. Reuther. Large scale parallelization using file-based communications. In , pages 1–7, 2019.[13] C. Byun, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, V. Gadepally, M. Hubbell,P. Michaleas, J. Mullen, A. Prout, A. Rosa, C. Yee, and A. Reuther. Llmapreduce: Multi-level map-reduce for high performance data analysis. In , pages 1–8, 2016.
14] C. Byun, A. Klein, L. Milechin, P. Michaleas, J. Mullen, A. Prout, A. Rosa, S. Samsi,C. Yee, A. Reuther, and et al. Optimizing xeon phi for interactive data analysis. , Sep 2019.[15] P. Cailliau, T. Davis, V. Gadepally, J. Kepner, R. Lipman, J. Lovitz, and K. Ouaknine.Redisgraph graphblas enabled graph database. , May 2019.[16] B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic instrumentation of pro-duction systems. In
Proceedings of the Annual Conference on USENIX Annual TechnicalConference , ATEC ’04, page 2, USA, 2004. USENIX Association.[17] V. Carbune, T. Coppey, A. Daryin, T. Deselaers, N. Sarda, and J. Yagnik. Smartchoices:Hybridizing programming and machine learning. In
Reinforcement Learning for RealLife (RL4RealLife) Workshop in the 36th International Conference on Machine Learning(ICML), , 2019.[18] D. D. Chamberlin, M. M. Astrahan, M. W. Blasgen, J. N. Gray, W. F. King, B. G. Lindsay,R. Lorie, J. W. Mehl, T. G. Price, F. Putzolu, P. G. Selinger, M. Schkolnick, D. R. Slutz,I. L. Traiger, B. W. Wade, and R. A. Yost. A history and evaluation of system r.
Commun.ACM , 24(10):632–646, Oct. 1981.[19] R. Chandra, T. Kim, and N. Zeldovich. Asynchronous intrusion recovery for interconnectedweb services. In
Proceedings of the Twenty-Fourth ACM Symposium on Operating SystemsPrinciples , SOSP ’13, page 213–227, New York, NY, USA, 2013. Association for ComputingMachinery.[20] E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura, and R. Bianchini. Resourcecentral: Understanding and predicting workloads for improved resource management inlarge cloud platforms. In
Proceedings of the 26th Symposium on Operating Systems Prin-ciples , SOSP ’17, page 153–167, New York, NY, USA, 2017. Association for ComputingMachinery.[21] J. Dean and L. A. Barroso. The tail at scale.
Commun. ACM , 56(2):74–80, Feb. 2013.[22] C. Delimitrou and C. Kozyrakis. Paragon: Qos-aware scheduling for heterogeneous data-centers. In
Proceedings of the Eighteenth International Conference on Architectural Supportfor Programming Languages and Operating Systems , ASPLOS ’13, page 77–88, New York,NY, USA, 2013. Association for Computing Machinery.[23] P. Feiner, A. D. Brown, and A. Goel. Comprehensive kernel instrumentation via dynamicbinary translation. In
Proceedings of the Seventeenth International Conference on Archi-tectural Support for Programming Languages and Operating Systems , ASPLOS XVII, page135–146, New York, NY, USA, 2012. Association for Computing Machinery.[24] S. Fouladi, F. Romero, D. Iter, Q. Li, S. Chatterjee, C. Kozyrakis, M. Zaharia, and K. Win-stein. From laptop to lambda: Outsourcing everyday jobs to thousands of transient func-tional containers. In ,pages 475–488, Renton, WA, July 2019. USENIX Association.[25] B. Fuller, M. Varia, A. Yerukhimovich, E. Shen, A. Hamlin, V. Gadepally, R. Shay, J. D.Mitchell, and R. K. Cunningham. Sok: Cryptographically protected database search. In , pages 172–191, 2017.[26] V. Gadepally, P. Chen, J. Duggan, A. Elmore, B. Haynes, J. Kepner, S. Madden, T. Matt-son, and M. Stonebraker. The bigdawg polystore system and architecture. In , pages 1–6, 2016.[27] V. Gadepally, B. Hancock, B. Kaiser, J. Kepner, P. Michaleas, M. Varia, and A. Yerukhi-movich. Computing on masked data to improve the security of big data. In , pages 1–6, 2015.[28] V. Gadepally, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun, L. Edwards,M. Hubbell, P. Michaleas, J. Mullen, A. Prout, A. Rosa, C. Yee, and A. Reuther. D4m: ringing associative arrays to database engines. In , pages 1–6, 2015.[29] V. Gadepally, T. Mattson, M. Stonebraker, F. Wang, G. Luo, Y. Laing, and A. Dubovit-skaya. Heterogeneous Data Management, Polystores, and Analytics for Healthcare: VLDB2019 Workshops, Poly and DMAH, Revised Selected Papers , volume 11721. Springer Na-ture, August 2019.[30] Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, B. Ritchken,B. Jackson, K. Hu, M. Pancholi, Y. He, B. Clancy, C. Colen, F. Wen, C. Leung, S. Wang,L. Zaruvinsky, M. Espinosa, R. Lin, Z. Liu, J. Padilla, and C. Delimitrou. An open-sourcebenchmark suite for microservices and their hardware-software implications for cloud &edge systems. In
Proceedings of the Twenty-Fourth International Conference on Architec-tural Support for Programming Languages and Operating Systems , ASPLOS ’19, page 3–18,New York, NY, USA, 2019. Association for Computing Machinery.[31] T. Gleixner. Refactoring the Linux kernel. https://kernel-recipes.org/en/2017/talks/refactoring-the-linux-kernel/ , 2017.[32] D. Hutchison, J. Kepner, V. Gadepally, and A. Fuchs. Graphulo implementation of server-side sparse matrix multiply in the accumulo database. In , pages 1–7, 2015.[33] N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. Patter-son. A domain-specific supercomputer for training deep neural networks.
Commun. ACM ,63(7):67–78, June 2020.[34] A. K. Kamath, L. Monis, A. T. Karthik, and B. Talawar. Storage class memory: Principles,problems, and possibilities, 2019.[35] P. Kedia and S. Bansal. Fast dynamic binary translation for the kernel. In
Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles , SOSP ’13, page101–115, New York, NY, USA, 2013. Association for Computing Machinery.[36] J. Kepner.
Parallel MATLAB for multicore and multinode computers . SIAM, 2009.[37] J. Kepner, P. Aaltonen, D. Bader, A. Bulu¸c, F. Franchetti, J. Gilbert, D. Hutchison,M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, C. Yang, J. D. Owens, M. Zalewski,T. Mattson, and J. Moreira. Mathematical foundations of the graphblas. In , pages 1–9, 2016.[38] J. Kepner, R. Brightwell, A. Edelman, V. Gadepally, H. Jananthan, M. Jones, S. Madden,P. Michaleas, H. Okhravi, K. Pedretti, A. Reuther, T. Sterling, and M. Stonebraker. Tabu-larosa: Tabular operating system architecture for massively parallel heterogeneous computeengines. In , pages1–8, 2018.[39] J. Kepner, K. Cho, K. Claffy, V. Gadepally, P. Michaleas, and L. Milechin. Hypersparseneural network analysis of large-scale internet traffic. , Sep 2019.[40] J. Kepner, T. Davis, C. Byun, W. Arcand, D. Bestor, W. Bergeron, V. Gadepally,M. Hubbell, M. Houle, M. Jones, A. Klein, P. Michaleas, L. Milechin, J. Mullen, A. Prout,A. Rosa, S. Samsi, C. Yee, and A. Reuther. 75,000,000,000 streaming inserts/second usinghierarchical hypersparse graphblas matrices, 2020.[41] J. Kepner, V. Gadepally, D. Hutchison, H. Jananthan, T. Mattson, S. Samsi, andA. Reuther. Associative array model of sql, nosql, and newsql databases. In , pages 1–9, 2016.[42] J. Kepner, V. Gadepally, P. Michaleas, N. Schear, M. Varia, A. Yerukhimovich, and R. K.Cunningham. Computing on masked data: a high performance method for improving bigdata veracity. In ,pages 1–6, 2014.
43] J. Kepner and H. Jananthan.
Mathematics of big data: Spreadsheets, databases, matrices,and graphs . MIT Press, 2018.[44] Y. Khan, A. Zimmermann, A. Jha, V. Gadepally, M. D’Aquin, and R. Sahay. One sizedoes not fit all: Querying web polystores.
IEEE Access , 7:9598–9617, 2019.[45] T. Kim, X. Wang, N. Zeldovich, and M. F. Kaashoek. Intrusion recovery using selectivere-execution. In
Proceedings of the 9th USENIX Conference on Operating Systems Designand Implementation , OSDI’10, page 89–104, USA, 2010. USENIX Association.[46] A. Klimovic, H. Litz, and C. Kozyrakis. Reflex: Remote flash = local flash. In
Proceedingsof the Twenty-Second International Conference on Architectural Support for ProgrammingLanguages and Operating Systems , ASPLOS ’17, page 345–359, New York, NY, USA, 2017.Association for Computing Machinery.[47] A. Klimovic, Y. Wang, P. Stuedi, A. Trivedi, J. Pfefferle, and C. Kozyrakis. Pocket: Elas-tic ephemeral storage for serverless analytics. In , pages 427–444, Carlsbad, CA, Oct. 2018.USENIX Association.[48] M. Larabel. The performance cost to selinux on fedora 31. , 2020.[49] C. E. Leiserson, N. C. Thompson, J. S. Emer, B. C. Kuszmaul, B. W. Lampson, D. Sanchez,and T. B. Schardl. There’s plenty of room at the top: What will drive computer performanceafter moore’s law?
Science , 368(6495), 2020.[50] Scaling in the linux networking stack. .[51] J.-P. Lozi, B. Lepers, J. Funston, F. Gaud, V. Qu´ema, and A. Fedorova. The linux scheduler:A decade of wasted cores. In
Proceedings of the Eleventh European Conference on ComputerSystems , EuroSys ’16, New York, NY, USA, 2016. Association for Computing Machinery.[52] J.-P. Lozi, B. Lepers, J. Funston, F. Gaud, V. Qu´ema, and A. Fedorova. The linux scheduler:A decade of wasted cores. In
Proceedings of the Eleventh European Conference on ComputerSystems , EuroSys ’16, New York, NY, USA, 2016. Association for Computing Machinery.[53] J. Lu, I. Holubov´a, and B. Cautis. Multi-model databases and tightly integrated polystores:Current practices, comparisons, and open challenges. In
Proceedings of the 27th ACMInternational Conference on Information and Knowledge Management , pages 2301–2302,2018.[54] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and M. Alizadeh. Learningscheduling algorithms for data processing clusters. In J. Wu and W. Hall, editors,
Pro-ceedings of the ACM Special Interest Group on Data Communication, SIGCOMM 2019,Beijing, China, August 19-23, 2019 , pages 270–288. ACM, 2019.[55] A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. V. Le, and J. Dean. Hierarchical planningfor device placement. 2018.[56] C. Mitchell, Y. Geng, and J. Li. Using one-sided rdma reads to build a fast, cpu-efficientkey-value store. In
Proceedings of the 2013 USENIX Conference on Annual TechnicalConference , USENIX ATC’13, page 103–114, USA, 2013. USENIX Association.[57] Y. Padioleau, J. L. Lawall, and G. Muller. Understanding collateral evolution in linuxdevice drivers.
SIGOPS Oper. Syst. Rev. , 40(4):59–71, Apr. 2006.[58] R. Poddar, T. Boelter, and R. A. Popa. Arx: an encrypted database using semanticallysecure encryption.
Proceedings of the VLDB Endowment , 12(11):1664–1678, 2019.[59] J. Poimboeuf. Introducing kpatch: Dynamic kernel patching. http://rhelblog.redhat.com/2014/02/26/kpatch/ , 2014.[60] R. A. Popa, C. M. Redfield, N. Zeldovich, and H. Balakrishnan. Cryptdb: protectingconfidentiality with encrypted query processing. In
Proceedings of the Twenty-Third ACMSymposium on Operating Systems Principles , pages 85–100, 2011.
61] A. Reuther, C. Byun, W. Arcand, D. Bestor, B. Bergeron, M. Hubbell, M. Jones,P. Michaleas, A. Prout, A. Rosa, and et al. Scalable system scheduling for hpc and bigdata.
Journal of Parallel and Distributed Computing , 111:76–92, Jan 2018.[62] A. Reuther, J. Kepner, C. Byun, S. Samsi, W. Arcand, D. Bestor, B. Bergeron, V. Gade-pally, M. Houle, M. Hubbell, and et al. Interactive supercomputing on 40,000 cores formachine learning and data analysis. , Sep 2018.[63] Y. Shan, Y. Huang, Y. Chen, and Y. Zhang. Legoos: A disseminated, distributed osfor hardware resource disaggregation. In
Proceedings of the 13th USENIX Conference onOperating Systems Design and Implementation , OSDI’18, page 69–87, USA, 2018. USENIXAssociation.[64] S. Smalley, C. Vance, and W. Salamon. Implementing SELinux as a Linux security module.Technical report, 2001.[65] The snowflake cloud data platform.[66] C. Song, B. Lee, K. Lu, W. Harris, T. Kim, and W. Lee. Enforcing kernel security in-variants with data flow integrity. In . The InternetSociety, 2016.[67] M. Stonebraker and U. C¸ etintemel. ” one size fits all” an idea whose time has come andgone. In
Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker , pages441–462. 2018.[68] R. Tan, R. Chirkova, V. Gadepally, and T. G. Mattson. Enabling query processing acrossheterogeneous data models: A survey. In , pages 3211–3220, 2017.[69] J. Thumshin. Introduction to the Linux block i/o layer. https://media.ccc.de/v/784-introduction-to-the-linux-block-i-o-layer , 2016.[70] C.-C. Tsai, B. Jain, N. A. Abdul, and D. E. Porter. A study of modern linux api usage andcompatibility: What to support when you’re supporting. In
Proceedings of the EleventhEuropean Conference on Computer Systems , EuroSys ’16, New York, NY, USA, 2016.Association for Computing Machinery.[71] J. Weisenthal. Reinhart and rogoff: ’full stop,’ we made a microsoft excel blunder in ourdebt study, and it makes a difference, April 2013.[72] Wikipedia. AWS Lambda. https://en.wikipedia.org/wiki/AWS_Lambda , July 2020.[73] Wikipedia. General data protection regulation. https://en.wikipedia.org/wiki/General_Data_Protection_Regulation , July 2020.[74] Attribute-based access control — Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Attribute-based_access_control&oldid=967477902 , 2020.[75] Completely fair scheduler — Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Completely_Fair_Scheduler&oldid=959791832 , 2020.[76] S. Yakoubov, V. Gadepally, N. Schear, E. Shen, and A. Yerukhimovich. A survey ofcryptographic approaches to securing big-data analytics in the cloud. In , pages 1–6, 2014.[77] E. Zamanian, X. Yu, M. Stonebraker, and T. Kraska. Rethinking database high availabilitywith rdma networks.
Proc. VLDB Endow. , 12(11):1637–1650, July 2019.[78] N. Zeldovich, S. Boyd-Wickizer, E. Kohler, and D. Mazi`eres. Making information flowexplicit in histar.
Commun. ACM , 54(11):93–101, Nov. 2011., 54(11):93–101, Nov. 2011.