[PDF] Parallel Rendering and Large Data Visualization

Abstract

We are living in the big data age: An ever increasing amount of data is being produced through data acquisition and computer simulations. While large scale analysis and simulations have received significant attention for cloud and high-performance computing, software to efficiently visualise large data sets is struggling to keep up. Visualization has proven to be an efficient tool for understanding data, in particular visual analysis is a powerful tool to gain intuitive insight into the spatial structure and relations of 3D data sets. Large-scale visualization setups are becoming ever more affordable, and high-resolution tiled display walls are in reach even for small institutions. Virtual reality has arrived in the consumer space, making it accessible to a large audience. This thesis addresses these developments by advancing the field of parallel rendering. We formalise the design of system software for large data visualization through parallel rendering, provide a reference implementation of a parallel rendering framework, introduce novel algorithms to accelerate the rendering of large amounts of data, and validate this research and development with new applications for large data visualization. Applications built using our framework enable domain scientists and large data engineers to better extract meaning from their data, making it feasible to explore more data and enabling the use of high-fidelity visualization installations to see more detail of the data.

Full PDF

PParallel Rendering andLarge Data Visualization

Dissertation submitted to the Faculty ofBusiness, Economics and Informatics ofthe University of Zurichto obtain the degree ofDoktor / Doktorin der Wissenschaften,Dr. sc.(corresponds to Doctor of Science, PhD)presented byStefan Eilemannfrom Neuchˆatel, NE, SwitzerlandApproved in February 2019at the request ofProf. Dr. Renato PajarolaProf. Dr. Markus Hadwiger a r X i v : . [ c s . G R ] F e b he Faculty of Business, Economics and Informatics of the University of Zurichhereby authorizes the printing of this dissertation, without indicating an opinionof the views expressed in the work.Zurich, February 13, 2019Chairman of the Doctoral Board: Prof. Dr. Thomas Fritz BSTRACT

We are living in the big data age: An ever increasing amount of data is beingproduced through data acquisition and computer simulations. While large scaleanalysis and simulations have received signiﬁcant attention for cloud and high-performance computing, software to efﬁciently visualise large data sets is strug-gling to keep up.Visualization has proven to be an efﬁcient tool for understanding data, in par-ticular visual analysis is a powerful tool to gain intuitive insight into the spatialstructure and relations of 3D data sets. Large-scale visualization setups are be-coming ever more affordable, and high-resolution tiled display walls are in reacheven for small institutions. Virtual reality has arrived in the consumer space, mak-ing it accessible to a large audience.This thesis addresses these developments by advancing the ﬁeld of parallelrendering. We formalise the design of system software for large data visualizationthrough parallel rendering, provide a reference implementation of a parallel ren-dering framework, introduce novel algorithms to accelerate the rendering of largeamounts of data, and validate this research and development with new applica-tions for large data visualization. Applications built using our framework enabledomain scientists and large data engineers to better extract meaning from theirdata, making it feasible to explore more data and enabling the use of high-ﬁdelityvisualization installations to see more detail of the data.i

URZFASSUNG

Daten sind das Gold des 21. Jahrhunderts: Computersimulationen, bildgebendeVerfahren und andere Datenerfassungssysteme generieren immer gr ¨ossere Daten-mengen. Visualisierungssoftware zur Darstellung grosser Datenmengen ist, re-lativ zu Simulationssoftware und verteilten Systemen f ¨ur Cloudumgebungen, inder Forschung und Entwicklung vernachl¨assigt.Visualisierung ist ein efﬁzientes Mittel um grosse Datenmengen zu analysieren.Insbesondere die Visualisierung von dreidimensionalen Datens¨atzen erlaubt einintuitives Verst¨andnis der r¨aumlichen Zusammenh¨ange und ihrer Struktur. Visu-alisierungshardware steht immer mehr Benutzern zur Verf¨ugung, insbesonderehochauﬂ¨osende Monitorw¨ande sind mittlerweile auch f¨ur kleine Institutionen er-schwinglich.Diese Doktorarbeit besch¨aftigt sich mit paralleler Software und Algorithmenzur Visualisierung dreidimensionaler Datens¨atze, um diesen Entwicklungen Folgezu tragen. Als Grundlage f¨ur Forschung und Entwicklung formalisieren wir dieSoftwarearchitektur f¨ur paralleles Rendering und stellen unsere Referenzimple-mentierung vor. Auf dieser Basis pr¨asentieren wir neue Forschungsergebnisse undAlgorithmen zur schnelleren Visualisierung grosser Datenmengen. Visualisier-ungssoftware, welche mit unserer Bibliothek entwickelt wurde, validiert unserenAnsatz, und erlaubt Benutzern mehr Daten mit besserer Detail zu analysieren.iii

CKNOWLEDGMENTS

The research leading to this proposal was supported in part by the Blue BrainProject, the Swiss National Science Foundation under Grant 200020-129525, theEuropean Union Seventh Framework Programme (FP7/2007-2013) under grantagreement no. 604102 (Human Brain Project), the Hasler Stiftung grant (projectnumber ), and the King Abdullah University of Science and Technology(KAUST) through the KAUST-EPFL alliance for Neuro-Inspired High Perform-ance Computing.I would like to take the opportunity to thank the Blue Brain Project and itsvisualization team, RTT AG (now part of Dassault Systems), KAUST, Universityof Siegen, the Electronic Visualization Laboratory at the University of IllinoisChicago, and all the other contributors for their support in the research and devel-opment leading to this thesis.I would like to thank Prof. Renato Pajarola and the VMML for his long-term commitment to my research work and Patrick Bouchaud for putting me ontothe path taken by this thesis. A special gratitude goes to all collaborators whojoined me in this endeavour: Daniel Nachbaur, Cedric Stalder, Maxim Makhinya,Christian Marten, Dardo D. Kleiner, Carsten Rohn, Daniel Pfeifer, Sarah Am-sellem, Juan Hernando, Marwan Abdellah, Raphael Dumusc, Lucas Peetz Dulley,Jafet Villafranca, Philippe Robert, Ahmet Bilgili, Tobias Wolf, Dustin Wueest,and Martin Lambers. v

ONTENTS

Abstract i Kurzfassung iii

Acknowledgments v List of Figures xi List of Benchmarks xiii iii CONTENTS ONTENTS ix CONTENTS Bibliography

Conference Publications

Journal Articles

Curriculum Vitae

IST OF FIGURES ii LIST OF FIGURES

IST OF BENCHMARKS

C H A P T E R

BACKGROUND

After decades of exponential growth in computational performance, storage anddata acquisition, computing is now well in the big data age, where future advancesare measured in our capability to extract meaningful information from the avail-able data. Visual analysis based on the interactive rendering of three-dimensionaldata has been proven to be a particularly efﬁcient approach to gain intuitive insightinto spatial structures and the relations of very large 3D data sets. For example,the electrical slice simulation in Figure 1.1 (top left) contains millions of voltagesamples per time step. A visualisation makes this electrical activity immediatelyunderstandable, and highlights eventual anomalies in the simulation. These de-velopments create new, unique challenges for applications and system software toenable users to fully exploit the available resources to gain insight from their data.The quantity of computed, measured or collected data is growing exponen-tially, fuelled by the pervasive diffusion of digitalisation in modern life. Moreover,the ﬁelds of science, engineering and technology are increasingly deﬁned by adata-driven approach to research and development. High-quality and large-scaledata is continuously generated at a growing rate from sensor and scanning sys-tems, as well as from data collections and numerical simulations in a number ofscience and technology domains.Display technology has made signiﬁcant progress in the last decade: High-resolution screens and tiled display walls are now affordable for most organisa-1

Figure 1.1:

Large Data Visualisation of a Brain Simulation, Molecular Visualisationin the Cave 2, Exploration of EM Stack Reconstructions in a Cave, Collaborative DataAnalysis on a Tiled Display Wall tions, and are getting deployed at an increasing rate. This increased resolution anddisplay size helps with understanding the data through higher ﬁdelity, but causesa quadratic increase in pixels to be rendered, which in turn challenges renderingalgorithms to deliver an interactive frame rate. Such large-scale visualisation sys-tems are often driven by multiple GPUs and workstations, making it natural, andmost times necessary, to drive them using parallel and distributed applications.However, not only applications are becoming more and more data-driven, butalso the technology used to tackle these kinds of problems has been witnessing aparadigm shift towards massively parallel on-chip and distributed parallel clustersolutions. On one hand, parallelism within a system has increased massively, withtenths of CPU cores, thousands of GPU cores and multiple CPUs and GPUs in asingle system. On the other hand, massively parallel distributed systems are easilyaccessible from various cloud infrastructure providers, and are also affordable foron-site hosting for many organisations. .2 Challenges 3

System software to exploit the available hardware parallelism capable of per-forming efﬁcient interactive data exploration has not kept up with the pace inhardware developments and data gathering capabilities. Mostly, this is due to aninherent delay between hardware and software capabilities, as development typ-ically only starts once the hardware is available. Secondly, existing software isoften engineered for different design parameters and has a signiﬁcant inertia tochange, to the extreme cost of having to rewrite it from scratch.In the context of emerging data-intensive knowledge discovery and data ana-lysis, efﬁcient interactive data exploration methodologies have become critical.Visual analysis by means of interactive visualisation and inspection of three-dimensional data is a particularly efﬁcient approach to gain intuitive insight intothe spatial structure and relations of very large 3D data sets. However, deﬁn-ing visual and interactive methods scaling with problem size and the degree ofparallelism, as well as generic applicability of high-performance interactive visu-alisation methods and systems, are recognised among the major current and futurechallenges.

Increased display ﬁdelity and faster rendering performance help to visualise largedata sets efﬁciently. Parallel rendering is one approach to achieve this goal byusing multiple GPUs, and often multiple computers, to improve the renderingperformance. It creates a new set of research challenges, which can be brokendown in more concrete challenges, starting with formalising and implementingthe architecture of a parallel rendering framework.These sub-challenges to build better scalable parallel rendering applicationscan be identiﬁed as ﬁnding better task decompositions, decreasing the cost for theresult composition, reducing the latency of the overall system, and minimisingsynchronisation between the parallel execution threads.Interactive visualisation poses its own unique set of challenges. The goal isto present a believable alternate universe to the visual system of the user. Thisprocess turns interactive visualisation into a powerful tool, by utilising the brains’native capabilities to interpret and understand data. Virtual Reality (VR) takes thisgoal to the extreme, and when done right, makes the user forget that he interactswith a virtual world.To achieve this goal, visualisation has the daunting task to transform largeamount of data into coloured pixels in a short amount of time. Believable visu-alisation has to minimise the latency between user input and the resulting out-put, and to maximise the number of frames rendered per second. With increasedimmersion in the data, these parameters become more important – for Virtual

Reality, a 60 Hz refresh rate and a latency below 50 ms is required, whereas fornon-immersive desktop visualisation 10 Hz and 200 ms are acceptable.When starting from a given rendering problem, the ﬁrst task of a parallel ren-dering system is to decompose (parallelise) this task into independent sub-tasks,each rendered by a separate resource in parallel. While the basics of this decom-position have been researched extensively, there are architectural challenges tomake these decompositions easily available in a generic and structured manner.Load balancing these tasks for an optimal parallelisation present many still un-addressed challenges for modern visualisation cluster sizes, consisting of tens tohundreds of GPUs, and increasingly affordable high-ﬁdelity visualisation systemswith tens of displays and hundreds of millions of pixels.By scaling up the amount of resources employed to accelerate the renderingtask, the task of combining the partial results from each resource becomes morechallenging. For some decomposition algorithms, the amount of data to compos-ite grows linearly with the amount of parallel resources used, and keeping thecompositing time within the available budget is a non-trivial problem.For parallel rendering, these constraints make building a parallel and distrib-uted application harder compared to other distributed applications for simulationsand cloud computing. In particular, one has to be careful with synchronisation andpipelining of operations to minimise latency. In addition, an interactive applica-tion has different requirements when it comes to resource allocation compared toother large-scale distributed computing domains.Last, but not least, a signiﬁcant challenge is how to make all this researchavailable to the large data scientists with the actual needs and use cases for parallelrendering.

Parallel rendering utilises multiple rendering units (GPUs), often on differentcomputers, to generate images for one or more output displays. Scalable render-ing is the subset of parallel rendering which uses multiple resources to acceleratethe rendering of one or more outputs. The goal of parallel rendering is to increasethe output resolution, rendering performance or rendering quality. Traditionallythe focus has been on the ﬁrst two goals, often in isolation of each other, e.g., al-gorithms and implementations for Cave systems tend to be different from scalablerendering for large data visualisation.The main performance indicator for Large Data Interactive Rendering is theperformance of the rendering algorithm, that is, the framerate with which the pro-gram produces new images. This framerate can be improved by either using fasteror more hardware, or by better algorithms exploiting existing hardware and data. .3 Parallel Rendering 5

This thesis primarily focuses on the ﬁrst approach, using parallel rendering to ex-ploit the CPU and GPU parallelism available on a single system, or a distributedcluster. The early fundamental concepts have been laid out in [Molnar et al., 1994]and [Crockett, 1997] (Figure 1.2). A number of domain speciﬁc parallel renderingalgorithms and special purpose hardware solutions have been proposed in the past,however, only few generic parallel rendering frameworks have been developed. bucketization(sort)G G GF F Fgraphics databasedisplaysort screen-space primitivesG G GF F Fgraphics databasedisplaysort fragments(composite)G G GF F Fgraphics databasedisplay geometryprocessingfragmentprocessing

Sort FirstSort MiddleSort Last

Figure 1.2:

Sort-Last, Sort-Middle and Sort-First Parallel Rendering

Sort-last rendering decomposes the rendering task in data space, that is, eachresource renders a part of the data. In the end partial fragments from each resourceare composited into a ﬁnal result image. Sort-middle rendering also decomposesthe rendering at the data level, but collects and sorts the unshaded primitives be-fore or after rasterisation, and then performs the fragment shading on the sorteddata. Sort-ﬁrst rendering decomposes the rendering task in screen space, and theapplication needs either to be ﬁll-rate bound or have efﬁcient view frustum cul-ling to scale the rendering performance. We will focus on sort-last and sort-ﬁrstrendering, since sort-middle architectures are only feasible in a hardware imple-mentation due to the large amount of data processed and transferred in the sortingstage.

Cluster-based parallel rendering has been commercialised for off-line rendering(i.e. distributed ray-tracing) for computer generated animated ﬁlms or special ef-fects, since the typically used ray-tracing technique is inherently amenable to par-allelisation for off-line processing. Other special purpose solutions exist for par-allel rendering in speciﬁc application domains such as volume rendering [Li et al.,1997; Wittenbrink, 1998; Huang et al., 2000; Schulze and Lang, 2002; Garcia andShen, 2002; Nie et al., 2005] or geo-visualisation [Vezina and Robertson, 1991;Agranov and Gotsman, 1995; Li et al., 1996; Johnson et al., 2006]. However,such speciﬁc solutions are typically not applicable as a generic parallel renderingparadigm and do not translate to arbitrary scientiﬁc visualisation and distributedgraphics problems.In [Niski and Cohen, 2007] parallel rendering of hierarchical level-of-detail(LOD) data has been addressed and a solution speciﬁc to sort-ﬁrst tile-based par-allel rendering has been presented. While the presented approach is not a genericparallel rendering system, basic concepts presented in [Niski and Cohen, 2007],such as load management and adaptive LOD data traversal, can be carried over toother sort-ﬁrst parallel rendering solutions.

Historically, high-performance real-time rendering systems have relied on an in-tegrated proprietary system architecture, such as the early SGI graphics supercom-puters. Special purpose solutions have become a niche product as their graphicsperformance did not keep up with off-the-shelf workstation graphics hardware andscalability of clusters.Due to its conceptual simplicity, a number of special purpose image compos-iting hardware solutions for sort-ﬁrst parallel rendering have been developed. Theproposed hardware architectures include Sepia [Moll et al., 1999; Lever, 2004],Sepia 2 [Lombeyda et al., 2001a; Lombeyda et al., 2001b], Lightning 2 [Stollet al., 2001], Metabuffer [Blanke et al., 2000; Zhang et al., 2001], MPC Compos-itor [Muraki et al., 2001] and PixelFlow [Molnar et al., 1992; Eyles et al., 1997],of which only a few have reached the commercial product stage (i.e. Sepia 2and MPC Compositor). However, the inherent inﬂexibility and setup overheadhave limited their distribution and application support. Moreover, with the re-cent advances in the speed of CPU-GPU and GPU-GPU interfaces, such as PCIExpress, NVLink and other modern interconnects, combinations of software andGPU-based solutions offer more ﬂexibility at a comparable performance. .3 Parallel Rendering 7

A number of algorithms and systems for parallel rendering have been developed inthe past. Some general concepts applicable to cluster parallel rendering have beenpresented in [Mueller, 1995; Mueller, 1997] (sort-ﬁrst architecture), [Samantaet al., 1999; Samanta et al., 2000] (load balancing), [Samanta et al., 2001] (datareplication), or [Cavin et al., 2005; Cavin and Mion, 2006] (scalability). Onthe other hand, speciﬁc algorithms have been developed for cluster based ren-dering and compositing such as [Ahrens and Painter, 1998], [Correa et al., 2002]and [Yang et al., 2001; Stompel et al., 2003]. However, these approaches do notconstitute APIs and libraries that can be readily integrated into existing visualisa-tion applications, although the issue of the design of a parallel graphics interfacehas been addressed in [Igehy et al., 1998].Only few generic APIs and (cluster-)parallel rendering systems exist, includ-ing VR Juggler [Bierbaum et al., 2001] (and its derivatives), Chromium [Humphreyset al., 2002] (an evolution of [Humphreys and Hanrahan, 1999; Humphreys et al.,2000; Humphreys et al., 2001]), ClusterGL [Neal et al., 2011] and OpenGL Mul-tipipe SDK [Jones et al., 2004; Bhaniramka et al., 2005; MPK, 2005]. Theseapproaches can be categorised into transparent interception and distribution of theOpenGL command stream and into the parallelisation of the application renderingcode (Figure 1.3).

Applicationrender broadcastrender crfakerOpenGLApplicationOpenGL implementationParallel Rendering System

Applicationrender

OpenGL driverGPU renderOpenGL renderOpenGL renderOpenGLGPU GPU GPU GPU

Application

OpenGL OpenGL OpenGL OpenGLGPU GPU GPU GPU render render render render

Equalizer / CollageEqualizer Collage Equalizer Collage Equalizer Collage Equalizer Collage

Figure 1.3:

Parallel Execution (left) versus Transparent OpenGL Interception (right)

VRJuggler

VR Juggler [Bierbaum et al., 2001; Just et al., 1998] is a graphics framework forvirtual reality applications, shielding the application developer from the under-lying hardware architecture, devices and operating system. Its main aim is easeof use in virtual reality conﬁgurations and use, without the need to know aboutthe devices and hardware conﬁguration details, but not speciﬁcally to providescalable rendering. Extensions of VR Juggler, such as for example ClusterJug-gler [Bierbaum and Cruz-Neira, 2003] and NetJuggler [Allard et al., 2002], aretypically based on the replication of application and data on each cluster node andonly take care of synchronisation issues, but fail to provide a ﬂexible and power-ful conﬁguration mechanism that efﬁciently supports scalable rendering as alsonoted in [Staadt et al., 2003]. VR Juggler does not support scalable parallel ren-dering such as sort-ﬁrst and sort-last task decomposition and image compositing,nor does it provide other important features for parallel rendering, such as net-work swap barriers (synchronisation), distributed objects, image compression andtransmission, or multiple rendering threads per process.

Chromium

Chromium [Humphreys et al., 2002] provides a powerful and transparent abstrac-tion of the OpenGL API allowing a ﬂexible conﬁguration of display resources. Itis limited in scalability, due to its focus on streaming OpenGL commands througha network of nodes, often initiated from a single source. This has also been ob-served in [Staadt et al., 2003], and is caused by the size of the OpenGL stream.This data stream not only contains OpenGL calls, but also geometry and imagedata. Only if the geometry and textures are mostly static and can be kept in GPUmemory on the graphic card, no signiﬁcant bottleneck can be expected, as theOpenGL stream is then composed of a relatively small number of rendering in-structions. For typical real-world visualisation applications, display and objectsettings are interactively manipulated, data and parameters may change dynam-ically, and large data sets do not ﬁt statically in GPU memory, but are often dy-namically loaded from out-of-core and/or multi-resolution data structures. Thiscan lead to frequent updates not only of commands and parameters which haveto be distributed, but also of the rendered data itself (geometry and texture), thuscausing the OpenGL stream to expand dramatically. Furthermore, this stream offunction calls and data must be packaged and broadcast in real-time over the net-work to multiple nodes for each rendered frame. This makes CPU performanceand network bandwidth more likely the limiting factor.The performance experiments in [Humphreys et al., 2002] indicate that Chro-mium is working well when the rendering problem is ﬁll-rate limited. This is .3 Parallel Rendering 9 due to the fact that the OpenGL commands and a non-critical amount of render-ing data can be distributed to multiple nodes without signiﬁcant problems. Thecritical ﬁll-rate work is then performed locally on the graphics hardware.Chromium also provides some facilities for parallel application development:A sort-last, binary-swap compositing stream processing unit and an OpenGL ex-tension providing synchronisation primitives, such as a barrier and semaphore. Itleaves problems like conﬁguration, task decomposition, process and thread man-agement unaddressed. Parallel Chromium applications tend to be written for onespeciﬁc parallel rendering use case and conﬁguration, e.g. the sort-ﬁrst distributedmemory volume renderer in [Bethel et al., 2003], or the sort-last parallel volumerenderer raptor [Houston, 2005]. We are not aware of a generic Chromium-basedapplication using many-to-one sort-ﬁrst or stereo decompositions.The concept of transparent OpenGL interception popularised by WireGL andChromium has received further contributions. While some commercial imple-mentations such as TechViz and MechDyne Conduit continue to exist, on theresearch side only ClusterGL [Neal et al., 2011] has been presented recently.ClusterGL employs the same approach as Chromium, but delivers a signiﬁcantlyfaster implementation of transparent OpenGL interception and distribution forparallel rendering. Transparent OpenGL interception is an appealing approachfor some applications, as it requires no code changes. It has inherent limitationsdue to the fact that eventually the bottleneck becomes the single-threaded applica-tion rendering code, the amount of application data the single application instancecan load or process, or the the size of the OpenGL command stream sent over thenetwork.

CGLX

CGLX [Doerr and Kuester, 2011] aims to bring parallel execution transparentlyto OpenGL applications, by emulating the GLUT API and intercepting certainOpenGL calls. Its target use case are multi-display installations, i.e., static sort-ﬁrst rendering with no compositing. In contrast to frameworks like Chromiumand ClusterGL, which distribute OpenGL calls, CGLX follows the distributed ap-plication approach. This works transparently for trivial applications, but quicklyrequires the application developer to address the complexities of a distributed ap-plication, when mutable application state needs to be synchronised across pro-cesses. For production applications, writing parallel applications remains the onlyviable approach for scalable rendering, as shown by the success of Paraview, Visitand Equalizer-based applications.

OpenGL Multipipe SDK

OpenGL Multipipe SDK (MPK) [Bhaniramka et al., 2005] implemented an ef-fective parallel rendering API for a shared memory multi-CPU/GPU system. It issimilar to IRIS Performer [Rohlf and Helman, 1994] in that it handles multi-GPUrendering by a lean abstraction layer via a callback mechanism, and that it runsdifferent application tasks in parallel. However, MPK is not designed nor meantfor rendering nodes separated by a network. MPK focuses on providing a parallelrendering framework for a single application, parts of which are run in parallel onmultiple rendering channels, such as the culling, rendering and ﬁnal image com-positing processes. The author used to be the technical lead developer of OpenGLMultipipe SDK, therefore Equalizer is in many ways an evolution of MPK fordistributed execution, improved performance and better conﬁgurability.

Tiled Display Walls

Software for driving and interacting with tiled display walls has received signi-ﬁcant attention, in particular Sage [Renambot et al., 2004] and Sage 2 [Marrinanet al., 2014]. Sage was built entirely around the concept of a shared framebufferwhere all content windows are separate applications using pixel streaming. It is nolonger actively supported. Sage 2 is a complete, browser-centric reimplementationwhere each application is a web application distributed across browser instances.DisplayCluster [Johnson et al., 2012], and its continuation Tide [Blue Brain Pro-ject, 2016], also implement the shared framebuffer concept of Sage, but providea few native content applications integrated into the display servers. These solu-tions implement a scalable display environment and are a target display platformfor scalable 3D graphics applications.

In the next chapter, we give a summary of the contributions of this thesis, list-ing relevant publications and the contributions of the author to these publications.Chapter 3 introduces the architecture of a parallel rendering framework, the found-ation for this thesis. Chapter 4 presents new algorithms for the task decompositionin parallel rendering. Chapter 5 focuses on optimisations to reduce the cost of re-combining the results of a parallel rendering decomposition. Chapter 6 describesbetter approaches to balance the task assignment to rendering resources. Chapter 7describes the design and architecture of a network library tailored to parallel ren-dering. Before a conclusion in Chapter 9, Chapter 8 provides an overview of themajor Equalizer applications.

C H A P T E R

CONTRIBUTIONS

This chapter summarises the main contributions of this thesis. In each section, welist the relevant publications and specify the contributions of the author.

A major contribution of this thesis is the formalisation of the architecture for aparallel rendering framework and its reference implementation, which advancesthe state of the art in many aspects:

Minimally invasive API:

The guiding principle for the API design was to allowapplications to retain all their rendering code and application logic. Theprogramming interface is based on a set of C++ classes, modelled closelyto the resource hierarchy of a graphics rendering system. The applicationsubclasses these objects and overrides C++ task methods, similar to C call-backs. These task methods will be called in parallel by the framework,depending on the current conﬁguration. The contract for the implement-ation of the task methods does not assume any speciﬁc rendering library,algorithm or technology, thus facilitating the adaptation of existing applic-ations for parallel rendering. This parallel rendering interface is signiﬁc-antly different from Chromium [Humphreys et al., 2002] and more similarto VRJuggler [Bierbaum et al., 2001] and MPK [Bhaniramka et al., 2005].

Runtime conﬁguration:

The architecture of our parallel rendering frameworkmakes a clear separation between the rendering algorithm and the runtime11 conﬁguration. It provides a contract between the framework and the applic-ation code based on a rendering context, and uses this context to drive theapplication output depending on the runtime conﬁguration. Application de-velopers are unaware of parallel rendering setups and make no assumptionson how the rendering code will be executed. This clear separation yieldsparallel rendering applications which can be deployed on a wide set of in-stallations, and are often conﬁgured in new ways unforeseen during theirdeployment.

Display abstraction:

Large scale visualisation systems cover a wide set of usecases from classical workstation setup to monoscopic tiled display walls,stereoscopic, edge-blended multi-projector walls to fully immersive install-ations CAVE systems. Consequently, applications running on these sys-tems serve many different use cases. Our novel canvas-layout abstractionprovides a simple conﬁguration for all these installations and empowers ap-plications using these installations with 2D and 3D contextual information,runtime stereo conﬁguration, and head tracking.

Compound trees:

The introduction of compounds, and their underlying contract,provides a formalisation of a ﬂexible task decomposition and result recom-position for parallel rendering. Compound trees allow for easy speciﬁcationof complex parallel task decomposition strategies, which are implementedand executed by the Equalizer system. They generalise parallel renderingprinciples without hardcoding a speciﬁc parallel rendering algorithm, thusproposing an orthogonal parameter set for decomposing rendering tasks,assembling results, and adapting these parameters at runtime. Furthermore,they facilitate new parallel rendering research due to their ﬂexibility andextensibility.

Equalizers:

The namesake of our framework, they are active components hookedinto a compound tree, and modify compound tree parameters at runtime.For example, a sort-ﬁrst load balancer adapts the sub-viewports assignedto each resource at runtime. Compounds are the passive conﬁguration, andequalizers are the active component to optimize this conﬁguration dynamic-ally. This makes their implementation independent of the rest of the frame-work, providing a powerful abstraction for research and development ofbetter resource usage for parallel rendering.

Modular architecture:

Our architecture uses layered abstractions that graduallyprovide higher level abstractions. On the lower level, a network library fordistributed abstractions provides the substrate for Equalizer and its applic-ations. Within each library, a clear separation of responsibilities allows aneasy combination of existing algorithms. For example, an advanced featurelike a cross-segment equalizer relies on per-segment load equalizers, andboth equalizers reconﬁgure the underlying compound tree each frame. .2 Scalable Rendering and Compositing 13 [Eilemann et al., 2009] and [Eilemann et al., 2018] publish the architecturalfoundations of parallel rendering frameworks. Any algorithmic implementationand architectural contributions in these publications are contributed by the author,while experimental results have signiﬁcant contributions from the secondary au-thors.[Bhaniramka et al., 2005] provides in many ways the foundation for Equal-izer, to which the author was a contributor for the implementation of a parallelrendering framework for shared memory systems.

Based on the ﬂexible system architecture we implemented new scalable renderingalgorithms, introduced in [Eilemann et al., 2009] and [Eilemann et al., 2018]. Aparticular focus was given on reducing cost for the expensive compositing step ofsort-last rendering.[Eilemann and Pajarola, 2007] provides an analysis of parallel compositingalgorithms in an early version of our parallel rendering framework, and we showthat direct send compositing has advantages on commodity visualisation clusters.The implementation, algorithm and experimental analysis in this paper are con-tributed by the author.[Makhinya et al., 2010] introduced more sort-last compositing optimisations,most notably automatic region-of-interest detection and new compression algorithms.The author contributed the foundations for using region of interest, and the fastRLE compression with optimised data preconditioning.[Eilemann et al., 2012] introduces many algorithmic optimisations for mod-ern visualisation clusters, ranging from asynchronous compositing, thread andmemory placement on NUMA architectures, region of interest, and an analysisof real-world application performance. We show that careful system design anddetailed optimisations are necessary to achieve scalability on larger visualisationclusters. For this publication, the author contributed the algorithms, large parts ofthe implementation in Equalizer, and some experimental analysis.

Optimal resource usage in larger visualisation clusters relies on an even distribu-tion of work over the available resources. This load balancing problem requiresreal-time algorithms based on imperfect knowledge of the system and applicationbehaviour. In our architecture, load balancing is achieved by modifying the com-pound tree parameters at runtime. For example, a sort-ﬁrst load balancer adapts the sub-viewports assigned to each resource at runtime. These so-called equal-izers are a component hooked into the compound tree, which makes their use andimplementation independent of the rest of the framework.[Eilemann et al., 2009] and [Eilemann et al., 2018] provide experimental res-ults on the effectiveness of our sort-ﬁrst and sort-last load balancing implementa-tions. In the latter publication, we compare two different reactive load balancingalgorithms and show that the theoretically superior algorithms do not necessarilyprovide better performance in realistic scenarios.[Erol et al., 2011] introduces a novel algorithm for load balancing an arbitraryset of rendering resources to drive visualisation installations with many output dis-plays, like tiled display walls or multi-projector systems. The author contributedthe algorithm and implementation for this publication.[Steiner et al., 2016] provides an implementation and detailed analysis ofcentral task queueing with work packages and different task afﬁnity modes forsort-ﬁrst and sort-last rendering. The author provided the base queueing infra-structure for this publication.

C H A P T E R

PARALLEL RENDERINGARCHITECTURE

A generic parallel rendering framework has to cover a wide range of use cases, tar-get systems, and conﬁgurations. This requires a strong separation between the im-plementation of the application and its conﬁguration, linked with a careful designto allow the resulting program to scale up to hundreds of nodes, while providinga minimally invasive API for the developer. In this section we present the sys-tem architecture of the Equalizer parallel rendering framework, and motivate itsdesign in contrast to related work.The motivation to use parallel rendering is either driven by the need to drivemultiple displays or projectors from multiple GPUs and potentially multiple nodes,or by the need to increase rendering performance to visualise more data, or use amore demanding rendering algorithm for higher visual quality. Occasionally bothneeds coincide, e.g., for the analysis for large data sets on high ﬁdelity visualisa-tion systems.Parallel rendering has similarities to other distributed computing domains likecloud computing and high-performance computing (HPC). It aims to acceleratethe completion of a task by parallelising a time-consuming algorithm, or to allowthe computation of a larger problem by employing multiple resources. Certainaspects are shared across these distributed computing domains, such as the need15 to load balance the parallel task execution, minimise synchronisation and commu-nication overhead, as well as to ﬁnd a task decomposition which allows to producecorrect results.Parallel rendering has one signiﬁcant additional constraint: serving an inter-active use case. Depending on the application domain and visualisation system,typically a framerate between 10 Hz and 120 Hz is required for useful user inter-action. In turn, this translates to a budget of 8 ms to 100 ms to decompose the taskto render a frame, perform parallel rendering, and to composite and display theresult. In comparison, cloud computing and HPC typically have turnaround timesof seconds to hours. Therefore, many algorithms for parallel rendering compute asuboptimal solution, but do so in at most a few milliseconds.Fundamentally two approaches enable applications to use multiple GPUs:transparent interception at the graphics API (typically OpenGL), or extending theapplication to support parallel rendering natively (Figure 1.3). The ﬁrst approachhas been extensively explored by Chromium and others, while the second is thefoundation for this thesis. The architecture of Equalizer is founded on an in-depthrequirements analysis of typical visualisation applications, existing frameworks,and previous work on OpenGL Multipipe SDK.The task of parallelising a visualisation application boils down to conﬁguringthe applications’ rendering code differently for each resource, enabling this ren-dering code to access the correct data, and synchronising execution. For scalablerendering, when multiple GPUs are used to accelerate a single output, partial res-ults need to be collected from all contributing resources, combined, and send tothe output.The architecture of our parallel rendering framework addresses the followingresearch questions: • How can we reduce end-to-end system latency for better user experience? • In a generic parallel rendering framework, how can we schedule the differ-ent rendering stages to minimise the latency for the user? • How can we architect the parallel rendering framework to minimise syn-chronisation between threads?Section 3.2 introduces our asynchronous execution model, which has beencarefully designed to minimise synchronisation points, maximise pipeliningand enables early display of rendered images. • How can we maximise the impact of this research on large data scientists?Ultimately, accessible applications determine the impact for large data re-search. With Equalizer we provide the base building blocks: a minimally .2 Asynchronous Execution Model 17 invasive API and distributed execution layer to lower the entry barrier forapplication developers, a ﬂexible conﬁguration with a clear separation fromthe implementation, comprehensive VR features and tiled display wall in-tegration addressing a wide set of visualisation installations. Various ap-plications, introduced in Chapter 8, have been developed using Equalizer.In this chapter, we will ﬁrst describe the execution model and resource con-ﬁguration, followed by how the generic conﬁguration is used to model the desiredvisualisation setup, and ﬁnally introduce speciﬁcs of scalable and distributed ren-dering.

The core execution model for parallel rendering was pioneered by CAVELib [De-Fanti et al., 1998], reﬁned by OpenGL Multipipe SDK for shared memory systemsand scalable rendering, and substantially extended by Equalizer for asynchronousand distributed execution. By analysing the typical architecture of a visualisationapplication we observe an initialization phase, a main rendering loop, and an exitphase. Equalizer decomposes these steps for parallel execution.The main rendering loop typically consists of four phases: submitting therendering commands to the graphics subsystem, displaying the rendered image,retrieving events from the operating system, and updating the application statebefore a new image is rendered. Usually, the conﬁguration of the rendering islargely hard-coded, with a few conﬁgurable parameters such as ﬁeld of view orstereo separation. For parallel execution, we need to separate the rendering codefrom this main loop, and execute it in parallel with different rendering parameters,as shown in Figure 3.1. Similarly, the initialisation and exit phase also need to bedecomposed to allow managing of multiple distributed resources.Figure 3.2 shows the execution of the rendering tasks of a two-node sort-ﬁrstcompound without latency and with a latency of one frame. The asynchronousexecution pipelines rendering operations and hides imbalances in the load distri-bution, resulting in an improved framerate. We have observed a speedup of 15%on a ﬁve-node rendering cluster when using a latency of one frame instead of nolatency in a sort-ﬁrst conﬁguration.Another critical design parameter are synchronisation points. Most imple-mentations, including OpenGL Multipipe SDK, use a per-frame barrier or sim-ilar synchronisation to manage parallel execution. In larger installations, this isdetrimental to scalability, as even slight load imbalances limit parallel speedup.The Equalizer execution model is fully asynchronous, and only introduces syn-chronisation points when strictly required. The main synchronisation points are:conﬁgured swap barriers between a set of output channels which have to dis- startcull / drawswapexit ?eventhandlinginitializeupdatedataexityesstop begin frameend frameexit ?exitstopstartinitialize initwindowsexit?render threadstop noexit?render threadstopswap swapcull / draw cull / drawexitwindows exitwindowseventhandlingupdatedata initwindowsnonono yes yes yes

Figure 3.1:

Simpliﬁed Execution Flow of a Classical Visualisation Application and anEqualizer Application framebefore last last frame

AsynchronousSynchronous idlewait framebefore last last frame

Figure 3.2:

Synchronous and Asynchronous Execution play simultaneously, the availability of input frames for scalable rendering, and atask synchronisation to prevent runaway of the main loop execution. By default,Equalizer keeps up to one frame of latency in execution, that is, some resourcesmight render the next frame while others are still ﬁnishing the current frame. Non-etheless, ﬁnished resources will immediately display their result. This asynchron-ous execution architecture, coupled with a frame of latency, allows pipelining ofmany operations, such as the application event processing, task computation andload balancing, rendering, image readback, compression, network transmission, .2 Asynchronous Execution Model 19 and compositing. It also hides small imbalances in the task distribution, as theyusually average out over multiple frames.In practical scenarios, application initialisation and exit is also a factor forusability. Consequently, these phases are also parallelised in Equalizer. A ﬁrstpass identiﬁes the resources to be launched or terminated, kicks off the tasks, andthen uses a second pass to synchronize their execution and results. launch time stddev1 app li c a t i on l aun c h t i m e ( m s ) Number of Render Client Processes Benchmark 3.a:

Parallel Application Startup

Benchmark 3.a shows thestartup time of eqPly, ourparallel polygon renderer.This benchmarks simply meas-ures the time taken by

Con-ﬁg::init , which includes therender client process cre-ation using ssh from the ap-plication node, library load-ing from a shared ﬁlesystem,network setup, OpenGL andwindow initialisation, andobject data mapping for theEqualizer resource instancesand a few internal objectsused by eqPly. The bench-mark conﬁrms that the application launch is scaling nicely to a medium clustersize. A slight increase in startup time with larger conﬁgurations is expected, sincemore processes increase the load on the shared ﬁlesystem and worsen distribu-tion and synchronisation overheads. Due to the shared ﬁlesystem used for theexecutable, the startup times observe a large uncertainty, shown by the standarddeviation bars.In comparison to interception approaches as used by Chromium, our asyn-chronous programming model inherently provides better performance. Bench-mark 3.b tests the rendering performance for driving a simple tiled display wallconﬁguration with a static model, rotating about its vertical axis, placed such thatit nicely covers the different screens.A standard tile-sort Chromium conﬁguration is comparable to a simple Equal-izer display wall setup, where in each case a single GPU and node is responsiblefor driving the attached display. The polygonal model is rendered using eqPly anduses display lists for the static geometry. Using display lists allows Chromium tosend geometry and texture data only once to the rendering nodes (retained moderendering) and display them repeatedly using glCallLists() , which is inexpensivein terms of network overhead. This setup is favourable for Chromium, because the display lists are transmitted only once over the network, and only simple displaycalls will be processed and distributed by Chromium for each rendered frame.

David 1mm , 1280x1024, sort-ﬁrst

Equalizer Chromium1 DisplaysGPUs Sort-First Polygonal Rendering F PS Benchmark 3.b:

Driving a Tiled Display Wall

Chromium initially in-creases performance whenadding nodes, but it quicklystagnates, and even decreases,when more nodes are added.In contrast, Equalizer con-tinually improves perform-ance with more added nodesand exhibits a smooth drop-off in speed-up, due tothe expected synchronisationand network overhead as therendered data gets negligiblein size per node. This per-formance difference is alsodue to the fact that Equalizercan beneﬁt from distributedparallel view frustum cullingon each render thread. libCollage.solibCollage.solibEqualizer.solibEqualizer.solibEqualizer.soApplication c on t r o l drive Equalizer ServerlibCollage.solibEqualizer.solibCollage.so libCollage.soEqualizerAdminAdmin ConsolelibCollage.so m od i f y Application Render ClientApplication Render ClientApplication Render Client EqualizerServer

Figure 3.3:

Parallel Rendering Entities

Equalizer is a frameworkto facilitate the developmentof distributed and multi-threaded parallel renderingapplications. The program-ming interface is based on aset of C++ classes, modelledclosely to the resource hier-archy of a graphics render-ing system. The applicationsubclasses these objects andoverrides C++ task methods,similar to C callbacks. Thesetask methods will be called in parallel by the framework, depending on the cur-rent conﬁguration. This parallel rendering interface is signiﬁcantly different fromChromium [Humphreys et al., 2002] and more similar to VRJuggler [Bierbaumet al., 2001] and OpenGL Multipipe SDK [Bhaniramka et al., 2005]. .2 Asynchronous Execution Model 21

To separate the responsibilities in a parallel rendering application, differententities are responsible for different aspects of the runtime system: the applicationprocess driving a rendering session, the server controlling the parallel renderingconﬁguration, render clients executing the rendering tasks, and an administrativeAPI to reconﬁgure the rendering session at runtime. All processes communicatewith each other through a common network library (Collage) and a client libraryimplementing the Equalizer API, as shown in Figure 3.3.The administrative API connects to a server, and allows some changes to therunning conﬁguration,, e.g., to create new output channels. Its description is out-side of the scope of this thesis, and is mentioned here for completeness.

The main application thread in Equalizer drives the rendering, that is, it carriesout the main event loop, but does not actually execute any rendering. Depend-ing on the conﬁguration, the application process often hosts one or more renderclient threads. These application render threads are identical in behaviour and im-plementation to render threads on the render client nodes. When a conﬁgurationhas no additional nodes besides the application node, we have a single-process,multi-threaded rendering application: all application code is executed in the sameprocess, and no network data distribution has to be implemented.The main rendering loop is simple: The application requests a new frameto be rendered, synchronises on the completion of a frame and processes eventsreceived from the render clients. It may perform idle processing between the startand synchronisation of a frame. Figure 3.1 shows a simpliﬁed execution model ofan Equalizer application.

The Equalizer server manages the parallel rendering session. It is an asynchron-ous execution thread or process, which receives requests from the application andserves these requests using the current conﬁguration, launching and stopping ren-dering client processes on nodes, determining the rendering tasks for a frame, andsynchronising the completion of tasks.

During initialisation, the application provides a rendering client executable. Therendering client is often, especially for simple applications, the same executableas the application. However, in more sophisticated implementations, the render-ing client can be another executable which only contains the application-speciﬁcrendering code. The server deploys this rendering client on all nodes speciﬁed in the conﬁguration. Render clients may run on a different architecture or operatingsystem from the main application, the underlying network library ensures typesafety and endian ordering.In contrast to the application process, the rendering client main loop is com-pletely controlled by Equalizer, based on application commands. A render clientconsists of the following threads: The node main thread, one network receivethread, one thread for each graphic card (GPU) to execute rendering tasks, andoptionally one thread for asynchronous readback per GPU. If a conﬁguration alsouses the application node for rendering, then the application process uses one ormore render threads, consistent with render client processes. The Equalizer clientlibrary implements the main loop, which receives network commands, processesthem, and invokes the necessary task methods provided by the developer.The task methods clear the frame buffer as necessary, execute the OpenGLrendering commands as well as readback, and assemble partial frame results forscalable rendering. All tasks have default implementations so that only the ap-plication speciﬁc methods have to be implemented, which at least involves the frameDraw() method executing a rendering task. For example, the defaultcallbacks for frame recomposition during scalable rendering implement tile-basedassembly for sort-ﬁrst and stereo decompositions, and unordered z -buffer com-positing for sort-last rendering. Render Context

The render context is the core entity abstracting the application-speciﬁc renderingalgorithm from the system-speciﬁc conﬁguration. It speciﬁes:

Buffer

OpenGL-style read and draw buffer as well as colour mask. These para-meters are inﬂuenced by the current eye pass, eye separation and anaglyphicstereo settings.

Viewport

Two-dimensional pixel viewport restricting the rendering area. Thepixel viewport is inﬂuenced by the destination viewport deﬁnition and view-ports set for sort-ﬁrst decompositions.

Frustum

Frustum parameters as deﬁned by glFrustum . Typically the frustum isused to set up the OpenGL projection matrix. The frustum is inﬂuencedby the destinations view deﬁnition, sort-ﬁrst decomposition, tracking headmatrix and the current eye pass.

Head Transformation

A transformation matrix positioning the frustum. Forplanar views this is an identity matrix and is used in immersive rendering.It is usually used to set up the ‘view’ part of the modelview matrix, beforestatic light sources are deﬁned. .2 Asynchronous Execution Model 23

Range

A one-dimensional range within the interval [0..1]. This parameter is op-tional and should be used by the application to render only the appropriatesubset of its data for sort-last rendering.

View

The view object from the logical rendering rendering conﬁguration, as in-troduced below. Holds view-speciﬁc data, such as camera, model or anyother application state.

Event Handling

Event handling routes events from the source window in the rendering thread tothe application main thread for consumption. At each step, events can be ob-served, transformed or dropped. Events are received from the operating systemin the rendering thread, transformed there into a generic representation, and sentto the application main thread. The application processes them in the main loopand modiﬁes its internal state accordingly. This follows the natural data ﬂow formost windowing systems and has natural semantics for thread safe event handling.For Qt, Equalizer internally dispatches events from the process main thread to therender threads to ensure consistent behaviour.

Non-Uniﬁed Memory Access (NUMA) is a common hardware architecture forhigh-performance visualisation clusters. Modern multi-socket render nodes usea NUMA architecture, where each CPU socket has a number of locally-attachedmemory buses, GPU and network devices, and CPU sockets are linked with aninterconnect to each other. Accessing a memory address located on another pro-cessor has a performance penalty for both bandwidth and latency, and accessing aGPU or network interface from a remote processor is slower than a local access.

RAM GPU 1Processor 1Core 1Core 2Core 3 Core 4 GPU 2GPU 3Network 1Processor 2Core 1Core 2Core 3 Core 4 RAMRAMRAMRAMNode Core 5Core 6Core 5Core 6 Network 2

Figure 3.4:

Exemplary Dual-Socket NUMA Node

Figure 3.4 shows onesuch NUMA visualisationnode, used in the experi-ments of [Eilemann et al.,2012]. It has two CPUsockets with six cores each,three GPUs connected to thetwo sockets, and two net-work cards (10 Gigabit Eth-ernet and InﬁniBand) con-nected to one socket.In our parallel renderingsystem, a number of threads are used to drive a single process in the cluster: the main thread (main), one rendering thread for each GPU (draw) and one threadto ﬁnish asynchronous downloads (read), one thread for receiving network data(recv), one command processing thread (cmd), and one thread for image trans-mission to other nodes (xmit). We have implemented automatic thread placementby extending and using the hwloc library in Equalizer. We restrict all node threads(main, recv, cmd, xmit) to the cores of the processor local to the network card, andall GPU threads (draw, read) to the cores of the processor closest to the respectiveGPU.

RAM GPU 1Processor 1Core 1Core 2Core 3 Core 4 GPU 2GPU 3Network 1Processor 2Core 1Core 2Core 3 Core 4 RAMRAMRAMRAMNode Core 5Core 6Core 5Core 6 Network 2recvmaincmd readdrawdrawdraw readreadxmit

Figure 3.5:

Thread Placement on a NUMA Node

Figure 3.5 shows thethread placement for thenode used in Figure 3.4.Threads are bound to allcores of the respective socket,and the ratio of cores tothreads varies with the usedhardware and software con-ﬁguration. Many of thethreads do not occupy a fullcore at runtime, especiallynode threads are mostly idleon a rendering client.When using the default ﬁrst-touch memory placement strategy, memory isallocated on the processor where it is ﬁrst accessed. All GPU-speciﬁc memoryallocations are done by the render threads executing the rendering code, thereforeplacing the CPU-side buffers onto the same socket as the corresponding GPU.Similarly, network buffers are allocated and used from the one of the node threads. linear ROI off ROI on correct incorrect linear ROI off ROI on DB AFF DB bad AFF speedup speedup speedup improvement3915212733 F r a m e s p e r S ec o n d Number of GPUs S p ee d u p d u e t o o p t i m i z a t i o n incorrectcorrectspeedup F r a m e s p e r S ec o n d Number of GPUs S p ee d u p d u e t o o p t i m i z a t i o n ROI offROI onspeedup

DB, Region of Interest , 4xDavid F r a m e s p e r S ec o n d Number of GPUs

ROI onROI offspeeduplinear

Benchmark 3.c:

Thread Afﬁnity on NUMA Hardware

We tested the inﬂuenceof thread placement by ex-plicitly placing the threadseither on the correct or in-correct processor. A low-level memory bandwidth testshows a × performance dif-ference between these twosettings. We found thatthis leads to a performanceimprovement of more than6% in real-world renderingloads, as shown in Bench-mark 3.c. This benchmarkuses the aforementioned cluster nodes, and renders polygonal data using sort-ﬁrst .3 Conﬁguration 25 scalable rendering. The exact experiment setup is described in [Eilemann et al.,2012]. While this is a relatively small inﬂuence, it becomes more important withhigher frame rates as the relative draw time decreases, and the memory-intensivecompositing step importance increases. Thread placement is therefore one of thecomponents to achieve scalability on larger visualisation clusters with NUMAnodes. A conﬁguration consists of the declaration of the rendering resources, the physicaland logical description of the projection system, and the conﬁguration on how theaforementioned resources are used for parallel and scalable rendering. A conﬁg-uration is an instantiated class hierarchy in memory used by the server to computerendering tasks, and has a serialised text ﬁle format to read and write conﬁgurationﬁles.The rendering resources are represented in a hierarchical tree structure whichcorresponds to the physical and logical resources found in a 3D rendering envir-onment: nodes (computers), pipes (graphic cards), windows, and channels (2Drendering area in a window).Physical layouts of display systems are conﬁgured using canvases with seg-ments, which represent 2D rendering areas composed of multiple displays or pro-jectors. Logical layouts are applied to canvases and deﬁne the views on a canvas.Observers observe multiple views and represent a head-tracked user in a visual-isation application.Scalable resource usage is conﬁgured using a compound tree, which is a hier-archical representation of the rendering decomposition and result recompositionacross the resources.

The ﬁrst part of the conﬁguration is a hierarchical structure of node → pipes → windows → channels describing the rendering resources. The developer will useinstances of these classes to implement application logic and manage data.The node is the representation of a single computer in a cluster. One operatingsystem process of the render client executable will be used for each node. Eachconﬁguration might also use an application node, in which case the applicationprocess is also used for rendering.The pipe is the abstraction of a graphics card (GPU), and uses an operatingsystem thread for rendering. All pipe, window and channel task methods areexecuted from the pipe thread. The pipe maintains the information about the GPUto be used by the windows for rendering. The window encapsulates a drawable and an OpenGL context. The drawablecan be an on-screen window or an off-screen pbuffer or framebuffer object (FBO).Windows on the same pipe share their OpenGL rendering resources. They executetheir rendering tasks sequentially on the pipe’s execution thread, in the order theyare deﬁned in the conﬁguration.The channel abstracts an OpenGL viewport within its parent window. It is theentity executing the actual rendering. The channel’s rendering context is overwrit-ten when it is rendering for another channel during scalable rendering. Multiplechannels in application windows may be used to view the model from differentviewports. Sometimes, a single window is split across multiple projectors, e.g.,by using an external splitter such as the Matrox TripleHead2Go.

Display resources are the second part of the conﬁguration. They describe thephysical display setup (canvases → segments), logical display (layouts → views)and head tracking of users within the visualisation installation (observers).A canvas represents one physical projection surface, e.g., a PowerWall, acurved screen, an immersive installation, or a window on a workstation. Canvasesprovide a convenient way to conﬁgure projection surfaces. They group a set ofsegments (displays or projectors) into a 2D projection surface. A canvas useslayouts describing logical views. Typically, a desktop window uses one canvas,one segment, one layout and one view. One conﬁguration might drive multiplecanvases, for example a projection wall with an operator station. Planar surfaces,e.g., a display wall, conﬁgure a frustum for the respective canvas. For non-planarsurfaces, the frustum will be conﬁgured on each display segment. The applicationrendering code has access to the 2D area being updated, for example to draw 2Dmenus on top of the 3D rendering.The frustum can be speciﬁed as a wall or projection description in the globalreference system, which is shared with the head-tracking matrix of the applica-tion. A wall is deﬁned by the bottom-left, bottom-right and top-left coordinatesrelative to the origin. A projection is deﬁned by the position and head-pitch-rollorientation of the projector, as well as the horizontal and vertical ﬁeld-of-viewand distance to the projection wall. Figure 3.6 illustrates the wall and projectionfrustum parameters. All size units are in meters.A canvas consists of one or more segments. A planar canvas typically has afrustum description, which initialises the segment frustum based on the 2D areacovered by it. Non-planar frusta are conﬁgured by overriding the default segmentfrusta. These frusta typically describe a physically correct display setup for VirtualReality installations. .3 Conﬁguration 27 projectionwall d i s t a n c e o r i g i n b o tt o m -r i g h t bo tt o m - l e ft t op - l e ft H P R F O V Figure 3.6:

Wall and Projection Parameters

A canvas has one or morelayouts. One of the layoutsis the active layout, that is,this set of views is currentlyused for rendering. It is pos-sible to specify

OFF as alayout, which deactivates thecanvas. It is supported to usethe same layout on differentcanvases, for example to mir-ror a display wall layout on acontrol station window.A segment represents oneoutput channel of the canvas,e.g., a projector or a display.A segment has an output channel, which references the channel to which the dis-play device is connected. To synchronise the video output, a swap barrier is con-ﬁgured to synchronise the respective window buffer swaps. Swap barriers canuse network-based software synchronisation or hardware synchronisation basedon NVidia’s G-Sync hardware.

Figure 3.7:

A Canvas using four Segments

A segment covers a two-dimensional region of its par-ent canvas, conﬁgured bythe segment viewport. Theviewport is in normalised co-ordinates relative to the can-vas. Segments might over-lap (edge-blended project-ors) or have gaps betweeneach other (display walls,Figure 3.7 ). The viewportis used to conﬁgure the seg-ment’s default frustum fromthe canvas frustum descrip-tion, and to place logicalviews correctly.A layout is the groupingof logical views. It is used by one or more canvases. For all given layout/canvascombinations, Equalizer creates destination channels when the conﬁguration is Dataset courtesy of VolVis distribution of SUNY Stony Brook, NY, USA. loaded. These destination channels may later be referenced by compounds to con-ﬁgure scalable rendering. Layouts can be switched at runtime by the application.Switching a layout will activate different destination channels for rendering.A view is a logical view of the application data, in the sense used by theModel-View-Controller pattern. It can conﬁgure a scene, viewing mode, view-ing position, or any other representation of the application’s data. The view objectis accessible to the application thread and all render threads contributing to its ren-dering. This allows the application to manage view-speciﬁc data by attaching it asa distributed object to the view, which will be synchronised from the applicationmain thread to the render clients at the beginning of each frame.A view has a fractional viewport relative to its layout. A layout is usuallyfully covered by its views. Each view can have a frustum description. The view’slogical frustum overrides physical frusta speciﬁed at the canvas or segment level.This is typically used for non-physically correct rendering, e.g., to compare twomodels side-by-side on a canvas. If the view does not specify a frustum, it willuse the sub-frustum resulting from the covered area on the canvas. A view mightreference an observer, in which case its frustum is head-tracked.

Figure 3.8:

Layout with four Views

Figure 3.8 shows an ex-ample layout using four viewson a single segment. Fig-ure 3.9 shows a real-worldsetup of a single canvas withsix segments using underlapfor the display bezels, witha two-view layout. Thisconﬁguration generates eightdestination channels.An observer representsan actor looking at one ormultiple views. It has ahead matrix, deﬁning its po-sition and orientation withinthe world, eye offsets and focus distance parameters. Typically, a conﬁgurationhas one observer. Conﬁgurations with multiple observers are used if multiple,head-tracked users are in the same conﬁguration session, e.g., a non-tracked con-trol host with two tracked head-mounted displays.

Compound trees describe how multiple rendering resources are combined to pro-duce the desired output, especially how multiple GPUs are aggregated to increase .3 Conﬁguration 29

Figure 3.9:

Tiled Display Wall using one Canvas with six Segments and a Layout withtwo Views rendering performance. They are one of the core innovations, enabling a ﬂex-ible resource conﬁguration. Compounds are modiﬁed at runtime by equalizers toimplement dynamic behaviour, e.g., for load balancing.Compounds are a data structure to describe the execution of rendering tasksin the form of a tree. Each compound corresponds to some rendering tasks (clear,draw, assemble, readback) and references a channel from the resource descrip-tion executing the tasks. The allocation of channels on pipes and nodes determ-ines which resources execute the task, and what can be executed in parallel. Acompound may provide output frames from the readback task to others, and canrequest input frames from others for its own assembly task. Output frames arelinked to input frames by name.Compound trees are a logical description of the rendering pipeline, and onlyreference the actual physical resources through their channels. This allows map-ping a compound tree to different physical conﬁgurations by simply replacing thechannel references. For example, one can test the functionality of a sort-last con-ﬁguration by using channels of different windows on a single-GPU workstationbefore deploying it to multiple physical GPUs.A simple leaf compound description for rendering a part of the data set, givenby the data range , into a particular region of the viewport is shown in Figure 3.10.

The data range is a logical mapping of the data set onto the unit interval and is leftto the application to interpret appropriately. Hence, the range [0 ] indicates thatthe ﬁrst half of the data set should be rendered, for example the ﬁrst n triangles ofa polygonal mesh with n faces. The viewport is indicated by the parameters [x ywidth height] as fraction of the parent’s viewport, and in the example the data isthus rendered into the left half of the viewport. The resulting framebuffer data –including per-pixel colour and depth – of the rendering executed on this channelis read back and is made available to other compounds by the name left half. compound { channel ”draw”buffer [ COLOR DEPTH ]range [0 ]viewport [ 0 0 { name ”left half” }} Figure 3.10:

Compound Rendering half of the Data Set into half of the Viewport

A non-leaf compound performing image assembly and compositing task isprovided in Figure 3.11. Framebuffer data is read from two other compounds,which did execute rendering for part a and part b of the data set in parallel. Thecompound itself executes by default z -depth visibility compositing of the twoinput images on its channel and returns the resulting colour framebuffer in theoutput frame named frame.display . compound { channel ”display”inputframe { name ”part a” } inputframe { name ”part b” } outputframe { buffer [ COLOR ] }} Figure 3.11:

Compound Performing Image Compositing

Leaf compounds execute all tasks by default, but the focus is often on the drawtask with a default assemble and standard readback task used to pass the resultingimage data on to other compounds for further compositing. While leaf compoundsexecute the rendering in parallel, non-leaf compounds often correspond to, butare not restricted to, the (parallel) image compositing and assembly part. Thereadback or assemble tasks are only active if output or input frames have beenspeciﬁed, respectively. Otherwise the rendered image frame is left in-place forfurther processing in a parent compound sharing the same channel. .3 Conﬁguration 31

Note that non-leaf nodes in the compound tree structure traverse their childrenﬁrst before performing their default assemble and readback tasks. Furthermore,compounds only deﬁne the logical task decomposition structure, while its exe-cution is actually performed on the referenced channels. Therefore, since com-pounds can share channels, as often done between a parent and one of its childcompounds, rendered image data can sometimes be left in place, avoiding read-back and transfer to another node.All attributes, as well as the channel, are inherited from the parent compoundif not speciﬁed otherwise. The viewport , data range and eye attributes are used todescribe the decomposition of the parents’ 2D viewport, database range, temporal,pixel, subpixel and eye passes, respectively.A more formal classiﬁcation of compound entities is: Root compound is the top-level compound of a compound tree. It might also bea destination compound, or can be empty (not referencing a channel) whensynchronising multiple destination channels.

Destination compound(s) are the top-most compounds referencing a channel,which becomes the destination channel. This destination channel determ-ines the rendering context for the whole subtree, that is, compounds andtheir channels lower in the hierarchy contribute to the rendering of the des-tination channel by executing part of the destination render context andproviding output frames which will eventually be composited onto the des-tination channel.

Source compounds are the leaf nodes in a compound tree. They typically use adifferent channel from the destination channel and conﬁgure scalability byoverriding render context parameters. This decomposes the rendering of thedestination channel. By adding output and input frames, the partial resultsare collected and composited:

Decomposition

On each child compound the rendering task of that childcan be limited by setting the viewport , range , period and phase , pixel , subpixel , eye or zoom as desired. Compositing

Source compounds deﬁne an output frame to read back theresult. This output frame is used as an input frame on the destinationcompound receiving the pixels. The frames are connected with eachother by their name, that has to be unique within the root compoundtree. For parallel compositing, the algorithm is described by deﬁn-ing multiple input and output frames across all source compounds andrestricting the task to assemble and readback.

Intermediate compounds may be used to simplify the task decomposition or toconﬁgure parallel compositing.

Virtual Reality is an important ﬁeld for parallel rendering. It requires specialattention to support it as a ﬁrst-class citizen in a generic parallel rendering frame-work.

Equalizer has been used in many virtual reality installations, such as theCave2 [Febretti et al., 2013], the high-resolution C6 CAVE at the KAUST visual-isation laboratory, and head-mounted displays (Figure 1.1). In the following welay out the features needed to support these installations, motivated by applicationuse cases.

Head tracking is the minimal feature needed to support immersive installations.

Equalizer does support multiple, independent tracked views through observer ab-straction. Built-in VRPN support enables the direct, application-transparent con-ﬁguration of a VRPN tracker device. Alternatively, applications can provide a × tracking matrix. Both CAVE-style tracking with ﬁxed projection surfaces,and HMD tracking with moving displays are implemented. To our knowledge all parallel rendering systems have the focal plane coincidewith the physical display surface. For better viewing comfort, we introduce a newdynamic focus mode, where the application deﬁnes the distance of the focal planefrom the observer, based on the current lookat distance.Figure 3.12 illustrates this feature in a top-down view of a Cave. The observedteapot is signiﬁcantly behind the front projection wall in the virtual world. In astandard implementation (left side), the focal plane coincides with the projectionsurface. In our implementation, the application conﬁgures a focus distance tocoincide with the observed teapot (right side). The dotted line shows the focalplane for both projection walls. Initial experiments show that this provides betterviewing comfort, in particular for objects placed in front of the physical displays.

Traditional head tracking computes the left and right eye positions by using aninterocular distance. However, since human heads are not symmetric, we supportan optional conﬁguration of individual, measured 3D eye translations relative tothe tracking matrix. .5 Tiled Display Walls 33

Figure 3.12:

Dynamic Focus in a Cave

This model unit allows applications to specify a scaling factor between the modeland the real world, allowing exploration of macroscopic or microscopic worldsin virtual reality. The unit is per view, allowing different scale factors within thesame application. It scales both the speciﬁed projection surface, as well as the eyeposition (and therefore eye separation) to achieve the necessary effect.

Applications can switch each view between mono and stereo rendering at runtime,and run both monoscopic and stereoscopic views concurrently. This switch doespotentially involve the start and stop of resources and processes for passive stereoor stereo-dependent task decompositions.

Simulations performed on todays high performance supercomputers produce massiveamounts of data, which are often too expensive to move to another system. Tileddisplay walls have proven to help understand complex data due to their size, res-olution and collaborative usage. Often the two systems are not located in the samefacility because of power constraints or other factors.Software for driving tiled display walls has converged on the collaborative as-pect of these installations. Sage, Sage 2, DisplayCluster and Omegalib implement a multi-window environment around a shared framebuffer concept. DisplayC-luster provides a dynamic, desktop-like windowing system with built-in mediaviewing capability, that supports ultra high-resolution imagery and video content,as well as remote streaming allowing arbitrary applications from remote sourcesto be shown. Figure 3.13 shows our evolution of DisplayCluster called Tide [BlueBrain Project, 2016] running on a 24 megapixel, × tiled display wall. Figure 3.13:

Tiled Display Wall with Remote Render-ing of the Equalizer-based RTNeuron Application

Streaming to a Tide wallis implemented using the De-ﬂect [Nachbaur et al., 2014]client library. The applica-tion provides an image buf-fer to Deﬂect, which willbe compressed using libjpeg-turbo, and sent asynchron-ously and in parallel by thestream library. Multiplestream sources from multipleprocesses can provide con-tent to a single wall window,enabling parallel streaming for parallel rendering applications. Deﬂect also imple-ments an event model, where the application registers to receive keyboard, mouseand window management events from the wall.We integrated the stream library into Equalizer to send the framebuffer of eachdestination channel of a view to DisplayCluster, using a direct FBO download (ifpossible) or a texture download. We use asynchronous transmission to pipelinecompression, streaming, and rendering. Received events from DisplayCluster areconverted and forwarded to Equalizers event system. This integration allows allEqualizer applications to beneﬁt from streaming without code changes, conﬁguredby specifying the DisplayCluster hostname on all views to be streamed.We evaluated the overall system performance using the Blue Brain Projectsetup shown in Figure 3.14. The supercomputer and data is located in a remotesupercomputing centre in Lugano, whereas the tiled display wall is at the project’smain ofﬁce in Lausanne. Both locations are linked using a high-speed WAN link.The HPC installation has a colocated visualisation cluster for remote renderingscenarios.Benchmark 3.d shows the performance of streaming RTNeuron rendering fromthe Lugano cluster to the remote 24 Megapixel wall. We tested three resolu-tions ( × , × and × ), and four different tile sizes( , , and × ). Due to a conﬁguration issue, the WANlink delivered only 1 GBit/s throughput during the benchmark. RTNeuron is an .6 Compositing 35 Table 1

RTNeuron, segment sizes 2K 4K 8K256 ²

31 20 18 ²

32 20 20 ²

22 18 16

17 16 14

RTNeuron - Tide Remote Parallel Rendering F PS Resolution

2K 4K 8K256 ² ² ² ² Tile Size

Resolution

2K 4K 8K 75%95%100%

Tile Size

JPEG Compression Quality

Benchmark 3.d:

Remote RTNeuron - Tide Parallel Rendering

Equalizer-based application used in the Blue Brain Project to analyse results fromdetailed simulations of neuronal simulations.

FDR InﬁniBandFour-RackBlueGene/Q Visualization NodeTesla K20Visualization NodeTesla K20 Tesla K2040 totalTesla K20 Visualization NodeGTX 580GTX 580GTX 580 Visualization NodeGTX 580GTX 580GTX 580 10 GBit/s WAN LinkLausanne 10 GE10 GEQDR IBGPFSLugano

Figure 3.14:

Remote Streaming Scenario

The results show thatinteractive frame rates areavailable even at the full nat-ive resolution, that a tilesize is the best option, andthat 95% compression deliv-ers the best performance inmost cases. Based on theexperiments we settled ona tile size and 100%compression quality to avoidartefacts as the default set-tings in Equalizer.Decoupling the display system and software from the rendering system hasmany beneﬁts. It increases robustness, provides reliably performance on a shared,collaborative device, facilitates media and device inteagration, and minimises datamovement. In contrast to most other parallel rendering frameworks, Equalizer decouples thecompositing algorithm from the task decomposition. This is a key aspect of ourarchitecture, allowing a ﬂexible conﬁguration, often in many unforeseen ways.

The compound tree with its task decomposition, input and output frames, is aspecialised description to “program” scalable rendering across parallel resources.Compositing is conﬁgured using output frames connected to input frames,compound tasks and eye passes, as well as frame parameters. In its simplest form,a sort-ﬁrst source compound provides an output frame, which is routed to an inputframe using the same name on the destination compound. The source viewport de-composes the task, and the output frame collects this partial result from the sourcechannel to composite it using the correct offset onto the destination channel.Frame parameters customise pixel handling. They include the transfer buffers(colour, depth), partial channel viewport, pixel zooming (upscale and downscale),and transport method (on-GPU texture or CPU memory). An output frame maybe connected to multiple input frames. Frame parameters are used together withcompound tasks for parallel compositing, and advanced features such as monitor-ing and dynamic frame resolution, introduced in Chapter 6. r eadba ck send/receive c o m po s i t e r eadba ck ga t he r t il e s source1(destination) source 2 source 3 send/receive Figure 3.15:

Direct Send Compositing

Figure 3.15 shows thepixel ﬂow of direct sendcompositing for a three-way sort-last decomposition,where the destination chan-nel also contributes to therendering. In the ﬁrststep, each channel exchangestwo colour + depth tiles withits neighbours, and then z -composites its own tile. Thisyields one complete tile oneach channel, of which twocolour tiles are then as-sembled on the destinationchannel, where the third tileis already in place.The corresponding com-pound tree is shown schem-atically in Figure 3.16. Each of the three channels has a child compound to ex-ecute the rendering and read back two incomplete tiles for sort-last compositingon the corresponding two sibling compounds. These three leaf compounds repres-ent the ﬁrst step in Figure 3.15. One level up, each channel receives two tiles andassembles them onto its partially rendered result, creating a complete tile (middlestep in Figure 3.15). For the two source-only channels, a ﬁnal colour-only outputimage is connected to the destination channel. The arrows illustrate the pixel ﬂowfor one of the tiles. .6 Compositing 37 range [0 ⅓ ]CLEAR, DRAW, READBACK outputframe "f1.dest" viewport [ 0 ⅓ ⅓ ] outputframe "f2.dest" viewport [ 0 ⅔ ⅓ ]channel "source1"ASSEMBLE, READBACK inputframe "f2.dest" inputframe "f2.source2" outputframe COLOR "frame.source1" viewport [ 0 ⅔ ⅓ ]channel "destination"buffer [ COLOR DEPTH ]CLEAR, DRAW, ASSEMBLEchannel "source2"ASSEMBLE, READBACK inputframe "f2.source1" inputframe "f1.dest" outputframe COLOR "frame.source2" viewport [ 0 ⅓ ⅓ ] inputframe "f1.source1" inputframe "f1.source2" inputframe "frame.source1" inputframe "frame.source2"range [ ⅓ ⅔ ]CLEAR, DRAW, READBACK outputframe "f1.source1" viewport [ 0 ⅓ ] outputframe "f2.source1" viewport [ 0 ⅓ ⅓ ]range [ ⅔ ]CLEAR, DRAW, READBACK outputframe "f1.source2" viewport [ 0 ⅓ ⅓ ] outputframe "f2.source2" viewport [ 0 ⅔ ⅓ ] Figure 3.16:

Direct Send Compound

For most rendering ap-plications even a relativelycomplex setup such as thisexample requires very littleprogrammer involvement. Theabstractions provided by theresource description, rendercontext and compounds en-able Equalizer to reconﬁgurethe application almost trans-parently. For polygonal ren-dering, it sufﬁces that the ap-plication honours the range parameter of the render con-text to decompose its ren-dering. All other tasks, inparticular the parallel com-positing and pixel transfers,are fully handled by Equal-izer. Applications which re-quire ordered compositing,for example volume render-ing, overwrite the assemblecallback to reorder the input frames correctly, before passing them on to the com-positing code.The abstraction through frames is ﬂexible, but still allows many architecturaloptimisations:

Unordered Compositing:

Unless overwritten by the application, Equalizer willcomposite all input frames by default in the order they become available,not in the order they are conﬁgured. In the example in Figure 3.15, the des-tination channel will assemble its four input frames one by one as the outputframe is received. Due to asynchronous execution and resulting pipeliningof operations, the availability changes each frame depending on the runtimeof other tasks.

Parallel Compression, Downloads and Network Transfers:

Compressing, trans-mitting and receiving frames between nodes is handled by threads inde-pendent from the render thread. GPU transfers are handled by asynchron-ous PBO transfers. Pipelining all these operations with the actual renderingsigniﬁcantly minimises the compositing overhead.

On-GPU Transfers:

Occasionally the source and destination channel are on thesame GPU. Using textures as pixel buffers minimises overhead for the out-put to input frame transfer.In Chapter 5 we provide detailed information on our compositing advances.

Compounds provide only a static conﬁguration of the parallel rendering setup.

Equalizers are algorithms hooking into a compound and modify parameters oftheir respective subtree at runtime, to dynamically optimise the resource usage.Each equalizer focuses on tuning one aspect of the decomposition, allowing themto be composited in a conﬁguration. Due to their nature, they are transparent toapplication developers, but might have application-accessible parameters to tunetheir behaviour. Resource equalisation is the critical component for scalable par-allel rendering, and therefore the eponym for the

Equalizer project name.Compounds are a static snapshot of a conﬁguration, and equalizers providedynamic conﬁguration on top. This separation of responsibilities is an importantarchitectural component of our parallel rendering framework. In Chapter 6 weprovide an extensive overview over the available equalizers.

Supported by the distributed execution layer, introduced in the next section, Equal-izer implements dynamic reconﬁguration of a running visualisation application.This functionality is used by runtime layout switches and runtime reliability.Runtime reconﬁguration is designed to be a side effect of the internal resourcemanagement algorithm, that is, the initialization and exit of a conﬁguration usesthe same code path as the runtime addition of a single channel. Rendering re-sources in Equalizer are reference counted by the compounds using them, and thestate change from inactivated to activated triggers a launch or stop of the associ-ated resource. These resource counters are propagated up the resource hierarchy:A channel will (de)activate its window, pipe and node.Channel and window (de)activation are relatively lightweight and only incurthe creation and initialisation of the class instance (and associated OpenGL re-sources) on the client. A pipe (de)activation incurs additionally a new (or re-moved) operating system thread, and a node (de)activation is tied to a process.Depending on the application logic, at some level of the resource hierarchy ap-plication data has to be distributed to the rendering client. An application may usepre-launched rendering clients which run even when not active, and can use thisto cache application state for faster reconﬁguration. .9 Distributed Execution Layer 39

Layout switches are caused by the activation of a different layout on a canvasby the application code. A typical layout switch will only (de)active channels,which is a very lightweight operation. Since each combination of a layout anda canvas creates a unique set of destination channels, the destination compoundsof these channels may use a different set of source channels for rendering, whichmay reside on different GPUs or even nodes in the cluster. Some conﬁgurationsuse a different number of rendering GPUs or even nodes, causing the startup orexit of new rendering processes in the cluster.Runtime reliability detects failed nodes in a visualisation cluster, independ-ent of the cause (hardware or software failure). The server tracks a ‘last seen’time stamp for each node. When waiting for a task to ﬁnish, the server uses thistime stamp to detect failures. Potentially failed nodes are pinged with a specialcommand, which is processed even if all application threads are busy. Nodes stillanswering this command are considered alive for a longer period, after which theyare considered failed, likely due to an inﬁnite loop in the application code. Failednodes are removed from the conﬁguration, and their associated compounds aredeactivated. In the case of load balanced source channels, the load equalizer willsimply reassign the work to other source channels. For static conﬁgurations, thesource channel contribution will be missing from the ﬁnal image. For destinationchannels, the corresponding output display will disappear from the conﬁguration.

An important part of writing a parallel rendering application is the communicationlayer between the individual processes. Equalizer relies on the Collage networklibrary for its internal operation. Collage provides networking functionality ofdifferent abstraction layers, gradually providing higher level functionality for theprogrammer. Figure 3.17 shows the main primitives in Collage:

Connection

A stream-oriented point-to-point communication line. Connectionstransmit raw data reliably between two endpoints for unicast connections,and between a set of endpoints for multicast connections. For unicast,process-local pipes, TCP and InﬁniBand RDMA are implemented. For mul-ticast, a reliable, UDP-based protocol is discussed in Section 7.3.

DataI/OStream

Abstracts the marshalling of C++ classes from or to a set ofconnections by implementing output stream operators. Uses buffering toaggregate data for network transmission. Performs byte swapping duringinput if the endianness differs between the remote and local node.

Node and LocalNode

The abstraction of a process in the cluster. Nodes commu-nicate with each other using connections. A

LocalNode listens on variousconnections and processes requests for a given process. Received data is wrapped in

ICommand s and dispatched to command handler methods. ANode is a proxy for a remote LocalNode.

Object

Provides object-oriented and versioned data distribution of C++ objectsbetween nodes. Objects are registered or mapped on a LocalNode.

LocalNode*1NodeConnection LocalNode*1NodeOCommandICommand<>*1 *1<> ConnectionObject ObjectObject ID

ObjectICommandObjectOCommand byte stream<> <> DispatcherDataOStreamDataIStream*1 *1

Figure 3.17:

Communication between two Collage Objects

Collage implements a few generic distributed objects, which are used by

Equal-izer and other applications. A barrier is a distributed primitive used for soft-ware swap synchronisation. Its implementation follows a simple master-slaveapproach, which has shown to by sufﬁcient for this use case. Queues are dis-tributed, single producer, multiple consumer FIFO queues are used for tile andchunk compounds (Section 4.2.4). To hide network latencies, consumers prefetchitems.An object map facilitates distribution and synchronisation of a collection ofdistributed objects. Master versions can be registered on a central node, e.g.,the application node in

Equalizer . Consumers, e.g.,

Equalizer render clients, canselectively map the objects they are interested in. Committing the object map willcommit all registered objects and sync their new version to the slaves. Syncingthe map on the slaves will synchronise all mapped instances to the new versionrecorded in the object map. This effective design allows data distribution withminimal application logic.Chapter 7 contains more information on our network library.

C H A P T E R

SCALABLE RENDERING

Scalable rendering is a subset of parallel rendering which aims to improve theframerate of a rendering work load by decomposing it over multiple rendering re-sources. Parallel rendering includes other use cases, for example to use multipleGPUs to drive individual displays of a tiled display wall. This chapter addressesthe research question on how we can improve the rendering performance of visu-alisation applications to enable users to explore more data.Scalable rendering research has put a lot of focus on two of the three architec-tures classiﬁed by [Molnar et al., 1992]: sort-ﬁrst and sort-last rendering. Sort-middle rendering is still largely conﬁned to hardware implementations due to itshigh communication cost of sorting and distributing fragments to processing units.In this chapter, we present new parallelisation variants of sort-ﬁrst rendering,and other decompositions which break out of this standard classiﬁcation. Foreach mode, we introduce its algorithm and implementation, potential impact onthe application code, as well as its strengths and weaknesses. Due to the ﬂexiblearchitecture of our parallel rendering system, these modes are largely usable withany Equalizer application and can be combined with all other modes.Most of these rendering modes are similar to sort-ﬁrst rendering, in that theydecompose the ﬁnal view spatially or temporally, while computing complete pixelson each source channel. Stereo compounds decompose per eye pass, DPlex com-pounds temporally, pixel and subpixel compounds use equal spatial decomposi-41 tions. Finally, tile and chunk compounds implement implicit load balancing forsort-ﬁrst and sort-last rendering using queueing of work items.This wide set of decomposition modes for scalable rendering, embedded in ourgeneralised compound structure, enables applications and researchers to decom-pose the rendering task in, as far as we know, any way possible for a renderingpipeline. To our knowledge no other implementation provides this breath andﬂexibility, and some algorithms appear for the ﬁrst time in Equalizer.

Sort-ﬁrst rendering decomposes the rendering task in screen space. It has manyvariants: tiled display walls and similar installations perform sort-ﬁrst parallelrendering naturally by using multiple GPUs to drive the output displays, and clas-sic sort-ﬁrst scalable rendering assigns one screen-space region to each renderingresource, often combined with load balancing. Equalizer supports these classicsort-ﬁrst modes. In the following subsections we present other variants of sort-ﬁrst rendering, each tailored to a certain use case. + Figure 4.1:

Anaglyphic Stereo Compound

Stereo, or eye decomposi-tion, is a specialised versionof sort-ﬁrst rendering. TwoGPUs get assigned each oneof the eye passes during ste-reo rendering. For passivestereo, there is no composit-ing step needed, whereas foractive and anaglyphic stereothe frame buffer for one eyepass has to be copied to thedestination channel. Due tothe strong similarity betweenboth eye passes, this modeprovides close to perfect loadbalance. Figure 4.1 shows an anaglyphic stereo compound.While many visualisation applications provide passive or active stereo render-ing and sometimes decomposition using two GPUs, our implementation withinour ﬂexible compound structure allows to fully exploit stereo decomposition. Ste-reoscopic and monoscopic rendering is no special case in the architecture, butrather a conﬁguration of the rendering resources. Among other things, this allows .2 Sort-First 43 extending a two-way stereo decomposition with further resources and any otherscalable rendering mode. One can also easily set up an application with mixedrendering, e.g., to render a monoscopic view on a control workstation while ren-dering stereoscopic on a larger immersive installation.Stereo compounds are conﬁgured by restricting each source to a single eyepass. Typically, one of the channels also conﬁgures the cyclop eye pass, whichgets activated when the view is switched to monoscopic rendering. + Figure 4.2:

Pixel Compound

Pixel compounds (Figure 4.2)decompose the destinationchannel by interleaving pixelsin image space. They area variant of sort-ﬁrst ren-dering well suited for ﬁll-limited applications whichare not geometry bound, forexample direct volume ren-dering. Source channels can-not reduce geometry loadthrough view frustum cul-ling, since each source chan-nel has almost the samefrustum as the destinationchannel. However, the frag-ment load on all source chan-nels is reduced linearly andwell load balanced due tothe interleaved distributionof pixels. This functional-ity is transparent to

Equal-izer applications, and the de-fault compositing uses the stencil buffer to blit pixels onto the destination channel.Pixel compounds are conﬁgured by restricting each source compound with apixel kernel. The kernel describes the size of the decomposition in 2D space,and the 2D pixel offset within this region. This follows our design philosophy ofenabling features by generalising the underlying algorithm rather then hardcodingthem. + Figure 4.3:

Subpixel Compound

Subpixel compounds (Fig-ure 4.3) are similar to pixelcompounds, but they de-compose the work for asingle pixel, for exampleduring Monte-Carlo ray tra-cing, FSAA or depth ofﬁeld rendering. The de-fault compositing algorithmuses accumulation and aver-aging of all computed frag-ments for a pixel. LikePixel compounds, this modeis naturally load balancedon the fragment processingstage but cannot scale geo-metry processing. This fea-ture is not fully transpar-ent to the application, sinceit decomposes rendering al-gorithms which render mul-tiple samples per pixel. Applications needs to adapt their rendering code, forexample to jitter or tilt the frustum based on the subpixel executed in the currentsubpixel rendering pass.Subpixel compounds increase the amount of pixels to be composited linearlywith the number of source channels. They can use the same parallel compositingalgorithms as sort-last rendering. Since the compositing logic is decoupled fromthe task decomposition, it reuses the same code as sort-last parallel compositingexcept for the combination step on each GPU.Subpixel compounds are conﬁgured on each source compound with a subpixelkernel. This kernel describes the number of contributing sources, and the offsetfor each source in this range.

Tile (Figure 4.4) decompositions are a variant of sort-ﬁrst rendering. They decom-pose the scene into a set of ﬁxed-size image tiles. These tasks, or work packages,are queued and processed by all source channels by polling a server-central queue.Prefetching ensures that the task communication overlaps with rendering. .3 Sort-Last 45 + Figure 4.4:

Tile Compound

As shown in [Steineret al., 2016], work pack-ages can provide better per-formance due to being impli-citly load balanced, as longas there is an insigniﬁcantoverhead for the render tasksetup. This mode is transpar-ent to

Equalizer applications.We have used a tile com-pound to scale an interact-ive ray tracing application tohundreds of rendering nodesusing RTT Deltagen.Tile compounds are con-ﬁgured using output and in-put queues. The destina-tion channel has an outputqueue, which conﬁgures thetile size and represents theserver-side end of the taskqueue. Each source com-pound has an input queue of the same name, which represents the client-sidequeue end polling tasks from the server. Output frames from tile sources areautomatically conﬁgured with the current tile offset and size for correct assemblyon the destination channel.

Sort-last rendering decomposes the rendering task in object space, that is, eachrendering resource produces an incomplete full-resolution image. To our know-ledge, sort-last rendering always requires a compositing step, which is the chal-lenging part for this decomposition mode. It is often addressed using parallelcompositing, which we discuss in Section 5.2.Equalizer does support classical sort-last rendering with or without load balan-cing, where each resource renders one part of the applications database. Further-more, we also implement chunk compounds, which are similar to tile compounds(Section 4.2.4), with which they share a lot of the infrastructure. Chunk com-pounds also produce work packages, although using a ﬁxed-size subrange of datafor each package instead of the tile coordinates used for tile compounds.

Time-multiplexing distributes full frames over the available resources, such thateach resource only renders a subset of the visible frames (Figure 4.5). This modeis also called alternate frame rate or DPlex, was ﬁrst implemented in [Bhaniramkaet al., 2005] for shared memory machines. The algorithm is however much bettersuited for distributed memory systems, since the separate memory space makesconcurrent rendering of different frames much easier to implement. While it in-creases the framerate almost linearly, it cannot improve the latency between userinput and the corresponding output. At best, it can achieve the same latency com-pared to the single-GPU case, when perfect linear scalability is achieved. Con-sequently, this decomposition mode is mostly useful for non-interactive ﬁlm gen-eration.

23 1 21 3 + Figure 4.5:

Time-Multiplex Compound

DPlex is very well loadbalanced, since most ap-plications observe a strongframe-to-frame coherence withrespect to the rendering load.This decomposition modehas the peculiarity that smallimbalances tend to accumu-late such that the concurrentframes all ﬁnish simultan-eously. To provide a smoothframerate, if so desired, aframerate equalizer can beinstalled on the destinationcompound. Section 6.5 cov-ers this functionality. It istransparent to

Equalizer ap-plications, but does requirethe conﬁguration latency tobe greater or equal to thenumber of source channels.DPlex rendering is not hard-coded into our framework, but conﬁgured by re-stricting the rendering task temporally on each source compound. This is achievedby setting a period and phase parameters, which conﬁgure the number of framesskipped and starting offset on the given source compound. A simple DPlex com-pound would have a destination compound with n source compounds, where eachsource has a period of n and one phase from ..n − . While this generalization .5 Stereo-Selective Compounds 47 may seem artiﬁcial, it opens up different use cases, for example giving a fast GPUa smaller period, thus giving it more work. Stereo-selective compounds have different conﬁgurations, depending on the cur-rent rendering mode. Each compound sub-tree can restrict the eye passes it rendersfrom the default left, right, cyclop passes. Depending on the active stereo mode(stereo or mono), restricted compound trees may be skipped or activated. Thisis used on one hand to conﬁgure stereo compounds, but may also be used toconﬁgure different decompositions depending on the stereo mode. Figure 4.6shows a simple example: A dual-GPU setup is used with eye-parallel renderingduring stereo rendering, and a standard sort-ﬁrst parallel rendering during mono-scopic rendering. Note that the rendering mode is runtime-conﬁgurable, that is,the application can switch the view from monoscopic to stereoscopic renderingat any time, activating and deactivating the conﬁgured compounds and attachedresources. It is also possible to conﬁgure a different set of resources (nodes andGPUs) per stereo mode, triggering the launch and exit of render client processesduring the stereo switch. channel “GPU0”eye [left] channel “GPU1”eye [right]view { mode STEREO} view { mode MONO} channel “GPU0”eye [cyclop]viewport [] channel “GPU1”eye [cyclop]viewport [}

Figure 4.6:

Stereo-Selective Compound

A major contribution of our parallel rendering system is the ﬂexible system ar-chitecture. While many applications and frameworks implement a subset of thefeatures mentioned above, most of them hardcode the algorithms, predetermin-ing the number of possible conﬁgurations. In Equalizer, both the decomposition and the recomposition of the rendering task are derived through a number of or-thogonal parameters, which are easily combined to conﬁgure common scalablerendering modes. For advanced usage, they can also be conﬁgured for many otheruse cases. During deployment of Equalizer, we have seen many interesting andunforeseen conﬁgurations: • Reusing the period parameter used to conﬁgure number of frames in aDPlex compound, an underpowered control workstation for a large tileddisplay wall was conﬁgured to render only every other frame using a periodof two. Due to the standard latency of one frame, this meant that the displaywall rendering became the bottleneck. It could now render at a substantiallyhigher framerate than before, when the control host was the bottleneck. • Rerouting one of the eye passes of a head-mounted display to a large displayusing an output and input frame, external users could observe the interac-tion and view of the person using the HMD. The same can be achieved bymirroring the video signal by other means, but this was not available on thegiven setup. • Using combined stereo and sort-ﬁrst decomposition on the central tiles of atiled display wall. Often times the central tiles of a tiled display wall receivea higher rendering load then the outer tiles. In this particular conﬁguration,each tile was driven by a dual-GPU node using active stereo compounds,and the middle segments where given an additional machine setting up atwo-way sort-ﬁrst decomposition under each node of the two-way stereocompound. • Combined sort-last and sort-ﬁrst decomposition: Sort-ﬁrst rendering is typ-ically limited in the scalability of the decomposition step, where geometryoverlap between resources often yields diminishing returns after about tenGPUs. Sort-last rendering on the other hand is often limited by the overheadof the compositing step. Combining both modes enables to balance theseconstraints for better scalability.

Benchmarks for static compound conﬁgurations are relatively rare, since mostpractical settings use some type of load balancing. They are however interest-ing in that they show how well different rendering algorithms are naturally load-balanced. In [Eilemann et al., 2018], we collected some data for polygonal andvolume rendering. Benchmark 4.a provides a strong scalability benchmark forboth types of rendering and a set of compounds. The linear scaling graph providesa theoretical limit for perfect scalability compared to the single-threaded, single-GPU rendering performance. .7 Benchmarks 49

David

Linear Scaling sort-ﬁrst sort-last pixel tiles Linear Scaling sort-ﬁrst sort-last pixel tiles GPUs Polygonal Rendering t i m e ( s ) G B / s r a w da t a Volume Rendering Benchmark 4.a:

Compound Scalability

For static task decomposition, polygonal rendering performs better with sort-last compared to sort-ﬁrst. Sort-last performs a static decomposition in data space,which reduces the geometry processing load per GPU, which is the dominantfactor in our polygonal rendering code. Since this decomposition can be computedeasily, even a static decomposition is relatively balanced. A sort-ﬁrst decompos-ition can reduce the geometry processing through view frustum culling, but theremaining visible set will be relatively unbalanced on each GPU, depending onthe current camera position.For volume rendering, this balance is reversed and sort-ﬁrst performs better.Typically, a volume renderer is bound by fragment processing. Consequently,sort-ﬁrst rendering scales better than sort-last, since the screen-space is equallydivided. For both rendering algorithms, one can observe static imbalances in thezig-zag graphs, where odd number of resources coincidentally split the renderingload less balanced than even numbers.Tile compounds provide close to linear scalability, and in some cases super-linear scaling. Compared to the other compounds, tile compounds are naturallyload balanced, providing this excellent scalability relative to static sort-ﬁrst andsort-last. Super-linear scaling is due to their small work package size, whichmakes rendering more cache-friendly. Polygonal rendering has a higher staticoverhead per tile due to the CPU-side view frustum culling, and therefore scalesless well compared to volume rendering.Pixel compounds provide predictable scalability, but fail to approach ideallinear scaling due to their increased compositing cost and constant geometry load.Benchmarks 6.a 6.b, 6.c and 6.d provide more realistic scalability data whenusing these compounds with load balancers.

C H A P T E R

COMPOSITING

Compositing collects and combines partial results from multiple resources dur-ing scalable rendering onto one or more destination channels. While signiﬁcantcharacteristics of the decomposition step, discussed in the previous chapter, aredependent on the application rendering code, compositing is largely a genericproblem and can be implemented and optimised in a parallel rendering frame-work. Consequently, this area of parallel rendering research has received signiﬁc-ant attention. By integrating many state-of-the art optimisations into our parallelrendering framework, we provide a generic solution that scales well on modernvisualisation cluster architectures.We present new insight into the behaviour of known sort-last parallel compos-iting algorithms on mid-size visualisation clusters (compared to high-end HPCsystems), the importance of streaming sort-last compositing and spatial sort-lastpolygonal rendering, the impact of state of the art optimisations such as region ofinterest and asynchronous compositing, as well as image compression algorithmsfor high-speed interconnects. This chapter addresses the research question whichnew algorithms will decrease the time needed to composite rendering results, inparticular for sort-last rendering. 51

Parallel compositing leverages multiple compute resources, memory bandwidthand network bandwidth within a visualisation cluster to accelerate the composit-ing step in parallel rendering. During sort-last and subpixel decompositions, eachrendering resource produces an output which needs to be combined with the resultof other resources on a per-pixel level. This compositing step reduces the amountof information, either through depth-sorting or blending multiple input fragmentsinto a single pixel. This loss of information in the compositing step can be ex-ploited by distributing the work over multiple resources, and then collecting thereduced image tiles, commonly called parallel compositing.

For sort-last rendering, two approaches to combine the partial results exist: z -sorting using the depth buffer, and spatial rendering decomposition with orderedcompositing.The ﬁrst algorithm uses both the colour and depth buffer, and assigns the ﬁnalpixel to the colour of the source with the front-most depth buffer values. It requiresno spatial ordering of the data during rendering. It does not correctly compositepixels with transparent geometry, since there is no guarantee of the blending order.Owing to the use of the depth buffer, it is also more expensive, since both colorand depth data needs to be processed. Furthermore, depth buffer readback tendsto be slower, and compression algorithms for depth buffer data do not to performas well as colour buffer compression. Depth-sorted compositing is often used forpolygonal data, as shown in Figure 5.1(a), since these applications often do notsort their geometry into convex spatial regions. (a) (b) Figure 5.1:

Depth-Sorted and Back-to-Front Sort-Last Compositing .2 Parallel Compositing 53

Spatial compositing is often used for direct volume rendering and requires theapplication to render convex regions of data on each source, and then depth-sortthe partial images produced by each source. The partial images are composited inorder, typically with alpha-blending. Since the sorting happens at the image level,rather then the fragment level as in the ﬁrst algorithm, it can operate using onlythe colour buffer, as shown in Figure 5.1(b). This algorithm can produce correcttransparency, since the convex regions allow ordered blending.Spatial compositing provides better performance, and better scalability whenused with other optimisations, such as region of interest and load balancing due tocompact regions produced by the spatial sorting. Typically used for volume ren-dering, we have applied spatial sort-last rendering and compositing to polygonaldata, by sorting the data spatially and using clipping planes to generate perfectlyconvex rendering subsets.

Sort-Last, Full Cortical Column F r a m e s p e r S ec o n d spatialround-robin Benchmark 5.a:

Spatial versus Depth-Sorted Sort-Last Rendering

Benchmark 5.a shows thedifference between spatialand depth-sorted sort-last ren-dering in RTNeuron (Sec-tion 8.3). Due to complex-ities in the application datamodel and the disadvantage-ous geometrical structure ofneurons, the spatial render-ing in RTNeuron scales lessthan the round-robin alloca-tion used for the depth-sortedmode. Still, owing to the sig-niﬁcantly reduced composit-ing load of this mode, bothdue to a smaller region ofinterest and no depth buf-fer transfers, spatial sort-lastrendering has a signiﬁcantly better framerate. The exact experiment setup can befound in [Eilemann et al., 2012].

Contrary to most other implementations, parallel compositing algorithms in Equal-izer are not hardcoded, but rather conﬁgured explicitly. The transport of pixel datafor compositing is expressed through connected output and input frames. Outputand input frames are connected by name; they do not need to follow the com-pound hierarchy, and a single output frame may be consumed by multiple input frames. Output frame parameters conﬁgure a subset of the rendering (viewportand framebuffer attachments), and are read back after rendering and assembly.Furthermore, every step of the compositing pipeline is implemented in Equalizerand transparent for the application developer. Some steps may be replaced withapplication code, for example ordering frames during compositing.Two commonly used parallel compositing algorithms are direct send and bin-ary swap. Both distribute the compositing task equally over all available re-sources, then collect the composited tiles on the destination channel.Direct send, shown in Figure 3.15, uses one assemble operation on each re-source to fully composite a single tile. Binary swap, shown in Figure 5.2, ex-changes pixels between pairs of nodes using a binary compositing tree whichgradually assembles a tile on each resource. Both use a sort-ﬁrst-like assemblyoperation to collect the fully assembled tiles on the destination channel. 2-3swap [Yu et al., 2008] is an extension to binary swap, which overcomes the power-of-two source channel requirement by exchanging compositions between groupsof two or three nodes in the compositing tree. source 2 source 3source1(destination) source 4

Figure 5.2:

Binary Swap Sort-Last Compositing .2 Parallel Compositing 55

In [Eilemann and Pajarola, 2007] we have shown that on commodity clusters,direct send compositing provides better performance over binary swap commonlyused on HPC systems. While it uses more messages in total, direct send hasfewer synchronisation points than binary swap. Moreover, as a result of the earlyassembly optimisation (Section 5.6), direct send can handle imbalances betweennodes better, since a late channel has a smaller penalty on the overall executiontime.

As a result of the asynchronous architecture of our framework, streaming sort-last compositing is a viable alternative to more involved parallel compositing al-gorithms in smaller sort-last conﬁgurations.Figure 5.3(a) shows a streaming sort-last compound. The output of one sourcechannel is copied to the next channel in the chain, which then composites it ontop of its own rendering, streaming the combined frame on to the next source.At the end of the chain, the destination channel completes the input frame bycompositing it with his rendering. (a) i npu t ou t pu t Channel2 draw

Channel1 draw ou t pu t i npu t ou t pu t Channel3 draw i npu t Channel4 draw (b)

Figure 5.3:

Streaming Sort-Last Compound

Equalizer only synchronises the input to the output frames, therefore this con-ﬁguration creates a pipelined conﬁguration, where the compositing operationsform the “critical path”. Each channel has its draw pass delayed by the time takenby all preceding compositing operations, as shown schematically in Figure 5.3(b).

This pipelining emerges naturally due to the synchronisation points introduced bythe compositing conﬁguration.The total system latency for sort-last stream compounds is t draw + ( n − × ( t readback + t assemble ) . Note that the readback and assemble times are usually anorder of magnitude smaller than the render time, which makes this compoundattractive for small-to-medium sized decompositions, since it has minimal com-positing overhead and less synchronization compared to parallel compositing al-gorithms. During scalable rendering pixel data has to be copied from the source channelframebuffer to either the destination channel framebuffer, or to an intermediatechannel during parallel compositing. The associated distributed image composit-ing cost is directly dependent on how much data has to be sent over the network,which in turn is related to how much screen space is actively covered. For sort-lastrendering every node potentially renders into the entire frame buffer, resulting ina linear increase in the amount of pixels composited for an increasing number ofnodes. Depending on the data set and viewpoint, only a subset of the framebuffershows pixels generated from the data. With an increasing number of nodes, theset of affected pixels typically decreases, leaving blank areas that can be omittedfor transmission and compositing.Equalizer provides an API for the programmer to provide the region of in-terest (ROI). The ROI is the screen-space 2D bounding box fully enclosing thedata rendered by a single resource, which can be easily computed by calculatingthe screen-space projection of the model’s bounding volume. We have extendedthe core parallel rendering framework to use this application-provided ROI to op-timise the load equalizer and tree equalizer , as well as image compositing.Figure 5.3(a) outlines the region of interest of each source. The compositingcode uses the ROI to minimise image readback size, and consequently networktransmission. The ROI is an output frame parameter, and is transmitted to all in-put frames together with the pixel data. On the input frame, the compositing coderespects this parameter to place the pixel data in the right position. Further, theROI of the rendering pass is automatically merged with the ROI of the compos-ited frames for readback. The usage of ROI for load balancing is described inSection 6.2.Applying ROI for sort-ﬁrst rendering provides a small improvement for therendering performance, as shown in Benchmark 5.b(a) from [Eilemann et al.,2012]. As the number of resources increases, the ROI becomes more importantsince the relative amount of time spent in compositing increases with the render- .4 Asynchronous Compositing 57 ing load decreasing. With ROI enabled we observed performance improvementsbetween 5-20%, reaching 60 Hz when using 33 GPUs. Without ROI, the frameratepeaked at less than 50 Hz when using 27 GPUs. linear ROI off ROI on correct incorrect linear ROI off ROI on DB AFF DB bad AFF speedup speedup speedup improvement3915212733 F r a m e s p e r S ec o n d Number of GPUs S p ee d u p d u e t o o p t i m i z a t i o n incorrectcorrectspeedup F r a m e s p e r S ec o n d Number of GPUs S p ee d u p d u e t o o p t i m i z a t i o n ROI offROI onspeedup

DB, Region of Interest , 4xDavid F r a m e s p e r S ec o n d Number of GPUs

ROI onROI offspeeduplinear

DB Direct Send, Region of Interest, 4xDavid F r a m e s p e r S ec o n d Number of GPUs s p ee d u p d u e t o o p t i m i z a t i o n ROI offROI onspeedup (a) (b)

Benchmark 5.b:

Region of Interest for Sort-First and Sort-Last Rendering

ROI is crucial for sort-last rendering performance. In our experiments in [Eile-mann et al., 2012], we used a polygonal renderer creating relatively compact re-gions during sort-last decomposition, while still using depth-sorted compositing(cf. Figure 5.3(a)). This is a relatively common use case for sort-last rendering. Inthis mode, we can observe signiﬁcant speedups with ROI (up to 4x), as shown inBenchmark 5.b(b). In [Makhinya et al., 2010] this application-provided ROI wasextended by an algorithm which automatically computes the ROI by analysing theframebuffer. This algorithm has the advantage of simplifying the application de-velopers’ life, and can also conveniently detect holes in the rendered framebuffer.

Asynchronous compositing pipelines pixel transfers with rendering operations bymoving them to a separate thread. Compositing in a distributed parallel renderingsystem is decomposed into readback of the produced pixel data (1), optional com-pression of this pixel data (2), transmission to the destination node consisting ofsend (3) and receive (4), optional decompression (5) and composition consistingof upload (6) and assembly (7) in the destination framebuffer.In a naive implementation operations 1 to 3 are executed serially on one core,4 to 7 on another. In our parallel rendering system, operations 2 to 5 are ex-ecuted asynchronously to the rendering operations 1, 6 and 7. Furthermore, weuse a latency of one frame which means that two rendering frames are alwaysin execution, allowing the pipelining of these operations, as shown in Figure 5.c.We have implemented asynchronous readback using OpenGL pixel buffer objects, further increasing the parallelism by pipelining the rendering and pixel transfers,as shown in Figure 5.4.

Transmit ThreadRenderThread drawdrawreadback sendreadback sendcompresscompress

RenderThreadCommand ThreadReceive Thread drawcompositedrawcompositereceivereceive decompr.decompr. t o de s t i na t i on node f r o m s ou r c e node n 1 1 1 1 n Benchmark 5.c:

Synchronous Readback and Upload

In the asynchronous case,the rendering thread per-forms only application-speciﬁcrendering operations, sincethe overhead of starting anasynchronous readback be-comes negligible. Equal-izer uses a plugin system toimplement GPU-CPU trans-fer modules that are runtimeloadable. We extended thisplugin API to allow the creation of asynchronous transfer plugins, and imple-mented such a plugin using OpenGL pixel buffer objects (PBO). At runtime, onerendering thread and one download thread are used for each GPU, as well as onetransmit thread per process. The download threads are created lazy when needed. linear synchronous asynchronous correct incorrect DB DB ROI DB AFF DB bad AFF DB async improvement speedup improvement improvement3915212733 F r a m e s p e r S ec o n d Number of GPUs incorrectcorrectimprovementlinear , 4xDavid F r a m e s p e r S ec o n d Number of GPUs S p ee d u p d u e t o o p t i m i z a t i o n synchronousasynchronousspeedup DB, Region of Interest , 4xDavid F r a m e p e r S ec o n d Category Title

DB DB ROI improvement

Benchmark 5.d:

Asynchronous Compositing Sort-First Rendering

Asynchronous composit-ing is, together with regionof interest, one of the mostinﬂuential optimisations forscaling rendering to largecluster sizes. For sort-ﬁrstrendering, shown in Bench-mark 5.d from [Eilemannet al., 2012], pipelining thereadback with the renderingyields a performance gainof about 10%. At higherframe rates, when the render-ing time of a single resourcedecreases, asynchronous readback has an even higher impact of over 25%.

The image compositing stages in distributed rendering are fundamentally limitedby the GPU-to-node and node-to-node image data throughput. Efﬁcient imagecoding, compression and transmission must be considered to minimise that bot-tleneck. .5 Compression for Image Compositing 59

Transmit ThreadRenderThread drawstart RBdrawstart RB

Download Thread ﬁnish RB sendﬁnish RB sendcompresscompress

RenderThreadUploadThreadCommand ThreadReceive Thread drawassembledrawassembleuploadreceivereceive uploaddecompr.decompr. t o de s t i na t i on node f r o m s ou r c e node n 11 1 1 1 1 n 1 1 Figure 5.4:

Asynchronous Readback and Upload

Basic run-length encoding (RLE) has been used as a fast algorithm to improvenetwork throughput for interactive image transmission. However, it only givessufﬁcient results in speciﬁc rendering contexts and fails to provide a general im-provement as shown in [Makhinya et al., 2010]. RLE only works in compactinglarge empty or uniform colour areas, but is often useless for non-trivial full framecolour results. We developed two enhancements to improve RLE: per-componentRLE compression and reordering of colour bits. These preconditioning steps ex-ploit typical characteristics of image data for run-length encoding.Equalizer also integrates more complex compression algorithms such as lib-jpeg-turbo , which are of little practical use on modern cluster interconnects. Theircompression overhead is often too high to be amortised by the decreased networktransmission time on 10 GBit/s or faster interconnects. For remote image stream-ing, as discussed in Section 3.5, they remain a viable compression algorithm.Based on our work, [Makhinya et al., 2010] implemented GPU-based YUVsubsampling before the image download, which has negligible overhead, reason-able compression artefacts, and a good compression ratio.

Run-length encoding (RLE) is a simple compression scheme and is on modernarchitectures purely constrained by the available memory bandwidth. For imagecompression in visualisation applications, we can exploit some characteristics ofthe data to improve the compression ratio over the standard RLE compression.Our basic RLE implementation is a fast 64-bit version comparing two pixelsat the same time (8 bit per channel RGBA format). This choice is motivated bythe fact that modern processors have 64 bit registers, thus using 64 bit tokensoptimises throughput. While this method is very fast, it shows poor compressionresults in most practical settings since it can only compress adjacent pixels of the same colour. We have observed a compression rate of up to 10% in practicalscenarios.

R G B A R G B A R G B A R G B A

64 bit RLE:RRLE:GRLE:BRLE:A

Figure 5.5:

64 bit and Per-Component RLE Compres-sion

The ﬁrst improvement isto treat each colour com-ponent separately by produ-cing four independent RLE-compressed output streamsas illustrated in Figure 5.5.This per-component RLE im-proves the compression ratefrom 10% to about 25%,as individual colour compon-ents change less often thanfull pixels.The second improvement is bit-swizzling of colour values before per-componentcompression. This swizzling step is a data pre-conditioner, which reorders and in-terleaves the per-component bits as shown in Figure 5.6 by grouping them by sig-niﬁcance. Now the per-component RLE compression separately compresses thehigher, medium and lower order bits in separate streams, thus achieving strongercompression for smoothly changing colour values, since high-order bits changeless often.

R G ABRLE:4 RLE:3 RLE:1RLE:2

32 bit

Figure 5.6:

Swizzling Preconditioning of 32-bit RGBA Values

Swizzling improves the compression rate to up to 40% for the same scenarioas above. The preconditioning step only requires bit shift and mask operations, isentirely executed in registers and has no measurable impact on performance, sincethe whole algorithm is memory bound on modern CPUs.All RLE compressors perform a data decomposition on the input image, andparallelise the compression of the resulting sub-images across multiple threads.This parallel execution improves the performance by saturating multiple memorychannels compared to a single-threaded implementation.Benchmark 5.e summarises the compression results from [Makhinya et al.,2010]. We have chosen sort-ﬁrst rendering, since this highlights the results for the .5 Compression for Image Compositing 61

RLE compressor which is optimised for colour image data. This benchmark didrun on a visualisation cluster with Gigabit Ethernet at a resolution of × pixels. It rendered the David statuette at 1mm resolution, resulting in a render-ing time of about 28 ms on a single GPU. The theoretical maximum line showsthe upper limit for sort-ﬁrst compositing with uncompressed image data and norendering time. It decreases as the destination channel contributes to the sort-ﬁrstrendering and does not require a pixel transfer. With an increasing number ofremote source channels its size decreases, requiring more pixels to be transferred. David 1mm , 1280x1024, sort-ﬁrst

None RLE64 RLE Component RLE Swizzle YUV YUV + RLE Swizzle Theoretical Maximum2 GPUs Sort-First Polygonal Rendering F PS Benchmark 5.e:

Image Compression in Sort-FirstPolygonal Rendering

The graph shows howvarious incremental improve-ments add up to signiﬁcantperformance gains. Even ina relatively difﬁcult scenariowith a fast rendering time,and, by modern standards,slow network interconnect,we were able to more thandouble the performance toabove 60 Hz.The basic 8-byte RLECompressor performs justminimally better than nocompression. Both stay re-latively close the theoreticalmaximum, but can’t quitereach it due to load imbalances and non-zero rendering time. The swizzling pre-conditioner can signiﬁcantly reduce the compositing time, and even improve theoverall framerate.The YUV is compressor is an on-GPU compression plugin which performs acolor space conversion and lossy chroma subsampling. It can be combined withthe CPU-based RLE compressor, which then interleaves and compresses the Y, Uand V channels, resulting in major performance improvements. Both compressionsteps have virtually no computational overhead and are memory bandwidth bound.Since the YUV compressor runs on the GPU, it reduces the costly GPU to CPUtransfer time over the PCI Express link.

Equalizer uses runtime-loadable plugins to transfer pixel data from and to theGPU, as well as plugins to compress and decompress pixel data for network com-pression. This separation allows different code paths for multi-GPU machines where no CPU-based compression is used, and for distributed execution wheredata is compressed before network transfer.The GPU transfer might also apply compression. This is typically done on theGPU to reduce the amount of memory transferred over the GPU-CPU intercon-nect. One example is YUV subsampling, where a shader implements the RGBto YUV colour space conversion and subsequent chroma subsampling. Further-more, a GPU transfer plugin may implement asynchronous downloads, where thedownload is started from the render thread an ﬁnished in a separate downloadthread as shown in Figure 5.4. CPU compression plugins are always executedfrom asynchronous threads concurrently to the rendering threads.The implementation of these steps in plugins provides a clean separation andinterface for users and researchers interested in experimenting with image com-pression for interactive parallel rendering.

Early assembly provides better pipelining when the frame assembly order is notimportant, for example for sort-ﬁrst rendering and sort-last rendering with z -compositing. Our default compositing code uses a signal on all input frames,which is triggered for each input frame arrival. The compositing code then picksand composites this image, assembling images early and out of order as they be-come available. This decreases the time to solution, since the assemble operationﬁnishes once the last frame arrives, plus the time to assemble this last frame. In-order assembly would have a statistical probability of n frames to assemble afterthe last frame arrives (unless other constraints make the arrival not fully randomor network-constrained). C H A P T E R

LOAD BALANCING

Load balancing performs resource assignment per source channel based on work-load, with the goal of equalising resource utilisation. Static load balancing, shownin Figure 6.1 top, performs this assignment once during initialisation. Dynamicload balancing can either be reactive or predictive (middle and bottom of Fig-ure 6.1). Reactive load balancing utilises statistics from previous frames to es-timate future load distribution. Predictive load balancing uses an application-provided load estimate (also called cost function) to predict the load distributionfor the current frame. Both approaches reassign resources dynamically, typicallyfor each rendered frame. Implicitly load balanced algorithms achieve a good loadbalance by other means, for example by work stealing between resources.In our framework load balancing is implemented by

Equalizers , which are anaddition to compound trees. They modify parameters of their respective com-pound subtree at runtime to dynamically optimise the resource usage, by tuningone aspect of the decomposition or recomposition. Due to their nature, they aretransparent to application developers, but might have application-accessible para-meters to tune their behaviour. Resource equalisation is the critical component forscalable rendering, and therefore the eponym for the

Equalizer project name.In this section we present various equalizer implementations: two variantsfor reactive load balancing for sort-ﬁrst and sort-last rendering, implicitly loadbalanced work packages for sort-last and sort-ﬁrst rendering, cross-segment load63 renderrenderrender ... compositeresultdistribute tasksassign resources init exitredraw renderrenderrender ... distribute tasksre assign resourcesinit exitredrawstatistics feedback compositeresultrenderrenderrender ... distribute tasksre assign resourcesinit exitredrawapplication load estimation compositeresult

Figure 6.1:

Static, Reactive and Predictive Load Balancing balancing for multi-display installations, constant frame rate rendering using dy-namic frame resolution, and monitoring of large-scale visualisation systems. Theseequalizers address the research question on how we can improve load-balancingfor sort-ﬁrst rendering, in particular for large display systems.

Sort-ﬁrst (Figure 6.2) and sort-last load balancing are the most obvious optimisa-tions for these parallel rendering modes. Our load equalizers are fully transparentfor application developers; they use a reactive approach based on past renderingtimes. This requires a good frame-to-frame coherence for optimal results, whichis the case with most rendering applications. Equalizer implements two differentalgorithms: A load equalizer and a tree equalizer , which have shown advantage-ous for different types of rendering load.Both equalizers extract their load metrics from statistics collected by the ren-dering clients, which are sent asynchronously from the clients to the server, wherethe equalizers subscribe to them for operation. At the beginning of each frame, .2 Sort-First and Sort-Last Load Balancing 65 the server triggers all equalizers on all compound trees, which enables them to setnew decomposition parameters before the rendering tasks are computed.

Figure 6.2:

Load Balancing

The load equalizer builds a model of the ren-dering load in screen space or data space. Itstores a 2D (for sort-ﬁrst) or 1D (for sort-last)grid of the load, mapping the load of each chan-nel. The load is stored in normalised 2D/1D co-ordinates using timearea as its measure. The con-tributing source channels are organised in a bin-ary kD-tree. The algorithm then balances thetwo branches of each level by equalising theintegral over the cost area map on each side.This algorithm is similar to [Abraham et al.,2004], which uses a dual-level tree. Our binarysplit tree provides more compact tiles for largercluster conﬁgurations, since the split directionalternates on each level.

Figure 6.3:

Load Cost Area Map with (top) and without(bottom) using Region of Interest Information

The load balancer hasto operate on the assump-tion that the load is uni-form within one load gridtile. Naturally this leads toestimation errors, since inreality the load is not uni-formly distributed, it tends toincrease towards the centreof the screen in Figure 6.3.We reuse the Region of In-terest (ROI) from composit-ing of each source channel toautomatically reﬁne the loadgrid as shown in Figure 6.3,top left. In cases where therendered data projects only to a limited screen area, this ROI reﬁnement providesthe load balancer with a more accurate load estimation, leading to a better loadprediction during the balancing step.The tree equalizer uses the same binary kD-tree structure as the load equalizer for recursive load balancing. It computes the accumulated render time of all chil-dren for each node of the tree and uses the result to allocate an equal render time toeach subtree. It makes no assumption of the load distribution in 2D or 1D space,it only tries to correct the imbalance in render time.

Both equalizers implement tunable parameters allowing application developersto optimise the load balancing based on the characteristics of their rendering al-gorithm. These parameters are accessible through an API from the applicationmain thread:

Split Mode conﬁgures the tile layout: horizontal stripes, vertical stripes, or 2D, abinary tree split alternating the split axis on each level, resulting in compact2D tiles.

Damping reduces frame-to-frame oscillations. The equal load distribution withinthe region of interest assumed by the load balancers is in reality not equal,causing the load balancing to overshoot. Damping is a normalised scalardeﬁning how much of the computed delta from the previous position is ap-plied to the new split.

Resistance eliminates small deltas in the load balancing step, i.e., it only changesthe viewport or range if the change is over the conﬁgured limit. This canhelp the application to cache visibility computations, since the frustum doesnot change with each frame.

Boundaries deﬁne the modulo factor in pixels onto which a load split may fall.Some rendering algorithms produce artefacts related to the OpenGL rasterposition, e.g., screen door transparency. It can be eliminated by aligning theboundary to the pixel repetition. Furthermore, some rendering algorithmsare sensitive to cache alignments, which can again be exploited by choosingthe corresponding boundary.

Usage is a per-child normalised resource utilisation coefﬁcient. The equalizerwill assign proportional work to this resource, and deactivate it if the usageis 0. This parameter is primarily used by the cross-segment load balancerto reassign resources between destination channels. It can also be used toconﬁgure heterogeneous GPU resources more efﬁciently.

Load balancing can be classiﬁed into explicit and implicit approaches. Explicitmethods centrally compute a task decomposition up-front, before a new frameis rendered, while implicit methods decompose the workload into task units thatcan be dynamically assigned to the resources during rendering, based on the workprogress of the individual resources. Explicit load balancing typically assigns asingle task to each resource to minimise static per-task cost. The aforementionedload and tree equalizers implement explicit reactive load balancing.Implicit load balancing uses a ﬁner granularity of signiﬁcantly more renderingtasks to resources. These tasks are assigned using central task distribution, ortask stealing between resources. Implicit algorithms are more commonly usedin ofﬂine raytracing compared to real-time rasterisation, because of practically .2 Sort-First and Sort-Last Load Balancing 67 non-existent per-tile cost in raytracing. Since each rendered task directly sends itsresult for compositing, work packages exhibit a better pipelining of rendering andcompositing operations.Our work package implementation uses a task pulling mechanism, an ap-proach that has been employed before in distributed computing. Rather than hav-ing the server push tasks to the rendering clients, our dynamic work packagesapproach manages ﬁne grained tasks on a server-side queue, while the clientsrequest and execute the tasks as they become available. Every rendering clientemploys a small local, prefetched queue of work packages to hide the round-triplatency to fetch new packages. During rendering, a client ﬁrst works on packagesfrom its local queue and concurrently requests packages from the server wheneverthe amount of available packages sinks below a threshold. In [Steiner et al., 2016]this basic, random task assignment was extended with client-afﬁnity models.At the beginning of each frame, the server generates tiles or database ranges ofa conﬁgurable size for each compound with an output queue. Compounds with aninput queue matching the name of the output queue generate a special draw task,which causes the render client to set up its input queue, and to call frameDraw and frameReadback for each received work package.

Benchmark 6.a shows the performance of static and dynamic sort-ﬁrst and sort-last rendering. The experiment setup is described in [Eilemann et al., 2018]. Theresults show that, as expected, load balancing improves the performance over astatic task decomposition signiﬁcantly. The simpler tree equalizer outperformsthe load grid-driven load equalizer in most cases, except for sort-ﬁrst volumerendering, where the load in the region of interest is relatively uniform. This resultis counter-intuitive, since the load equalizer operates with more information andshould be able to produce better results. It seems to conﬁrm that simple algorithmsoften outperform theoretically better, but more complex implementations. Thedecoupling of the load balancing algorithm from the rest of the system enabled bythe compound architecture opens the possibility for more research in this area.Benchmark 6.b provides a more detailed analysis of the different equalizers.Using volume rendering, we measure the performance of decomposition modesunder heterogeneous load, which was achieved by varying the number of volumesamples used for each fragment (1-7) while rendering. This allowed for a consist-ent linear scaling of rendering load, which was randomly varied either per frame,or per node. The linear scaling of load per node corresponds to a scaling of re-sources. Doubling the rendering load on a speciﬁc node corresponds to halvingits available rendering resources. To the system this node would then contributethe value 0.5 in terms of normalised compute resources.

David

2D load_equalizer DB load_equalizer sort-ﬁrst + tree_equalizer tile_equalizer sort-last + tree_equalizer pixel Polygonal Rendering t i m e ( s ) Processes

Table 1

2D load_equalizer DB load_equalizer tile_equalizer sort-last sort-ﬁrst pixel + tree_equalizer + tree_equalizer2 G B / s r a w da t a David-1

DB load_equalizer sort-ﬁrst + load_equalizer tile_equalizer sort-last + tree_equalizer pixel tree Volume Rendering

Processes Benchmark 6.a:

Sort-First and Sort-Last Scalability

David

2D load_equalizer sort-ﬁrst + tree_equalizer + usage tile_equalizer sort-last + tree_equalizer + usage

Heterogenous Rendering Nodes t i m e ( s ) Processes (normalized compute resources)

Table 1

2D load_equalizer sort-ﬁrst tile_equalizer + usage + usage sort-last + tree_equalizer2 G B / s r a w da t a David-1

DB 2D pixel sort-ﬁrst load_equalizer sort-ﬁrst tree_equalizer tile_equalizer sort-last load_equalizer sort-last tree_equalizer Shared Rendering Nodes

Processes Benchmark 6.b:

Sort-First and Sort-Last Equalizer Behaviour

Benchmark 6.b (left) models how individual modes perform on heterogeneoussystems. In this case the tree equalizer performs best, as it allows us to a priorideﬁne how much usage it should make of individual nodes, i.e., bias the allocationof rendering time in accordance with the (simulated) compute resources. Bench-mark 6.b (right), on the other hand, illustrates how the decomposition modes per-form on a system where compute resources ﬂuctuate randomly every frame, ascan be the case for shared rendering nodes in virtualised environments. For thisscenario tile equalizer seems best suited, as it load balances implicitly and doesnot assume coherence of available resources between frames. The simpler treeequalizer outperforms the load equalizer in this experiment. .3 Cross-Segment Load Balancing 69

The tile equalizer often outperforms the tree equalizer . This suggests that theunderlying implicit load balancing of task queues can be superior to the expli-cit methods of load equalizer and tree equalizer in high load situations, wherethe additional overhead of tile generation and distribution is more justiﬁed. Therelatively simple nature of our rendering algorithms is also favouring work pack-ages, since they have a near-zero static overhead per rendering pass. [Steiner et al.,2016] contains additional experiments.

A serious challenge for all distributed rendering systems driving large multi-displayinstallations is dealing with the varying rendering load per display, and thereforethe graphics load on its driving GPUs. Cross-segment load balancing (CSLB)is a novel dynamic load balancing approach to dynamically allocate n renderingresources to m output channels (with n ≥ m ), as shown in Figure 6.4. Image Copyright Realtime Technology AG, 2008

Figure 6.4:

Six GPU to two Display Cross-Segment Load Balancing

The m output channels each drive a display or projector of a multi-displaysystem. Commonly, each destination channel is solely responsible for renderingand potentially compositing of its corresponding display segment.A key element of CSLB is that the m GPUs physically driving the m displaysegments are not restricted to a one-to-one mapping of rendering tasks to the cor-responding display segment. CSLB performs a dynamic assignment of n graphicsresources from a pool to drive m different destination display segments, where the m destination channel GPUs themselves may also be part of the pool of graph-ics resources. CSLB also does not require a planar display surface, that is, thealgorithm works equally for tiled display walls and immersive installations.Dynamic resource assignment is performed through load balancing compon-ents that exploit statistical data from previous frames for the decision of optimal GPU usage for each segment, as well as optimal distribution of work among them.The algorithm is also compatible with predictive load balancing based on a loadestimation given by the application.CSLB is implemented in Equalizer as two layers of hierarchically organisedcomponents speciﬁed in the conﬁguration. The ﬁrst level globally assigns frac-tions of resources to each destination channel, and the second level consist of load equalizer s or tree equalizer s balancing the assigned resources for each des-tination segment.

Channel 1 load_equalizer

Source 1

Usage 1.0

Channel 2 load_equalizer view_equalizerSource 2

Usage 0.2

Source 1

Usage 0.0

Source 2

Usage 0.8 (a) CSLB resources setup. compound { view equalizer {} compound { channel "Channel1"load equalizer {} compound {} { channel "Source2"outputframe {}} inputframe {} ... } compound { channel "Channel2"load equalizer {} compound {} { channel "Source1"outputframe {}} inputframe {}} ... } (b) CSLB conﬁguration ﬁle format. Figure 6.5:

Dual-GPU, Dual-Display Cross-Segment Load Balancing Setup

Figure 6.5 depicts a snapshot of the simplest CSLB setup along with its conﬁg-uration ﬁle. Two destination channels,

Channel1 and

Channel2 , each connectedto a projector, create the ﬁnal output for a multi-projector view. Each projector isdriven by a distinct GPU, constituting the source channels

Source1 and

Source2 .Each source channel GPU can contribute to the rendering of the other destinationchannel segment. For each destination channel, a set of potential resources areallocated. A top-level view equalizer assigns the usage to each resource, based .3 Cross-Segment Load Balancing 71 on which per-segment load equalizer s compute the 2D split to balance the as-signed resources within the display. The left segment of the display has a higherworkload, so, both

Source1 and

Source2 are used to render for

Channel1 , whereas

Channel2 uses only

Source2 to render the image for the right segment. The schem-atic also shows the current usage of the four potential source compounds, whereonly three have an active draw pass at this point in time.CSLB uses a two-stage approach, where a view equalizer at the top level ofthe compound hierarchy handles the resource assignment. Each child of this rootcompound has one destination (segment) channel, corresponding to one of the m display segments, using a load equalizer or tree equalizer . The view equalizersupervises the different destination channels of a multi-display setup; the loadequalizers on the other hand are responsible for the partitioning of the renderingtask of each segment among its child compounds. They use the precomputedusage of each child to allocate a corresponding amount of work for the child.Therefore, each destination channel of a display segment has its source channelleaf nodes sharing the actual rendering load. One physical GPU assigned to asource channel can be referenced in multiple leaf nodes, and thus contribute todifferent displays.For performance reasons the view equalizer assigns each resource to at mosttwo rendering tasks, e.g., to update itself and to contribute to another display.Furthermore, it gives priority to the source compound using the same channel asthe output channel of each segment to minimise pixel transfers.Cross-segment load balancing allows for optimal resource usage of multipleGPUs used for driving the display segments themselves, as well as any additionalsource GPUs for rendering. It combines multi-display parallel rendering withscalable rendering for optimal performance.[Erol et al., 2011] provides experimental results for a six-monitor tiled dis-play wall, driven by twelve GPUs. Benchmark 6.c shows an overview for theachievable performance improvements. The ﬁrst two conﬁgurations use a staticassignment of two GPUs to one output channel, where the ﬁrst one statically as-signs one half to each GPU, and the second uses a load balancer to dynamicallysplit the work between the two GPUs. Already this 2D load balancing improvesthe framerate by almost 50%. The remaining conﬁgurations add a view equalizeron top of the per-segment 2D load equalizer. The conﬁguration assigns up to 4, 6,8, 10 or all GPUs to each segment, that is, any segment may use up to n GPUs,and the GPUs are shared evenly across multiple segments. While theoretically theall-to-all (

12 : 6 × ) conﬁguration should provide the best performance, mispre-dictions of the equalizers lead to a sweet spot of GPU sharing between segments.In our

12 : 6 setup, assigning up to six GPUs per segment almost doubles theperformance over the state-of-the-art sort-ﬁrst load balanced setup.

Binary

Framerate12:6 F r a m e s pe r S e c ond Conﬁguration Benchmark 6.c:

Cross-Segment Load Balancing

Benchmark 6.d showsthe rendering time over aﬁxed camera path of 540frames. In the static casetwo GPUs are responsible foreach of the six outputs ofthe tiled display wall used.For the CSLB graph, up toeight GPUs were dynamic-ally reassigned each frame toeach of the six output chan-nels, depending on the cur-rent load distribution. Ex-cept for a few camera posi-tions, where the model is po-sitioned evenly over all outputs, CSLB outperforms the ﬁxed assignment. stat_2D_12to6 cslb_2D_12to6_8

Benchmark 6.d:

Cross-Segment Load Balancing for six Displays and 12 GPUs com-pared to a static two-to-one six Display Sort-First Rendering

A strength of this algorithm lies in its ﬂexibility. On one hand, it can per-form dynamic resource assignment not only for a planar display system, as someapproaches which built a single virtual framebuffer, but also for curved displaysand CAVE installations. On the other hand, it allows a ﬂexible assignment of po-tential contributing GPUs to each output channel individually. Each output mayhave a different, potentially overlapping, set of GPUs which may contribute to itsrendering.

Dynamic Frame Resolution (DFR) (Figure 6.6) provides a functionality similar todynamic video resizing [Montrym et al., 1997], speciﬁcally it maintains a constantframerate by adapting the rendering resolution of a ﬁll-limited application. .5 Frame Rate Equalizer 73

Figure 6.6:

Dynamic Frame Resolution

While the aforementioneduses a now-obsolete hard-ware implementation, ourimplementation works on com-modity hardware and is purelyimplemented in software.DFR works by renderinginto a source channel (of-ten on a FBO) separate fromthe destination channel, andthen scaling the renderingduring the transfer (typicallythrough an on-GPU texture)to the destination channel.The DFR equalizer monit-ors the rendering perform-ance and accordingly adaptsthe resolution of the sourcechannel and the zoom factorfor the source to destinationtransfer. If the performance and source channel resolutions allow, this will notonly subsample, but also supersample the destination channel to reduce aliasingartefacts.DFR can be combined with other scalability features, e.g., sort-ﬁrst rendering.It is also notable that it does not need any additional code in the core compoundlogic, it simply exploits existing functionality such as texture-based compositingframes and frame zoom with dynamic per-frame adjustments.

The framerate equalizer smooths the output frame rate of a destination channelby instructing the corresponding window to delay its buffer swap to a minimumtime between swaps. This is regularly used for time-multiplexed decompositions,where source channels tend to drift and ﬁnish their rendering unevenly distributedover time. This equalizer is however fully independent of DPlex compounds, andmay be used to smooth the framerate of irregular rendering algorithms. Due to theartiﬁcial sleep time before swap, it may incur a small performance penalty, but itgreatly improves the perceived rendering quality for users in DPlex compounds.

The monitor equalizer (Figure 6.7) allows reusing of the rendering from one ormore channels on another channel, typically for monitoring a larger display setupon a control workstation.

Figure 6.7:

Monitoring

Output frames on the dis-play channels are connectedto input frames on the monit-oring channel. The monitorequalizer changes the scal-ing factor and offset betweenthe output and input, so thatthe monitor channel has thesame, but typically down-scaled view, as the originat-ing segments. While this isnot strictly a scalable render-ing feature, it optimises re-source usage by not need-lessly rendering the sameview multiple times. It reuses the zoom parameter of compositing frames, andadapts this every time one of the channels is resized.

C H A P T E R

DATA DISTRIBUTION ANDSYNCHRONISATION

Most research in parallel rendering does not look into the problem of managingapplication state in a distributed rendering session. For basic parallel rendering re-search this problem is trivial to solve, whereas in real-world applications it is oftenone of the major challenges for using a distributed rendering cluster. Research-ing and improving the system behaviour of non-trivial applications is critical formeaningful parallel rendering research, and therefore providing a distributed net-work library is a key component of a parallel rendering system.For this reason we have spend signiﬁcant effort in researching, designing andimplementing a distributed execution layer used by Equalizer and applicationsbuilt on Equalizer. The

Collage network library is an independent open sourceproject. In the following sections we highlight core features and show how theyare different from other distribution mechanisms, e.g., the MPI library.The

Collage network library was conceived with the requirements of a dy-namic parallel rendering system in mind. Some of the features implemented byCollage emerged with the growing complexity of Equalizer and its applications,and are often layered on top of the basic primitives. The core requirements are:

Peer-to-peer network:

Whilst the execution model of an Equalizer applicationfollows a master-slave approach, and Equalizer internally uses a client-75 server model, the core transport layer should be agnostic to these higher-level abstractions. In Collage, each communicating process is equal to allothers, and no trafﬁc prioritisation or communication pattern is enforced bya node type. This has proven particularly useful during the implementationof parallel compositing algorithms, where the compositing nodes form anad-hoc peer-to-peer sub-network.

Dynamic connection management:

As a consequence of the peer-to-peer net-work, all nodes in a cluster are equivalent. Due to the heterogeneous natureof a parallel rendering application, we furthermore imposed no constraintson the management of connections between nodes. Nodes are identiﬁedand addressed by an universally unique identiﬁer. The network layer lazilyestablishes a connection to any given node by querying its known neigh-bours or a zeroconf network for connection parameters. Connections maybe established concurrently by both sides of a node pair (e.g. during par-allel compositing), which requires a robust handshake protocol during con-nection establishment. For larger cluster installations, a fully connectedpeer-to-peer network would be suboptimal. For example on Windows op-erating systems there is a latency penalty once more than 64 connectionsare needed, caused by low-level implementation details. This feature alsoallowed us to implement runtime conﬁguration switches involving a chan-ging set of rendering resources.

Transport layer abstraction:

The actual network protocol is abstracted by anAPI deﬁning byte-oriented stream semantics. While this choice of abstrac-tion makes it harder for RDMA-based protocols to deliver full performance,it has proven useful in supporting a large set of transports, from standardEthernet sockets, SDP for InﬁniBand, native Verbs for InﬁniBand, UDT toa fully-featured reliable multicast implementation. In particular, the easeof integration of multicast transport is strong evidence for the usefulness ofthis abstraction.

Convenient to use for existing applications:

The history and code structure ofvisualisation applications is often very different from other distributed ap-plications, such as simulation codes. They have been developed for yearsfor desktop systems, are often single-threaded and have data models andobject hierarchies built for their domain-speciﬁc problems and algorithms.The network library needs to provide primitives which match this reality asclosely as possible by providing a modern, object-oriented C++ API. .2 Architecture 77

Our Collage network library provides a peer-to-peer communication infrastruc-ture, offering different abstraction layers which gradually provide higher levelfunctionality to the programmer. Collage is used by Equalizer to communicatebetween the application node, the server and the render clients. Many resourceentities described in Chapter 3 are distributed Collage objects. Figure 7.1 providesan overview of the major Collage classes and their relationship. The main classes,in ascending abstraction level, are:

Connection:

A stream-oriented point-to-point communication line. Differentimplementations of a connection exist. A connection transmits a raw bytestream reliably between two endpoints for unicast connections, and betweena set of endpoints for multicast connections.

DataOStream:

Abstracts the output of C++ data types onto a set of connectionsby implementing output stream operators. Uses buffering to aggregate datafor network transmission.

OCommand:

Extends DataOStream to implement the protocol between Collagenodes by adding node and command type routing information to the stream.

DataIStream:

Decodes a buffer of received data into C++ objects and PODsby implementing input stream operators. Performs endian swapping if theendianness differs between the originating and local node.

ICommand:

The other side of OCommand, extending DataIStream.

Node and LocalNode:

The abstraction of a process in the cluster. Nodes com-municate with each other using connections. A LocalNode listens on vari-ous connections and processes requests for a given process. Received datais wrapped in ICommands and dispatched to command handler methods. ANode is a proxy for communicating with a remote LocalNode.

Object:

Provides object-oriented, versioned data distribution of C++ objects betweennodes within a session. Objects are registered or mapped on a LocalNode. A Connection is the basic primitive used for communication between processesin Collage. It provides a stream-oriented communication between two endpoints.A connection is either closed, connected or listening. A closed connection cannotbe used for communication. A connected connection can be used to read or writedata to the communication peer. A listening connection can accept connectionrequests leading to new, connected connections.A

ConnectionSet is used to manage multiple connections. The typical usecase is to have one or more listening connections for the local process, and anumber of connected connections for communicating with other processes. The uint128_t commit()uint128_t sync()UUID getID()uint128_t getVersion()ObjectOCommand send(cmd)

Object bool dispatch(Command)register(cmd, func, queue)

Dispatcher bool listen()bool close()bool connect(Node)bool connect(NodeID)void registerObject(Object)bool mapObject(Object)void deregisterObject(Object)void unmapObject(Object)bool dispatch(Command)

LocalNode * type = CO_OBJECTUUID objectID

ObjectOCommand <> objectIDOCommand send()OCommand multicast()

Node * uint32_t type

OCommand <><> 1ConnectionType type[Connection Parameters]

ConnectionDescription *addConnection()removeConnection()select()

ConnectionSet listen()connect()close()send(void* data)recv(BufferPtr data)

Connection * 11 uint32_t type

ICommand uint64_t getSize()uint8_t* getData()

Buffer

ObjectICommandConnection LayerNode Peer-to-Peer LayerVersioned Objects operator <<

DataOStream operator >>

DataIStream

Figure 7.1:

UML class diagram of the major Collage classes connection set is used to select one connection requiring some action. This canbe a connection request on a listening connection, pending data on a connectedconnection, or the notiﬁcation of a disconnect. It is an encapsulation of the poll or WaitForMultipleObject system calls.The connection and connection set can be used by applications to implementother network-related functionality, e.g., to communicate with a sound server ona different machine. They do not require a particular wire protocol. A

LocalNode has a connection set and uses it to manage connections with other nodes.

Data streams implement serialisation and buffering on top of connections. Theyuse output and input stream operators ( << and >> ) with function overloads toprovide serialisation for all common data types. The input stream will performbyte swapping if the endianness differs between the sending and receiving node.Applications can easily provide overloads for their own classes for serialisation. .2 Architecture 79 All serialised data is assembled in a memory buffer and sent over the connectiononce the data is complete. An output data stream might send its data to manyconnections, e.g., when an object update is sent to all subscribed slave nodes.

Collage sends commands over connections to implement remote procedure calls.A command is identiﬁed by its type (typically the C++ class handling it) and acommand identiﬁer. These ﬁelds are used to implement thread-aware dispatchingof received commands to handler functions.Nodes and objects communicate using commands derived from data streams.The basic command dispatch is implemented in the

Dispatcher class, from which

Node and

Object are sub-classed.The dispatcher allows the registration of commands with a dispatch queue andan invocation method. Each command has a type and command identiﬁer, whichis used to identify the receiver, registered queue and method. The dispatch pushesthe packet to the registered queue. When the commands are dequeued by theprocessing thread, the registered command method is invoked.This dispatch and invocation functionality is used within Equalizer to dispatchcommands from the receiver thread to the appropriate node or pipe thread, andthen to call a speciﬁc method when it is processed by these threads. All Equal-izer task methods available to the application are triggered by this mechanism.This dispatch provides object-oriented semantics, since C++ instances can registerthemselves on the dispatcher, and get automatically invoked in the correct threadwhen an appropriate command arrives.

The

Node is the abstraction of one process in the peer-to-peer network. Each nodehas a universally unique identiﬁer. This identiﬁer is used to address nodes, e.g.,to query connection information to connect to the node. Nodes use connections tocommunicate with each other by sending

OCommand s.The

LocalNode is the specialisation of the node for the given process. It en-capsulates the communication logic for connecting remote nodes, as well as objectregistration and mapping. Local nodes are set up in the listening state during ini-tialisation.A remote

Node can either be connected explicitly by the application, or im-plicitly due to a connection from a remote node. The explicit connection can bedone by programmatically creating a node, adding the necessary

ConnectionDes-cription s and connecting it to the local node. It may also be done by connectingthe remote node to the local node by using its

NodeID . This will cause Collage to query connection information for this node from the already connected nodes andzeroconf, instantiating the node and connecting it. Both operations may fail.Each Equalizer entity has a

LocalNode for communication, and one

Node in-stance for each peer it communicates with.

Zeroconf Discovery

Each

LocalNode provides a

Zeroconf communicator, which allows node and re-source discovery. The zeroconf service ” collage. tcp” is used to announce thepresence of a listening

LocalNode using the ZeroConf protocol to the network.The node identiﬁer and all listening connection descriptions are announced, usedto connect unknown nodes by using the node identiﬁer alone.

Communication between Nodes

Figure 7.2 shows the communication between two nodes. Each

LocalNode has areceiver thread, which uses a connection set to read and dispatch incoming datafrom the network, and a command thread used for higher-level functions suchas object mapping. When the remote node sends a command, the listening nodereceives the command and dispatches it from the receiver thread. The dispatchwill either invoke the bound function immediately, or enqueue the command intothe given queue. The queue consumer, for example the main or command thread,will read the command of this queue and then invoke the bound function.

LocalNode*1NodeConnection LocalNode*1NodeOCommand<> <> Connectionbyte stream DispatcherDataIStream*1 *1ICommandDataOStream

Figure 7.2:

Communication between two Nodes

RSP is an implementation of a reliable multicast protocol over unreliable UDPmulticast transport. RSP behaves similarly to TCP; in contrast to the underlying .3 Reliable Stream Protocol 81

UDP transport, it is not message-oriented, but implements byte stream semantics.RSP provides full reliability and ordering of the data, and slow receivers will even-tually throttle the sender through a sliding window algorithm. This behaviour isneeded to guarantee delivery of data in all situations. Pragmatic generic multicast(PGM [Gemmell et al., 2003]) provides full ordering, but slow clients will dis-connect from the multicast session instead of throttling the send rate. Since weuse multicast for distributing application data to all rendering clients, we wantsemantics similar to TCP, expressly waiting for a client to read data is preferableover losing this client.RSP combines various established multicast algorithms [Adamson et al., 2004;Gau et al., 2002] in an open source implementation capable of delivering wirespeed transmission rates on high-speed LAN interfaces. The following will out-line the RSP protocol and implementation, as well as the motivation for the designdecisions. Any defaults given below are for Linux or Mac OS X, the WindowsUDP stack requires different default values which can be found in the implement-ation.Our RSP implementation uses a separate protocol thread for each RSP group,which handles all reads and writes on the multicast UDP socket. It implements theprotocol handling and communicates with the application threads through thread-safe queues. The queues contain datagrams ﬁlled with the application byte stream,preﬁxed by a header of at most eight bytes. Each connection has a conﬁgurablenumber of buffers (1024 by default) of a conﬁgurable datagram size (1470 bytesdefault), which are either free or in transmission. The header contains two bytesfor the datagram type (connection handshake, data, acknowledgement, negativeacknowledgement, acknowledgement request), and up two six bytes of datagram-speciﬁc information (e.g. for acknowledgement: two bytes read node identiﬁer,two bytes write node identiﬁer, two bytes sequence number).Figure 7.3 shows the data ﬂow through the RSP implementation. Each mem-ber of the multicast group opens a listening connection, which will send querydatagrams to the multicast socket. For each found member, a receiving connec-tion instance is created and, similar to a TCP socket, passed to the applicationupon accept . Each connection instance has a ﬁxed number (1024 by default) ofﬁxed-size (1470 by default) buffers, each used directly for an UDP datagram. Thelistening connection uses these buffers for writing data, and each receiving con-nection uses its buffers for received data. These buffers are continuously cycledthrough two sets of queues: a blocking, thread-safe queue used on the applicationside for reading and writing data, and a non-blocking, lock-free and thread-safequeue on the protocol thread for data management.When writing data, the application thread pops empty buffers from its queue(blocking when the data cannot be written fast enough), ﬁlls in the data datagramheader and copies the application data piece-wise into the datagram. The data- _buffers_appBuffers listen_threadBuffers write_writeBuffers _writeData _addNewConnection_handleData read_recvBuffers(out-of-order)Sender Receiveraccept app li c a t i on t h r ead I O t h r ead data na ck _handleNack_handleAck a c k _handleAckRequest Figure 7.3:

RSP Data Flow grams are then pushed onto the protocol thread buffer queue. The protocol threadwrites the datagrams into the UDP multicast socket, and reads and handles anyincoming datagrams. On the receiver side, the protocol receives the data, andpushes them in order to the corresponding application thread queue. Out-of-orderdatagrams are stored aside and queued in order later. Negative acknowledgements(nack) are immediately sent for missing datagrams. The writer will repeat nack’ddatagrams, recycle fully acknowledged datagrams to the application queue, andask for missing acknowledgements if needed. When reading data, the applicationpops full buffers from the corresponding connection queue (blocking when no datais available), copies the data piece-wise out of the datagram into the applicationbuffer, and recycles the cleared buffers onto the protocol thread queue.Handling a smooth packet ﬂow is critical for performance. RSP uses activeﬂow control to advance the byte stream buffered by the implementation. Eachincoming connection actively acknowledges every n (17 by default) packets fullyreceived. The incoming connection offset this acknowledgement by their con-nection identiﬁer to avoid ack bursts. Any missed datagram is actively nack’das soon as detected. Write connections continuously retransmit packets for nackdatagrams, and advance their window upon reception of all acks from the group.The writer will explicitly request an ack or nack when it runs out of empty buf- .4 Distributed, Versioned Objects 83 fers or ﬁnishes its write queue. Nack datagrams may contain multiple ranges ofmissed datagrams, motivated by the observation that UDP implementations oftendrop multiple contiguous packets.Congestion control is necessary to optimise bandwidth usage. While TCPuses the well-known additive increase, multiplicative decrease algorithm, we havechosen a more aggressive congestion control algorithm of additive increase andadditive decrease. Experimentally this has proven to be more optimal: UDP isoften rate-limited by switches; packets are discarded regularly and not occasion-ally. Only slowly backing off the current send rate helps to stay close to this limit.Furthermore, our RSP trafﬁc is limited to the local subnet, making cooperationbetween multiple data stream less of an issue. Send rate limiting uses a bucketalgorithm, where over time the bucket ﬁlls with send credits, from which senddatagrams are subtracted. If there are no available credits, the sender sleeps untilsufﬁcient credits are available.In [Eilemann et al., 2018] we provide experimental results showing that ourimplementation can achieve above 90% wire speed on 10 GBit/s Ethernet, goodscalability with respect to multicast group size, and is very effective for concur-rently distributing structured and unstructured application data to a large numberof rendering clients (see Benchmark 7.c). Adapting an existing application for parallel rendering requires the synchronisa-tion of application data across the processes in the parallel rendering setup. Ex-isting parallel rendering frameworks often address this poorly, at best they relyon MPI to distribute data. Real-world, interactive visualisation applications aretypically written in C++ and have complex data models and class hierarchies torepresent their application state. As outlined in [Eilemann et al., 2009], the par-allel rendering code in an

Equalizer application only requires access to the dataneeded for rendering, as all application logic is centralised in the application mainthread. We have encountered two main approaches to address this distribution:Using a shared ﬁlesystem for static data, or using data distribution for static anddynamic data. Distributed objects are not required to build

Equalizer applica-tions. While most developers choose to use this abstraction for convenience, wehave seen applications using other means for data distribution, e.g., MPI.

Distributed objects in

Collage provide powerful, object-oriented data distributionfor C++ objects. They facilitate the implementation of data distribution in a clusterenvironment. Distributed objects are created by subclassing from co::Serializable or co::Object . The application programmer implements serialisation and deseri-alisation. Distributed objects can be static (immutable) or dynamic. Objects havea universally unique identiﬁer (UUID) as cluster-wide address. A master-slavemodel is used to establish mapping and data synchronisation across processes.Typically, the application main loop registers a master instance and communicatesthe UUID to the render clients, which map their instance to the given identiﬁer.The following object types are available: Static

The object is neither versioned nor buffered. The instance data is serialisedwhenever a new slave instance is mapped. No additional data is stored.

Unbuffered

The object is versioned and unbuffered. No data is stored, and noprevious versions can be mapped.

Instance

The object is versioned and buffered. The instance and delta data areidentical; that is, only instance data is serialised. Previous instance data issaved to be able to map old versions.

Delta

The object is versioned and buffered. The delta data is typically smallerthan the instance data. The delta data is transmitted to slave instances forsynchronisation. Previous instance and delta data is saved to be able to mapand sync old versions.Instance and delta objects have a memory overhead on the master instanceto store past data. The number of old versions retained is conﬁgurable per ob-ject. For Equalizer applications, this overhead typically occurs on the applicationnode holding the master instances, and is conﬁgured based on the conﬁgurations’latency. When using unbuffered objects, applications only observe inconsistentstate during the initial mapping, when a too recent version is used by a renderclient. The push-based commit-sync logic eventually brings the object into a con-sistent state with respect to the rendered frame.Serialisation is facilitated using output or input streams, which abstract thedata transmission and are used like a std::stream . The data streams implementefﬁcient buffering and compression, and automatically select the best connectionfor data transport. Custom data type serialisers can be implemented by providingthe appropriate serialisation functions. No pointers should be directly transmittedthrough the data streams. For pointers, the corresponding object is typically alsoa distributed object , and its UUID and version are transmitted in lieu of a pointer.Dynamic objects are versioned, and on commit the delta data to the previousversion is sent, if available using multicast, to all mapped slave instances. The datais queued on the remote node, and is applied when the application calls sync to .4 Distributed, Versioned Objects 85 synchronise the object to a new version. The sync method might block if a versionhas not yet been committed or is still in transmission. All versioned objects havethe following characteristics: • The master instance of the object generates new versions for all slaves.These versions are continuous. It is possible to commit on slave instances,but special care has to be taken to handle possible conﬂicts during concur-rent commits from multiple slave instances. • Slave instance versions can only be advanced; that is, sync(version) with aversion older than the current version will fail. • Newly mapped slave instances are mapped to the oldest available versionby default, or to the version speciﬁed when calling mapObject .Blocking commits allows limiting the number of outstanding, queued versionson the slave nodes. A token-based protocol will block the commit on the masterinstance if too many unsynchronised versions exist. This is useful to limit theamount of memory consumed by slave instance, and too prohibit run-away condi-tions of the master instance.

The

Serializable implements one convenient usage pattern for object data distri-bution which emerged during deployment of Equalizer in applications. The

Seri-alizable data distribution is based on the concept of dirty bits, allowing inheritancewith data distribution. Dirty bits are a 64-bit mask tracking the parts of the objectto be distributed during the next commit. Setters of the class mark the appropriatedirty bit, and the accumulated bits are used to compute deltas during commit.For serialization, the application developer implements serialize or deserial-ize , which are called with the bit mask specifying which data has to be transmittedor received. During a commit or sync, the current dirty bits are given, whereasduring object mapping all dirty bits are passed to the serialisation methods. Acommit will clear the dirty mask after serialisation. The Object API provides sufﬁcient abstraction to implement various optimisationsfor faster mapping and synchronisation of data: compression, chunking, caching,preloading and multicast.

Compression

The most obvious optimisation is compression. Recently many new compres-sion algorithms have been developed, exploiting modern CPU architectures anddeliver compression rates well above one Gigabyte per second.

Collage uses the Pression library [Eyescale Software GmbH and Blue Brain Project, 2016],which provides an uniﬁed interface for a number of compression libraries, such asFastLZ [Hidayat, 2007], Snappy [[email protected], 2016] and ZStand-ard [Facebook, 2016]. It also contains a custom, virtually zero-cost RLE com-pressor. Pression parallelises the compression and decompression using data de-composition. The compression is generic and lossless, available transparently tothe application. Applications can also use data-speciﬁc compression.Benchmark 7.a (top left) shows the compression ratio and speed for genericbinary data from [Eilemann et al., 2018]. Whilst the structure of the transmit-ted data varies with each application, this micro-benchmark gives a reasonableestimation of the expected performance. In our context of interactive distributedrendering applications, it is important to use the right tradeoff between spendingtime and resources for data compression, and the gained network transmissiontime due data reduction.

Binary

Relative compressed size Compression DecompressionRLE

98% 10.0 13.2

Snappy

77% 4.41 7.97

FastLZ

76% 1.96 4.75

LZF

76% 1.67 5.64

ZSTD1

63% 1.50 3.76

ZSTD2

63% 1.15 3.02

ZSTD3

62% 0.719 3.46

ZSTD4

60% 0.590 2.42

ZSTD5

60% 0.456 2.74

ZSTD10

60% 0.220 3.25

ZSTD19

55% 0.047 1.69

Binary Data C o m p r e ss i on R a t i o Relative compressed sizeCompressionDecompression 55%60%60%60%62%63%63%76%76%77%98%98% 77% 76% 76% 63% 63% 62% 60% 60% 60% 55%

Spikes 437x240x512

Relative compressed size Compression DecompressionRLE

72% 6.8692 3.92 37605

Snappy

19% 4.9998 10.974 9748

FastLZ

19% 4.3165 2.9956 9853

LZF

19% 5.007 2.3205 9777

ZSTD1

12% 2.3272 2.4628 6267

ZSTD2

12% 1.6627 2.3992 6304

ZSTD3

12% 1.0059 2.12 6223

ZSTD4

12% 0.69925 2.205 6107

ZSTD5

12% 0.54902 1.7475 6035

ZSTD10

11% 0.20548 5 5719

ZSTD19

9% 0.013166 2.6265 4915

Spikes C o m p r e ss i on R a t i o R L E S napp y F a s t L Z L ZF Z S T D Z S T D Z S T D Z S T D Z S T D Z S T D Z S T D Beechnut 1024x1024x1546

Relative compressed size Compression DecompressionRLE

57% 5.2958 12.997 1764

Snappy

59% 15.628 13.904 1821

FastLZ

56% 2.7606 7.1601 1735

LZF

56% 2.6859 7.6337 1739

ZSTD1

45% 3.5344 7.7903 1382

ZSTD2

45% 1.8205 7.5859 1383

ZSTD3

45% 1.0213 7.6865 1383

ZSTD4

45% 0.70692 6.9144 1381

ZSTD5

45% 0.58168 6.4649 1379

ZSTD10

45% 0.45114 6.3964 1377

ZSTD19

44% 0.14913 2.7041 1351

Beechnut S peed ( G B / s ) R L E S napp y F a s t L Z L ZF Z S T D Z S T D Z S T D Z S T D Z S T D Z S T D Z S T D David 2mm

Relative compressed size Compression DecompressionRLE

Snappy

90% 3.267 9.3255

FastLZ

90% 2.0667 4.7394

LZF

89% 1.7327 5.7091

ZSTD1

84% 1.3078 3.4581

ZSTD2

81% 0.90846 3.1057

ZSTD3

81% 0.55187 3.1324

ZSTD4

77% 0.47093 2.4956

ZSTD5

77% 0.3562 2.4732

ZSTD10

77% 0.15825 2.6823

ZSTD19

70% 0.073808 2.0204

David 2mm S peed ( G B / s ) Benchmark 7.a:

Compression Performance for Binary Data and the Object Data usedin Benchmark 7.b

On current CPUs (the benchmark was executed on a 12-core node), moderncompression libraries provide performance beneﬁts even on fast interconnects .4 Distributed, Versioned Objects 87 such as 10 Gb/s Ethernet. In particular, the modern Snappy and ZStandard lib-raries deliver impressive performance.In [Eilemann et al., 2018] we also evaluated the compression performance forconcrete application data. The results are shown in Benchmark 7.a. Polygonaldata is difﬁcult to compress with a generic lossless compressor, due to the ﬂoatingpoint format used for the vertices. A data-speciﬁc compressor aware of the datasemantics can provide much better results. Volume data on the other hand hasshown to be well compressible, with typical compression ratios at interactivespeeds. Section 7.4.5 discusses how these compressors accelerate data distributionin Equalizer applications.

Chunking

The data streaming interface implements chunking, which pipelines the serialisa-tion code with the network transmission. After a conﬁgurable number of bytes hasbeen serialised to the internal buffer, it is transmitted and serialisation continues.This is used both for the initial mapping data, and for commit data.

Caching and Preloading

Caching retains instance data of objects in a client-side cache, and reuses this datato accelerate mapping of objects. The instance cache is either ﬁlled by “snooping”on multicast transmissions or by an explicit preloading when master objects areregistered. Preloading sends instance data of recently registered master objectsto all connected nodes during idle time of the corresponding node. These nodessimply enter the received data to their cache. Preloading uses multicast whenavailable.

Multicast

Due to the master-slave nature of data distribution, multicast is used to optimisethe transmission time of data. If the contributing nodes share a multicast ses-sion, and more than one slave instance is mapped,

Collage automatically uses themulticast connection to send the new version information.

Benchmark 7.b analyses the performance of data distribution and synchronisationin real-world applications. We extracted the data distribution code from our meshrenderer (eqPly) and our volume renderer (Livre) into a benchmark applicationto measure the time to initially map all the objects on the render client nodes, and to perform a commit + sync of the full data set after mapping has been estab-lished. All ﬁgures observe a noticeable measurement jitter due to other servicesrunning on the shared cluster during benchmarking. The details of the benchmarkalgorithm can be found in [Eilemann et al., 2018]. David none compression buffered compression buffered chunked1 Mapping on n Slave Processes t i m e ( m s ) Processes

Table 1-1 none RLE FastLZ Snappy ZSTD1 ZSTD5 ZSTD101 Commit - Sync using n Slave Processes

Processes

Detail of Commit - Sync

Processes

Table 1 none compression buffered compression buffered1 Table 1-2 none RLE FastLZ Snappy ZSTD1 ZSTD51 Mapping on n Slave Processes t i m e ( m s ) Processes t i m e ( m s ) t i m e ( m s ) Commit - Sync using n Slave Processes

Processes

David 2mmSpikesBeechnut Benchmark 7.b:

Object Mapping and Synchronisation

We used three different data sets, and ran the benchmark on up to eight phys-ical nodes, speciﬁcally, after eight process nodes start to run two processes pernode, which share CPU, memory and network interface bandwidth. Object map-ping is measured using the following settings: none distributes the raw, uncom-pressed, and unbuffered data, compression uses the Snappy compressor to com-press and distribute unbuffered data, buffered reuses uncompressed, serialised datafor mappings from multiple nodes, and compression buffered reuses the com-pressed buffer for multiple nodes.Unbuffered operations need to reserialise, and potentially recompress, themaster object data for each slave node. Each slave instance needs to deserialiseand decompress the data, which happens naturally in parallel on the slave nodes. .4 Distributed, Versioned Objects 89

During data synchronisation, the master commits the object data to all mappedslave instances simultaneously. This is a push operation, whereas the mappingis a slave-triggered pull operation. During commit, the buffers only have to beserialised and compressed once, and can then be sent directly to all mapped slavenodes. Slave nodes queue this data and consume it during synchronisation. Incontrast, object mapping needs to wait for each slave node to request the map-ping, and then may need to reserialise and compress the object data. We tested thetime to commit and sync the data using the compression engines discussed above.The David statue at a 2 mm resolution is organised in a k-d tree for render-ing. Each k-d tree node is a separate distributed object, having two child nodeobjects. A total of 1023 objects are distributed and synchronised. Due to limitedcompressibility of the data, the results are relatively similar. Compressing the datarepeatedly for each client leads to decreased performance, since the compressionoverhead cannot be amortised by a decreased transmission time. Buffering dataslightly improves performance by reducing the CPU and copy overhead. Com-bining compression and buffering leads to the best performance, although onlyby about 10%. During synchronisation data is pushed from the master processto all mapped slaves using a unicast connection to each slave. While the resultsare relatively close to each other, we can still observe how the tradeoff betweencompression ratio and speed inﬂuences overall performance. Better, slower com-pression algorithms lead to improved overall performance when amortised overmany send operations.The volume data sets are distributed in a single object, serialising the rawvolume buffer. The Spike volume data set has a signiﬁcant compression ratio,which is reﬂected by the results. Compression for this data is beneﬁcial for trans-mitting data over a 10 Gb/s link, even for a single slave process. Buffering haslittle beneﬁt, since the serialisation of volume data is trivial. Buffered compres-sion shows a signiﬁcant difference, since the compression cost can be amortisedover many nodes, reaching raw data transmission rates of 3.7 GB/s with the de-fault Snappy compressor, and at best 4.4 GB/s with ZStandard at level 1. Thedistribution of the beechnut data set also behaves as expected: Due to the largerobject size, uncompressed transmission is slightly faster compared to the Spikedata set at 700 MB/s since static overheads are comparatively smaller. Com-pressed transmission does not improve the mapping performance, likely due toincreased memory pressure caused by the data size. The comparison of the vari-ous compression engines is consistent with the benchmarks in Benchmark 7.a;RLE, Snappy and the LZ variants are very close to each other, and ZSTD1 canprovide better performance after four nodes due to the better compression ratio.Benchmark 7.c compares data distribution speed using different network pro-tocols. This benchmark measures the data synchronisation time of the Spikevolume data set. Buffering is enabled, and compression is disabled to focus on the raw network performance. For the benchmark, eight physical nodes are used,that is, after eight processes two client processes will run on some nodes, sharingCPU and network resources.

Commit - Sync using n Slave Processes

Processes

512 spikes

10 GE 40G IPoIB 40G RDMA 10 GE multicast 40G IPoIB multicast1 Unicast Commit - Sync using n Slave Processes t i m e ( m s ) Processes

Multicast Commit - Sync using n Slave Processes t i m e ( m s ) Processes

Multicast Commit - Sync using n Slave Processes

Processes

512 spikes-1

10 GE 40G IPoIB 40G RDMA 10 GE multicast 40G IPoIB multicast 10 GE multicast 40G IPoIB multicast 10 GE multicast 40G IPoIB multicast1

30 0.18 0.01 Commit - Sync using n Slave Processes t i m e ( m s ) Processes

Multicast Packet Loss

Processes Benchmark 7.c:

Synchronisation Performance overdifferent Network Protocols

TCP over the faster Inﬁn-iBand link outperforms thecheaper Ten Gigabit Ether-net link by more than afactor of two. Unexpectedly,the native RDMA connec-tion performs worse, eventhough it outperforms IPoIBin a simple peer-to-peer con-nection benchmark. Thisneeds further investigation,but we suspect the abstrac-tion of a byte stream con-nection chosen by Collage isnot well suited for remoteDMA semantics; one needsto design the network APIaround zero-copy semanticswith managed memory for modern high-speed transports. Both InﬁniBand con-nections show signiﬁcant measurement jitter.RSP multicast performs as expected. Collage starts using multicast to commitnew object versions when two or more clients are mapped, since the transmissionto a single client is faster using unicast. RSP consistently outperforms unicast onthe same physical interface and shows good scaling behaviour (2.5 times sloweron 16 vs. 2 clients on Ethernet, 1.8 times slower on InﬁniBand). The scalingis signiﬁcantly better when only one process per node is used. The increasedtransmission time with multiple clients is caused by a higher probability of packetloss, which increases signiﬁcantly when using more than one process per node andnetwork interface. InﬁniBand outperforms Ethernet slightly, but is largely limitedby the RSP implementation throughput of preparing and queueing the datagramsto and from the protocol thread, which we observed in proﬁling.

C H A P T E R

APPLICATIONS

A key performance indicator for a good design of any framework is the acceptanceby developers. A good measure is the adoption by third-party applications. Whilethe evaluation and architecture of applications build with Equalizer is outside ofthe scope of this thesis, we provide a few examples here to illustrate the variety ofuse cases supported in our framework.

Figure 8.1:

Livre running on a 4x3 Tiled Display Wall

Livre (Large-scale Interact-ive Volume Rendering En-gine) is a GPU ray-castingparallel 4D volume renderer,implementing state-of-the-artview-dependent level-of-detailrendering (LOD) and out-of-core data management [En-gel et al., 2006].Hierarchical and out-of-core LOD data managementis supported by an implicitvolume octree, accessed asynchronously by the renderer from a data source on ashared ﬁle system. Different data sources provide octree-conform access to RAW91 or compressed ﬁles, as well as to on-the-ﬂy generated volume data (e.g. such asfrom event simulations or surface meshes).High-level state information, e.g., camera position and rendering settings, areshared in Livre through

Collage objects between the application and renderingthreads. Sort-ﬁrst decomposition is efﬁciently supported through octree traversaland culling, both for scalability, as well as for driving large-scale tiled displaywalls.

RTT Deltagen (now Dassault 3D Excite) is a commercial application for inter-active, high quality rendering of CAD data. The RTT Scale module, deliveringmulti-GPU and distributed execution, is based on

Equalizer and

Collage , and hasdriven many of the features implemented in Equalizer.

Figure 8.2:

RTT Deltagen mixing OpenGL Renderingand Raytracing (for the head light)

RTT Scale uses a master-slave execution mode, were asingle Deltagen instance cango into “Scale mode” at anytime by launching an

Equal-izer conﬁguration. Con-sequently, the internal rep-resentation needed for ren-dering is based on a

Collage -based data distribution. Therendering clients are sep-arate, smaller applicationswhich will map their scenesduring startup. At runtimeany change performed in themain application is committed as a delta at the beginning of the next frame. Mul-ticast is used to keep data distribution times during session launch reasonable forlarger cluster sizes (tens to hundreds of nodes).RTT Scale supports a wide variety of use cases. In virtual reality, the applica-tion is used for virtual prototyping and design reviews in front of high-resolutiondisplay walls and CAVEs. It is also used for virtual prototyping of human-machineinteractions in CAVEs and HMDs. For scalability, sort-ﬁrst and tile compoundsare used to achieve fast, high-quality rendering, primarily for interactive raytra-cing, both based on CPUs and GPUs. For CPU-based raytracing, often Linux-based rendering clients are used with a Windows-based application node. .3 RTNeuron 93

Figure 8.3:

RTNeuron running in a six-sided CAVE

RTNeuron [Hernando et al.,2013] is a scalable real-timerendering tool for the visu-alisation of neuronal simu-lations based on cable mod-els. It uses OpenSceneGraphfor data management andEqualizer for parallel ren-dering. The focus is notonly on fast rendering times,but also on fast loadingtimes with no ofﬂine prepro-cessing. It provides level ofdetail (LOD) rendering, highquality anti-aliasing based onjittered frusta, accumulation during still views, and interactive modiﬁcation of thevisual representation of neurons on a per-neuron basis (full neuron vs. soma only,branch pruning depending on the branch level, . . . ). RTNeuron implements bothsort-ﬁrst and sort-last rendering with order independent transparency.

Figure 8.4:

RASTeR running on a 3x2 Tiled DisplayWall

RASTeR [B¨osch et al., 2009]uses an out-of-core and view-dependent real-time multi-resolution terrain renderingalgorithm. For load balancedparallel rendering [Goswamiet al., 2010] it exploits fasthierarchical view-frustum cul-ling of the level-of-detail(LOD) quadtree for sort-ﬁrstdecomposition, and uniformdistribution of the visibleLOD triangle patches for sort-last decomposition. The latter is enabled by a fasttraversal of the patch-based restricted quadtree triangulation hierarchy, which res-ults in a list of selected LOD nodes, constituting a view-dependent cut or front ofactivated nodes through the LOD hierarchy. Assigning and distributing equally sized segments of this active LOD front to the concurrent rendering threads resultsin a near-optimal sort-last decomposition for each frame.

Figure 8.5:

Bino on a Semi-Cylindrical Multi-Projector Wall

Bino is a stereoscopic 3Dvideo player capable of run-ning on very large displaysystems. Originally writ-ten for the immersive semi-cylindrical projection systemat the University of Siegen,its ﬂexibility enabled its usein many installations. Binodecodes video on each ren-dering thread and only syn-chronises the time step glob-ally, providing a scalablesolution to video playback.Bino uses the 2D informa-tion from the segment viewports to lay out the video tiles for each projector.

Figure 8.6:

An Omegalib Application running in theCave2

Omegalib [Febretti et al.,2014] is a software frame-work built on top of Equal-izer that facilitates applica-tion development for hybridreality environments, like theCave 2. Hybrid realityenvironments aim to cre-ate a seamless 2D/3D en-vironment supporting bothinformation-rich analysis (tra-ditionally done on tiled dis-play wall), as well as virtual reality simulation exploration (traditionally done inVR systems) at a resolution matching human visual acuity. Omegalib supports dy-namic reconﬁgurability of the display environment, so that areas of the display canbe interactively allocated to 2D or 3D workspaces as needed. It is possible to have .6 Omegalib 95 multiple immersive applications running on a cluster-controlled display system,have different input sources dynamically routed to applications, and have render-ing results optionally redirected to a distributed compositing manager. Omegalibsupports pluggable front-ends to simplify the integration of third-party librarieslike OpenGL, OpenSceneGraph, and the Visualisation Toolkit (VTK).

C H A P T E R

CONCLUSION

Formalising, designing and implementing a generic parallel rendering framework,that can serve both complex applications and research, has been no easy task.Based on the analysis of Cavelib, practical experience in implementing and de-ploying OpenGL Multipipe SDK, we have been in the unique position to makesigniﬁcant contributions in this area. Equalizer, our parallel rendering framework,allowed us to take parallel rendering research to a new level. It enabled us toeasily implement new decomposition algorithms, many improvements for resultcomposition, novel load balancing schemes, and numerous whole system optim-isations, all of which are much harder to research without such a framework andassociated applications. This is not only supported by the contributions of thisthesis, but by other publications and doctoral theses completed using Equalizer.This research has not only been performed in the original research group; Equal-izer has also been picked up by other laboratories, e.g., the Electronic visualisationLaboratory at the University of Illinois at Chicago for Cave2 research.Beyond the core system design, we have incorporated many new parallel ren-dering algorithms into our framework. Most notably, cross-segment load balan-cing provides a novel approach to better assign multiple rendering resources tomulti-display systems. It maximises rendering locality for the display GPUs andis not limited to planar displays, compared to other approaches.97

Having a fully-featured rendering framework and real-world applications en-abled us to implement many algorithmic improvements and optimisations, andevaluate them in a holistic and realistic setup. The results of this work advanceparallel rendering with new decomposition modes, compositing algorithms, bet-ter load balancing and an asynchronous rendering pipeline. Last, but not least, anetwork library for distributed, interactive visualisation applications greatly facil-itates the task to distribute and synchronise application state in a parallel renderingsystem.Beyond the scope of this thesis, Equalizer has inﬂuenced the ﬁeld and has beenused in various commercial and research applications. These applications span awide ﬁeld of domains, from virtual prototyping, interactive raytracing, large-scalevolume rendering, terrain rendering, neuroscience applications, to next-generationvisualisation systems such as collaborative tiled display walls and hybrid 2D/3Dsetups such as the Cave2.

We consider the core parallel rendering framework largely feature complete, withthe exception of keeping up with new technologies, e.g., providing glue code forthe Vulkan API or exploiting new Multi-GPU extensions. There remains a largeamount of work to make parallel rendering more accessible. This may be ad-dressed by simpliﬁed APIs layered on top of Equalizer, and through integrationswith popular rendering toolkits. Future work should also address operators andusers of visualisation systems through simpliﬁed conﬁguration, monitoring andadministration tools.There is still a signiﬁcant amount of research in automatically selecting thebest decomposition and recomposition algorithm, as well as the resources used fora given application. This task becomes even more challenging when consideringchanges in the rendering load and algorithm during the runtime of an application.Furthermore, implementing load balancing for the compositing task is an arealargely unexplored, in particular in combination with state of the art optimisations.We foresee an increasing importance for interactive raytracing, which has itsown set of challenges for parallel rendering. In particular for large data rendering,there are a number of open questions, like out-of-core parallel raytracing and data-parallel decomposition with global illumination.Load balancing for better utilisation of available resources, and increasedscalability to higher node counts remains an open area of research. While thisthesis provides many new results in this area, a comprehensive benchmark andstudy of different algorithms and applications would be very valuable, which maylead to the discovery of new load-balancing algorithms. .2 Future Work 99

One of the remaining challenges is to make interactive supercomputing ac-cessible. Signiﬁcant research has been performed on how to link simulations withvisualisation, and how to use this monitoring to interactively steer the simulation.These advances now need to be translated into easily usable software components,integrated well with existing resource management systems.

IBLIOGRAPHY [MPK, 2005] (2005). OpenGL Multipipe SDK.[Abraham et al., 2004] Abraham, F., Celes, W., Cerqueira, R., and Campos, J. L.(2004). A load-balancing strategy for sort-ﬁrst distributed rendering. In

Pro-ceedings SIBGRAPI , pages 292–299.[Adamson et al., 2004] Adamson, B., Bormann, C., Handley, M., and Macker, J.(2004). Negative-acknowledgment (nack)-oriented reliable multicast (norm)protocol. Technical report.[Agranov and Gotsman, 1995] Agranov, G. and Gotsman, C. (1995). Algorithmsfor rendering realistic terrain image sequences and their parallel implementa-tion.

The Visual Computer , 11(9):455–464.[Ahrens and Painter, 1998] Ahrens, J. and Painter, J. (1998). Efﬁcient sort-lastrendering using compression-based image compositing. In

Proceedings Euro-graphics Workshop on Parallel Graphics and Visualization .[Allard et al., 2002] Allard, J., Gouranton, V., Lecointre, L., Melin, E., andRafﬁn, B. (2002). NetJuggler: Running VR Juggler with multiple displayson a commodity component cluster. In

Proceeding IEEE Virtual Reality , pages275–276.[Bethel et al., 2003] Bethel, W. E., Humphreys, G., Paul, B., and Brederson, J. D.(2003). Sort-ﬁrst, distributed memory parallel visualization and rendering. In101

02 BIBLIOGRAPHY

Proceedings IEEE Symposium on Parallel and Large-Data Visualization andGraphics , pages 41–50.[Bhaniramka et al., 2005] Bhaniramka, P., Robert, P. C. D., and Eilemann, S.(2005). OpenGL Multipipe SDK: A toolkit for scalable parallel rendering.In

Proceedings IEEE Visualization , pages 119–126.[Bierbaum and Cruz-Neira, 2003] Bierbaum, A. and Cruz-Neira, C. (2003).ClusterJuggler: A modular architecture for immersive clustering. In

Proceed-ings Workshop on Commodity Clusters for Virtual Reality, IEEE Virtual RealityConference .[Bierbaum et al., 2001] Bierbaum, A., Just, C., Hartling, P., Meinert, K., Baker,A., and Cruz-Neira, C. (2001). VR Juggler: A virtual platform for virtualreality application development. In

Proceedings of IEEE Virtual Reality , pages89–96.[Blanke et al., 2000] Blanke, W., Bajaj, C., D.Fussel, and Zhang, X. (2000). Themetabuffer: A scalable multi-resolution 3-d graphics system using commodityrendering engines. Technical Report TR2000-16, University of Texas at Austin.[Blue Brain Project, 2016] Blue Brain Project (2016). Tide: Tiled InteractiveDisplay Environment. https://github.com/BlueBrain/Tide.[B¨osch et al., 2009] B¨osch, J., Goswami, P., and Pajarola, R. (2009). RASTeR:Simple and efﬁcient terrain rendering on the GPU. In

Proceedings EURO-GRAPHICS Areas Papers, Scientiﬁc Visulization , pages 35–42.[Cavin and Mion, 2006] Cavin, X. and Mion, C. (2006). Pipelined sort-last ren-dering: Scalability, performance and beyond. In

Proceedings EurographicsSymposium on Parallel Graphics and Visualization .[Cavin et al., 2005] Cavin, X., Mion, C., and Filbois, A. (2005). COTS cluster-based sort-last rendering: Performance evaluation and pipelined implementa-tion. In

Proceedings IEEE Visualization , pages 111–118. Computer SocietyPress.[Correa et al., 2002] Correa, W. T., Klosowski, J. T., and Silva, C. T. (2002). Out-of-core sort-ﬁrst parallel rendering for cluster-based tiled displays. In

Proceed-ings Eurographics Workshop on Parallel Graphics and Visualization , pages89–96.[Crockett, 1997] Crockett, T. W. (1997). An introduction to parallel rendering.

Parallel Computing , 23:819–843.

IBLIOGRAPHY 103

IEEE Transactions on Visualization and Computer Graphics ,17(2):320–332.[Eilemann et al., 2012] Eilemann, S., Bilgili, A., Abdellah, M., Hernando, J.,Makhinya, M., Pajarola, R., and Sch¨urmann, F. (2012). Parallel Renderingon Hybrid Multi-GPU Clusters. In

EGPGV , pages 109–117.[Eilemann et al., 2009] Eilemann, S., Makhinya, M., and Pajarola, R. (2009).Equalizer: A Scalable Parallel Rendering Framework.

IEEE Transactions onVisualization and Computer Graphics , 15(3):436–452.[Eilemann and Pajarola, 2007] Eilemann, S. and Pajarola, R. (2007). Direct sendcompositing for parallel sort-last rendering. In

Proceedings Eurographics Sym-posium on Parallel Graphics and Visualization .[Eilemann et al., 2018] Eilemann, S., Steiner, D., and Pajarola, R. (2018). Equal-izer 2.0 Convergence of a Parallel Rendering Framework.

IEEE Transactionson Visualization and Computer Graphics , pages 1–1.[Engel et al., 2006] Engel, K., Hadwiger, M., Kniss, J. M., Rezk-Salama, C., andWeiskopf, D. (2006).

Real-Time Volume Graphics . AK Peters.[Erol et al., 2011] Erol, F., Eilemann, S., and Pajarola, R. (2011). Cross-segmentload balancing in parallel rendering. In

Proceedings Eurographics Symposiumon Parallel Graphics and Visualization , pages 41–50.[Eyescale Software GmbH and Blue Brain Project, 2016] Eyescale SoftwareGmbH and Blue Brain Project (2016). Compression and data transfer plugins.https://github.com/Eyescale/Pression.[Eyles et al., 1997] Eyles, J., Molnar, S., Poulton, J., Greer, T., Lastra, A., Eng-land, N., and Westover, L. (1997). PixelFlow: The realization. In

Proceedingsof the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics Hardware ,pages 57–68.[Facebook, 2016] Facebook, I. (2016). Fast real-time compression algorithm. ht-tps://github.com/facebook/zstd.

04 BIBLIOGRAPHY [Febretti et al., 2014] Febretti, A., Nishimoto, A., Mateevitsi, V., Renambot, L.,Johnson, A., and Leigh, J. (2014). Omegalib: A multi-view application frame-work for hybrid reality display environments. In , pages 9–14.[Febretti et al., 2013] Febretti, A., Nishimoto, A., Thigpen, T., Talandis, J., Long,L., Pirtle, J., Peterka, T., Verlo, A., Brown, M., Plepys, D., et al. (2013). Cave2:a hybrid reality environment for immersive simulation and information ana-lysis. In

IS&T/SPIE Electronic Imaging , pages 864903–864903. InternationalSociety for ‘Optic‘s and Photonics.[Garcia and Shen, 2002] Garcia, A. and Shen, H.-W. (2002). An interleaved par-allel volume renderer with PC-clusters. In

Proceedings Eurographics Work-shop on Parallel Graphics and Visualization , pages 51–60.[Gau et al., 2002] Gau, R.-H., Haas, Z. J., and Krishnamachari, B. (2002). Onmulticast ﬂow control for heterogeneous receivers.

IEEE/ACM Trans. Netw. ,10(1):86–101.[Gemmell et al., 2003] Gemmell, J., Montgomery, T., Speakman, T., and Crow-croft, J. (2003). The PGM reliable multicast protocol.

IEEE Network ,17(1):16–22.[Goswami et al., 2010] Goswami, P., Makhinya, M., B¨osch, J., and Pajarola, R.(2010). Scalable parallel out-of-core terrain rendering. In

Proceedings Euro-graphics Symposium on Parallel Graphics and Visualization , pages 63–71.[Hernando et al., 2013] Hernando, J. B., Biddiscombe, J., Bohara, B., Eilemann,S., and Sch¨urmann, F. (2013). Practical Parallel Rendering of Detailed NeuronSimulations. In

Proceedings of the 13th Eurographics Symposium on ParallelGraphics and Visualization

Proceedings Eurographics Workshop on Parallel Graphics and Visualization . IBLIOGRAPHY 105 [Humphreys et al., 2000] Humphreys, G., Buck, I., Eldridge, M., and Hanrahan,P. (2000). Distributed rendering for scalable displays.

IEEE Supercomputing .[Humphreys et al., 2001] Humphreys, G., Eldridge, M., Buck, I., Stoll, G., Ever-ett, M., and Hanrahan, P. (2001). WireGL: A scalable graphics system forclusters. In

Proceedings Annual Conference on Computer Graphics and Inter-active Techniques , pages 129–140.[Humphreys and Hanrahan, 1999] Humphreys, G. and Hanrahan, P. (1999). Adistributed graphics system for large tiled displays.

IEEE Visualization 1999 ,pages 215–224.[Humphreys et al., 2002] Humphreys, G., Houston, M., Ng, R., Frank, R., Ahern,S., Kirchner, P. D., and Klosowski, J. T. (2002). Chromium: A stream-processing framework for interactive rendering on clusters.

ACM Transactionson Graphics , 21(3):693–702.[Igehy et al., 1998] Igehy, H., Stoll, G., and Hanrahan, P. (1998). The design of aparallel graphics interface.

Proceedings of SIGGRAPH 98 , pages 141–150.[Johnson et al., 2006] Johnson, A., Leigh, J., Morin, P., and Van Keken, P. (2006).GeoWall: Stereoscopic visualization for geoscience research and education.

IEEE Computer Graphics and Applications , 26(6):10–14.[Johnson et al., 2012] Johnson, G. P., Abram, G. D., Westing, B., Navr’til, P., andGaither, K. (2012). DisplayCluster: An Interactive Visualization Environmentfor Tiled Displays. In , pages 239–247.[Jones et al., 2004] Jones, K., Danzer, C., Byrnes, J., Jacobson, K., Bouchaud,P., Courvoisier, D., Eilemann, S., and Robert, P. (2004). SGI R (cid:13) OpenGLMultipipe TM SDK User’s Guide. Technical Report 007-4239-004, SiliconGraphics.[Just et al., 1998] Just, C., Bierbaum, A., Baker, A., and Cruz-Neira, C. (1998).VR Juggler: A framework for virtual reality development. In

ProceedingsImmersive Projection Technology Workshop .[Lever, 2004] Lever, P. G. (2004). SEPIA – applicability to MVC. White paperManchester Visualization Centre (MVC), University of Manchester.[Li et al., 1996] Li, P. P., Duquette, W. H., and Curkendall, D. W. (1996). RIVA:A versatile parallel rendering system for interactive scientiﬁc visualization.

IEEE Transactions on Visualization and Computer Graphics , 2(3):186–201.

06 BIBLIOGRAPHY [Li et al., 1997] Li, P. P., Whitman, S., Mendoza, R., and Tsiao, J. (1997). Par-Vox: A parallel splatting volume rendering system for distributed visualization.In

Proceedings IEEE Parallel Rendering Symposium , pages 7–14.[Lombeyda et al., 2001a] Lombeyda, S., Moll, L., Shand, M., Breen, D., andHeirich, A. (2001a). Scalable interactive volume rendering using off-the-shelfcomponents. Technical Report CACR-2001-189, California Institute of Tech-nology.[Lombeyda et al., 2001b] Lombeyda, S., Moll, L., Shand, M., Breen, D., andHeirich, A. (2001b). Scalable interactive volume rendering using off-the-shelfcomponents. In

Proceedings IEEE Symposium on Parallel and Large DataVisualization and Graphics , pages 115–121.[Makhinya et al., 2010] Makhinya, M., Eilemann, S., and Pajarola, R. (2010).Fast Compositing for Cluster-Parallel Rendering. In

Proceedings of the 10thEurographics Conference on Parallel Graphics and Visualization , EGPGV,pages 111–120, Aire-la-Ville, Switzerland, Switzerland. Eurographics Asso-ciation.[Marrinan et al., 2014] Marrinan, T., Aurisano, J., Nishimoto, A., Bharadwaj, K.,Mateevitsi, V., Renambot, L., Long, L., Johnson, A., and Leigh, J. (2014).SAGE2: A new approach for data intensive collaboration using Scalable Resol-ution Shared Displays. In

Collaborative Computing: Networking, Applicationsand Worksharing , pages 177–186.[Moll et al., 1999] Moll, L., Heirich, A., and Shand, M. (1999). Sepia: scalable3D compositing using PCI pamette. In

Proceedings IEEE Symposium on Field-Programmable Custom Computing Machines , pages 146–155.[Molnar et al., 1994] Molnar, S., Cox, M., Ellsworth, D., and Fuchs, H. (1994).A sorting classiﬁcation of parallel rendering.

IEEE Computer Graphics andApplications , 14(4):23–32.[Molnar et al., 1992] Molnar, S., Eyles, J., and Poulton, J. (1992). PixelFlow:High-speed rendering using image composition. In

Proceedings ACM SIG-GRAPH , pages 231–240.[Montrym et al., 1997] Montrym, J. S., Baum, D. R., Dignam, D. L., and Migdal,C. J. (1997). InﬁniteReality: A real-time graphics system. In

Proceedings ACMSIGGRAPH , pages 293–302.

IBLIOGRAPHY 107 [Mueller, 1995] Mueller, C. (1995). The sort-ﬁrst rendering architecture for high-performance graphics. In

Proceedings Symposium on Interactive 3D Graphics ,pages 75–84. ACM SIGGRAPH.[Mueller, 1997] Mueller, C. (1997). Hierarchical graphics databases in sort-ﬁrst.In

Proceedings IEEE Symposium on Parallel Rendering , pages 49–. ComputerSociety Press.[Muraki et al., 2001] Muraki, S., Ogata, M., Ma, K.-L., Koshizuka, K., Kajihara,K., Liu, X., Nagano, Y., and Shimokawa, K. (2001). Next-generation visualsupercomputing using PC clusters with volume graphics hardware devices. In

Proceedings ACM/IEEE Conference on Supercomputing , pages 51–51.[Nachbaur et al., 2014] Nachbaur, D., Dumusc, R., Bilgili, A., Hernando, J., andEilemann, S. (2014). Remote parallel rendering for high-resolution tiled dis-play walls. In

Large Data Analysis and Visualization (LDAV), 2014 IEEE 4thSymposium on , pages 117–118.[Neal et al., 2011] Neal, B., Hunkin, P., and McGregor, A. (2011). DistributedOpenGL rendering in network bandwidth constrained environments. In Kuh-len, T., Pajarola, R., and Zhou, K., editors,

Proceedings Eurographics Con-ference on Parallel Graphics and Visualization , pages 21–29. EurographicsAssociation.[Nie et al., 2005] Nie, W., Sun, J., Jin, J., Li, X., Yang, J., and Zhang, J. (2005). Adynamic parallel volume rendering computation mode based on cluster. In

Pro-ceedings Computational Science and its Applications , volume 3482 of

LectureNotes in Computer Science , pages 416–425.[Niski and Cohen, 2007] Niski, K. and Cohen, J. D. (2007). Tile-based level ofdetail for the parallel age.

IEEE Transactions on Visualization and ComputerGraphics , 13(6):1352–1359.[[email protected], 2016] [email protected] (2016). A fast com-pressor/decompressor. https://github.com/google/snappy.[Renambot et al., 2004] Renambot, L., Rao, A., Singh, R., Jeong, B., Krish-naprasad, N., Vishwanath, V., Chandrasekhar, V., Schwarz, N., Spale, A.,Zhang, C., Goldman, G., Leigh, J., and Johnson, A. (2004). Sage: the scal-able adaptive graphics environment.[Rohlf and Helman, 1994] Rohlf, J. and Helman, J. (1994). IRIS Performer: Ahigh performance multiprocessing toolkit for real-time 3D graphics. In

Pro-ceedings ACM SIGGRAPH , pages 381–394. ACM Press.

08 BIBLIOGRAPHY [Samanta et al., 2001] Samanta, R., Funkhouser, T., and Li, K. (2001). Parallelrendering with K-way replication. In

Proceedings IEEE Symposium on Paralleland Large-Data Visualization and Graphics . Computer Society Press.[Samanta et al., 2000] Samanta, R., Funkhouser, T., Li, K., and Singh, J. P.(2000). Hybrid sort-ﬁrst and sort-last parallel rendering with a cluster of PCs.In

Proceedings Eurographics Workshop on Graphics Hardware , pages 97–108.[Samanta et al., 1999] Samanta, R., Zheng, J., Funkhouser, T., Li, K., and Singh,J. P. (1999). Load balancing for multi-projector rendering systems. In

Pro-ceedings Eurographics Workshop on Graphics Hardware , pages 107–116.[Schulze and Lang, 2002] Schulze, J. P. and Lang, U. (2002). The parallelizationof the perspective shear-warp volume rendering algorithm. In

ProceedingsEurographics Workshop on Parallel Graphics and Visualization , pages 61–70.[Staadt et al., 2003] Staadt, O. G., Walker, J., Nuber, C., and Hamann, B. (2003).A survey and performance analysis of software platforms for interactivecluster-based multi-screen rendering. In

Proceedings Eurographics Workshopon Virtual Environments , pages 261–270.[Steiner et al., 2016] Steiner, D., Paredes, E. G., Eilemann, S., and Pajarola, R.(2016). Dynamic work packages in parallel rendering. In

Proceedings Euro-graphics Symposium on Parallel Graphics and Visualization , pages 89–98.[Stoll et al., 2001] Stoll, G., Eldridge, M., Patterson, D., Webb, A., Berman,S., Levy, R., Caywood, C., Taveira, M., Hunt, S., and Hanrahan, P. (2001).Lightning-2: A high-performance display subsystem for PC clusters. In

Pro-ceedings ACM SIGGRAPH , pages 141–148.[Stompel et al., 2003] Stompel, A., Ma, K.-L., Lum, E. B., Ahrens, J., and Patch-ett, J. (2003). SLIC: Scheduled linear image compositing for parallel volumerendering. In

Proceedings IEEE Symposium on Parallel and Large-Data Visu-alization and Graphics , pages 33–40.[Vezina and Robertson, 1991] Vezina, G. and Robertson, P. K. (1991). Terrainperspectives on a massively parallel SIMD computer. In

Proceedings ComputerGraphics International (CGI) , pages 163–188.[Wittenbrink, 1998] Wittenbrink, C. M. (1998). Survey of parallel volume ren-dering algorithms. In

Proceedings Parallel and Distributed Processing Tech-niques and Applications , pages 1329–1336.

IBLIOGRAPHY 109 [Yang et al., 2001] Yang, D.-L., Yu, J.-C., and Chung, Y.-C. (2001). Efﬁcientcompositing methods for the sort-last-sparse parallel volume rendering sys-tem on distributed memory multicomputers.

Journal of Supercomputing ,18(2):201–22–.[Yu et al., 2008] Yu, H., Wang, C., and Ma, K.-L. (2008). Massively parallelvolume rendering using 2-3 swap image compositing. In

Proceedings of the2008 ACM/IEEE Conference on Supercomputing , SC ’08, pages 48:1–48:11,Piscataway, NJ, USA. IEEE Press.[Zhang et al., 2001] Zhang, X., Bajaj, C., and Blanke, W. (2001). Scalable isosur-face visualization of massive datasets on COTS clusters. In

Proceedings IEEESymposium on Parallel and Large Data Visualization and Graphics , pages 51–58.

ONFERENCE PUBLICATIONS [Bhaniramka et al., 2005] Bhaniramka, P., Robert, P. C. D., and Eilemann, S.(2005). OpenGL Multipipe SDK: A toolkit for scalable parallel rendering.In

Proceedings IEEE Visualization , pages 119–126.[Eilemann et al., 2017] Eilemann, S., Abdellah, M., Antille, N., Bilgili, A.,Chevtchenko, G., Dumusc, R., Favreau, C., Hernando, J., Nachbaur, D., Pod-hajski, P., Villafranca, J., and Sch¨urmann, F. (2017). From Big Data to BigDisplays: High-Performance Visualization at Blue Brain. In Kunkel, J. M.,Yokota, R., Taufer, M., and Shalf, J., editors,

High Performance Computing ,pages 662–675, Cham. Springer International Publishing.[Eilemann et al., 2012] Eilemann, S., Bilgili, A., Abdellah, M., Hernando, J.,Makhinya, M., Pajarola, R., and Sch¨urmann, F. (2012). Parallel Renderingon Hybrid Multi-GPU Clusters. In

EGPGV , pages 109–117.[Eilemann et al., 2016] Eilemann, S., Delalondre, F., Bernard, J., Planas, J.,Schuermann, F., Biddiscombe, J., Bekas, C., Curioni, A., Metzler, B., Kalt-stein, P., et al. (2016). Key/value-enabled ﬂash memory for complex scientiﬁcworkﬂows with on-line analysis and visualization. In

Parallel and DistributedProcessing Symposium, 2016 IEEE International , pages 608–617. IEEE.[Eilemann and Pajarola, 2007] Eilemann, S. and Pajarola, R. (2007). Direct sendcompositing for parallel sort-last rendering. In

Proceedings Eurographics Sym-posium on Parallel Graphics and Visualization .111

12 CONFERENCE PUBLICATIONS [Erol et al., 2011] Erol, F., Eilemann, S., and Pajarola, R. (2011). Cross-segmentload balancing in parallel rendering. In

Proceedings Eurographics Symposiumon Parallel Graphics and Visualization , pages 41–50.[Hernando et al., 2013] Hernando, J. B., Biddiscombe, J., Bohara, B., Eilemann,S., and Sch¨urmann, F. (2013). Practical Parallel Rendering of Detailed NeuronSimulations. In

Proceedings of the 13th Eurographics Symposium on ParallelGraphics and Visualization , EGPGV, pages 49–56, Aire-la-Ville, Switzerland,Switzerland. Eurographics Association.[Makhinya et al., 2010] Makhinya, M., Eilemann, S., and Pajarola, R. (2010).Fast Compositing for Cluster-Parallel Rendering. In

Proceedings of the 10thEurographics Conference on Parallel Graphics and Visualization , EGPGV,pages 111–120, Aire-la-Ville, Switzerland, Switzerland. Eurographics Asso-ciation.[Steiner et al., 2016] Steiner, D., Paredes, E. G., Eilemann, S., and Pajarola, R.(2016). Dynamic work packages in parallel rendering. In

Proceedings Euro-graphics Symposium on Parallel Graphics and Visualization , pages 89–98.

OURNAL ARTICLES [Eilemann et al., 2009] Eilemann, S., Makhinya, M., and Pajarola, R. (2009).Equalizer: A Scalable Parallel Rendering Framework.

IEEE Transactions onVisualization and Computer Graphics , 15(3):436–452.[Eilemann et al., 2018] Eilemann, S., Steiner, D., and Pajarola, R. (2018). Equal-izer 2.0 Convergence of a Parallel Rendering Framework.

IEEE Transactionson Visualization and Computer Graphics , pages 1–1.113

URRICULUM VITAE P ARTICULARS

Date of Birth 9th August 1975, Wittenberg, GermanyNationality German, SwissLanguages German (native), English (ﬂuent), French (ﬂuent)Open Source Proﬁle github.com/eileP

ROFILE

Senior software engineer and technical team lead, with a specialization ininteractive large data visualization, C ++ , parallel and distributed program-ming. Successful track record of building and leading engineering teamsto success.E XPERTISE

Technical leadership for high performance C++ applications, parallel pro-gramming, distributed systems, Virtual Reality and collaborative visualiz-ationSoftware and library design, test driven development and maintenance us-ing C ++ , Typescript, Python, CMake and gitSoftware development methodology during the whole lifecycle, rangingfrom requirements analysis, speciﬁcation, design, implementation to doc-umentation, education, debugging, optimization and supportBroad knowledge of operating systems: Mac OS X, Linux, Windows, Irix115

16 CURRICULUM VITAE E XPERIENCE

Frontend Software Engineer

ESRI R&D Center

Z¨urich, Switzerland

Nov 2017 – current

Development of frontend APIs and rendering algorithms for 3D mapping.

Researcher, Parallel Rendering

University of Z”urich

Z”urich, Switzerland

Research new algorithms for large data visualization, in particular the par-allelization, load-balancing and data distribution of parallel OpenGL ap-plications on graphics clusters. Invented and developed Equalizer, a frame-work for scalable, distributed OpenGL applications.

Visualization Team Manager

Blue Brain Project, EPFL

Lausanne, Switzerland

May 2011 – Sep 2017

Built a team of seven software engineers, one post-doc, one PhD studentand one media designer to deliver innovative visualization software aswell as media for communication and scientiﬁc publications. Developedthe long-term interactive supercomputing vision and the correspondingmedium-term roadmap with the team, motivated and lead the implementa-tion based on modular software components. Drove the implementation ofsoftware engineering best practices for the whole project.

CEO and Founder

Eyescale Software GmbH

Neuchˆatel, Switzerland

January 2007 – current

Co-founder of Eyescale and lead developer of the Equalizer parallel ren-dering framework and related libraries. Deploying Equalizer in existingISV applications to scale display size, performance and visual quality.Software architecture, design and development, hardware and softwareconsulting for multi-GPU workstations, visualization clusters and VirtualReality.

Senior Software Engineer, 3D Graphics

Tungsten Graphics

Neuchˆatel, Switzerland

January 2007 – June 2007

Senior Software Engineer

Esmertec AG

Neuchˆatel, Switzerland

January 2004 – September 2005

Job position details available on demand. Senior Software Engineer

Silicon Graphics, Inc.

Neuchˆatel, Switzerland

August 2000 – December 2003

Worked in SGI’s advanced graphics division as technical lead for OpenGLMultipipe SDK (MPK), a framework to develop high performance, scal-able visualization software. Worked on DataSync, a distributed sharedmemory API for clusters.

Software Engineer

Freelancer

Munich, Germany

April 2000 – July 2000

Software Engineer

Intec GmbH

Wessling, Germany

October 1998 – March 2000

Job position details available on demand.E