Implementing a Photorealistic Rendering System using GLSL
IImplementing a Photorealistic Rendering System using GLSL
Toshiya Hachisuka
The Unviersity of Tokyo
Figure 1:
Example rendered images by the proposed system. All the images are rendered within one minute on GeForce GTX 680. Therendering process runs entirely on a GPU using only GLSL shaders. The system runs equally fine on many different platforms while beingcapable of handling complex light transport paths.
Abstract
Ray tracing on GPUs is becoming quite common these days. There are many publicly available documents on how to implementbasic ray tracing on GPUs for spheres and implicit surfaces. We even have some general frameworks for ray tracing on GPUs.We however hardly find details on how to implement more complex ray tracing algorithms themselves that are commonly usedfor photorealistic rendering. This paper explains an implementation of a stand-alone rendering system on GPUs which supportsthe bounding volume hierarchy and stochastic progressive photon mapping. The key characteristic of the system is that it usesonly GLSL shaders without relying on any platform dependent feature. The system can thus run on many platforms that supportOpenGL, making photorealistic rendering on GPUs widely accessible. This paper also sketches practical ideas for stacklesstraversal and pseudorandom number generation which both fit well with the limited system configuration.
Categories and Subject Descriptors (according to ACM CCS) : I.3.7 [Computer Graphics]: Three-Dimensional Graphics andRealism—Raytracing
1. Introduction
The use of GPUs for ray tracing is becoming increasingly com-mon. In particular, details on how to implement a simple ray trac-ing system for a small number of objects or implicit surfaces arewell documented in many publicly available tutorials. On the otherhand, while there are several publicly available rendering systemson GPUs, implementation of such a more practical rendering sys-tem has been rarely documented. Nvidia’s OptiX [PBD ∗
10] is oneexception, yet OptiX itself is merely a general framework wherewe can implement various ray tracing algorithms on top of it. Evenwith the availability of such a general ray tracing framework, de-tails on implementations of specific ray tracing algorithms have tobe sorted out.This paper explains an implementation of a rendering sys-tem which supports the bounding volume hierarchy [Wal07]and stochastic progressive photon mapping [HJ09] using onlyOpenGL 3.0 and GLSL 1.20. This limited system configurationwas chosen for multiple practical reasons. Firstly, a program can reliably run on various operating systems and GPUs. While thisplatform independence of OpenGL is not perfect, it is mostlytrue for battle-tested versions such as the version 3.0. This is incontrast to vendor-specific implementations [PBD ∗
10, DKHS14]which are bound to be incompatible with GPUs of other vendors.Secondly, parallelization over multiple GPUs is automatically sup-ported without additional code. Unlike OpenCL, a graphics drivermanages multiple GPUs with a general parallelization strategy.While this strategy can be suboptimal for each application, this sep-aration of management simplifies the implementation. Thirdly, aprogram can potentially run on web browsers via WebGL, sinceWebGL is essentially a limited version of OpenGL and becom-ing rapidly common. While WebCL [Khra] proposes support formore general GPU computation on any compatible web browser,no browsers currently support WebCL. There is no one to one cor-respondence between OpenGL and WebGL, but the proposed sys-tem uses mainly the features that are also available on WebGL. Itis thus still making sense and practical to consider developing a a r X i v : . [ c s . G R ] M a y . Hachisuka / Implementing a Photorealistic Rendering System using GLSL Radeon/HD/5870
GeForce/GT/630Intel/HD/5000Intel/HD/4000 M paths //sec
Figure 2:
End-to-end performance on various GPUs on the metalCornell box scene. The scene configuration is used by default in thereleased code. rendering system using a rather old version of OpenGL for generalcomputation.This paper specifically sketches two practical ideas which aresuitable for this limited system configuration. The first idea is amodification of the threaded bounding volume hierarchy [STØ05]which improves the traversal performance by two to three times.The modification is to pre-sort all the nodes in a given threadedBVH along principal directions of rays and to store multiple in-stances of threaded BVHs. The modified traversal algorithm re-mains simple and its performance is on a practical level. The sec-ond idea is a pseudorandom number generator which uses onlyfloating-point numbers. This generator is computationally inexpen-sive while the quality of random numbers is sufficient. To summa-rize, the contributions are: • Open source GPU rendering system which uses limited featuresof OpenGL and thus is likely to run on WebGL. • Modification of the threaded BVH which allows an efficient andsimple stackless traversal of a given BVH. • Introduction of a pseudorandom number generator which usesonly floating-point number operations.While similar work has been recently published by Davidoviˇc etal. [DKHS14], their work focuses on a highly optimized implemen-tation on Nvidia’s GPUs using CUDA [NVI07]. The proposed sys-tem, on the other hand, is designed to be vendor independent. Sinceexplaining all the details is not very informative, the following sec-tions outline only some high-level ideas. For more details, pleaserefer to the released code. Figure 2 shows end-to-end performanceon various GPUs. The code is available at as of May2015.
2. Overview
The proposed system supports stochastic progressive photon map-ping [HJ09] as the main rendering engine. Each iteration ofstochastic progressive photon mapping consists of three stages. Thefirst stage is photon tracing which samples light paths starting fromlight sources. The second stage traces eye paths from the camerauntil they hit non-specular surfaces. The last stage performs rangequeries at the intersection points of eye paths and executes stochas-tic progressive density estimation. The overall implementation isnot done by merely porting existing algorithms to GLSL, but comeswith several algorithmic modifications as noted later.Both the photon tracing stage and eye ray tracing stage needan efficient ray casting algorithm. While there exist many efficientray casting algorithms [ALK12], many of them are not compatiblewith our limited system configuration. The proposed system thus node = cubemap(root_tex, ray.direction);while ( node != n u l l ) {i f ( i n t e r s e c t ( node . aabb , ray ) ) {i f ( node . l e a f ) r e s u l t = i n t e r s e c t ( node . t r i a n g l e s , ray ) ;node = node . h i t ;} e l s e {node = node . miss ;}}
Figure 3:
Traversal algorithm of MTBVH. The only difference fromthe traversal algorithm of the original threaded BVH [STØ05] isthat it chooses an pre-ordered set of hit/miss links according to theray direction (first line). employs the threaded BVH [STØ05] (TBVH) to achieve a simplestackless traversal algorithm. The original algorithm unfortunatelyhas major performance degradation due to the fixed traversal order.The proposed modification, the multiple-threaded BVH (MTBVH),alleviates this issue and improves the traversal performance by twoto three times.Monte Carlo sampling of light paths and eye paths needs an ef-ficient and high quality random number generator. The main dif-ficulty is that our limited system configuration does not allow anynative bitwise operations. Existing random number generators onGPUs rely on the availability of bitwise operations [TW08] andthus are incompatible with our configuration. We introduce a ran-dom number generator that uses only floating-point number opera-tions.
3. System components3.1. Multiple-Threaded BVH
Since GLSL 1.20 (and GLSL in WebGL) does not support stack,we need a stackless traversal algorithm that runs efficiently onGPUs. On top of this practical reason, even if we could usestack, stackless traversal is known to have multiple advantagesfor massively parallel platforms [ÁSK14] such as GPUs. Thread-ing [STØ05] is one such approach to achieve stackless traversalof a given BVH. The threaded BVH stores hit/miss links insteadof the tree structure to represent a given BVH. The traversal algo-rithm simply follows a hit or miss link depending on the result ofthe ray-AABB intersection test at each node.The proposed modification is to store six instances of hit/misslinks by pre-sorting nodes along positive x axis, negative x axis,and so on for y and z as well. The traversal algorithm remains al-most the same, but it now selects one out of the six sets of links atthe beginning, depending on the direction of a given ray. Figure 3is the pseudocode of the traversal algorithm with the highlightedmodification. This simple modification enables approximate opti-mization of the traversal order according to a given ray direction.Figure 4 shows comparisons of ray traversal performance. Whilethe multiple-threaded BVH is still not as fast as the vendor-specificoptimization [ALK12], it is two to three times faster than originalthreaded BVH and remains vendor-independent. It should be em-phasized that the proposed algorithm is not fundamentally designedto achieve the best performance, but to achieve reasonable perfor-mance with only vendor-independent features. Please be aware thatthe provided code shows the number of complete paths per sec-ond including end-to-end rendering computation, not raw rays persecond, thus the numbers will be different from those in Figure 4. . Hachisuka / Implementing a Photorealistic Rendering System using GLSL bunny fairy sponza conference M V r a y s V / V s e c ALK12TBVHMTBVH
Figure 4:
Performance comparisons of ray casting using thevendor-specific optimized traversal [ALK12], the original threadedBVH [STØ05] (TBVH), and the multiple-threaded BVH (MTBVH).The experiments used the SAH-BVH [Wal07] for all the algorithmsand the computation times include diffuse shading. The testing en-vironment is GeForce GT 630.
The implementation uses a standard top-down sweeping algo-rithm for constructing a BVH based on SAH [Wal07]. The con-struction and threading are both currently done on CPUs and theresulting data is transferred to GPUs afterward. Threading is usu-ally done less than 100 ms even for a typical scene and hardly be-comes the bottleneck.The storage overhead of MTBVH is not as significant as it ap-pears to be. For instance, if we count the number of
VEC
4s usedin original threaded BVH, a triangle is stored as six
VEC
4s (two
VEC
4s for the packed position, normal, and texture coordinates foreach vertex), an AABB is stored as two
VEC
4s (min and max), andhit/miss links can be packed into one
VEC
4. MTBVH adds only fivemore hit/miss links. Since the number of triangles and the numberof nodes are typically not very different, MTBVH does not increasethe total storage cost by six times, but by approximately 1.56 times(9
VEC
4s of the original vs 14
VEC
4s of ours). Since image texturesusually add to the storage cost significantly more, the overhead ofMTBVH is not significant.
Since GLSL 1.20 does not support bitwise arithmetic opera-tions, we cannot port existing pseudorandom number generators(PRNGs) such as the one based on cryptographic hashing [TW08].The system thus uses a weighted sum of multiplicative lin-ear congruential generators [L’e88] only with floating-pointnumber operations. Figure 5 shows the GLSL code of thisPRNG. The original version of the algorithm was introducedas an anonymous post at the GPGPU web forum ( http://web.archive.org/web/20101217080108/http://gpgpu.org/forums/viewtopic.php?t=2591 ). Thealgorithm in Figure 5 has been modified to run well on GLSL. Togenerate many random numbers in parallel, one can initialize eachPRNG state on a CPU via xorshift [M ∗ f l o a t GPURnd( inout vec4 s t a t e ) {const vec4 q = vec4 (1225 , 1585 , 2457 , 2098);const vec4 r = vec4 (1112 , 367 , 92 , 265);const vec4 a = vec4 (3423 , 2646 , 1707 , 1999);const vec4 m = vec4 (4194287 , 4194277 , 4194191 , 4194167);vec4 beta = f l o o r ( s t a t e / q ) ;vec4 p = a ∗ mod( state , q ) − beta ∗ r ;beta = ( sign( − p ) + vec4 ( 1 ) ) ∗ vec4 ( 0 . 5 ) ∗ m;s t a t e = ( p + beta ) ;r e t u r n f r a c t ( dot ( s t a t e / m, vec4 (1 , −
1, 1 , − Figure 5:
Pseudorandom number generator using only floatingpoint number operations. The algorithm is based on a weightedsum of four instances of the multiplicative linear congruential gen-erator [L’e88]. might have introduced some statistical deficiency. The author of theabove anonymous posting seems to claim certain quality of ran-domness within the 23 bit mantissa.
A standard implementation of photon tracing keeps a global list ofphotons and sequentially adds a photon to the list. This approachhowever needs inter-threads commutation on GPUs to compact listsin order to obtain a complete global list of photons. The alternativeapproach used in the implementation is to trace a single bounce perpass and stores only one photon at most. For example, the very firstpass traces photons from light sources toward the first intersectionsand stores the photons. Further bounces are traced only in succeed-ing passes. Each pass thus outputs at most one photon per thread,not a list of photons.A new photon path is generated at next pass if the current photonpath is killed by Russian Roulette or misses a scene. This process isdone independently between threads (in our case, threads are equalto pixels), thus each thread potentially traces a photon ray at a dif-ferent number of bounces. This approach keeps all threads busy allthe time regardless of path length. A similar method is used for pathtracing [NHD10] and they reported performance improvement. Wecan observe similar improvement in the proposed system.
Hachisuka and Jensen [HJ10] proposed a stochastic hashing algo-rithm that utilizes the statistical nature of density estimation. Theypointed two fundamental challenges in the use of regular spatialhashing on GPUs: the sequential nature of list constructions forhash table entries and uneven workload distribution at the data re-trieval phase. Their key idea is to let only one data survive for eachhash table entry with concurrent writes, and to scale the contribu-tion of each photon by the number of hash collisions.The proposed modification is to assign a statistically independentrandom depth value (using the PRNG in Figure 5) for each photonpath and to enable z-buffering for hashing photon data simultane-ously. Davidoviˇc et al. [DKHS14] pointed that stochastic hashingcan potentially produce wrong results if the timing that each photondata is hashed depends on some properties of the given photon pathsuch as path length. This modification ensures that the probabilitythat one photon survives over other photons is independent of anyproperties of the corresponding photon path. . Hachisuka / Implementing a Photorealistic Rendering System using GLSL
4. Discussion
All the shaders in the proposed system use features only ofGLSL 1.20. They can thus run on WebGL which is based onOpenGL ES 2.0 and supports most features of GLSL 1.20. In prac-tice, however, they do not run on WebGL alone, since the offi-cial specification of WebGL does not guarantee a native support ofsome important features such as rendering to floating point num-ber textures. It should also be repeated that there is no one to onecorrespondence between a certain version of OpenGL and WebGL.Having said that, since WebGL can support floating point texturesvia the extension, it is most likely possible to port the code for We-bGL. WebGL 2 [Khrb] will natively support this extension.
While the code is intended to be platform-independent, it is pos-sible to fail due to the vendor-dependent JIT compilation modelof GLSL. No fundamental modification, however, will be neces-sary to make the code successfully executable. The author wouldappreciate a bug report in case you found any. The current im-plementation does not build an acceleration data structure usingGPUs. It is however trivial to implement linear BVH [LGS ∗ ∗
03] within our limited configuration,if one wants to build an acceleration data structure on GPUs. Amore efficient construction for a high quality tree [GPM11] usingonly GLSL could however be challenging due the requirement ofa more advanced thread management such as work queue. Whilestochastic progressive photon mapping covers many scene config-urations, one might want to implement more advanced renderingalgorithms such as unified path sampling [HPJ12] / vertex merg-ing and connection [GKDS12]. Davidoviˇc et al. [DKHS14] showhow to implement some of such algorithms using CUDA, but port-ing their algorithms on GLSL may pose some challenges. The pro-posed system does not support out-of-core rendering. An efficientout-of-core rendering on GPUs is still an open problem even withmore general GPU computation platforms. This feature, however,might not be necessary for some applications such as rendering foronline shopping.
5. Conclusion
The paper outlines an implementation of a rendering system viaOpenGL 3.0 and GLSL 1.20 without relying on any platform de-pendent feature. One of the proposed modifications, the multiple-threaded BVH, uses the principal direction of a given ray to selectone of the threaded BVHs. This simple modification results in twoto three times performance improvement compared to the originalthreaded BVH. Unlike common pseudorandom number generators,the proposed generator uses only floating point number operationsbased on a combination of multiplicative linear congruential gener-ators. The resulting generator is computationally inexpensive andthe quality of generated random numbers is enough for the pro-posed rendering system. The author believes that the code can serveas an example implementation of a rendering system using onlyplatform independent features.
Acknowledgements
Thank Kentaro Oku ([email protected]) forreporting bugs of the first release of the code. The scanned womanmodel is by courtesy of FUTURESCAN.
References [ALK12] A
ILA
T., L
AINE
S., K
ARRAS
T.:
Understanding the Efficiencyof Ray Traversal on GPUs – Kepler and Fermi Addendum . NVIDIATechnical Report NVR-2012-02, NVIDIA Corporation, June 2012. 2, 3[ÁSK14] Á
FRA
A. T., S
ZIRMAY -K ALOS
L.: Stackless multi-bvh traver-sal for cpu, mic and gpu ray tracing. In
Computer Graphics Forum (2014), vol. 33, Wiley Online Library, pp. 129–140. 2[DKHS14] D
AVIDOVI ˇC
T., K ˇRIVÁNEK
J., H
AŠAN
M., S
LUSALLEK
P.:Progressive light transport simulation on the GPU: Survey and improve-ments.
ACM Transactions on Graphics (TOG) 33 , 3 (2014), 29. 1, 2, 3,4[GKDS12] G
EORGIEV
I., K
RIVANEK
J., D
AVIDOVIC
T., S
LUSALLEK
P.: Light transport simulation with vertex connection and merging.
ACMTrans. Graph. 31 , 6 (2012), 192. 4[GPM11] G
ARANZHA
K., P
ANTALEONI
J., M C A LLISTER
D.: Sim-pler and faster HLBVH with work queues. In
Proceedings of the ACMSIGGRAPH Symposium on High Performance Graphics (2011), ACM,pp. 59–64. 4[HJ09] H
ACHISUKA
T., J
ENSEN
H. W.: Stochastic progressive photonmapping.
ACM Transactions on Graphics (TOG) 28 , 5 (2009), 141. 1, 2[HJ10] H
ACHISUKA
T., J
ENSEN
H. W.: Parallel progressive photonmapping on GPUs. In
ACM SIGGRAPH ASIA 2010 Sketches (2010),ACM, p. 54. 3[HPJ12] H
ACHISUKA
T., P
ANTALEONI
J., J
ENSEN
H. W.: A path spaceextension for robust light transport simulation.
ACM Transactions onGraphics (TOG) 31 , 6 (2012), 191. 4[Khra] K
HRONOS G ROUP : WebCL. . 1[Khrb] K
HRONOS G ROUP : WebGL 2 specification. . 4[L’e88] L’
ECUYER
P.: Efficient and portable combined random numbergenerators.
Communications of the ACM 31 , 6 (1988), 742–751. 3[LGS ∗
09] L
AUTERBACH
C., G
ARLAND
M., S
ENGUPTA
S., L
UEBKE
D., M
ANOCHA
D.: Fast BVH construction on GPUs.
Computer Graph-ics Forum 28 , 2 (2009), 375–384. 4[M ∗
03] M
ARSAGLIA
G.,
ET AL .: Xorshift rngs.
Journal of StatisticalSoftware 8 , 14 (2003), 1–6. 3[NHD10] N
OVÁK
J., H
AVRAN
V., D
ASCHBACHER
C.: Path regener-ation for interactive path tracing. In
The European Association forComputer Graphics 28th Annual Conference: EUROGRAPHICS 2007,short papers (2010), The European Association for Computer Graphics,pp. 61–64. 3[NVI07] NVIDIA C
ORPORATION : NVIDIA CUDA Compute UnifiedDevice Architecture Programming Guide . NVIDIA Corporation, 2007.2[PBD ∗
10] P
ARKER
S. G., B
IGLER
J., D
IETRICH
A., F
RIEDRICH
H.,H
OBEROCK
J., L
UEBKE
D., M C A LLISTER
D., M C G UIRE
M., M OR - LEY
K., R
OBISON
A.,
ET AL .: Optix: a general purpose ray tracingengine.
ACM Transactions on Graphics (TOG) 29 , 4 (2010), 66. 1[PDC ∗
03] P
URCELL
T. J., D
ONNER
C., C
AMMARANO
M., J
ENSEN
H. W., H
ANRAHAN
P.: Photon mapping on programmable graphicshardware. In
Proceedings of the ACM SIGGRAPH/EUROGRAPHICSconference on Graphics hardware (2003), Eurographics Association,pp. 41–50. 4[STØ05] S
IMONSEN
L. O., T
HRANE
N., Ø
RBÆK
P.: A comparison ofacceleration structures for GPU assisted ray tracing.
Master’s thesis,University of Aarhus (2005). 2, 3[TW08] T
ZENG
S., W EI L.-Y.: Parallel white noise generation on a gpuvia cryptographic hash. In
Proceedings of the 2008 symposium on Inter-active 3D graphics and games (2008), ACM, pp. 79–87. 2, 3[Wal07] W
ALD
I.: On fast construction of SAH-based bounding volumehierarchies. In