[PDF] A GPU implementation of a track-repeating algorithm for proton radiotherapy dose calculations

Abstract

An essential component in proton radiotherapy is the algorithm to calculate the radiation dose to be delivered to the patient. The most common dose algorithms are fast but they are approximate analytical approaches. However their level of accuracy is not always satisfactory, especially for heterogeneous anatomic areas, like the thorax. Monte Carlo techniques provide superior accuracy, however, they often require large computation resources, which render them impractical for routine clinical use. Track-repeating algorithms, for example the Fast Dose Calculator, have shown promise for achieving the accuracy of Monte Carlo simulations for proton radiotherapy dose calculations in a fraction of the computation time. We report on the implementation of the Fast Dose Calculator for proton radiotherapy on a card equipped with graphics processor units (GPU) rather than a central processing unit architecture. This implementation reproduces the full Monte Carlo and CPU-based track-repeating dose calculations within 2%, while achieving a statistical uncertainty of 2% in less than one minute utilizing one single GPU card, which should allow real-time accurate dose calculations.

Full PDF

AA GPU implementation of a track-repeating algorithm for proton radiotherapy dose calculations

Pablo P Yepes , Dragan Mirkovic and Phillip J Taddei Department of Physics and Astronomy, MS 315, Rice University, 6100 Main Street, Houston, TX 77005, USA Department of Radiation Physics, Unit 1202, The University of Texas M. D. Anderson Cancer, 1515 Holcombe Blvd., Houston, TX 77030, USAPACS: 61.80.-x, 61.82.-d, 78.70.-g, 81.40.Wx, 87.53.-jShort title: GPU-based algorithm for proton dose calculations. Key words: proton radiotherapy, track-repeating, GPU, dose calculation, Monte Carlo, fast dose calculatorCorresponding author: Pablo P. Yepes, PhD, Physics and Astronomy Department, MS 315, Rice University, Houston, TX 77005, USA Phone: 713-248-5942; Fax: 713-348-5215; E-mail: [email protected] is an author-created, un-copyedited version of an article accepted for publication in Physics in Medicine and Biology. IOP Publishing Ltd is not responsible for any errors or omissions in this version of the manuscript or any version derived from it. bstract

1. Introduction

Radiation therapy is an important component of the treatment of cancer. Radiation dose absorbed in normal tissues produces acute effects, for example necrosis, and late effects, such as carcinogenesis. An essential component for the quality of a radiotherapy treatment plan is the accuracy of dose calculations (Papanikolaou et al 2004). The clinical advantages of more accurate dose calculations -tumor recurrence, local control, and normal tissue complications- has not been fully quantified and requires further investigation. Nevertheless evidence exists that dose differences on the order of 7 % are clinically detectable (Dutreix 1984). Moreover, several studies have shown that 5 % changes in dose can result in 10 %- % changes in tumor control probability or up to 20-30 % changes in normal tissue complication probabilities (Orton 1984, Stewart 1975, Goitein 1975). Dose distributions in proton therapy are typically calculated by commercial treatment planning engines based on analytical pencil-beam algorithms (Petti 1992, Russell 1995, Hong 1996, Deasy 1998, Schneider 1998, Schaffner 1999, Szymanowski and Oelfke et al. (1999) reviewed various analytical proton dose models and concluded that no single pencil-beam model can predict the dose correctly in every situation. They also concluded that the Monte Carlo approach is more accurate than any analytical model and should be used to verify the dose distributions in situations above a certain level of anatomical complexity. Soukup et al (2005) derived a pencil-beam method from Monte Carlo simulations; this method works well for simple heterogeneous or slab-structured phantoms, however it does not achieve the accuracy of the Monte Carlo approach for phantoms describing more complex heterogeneous media, for example head and neck and pelvic geometries. Ciangaru et al (2005) benchmarked analytical calculations of proton doses in simple heterogeneous phantoms and concluded that the algorithms were reasonably accurate for predicting doses at anatomic sites containing laterally extended inhomogeneities that are comparable in density to one another and located away from the Bragg peak. However, the algorithm had mixed success in calculating proton doses in areas with a combination of high and low density media. The Monte Carlo approach has been shown to provide higher accuracy (Titt et al 2008, Koch et al 2008, Newhauser et al 2008, Paganetti et al 2008, Giebeler et al 2010) than pencil-eam algorithms, however its clinical utilization has been hampered by its high computational requirements. In one study with the Monte Carlo code MCNPX (Pelowitz 2007), for a typical three-beam proton lung cancer, simulating all three proton radiotherapy treatment fields with acceptable statistical precision required up to 5000 hours of central processor unit (CPU) time (Taddei 2009). The high computation times make the use of robust Monte Carlo codes impractical for routine clinical radiotherapy dose calculations, i.e., without the use of large-scale computer clusters.Various simplified Monte Carlo approaches have also been reported in the literature. Kohno and collaborators (2003) implemented a proton Monte Carlo dose algorithm in which a reduced number of physics processes were taken into account. This approach increased the computational efficiency while largely preserving the accuracy of the calculations (Hotta 2010). Fippel and Soukup (2004) developed another proton Monte Carlo code based on the concept of the voxel Monte Carlo algorithm that had previously been applied to photons and electrons (Fippel 1999). They reported good agreement between dosimetric predictions from their code and from full Monte Carlo codes. Their calculation times were 35 and 13 times faster than Monte Carlo codes GEANT4 (Agostinelli et al 2003; Allison et al 2006) and FLUKA (Fasso et al 2005), respectively. Tourovsky et al (2005) implemented a stochastic proton transport code with simplified physics models and tested it in a variety of clinical cases. Computation times relative to other algorithms were not reported. Li et al (2005) reported a track-repeating algorithm for proton therapy that maintained the accuracy of the Monte Carlo technique, while significantly decreasing computation times. However, their validation was limited to simple heterogeneous phantoms. Extending their work, we developed an alternative track-repeating algorithm, the fast dose calculator (FDC), and applied it to clinical proton treatment planning for the pelvis (Yepes et al. 2009a and Yepes et al. 2009b) and the thorax (Yepes et al. 2010). FDC reproduced the results of GEANT4 simulations to within 2%, yet required less than 1% of the computation time. While the reduction in computation time with FDC was significant, calculation times were still on the order of an hour on a single CPU for the dose calculation of a typical radiotherapy treatment. An alternative, more efficient computation device is needed to use FDC for routine clinical dose verification calculations viable.Calculations can be accelerated by performing the computation in a parallel manner on multiple processor systems, like CPU clusters or graphics processing units (GPUs). Dose calculations performed on a large number of distributed CPUs has been reported in the literature (Vadapalli 2010). GPU clusters are less expensive and easier to maintain than traditional CPU clusters. GPU-based algorithms have been developed for a variety of tasks in radiotherapy (Jia et al 2010a and 2010b, Gu et al 2009 and 2010, Jacques 2008 and Hissoiny et al 2009). They have been used for dose calculations by Hissoiny et al (2009) and Jacques et al (2008), who implemented superposition convolution algorithms for dose calculation on GPUs, and by Gu et al (2009), who explored the use of GPUs for a finite size pencil beam model. Moreover a Monte Carlo code for coupled electron-photon transport was implemented on a GPU architecture by Jia et al (2010b). They reported speed gains up to a factor of 6.6. However, to date, to our knowledge no attempt to implement a track-repeating algorithm on a GPU architecture has been reported in the literatureIn this paper, we report our recent development of the FDC track-repeating algorithm for proton therapy on a GPU architecture under the computer unified device architecture (CUDA) platform developed by NVIDIA (2009). The code has been implemented on a single ommercially available graphic card with two GPU units. The GPU-based FDC has been benchmarked against the CPU-based FDC code and the full Monte Carlo GEANT4, since the latter has been validated for proton therapy (Aso et al 2005, Paganetti et al 2008) .

2. Methods

We used the GEANT4 tool kit (version 4.8.3) (Agostinelli et al et al via the continuous slowing-down approximation for secondary electrons with energy below Tc. Tc is material dependent and calculated from a particle distance range set to 0.1 mm. This translates to 83.6 KeV and 250 eV energy cutoffs for electrons in water and air, respectively. In this work we utilized the low-energy parametrized model (Chauvie 2004), which takes into account atomic and shell effects and is applicable down to 250 eV. It uses the Bethe-Bloch formula to calculate hadron ionization down to 2 MeV, and a ICRU 49 parametrization (ICRU 1993) in the range 1 keV to 2 MeV. Below 1 keV the free electron gas approach is utilized. The energy straggling was calculated with a Gaussian distribution with Bohr’s variance (ICRU 1993) for distances long enough for the approximation to be accurate. For short distances a simple model of the atom was used (GEANT4 2008 ). The multiple Coulomb scattering for protons was estimated with a condensed simulation algorithm, in which the global effects of the collisions are estimated at the end of a track segment. It uses model functions to determine the angular and spatial distributions, which were chosen to reproduce the distributions of the Lewis theory (Lewis 1950 ). For elastic hadronic interactions the low-energy parametrized model was implemented. Inelastic interactions were simulated with a pre-equilibrium model in the range of interest to our simulations (0-250 MeV). The model is based on Griffin’s semi-classical description of composite nucleus decay (Griffin, 1966, Gudima et al

The FDC algorithm utilizes a pre-generated database of the histories of particles produced by a proton impinging on a water phantom. In our study, we have generated the database using the Monte Carlo code GEANT4. Each particle trajectory is broken into steps, and for each step the direction, length and energy loss is stored. FDC calculates dose distributions in heterogeneous anatomies, by re-tracing proton tracks. This re-tracing is achieved by scaling the length and scattering angle of each step according to the material in the non-water medium. In this study the database of proton histories was generated by simulating 121 MeV protons impinging on a 510x510x2500 mm water phantom with a 3x3x2500 segmentation along the (x,y,z) axes. Further details of the FDC code were described previously (Yepes 2009a, 2009b).In this work the dose and deposited energy calculated with GEANT4 and FDC, as described previously (Yepes, 2009a, Yepes 2009b, Yepes 2010), were used as basis of comparison with prediction from GFDC. Results from the CPU version of FDC reported in this work utilized a database with 10 million pre-calculated proton histories in water, while the GPU-ased version utilized a database with only 100,000 proton histories. Such a relative small database was chosen because no undesired correlations were observed with such a reduced number of 100,000 proton histories (see Results section). A small database should make the algorithm easier to install on various platforms. GFDC was developed using the CUDA software platform (NVIDIA 2009) on a general purpose GPU on a single graphics card (GEFORCE GTX 295, NVIDIA, Santa Clara, CA) with 1.79GB of global memory as the hardware platform. That GPU card holds 2 GPU units, with each unit holding 240 GPU cores. The software environment calls for the generation of multiple computational threads. The total number of threads is defined by the programmer and can be as high as hundreds of thousands. Threads are divided into blocks, so that the total number of threads is the number of blocks (N B ) multiplied by the number of threads per block (N T ). The number of threads should be a multiple of 32 for optimal performance because of hardware configuration of the GPU units. Since the track-repeating algorithm is inherently highly parallel, each thread is treated as an independent computational unit. Each unit re-traces one of the proton histories from the database of pre-calculated histories. A flowchart of the GPU implementation of FDC is shown in Figure 1. The code was split into sections to be executed on the CPU and on the GPU, with the GPU being called from the CPU code. After the program initialization on the CPU, it first reads the material and geometry information from configuration files. That information was then stored in GPU global memory accessible to the GPU code. The material information corresponds to the scaling parameters of the step length and scattering angle for all the materials to be used in the calculation. The geometry information, derived from the patient CT-scan, consists of a three dimensional array with a material index for each voxel. After this initial stage, a loop over a pre- determined number of iterations was initiated. For each iteration, a number histories equal to the number of thread blocks (N B ) was read from the database. In addition an array with phase-space information for N T xN B incident protons to be simulated was generated. This proton phase-space can be read from a file or generated randomly by the program within certain controllable ranges in energy and direction. In the results present in this work, the phase space was generated with fixed energy and direction but a spread in the position transverse to the beam (See section 2.4). Once the phase-space array was generated and the proton histories read, both pieces of information were transferred from the CPU to the GPU global memory, where they could be accessed by all the N t xN B GPU threads. Since only N B pre-calculated proton histories were made available for N T xN B threads, the same proton history was utilized N T times for various positions of the incident protons. However, since trajectories from a given history traversed different areas of the heterogeneous phantoms, the results from the re-tracing of the same history were expected to be statistically independent. Statistical uncertainties are calculated with two alternative methods, as explained in Section 2.3, to test whether different trajectories are statistically independent.After the operations to initialize an iteration were completed, the GPU code was invoked from the CPU code through a special C-language function termed kernel. The kernel was executed N=N T x N B times on the GPU engine in parallel on independent threads. The GPU code was subdivided into two main tasks: 1) finding the database history to be used and selecting the trajectory and step where the track-repeating algorithm should start; 2) re-tracing the pre-calculated proton history through the heterogeneous phantom. he deposited energy generated by the different threads was tallied in a large three dimensional grid with the same number of voxels as the heterogeneous phantom. The tally grid was defined in GFDC as a one dimensional array and was placed in the GPU global memory. This approach minimized the amount of memory utilized for tallying, which was significant for large grids. However utilizing a common tally for all threads required a function which blocked access to the memory location while a particular thread updated the information stored in it. Blocking a certain memory location forced other threads, trying to update the same voxel tally, to wait until the operation was completed. Such a mechanism slowed down the algorithm when competing threads attempted to access the same memory location simultaneously. A second tallying array, with the same dimension as the tally for the deposited energy, was utilized to store the sum of the squares of the deposited energy. The second array was necessary to estimate the history-by-history statistical uncertainties (See next section). At the end of the run the energy deposited in each voxel was converted to absorbed dose by dividing by the mass of the voxel. The number of blocks (N B ) and threads (N T ) were optimized to minimize the execution time.The graphics card used in this study housed two GPU units. In order to maximize the performance of the card, two CPU threads were defined in the code, with each thread handling the operation and data transfer to one of the GPU units. Each CPU thread read from the same database; however, they read different pre-calculated proton histories. At the end of the execution of the two CPU threads, the absorbed dose from both threads was combined. Standard methods (Chetty 2007) were used to calculate statistical uncertainties and to investigate the effects of using the same 100,000 pre-calculated proton histories as many as 250 times. The batch method consists in comparing the results of multiple calculations, or batches, performed with uncorrelated phase space files and random number sequences. For this method, the estimate of uncertainty of the dose, D , is given by: σ  D =  ∑ i= n  D i −  D  n  n −  (1)where n is the number of independent batches or runs, D i is the scored dose in batch i , and  D is the mean value of the absorbed dose over all the batches. The sample size is given by the number of batches or independent calculations. In the history-by-history method, where a history dose corresponds to the absorbed dose produced by a single proton impinging on the phantom, the statistical uncertainty is given by: σ D =  N −  ∑ i= n D i N −  ∑ i= n D i N   (2) where N is number of primary histories, D i the contribution to the dose by independent history i . Whereas the history-by-history approach may be distorted by hidden correlations, the batch method uncertainties would not because each batch is generated with completely independent databases and random numbers. If the estimate of the uncertainties is biased by hidden correlations, we expect a deviation from the 1 /  N behavior for the history-by-history uncertainties. Moreover the estimated uncertainty for the history-by-history approach will be lower than the unbiased correlations. On the other hand, hidden correlations due a small database are not expected to affect the batch estimates. Thus, comparing the results of these two pproaches tests the feasibility of using the same proton history multiple times for these type of calculations.The mean dose uncertainty was defined as the average of σ D / D over all voxels with a dose larger than half the maximum dose, with σ D calculated with equations (1) and (2). To quantify the dosimetric accuracy of the FDC and GFDC dose distributions, we compared them to a distribution from GEANT4 simulations for the same field and voxelized phantom. The figure of merit used to quantify the dosimetric accuracy was the gamma index, G (Low et al j , have values of Γ j smaller than unity. GEANT4, which was previously validated for applications in proton therapy (Aso 2005 et al and Paganetti et al 2008), provided reference dose distribution in this study, against which dose distributions from FDC and GFDC were compared. The maximum acceptable differences in dose and spatial distance used to calculate the G index in this study were 3% and 3 mm, respectively. The geometric model was represented as a voxelized phantom based on the computed tomography (CT) images of the thoracic region of a patient who had been treated for lung cancer at The University of Texas M. D. Anderson Cancer Center. The phantom contained 6,064,305 voxels, each having dimensions of 1x1x2.5 mm . Each voxel was assigned a material composition and a mass density that corresponded to the Hounsfield unit value in the CT scan for that voxel, following the approach described elsewhere (Newhauser et al 2008). The thoracic region was selected because the thorax is highly inhomogeneous. It is in such in homogeneous areas that pencil-beam algorithms are least reliable. We have simulated a mono-energetic proton field of 121 MeV and a circular cross section of 4 cm radius. The energy of the beam was selected as to traverse a significant fraction of the lung and, therefore, test the algorithm running on GPUs stringently. The energy and field characteristics were not selected to maximize the dose to the tumor or to minimize the dose to healthy tissue, as this was a generic test field, not a clinically realistic field.

3. Results

Figure 2 shows the distribution of deposited energy versus depth (y axis) in the heterogeneous phantom, plotted along the beam central axis, (i.e., x = 0 and z = 0) and 1.5 cm and 3.0 cm lateral to the central beam axis, as predicted by GEANT4, FDC and GFDC. In addition, the percent difference in deposited energy between the track-repeating algorithm (FDC and GFDC) and GEANT4 are plotted along the same axis. Plotting the deposited energy rather than dose was chosen to better show the effects of inhomogeneity of the anatomy. GFDC reproduces the results from the CPU version of the FDC code, the differences can be attributed to statistical fluctuations, since different pre-calculated databases of proton histories were utilized. In addition good agreement was observed between the deposited energy calculated by GEANT4 and GFDC. Figure 3 shows the cross-field profiles in the vertical direction for the isocenter (x = 0 and y = 0) and 7.5 cm posterior and anterior relative to the isocenter (x = 0 and y = +/-7.5 cm). Each profile is calculated along a five voxel thick line. The cross-field profiles are depicted for three penetration depths to illustrate the agreement between the three approaches. greement was excellent for both profiles, with the largest discrepancy less than 3%. The rest of the results comprise comparisons of dose rather than deposited energy. The difference between the GFDC- and GEANT4-calculated doses for each voxel in the anatomic phantom divided by the maximum GEANT4 voxel dose has a RMS value of 0.5%. From that value we conclude that GFDC reproduces the dosimetric accuracy of GEANT within 1%. A more comprehensive comparison was obtained by calculating the G -index of FDC and GFDC relative to GEANT4 for each voxel. The G index results are presented in Figure 4 as the complimentary cumulative distribution function such that the ordinate represents the probability that G will be greater than the value of the abscissa. Both GFDC and FDC have essentially identical distributions, showing that the GPU-based FDC reproduces the results from the CPU-based version. Moreover, less than 0.01% of the voxels have G values greater than unity. Thus, the dose distributions from FDC and GFDC are in good agreement with the reference dose distribution from GEANT4. As explained above, GFDC utilizes a database with 100,000 pre-calculated proton histories and re-traces each history multiple times. In order to verify that such re-cycling of proton histories does not produce undesired statistical correlations, the mean dose uncertainties were calculated with the history-by-history approach and with multiple batches. The batch method utilizes six independent batches or runs. Results were identical if the number of batches was reduced to three. Figure 5 shows the results of the test of re-cycling the proton history database. In the figure, the lines represent a function f  N  =C /  N , where C is adjusted for the function to go through the point with the lowest N for each of the two curves. As can be seen in Figure 5, the measured points for both methods follow the f  N  function. If correlations were present, we would expect a deviation from the /  N . Moreover the fact that the batch uncertainties which should not be affected by inter-batch correlations, are smaller than the history-by-history uncertainties demonstrates the absence of undesired statistical correlations introduced by the use of multiple proton histories. As can be seen, uncertainties calculated with the history-by-history method are around 50% higher than those from the batch approach.The calculation time per proton history was found to depend on the number of blocks (N B ) and the number of threads per block (N T ). The number of blocks and threads per block were varied within the range allowed by the hardware in order to maximize the algorithm performance. The fastest calculation times were obtained for N B =500 and N T =320, for which 184,525 proton histories were processed per second utilizing the two GPU units on the graphics card. The CPU-based FDC on one CPU processed 2445 proton histories per second. Therefore the implementation of the FDC algorithm on a GPU card alone achieved a speedup of a factor of 75.5 with respect to the CPU-based implementation.

Storing the error in the tally arrays was found to increase the calculation time by 18%. In the results reported here errors were not stored. The rationale being that in a clinical environment, error is rarely reported for each voxel. The upper abscissa of Figure 5 shows the calculation time on the GPU card as a function of the statistical uncertainties. With the batch method a mean statistical dose uncertainty of 1% was achieved in less than one minute, while around 2 minutes were required for the history-by-history approach. . Discussion

The good agreement between dose distributions calculated with GFDC versus the GEANT4 and FDC codes suggests the feasibility of GFDC to calculate dose distributions in proton radiotherapy as accurately as general purpose Monte Carlo programs. While preserving accuracy, GFDC reduced computational times by a factor of 75 with respect to CPU-based FDC track-repeating algorithm. Speed gains of 6.6 for a GPU-based Monte Carlo code relative to its CPU-based version were reported in the literature (Jia 2010b). The limited gains of 6.6 for the MC code are thought to be due to the nature of the GPU architecture. On a GPU, the algorithm is executed in multiple threads running in parallel. Threads must run in groups for best performance. Branches in the code do not impact performance provided all threads of a given group follow the same execution path. This may become a significant limitation for any inherently divergent task, like Monte Carlo simulations. It is likely that the speed gains seen for GFDC were large because of the simpler logic of a track-repeating algorithm, as compared to a full Monte Carlo. A simpler logic should generate threads which follow closer execution paths.In our current GPU algorithm all the threads in a given group were fed with the same proton history from the database of pre-calculated histories. Even though each history is re-traced in different areas of the phantom, and thus produces statistically independent results, this seems to minimize the logic path divergence for the various threads in a group. When threads in a given group were fed with different proton histories, the execution times increased by about 50%. The CPU-based version used for the results on pelvis anatomy (Yepes 2009a, 2009b) was limited to dose calculations in a voxelized geometry. Results reported for the thorax anatomy included an aperture and range compensator (Yepes 2010). For those results the code included a package from ROOT (Brun 1997) to describe arbitrary geometries. In the GPU-based FDC (GFDC), reported in this study, this feature to describe arbitrary geometries has not been implemented yet, due to the difficulties to port the corresponding ROOT classes to GPUs. The GFDC version is restricted to dose calculation in voxelized geometries.Currently, the large computational requirement for Monte Carlo simulations makes their use difficult for routine clinical proton radiotherapy treatment planning. At present, it is only practical to calculate such treatment plans with large computer clusters, which are unavailable in most clinics. This obstacle also hinders the opportunities for studies in which whole-body dose reconstructions are needed for large numbers of patients, e.g., in clinical trials or radiation epidemiology studies. The results of the present study suggest that it may be feasible to overcome this obstacle with the GFDC approach, although additional development and testing of the codes will still be needed. The findings of this study indicate that it may be feasible for GFDC to calculate the dose distribution for an entire proton radiotherapy treatment plan for lung cancer with 1% statistical dosimetric uncertainty on a desktop-size system equipped with 2 GPU cards in about one minute. Future studies to test this hypothesis should include multiple clinically realistic fields, a plurality of patients and sites. The code should also be extended to make it capable to handle arbitrary geometries to include the patient dependent beam shaping elements. . Conclusions

In conclusion, GFDC is a promising implementation of a track-repeating code for proton radiotherapy dose calculations using a GPU architecture. The dosimetric accuracy of the GFDC algorithm was validated by comparing the results with those generated with GEANT4 Monte Carlo and CPU-based FDC simulations. GFDC can calculate the dose distribution produced by a 120 MeV proton beam in a thoracic geometry with 1% accuracy in one 1 minute with a graphics card with two GPU units. Only 0.01% of the phantom voxels had a G index larger than one, with the G index calculated with a full Monte Carlo as the reference distribution. The implementation of the track-repeating algorithm on a GPU architecture may allow for real-time dose calculations for proton radiotherapy with a desktop-size computer system equipped with multiple GPU cards. Acknowledgements

This work was supported in part by the John & Ann Doerr Fund for Computational Biomedicine under contract CABC-2006-5-PY; by the Rice Computational Research Cluster funded by the National Science Foundation under Grant CNS-0421109 and a partnership between Rice University, Advanced Micro Devices., and Cray Inc. This work was supported in part by the National Cancer Institute (award 1R01CA131463-01A1) and by the Fogarty International Center (award K01TW008409). We are thankful to Wayne Newhauser for his careful reading of the manuscript. We are indebted to Vivek Sarkar and Yonghong Yan for discussions on the technical aspects of the GPU implementation. The content is solely the responsibility of the authors and does not necessarily represent the official views of the sponsors. eferences

Agostinelli S et al

Nucl. Instrum. Meth. A et al

IEEE Trans. Nucl. Sci. IEEE Trans. on Nucl. Sci. Nucl. Instrum. Meth. A

Chauvie S, Guatelli S, Ivanchenko V, Longo F, Mantero A, Mascialino B, Nieminen P, Pandola L, Parlati S, Peralta L, Pia M G, Piergentili M, Rodrigues P, Saliceti S, Tnndade A 2004 Geant4 low energy electromagnetic physics

IEEE Nuclear Science Symp. Conf. Rec. Med Phys , 4818-53. Ciangaru G, Polf J C, Bues M and Smith A R 2005 Benchmarking analytical calculations of proton doses in heterogeneous matter Med. Phys . Med. Phys. Radiother. Oncol. CERN-2005-10, INFN/TC_05/11

Fippel M 1999 Fast Monte Carlo dose calculation for photon beams based on the VMC electron algorithm

Med. Phys. Med. Phys. Phys. Med. Bio . Goitein M and Busse J 1975 Immobilization error: Some theoretical considerations

Radiology

Phys. Rev. Lett. Phys. Med. Biol. Phys. Med. Biol. Nuclear Physics A

Med. Phys. Phys. Med. Biol. Phys. Med. Biol. Report 49 . Jacques R, Taylor R, Wong J and McNutt T 2008 Towards real-time radiation therapy: GPU accelerated superposition/convolution High-Performance Medical Image Computing and Computer Aided Intervention Workshop, HP-MICCAI 2008 Jarlskog CZ and Paganetti H 2008 Physics Settings for Using the Geant4 Toolkit in Proton Therapy

IEEE Trans. Nucl. Sci. Med. Phys.

37 1757–60 Jia X, Gu X , Sempau J , Choi D, Majumdar A and Jiang S B 2010b Development of a GPU-based Monte Carlo dose calculation code for coupled electron–photon transport

Phys. Med. Biol.

55 3077–86 Koch N, Newhauser WD, Titt U, Gombos D, Coombes K, Starkschall G 2008 Monte Carlo calculations and measurements of absorbed dose per monitor unit for the treatment of uveal melanoma with proton therapy

Phys. Med. Biol. Phys. Med. Biol. Phys. Rev. Li J S, Shahine B, Fourkal E and Ma C M 2005 A particle track-repeating algorithm for proton beam dose calculation

Phys. Med. Biol. Med. Phys. Trans. Am. Nucl. Soc. Int. J. Radiat. Oncol. Biol. Phys. Phys. Med. Biol. AAPM Report

No. . Pelowitz DB, ed. 2007 MCNPX TM User’s Manual, Version 2.6.0. Los Alamos, New Mexico: Los Alamos National Laboratory Petti P L 1992 Differential-pencil-beam dose calculations for charged particles

Med. Phys. Phys. Med. Biol. Phys. Med. Biol . Med. Phys.

25 457–63 Soukup M, Fippel M and Alber M 2005 A pencil beam algorithm for intensity modulated proton therapy derived from Monte Carlo simulations

Phys. Med. Biol. Laryngoscope Phys. Med. Biol. Phys. Med. Biol. Phys. Med. Biol. Phys. Med. Biol . Nucl. Technol. in press.Yepes P, Randeniya S, Taddei P and Newhauser W 2009a A track repeating algorithm for fast Monte Carlo dose calculations of proton radiotherapy

Nucl. Technol.

Phys. Med. Biol. N21-N28Yepes P, Brannan T, Huang J, Mirkovic D, Newhauser W D, Taddei P J, Titt U 2010 Application of a fast proton dosecalculation algorithm to a thorax geometry

Rad. Meas.

In press. igure : The flow chart of the GPU-based version of the track-repeating Fast Dose Calculator (GFDC). The kernel, code running on the GPUs, is bounded by the pointed-line rectangle, and is run in parallel by N=N T xN B different CUDA threads, where N B and N T are the number of blocks and threads per block, respectively. igure : Deposited energy profiles in voxels along the beam axis, y, for z = 0 and (a) x = 0, (b) x=1.5 cm and (c) x = 3.0 cm. The y axis runs from posterior to anterior of the patient. Distributions were calculated with GEANT4 (G4: black line), FDC (red circles), and GFDC (blue triangles). The differences in dose between GEANT4 and FDC (red line) and GFDC (blue line) and GEANT4 divided by the maximum GEANT4 dose are shown in panels (d), (e) and (f) for x = 0, x = 1.5 cm, and x = 3.0 cm, respectively. igure : Cross-field profiles of deposited energy along the patient’s vertical axis (z), along the central beam axis (i.e., x = 0) for (a) y = 7.5 cm, (b) y = 0 cm and (c) y= -7.5 cm, where the y-axis runs from anterior to posterior of the patent. Distributions were calculated with GEANT4 (G4: black line), FDC (red circles) and GFDC (blue triangles). The GEANT4-FDC (red line) and GEANT4-GFDC (blue line) deposited energy differences divided by the maximum GEANT4 deposited energy is shown in panels (d), (e) and (f) for y= 7.5 cm, y=0 cm, and y = -7.5 cm, respectively. igure : The complementary cumulative distribution function (CDF) of the G index for FDC and GFDC using GEANT4 as the best estimate of the true dose distribution in the heterogeneous phantom representing a thoracic cancer patient. The gamma function was calculated for all non-air voxels in the geometric model. igure : The mean statistical uncertainties of the GFDC dose distributions calculated with the history-by-history and the batch approaches as a function of the number of proton histories (N) and calculation times. The lines are functions proportional to N − /2