Compact Native Code Generation for Dynamic Languages on Micro-core Architectures
CCompact Native Code Generation for DynamicLanguages on Micro-core Architectures
Maurice Jamieson [email protected]
EPCC at the University of EdinburghEdinburgh, United Kingdom
Nick Brown
EPCC at the University of EdinburghEdinburgh, United Kingdom
Abstract
Micro-core architectures combine many simple, low mem-ory, low power-consuming CPU cores onto a single chip. Po-tentially providing significant performance and low powerconsumption, this technology is not only of great interest inembedded, edge, and IoT uses, but also potentially as acceler-ators for data-center workloads. Due to the restricted natureof such CPUs, these architectures have traditionally beenchallenging to program, not least due to the very constrainedamounts of memory (often around 32KB) and idiosyncrasiesof the technology. However, more recently, dynamic lan-guages such as Python have been ported to a number ofmicro-cores, but these are often delivered as interpreterswhich have an associated performance limitation.Targeting the four objectives of performance, unlimitedcode-size, portability between architectures, and maintainingthe programmer productivity benefits of dynamic languages,the limited memory available means that classic techniquesemployed by dynamic language compilers, such as just-in-time (JIT), are simply not feasible. In this paper we describethe construction of a compilation approach for dynamic lan-guages on micro-core architectures which aims to meet thesefour objectives, and use Python as a vehicle for exploringthe application of this in replacing the existing micro-coreinterpreter. Our experiments focus on the metrics of perfor-mance, architecture portability, minimum memory size, andprogrammer productivity, comparing our approach againstthat of writing native C code. The outcome of this work isthe identification of a series of techniques that are not onlysuitable for compiling Python code, but also applicable to awide variety of dynamic languages on micro-cores.
Keywords: native code generation, Python, micro-core ar-chitectures, Epiphany, RISC-V, MicroBlaze, ARM, soft-cores
Micro-core architectures combine many simple, low power,CPU cores on a single processor package. Providing signifi-cant parallelism and performance at low power consumption,these architectures are not only of great interest in embedded,edge, and IoT applications, but also demonstrate potential inhigh performance workloads as many-core accelerators.From the Epiphany [7], to the PicoRV-32 [36], to the Pezy-SC2, this class of architecture represents a diverse set of technologies. However, the trait that all these share is thatthey typically provide a very constrained programming en-vironment and complex, manually programmed, memoryhierarchies. In short, writing code for these architectures,typically in C with some bespoke support libraries, is timeconsuming and requires expertise. There are numerous rea-sons for this, for instance, not only must the programmerhandle idiosyncrasies of the architectures themselves butalso, for good performance, contend with fitting their dataand code into the tiny amount (typically 32 - 64 KB) of on-core fast scratchpad memory.Little wonder then that there have been a number of at-tempts to provide programming technologies that increasethe abstraction level for micro-cores. Some early successesinvolved a simple port of OpenCL [33] and OpenMP [15],however, whilst this helped with the marshalling of overallcontrol, the programmer still had to write low-level C codeand handle many of the architectural complexities. Morerecently, a number of dynamic languages have been portedto these very constrained architectures. From an implemen-tation of a Python interpreter [17], to [11] and [30], theseare typically provided as interpreters but the severe mem-ory limits of these cores is a major limiting factor to theseapproaches.It was our belief that a compilation, rather than inter-preted, approach to these dynamic languages would offera much better solution on the micro-cores. However, thevery nature of the technology requires considerable thoughtabout how best to construct an approach which can deliverthe following objectives: • Performance at or close to directly written C code • The ability to handle code sizes of any arbitrary length • Portability between different micro-core architectures • Maintaining the programmer productivity benefits ofdynamic languagesIn this paper we describe our approach to the constructionof a compilation approach for dynamic languages of micro-core architectures. The paper is organised as follows; inSection 2 we explore background and related work from acompilation perspective and why existing approaches didnot match the objectives described above. Section 3 thendescribes our approach in detail, before applying this toan existing Python interpreter, called ePython, for micro-cores in Section 4. In Section 4.2 we not only explore the a r X i v : . [ c s . P L ] F e b aurice Jamieson and Nick Brown application of our approach to compiling Python for micro-cores, but also the performance and memory characteristicsof our approach compared to writing directly in C. Section 5then draws a number of conclusions and discusses furtherwork. Whilst implementations of dynamic languages, such as Python,for micro-core architectures greatly reduces the time andeffort to develop applications in comparison to writing themusing the provided C software development kits (SDKs), thereremains the performance overhead of interpreting dynamiclanguages over compiled C binaries. There are two majorapproaches to accelerating Python codes, just-in-time (JIT)compilation of the bytecodes at runtime and ahead-of-time(AOT) compilation before the code executes. In desktop andserver environments, the Numba JIT compiler [27] acceler-ates Python applications by specifying a subset of the lan-guage that is compiled to native code for central process-ing units (CPUs) and and graphics processing units (GPUs).Numba uses a Python function decorator , or directive, ap-proach to annotating the code to compile to native code,defining the @jit decorator to target CPUs, and the @cuda.jit decorator to target GPUs and perform the necessary datatransfer. Numba’s JIT compiler can speed up codes by a factorof around 20 times that of the standard Python interpreter[31].The Nuitka [22] and Cython [14] AOT Python compilersgenerate C source code that is compiled and linked againstthe CPython runtime libraries. Whilst, like Numba, they arehighly compliant with, and produce binaries that are sig-nificantly faster than, the standard CPython interpreter, upto 30 times faster in the case of Cython [32], the overallbinary size is large. For instance, if we take the seven linecode example by [18] and compile it on x86 Linux, Nuitkaproduces a dynamically-linked binary that is 154KB, andCython produces a much smaller binary at 43KB. However,when the required dynamic libraries are included, the over-all size is significantly larger, at 29MB for the archive fileproduced by Nuitka using the --standalone compiler option.A smaller hand-crafted static binary was produced by [18]using Cython but it is still 3MB in size. The binaries gener-ated by Numba, Nuitka and Cython are much larger thanthe memory available on micro-core architectures, therebyrequiring a different approach.In embedded environments, MicroPython [9] uses AOTcompilation to accelerate codes for the target microcon-trollers. MicroPython implements two native code emitters , native and viper [20], defining the @micropython.native and @micropython.viper decorators to select the emitter for nativecode generation. The native emitter replaces the MicroPy-thon bytecode and virtual machine (VM) with machine coderepresentations of the bytecode and calls to VM functions for operations such as arithmetic calculations, binary oper-ations and comparisons. Effectively, this method removesthe overhead of the VM’s dispatch loop, whilst leveragingthe existing VM functions and capabilities. The viper emit-ter takes this approach further by also generating machinecode instructions for operations, rather than calling the VMfunctions. As we would expect, this increases performanceyet further as arithmetic operations are performed inlinerather than via a procedure call. This results in performanceapproximately 10 times that of the native emitter and around24 times faster than the VM [21]. Whilst the MicroPythonemitters are not JIT compilers, as the native code is gener-ated before execution begins and the bytecode is not profiledto select compilation candidates, the code generation is per-formed on the microcontrollers themselves.Our native code generation approach is similar to that ofthe MicroPython viper emitter in that native code is gen-erated for all Python code, including operations such asarithmetic and comparisons. However, as will be discussedin Section 3, we generate C source code that is compiled to anative binary for download and execution on the micro-coredevice. As well as the Nuitka and Cython AOT Python com-pilers, a number of existing programming languages haveused this approach, including Eiffel [12], Haskell [34] andLOLCODE [29]. Whilst Haskell has deprecated the C back-end in preference to one based on LLVM [2], the former isstill beneficial for porting to a new platform as it produces vanilla code , requiring only gcc , as and ld tools [1]. Likewise,C was chosen as the backend for ePython native code gen-eration to enhance portability, particularly as a number ofthe target micro-core architectures, including the AdaptevaEpiphany and Xilinx MicroBlaze, are not supported by LLVM.Furthermore, code generators, such as MicroPython’s emit-ters, need to be specifically written to support new processorinstruction set architectures (ISAs), and the C backend en-sures that native code generation is immediately availableon all micro-core platforms that ePython supports. Crucially,our approach generates high-level C source code that retainsthe overall structure of the Python source program ratherthan emitting machine code representations of the bytecode(MicroPython) or disassembling / translating the bytecodeto native code (Numba). This is in order to leverage the ex-tensive optimisation capabilities within modern C compilers,including register allocation, data flow analysis, instructionselection and scheduling, data dependency management andscalar optimisations [35]. Due to the severely limited memory available on micro-corearchitectures, an AOT compilation approach is favored overJIT. This also enables us to leverage the C compiler’s ex-tensive code optimisation routines, at a higher level over ompact Native Code Generation for Dynamic Languages on Micro-core Architectures a greater amount of source code, resulting in significantlyfaster code.Our approach takes the programming language’s gener-ated Abstract Syntax Tree (AST) and then traverses this togenerate high-level C source code. Whilst this is a commonapproach to compiling codes, we are not proposing a sim-ple transliteration from the source language to C here butinstead the generation of optimal source code that supportsthe dynamic features of the source language, whilst optimis-ing memory access and arithmetic operations. The target Ccode is designed around a set of application programminginterfaces (APIs) that implement a form of abstract machinefor a generic dynamic object-oriented (OO) programminglanguage.This approach of generating C code, with associated macros,is similar to the cross-platform, macro-based, code genera-tion approach used by [26]. Such an abstract machine cancontain powerful programming abstractions, such as first-class and anonymous functions (lambdas), with their respec-tive closure support.
The main departure of the abstract machine from a simpletransliteration is that the target code is not managed using Cvariables but by the abstract machine through the introduc-tion of environments containing frames , as shown in Figure1. Therefore, all memory management, including functionstacks and argument passing is managed by the abstract ma-chine, not the C compiler and runtime. Furthermore, as libc ,the C runtime library, is often too large to be used success-fully on many micro-core architectures, the abstract machinealso provides a simple, small heap manager, which can betailored to the source language in question.
Figure 1.
Display, environment and frame structureFor the sake of simplicity, we will refer to all source pro-gramming language elements, for example, numbers, strings,lists, arrays and functions as variables . The generated codedeclares all the variables within frames as shown in Figure1, that grow downwards from the top of memory. Following [23], we create a display that holds references to the frames,allowing variable indexing via scope level and offset . The com-piler calculates and maintains the levels and offsets of thevariables within the environment. A new frame is createdfor each function call and the display is updated to managethe access to variables in outer frames (enclosing scope). Formost function calls, a new entry is created in the display,but recursive calls reuse the same display entry updated topoint to the new frame for each invocation. This allows us tocontinue to reference all variables by scope level and offsetfor recursive functions, thereby maintaining the ability toaccess all variables by indexing from a base pointer, negatingthe requirement to chain up the environment list to findvariables in outer scope levels.Figure 1 also shows that all fixed-size objects can be al-located in a frame or in the heap. Here, a complex numberis declared in frame n . Allocating space for variables in the frame ismuch quicker than allocating them in the heap as the pro-cess of memory allocation is much simpler; allocating thespace within a frame is just the matter of decrementing theframe pointer by the required amount. This model allowsdecisions to be made within the compiler regarding the bestplacement for composite data types.The display and environment of frames shown in Figure1 perform similar capabilities to the C stack. However, forbrevity the diagram is simplified, and the environment in-cludes dynamic and static links to support nested functions/ closures. A more detailed explanation of these mechanismscan be found in [13], and [16] discusses the issues with sim-ple displays for environment and closure support, outliningthe environment link mechanism solution, a version of whichis used in our approach. Therefore, we will not discuss thedetailed implementation of these underlying mechanisms inthis paper beyond highlighting that there is a static limit onthe maximum number of scope (lexical) levels for kernelsrunning on the device. For a lot of codes, this could be apredefined value but this can also be calculated by the com-piler, even for kernels with dynamically loaded functions. ForePython kernels, the compiler visits all functions, includingthose which are dynamically loaded and is able to calculatethe required maximum value. Listing 1 is an example of the typedvariable access in the generated C code, where the variablesare addressed by their scope level and offset, with the mem-ory layout visualised in Figure 1. In this example, the real (float) part of the complex variable at the enclosing (non-local)scope level 1 and offset 2 is being set to 4.3. The second lineupdates the element, indexed by the variable at offset 0, ofa vector stored in the heap and declared in the local scope(level=0) at offset 1. aurice Jamieson and Nick Brown update_complex_real ( lookup_complex (env ,1 ,2),4.3); vector_update_int ( lookup_vector (env ,0 ,1) ,lookup_int (env ,0 ,0) ,42); update_real (env ,4 ,1 ,10.0) ; Listing 1.
Variable access exampleAs shown in Figure 1, the scope level increases outwardsfrom the local block (or function) scope, with local variableshaving a scope level of zero. This enables the C compiler touse indexed addressing from the frame pointer ( env[0] ) toaccess local variables, thereby increasing performance for lo-cal variables and loop block indices, as the compiler can useindexed addressing to directly access the local variables. Fur-thermore, the indexing of outer scope levels directly removesthe need to chain up environment frames to locate non-localvariables, with the corresponding performance overhead.This model also allows support of Python 3 nonlocal vari-ables [10] that are declared in the nearest enclosing scopelevel, as shown in line 1 of Listing 1. In the target C code,the variable access APIs are implemented as macros thatdirectly update the frame elements within the environmentand include the required casting to and from the current vari-able type. This is illustrated in line 3 of Listing 1, where the update_real macro expands to (((Real*)(env[(lex_level)]))[(offset)]=(Real)((value))) . This not only accesses the frameelement directly but also ensures that the value is storedcorrectly in memory. Int oly_e1 (Env env , Object self ) { return ( lookup_int (env ,0 ,0) + lookup_int (env ,0 ,1)); } Listing 2.
Generated function exampleAll target C generated functions are passed two arguments,the environment of frames env and self , as shown in Listing2. The self argument is a reference to an
Object type whichenables the abstract machine to support object-orientation,where a function is actually a method of an object. The base,or native , functions which are pre-provided in the abstractmachine, such as heap management, have the same argumentmodel to allow them to be called and dynamically loadedusing the same mechanism as compiler generated functions.The added benefit is that this model also encourages theC compiler to place the function arguments (environmentand object) and variable offsets in registers, as much as ispossible, on all target platforms. The arguments to a functionare declared within the same frame as the function’s localvariables and form part of the overall frame size. The nativecode function names are machine generated and prependedwith oly_ to minimise any name clashes with existing codelibraries. As you might expect, and as shown in Listing 3,functions are also declared within the environment frames. As with other variables, function names (e.g. add ) are passedto the abstract machine APIs to enable support for debuggingand reflection. declare_proc (env ,1,"add", mk_proc (oly_e1 ,env ,2)); Listing 3.
Function declaration exampleCrucially we have found that, by leveraging frames withinan environment, the decoupling of variables from the under-lying C storage mechanisms not only provides support forthe dynamic and object-oriented features of many sourcelanguages but also for the implementation of dynamicallyloading of functions which we describe in Section 4.1.
Within the context of micro-cores and their very limitedmemory size, a major benefit of our abstract machine ap-proach is that it allows source code functions to be dynami-cally loaded at declaration or at any later point in the execu-tion of a kernel. Listing 4 demonstrates the simple change tothe generated code required to allow the example add func-tion in Listing 3 to be dynamically loaded at declaration. Asexpected, the code definition in Listing 2 isn’t required. The load_proc
API call initiates the download of the add functionfrom the host, allocates the space in the device’s heap to holdthe function code and declares it within the frame. declare_proc (env ,1,"add", load_proc ("add",env ,2)); Listing 4.
Dynamically loaded function declarationexampleThe final argument of the load_proc
API call is the numberof arguments in the dynamic function and allows the runtimeto create a frame of the correct size. Listing 5 shows howdynamic function loading can be deferred to a later pointof execution after declaration. The ability to separate theloading of a dynamic function from the declaration allowsthe compiler to implement a dynamic code loading strategytuned to a kernel’s particular execution profile. Furthermore,as the code for dynamic functions is stored within the ab-stract machine heap, it can be discarded (freed) as required,thereby allowing the execution of much larger kernels thanis possible with previous static code loading model. Crucially,our environment model automatically enables runtime sym-bol resolution within the compiled C code, enabling dynamicfunction loading. declare_proc (env ,1,"add", NULL ); ... update_proc (env ,1, load_proc ("add",env ,2)); Listing 5.
Deferred dynamically loaded function example ompact Native Code Generation for Dynamic Languages on Micro-core Architectures ePython [17] is an interpreter which implements a subset ofPython and is designed to target micro-core architectures.Designed with portability across these architectures in mind,it has evolved from its initial purpose as an educationallanguage for parallel programming, through its use as re-search vehicle for understanding how to program micro-corearchitectures, to supporting real-word applications on themicro-cores.As described previously, on-core memory is everythingwith these micro-cores and whilst previous work aroundmemory hierarchies and remote data [24] allow an unlim-ited amount of data to be streamed through the micro-corememory, there were still fundamental limits to the code size.This resulted in two major impacts, firstly the size of thePython codes that could be executed on the micro-cores andsecondly the number of language features that the ePythoninterpreter could fully support.However, it was our hypothesis that by applying the con-cepts described in Section 3, then not only would the per-formance of ePython be significantly improved (a compiledvs interpreted language) but the ability to dynamically loaddifferent parts could significantly reduce the memory re-quirement and enable codes of unlimited size to be executed.Put simply, using the approach described in Section 4.1, oneneeds only load at a minimum a resident bootstrapper whichcontains the core support for marshalling and control ofdynamically loaded functions. These can then be retrievedon-demand and garbage collected as memory fills up.
Figure 2 outlines the key components of our dynamic loaderwhen applied to ePython. The updates to support dynamiccode loading in the abstract machine were relatively minor;as the memory model already provided the abstraction ofthe variables and functions from the underlying C runtime,the modifications were mainly concerned with the request/ transfer of the dynamic functions from the host and theirloading into the heap. However, the changes to the host-based compiler and device support functions were moresignificant, requiring changes to the compiler, the Pythonintegration module and monitor, and a new object file parser.There are effectively two options available for dynamicloading; make all user functions dynamic or allow the pro-grammer to select which functions they would like to bedynamically loaded. Initially, we chose the former for ease ofimplementation but for increased flexibility, ePython now im-plements the latter by allowing the programmer to annotatedynamic functions, with the key limitation that only top-level functions (kernel entry points) can be marked @dynamic .This does not prevent dynamic functions containing nestedfunctions but these cannot be marked @dynamic individually device importepythonimportsysimportos@o ffl oaddefDoSomething():globalFlags... Python source fi le GCC ePython compiler host ePython monitor & loader
Dynamic symbol table
Listener stackstackstackbootstrap loader kernel… bootstrap loader kernel… bootstrap loader kernel… bootstrap loader kernel…stack
Kernel &libraryobject fi les .o Filesystem
Figure 2. ePython dynamic loader architecturein order to simplify the management of non-local references.During the traversal of the Python AST during code gener-ation, functions that were annotated with the @dynamic dec-orator, as shown in Listing 6, are placed in a separate Csource file, with another file containing the bootstrap loaderand dynamic loading calls for the relevant functions. A dy-namic function symbol table is generated by the compilerand loaded by ePython on the host, which is then used tomap the kernel dynamic function requests to the correctobject file, highlighted in green in Figure 2. from epython import dynamic @dynamic ( defer = True ) def add(x,y): return x+y @dynamic def add_nums (): global add add = load_function ("add") print (add (3 ,4)) del(add) add_nums () Listing 6.
Function declaration exampleThe GCC C compiler is used to generate code for the tar-get platforms (Epiphany-III, MicroBlaze and RISC-V, SPARC,MIPS32 and AMD64), resulting in Executable and LinkableFormat (ELF) [28] object files. Figure 3 illustrates the ELFfile structure, highlighting the linkage between the differ-ent sections that need to be traversed to access the requiredfunction binary code. The dynamically loaded functions arekeyed by their C source name and the host-side symbol tableprovides the mapping between the ePython source view andgenerated C function names.The dynamic object file created by our approach is parsedwhen the kernel is downloaded to the devices. The requiredfunctions are loaded into memory, based on the entries inthe symbol table and then the ELF parser checks that the file aurice Jamieson and Nick Brown
Figure 3.
ELF file structure [28]is of the correct binary format (for the micro-architecture inquestion) and raises an error if the file is incorrect for the tar-get device. The ELF object file contains function sizes, whichwe store in memory along with the functions themselvesto allow the device dynamic loader to allocate the correctamount of memory in the heap. As the compiler has previ-ously inserted the number of local variables and argumentsinto the function declaration code, the abstract machine isable to allocate the correct frame space for future functioncalls.Listing 6 shows how ePython leverages Python’s first-class function support to enable the deferment of dynamicfunction loading, using the @dynamic(defer=True) decoratorand mark the function add for deletion from the heap afterit has been executed. This allows the programmer to con-trol the exact time during a kernel’s execution a functionis loaded and marked for deletion. Our model allows theprogrammer a large amount of flexibility, with the ability tochoose whether functions are statically bound to the binarythat is downloaded to the device or to download the functionat runtime, either when declared or just before execution andretain or delete them from the heap as the kernel executionprofile demands.
In order to evalu-ate the level of performance that our code generation modelcan attain, we used a standard Jacobi code from [8], that isbundled with the Eithne framework [25]. The performanceof the native C and ePython codegen versions were com-pared across a number of CPUs (Epiphany-III, MicroBlaze,PicoRV32, MIPS32, AMD64 and SPARC v9), as shown inFigure 4. The benchmark version for the comparisons wassequential and ran on a single core, with a problem size(NX) of 100 and 10000 maximum iterations on all platforms.Previous experience of generating native code via C for high-level languages suggested that the processor ISA can havea significant impact on both the resulting performance andbinary size. Therefore, we included the Intel x86 (AMD64)
Table 1.
Relative Jacobi kernel runtime (seconds)
CPU Codegen native CEpiphany-III
MicroBlaze
PicoRV32
MIPS32
AMD64
SPARCv9 and SPARC v9 processors as targets. The latter was selecteddue to its support for register windows that, for large C pro-grams, can show a 33% to 50% reduction in the number ofload and store instructions generated over a non-registerwindow RISC (reduced instruction set computer) processor[5]. Generally, the native C code was faster, as one wouldexpect, but the difference on the soft-cores was surprisinglysmall. This is likely to be due to the small problem size, wherethe kernel runtime is impacted by the invocation messaginglatency (bandwidth and device listener response). One inter-esting result is the performance of the Epiphany-III relativeto the MIPS32, AMD64 and SPARCv9 processors. This is, inpart, explained by the fact that the kernels are executing bare metal on the Epiphany-III, MicroBlaze, and PicoRV32,whereas they are running as POSIX threads on the MIPS32(Linux), AMD64 (Windows Linux Subsystem) and SPARCv9(Solaris).Figure 4 and Table 1 shows that the ePython codegenkernel was slightly faster (0.996 times) than the native Ckernel on the SPARC but with the small problem size theruntimes are almost the same for both kernels, suggestingthe timings were I/O bound. When the SPARC benchmarkswere run again with a much larger problem size (NX=500),the ePython codegen runtimes are 1.140 times slower thanthe native C version. Comparing this result to the average33% performance overhead of the ePython codegen over na-tive C on the other processors suggests our environmentmodel may be leveraging the advantage of the SPARC’s reg-ister windows. Overall, these are interesting results bearingin mind the runtime’s support for the dynamic features ofPython versus the static nature of C but it should be notedthat these results are best case as the compiler is able to gen-erate faster code for this benchmark by removing the needfor dynamic function dispatch via environments and makingthe underlying C function calls directly. We will cover theoverhead of the standard dynamic dispatch in Section 4.2.4.As on-chip memory is extremely limited on micro-coredevices, we compared the relative kernel binary (ELF) sizesfor the Jacobi benchmark, as shown in Figure 5. However, allthe kernels were compiled with the GCC -O3 compiler optionfor maximum performance, rather than -Os for minimum size,as the performance of the offloaded kernels is critical to ourtarget applications and it is crucial to determine if the size ompact Native Code Generation for Dynamic Languages on Micro-core Architectures
Figure 4.
Comparison of native C and ePython codegenperformance for the Jacobi benchmarkof a speed optimised binary would preclude its deploymentto the target micro-core architectures.
Figure 5.
Comparison of native C and ePython codegenkernel size for the Jacobi benchmark
As shown in Figure5, the speed optimised codegen binaries are similar in sizeto native C on the Epiphany-III, MicroBlaze, MIPS32 andSPARC. However, we see a more marked difference on theRISC-V PicoRV32 and the AMD64, suggesting that the C com-pilers for these processors are more capable at optimising thenative C binary for size as well as performance but furtherinvestigation of the generated code is required to ascertain ifthis is the case or if it is a consequence of our environmentsmodel on these ISAs. Interestingly, the codegen binary size iswithin 3% on the MIPS32, AMD64 and SPARC. From Figure5 and Table 2 we see that the Jacobi codegen binaries on theEpiphany-III, MicroBlaze, PicoRV32 and SPARC are within46% of that for native C, with the MIPS32 within 7% andthe AMD64 significantly the worst-case at almost 4 timesthe size of the native C binary. Bearing in mind that thesebinaries were compiled for performance ( -O3 ) rather thansize ( -Os ), the overall codegen binary size is viable on themicro-cores as they are all less than 50% of the availableon-chip RAM for the micro-core designs used in the tests. Infact, the Epiphany-III binary only requires approximately a
Table 2.
Relative Jacobi kernel code size (bytes)
CPU Codegen native CEpiphany-III
MicroBlaze
PicoRV32
MIPS32
AMD64
SPARCv9 third of the available on-chip 32KB RAM. This significantlyenhances the usability of our codegen approach, as the re-sulting binaries achieve 67% the performance of native C butstill require less than 50% of the extremely limited memory ofthe target micro-core devices. These are compelling resultsbearing in mind the benefits to the programmer: significantlyincreased productivity and portability across architectures.The challenge with evaluation against other prior work be-yond C is that such technologies do not support the tinymemory spaces of our target micro-core devices. Compa-rable technologies discussed in Section 2, such as Numbaand MicroPython, require far more memory. For example,MicroPython requires at least 256KB, eight times more mem-ory than our target devices. However, we can draw somegeneral comparisons based on previously published work,for instance the MicroPython versus C performance compar-ison [3] reveals that MicroPython is approximately 87 timesslower than C. If we assume that these values are for the in-terpreter, and that the ‘Viper’ code generator is 7 times faster[6], MicroPython is still 12 times slower than hand-craftedC. We will provide a comparison of the code generated bi-nary versus the ePython VM, in terms of performance andmemory requirements, in Section 4.2.3.
Although, as demon-strated, our codegen approach can generate binaries that arecompact enough to run on micro-core architectures, morecomplex codes still require more memory than is available onthe target micro-cores, and this issue is addressed by our dy-namic function loading support. As discussed in Section 3.2,the dynamic loader downloads the required functions fromthe host and loads them into the abstract machine heap onthe device. Since these functions are executed from the heapand not the code segment of the static binary, our methodonly supports von Neumann and modified Harvard CPUarchitectures which allow self-modifying code. A pure Har-vard architecture CPU that has separate code and data buses,with the restriction that code is only executed from the read-only code segment is unable to support our dynamic loadingmodel. Therefore, whilst our code generation approach cansupport generating static binaries on any platform with C99compiler support, dynamic code loading can only be sup-ported on von Neumann and modified Harvard architectures.With this in mind, we will focus our discussion of dynamic aurice Jamieson and Nick Brown
Table 3.
Epiphany-III Jacobi kernel code size (bytes)
Variant Runtime Bytecode Functions TotalePython VM
Static dispatch
Dynamic dispatch
Dynamic loading C loading on the Adapteva Epiphany-III micro-core, which has16 von Neumann cores with only 32KB of RAM per core, arestriction that would significantly benefit from being ableto dynamically load (and unload) executable code.In order to understand how different code dispatch andloading models impact performance and code size, the fol-lowing options were used: • Static dispatch: functions are statically bound to theexecutable and optimised for direct execution • Dynamic dispatch: functions are statically bound tothe executable but dynamically dispatched via lookupin the environment • Dynamic loading: functions are dynamically loaded bythe kernel and dispatched via lookup in the environ-mentFor the comparisons, we wrote a modified version of theprevious Jacobi benchmark to support these options and allePython codegen and C kernels were compiled with the -O3 option, with all dynamic functions compiled with -Os . Thelatter option ensures the dynamic functions are optimised asfar as possible whilst keeping the binary as small as possible.The GCC -Os optimisation option is the same as -O2 less anyoptimisations that increase the size of the code [4]. We alsocompared the same Python version of the Jacobi benchmarkrunning under the ePython VM.Table 3 details the overall code size for the different Ja-cobi benchmark variants, with the VM size including theinterpreter, runtime support and bytecode, and the codegendynamic loading including the static binary (kernel and run-time support), plus the dynamically loaded functions. Thenon-applicable entries in the table are marked with a dash,for example, bytecode is not applicable for the codegen andnative C variants. As expected, the native C variant of thecode is the smallest, with the dynamically loaded codegenvariant trailing by only 464 bytes. This is followed at 9810bytes by the codegen variant with all functions staticallybound and dispatched, where the compiler is free to opti-mise function calls. The dynamic dispatch code gen variantis 24% bigger, as the dynamic dispatch method requires morecode (lookups in the environment) and prevents the compilerfrom optimising the function calls. Finally, the ePython VMvariant of the benchmark has the smallest compiled code(bytecode) size but requires the VM of around 22KB to exe-cute. It should be noted that the static dispatch option canonly be used for very simple codes, where the compiler can detect simple function calls that do not need new frames, forexample, those without local variables or are non-recursive.Therefore, for most functions the compiler will generatecode that uses the dynamic dispatch model. Whilst this op-tion produces code that is approximately 64% larger thannative C, it uses only 51% of the overall memory requiredby the ePython VM variant. Therefore, for the same Pythonsource code, significantly more of the Epiphany’s limited32KB on-core memory is available for data.
Figure 6.
Relative performance and code size for the modi-fied Jacobi benchmark on the Adapteva Epiphany
Figure 6 shows therelative code sizes and runtime performance for all the bench-mark variants. Unsurprisingly, the native C benchmark ker-nel has the fastest execution time at 0.053 seconds and thestatic dispatch codegen variant is around 2 times slower at0.115 seconds. The dynamic dispatch model decreases perfor-mance relative to the static dispatch model by approximately5 times at 0.539 seconds and the dynamic loading versionis only marginally slower at 0.566 seconds. As we can seefrom Figure 6 and Table 4, the default dynamic function dis-patch codegen version is significantly faster than the VMversion, which requires around 201 seconds to execute thesame Python kernel. However, it should be noted that theePython is also executing the bytecode from significantlyslower off-chip memory (150 MB/s maximum bandwidthobtainable in practice [19]), as the kernel bytecode and heaprequirements are too large to allow the bytecode to executefrom on-chip RAM. When we consider that the dynamic load-ing version code also only requires approximately a thirdof the memory to execute, we not only gain a significantperformance increase, but also have the ability to handlemuch more data at the same time, as well as being able toexecute arbitrary sized codes via dynamic function loading.This greatly increases the practical applications that dynamiclanguages can support on micro-core architectures.
The micro-core classification covers a wide variety of proces-sor technologies and this is a thriving area which contains a ompact Native Code Generation for Dynamic Languages on Micro-core Architectures
Table 4.
Epiphany-III Jacobi kernel runtime (seconds)
Variant RuntimeePython VM
Static dispatch
Dynamic dispatch
Dynamic loading C number of vibrant communities. Whilst these are very inter-esting for a number of different reasons, a major challenge isaround programmer productivity. Whilst we firmly believethat Python has a significant role to play here, the perfor-mance impact of a traditional interpreter greatly reduces itsviability for high-performance applications on these tech-nologies. In this paper, to address this, we have introduceda code generation approach for dynamic languages, suchas Python. Our code generation model was specifically de-signed to address these performance concerns and to be ableto support the peculiarities of micro-core architectures, andmore specifically the simplicity of the cores themselves andtiny amounts of associated memory. The reader can clearlysee that our approach is widely portable to a number of dif-ferent processor architectures, whilst also returning small,high performance code that significantly releases preciouson-chip memory that can be allocated for data. Furthermore,our dynamic loading model enables further memory savings,as well as supporting arbitrary sized codes, whilst returningperformance that, although is approximately 10 times slowerthan optimised native C code, is over 300 times faster thanthe interpreter. Crucially, our code generation and dynamicloading approach provides the compiler with options to op-timise the resulting code in terms of static function dispatchwhere possible, dynamic dispatch where required and pointsfor the loading (and unloading) of dynamic functions, basedon the execution profile of the code.Therefore, our present focus is in maturing the native codegeneration as we think this has demonstrated some worth-while early results. Further work includes exploring oppor-tunities for further performance improvements, validated bya wider range of benchmarks to provide greater coverageof the abstract machine design. Furthermore, currently thearchitecture specific runtime library is not included in thedynamic loading. Through extending the dynamic loadingapproach to include the runtime support, the minimum sizewill be around 1.5KB plus the size of the largest function.This will open up the possibility of running over a num-ber of additional micro-core architectures that contain tinyamounts of memory per core (less than 8KB).The environment model provides other opportunities;functions have a reference to their environment (closure)that can be traversed, and it would be possible to persistfunctions for later execution, enabling task switching offunctions or interrupt support within the abstract machine. The code bodies for these functions could remain in memoryfor faster switching response or could be unloaded for longerinterruptions to free critical memory resources.Whilst this paper has focused on a code generation modelusing Python as a vehicle for testing our approach, we alsobelieve that the work here has a wider applicability to otherdynamic programming languages targeting micro-core ar-chitectures. aurice Jamieson and Nick Brown References [1] [n.d.]. . https://downloads.haskell.org/~ghc/7.8.3/docs/html/users_guide/code-generators.html [2] [n.d.]. The LLVM Compiler Infrastructure Project . https://llvm.org/ [3] [n.d.]. micropython/micropython . https://github.com/micropython/micropython [4] [n.d.]. Optimize Options (Using the GNU Compiler Collection (GCC)) . https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html [5] Ben J. Catanzaro (Ed.). 1991. The SPARC Technical Papers (New York).Springer-Verlag. https://doi.org/10.1007/978-1-4612-3192-9 [6] 2013.
Update 4: The 3 different code emitters · Micro Python: Pythonfor microcontrollers . [7] 2014. Epiphany Architecture Reference. [8] 2016.
ARCHER » Advanced MPI . [9] 2018. MicroPython - Python for microcontrollers . http://micropython.org/ [10] 2020.
7. Simple statements — Python 3.9.0 documentation . https://docs.python.org/3/reference/simple_stmts.html [11] 2020. parallella/otp . https://github.com/parallella/otp original-date:2016-10-05T02:56:10Z.[12] 2020. Two-Minute fact sheet . [13] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers:Principles, Techniques, and Tools . Addison-Wesley Longman PublishingCo., Inc., USA, 404–429.[14] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. S. Seljebotn, and K.Smith. 2011. Cython: The Best of Both Worlds. 13, 2 (2011), 31–39. https://doi.org/10.1109/MCSE.2010.118
Conference Name: Computingin Science Engineering.[15] OpenMP Architecture Review Board. 2013.
OpenMP Application Pro-gram Interface Version 4.0 . Accessed: 2018-07-25.[16] Richard Bornat. 2020.
Understanding and Writing Compilers .Chapter 13. [17] N. Brown. 2016. ePython: An Implementation of Python for the Many-Core Epiphany Co-processor. In . 59–66. https://doi.org/10.1109/PyHPC.2016.012 [18] Olivier Brunet. 2020.
ELF binary compilation of a python script - part 1: Cython . https://obrunet.github.io/pythonic%20ideas/compilation_cython/ [19] Francisco M Castro, Nicolás Guil, Manuel J Marín-Jiménez, Jesús Pérez-Serrano, and Manuel Ujaldón. 2018. Energy-based tuning of convolu-tional neural networks on multi-GPUs. Concurrency and Computation:Practice and Experience (2018), e4786.[20] Damien George. 2013.
Update 4: The 3 different code emit-ters · Micro Python: Python for microcontrollers . [21] Damien George. 2013. Update 5: The 3 different code emit-ters, part 2 · Micro Python: Python for microcontrollers . [22] Kay Hayen. 2012. Nuitka User Manual . http://nuitka.net/doc/user-manual.html [23] Robin Hunter. 1999. The Essence of Compilers . Prentice Hall Europe,Hemel Hempstead, United Kingdom, Chapter 7.[24] Maurice Jamieson and Nick Brown. 2019. High level programmingabstractions for leveraging hierarchical memories with micro-corearchitectures.
J. Parallel and Distrib. Comput.
138 (12 2019). https://doi.org/10.1016/j.jpdc.2019.11.011 [25] Maurice Jamieson and Nick Brown. 2019.
Poster 99: Ei-thne: A Framework for Benchmarking Micro-Core Accelerators . https://sc19.supercomputing.org/proceedings/tech_poster/tech_poster_pages/rpost186.html [26] Wilfried Koch and Christoph Oeters. 1975. An Abstract ALGOL 68Machine and its Application in a Machine Independent Compiler. 642–653. https://doi.org/10.1007/3-540-07410-4_665 [27] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: Allvm-based python jit compiler. In Proceedings of the Second Workshopon the LLVM Compiler Infrastructure in HPC . 1–6.[28] John R. Levine. 1999.
Linkers and Loaders (1st ed.). Morgan KaufmannPublishers Inc., San Francisco, CA, USA.[29] David Richie and James Ross. 2017. I CAN HAS SUPERCOMPUTER? ANovel Approach to Teaching Parallel and Distributed Computing Con-cepts Using a Meme-Based Programming Language. arXiv:1703.10242[cs] (March 2017). http://arxiv.org/abs/1703.10242 [30] Andrew Ernest Ritz. 2020. drewvid/parallella-lisp . https://github.com/drewvid/parallella-lisp original-date: 2015-12-29T23:42:15Z.[31] George Seif. 2019. Here’s how you can get some free speed on yourPython code with Numba . https://towardsdatascience.com/heres-how-you-can-get-some-free-speed-on-your-python-code-with-numba-89fdc8249ef3 [32] George Seif. 2019. Use Cython to get more than 30X speedup on yourPython code . https://towardsdatascience.com/use-cython-to-get-more-than-30x-speedup-on-your-python-code-f6cb337919b6 [33] John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: Aparallel programming standard for heterogeneous computing systems. Computing in science & engineering
12, 3 (2010), 66.[34] David A. Terei and Manuel M.T. Chakravarty. 2010. An llVM backendfor GHC.
ACM SIGPLAN Notices
45, 11 (Sept. 2010), 109–120. https://doi.org/10.1145/2088456.1863538 [35] Linda Torczon and Keith Cooper. 2012.
Engineering A Compiler (2nded.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.[36] Clifford Wolf. 2018.
PicoRV32: A Size-Optimized RISC-V CPU. Contributeto cliffordwolf/picorv32 development by creating an account on GitHub ..