Specialization of Generic Array Accesses After Inlining
JJeremy Yallop and Damien Doligez (Eds.)ML/OCaml 2015EPTCS 241, 2017, pp. 45–53, doi:10.4204/EPTCS.241.4 c (cid:13)
R. Tokuda, E. Sumii & A. AbeThis work is licensed under theCreative Commons Attribution License.
Specialization of Generic Array Accesses After Inlining(System Description)
Ryohei Tokuda Eijiro Sumii Akinori Abe
Graduate School of Information Sciences, Tohoku University, Sendai, Japan [email protected] [email protected] [email protected]
We have implemented an optimization that specializes type-generic array accesses after inlining ofpolymorphic functions in the native-code OCaml compiler. Polymorphic array operations (read andwrite) in OCaml require runtime type dispatch because of ad hoc memory representations of integerand float arrays. It cannot be removed even after being monomorphized by inlining because theintermediate language is mostly untyped. We therefore extended it with explicit type application likeSystem F (while keeping implicit type abstraction by means of unique identifiers for type variables).Our optimization has achieved up to 21% speed-up of numerical programs.
Representation of primitive values such as floating-point numbers is a classical problem in the imple-mentation of polymorphic languages since ad hoc representations hinder uniform treatment of values.The classical way to overcome this difficulty is to fit every value in one machine word by heap-allocatingmulti-word data (called boxing ) and manipulating them through pointers. Although simple, this methodis inefficient especially for numerical computations because of the heap allocation and pointer derefer-ences. The cost of garbage collection due to the frequent allocations is particularly problematic.More sophisticated implementation methods for polymorphism have also been devised. Leroy [8]and Shao [10] adopt a mixture of specialized and uniform representations, where the cost of conversionsbetween different representations (such as flat arrays vs. arrays of boxed values) is non-trivial. Anothermethod is to pass the types at runtime [7, 9, 14], where the construction of the type representations itselfincurs an overhead.OCaml takes an ad hoc approach to the problem: it mainly adopts uniform representation but usesunboxed representations for “local” floating-point numbers (i.e., ones that do not escape a function’sbody) and, in particular, for arrays of floating-point numbers (as well as records with fields of floating-point numbers only). Such specialized representation of float array means that polymorphic arrayaccesses (such as setting and getting the elements) need to dynamically check the type of the array andmake a case branch when it is a float array . These dynamic checks generally incur runtime overheads,which the standard OCaml compiler tries to remove if the monomorphic type of the array is staticallyknown. Standard ML (as well as OCaml’s bigarray ) takes another ad hoc approach: its Basis Library offers a monomorphicmodule
RealArray (along with other monomorphic array modules of a common interface) for an unboxed representation.There is also a similar proposals [6, 15] for array in OCaml.
Despite the aforementioned specialization of generic array accesses, the standard OCaml compiler failsto apply the specialization when the monomorphic type is known after inlining of functions, because ofa lack of type information in the intermediate language.To see the problem concretely, consider the following program: let get0 a = a .(0) (* ’a array -> ’a *)let i = get0 i n t _ a r r a ylet f = get0 f l o a t _ a r r a y
Even if the polymorphic function get0 is inlined, the array accesses int_array.(0) and float_array.(0) are not specialized and are considered generic, incurring the runtime overheads of case branches over thedynamic types of the (obviously monomorphic) arrays.This problem is due to the internal representation of the partial type information attached (only)to array accesses in the intermediate language, which is defined in the OCaml compiler as (roughlyspeaking): type a r r a y _ k i n d =| P g e n a r r a y (* g e n e r i c *)| P i n t a r r a y (* int *)| P f l o a t a r r a y (* float *)| P a d d r a r r a y (* a d d r e s s ( p o i n t e r) *)
The array_kind “ Pgenarray ” means runtime dispatch over the dynamic type of the elements of an array.For example, the program above can be annotated with this (partial) type information like let get0 a = a .{ P g e n a r r a y }(0)let i = get0 i n t _ a r r a ylet f = get0 f l o a t _ a r r a y where {} denotes the internal type annotation with array_kind . Obviously, just inlining the function get0 does not specialize the generic read operations: let i = i n t _ a r r a y .{ P g e n a r r a y }(0)let f = f l o a t _ a r r a y .{ P g e n a r r a y }(0) Our idea is to add explicit type information like System F to the mostly untyped intermediate language lambda of OCaml. That is, we basically extend the intermediate language with type abstractions andapplications.For instance, the example above can be type-annotated like: let get0 { ’a } a = a .{ ’ a }(0)let i = get0 { P i n t a r r a y} i n t _ a r r a ylet f = get0 { P f l o a t a r r a y} f l o a t _ a r r a y
We add the formal type parameter {’a} in the definition of get0 . Then, we attach the type information {’a} on the read access a.(0) to the array a of polymorphic type ’a array . We then explicitly denotetype applications by annotating get0 with {Pintarray} and {Pfloatarray} . The generic array accessescan now be specialized by inlining as:.Tokuda, E.Sumii&A.Abe 47 let i = i n t _ a r r a y .{ P i n t a r r a y }(0)let f = f l o a t _ a r r a y .{ P f l o a t a r r a y }(0) For type application, we indeed extended the intermediate language. For abstraction, we actuallyused globally unique identifiers for type variables to avoid introducing a new binder, as OCaml does fortyping; in exchange, we have to specify which type variable to instantiate at the application side.For example, the foregoing program is now represented like: let get0 (* { ’a } *) a = a .{ ’ a }(0)let i = get0 { ’a I } i n t _ a r r a ylet f = get0 { ’a F } f l o a t _ a r r a y
In the definition of get0 , we omit the type abstraction (* {’a} *) while annotating the read access withthe implicitly bound type variable ’a . We then annotate the applications of get0 with a mapping fromthe type variable ’a to an array_kind , like {’a I} and {’a F} . When inlining get0 , we replace theoccurrence of ’a according to the given mapping, resulting in let i = i n t _ a r r a y .{ I }(0)let f = f l o a t _ a r r a y .{ F }(0) as desired.Of course, not all polymorphic functions can be inlined completely (because of code size, for ex-ample). In such cases, some type variables still remain and are compiled as generic array accesses withdynamic type dispatch. In this section, we describe details of our implementation based on the native-code OCaml compiler. Wehave made the following changes to the intermediate language lambda (resp. clambda ) before (resp. after)closure conversion.
First, we replace
Pgenarray (type of generic arrays) with
Ptvar of int (type variable with an integeridentifier) in array_kind to specify which type variable the generic type refers to (like {’a} in a.{’a}(0) instead of Pgenarray in a.{Pgenarray}(0) ) as follows : type a r r a y _ k i n d =Ptvar of int (* id *) | P i n t a r r a y | P f l o a t a r r a y | P a d d r a r r a y Second, we add
Lspecialized to the intermediate languages for explicitly representing type applications. Precisely speaking, we still keep
Pgenarray for places where our current implementation abandons specialization, suchas functors and GADTs. type lambda = (* ditto for c l a m b d a *)| ...(* same as before *)...| L s p e c i a l i z e d of lambda * k i n d _ m a p (* type a p p l i c a t i o n *)and k i n d _ m a p = ( int * a r r a y _ k i n d ) list (* a s s o c i a t i o n list *)
The constructor
Lspecialized has two parameters: the first parameter (of type lambda ) is a poly-morphic function to be specialized, while the second (type kind_map ) is a type mapping described above,such as {’a I} and {’a F} . We insert Lspecialized to every occurrence of a let-bound polymorphicvariable during the translation from typedtree (typed AST) to lambda . More specifically, for everyvariable occurrence, we compare the monomorphic type of the variable (annotated in the AST) with thepolymorphic type stored in the type environment and, if the latter type is indeed polymorphic, insert
Lspecialized with kind_map recovered from the comparison by means of a one-directional unification.(In principle, this mapping is already known during type inference but is discarded in the current OCamlcompiler. We avoided modifying the type inference because of its complexity.)
Finally, we need to record and follow renaming of type variables exported via .cmx files to preventinconsistencies with .cmi files.Suppose, for example, that we compile a source file a.ml including function get0a.mllet get0 a = a .{ ’ a }(0) with the interface file: a.mlival get0 : ’a array -> ’a
On one hand, the .mli is compiled into a .cmi file: a.cmi(* pseudo - code for the binary *)val get0 : ’a array -> ’a
On the other hand, the implementation a.ml is compiled separately from the interface a.mli . Eventhe types are inferred independently—they are only checked against the compiled interface a.cmi after inference. As a result, the type variable ’a may be given a completely different identifier in the AST a.cmx generated for inlining: a.cmx(* pseudo - code for the AST *)get0 a = a .{ ’ b }(0) Thus, the only possible first parameter for
Lspecialized is actually an occurrence of a local (
Lvar ) or global (mod-ule access) variable, so it is also possible to attach the kind_map to those variable occurrences of instead of introducing
Lspecialized . We did not take this approach because modifying the module access primitive (
Pgetglobal ) seemed morecomplicated than simply adding
Lspecialized . .Tokuda, E.Sumii&A.Abe 49This inconsistency is problematic for our scheme, where get0 is applied like A.get0{’a I}int_array according to its type in the interface a.cmi . We fixed this by adjusting the type variable identifiers in a .cmx according to the corresponding .cmi just before generating the former: a.cmx (adjusted) (* pseudo - code for the AST *)get0 a = a .{ ’ a }(0)
Moreover, suppose that the function
A.get0 is used from another file b.ml . When the interface a.cmi is imported, the type variable ’a is renamed for the sake of uniqueness inside the importing module Bb.mlopen A(* val get0 : ’c array -> ’c *)let i = get0 { ’c I } i n t _ a r r a y (* an e x a m p l e using A . get0 *) and becomes inconsistent with the implementation a.cmx . We fixed this inconsistency by rememberingthe renaming such as ’a ’c in a global table at import time, and applying it before inlining functionbodies such as a.{’a}(0) . b.ml (after inlining) open A(* r e n a m i n g table for A is ’a ’c *)(* val get0 : ’c array -> ’c *)let i = i n t _ a r r a y .{( ’c I )( ’ a ’c ) ’ a }(0) (* c o r r e c t l y i n l i n e d *) We have implemented the above specialization on top of the 4.02 branch of OCaml as of May 6,2015, and measured its effects for the numerical programs in Table 1. “Simple” is a program that addsall elements of an array, where all the accesses are made through polymorphic functions. “Random”makes generic array accesses, randomly switching between int array and float array (in addition,the pseudo-random number generator internally makes monomorphic array accesses). The other bench-marks are realistic, naturally written numerical programs: “DKA” stands for Durand-Kerner-Aberth (amethod for finding a root of a complex polynomial), “FFT” is a fast Fourier transform, “K-means” isa clustering method used for data mining and machine learning, “LD” stands for the Levinson-Durbinrecursion for time series analysis, “LU” is the LU decomposition in linear algebra, “NN” is a neuralnetwork program, and “QR” is the QR decomposition (again in linear algebra). We have hand-tuned thetiming of garbage collections. We compiled all the files (including the standard library modules) with ocamlopt -inline 10000000 (and -unsafe for the benchmark programs ).The results (on Ubuntu Linux 14.04, Intel(R) Core(TM) i7-2677M 1.80 GHz—fixed at 1.0 GHzby cpufreq to avoid experimental errors caused by Intel Turbo Boost, which is sensitive to changes intemperature!—and 4 GiB DDR3 SDRAM) are also in Table 1 and can be explained as follows: Our compiler is available from: https://github.com/nomaddo/ocaml Their source code is available from: https://github.com/nomaddo/ocaml-numerical-analysis/tree/bench ( Time-after − Time-before ) / Time-before × • “Simple”, “DKA”, and “LD” show modest speed-up because all the generic accesses are special-ized while most of the execution time is still spent on floating-point operations. • “Random” exhibits considerable improvement since the generic accesses are removed and theprogram does not perform any other significant computation. • “FFT” and “LU” contain only a relatively small number of generic accesses in the first placebecause of the monomorphic coding style. • “K-means”, “NN”, and “LU” include polymorphic array access functions that are not inlined at all,probably because of closure sharing of the OCaml compiler. For instance, the following function foldilet foldi f init x =snd ( Array . f o l d _ l e f t( fun (i , acc ) xi -> ( i +1 , f i acc xi ))(0 , init )x ) is never inlined, since an argument f of foldi appears free in the anonymous function fun (i, acc) xi -> (i+1, f i acc xi) . • The high-level coding style of QR almost exclusively uses polymorphic functions such as
Array.map .Tokuda, E.Sumii&A.Abe 51and
Array.fold_left as opposed to low-level index accesses, and achieves 21% speed-up thanksto the almost complete specialization.
While we have extended the partial explicit type information for array access operations in the inter-mediate language(s) of OCaml, more explicitly typed intermediate languages have already been studiedextensively: • Harper and Morrisett [7] formalized a translation from the implicitly typed ML core language toan explicitly typed intermediate language l MLi , which is a variant of F w extended with intensionaltype analysis. Crary and Weirich [4] further extended l MLi , generalizing the type language. • TIL [14], a Standard ML compiler with intensional polymorphism, adopted l MLi as the interme-diate language for type-directed optimizations and applied conventional optimizations (inlining,uncurrying, common sub-expression elimination, etc) to the type-level language to reduce its over-heads, in particular the construction of type representations. • FLINT [11, 12], the intermediate language of SML/NJ [1], used directed acyclic graphs instead oftrees as type representations for scalability against large types. • GHC uses System F C [13], an extension of F w with type coercion for uniformly supporting a widevariety of features such as GADTs [16] and associated types [3, 2].Compared with these fully typed intermediate languages, our extension ( Ptvar and
Lspecialized )to the partial type information ( array_kind ) for array access operations in OCaml is ad hoc but smalland relatively easy, with less than 1000 lines of modification to (hundreds of thousands lines of) thenative-code compiler.
We have made a relatively simple modification to the native-code OCaml compiler to specialize genericarray accesses after inlining, and observed modest or significant speed-ups for numerical programs.Currently, a new intermediate language flambda [5] is under development (independently of ourwork) in the trunk branch of OCaml. We expect that it solves problems of the current lambda (likeclosure sharing hinders inlining as observed in Section 3) as well as making our approach even moreeffective by enabling the specialization of recursive functions (which are never inlined by the currentOCaml compiler), for example.Although our experiments focused on the efficiency of floating-point programs, the optimizationmay also be effective for generic functions (such as
Array.map and
Array.fold_left ) applied to integeror pointer arrays. It would also be interesting future work to adapt an approach similar to ours forspecializing other operations than array accesses, such as polymorphic comparisons and unboxing localvariables.
Acknowledgments
We thank Jacques Garrigue for his help on hacking the OCaml compiler and the anonymous reviewers forvaluable comments. This work was partially supported by JSPS KAKENHI Grant Numbers JP22300005,2 Specialization ofGeneric ArrayAccesses After InliningJP25540001, JP15H02681, JP16K12409, and by Mitsubishi Foundation Research Grants in the NaturalSciences.
References [1]
Standard ML of New Jersey . .[2] Manuel M. T. Chakravarty, Gabriele Keller & Simon L. Peyton Jones (2005): Associated type synonyms . In:Proceedingsofthe10thACMSIGPLANInternationalConferenceonFunctionalProgramming,ICFP2005,Tallinn,Estonia,September26-28,2005, pp. 241–253, doi: .[3] Manuel M. T. Chakravarty, Gabriele Keller, Simon L. Peyton Jones & Simon Marlow (2005):
Associatedtypes with class . In: Proceedingsof the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Pro-gramming Languages, POPL 2005, Long Beach, California, USA, January 12-14, 2005, pp. 1–13, doi: .[4] Karl Crary & Stephanie Weirich (1999):
Flexible Type Analysis . In: Proceedings of the fourth ACM SIG-PLAN International Conference on Functional Programming (ICFP ’99), Paris, France, September 27-29,1999., pp. 233–248, doi: .[5]
Optimisation with Flambda . http://caml.inria.fr/pub/docs/manual-ocaml/flambda.html .[6] Alan Frisch (2015): About unboxed float arrays . .[7] Robert Harper & J. Gregory Morrisett (1995): Compiling Polymorphism Using Intensional Type Analysis . In:ConferenceRecordofPOPL’95: 22ndACMSIGPLAN-SIGACTSymposiumonPrinciplesofProgrammingLanguages, San Francisco, California, USA, January 23-25, 1995, pp. 130–141, doi: .[8] Xavier Leroy (1992):
Unboxed Objects and Polymorphic Typing . In: ACMSIGPLAN-SIGACTSymposiumonPrinciplesofProgrammingLanguages, pp. 177–188, doi: .[9] Ronald Morrison, Alan Dearle, Richard C. H. Connor & Alfred L. Brown (1991):
An Ad Hoc Approach tothe Implementation of Polymorphism . ACMTrans.Program.Lang.Syst. 13(3), pp. 342–371, doi: .[10] Zhong Shao (1997):
Flexible Representation Analysis . In: Proceedingsofthe1997ACMSIGPLANInterna-tionalConferenceon FunctionalProgramming(ICFP ’97), Amsterdam,TheNetherlands,June9-11,1997.,pp. 85–98, doi: .[11] Zhong Shao (2000):
Typed common intermediate format . ACMSIGSOFTSoftwareEngineeringNotes25(1),p. 82, doi: .[12] Zhong Shao, Christopher League & Stefan Monnier (1998):
Implementing Typed Intermediate Languages .In: Proceedings of the third ACM SIGPLAN International Conference on Functional Programming (ICFP’98),Baltimore,Maryland,USA,September27-29,1998., pp. 313–323, doi: .[13] Martin Sulzmann, Manuel M. T. Chakravarty, Simon L. Peyton Jones & Kevin Donnelly (2007):
System Fwith type equality coercions . In: ProceedingsofTLDI’07:2007ACMSIGPLANInternationalWorkshoponTypesin LanguagesDesign andImplementation,Nice, France,January16, 2007, pp. 53–66, doi: .[14] David Tarditi, J. Gregory Morrisett, Perry Cheng, Christopher A. Stone, Robert Harper & Peter Lee (1996):
TIL: a type-directed, optimizing compiler for ML (with retrospective) . In: 20 Years of the ACM SIGPLANConference on Programming Language Design and Implementation 1979-1999, A Selection, pp. 554–567,doi: .[15] Leo White (2015):
Remove float array optimisation . https://github.com/ocaml/ocaml/pull/163 . .Tokuda, E.Sumii&A.Abe 53 [16] Hongwei Xi, Chiyan Chen & Gang Chen (2003): Guarded recursive datatype constructors . In: ConferenceRecordofPOPL2003:The30thSIGPLAN-SIGACTSymposiumonPrinciplesofProgrammingLanguages,NewOrleans,Louisisana,USA,January15-17,2003, pp. 224–235, doi:10.1145/640128.604150