[PDF] Automatically Mining Program Build Information via Signature Matching

Abstract

Program build information, such as compilers and libraries used, is vitally important in an auditing and benchmarking framework for HPC systems. We have developed a tool to automatically extract this information using signature-based detection, a common strategy employed by anti-virus software to search for known patterns of data within the program binaries. We formulate the patterns from various "features" embedded in the program binaries, and the experiment shows that our tool can successfully identify many different compilers, libraries, and their versions.

Full PDF

AAutomatically Mining Program Build Information viaSignature Matching

Charng-Da LuBuffalo, NY 14203

Abstract

Program build information, such as compilers and li-braries used, is vitally important in an auditing and bench-marking framework for HPC systems. We have devel-oped a tool to automatically extract this information usingsignature-based detection, a common strategy employedby anti-virus software to search for known patterns of datawithin the program binaries. We formulate the patternsfrom various ”features” embedded in the program bina-ries, and the experiment shows that our tool can success-fully identify many different compilers, libraries, and theirversions.

One important component in an auditing and benchmark-ing framework for HPC systems is to be able to report thebuild information of program binaries. This is because theprogram performance depends heavily on the compilers,numerical libraries, and communication libraries. For ex-ample, the SPEC CPU 2000 Run and Reporting Rules [2]contain meticulous guidelines on the reporting of the com-piler of choice, compilation ﬂags, allowed and forbiddencompiler tuning, libraries, data type sizes, etc.However, in most HPC systems, program build infor-mation, if maintained at all, is recorded manually by sys-tem administrators. Over time, the sheer number of soft-ware/library packages of different versions, builds, andcompilers of choice can grow exponentially and becometoo daunting and burdensome to document. For exam-ple, at our local center we have software packages builtfrom 250 combinations of different compilers and numer-ical/MPI libraries. On larger systems such as Jaguar andKraken at the Oak Ridge National Laboratory, the numbercan be as high as 738 [13].In addition, there is no standard format of document-ing program build information. Many HPC systems useModules [3] or SoftEnv [4] to manage software pack- ages, and a common naming scheme is to incorporatethe compiler name (as a sufﬁx) in the package name.There is usually additional textual description to indi-cate build information, such as compiler version, de-bug/optimization/proﬁling build, and so on. Mining thesefree-form texts, however, requires the understanding ofeach HPC site’s software environment and documentationstyle and is not generally applicable.In this paper, we present a signature-matching approachto automatically uncover the program build information.This approach is akin to the common strategy employedby anti-virus software to detect malware: search for a setof known signatures. We exploit the following ”features”of program binaries and create signatures out of them: • Compiler-speciﬁc code snippets. • Compiler-speciﬁc meta data. • Library code snippets. • Symbol versioning. • Checksums.Our approach has several advantages. First, we onlyneed to create, annotate, and maintain a database of sig-natures gathered from compilers and libraries, and we canthen run the signature scanner over program binaries toderive their build information. Second, unlike the anti-virus industry where the malware code must be identi-ﬁed and extracted by experts, our signature collection pro-cess is almost mechanical and can be performed by non-experts. Third, our approach does not rely on symbolicinformation and thus can handle stripped program bina-ries.Our implementation is based on the advanced patternmatching engine of ClamAV [11], an open-source anti-virus package. We choose ClamAV for its open-sourcenature, signature expressiveness and scanning speed.The remainder of this paper begins by describing thefeatures in the program binaries. Section 3-4 provide theimplementation details and experimental results. We then1 a r X i v : . [ c s . S E ] F e b iscuss potential improvement and related work in § § On most modern UNIX and UNIX-related systems, theexecutable binaries (programs and libraries) are stored ina standard object ﬁle format called the Executable andLinking Format (ELF) [5, 6]. An ELF ﬁle can be dividedinto named ”sections,” each of which serves a speciﬁcfunction at compile time or runtime. The sections rele-vant to our work are: • .text section contains the executable machinecode and is the main source for our signature iden-tiﬁcation. • .comment section contains compiler and linkerspeciﬁc version control information. More on thisin § • .dynamic section holds dynamic linking informa-tion, including ﬁle names of dependent dynamic li-braries, and pointers to symbol version tables and re-location tables. • .rel.text and .rela.text sections consist ofrelocation tables associated with the corresponding .text sections. More details in § • .gnu.version d section comprises the versiondeﬁnition table. More on this in § It is not news that certain popular compilers on the Intelx86 platform insert extra code snippets unbeknownst tothe developers [7]. We will illustrate with three examples.The ﬁrst example is the so-called ”processor dispatch”employed by certain optimizing compilers. As the x86architecture evolves with the addition of new capabilitiesand new instructions such as Streaming SIMD Extensions(SSE) and Advanced Vector eXtensions (AVX), an op-timizing compiler will produce machine code tuned foreach capability. Since the new instructions are not rec-ognized by older generations of x86 processors, to avoid”illegal instruction” errors and to re-route the executionpath to the suitable code blocks, an extra code snippet isinserted to perform this task. Both Intel and PGI compilers, when invoked with opti-mization ﬂags enabled (and -O2 is used implicitly), in-sert the processor dispatch code which is executed be-fore the application’s main function. These code snip-pets invariably use the cpuid instruction to obtain pro-cessor feature ﬂags. For example, the core processordispatch routine used by the Intel compiler is called intel cpu indicator init . It initializes an in-ternal variable called intel cpu indicator to dif-ferent values based on the processor on which the pro-gram is running [7]. This information is later used to ei-ther abort program execution immediately, with an errorlike ”This program was not built to run on the processor inyour system,” or execute different code blocks (tuned fordifferent generations of SSE instructions) in Intel’s opti-mized C library routines such as memcpy and strcmp .A second instance of compiler-inserted code is to en-able or disable certain ﬂoating-point unit (FPU) features.For example, when GCC is invoked with -ffast-math or -funsafe-math-optimizations optimizationﬂags, it inserts code to turn on the Flush-To-Zero (FTZ)mode and the Denormals-Are-Zero (DAZ) mode in thex86 control register MXCSR . When these modes are on,the FPU bypasses IEEE 754 standards and treats de-normal numbers, i.e. values extremely close to zero,as zeros. This optimization trades off accuracy forspeed [8]. The GNU C Compiler, GCC, also accepts -mpc { } ﬂags, which are used to set thelegacy x87 FPU precision/rounding mode. Again, GCCuses a special prolog code to conﬁgure the FPU to the re-quested mode.A third instance of compiler-inserted code is to initial-ize user’s data. For example, one of the C++ languagefeatures requires that static objects must be initialized, i.e.their constructors must be called, before program startup[9]. To implement this, the C++ compiler emits a specialELF section called .ctors , which is an array of pointersto static objects’ constructors, and inserts a prolog codesnippet which sweeps through the .ctors section beforerunning the application’s main function. ELF ﬁles have an optional section called .comment which consists of a sequence of null-terminated ASCIIstrings. This section is not loaded into memory duringexecution and its primary use is a placeholder for ver-sion control software such as CVS or SVN to store controlkeyword information. In practice, most compilers we ex-amined will also ﬁll this section with strings which areunique enough to differentiate the compilers and the ver-2ions (see § .ident assembler directive when generating the as-sembly code, and then the assembler pools these stringsand saves them into the .comment section. Unlike thedebugging and symbolic information embedded in otherELF sections, the .comment section is not removed bythe GNU strip utility, so we can mine it to obtain thecompiler provenance.For example, using the GNU readelf tool withcommand-line option -p .comment on GCC-compiledprograms could have the following output: GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-50)

If a program calls library functions, the linker will bind thefunctions to libraries to create the executable. The linkingmode is either static or dynamic. In the former, the linkerextracts the code of called functions from libraries, whichare simply archives of ELF ﬁles, and performs the relo-cation (see § Some dynamic libraries are self-annotated with versioninformation in a uniform format, and we use this informa-tion to identify both the library and its version.As mentioned in § lib.so.. . The linkerwill then record the exact ﬁle names in the resulting bi-naries’ .dynamic section. In 1995 Sun introduced anew and ﬁne-grained versioning mechanism in Solaris2.5, which the GNU/Linux community soon adopted [12].In this scheme, each function name and symbol can be as-sociated with a version, and at the library level, a chain ofversion compatibility can be speciﬁed. The version of thelibrary is then the highest version in the version chain .As an example, in the GNU C runtime library ( glibc )source tree, one can ﬁnd version deﬁnition scripts contain-ing the following libc { libc {GLIBC_2.0 { GLIBC_2.0malloc; GLIBC_2.1free; ...... GLIBC_2.10} ...... }GLIBC_2.10 {malloc_info;}}

The left-hand side speciﬁes that malloc and free areversioned GLIBC 2.0 and malloc info

GLIBC 2.10.The right-hand side indicates GLIBC 2.10 is com-patible with GLIBC 2.1, which is compatible withGLIBC 2.0. All of the versioning data are encoded inthe .gnu.version d section ( d for deﬁnition) of dy-namic libraries when they are built. When a user programis compiled and linked, a version-aware linker obtains ver-sions of called functions from the dynamic libraries andstores them in the resulting binaries’ .gnu.version r section ( r for reference). At runtime, the program loader-linker ld.so ﬁrst examines whether all version refer-ences in the user’s program binary can be satisﬁed or not,and determines to either abort or continue.Symbol versioning is used extensively in the GNUcompiler collection (C, C++, Fortran, and OpenMP run-time libraries), Myrinet MX/DAPL libraries, and Open-Fabrics/InﬁniBand Verbs libraries. All of these instancesadopt the same version naming scheme: a unique label,3.g. GLIBC , GLIBCXX , or MX , followed by an underscoreand the version. Hence, our tool can recognize them us-ing a hard-coded list of labels and obtain their version bytraversing the version chain. Most dynamic libraries are less sophisticated and do notuse symbol versioning. Therefore, to recognize them, weresort to the traditional approach of checksums.

Md5sum is a commonly used open-source utility to produce andverify the MD5 checksum of a ﬁle, but it is ﬁle-structureagnostic and fails to characterize ELF dynamic librarieson platforms (e.g. Red Hat Enterprise Linux) where theprelinking/prebinding technology [18] is used. Prelinkingis intended to speed up the runtime loading and linking ofdynamic libraries when a program binary is launched. Toachieve this, a daemon process will periodically updatethe dynamic libraries’ relocation table. The side effectof prelinking is MD5 checksum mismatch, as part of theﬁle content has been changed. To defeat this effect, wecalculate the MD5 checksum over the .text section onlyfor ELF ﬁles.

Our implementation is based on the pattern matching en-gine of the open-source anti-virus package ClamAV [11],with additional code to support symbol versioning. Theimplementation comprises two tools: a signature gener-ator and a signature scanner. The signature generatorparses ELF ﬁles and outputs ClamAV-formatted signatureﬁles. The signature scanner takes as input the signatureﬁles and the user’s program binary and outputs all pos-sible matches. In the following, we discuss ClamAV’ssignature formats and matching algorithms and how weleverage ClamAV in our implementation.

ClamAV signatures can be classiﬁed as one of the fol-lowing types, in the order of increasing complexity andpower:

MD5 , basic , regular expression (regex) , logical,and bytecode. Our implementation makes use of the ﬁrstthree types because they can be generated automatically(see § ?? (to match anybyte) and { n } (to match any consecutive n bytes). Cla-mAV’s scanning engine handles regex signatures with theAho-Corasick (AC) string searching algorithm, which canmatch multiple strings concurrently at the cost of consum-ing more memory. The AC algorithm starts with a prepro-cessing phase: Take a set of wildcard-free strings to createa ﬁnite automaton. The scanning phase is simply a se-ries of state transitions in this ﬁnite automaton. ClamAVutilizes the AC algorithm as follows: Every regex signa-ture is broken into basic signatures (separated by wild-cards), and a single ﬁnite automaton (implemented as atwo-level 256-way “trie” data structure) is created fromall of these basic signatures. If all wildcard-free parts of aregex signature are matched, ClamAV checks whether theorder and the gaps between the parts satisfy the speciﬁedwildcards.For completeness we brieﬂy mention the remaining twosignature types. We do not use them because we do not yetﬁnd automatic ways to create them. Logical signatures al-low combining of multiple regex signatures using logicaland arithmetic operators. Bytecode signatures further ex-tend logical signatures and offer the maximal ﬂexibility.Bytecode signatures are actually ClamAV plug-ins com-piled from C programs into LLVM bytecodes, and henceallow arbitrary algorithmic detections of patterns. For dynamic libraries ( .so ﬁles), the signature generatorcomputes the MD5 checksums over their .text sectionsand outputs the ClamAV-conformant MD5 signature ﬁles.Compiler-speciﬁc code snippets and static library codereside in ELF .o (object) and .a (library archive) ﬁles.In the following discussions we only focus on .o ﬁle han-dling because an .a ﬁle is just an archive of multiple .o ﬁles. Our signature generator extracts .text sectionsfrom .o ﬁles, and outputs, for each .text section, a ba-sic or regex signature of 16-255 bytes length (excludingthe wildcards.) We describe this process in depth as fol-lows.First, a signature is not just bytes from the .text sec-tion verbatim . When a source ﬁle is compiled into an .o ﬁle, the addresses of unresolved function names and sym-bols in this .o ﬁle are unknown and have to be left empty.It is during the linking phase that these addresses are re-solved and assigned by the linker. This process is called relocation [10]. To facilitate the relocation, the compileremits one relocation table for each .text section. Eachentry of a relocation table speciﬁes the symbol name tobe resolved, the offset into the .text section which con-4ains the address to be assigned, and the relocation type.When we create a signature from the bytes of a .text section, we have to mask the bytes which are reserved foraddresses yet to be computed . To illustrate, suppose thatwe compile the following source code into an .o ﬁle: On x86, the disassembly of the generated .o ﬁle wouldbe (using the GNU objdump utility): and the corresponding relocation table is: OFFSET TYPE VALUE00000e R_X86_64_PC32 malloc+0xfffffffffffffffc

Together, the above examples illustrate that the target ofthe callq instruction should be the address of a functionnamed ”malloc”, and the address should ﬁll the 4 bytes (asspeciﬁed by the

R X86 64 PC32 relocation type) start-ing at offset (the boxed ’s). So if foo , as a libraryfunction, is used to create a user program binary, the linkerwill take the byte stream

55 48 89 e5 . . . c9 c3 andﬁll the bytes at offset through +3 with the ac-tual address of malloc. Thus, to identify foo , we create aClamAV regex signature as:

55 48 89 e5 48 83 ec 10 bf 0a 00 0000 e8 ?? ?? ?? ?? 48 89 45 f8 c9 c3

The second consideration is the signature size. As willbe seen in § .text section can be as big as fourmegabytes. Using the entire .text section could leadto long preprocessing time and large disk/memory stor-age space. Therefore, we impose an upper limit on thesignature size to be 255 bytes. We think 255 is a reason-able size, as there are possible distinct 256-bytestreams, which is large enough to have few collisions/falsepositives. For a .text section of n > bytes, we usethe tailing 255/3=85 bytes x x . . . x of the ﬁrst thirdportion, the tailing 85 bytes y y . . . y of the middlethird, and the tailing 85 bytes z z . . . z of the last mid-dle third, and form a regex signature as: x x . . . x { l } y y . . . y { m } z z . . . z where l = (cid:98) n/ (cid:99)− and m = l +( n %3) . We also ignore .text sections which are shorter than 16 bytes. Thiscut-off is chosen because the size of an x86 instructionvaries between 1 and 16 bytes, and since we do not decodethe bytes back to x86 instructions, we do not know theinstruction boundaries and have to make a conservativeassumption. Besides, signatures that are too short couldresult in many false positives.The third consideration is an .o ﬁle could con-tain more than one .text section. This happens inGNU Fortran’s static library, which is created with the -ffunction-sections compiler ﬂag. This ﬂag in-structs the compiler to put each function in its own .text section instead of all functions from the same source ﬁlein one single .text section. So for a Fortran func-tion, say foo , the compiler creates a section named .text.foo which consists of foo ’s code only . Insuch a situation, our tool emits one signature for one such .text section. The signature database is organized as a collection of sig-nature ﬁles, each of which contain signatures from a spe-ciﬁc compiler/library, e.g. Intel Fortran compiler, IntelMKL, MVAPICH, etc. Each signature ﬁle is annotatedmanually to indicate the package name and version. Thescanner takes as input this database and the user’s programbinary and outputs all possible matches. For dynamic li-brary identiﬁcation, it uses the ldd command to obtainthe library pathnames. It then extracts their symbol ver-sioning data (if there is any) and compares against a listof known labels, as explained in § .text and .comment sections (compiler meta-data are treated as basic signa-tures) and runs them through the ClamAV matching en-gine. By default ClamAV stops as soon as it spots a match,so to ﬁnd all matches, we modify it by repeatedly zeroingout the matched area and rerunning the engine, until nomatch can be found. This optimization reduces the size of statically linked program bi-naries because it eliminates dead code, i.e. functions which are unusedbut included nevertheless because they are in the same source ﬁles as theused functions. Evaluation

We evaluate our approach with both toy programs andreal-world HPC software packages from two HPC sites.We compile toy programs with a variety of compilers totest the effectiveness of source compiler identiﬁcation.We use the existing HPC software packages to assess notonly the compiler and library recognition but also Cla-mAV’s scanning performance.

We examine fourteen compilers on the x86-64 Linux plat-form and we summarize our ﬁndings in Table 1. We locatethe compiler-speciﬁc code snippets by enabling the ver-bosity ﬂag in building the toy programs. This ﬂag is sup-ported by all compilers and it can display exactly whereand which .a and .o ﬁles are used in the compilationprocess. The toy programs we constructed, e.g. ”Hello,World” and matrix multiplication, are short and use onlybasic language features and APIs, so they can highlightthe usefulness of our approach. All test cases are com-piled with each compilers’ default settings.As an example, the ”Hello, World” program compiledwith Intel compiler 12.0 yields the following output fromour scanner. It gives the number of matches and total sizeof matches against each signature ﬁle: (3 times, 6992 bytes) Intel Compiler Suite 12.0(2 times, 200 bytes) GCC 4.4.3 We have the following observations. 1. Many com-pilers strive to be compatible with the GNU developmenttools and runtime environment, so they also use GNU’scode snippets. Therefore, GCC becomes a common de-nominator and is ubiquitous in the scanning results. Theabove output is typical: The Intel compiler locates the sys-tem’s default GCC installation (version 4.4.3 in this case)and uses its crtbegin.o and crtend.o in the compi-lation. These two .o ﬁles handle the .ctors section asdiscussed in § matmul intrinsic to performmatrix multiplications and compiled it with PGI 11.0. Theresult is as follows: (58 times, 346766 bytes) PGI Fortran Compiler 11.x(48 times, 56833 bytes) PGI Fortran Compiler 8.x(45 times, 118288 bytes) PGI Fortran Compiler 10.x Compiler Note Version MetaData Code SnippetSourceAbsoft F,O 11.1 liba*.aClang C,L 2.8Cray 7.1, 7.2 V libcsup.a,libf*.a,libcray*.aG95 F,G 0.93 V libf95.aGNU G 4.1, 4.4,4.5 V crt*.o, libgcc*.aIntel 9.x thru12.0 I libirc*.a, libf-core*.aLahey-Fujitsu F 8.1 I fj*.o, libfj*.aLLVM-GCC G,L 2.8 VNAG F, † ‡ ‡ Table 1: Compiler identiﬁcation. C: C/C++ compileronly. F: Fortran compiler only. G: uses GNU codebase.I: has unique meta data. L: uses LLVM codebase. O:uses Open64 codebase. V: meta data have both brandstring and version number. † : is actually a Fortran-to-Cconverter with GCC as backend. ‡ : inserts FTZ/DAZ-enabling prolog code (see § .a / .o ﬁles so we manually produce its signature. Library Version(Compiler) Code SnippetSource Mean andStdDev .text sizein KBACML 4.4.0 (I,P) libacml*.a 11.1, 70.8Cray LibSci 10.4.0 (G,I,P) libsci*.a 3.4, 4.9Intel MKL 8.0, 8.1, 9.1 libmkl*.a 4.6, 9.010.x libmkl core.a 4.2, 16.6Cray MPI 3.5.1 (G,I,P) libmpich*.a 1.3, 2.6MPICH 1.2.7mx (G,I) libmpich.a 1.2, 2.7MVAPICH2 1.4, 1.5 (I) libmpich.a 2.6, 4.8

Table 2: Library identiﬁcation. G: GNU. I: Intel. P: PGI.6

42 times, 49895 bytes) PGI Fortran Compiler 7.x(32 times, 82808 bytes) PGI Compiler Suite 11.x(29 times, 57166 bytes) PGI Compiler Suite 7.x....(2 times, 200 bytes) GCC 4.4.3

The matches include both the Fortran runtime libraryand compiler-speciﬁc code snippets, which are shared byC/C++ and Fortran compilers. The result also implies thatPGI reuses a signiﬁcant amount of code across each re-lease. We scrutinized the code snippets which matchedboth versions 7.x and 11.x and found their functionalityincludes memory operations (allocate, copy, zero, set),I/O setup (open, close), command-line argc/argv han-dling, etc.4. Compilers which share codebase are not easily dis-tinguishable. Examples include Open64 and PathScale,GNU and LLVM-GCC, etc. In these cases, only thecompiler-speciﬁc meta data can tell them apart, and Clangis thus far the only compiler which deﬁes our inference ef-forts.

We applied the scanner to a subset of HPC applications(Amber [20], Charmm [21], CPMD [22], GAMESS [23],Lammps [24], NAMD [25], NWChem [26], PWscf [27])from two HPC sites (a 3456-core Intel-based commod-ity PC cluster at our center and a 672-core Cray XT5mat Indiana University). We gathered signatures from nu-merical and MPI libraries which we know have beenlinked statically in the application builds. The librariesand the size of their constituent .o ﬁles are summa-rized in Table 2. Numerical libraries tend to have more .o ﬁles and larger code size per .o ﬁle. The explana-tion is various processor-specialization codes and aggres-sive loop unrolling. For example, ACML 4.4.0-ifort64’s libacml.a has 4.5K .o ﬁles, with the largest (4.1MB code size) being an AMD-K8-tuned complex matrixmultiplication (zgemm) kernel, and Intel MKL 10.3.1’s libmkl core.a has 44K .o ’s, with the largest (1.4MB) being an Intel-Nehalem-optimized batched forwarddiscrete Fourier transform code.For the test we create a signature database exclusivelyfrom the aforementioned libraries. It has 100K signaturesand the predominant signature type is regex. The 21 HPCapplication binaries under test have a mean code size of13.3 MB and the largest is NWChem 6.0 on Cray (39.4MB, mainly due to static linking, as in § t (in seconds) can bebest described by the linear regressions t = − . . x (Harpertown) and t = − .

44 + 6 . x (Nehalem) where x is the code size in MB, and the peak memory usage is 195MB. Our methodology of identifying the source compiler de-pends on the idiosyncrasies of the x86 platform andcompilers. We also explored the two major compil-ers, GCC and IBM XL, on the PowerPC platform,and did not ﬁnd discernible compiler-speciﬁc code snip-pets. IBM XL compilers do inscribe their brand stringsin the .comment section, but in general, content in .comment section is subject to tampering. For example,the following line in a C program: __asm__(".ident \"foo\""); will emit “ foo ” to the .comment section. This makes .comment section a less reliable source of compilerprovenance from a general perspective of software foren-sics.Another issue is that a compiler inserts its character-istic prolog code only when it is compiling the sourceﬁle which contains the main function. So if differentsource ﬁles are compiled with different compilers, the re-sulting program binary could lack the compiler-speciﬁccode snippets one would expect. In addition, in Intel com-piler’s case, it does not insert processor-dispatch code ifthe optimization is turned off either explicitly (with -O0 )or implicitly (e.g. with -g ).Our approach cannot discover the compilation ﬂagsused in the program build process. Some compilers of-fer a switch to record the command-line options insideeither .comment or other sections. For example, In-tel has -sox , GCC has -frecord-gcc-switches (recorded in .GCC.command.line section), andOpen64/PathScale and Absoft do it by default. We expectthis self-annotation feature to be more widely embracedby compiler developers, as they move toward better com-patibility with GCC, and used by HPC programmers, as itgreatly aids debugging and performance analysis. ALTD [13] is an effort to track software and library usageat HPC sites. It takes a proactive approach by intercept-ing and recording every invocation of the linker and thejob scheduler. Our work is complementary in that it per-forms post-mortem analysis and works on systems with-out ALTD.7he work by Rosenblum et al [16] is the ﬁrst attemptto infer the compiler provenance. They used sophisticatedmachine learning by modeling and classifying the codebyte stream as a linear chain Conditional Random Field.As in most supervised learning systems, a lengthy trainingphase is required. The resulting system can then infer thesource compiler with a probability. Their approach hasseveral drawbacks, which our method addresses: Theyfocus solely on executable code and ignores other partsof ELF ﬁles, the preprocessing/training phase, albeit one-time, is slow and complex, the model parameters cannotbe updated incrementally with ease when a new compileris added, and it is unclear if their model can discern thenuances among different versions of the same compiler.Kim’s approach [19] is closest to ours in spirit, but itmisses the key feature in our implementation: the relo-cation table. It produces a signature by copying the ﬁrst25 bytes of a library function code verbatim . With sucha short signature and lack of relocation information, histool has very limited success in identifying library codesnippets.

Compilers and libraries provenance reporting is crucial inan auditing and benchmarking framework for HPC sys-tems. In this paper we present a simple and effectiveway to mine this information via signature matching. Wealso demonstrate that building and updating a signaturedatabase is straightforward and needs no expert knowl-edge. Finally, our tests show excellent scanning speedeven on very large program binaries.

Acknowledgments

This work is supported by the National Science Founda-tion under award number OCI 1025159. We would like tothank Gregor von Laszewski for providing access to Fu-tureGrid computing resources.

References [1] T. R. Furlani et al.

The Workshop on Operating Sys-tem Interference in High Performance Applications (OSIHPA) ,2005.[9] § The 4th Annual Linux Showcase (ALS) & Conference ,2000.[13] B. Hadri, M. Fahey, and N. Jones, “Identifying software usage atHPC centers with the automatic library tracking database.”

Pro-ceedings of the 2010 TeraGrid Conference .[14] N. Sidwell, “A common vendor ABI for C++ – GCC’s why, whatand not.”

Proceedings of the 2003 ACCU Conference .[15] http://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html[16] N. Rosenblum, B. Miller, and X. Zhu, “Extracting compiler prove-nance from program binaries.”

The workshop on Program Analysisfor Software Tools and Engineering (PASTE) , 2010.[17] G. Johansen and B. Mauzy, “Cray XT programming environment’simplementation of dynamic shared libraries.”

Cray User Group(CUG) Conference , 2009.[18] J. Jelinek, http://people.redhat.com/jakub/prelink.pdf[19] J. S. Kim, “Recovering debugging symbols from strippedstatic compiled binaries.”

Hakin9 Magazine , June 2009.http://0xbeefc0de.org/papers/[20] D. A. Case et al. , “The Amber biomolecular simulation programs.”

J. Comp. Chem. v 26, 1668-1688 (2005).[21] B. R. Brooks et al. , “CHARMM: The biomolecular simulationprogram.”

J. Comp. Chem. et al. , “General atomic and molecular electronicstructure system.”

J. Comp. Chem. v 14, 1347-1363 (1993).[24] S. J. Plimpton, “Fast parallel algorithms for short-range moleculardynamics.”

J. Comp. Phys. v 117, 1-19 (1995).[25] J. C. Phillips et al. , “Scalable molecular dynamics with NAMD.”

J. Comp. Chem. v 26, 1781-1802 (2005).[26] M. Valiev et al. , “NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations.”

Comput.Phys. Commun. v 181, 1477 (2010).[27] P. Giannozzi et al.