[PDF] Experimentation Procedure for Offloaded Mini-Apps Executed on Cluster Architectures with Xeon Phi Accelerators

Abstract

A heterogeneous cluster architecture is complex. It contains hundreds, or thousands of devices connected by a tiered communication system in order to solve a problem. As a heterogeneous system, these devices will have varying performance capabilities. To better understand the interactions which occur between the various devices during execution, an experimentation procedure has been devised to capture, store, and analyze important and meaningful data. The procedure consists of various tools, techniques, and methods for capturing relevant timing, power, and performance data for a typical execution. This procedure currently applies to architectures with Intel Xeon processors and Intel Xeon Phi accelerators. It has been applied to the Co-Design Molecular Dynamics mini-app, courtesy of the ExMatEx team. This work aims to provide end-users with a strategy for investigating codes executed on heterogeneous cluster architectures with Xeon Phi accelerators.

Full PDF

aa r X i v : . [ c s . D C ] S e p Experimentation Procedure for Ofﬂoaded Mini-AppsExecuted on Cluster Architectures withXeon Phi Accelerators

Gary Lawson † , Vaibhav Sundriyal, Masha Sosonkina, and Yuzhong Shen Old Dominion University, Norfolk VA 23529, USA,Department of Modeling, Simulation, and Visualization Engineering, { glaws003, vsundriy, msosonki, yshen } @odu.edu Abstract —A heterogeneous cluster architecture is complex.It contains hundreds, or thousands of devices connected bya tiered communication system in order to solve a problem.As a heterogeneous system, these devices will have varyingperformance capabilities. To better understand the interactionswhich occur between the various devices during execution, anexperimentation procedure has been devised to capture, store,and analyze important and meaningful data. The procedureconsists of various tools, techniques, and methods for capturingrelevant timing, power, and performance data for a typicalexecution. This procedure currently applies to architectures withIntel Xeon processors and Intel Xeon Phi accelerators. It has beenapplied to the Co-Design Molecular Dynamics mini-app, courtesyof the ExMatEx team. This work aims to provide end-users with astrategy for investigating codes executed on heterogeneous clusterarchitectures with Xeon Phi accelerators.

I. I

NTRODUCTION

Measuring the performance and energy of an applicationcan be a challenge. There are tools and methods for obtainingpower and performance measurements, but accurately com-bining these with execution can be difﬁcult. Further, from thepoint-of-view of a single developer, determination of the crit-ical and non-critical execution points can be tedious or over-whelming. This work aims to provide an easy-to-reproduceprocedure for accurately proﬁling a generic application withminimal code changes. This is beneﬁcial to the solo developerlooking to optimize an application, because key phases ofexecution within the application will fall into the authorsoutlined measurement scheme and promote isolation of theseapplication phases. From the point-of-view of co-design, thiswork provides meaningful insights into the performance ofan application; metrics such as memory bandwidth and com-putational throughput are used in place of application phasesto describe execution time. However, obtaining these metricsdoes require execution, hence a well described procedure hasbeen developed.Accelerators are often adopted to reduce time-to-solutionwith low energy costs. From the work of Choi et. al [1],the Xeon Phi is capable of 11 GFLOPs/J and 880 MB/J forsingle-precision operations (measured throughput of 2 Tﬂops/sand 180 GB/s memory bandwidth). The Intel Xeon Phi isan accelerator that promotes high memory bandwidth (i.e.data movement) in addition to high computational throughput, and that supports various execution modes [10], [16]. TheXeon Phi also offers user-level access to important powerdata, but poorly document how to utilize the information; thiswork sheds light on an easy-to-implement method to readpower at the highest available sampling frequency for thedevice. Further, unlike tools or methods which relay on reading“window” power, such as MICSMC (Software ManagementController) [14], this work yields true, instantaneous powermeasurements based on the connectors that supply power.The Xeon Phi co-processor is an accelerator with many ex-ecution modes: native , ofﬂoad , and symmetric . This is uniqueto accelerators because normal operation of an acceleratoris considered the ofﬂoad execution mode; this is the modeused by GPUs. The Xeon Phi supports these other modesbecause of the micro-OS (running a special version of Linux)which enables the device to execute applications from thedevice itself; to the host, the Xeon Phi may be considered anadditional node. Native execution mode allows an applicationto be executed only on the Xeon Phi; a user log’s ontothe device and executes the application. Symmetric executionmode allows an application to be executed on the host andon the Xeon Phi, but each device is to solve a differentsub-domain. Ofﬂoad execution mode allows the host andaccelerator to share the workload, but this execution moderequires some code changes. This is the ideal starting placefor a procedure that requires some code instrumentation.A heterogeneous cluster architecture is composed of manynodes connected by a network interconnect. Each node iscomposed of multiple devices with varying performance ca-pabilities. An application executed on such an architecturemust implement domain decomposition [4]; to take a problemand divide it into independent sub-problems, or sub-domains.These sub-domains are then distributed to each node, oneor more sub-domain per node. However, domain decompo-sition requires data sharing between sub-domains to solve thetotal problem completely. For more simple implementations,computation of the problem and communication between sub-domains do not overlap. In this work, overlap between thesetwo phases is not considered.A heterogeneous cluster architecture is complex. Measuringthe performance and energy consumption of such an archi-tecture is also complex. Ideally, a measurement procedurehould have low performance impact and energy overhead;measuring an execution should not degrade the performance ofthe application, nor dramatically increase the required powerdraw. A procedure should also be easy-to- implement on othersystems, and should not require dramatic code changes. In thiswork, such a procedure is presented to capture important datafor all devices utilized in a heterogeneous cluster architecture.Important data includes bandwidth for various data transfers,work performed by the Xeon Phi accelerator, and powerconsumption for the various devices.This work investigates speciﬁcally the ofﬂoad executionmode [11], [9], however the procedure may be applied tonative and symmetric execution modes as well with minimalchanges. Some of these changes are to be introduced in thiswork, but have not been thoroughly tested; although this isa future work. To measure performance, hardware countersare read using the Performance API (PAPI) [5]. Host poweris measured using the Running Average Power Limit (RAPL)interface [18], and Xeon Phi power is measured by readingthe power ﬁle /sys/class/micras/power [15]. Timingsare gathered during execution of an application; howeveradditional timings are necessary beyond what is provided bydefault. This work will explain the additional code changesrequired for an application to provide accurate event timingsto be used to synchronize execution ﬂow with the power andperformance samples.The remainder of this paper is organized as follows: Sec-tion II provides details for cluster and application require-ments, such as software, hardware, and code changes. Sec-tion III discusses deﬁning the experiment and post-executiondata processing involved in proﬁling execution output. Finally,Section IV concludes.II. I NITIAL S TEPS

Before experimentation may proceed, certain software andhardware are required to accurately measure an execution.In general, the authors assume the cluster is composed ofIntel Xeon processors of the Sandy-Bridge or newer micro-architecture supplied with one or more Intel Xeon Phi ac-celerators. This architecture conﬁguration is assumed to bemost common (among hybrid Xeon and Xeon Phi clusters),and provides the necessary hardware counters for modelparameter estimations. However, it is not a requirement thatthe processors be of the Intel brand or the accelerator be aXeon Phi if similar measurements may be obtained.Beyond compiling the application [8] and executing in anMPI (Message Passing Interface) environment [7], the clusterarchitecture should have access to RAPL on the host, and aversion of PAPI for both the host and native Xeon Phi; evenfor ofﬂoad execution. Although the application will requirecalls to PAPI from within the ofﬂoad section, PAPI itselfshould be natively compiled and the library accessible duringcompilation and execution of the application. For ofﬂoadexecution, a host version of PAPI is also required becauseofﬂoad sections may be executed on the host in the event of aconditional ofﬂoad, or should “no-ofﬂoad” be enabled during compilation. It is also important to note that for native XeonPhi hardware counters, only versions 5.3.0 and 5.3.2 currentlysupport this functionality; more recent versions of PAPI (up to5.4.1 as of this version) are unable to convert native hardwarecounter names into codes and therefore are unable to accessthe counter. Removing the dependency of PAPI from theparameter estimation procedure will be done by the authors inthe future since PAPI provides limited support for the XeonPhi. It is important to note that with the more recent versions ofMPSS, conﬁguring PAPI requires more than what is providedin the instructions for the PAPI 5.3.X versions. Speciﬁcally,the authors followed the instructions from version 5.4.1 [5] touse the following conﬁgurations options: • --with-mic • --host=x86_64-k1om-linux • --with-arch=k1om • --with-ffsll • --with-walltimer=cycle • --with-tls=__thread • --with-virtualtimer=clock_thread_cputime_id ,along with all the other conﬁgutation steps as per instructionsto PAPI versions 5.3.X. A. Code Instrumentation

This section deﬁnes the required code changes to implementofﬂoad execution in CoMD and to obtain highly accuratetimings for the computation and communication phases oneach device. These timings may then be cross-referenced tothe power and performance measurements to deﬁne variousperformance metrics, and total energy consumption. Associatecode changes and micro-measurement apps are provided. Eachmicro-measurement app only spawns a single-thread to sampleand print.

1) CoMD Overview:

CoMD is a proxy application devel-oped as part of the Department of Energy co-design researcheffort [2] Extreme Materials at Extreme Scale [3] (ExMatEx)center. CoMD is a compute-intensive application where ap-proximately 85-90% of the execution time is spent computingforces. Although two methods are available for the forcecomputation, this work focuses only on one of them, themore complex and accurate EAM force kernel for short-rangematerial response simulations, such as uncharged metallicmaterials [17]. The EAM kernel was chosen because itsparallel performance generally receives less attention than themore commonly used Lennard-Jones potential, which easilyyields itself to parallelism.

2) Setup of the Ofﬂoad Execution Mode:

Ofﬂoading toMIC requires the use of special pragmas deﬁned for the Intelcompilers. These pragmas specify the code sections to beprocessed by the Xeon Phi accelerator. Within the pragmastatement, one must specify the MIC device to communicatewith, the data to be transferred with the associated parameters(such as array length, data persistence, variable reassignment,etc.), ofﬂoad conditional, and whether the ofﬂoad event isasynchronous, among other options [8]. However, in additiono simply specifying what code sections to process on theXeon Phi, the arrays must be properly formatted for optimaltransfer bandwidth.It is possible, although inefﬁcient, to transfer multi-dimensional arrays between the host and Xeon Phi. Thereforealgorithm structures should conform to the structure-of-arraysdata layout; CoMD is originally organized as an array-of-structures which do not transfer easily. This code change sim-ply requires converting the multi-dimensional arrays into one-dimensional arrays. In the most recent experiments, CoMD ismeasured to obtain more than 3 GB/s bandwidth over the PCIbus (of 8 GB/s) for a problem size of 50 (500,000 atoms);the resulting communication time is insigniﬁcant with respectto the remaining computational requirements. This is one ofmany measurements to be obtained using the procedure. Oneﬁnal code improvement is the re-assignment of the maximumnumber of atoms per link cell: by default, a link cell maycontain 64 atoms but has been reduced to 16 to reduce memoryrequirements per thread. PAPI must be instrumented into theofﬂoad sections such that memory and bandwidth may beapproximated during execution. PAPI is simply started andstopped for each ofﬂoad section such that the counter is alwaysreset for the next ofﬂoad; the result is printed with applicationoutput. SSE3 instructions have been enforced during compila-tion [12], and utilization of the 2 MB buffers available throughthe environment variable [6] has been implemented.

3) Synchronization of Measuring Event Timers:

Powerand performance measurements are obtained for each deviceindividually. This approach removes unnecessary overheadbecause devices are not required to communicate measurementdata during execution. However, this approach also requires theuse of three (or more) separate timers: the algorithm timer, hosttimer, and accelerator timer(s). To synchronize these data ﬁles,two timings are output with each measurement or event outputstatement: local time in the format [HH:MM:SS], and the timefrom start as a decimal value. The time from start (TFS) valuerepresents the time elapsed from the start of each measurementtool or algorithm execution. The use of the local timestampensures all timings are accurate to within one second, howeverin addition to TFS, the error in timings is reduced to a fractionof a second (within 20ms for the host, and 100ms for the XeonPhi).Until a more sophisticated, and automatic method is devel-oped, direct source code manipulation is the simplest solutionto start with. However, as ofﬂoad execution already requiressource code manipulation, the additional event timing state-ments are reasonable; especially for CoMD which features ro-bust proﬁling output by default. Speciﬁcally, CoMD providesfour speciﬁc functions which must occur within one iterationof the simulation; for other codes one or more functions maybe required, but in general it is most important to quantify totalsimulation time, the time spent performing ofﬂoad execution,and the time spent transferring data during the communicationphase.

4) Obtaining Execution Time Values:

The execution tim-ings of interest are speciﬁcally: the time to compute on the host, communicate on the host, compute on the Xeon Phi,and transfer data over the PCI bus. To obtain these timings, theapplication output must be consulted: for CoMD, the timingsare excellently proﬁled although only the root timings areprovided in entirety. For other applications, additional timingsmay be required to obtain host computation; this will beinvestigated in the future. The execution timings, which arealready collected for each sub-domain, have been exposed toobtain exact timings for each sub-domain. Although this isnot so much of interest in these ofﬂoad-only execution exper-iments; the authors preliminary investigation into executionswith multiple execution modes showed this information to becrucial and thus has been maintained for future investigationsuntil an improved method has been determined.

5) Using CPU Hardware Counters:

The host CPUmicro-measurement application has been developed tocontinuously read the RAPL power counter for CPU Coreand DRAM power; the sum is regarded as total CPU poweras uncore device power is not considered. Additionally,host performance is measured using PAPI where thelast-level-cache memory ﬁll counter, and unhalted CPUcycles are measured; the hardware counter name differsslightly depending on micro-architecture. For Sandy-Bridge,“

MEM_LOAD_UOPS_MISC_RETIRED:LLC_MISS ”,and “

CPU_CLK_UNHALTED:THREAD_P ” are thenative hardware counters used to approximate hostmemory usage and bandwidth. For Ivy-Bridge, unhaltedcycles are captured using the same counter as onSandy-Bridge, but the LLC memory counter is:“

MEM_LOAD_UOPS_RETIRED:L3_MISS ”. Power andperformance are sampled at a rate of 10 m s ; the resultingdata is printed to output with the timestamp and TFS forsynchronization.

6) Using MIC Hardware Counters:

The MIC micro-measurement application has been developedto continuously read the available power ﬁle [15]: /sys/class/micras/power which providesapproximated power over two time windows, power toeach connector, and voltage and power readings to the core,uncore, and DRAM devices. Unlike the host CPU deﬁnitionof power, MIC power is based on the power draw measuredfor each connector: PCI-E, 2x3, and 2x4. The sum of eachconnector is the absolute power draw for the device as deﬁnedin the Xeon Phi data sheet [9]. The power ﬁle is updatedonly every 50ms, thus is the lowest available sample rate forthe device.For ofﬂoad execution, the MIC micro-measurement apponly measures power; however for native or symmetric ex-ecution, this micro-app would also measure performancewith PAPI. For the Xeon Phi, the native hardware countersof interest are: “ L2_DATA_READ_MISS_MEM_FILL ” and“

CPU_CLK_UNHALTED ”; these are used to obtain estimatesfor memory usage and bandwidth. For ofﬂoad execution,these hardware counters are instead measured over the du- MIC stands for “many-integrate cores” technology used in Intel Xeon Phi. ation of each ofﬂoad section and print with applicationoutput. To obtain an estimate for vectorization intensity, afew executions (although one is really all that is needed) us-ing the hardware counters: “

VPU_ELEMENTS_ACTIVE ” and“

VPU_INSTRUCTIONS_EXECUTED ” should be measured,where elements over instructions equates to vectorizationintensity. In general, the value should be between 1 and 8for double-precision, and 1 and 16 for single-precision [13].III. P

ROFILING E XECUTIONS

In this work, an execution is more than simply running theapplication; it requires properly measuring CPU and Xeon Phipower and performance, and synchronizing this output with theapplications execution. A properly deﬁned experiment must bepresented, and the process of mining the raw output data is alsodiscussed. The result of this process are measured performancemetrics to describe various attributes for an application, suchas the total workload for the accelerator deﬁned in FLOPSand bandwidths for many different data transfer situations.The executions are always run with the offload report environment variable set to 2; this provides MIC time, CPUtime (if available), and data transferred to and from the device.To distinguish ofﬂoad reports between various sub-domains, itis advised that MPI is executed with the ‘ -l ’ option to printthe sub-domain identiﬁcation number [ ? ]. A. The Experiment

The experiment should be designed such that all investi-gated parameters are meaningful. In this work, six meaningfulparameters have been chosen: the system, number of nodes,number of Xeon Phi per node, total problem size, hostfrequency (no DVFS), and Xeon Phi cores used. In general, theauthors are interested in determining the optimal conﬁguration(deﬁned by all six parameters) which is deﬁned by a staticconﬁguration set (deﬁned by: system, nodes, MICs/node,and problem size) and conﬁguration space (deﬁned by: hostfrequency and MIC cores). On the Borges system, two staticconﬁgurations per problem size are investigated:

MIC 1 and

MIC 2 , because the system consists of a single-node. On Bolt,six static conﬁgurations per problem size are investigated:

N ; Bolt only has three nodes with two Xeon Phi,hence one, two, and three node conﬁgurations are investigated,each with one and two Xeon Phi used. The parameters havebeen grouped into static conﬁgurations and conﬁguration spacebecause static conﬁgurations may be easily compared with oneanother and deﬁned with a minimum energy; the minimumenergy may be found in the conﬁguration space, because theseparameters impact execution energy and performance. Note,although the number of Xeon Phi also impacts energy, it isoften desirable to compare the performance and energy foreach investigated.An experiment is composed of many executions; each mayvary in conﬁguration, but each follows the same executionprocess to ensure minimal measurement overhead. The processfor a typical execution is as follows:1) Start CPU micro-measurement app on all nodes 2) Start MIC micro-measurement app on all Xeon Phi3) Sleep 20 seconds4) Execute CoMD5) Sleep 10 seconds6) Stop MIC micro-measurement app on all Xeon Phi7) Stop CPU micro-measurement app on all nodes8) Copy MIC power output ﬁles from MIC to storage9) Sleep 60 seconds.Ample idle time is provided before execution begins toensure a sufﬁcient number of power samples may be obtainedfor each device such that idle power may be measured. Forlarger clusters, the timing for step three may need to beadjusted. The command to start each MIC power measurementis issued using SSH which incurs a slight delay before powermeasurements may begin. Idle power measurements are basedon at least 10 seconds of sample data. CoMD is then executedaccording to the execution conﬁguration parameters. Uponcompletion, a brief idle period is provided to capture powermeasurements before the CPU and MIC measurement threadsare halted. Finally, a rest period of one minute is provided toallow the system to cool-down. Ideally this should be longer,but a typical experiment consists of hundreds or thousandsof executions. The time spent allowing the system to restaccounts for the majority of the total execution time for anexperiment.

B. Post-Execution Proﬁling

Upon successful completion of the experiment, a plentifulnumber of output ﬁles are available for post-execution process-ing. This process involves synchronizing measurement outputwith application execution to properly quantify timings, powermeasurements, and performance for various phases and statesinvolved in execution. These raw metrics are then to be usedto establish estimated global parameters that deﬁne each staticconﬁguration.

1) Obtaining Execution Time Values:

Extracting execu-tion time is fairly simple with CoMD. There are four mainfunctions that occur every iteration: update position, velocity,compute force, and share data between sub-domains; for EAM,there is an additional data transfer during the force compu-tation since it is a more speciﬁc algorithm. Host executiontime is deﬁned as time to update position plus velocity plusdata redistribution minus the data transfer occuring withinthe redistribution phase. Host communication time is the sumof both communication phases (within the redistribution andforce phase), but is more accurately deﬁned as two commtimings: halo exchange and reduce. This is preferred becauseit is more interesting with respect to optimization to separatethe point-to-point data transfers and reduce function timings.Xeon Phi computation time is based on the sum of ofﬂoadreport MIC time for all ofﬂoads throughout the simulation.PCI transfer time is based on the ofﬂoad report as well;however is the difference between the sum of CPU time andMIC time over all ofﬂoads during the simulation. If CPUtime is undeﬁned (reports 0.0000 seconds), PCI time may beapproximated as the difference between total simulation timedeﬁned as loop in the CoMD output timings), and the timeto compute and communicate on host and to compute on theaccelerator.

2) Obtaining Power Consumption Values:

Extract powerdraw for each state, idle or active, is accomplished slightlydifferently for each device because idle time and duration dif-fers for each device. For the host, the active state is deﬁned bythe host computation time; for host communication, PCI datatransfer, or computation on the accelerator, the host remainsidle. For the Xeon Phi, the active state is deﬁned by the time tocompute on the accelerator, and the device is otherwise idle.To synchronize power draw to execution state, the TFS foreach ﬁle has to be synchronized to the TFS value from theapplication for the key execution phases. Host idle power isaccumulated during host communication and the entire forcecomputation because it includes accelerator computation andPCI data transfer. All other power is accumulated in host activepower. For the Xeon Phi, active power is accumulated duringofﬂoad execution and otherwise idle power is accumulatedwith the current power sample. Power is ﬁnally divided bythe number of samples for each state (always greater than 100samples).

3) Obtaining Performance Values:

Extracting the perfor-mance metrics for each phase of execution is fairly straightforward with the required code instrumentations. For theXeon Phi, memory usage is simply the LLC MISS countermultiplied by 64 bytes per cache line; bandwidth is memoryusage times frequency divided by unhalted clock cycles.Because these are measured explicitly for the ofﬂoad sections,these may simply be summated over all ofﬂoads during thesimulation. For the host, performance samples must ﬁrst besynchronized with execution and within the appropriate phase,but because only the host communication phase is of interestwith respect to performance, this is the only phase in which thecounters are accumulated. The host follows the same simpleformula for computing memory usage and bandwidth as on theXeon Phi. PCI memory usage is summated over each ofﬂoadreport within execution of the simulation by adding togetherthe data sent to and recieved from the device. PCI bandwidth isestimated by the amount of data to transfer divided by transfertime.Finally, work on the Xeon Phi may be estimated bymultiplying computational throughput and computation time.Computational throughput may be calculated as the product ofthe number of cores, vectorization intensity, average numberof operations per cycle, and operational frequency. For CoMD,vectorization intensity is measured to be 2.6; operations percycle is estimated to be 1.15 as few fused-multiply operationsare vectorized for this version of CoMD. These two valuesdepend heavily on the implementation and application andmust be measured and approximated approiately for eachapplication. Operations per cycle may be approximated bycross-referencing the compiler vectorization report and sourcecode to determine which fused-multiply operations have beenvectorized; all other operations count as one. Then, assumingeach operation were to count only as one, take the ratio of number of vectorized to non-vectorized operations.IV. C

ONCLUSIONS AND F UTURE W ORK

This work has provided a detailed procedure through whichdevelopers may proﬁle applications ofﬂoaded to acceleratorsand produce meaningful conclusions and insights. At thisstage, only the mini-application CoMD has been investi-gated thoroughly with the procedure, but it is of the utmostimportance to validate the technique on many other mini-applications in the future. Additionally, reducing the numberof executions to accurately measure the application is a highpriority because this would provide larger datasets in a fractionof the time. Currently, the experiment requires several days tocomplete in the cluster environment because each conﬁgura-tion change requires a new execution and the associated systemcool-down time. Finally, collecting hardware counter measure-ments without the aid of PAPI is also to be investigated.A

CKNOWLEDGMENTS

This work was supported in part by the Air Force Ofﬁceof Scientiﬁc Research under the AFOSR award FA9550-12-1-0476, by the National Science Foundation grants NSF/OCI—0941434, 0904782, 1047772, and by the U.S. Department ofEnergy, Ofﬁce of Advanced Scientiﬁc Computing Research,through the Ames Laboratory, operated by Iowa State Univer-sity under contract No. DE-AC02-07CH11358.R

EFERENCES[1] Jee Choi, Marat Mukhan, Xing Liu, and Richard Vudue. Algorithmictime, energy, and power on candidate HPC compute building blocks.In , Arizona, USA, May 2014.[2] DOE. Co-design, 2013. http://science . energy . . exmatex . org/comd . html.[4] R. Hayashi and S. Horiguchi. Domain decomposition scheme for parallelmolecular dynamics simulation. In High Performance Computing on theInformation Superhighway, 1997. HPC Asia ’97 , pages 595–600, Apr1997.[5] ICL:UT. Performance application programming interface PAPI, 2015.http://icl . cs . utk . edu/papi/.[6] Intel. How to use huge pages to improve applica-tion performance on Intel Xeon Phi coprocessor, 2012.https://software . intel . com/sites/default/ﬁles/Large pages mic 0 . pdf.[7] Intel. Intel MPI library reference manual, 2014.https://software . intel . com/sites/products/documentation/hpc/ics/impi/41/lin/ReferenceManual/index . htm?wapkw=.[8] Intel. Intel C++ and Fortran compilers, 2015.https://software . intel . . intel . . html.[10] James Reinders Intel. An overview of programming forIntel Xeon processors and Intel Xeon Phi coprocessors, 2013.https://software . intel . com/sites/default/ﬁles/article/330164/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors1 . pdf.[11] Kevin Davis Intel. Effective use ofthe Intel compiler’s ofﬂoad features, 2013.https://software . intel . com/en-us/articles/effective-use-of-the-intel-compilers-ofﬂoad-features.[12] Martyn Corden Intel. How to compile for Intel AVX, 2012.https://software . intel . com/en-us/articles/how-to-compile-for-intel-avx.[13] Shannon Cepeda Intel. Optimization and performance tuning for IntelXeon Phi coprocessors, part 2: Understanding and using hardwareevents, 2012. https://software . intel . com/en-us/articles/.[14] Taylor Kidd Intel. Intel Xeon Phi coprocessor power managementconﬁguration: Using the micsmc command-line interface, 2014.https://software . intel . com/en-us/blogs/2014/01/31/intel-xeon-phi-coprocessor-power-management-conﬁguration-using-the-micsmc-command.15] Intel Forums. how to measure en-ergy consumption on the coprocessor?, 2013.https://software . intel . com/en-us/forums/topic/488782?language=en.[16] loc-nguyen Intel. Intel Xeon Phi copro-cessor developer’s quick start guide, 2015.https://software . intel . . lanl . gov/orgs/adtsc/publications/sciencehighlights 2013/docs/Pg88 89 . pdf.[18] PAPI. Accessing the Intel RAPL registers, 2013.http://icl . cs . utk ..