Experimentation Procedure for Offloaded Mini-Apps Executed on Cluster Architectures with Xeon Phi Accelerators
Gary Lawson, Vaibhav Sundriyal, Masha Sosonkina, Yuzhong Shen
aa r X i v : . [ c s . D C ] S e p Experimentation Procedure for Offloaded Mini-AppsExecuted on Cluster Architectures withXeon Phi Accelerators
Gary Lawson † , Vaibhav Sundriyal, Masha Sosonkina, and Yuzhong Shen Old Dominion University, Norfolk VA 23529, USA,Department of Modeling, Simulation, and Visualization Engineering, { glaws003, vsundriy, msosonki, yshen } @odu.edu Abstract —A heterogeneous cluster architecture is complex.It contains hundreds, or thousands of devices connected bya tiered communication system in order to solve a problem.As a heterogeneous system, these devices will have varyingperformance capabilities. To better understand the interactionswhich occur between the various devices during execution, anexperimentation procedure has been devised to capture, store,and analyze important and meaningful data. The procedureconsists of various tools, techniques, and methods for capturingrelevant timing, power, and performance data for a typicalexecution. This procedure currently applies to architectures withIntel Xeon processors and Intel Xeon Phi accelerators. It has beenapplied to the Co-Design Molecular Dynamics mini-app, courtesyof the ExMatEx team. This work aims to provide end-users with astrategy for investigating codes executed on heterogeneous clusterarchitectures with Xeon Phi accelerators.
I. I
NTRODUCTION
Measuring the performance and energy of an applicationcan be a challenge. There are tools and methods for obtainingpower and performance measurements, but accurately com-bining these with execution can be difficult. Further, from thepoint-of-view of a single developer, determination of the crit-ical and non-critical execution points can be tedious or over-whelming. This work aims to provide an easy-to-reproduceprocedure for accurately profiling a generic application withminimal code changes. This is beneficial to the solo developerlooking to optimize an application, because key phases ofexecution within the application will fall into the authorsoutlined measurement scheme and promote isolation of theseapplication phases. From the point-of-view of co-design, thiswork provides meaningful insights into the performance ofan application; metrics such as memory bandwidth and com-putational throughput are used in place of application phasesto describe execution time. However, obtaining these metricsdoes require execution, hence a well described procedure hasbeen developed.Accelerators are often adopted to reduce time-to-solutionwith low energy costs. From the work of Choi et. al [1],the Xeon Phi is capable of 11 GFLOPs/J and 880 MB/J forsingle-precision operations (measured throughput of 2 Tflops/sand 180 GB/s memory bandwidth). The Intel Xeon Phi isan accelerator that promotes high memory bandwidth (i.e.data movement) in addition to high computational throughput, and that supports various execution modes [10], [16]. TheXeon Phi also offers user-level access to important powerdata, but poorly document how to utilize the information; thiswork sheds light on an easy-to-implement method to readpower at the highest available sampling frequency for thedevice. Further, unlike tools or methods which relay on reading“window” power, such as MICSMC (Software ManagementController) [14], this work yields true, instantaneous powermeasurements based on the connectors that supply power.The Xeon Phi co-processor is an accelerator with many ex-ecution modes: native , offload , and symmetric . This is uniqueto accelerators because normal operation of an acceleratoris considered the offload execution mode; this is the modeused by GPUs. The Xeon Phi supports these other modesbecause of the micro-OS (running a special version of Linux)which enables the device to execute applications from thedevice itself; to the host, the Xeon Phi may be considered anadditional node. Native execution mode allows an applicationto be executed only on the Xeon Phi; a user log’s ontothe device and executes the application. Symmetric executionmode allows an application to be executed on the host andon the Xeon Phi, but each device is to solve a differentsub-domain. Offload execution mode allows the host andaccelerator to share the workload, but this execution moderequires some code changes. This is the ideal starting placefor a procedure that requires some code instrumentation.A heterogeneous cluster architecture is composed of manynodes connected by a network interconnect. Each node iscomposed of multiple devices with varying performance ca-pabilities. An application executed on such an architecturemust implement domain decomposition [4]; to take a problemand divide it into independent sub-problems, or sub-domains.These sub-domains are then distributed to each node, oneor more sub-domain per node. However, domain decompo-sition requires data sharing between sub-domains to solve thetotal problem completely. For more simple implementations,computation of the problem and communication between sub-domains do not overlap. In this work, overlap between thesetwo phases is not considered.A heterogeneous cluster architecture is complex. Measuringthe performance and energy consumption of such an archi-tecture is also complex. Ideally, a measurement procedurehould have low performance impact and energy overhead;measuring an execution should not degrade the performance ofthe application, nor dramatically increase the required powerdraw. A procedure should also be easy-to- implement on othersystems, and should not require dramatic code changes. In thiswork, such a procedure is presented to capture important datafor all devices utilized in a heterogeneous cluster architecture.Important data includes bandwidth for various data transfers,work performed by the Xeon Phi accelerator, and powerconsumption for the various devices.This work investigates specifically the offload executionmode [11], [9], however the procedure may be applied tonative and symmetric execution modes as well with minimalchanges. Some of these changes are to be introduced in thiswork, but have not been thoroughly tested; although this isa future work. To measure performance, hardware countersare read using the Performance API (PAPI) [5]. Host poweris measured using the Running Average Power Limit (RAPL)interface [18], and Xeon Phi power is measured by readingthe power file /sys/class/micras/power [15]. Timingsare gathered during execution of an application; howeveradditional timings are necessary beyond what is provided bydefault. This work will explain the additional code changesrequired for an application to provide accurate event timingsto be used to synchronize execution flow with the power andperformance samples.The remainder of this paper is organized as follows: Sec-tion II provides details for cluster and application require-ments, such as software, hardware, and code changes. Sec-tion III discusses defining the experiment and post-executiondata processing involved in profiling execution output. Finally,Section IV concludes.II. I NITIAL S TEPS
Before experimentation may proceed, certain software andhardware are required to accurately measure an execution.In general, the authors assume the cluster is composed ofIntel Xeon processors of the Sandy-Bridge or newer micro-architecture supplied with one or more Intel Xeon Phi ac-celerators. This architecture configuration is assumed to bemost common (among hybrid Xeon and Xeon Phi clusters),and provides the necessary hardware counters for modelparameter estimations. However, it is not a requirement thatthe processors be of the Intel brand or the accelerator be aXeon Phi if similar measurements may be obtained.Beyond compiling the application [8] and executing in anMPI (Message Passing Interface) environment [7], the clusterarchitecture should have access to RAPL on the host, and aversion of PAPI for both the host and native Xeon Phi; evenfor offload execution. Although the application will requirecalls to PAPI from within the offload section, PAPI itselfshould be natively compiled and the library accessible duringcompilation and execution of the application. For offloadexecution, a host version of PAPI is also required becauseoffload sections may be executed on the host in the event of aconditional offload, or should “no-offload” be enabled during compilation. It is also important to note that for native XeonPhi hardware counters, only versions 5.3.0 and 5.3.2 currentlysupport this functionality; more recent versions of PAPI (up to5.4.1 as of this version) are unable to convert native hardwarecounter names into codes and therefore are unable to accessthe counter. Removing the dependency of PAPI from theparameter estimation procedure will be done by the authors inthe future since PAPI provides limited support for the XeonPhi. It is important to note that with the more recent versions ofMPSS, configuring PAPI requires more than what is providedin the instructions for the PAPI 5.3.X versions. Specifically,the authors followed the instructions from version 5.4.1 [5] touse the following configurations options: • --with-mic • --host=x86_64-k1om-linux • --with-arch=k1om • --with-ffsll • --with-walltimer=cycle • --with-tls=__thread • --with-virtualtimer=clock_thread_cputime_id ,along with all the other configutation steps as per instructionsto PAPI versions 5.3.X. A. Code Instrumentation
This section defines the required code changes to implementoffload execution in CoMD and to obtain highly accuratetimings for the computation and communication phases oneach device. These timings may then be cross-referenced tothe power and performance measurements to define variousperformance metrics, and total energy consumption. Associatecode changes and micro-measurement apps are provided. Eachmicro-measurement app only spawns a single-thread to sampleand print.
1) CoMD Overview:
CoMD is a proxy application devel-oped as part of the Department of Energy co-design researcheffort [2] Extreme Materials at Extreme Scale [3] (ExMatEx)center. CoMD is a compute-intensive application where ap-proximately 85-90% of the execution time is spent computingforces. Although two methods are available for the forcecomputation, this work focuses only on one of them, themore complex and accurate EAM force kernel for short-rangematerial response simulations, such as uncharged metallicmaterials [17]. The EAM kernel was chosen because itsparallel performance generally receives less attention than themore commonly used Lennard-Jones potential, which easilyyields itself to parallelism.
2) Setup of the Offload Execution Mode:
Offloading toMIC requires the use of special pragmas defined for the Intelcompilers. These pragmas specify the code sections to beprocessed by the Xeon Phi accelerator. Within the pragmastatement, one must specify the MIC device to communicatewith, the data to be transferred with the associated parameters(such as array length, data persistence, variable reassignment,etc.), offload conditional, and whether the offload event isasynchronous, among other options [8]. However, in additiono simply specifying what code sections to process on theXeon Phi, the arrays must be properly formatted for optimaltransfer bandwidth.It is possible, although inefficient, to transfer multi-dimensional arrays between the host and Xeon Phi. Thereforealgorithm structures should conform to the structure-of-arraysdata layout; CoMD is originally organized as an array-of-structures which do not transfer easily. This code change sim-ply requires converting the multi-dimensional arrays into one-dimensional arrays. In the most recent experiments, CoMD ismeasured to obtain more than 3 GB/s bandwidth over the PCIbus (of 8 GB/s) for a problem size of 50 (500,000 atoms);the resulting communication time is insignificant with respectto the remaining computational requirements. This is one ofmany measurements to be obtained using the procedure. Onefinal code improvement is the re-assignment of the maximumnumber of atoms per link cell: by default, a link cell maycontain 64 atoms but has been reduced to 16 to reduce memoryrequirements per thread. PAPI must be instrumented into theoffload sections such that memory and bandwidth may beapproximated during execution. PAPI is simply started andstopped for each offload section such that the counter is alwaysreset for the next offload; the result is printed with applicationoutput. SSE3 instructions have been enforced during compila-tion [12], and utilization of the 2 MB buffers available throughthe environment variable [6] has been implemented.
3) Synchronization of Measuring Event Timers:
Powerand performance measurements are obtained for each deviceindividually. This approach removes unnecessary overheadbecause devices are not required to communicate measurementdata during execution. However, this approach also requires theuse of three (or more) separate timers: the algorithm timer, hosttimer, and accelerator timer(s). To synchronize these data files,two timings are output with each measurement or event outputstatement: local time in the format [HH:MM:SS], and the timefrom start as a decimal value. The time from start (TFS) valuerepresents the time elapsed from the start of each measurementtool or algorithm execution. The use of the local timestampensures all timings are accurate to within one second, howeverin addition to TFS, the error in timings is reduced to a fractionof a second (within 20ms for the host, and 100ms for the XeonPhi).Until a more sophisticated, and automatic method is devel-oped, direct source code manipulation is the simplest solutionto start with. However, as offload execution already requiressource code manipulation, the additional event timing state-ments are reasonable; especially for CoMD which features ro-bust profiling output by default. Specifically, CoMD providesfour specific functions which must occur within one iterationof the simulation; for other codes one or more functions maybe required, but in general it is most important to quantify totalsimulation time, the time spent performing offload execution,and the time spent transferring data during the communicationphase.
4) Obtaining Execution Time Values:
The execution tim-ings of interest are specifically: the time to compute on the host, communicate on the host, compute on the Xeon Phi,and transfer data over the PCI bus. To obtain these timings, theapplication output must be consulted: for CoMD, the timingsare excellently profiled although only the root timings areprovided in entirety. For other applications, additional timingsmay be required to obtain host computation; this will beinvestigated in the future. The execution timings, which arealready collected for each sub-domain, have been exposed toobtain exact timings for each sub-domain. Although this isnot so much of interest in these offload-only execution exper-iments; the authors preliminary investigation into executionswith multiple execution modes showed this information to becrucial and thus has been maintained for future investigationsuntil an improved method has been determined.
5) Using CPU Hardware Counters:
The host CPUmicro-measurement application has been developed tocontinuously read the RAPL power counter for CPU Coreand DRAM power; the sum is regarded as total CPU poweras uncore device power is not considered. Additionally,host performance is measured using PAPI where thelast-level-cache memory fill counter, and unhalted CPUcycles are measured; the hardware counter name differsslightly depending on micro-architecture. For Sandy-Bridge,“
MEM_LOAD_UOPS_MISC_RETIRED:LLC_MISS ”,and “
CPU_CLK_UNHALTED:THREAD_P ” are thenative hardware counters used to approximate hostmemory usage and bandwidth. For Ivy-Bridge, unhaltedcycles are captured using the same counter as onSandy-Bridge, but the LLC memory counter is:“
MEM_LOAD_UOPS_RETIRED:L3_MISS ”. Power andperformance are sampled at a rate of 10 m s ; the resultingdata is printed to output with the timestamp and TFS forsynchronization.
6) Using MIC Hardware Counters:
The MIC micro-measurement application has been developedto continuously read the available power file [15]: /sys/class/micras/power which providesapproximated power over two time windows, power toeach connector, and voltage and power readings to the core,uncore, and DRAM devices. Unlike the host CPU definitionof power, MIC power is based on the power draw measuredfor each connector: PCI-E, 2x3, and 2x4. The sum of eachconnector is the absolute power draw for the device as definedin the Xeon Phi data sheet [9]. The power file is updatedonly every 50ms, thus is the lowest available sample rate forthe device.For offload execution, the MIC micro-measurement apponly measures power; however for native or symmetric ex-ecution, this micro-app would also measure performancewith PAPI. For the Xeon Phi, the native hardware countersof interest are: “ L2_DATA_READ_MISS_MEM_FILL ” and“
CPU_CLK_UNHALTED ”; these are used to obtain estimatesfor memory usage and bandwidth. For offload execution,these hardware counters are instead measured over the du- MIC stands for “many-integrate cores” technology used in Intel Xeon Phi. ation of each offload section and print with applicationoutput. To obtain an estimate for vectorization intensity, afew executions (although one is really all that is needed) us-ing the hardware counters: “
VPU_ELEMENTS_ACTIVE ” and“
VPU_INSTRUCTIONS_EXECUTED ” should be measured,where elements over instructions equates to vectorizationintensity. In general, the value should be between 1 and 8for double-precision, and 1 and 16 for single-precision [13].III. P
ROFILING E XECUTIONS
In this work, an execution is more than simply running theapplication; it requires properly measuring CPU and Xeon Phipower and performance, and synchronizing this output with theapplications execution. A properly defined experiment must bepresented, and the process of mining the raw output data is alsodiscussed. The result of this process are measured performancemetrics to describe various attributes for an application, suchas the total workload for the accelerator defined in FLOPSand bandwidths for many different data transfer situations.The executions are always run with the offload report environment variable set to 2; this provides MIC time, CPUtime (if available), and data transferred to and from the device.To distinguish offload reports between various sub-domains, itis advised that MPI is executed with the ‘ -l ’ option to printthe sub-domain identification number [ ? ]. A. The Experiment
The experiment should be designed such that all investi-gated parameters are meaningful. In this work, six meaningfulparameters have been chosen: the system, number of nodes,number of Xeon Phi per node, total problem size, hostfrequency (no DVFS), and Xeon Phi cores used. In general, theauthors are interested in determining the optimal configuration(defined by all six parameters) which is defined by a staticconfiguration set (defined by: system, nodes, MICs/node,and problem size) and configuration space (defined by: hostfrequency and MIC cores). On the Borges system, two staticconfigurations per problem size are investigated:
MIC 1 and
MIC 2 , because the system consists of a single-node. On Bolt,six static configurations per problem size are investigated:
N ; Bolt only has three nodes with two Xeon Phi,hence one, two, and three node configurations are investigated,each with one and two Xeon Phi used. The parameters havebeen grouped into static configurations and configuration spacebecause static configurations may be easily compared with oneanother and defined with a minimum energy; the minimumenergy may be found in the configuration space, because theseparameters impact execution energy and performance. Note,although the number of Xeon Phi also impacts energy, it isoften desirable to compare the performance and energy foreach investigated.An experiment is composed of many executions; each mayvary in configuration, but each follows the same executionprocess to ensure minimal measurement overhead. The processfor a typical execution is as follows:1) Start CPU micro-measurement app on all nodes 2) Start MIC micro-measurement app on all Xeon Phi3) Sleep 20 seconds4) Execute CoMD5) Sleep 10 seconds6) Stop MIC micro-measurement app on all Xeon Phi7) Stop CPU micro-measurement app on all nodes8) Copy MIC power output files from MIC to storage9) Sleep 60 seconds.Ample idle time is provided before execution begins toensure a sufficient number of power samples may be obtainedfor each device such that idle power may be measured. Forlarger clusters, the timing for step three may need to beadjusted. The command to start each MIC power measurementis issued using SSH which incurs a slight delay before powermeasurements may begin. Idle power measurements are basedon at least 10 seconds of sample data. CoMD is then executedaccording to the execution configuration parameters. Uponcompletion, a brief idle period is provided to capture powermeasurements before the CPU and MIC measurement threadsare halted. Finally, a rest period of one minute is provided toallow the system to cool-down. Ideally this should be longer,but a typical experiment consists of hundreds or thousandsof executions. The time spent allowing the system to restaccounts for the majority of the total execution time for anexperiment.
B. Post-Execution Profiling
Upon successful completion of the experiment, a plentifulnumber of output files are available for post-execution process-ing. This process involves synchronizing measurement outputwith application execution to properly quantify timings, powermeasurements, and performance for various phases and statesinvolved in execution. These raw metrics are then to be usedto establish estimated global parameters that define each staticconfiguration.
1) Obtaining Execution Time Values:
Extracting execu-tion time is fairly simple with CoMD. There are four mainfunctions that occur every iteration: update position, velocity,compute force, and share data between sub-domains; for EAM,there is an additional data transfer during the force compu-tation since it is a more specific algorithm. Host executiontime is defined as time to update position plus velocity plusdata redistribution minus the data transfer occuring withinthe redistribution phase. Host communication time is the sumof both communication phases (within the redistribution andforce phase), but is more accurately defined as two commtimings: halo exchange and reduce. This is preferred becauseit is more interesting with respect to optimization to separatethe point-to-point data transfers and reduce function timings.Xeon Phi computation time is based on the sum of offloadreport MIC time for all offloads throughout the simulation.PCI transfer time is based on the offload report as well;however is the difference between the sum of CPU time andMIC time over all offloads during the simulation. If CPUtime is undefined (reports 0.0000 seconds), PCI time may beapproximated as the difference between total simulation timedefined as loop in the CoMD output timings), and the timeto compute and communicate on host and to compute on theaccelerator.
2) Obtaining Power Consumption Values:
Extract powerdraw for each state, idle or active, is accomplished slightlydifferently for each device because idle time and duration dif-fers for each device. For the host, the active state is defined bythe host computation time; for host communication, PCI datatransfer, or computation on the accelerator, the host remainsidle. For the Xeon Phi, the active state is defined by the time tocompute on the accelerator, and the device is otherwise idle.To synchronize power draw to execution state, the TFS foreach file has to be synchronized to the TFS value from theapplication for the key execution phases. Host idle power isaccumulated during host communication and the entire forcecomputation because it includes accelerator computation andPCI data transfer. All other power is accumulated in host activepower. For the Xeon Phi, active power is accumulated duringoffload execution and otherwise idle power is accumulatedwith the current power sample. Power is finally divided bythe number of samples for each state (always greater than 100samples).
3) Obtaining Performance Values:
Extracting the perfor-mance metrics for each phase of execution is fairly straightforward with the required code instrumentations. For theXeon Phi, memory usage is simply the LLC MISS countermultiplied by 64 bytes per cache line; bandwidth is memoryusage times frequency divided by unhalted clock cycles.Because these are measured explicitly for the offload sections,these may simply be summated over all offloads during thesimulation. For the host, performance samples must first besynchronized with execution and within the appropriate phase,but because only the host communication phase is of interestwith respect to performance, this is the only phase in which thecounters are accumulated. The host follows the same simpleformula for computing memory usage and bandwidth as on theXeon Phi. PCI memory usage is summated over each offloadreport within execution of the simulation by adding togetherthe data sent to and recieved from the device. PCI bandwidth isestimated by the amount of data to transfer divided by transfertime.Finally, work on the Xeon Phi may be estimated bymultiplying computational throughput and computation time.Computational throughput may be calculated as the product ofthe number of cores, vectorization intensity, average numberof operations per cycle, and operational frequency. For CoMD,vectorization intensity is measured to be 2.6; operations percycle is estimated to be 1.15 as few fused-multiply operationsare vectorized for this version of CoMD. These two valuesdepend heavily on the implementation and application andmust be measured and approximated approiately for eachapplication. Operations per cycle may be approximated bycross-referencing the compiler vectorization report and sourcecode to determine which fused-multiply operations have beenvectorized; all other operations count as one. Then, assumingeach operation were to count only as one, take the ratio of number of vectorized to non-vectorized operations.IV. C
ONCLUSIONS AND F UTURE W ORK
This work has provided a detailed procedure through whichdevelopers may profile applications offloaded to acceleratorsand produce meaningful conclusions and insights. At thisstage, only the mini-application CoMD has been investi-gated thoroughly with the procedure, but it is of the utmostimportance to validate the technique on many other mini-applications in the future. Additionally, reducing the numberof executions to accurately measure the application is a highpriority because this would provide larger datasets in a fractionof the time. Currently, the experiment requires several days tocomplete in the cluster environment because each configura-tion change requires a new execution and the associated systemcool-down time. Finally, collecting hardware counter measure-ments without the aid of PAPI is also to be investigated.A
CKNOWLEDGMENTS
This work was supported in part by the Air Force Officeof Scientific Research under the AFOSR award FA9550-12-1-0476, by the National Science Foundation grants NSF/OCI—0941434, 0904782, 1047772, and by the U.S. Department ofEnergy, Office of Advanced Scientific Computing Research,through the Ames Laboratory, operated by Iowa State Univer-sity under contract No. DE-AC02-07CH11358.R
EFERENCES[1] Jee Choi, Marat Mukhan, Xing Liu, and Richard Vudue. Algorithmictime, energy, and power on candidate HPC compute building blocks.In , Arizona, USA, May 2014.[2] DOE. Co-design, 2013. http://science . energy . . exmatex . org/comd . html.[4] R. Hayashi and S. Horiguchi. Domain decomposition scheme for parallelmolecular dynamics simulation. In High Performance Computing on theInformation Superhighway, 1997. HPC Asia ’97 , pages 595–600, Apr1997.[5] ICL:UT. Performance application programming interface PAPI, 2015.http://icl . cs . utk . edu/papi/.[6] Intel. How to use huge pages to improve applica-tion performance on Intel Xeon Phi coprocessor, 2012.https://software . intel . com/sites/default/files/Large pages mic 0 . pdf.[7] Intel. Intel MPI library reference manual, 2014.https://software . intel . com/sites/products/documentation/hpc/ics/impi/41/lin/ReferenceManual/index . htm?wapkw=.[8] Intel. Intel C++ and Fortran compilers, 2015.https://software . intel . . intel . . html.[10] James Reinders Intel. An overview of programming forIntel Xeon processors and Intel Xeon Phi coprocessors, 2013.https://software . intel . com/sites/default/files/article/330164/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors1 . pdf.[11] Kevin Davis Intel. Effective use ofthe Intel compiler’s offload features, 2013.https://software . intel . com/en-us/articles/effective-use-of-the-intel-compilers-offload-features.[12] Martyn Corden Intel. How to compile for Intel AVX, 2012.https://software . intel . com/en-us/articles/how-to-compile-for-intel-avx.[13] Shannon Cepeda Intel. Optimization and performance tuning for IntelXeon Phi coprocessors, part 2: Understanding and using hardwareevents, 2012. https://software . intel . com/en-us/articles/.[14] Taylor Kidd Intel. Intel Xeon Phi coprocessor power managementconfiguration: Using the micsmc command-line interface, 2014.https://software . intel . com/en-us/blogs/2014/01/31/intel-xeon-phi-coprocessor-power-management-configuration-using-the-micsmc-command.15] Intel Forums. how to measure en-ergy consumption on the coprocessor?, 2013.https://software . intel . com/en-us/forums/topic/488782?language=en.[16] loc-nguyen Intel. Intel Xeon Phi copro-cessor developer’s quick start guide, 2015.https://software . intel . . lanl . gov/orgs/adtsc/publications/sciencehighlights 2013/docs/Pg88 89 . pdf.[18] PAPI. Accessing the Intel RAPL registers, 2013.http://icl . cs . utk ..