Mustafa M. Tikir
San Diego Supercomputer Center
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mustafa M. Tikir.
international symposium on software testing and analysis | 2002
Mustafa M. Tikir; Jeffrey K. Hollingsworth
Evaluation of Code Coverage is the problem of identifying the parts of a program that did not execute in one or more runs of a program. The traditional approach for code coverage tools is to use static code instrumentation. In this paper we present a new approach to dynamically insert and remove instrumentation code to reduce the runtime overhead of code coverage. We also explore the use of dominator tree information to reduce the number of instrumentation points needed. Our experiments show that our approach reduces runtime overhead by 38-90% compared with purecov, a commercial code coverage tool. Our tool is fully automated and available for download from the Internet.
international symposium on performance analysis of systems and software | 2010
Michael A. Laurenzano; Mustafa M. Tikir; Laura Carrington; Allan Snavely
Binary instrumentation facilitates the insertion of additional code into an executable in order to observe or modify the executables behavior. There are two main approaches to binary instrumentation: static and dynamic binary instrumentation. In this paper we present a static binary instrumentation toolkit for Linux on the x86/x86_64 platforms, PEBIL (PMaCs Efficient Binary Instrumentation Toolkit for Linux). PEBIL is similar to other toolkits in terms of how additional code is inserted into the executable. However, it is designed with the primary goal of producing efficient-running instrumented code. To this end, PEBIL uses function level code relocation in order to insert large but fast control structures. Furthermore, the PEBIL API provides tool developers with the means to insert lightweight hand-coded assembly rather than relying solely on the insertion of instrumentation functions. These features enable the implementation of efficient instrumentation tools with PEBIL. The overhead introduced for basic block counting by PEBIL is an average of 65% of the overhead of Dyninst, 41% of the overhead of Pin, 15% of the overhead of DynamoRIO, and 8% of the overhead of Valgrind.
european conference on parallel processing | 2009
Mustafa M. Tikir; Michael A. Laurenzano; Laura Carrington; Allan Snavely
As the size of today’s supercomputers grow exponentially in numbers of processors, the applications that run on these systems scale to larger processor counts. The majority of these applications commonly use Message Passing Interface (MPI); a trace of these MPI communication events is an important input to the tools that visualize, simulate for performance modeling, or enable tuning of parallel applications. We introduce an efficient, accurate and flexible trace-driven performance modeling and prediction tool, PMaC’s Open Source Interconnect and Network Simulator (PSINS), for MPI applications. A principal feature of PSINS is its usability for applications that scale up to large processor counts. PSINS generates compact and tractable event traces for fast and efficient simulations while producing accurate performance predictions. PSINS was incorporated into PMaC’s automated performance prediction framework and used to model three applications from the High Performance Computing Modernization Program’s (HPCMP) Technology Insertion 2009 (TI-09) application suite.
conference on high performance computing (supercomputing) | 2004
Mustafa M. Tikir; Jeffrey K. Hollingsworth
In this paper, we introduce a profile-driven online page migration scheme and investigate its impact on the performance of multithreaded applications. We use lightweight, inexpensive plug-in hardware counters to profile the memory access behavior of an application, and then migrate pages to memory local to the most frequently accessing processor. Using the Dyninst runtime instrumentation combined with hardware counters, we were able to add page migration capabilities to the system without having to modify the operating system kernel, or to re-compile application programs. This approach reduced the total number of non-local memory accesses of applications by up to 90%. Even on a system with small remote to local memory access latency rations, this resulted in up to 16% improvement in execution time.
ieee international conference on high performance computing data and analytics | 2008
Laura Carrington; D. Komatitsch; Michael A. Laurenzano; Mustafa M. Tikir; David Michéa; N. Le Goff; Allan Snavely; John Tromp
SPECFEM3D_GLOBE is a spectral-element application enabling the simulation of global seismic wave propagation in 3D anelastic, anisotropic, rotating and self-gravitating Earth models at unprecedented resolution. A fundamental challenge in global seismology is to model the propagation of waves with periods between 1 and 2 seconds, the highest frequency signals that can propagate clear across the Earth. These waves help reveal the 3D structure of the Earths deep interior and can be compared to seismographic recordings. We broke the 2 second barrier using the 62K processor Ranger system at TACC. Indeed we broke the barrier using just half of Ranger, by reaching a period of 1.84 seconds with sustained 28.7 Tflops on 32K processors. We obtained similar results on the XT4 Franklin system at NERSC and the XT4 Kraken system at University of Tennessee Knoxville, while a similar run on the 28K processor Jaguar system at ORNL, which has better memory bandwidth per processor, sustained 35.7 Tflops (a higher flops rate) with a 1.94 shortest period. Thus we have enabled a powerful new tool for seismic wave simulation, one that operates in the same frequency regimes as nature; in seismology there is no need to pursue periods much smaller because higher frequency signals do not propagate across the entire globe. We employed performance modeling methods to identify performance bottlenecks and worked through issues of parallel I/O and scalability. Improved mesh design and numbering results in excellent load balancing and few cache misses. The primary achievements are not just the scalability and high teraflops number, but a historic step towards understanding the physics and chemistry of the Earths interior at unprecedented resolution.
Journal of Parallel and Distributed Computing | 2008
Mustafa M. Tikir; Jeffrey K. Hollingsworth
In this paper, we first introduce a profile-driven online page migration scheme and investigate its impact on the performance of multithreaded applications. We use centralized lightweight, inexpensive plug-in hardware monitors to profile the memory access behavior of an application, and then migrate pages to memory local to the most frequently accessing processor. We also investigate the use of several other potential sources of data gathered from hardware monitors and compare their effectiveness to using data from centralized hardware monitors. In particular, we investigate the effectiveness of using cache miss profiles, Translation Lookaside Buffer (TLB) miss profiles and the content of the on-chip TLBs using the valid bit information. Moreover, we also introduce a modest hardware feature, called Address Translation Counters (ATC), and compare its effectiveness with other sources of hardware profiles. Using the Dyninst runtime instrumentation combined with hardware monitors, we were able to add page migration capabilities to a Sun Fire 6800 server without having to modify the operating system kernel, or to re-compile application programs. Our dynamic page migration scheme reduced the total number of non-local memory accesses of applications by up to 90% and improved the execution times up to 16%. We also conducted a simulation based study and demonstrated that cache miss profiles gathered from on-chip CPU monitors, which are typically available in current microprocessors, can be effectively used to guide dynamic page migrations in applications.
Advances in Computers | 2008
Jack J. Dongarra; Robert Graybill; William Harrod; Robert F. Lucas; Ewing L. Lusk; Piotr Luszczek; Janice McMahon; Allan Snavely; Jeffrey S. Vetter; Katherine A. Yelick; Sadaf R. Alam; Roy L. Campbell; Laura Carrington; Tzu-Yi Chen; Omid Khalili; Jeremy S. Meredith; Mustafa M. Tikir
Abstract The historical context with regard to the origin of the DARPA High Productivity Computing Systems (HPCS) program is important for understanding why federal government agencies launched this new, long-term high-performance computing program and renewed their commitment to leadership computing in support of national security, large science and space requirements at the start of the 21st century. In this chapter, we provide an overview of the context for this work as well as various procedures being undertaken for evaluating the effectiveness of this activity including such topics as modelling the proposed performance of the new machines, evaluating the proposed architectures, understanding the languages used to program these machines as well as understanding programmer productivity issues in order to better prepare for the introduction of these machines in the 2011–2015 timeframe.
international parallel and distributed processing symposium | 2005
Mustafa M. Tikir; Jeffery K. Hollingsworth
We introduce a set of techniques to both measure and optimize memory access locality of Java applications running on cc-NUMA servers. These techniques work at the object level and use information gathered from embedded hardware performance monitors. We propose a new NUMA-aware Java heap layout. In addition, we propose using dynamic object migration during garbage collection to move objects local to the processors accessing them most. Our optimization technique reduced the number of non-local memory accesses in Java workloads generated from actual runs of the SPECjbb2000 benchmark by up to 41%, and also resulted in 40% reduction in workload execution time.
Journal of Systems and Software | 2005
Mustafa M. Tikir; Jeffrey K. Hollingsworth
Evaluation of statement coverage is the problem of identifying the statements of a program that execute in one or more runs of a program. The traditional approach for statement coverage tools is to use static code instrumentation. In this paper we present a new approach to dynamically insert and remove instrumentation code to reduce the runtime overhead of statement coverage measurement. We also explore the use of dominator tree information to reduce the number of instrumentation points needed. Our experiments show that our approach reduces runtime overhead by 38-90% compared with purecov, a commercial statement coverage tool. Our tool is fully automated and available for download from the Internet.
ieee international conference on high performance computing data and analytics | 2009
Mustafa M. Tikir; Michael A. Laurenzano; Laura Carrington; Allan Snavely
As the size of today’s supercomputers grow exponentially in numbers of processors, the applications that run on these systems scale to larger processor counts. The majority of these applications commonly use Message Passing Interface (MPI); a trace of these MPI communication events is an important input to the tools that visualize, simulate for performance modeling, or enable tuning of parallel applications. We introduce an efficient, accurate and flexible trace-driven performance modeling and prediction tool, PMaC’s Open Source Interconnect and Network Simulator (PSINS), for MPI applications. A principal feature of PSINS is its usability for applications that scale up to large processor counts. PSINS generates compact and tractable event traces for fast and efficient simulations while producing accurate performance predictions. PSINS was incorporated into PMaC’s automated performance prediction framework and used to model three applications from the High Performance Computing Modernization Program’s (HPCMP) Technology Insertion 2009 (TI-09) application suite.