Kamen Yotov | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kamen Yotov is active.

Explore More

Publication

Featured researches published by Kamen Yotov.

Proceedings of the IEEE | 2005

Is Search Really Necessary to Generate High-Performance BLAS?

Kamen Yotov; Xiaoming Li; Gang Ren; María Jesús Garzarán; David A. Padua; Keshav Pingali; Paul Stodghill

A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of parameter values by generating programs with many different combinations of parameter values, and running them on the actual hardware to determine which values give the best performance. It is widely believed that traditional model-driven optimization cannot compete with search-based empirical optimization because tractable analytical models cannot capture all the complexities of modern high-performance architectures, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the global search engine in ATLAS with a model-driven optimization engine and measured the relative performance of the code produced by the two systems on a variety of architectures. Since both systems use the same code generator, any differences in the performance of the code produced by the two systems can come only from differences in optimization parameter values. Our experiments show that model-driven optimization can be surprisingly effective and can generate code with performance comparable to that of code generated by ATLAS using global search.

acm symposium on parallel algorithms and architectures | 2007

An experimental comparison of cache-oblivious and cache-conscious programs

Kamen Yotov; Thomas Roeder; Keshav Pingali; John A. Gunnels; Fred G. Gustavson

Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -- each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy. An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question. This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cache-oblivious algorithms more competitive.

languages and compilers for parallel computing | 2005

A language for the compact representation of multiple program versions

Sébastien Donadio; James C. Brodman; Thomas Roeder; Kamen Yotov; Denis Barthou; Albert Cohen; María Jesús Garzarán; David A. Padua; Keshav Pingali

As processor complexity increases compilers tend to deliver suboptimal performance. Library generators such as ATLAS, FFTW and SPIRALz overcome this issue by empirically searching in the space of possible program versions for the one that performs the best. Empirical search can also be applied by programmers, but because they lack a tool to automate the process, programmers need to manually re-write the application in terms of several parameters whose best value will be determined by the empirical search in the target machine. In this paper, we present the design of an annotation language, meant to be used either as an intermediate representation within library generators or directly by the programmer. This language that we call X represents parameterized programs in a compact and natural way. It provides an powerful optimization framework for high performance computing.

measurement and modeling of computer systems | 2005

Automatic measurement of memory hierarchy parameters

Kamen Yotov; Keshav Pingali; Paul Stodghill

The running time of many applications is dominated by the cost of memory operations. To optimize such applications for a given platform, it is necessary to have a detailed knowledge of the memory hierarchy parameters of that platform. In practice, this information is poorly documented if at all. Moreover, there is growing interest in self-tuning, autonomic software systems that can optimize themselves for different platforms; these systems must determine memory hierarchy parameters automatically without human intervention.One solution is to use micro-benchmarks to determine the parameters of the memory hierarchy. In this paper, we argue that existing micro-benchmarks are inadequate, and present novel micro-benchmarks for determining parameters of all levels of the memory hierarchy, including registers, all data caches and the translation look-aside buffer. We have implemented these micro-benchmarks in a tool called X-Ray that can be ported easily to new platforms. We present experimental results that show that X-Ray successfully determines memory hierarchy parameters on current platforms, and compare its accuracy with that of existing tools.

international conference on supercomputing | 2005

Think globally, search locally

Kamen Yotov; Keshav Pingali; Paul Stodghill

A key step in program optimization is the determination of optimal values for code optimization parameters such as cache tile sizes and loop unrolling factors. One approach, which is implemented in most compilers, is to use analytical models to determine these values. The other approach, used in library generators like ATLAS, is to perform a global empirical search over the space of parameter values.Neither approach is completely suitable for use in general-purpose compilers that must generate high quality code for large programs running on complex architectures. Model-driven optimization may incur a performance penalty of 10-20% even for a relatively simple code like matrix multiplication. On the other hand, global search is not tractable for optimizing large programs for complex architectures because the optimization space is too large.In this paper, we advocate a methodology for generating high-performance code without increasing search time dramatically. Our methodology has three components: (i) modeling, (ii) local search, and (iii) model refinement. We demonstrate this methodology by using it to eliminate the performance gap between code produced by a model-driven version of ATLAS described by us in prior work, and code produced by the original ATLAS system using global search.

quantitative evaluation of systems | 2005

X-ray: a tool for automatic measurement of hardware parameters

Kamen Yotov; Keshav Pingali; Paul Stodghill

There is growing interest in self-optimizing computing systems that can optimize their own behavior on different platforms without manual intervention. Examples of successful self-optimizing systems are ATLAS, which generates basic linear algebra subroutine (BLAS) libraries, and FFTW, which generates FFT libraries. Self-optimizing systems need values for hardware parameters such as the number of registers of various types and the capacities of caches at various levels. For example, ATLAS uses the capacity of the LI cache and the number of registers in determining the size of cache tiles and register tiles. In this paper, we describe X-ray, a system for implementing micro-benchmarks to measure such hardware parameters. We also present novel algorithms for measuring some of these parameters. Experimental evaluations of X-ray on traditional workstations, servers and embedded systems show that X-ray produces more accurate and complete results than existing tools.

parallel computing | 2006

Is cache-oblivious DGEMM viable?

John A. Gunnels; Fred G. Gustavson; Keshav Pingali; Kamen Yotov

We present a study of implementations of DGEMM using both the cache-oblivious and cache-conscious programming styles. The cache-oblivious programs use recursion and automatically block DGEMM operands A,B,C for thememory hierarchy. The cache-conscious programs use iteration and explicitly block A,B,C for register files, all caches and memory. Our study shows that the cache-oblivious programs achieve substantially less performance than the cache-conscious programs. We discuss why this is so and suggest approaches for improving the performance of cache-oblivious programs.

languages and compilers for parallel computing | 2005

Analytic models and empirical search: a hybrid approach to code optimization

Arkady Epshteyn; María Jesús Garzarán; Gerald DeJong; David A. Padua; Gang Ren; Xiaoming Li; Kamen Yotov; Keshav Pingali

Compilers employ system models, sometimes implicitly, to make code optimization decisions. These models are analytic; they reflect their implementors understanding and beliefs of the system. While their decisions can be made almost instantaneously, unless the model is perfect their decisions may be flawed. To avoid exercising unique characteristics of a particular machine, such models are necessarily general and conservative. An alternative is to construct an empirical model. Building an empirical model involves extensive search of a parameter space to determine optimal settings. But this search is performed on the actual machine on which the compiler is to be deployed so that, once constructed, its decisions automatically reflect any eccentricities of the target system. Unfortunately, constructing accurate empirical models is expensive and, therefore, their applicability is limited to library generators such as ATLAS and FFTW. Here the high up-front installation cost can amortized over many future uses. In this paper we examine a hybrid approach. Active learning in an Explanation-Based paradigm allows the hybrid system to greatly increase the search range while drastically reducing the search time. Individual search points are analyzed for their information content using an known-imprecise qualitative analytic model. Next-search-points are chosen which have the highest expected information content with respect to refinement of the empirical model being constructed. To evaluate our approach we compare it with a leading analytic model and a leading empirical model. Our results show that the performance of the libraries generated using the hybrid approach is comparable to the performance of libraries generated via extensive search techniques and much better than that of the libraries generated by optimization based solely on an analytic model.

Proceedings of the IEEE | 2005