Bob Blainey | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bob Blainey is active.

Explore More

Publication

Featured researches published by Bob Blainey.

embedded software | 2007

Design and implementation of a comprehensive real-time java virtual machine

Joshua S. Auerbach; David F. Bacon; Bob Blainey; Perry Cheng; Michael H. Dawson; Mike Stephen Fulton; David Grove; Darren Hart; Mark G. Stoodley

The emergence of standards for programming real-time systems in Java has encouraged many developers to consider its use for systems previously only built using C, Ada, or assembly language. However, the RTSJ standard in isolation leaves many important problems unaddressed, and suffers from some serious problems in usability and safety. As a result, the use of Java for real-time programming has continued to be viewed as risky and adoption has been slow. In this paper we provide a description of IBMs new real-time Java virtual machine product, which combines Metronome real-time garbage collection, ahead-of-time compilation, and a complete implementation of the RTSJ standard, running on top of a custom real-time multiprocessor Linux kernel. We will describe the implementation of each of these components, including how they interacted both positively and negatively, and the extensions to previous work required to move it from research prototype to a system implementing the complete semantics of the Java language. The system has been adopted for hard real-time development of naval weapons systems and soft real-time telecommunications servers. We present measurements showing that the system is able to provide sub-millisecond worst-case garbage collection latencies, 50 microsecond Linux scheduling accuracy, and eliminate non-determinism due to JIT compilation.

international workshop on openmp | 2003

Is the schedule clause really necessary in OpenMP

Eduard Ayguadé; Bob Blainey; Alejandro Duran; Jesús Labarta; Francisco Martínez; Xavier Martorell; Raul Esteban Silvera

Choosing the appropriate assignment of loop iterations to threads is one of the most important decisions that need to be taken when parallelizing Loops, the main source of parallelism in numerical applications. This is not an easy task, even for expert programmers, and it can potentially take a large amount of time. OpenMP offers the schedule clause, with a set of predefined iteration scheduling strategies, to specify how (and when) this assignment of iterations to threads is done. In some cases, the best schedule depends on architectural characteristics of the target architecture, data input, ... making the code less portable. Even worse, the best schedule can change along execution time depending on dynamic changes in the behavior of the loop or changes in the resources available in the system. Also, for certain types of imbalanced loops, the schedulers already proposed in the literature are not able to extract the maximum parallelism because they do not appropriately trade-off load balancing and data locality. This paper proposes a new scheduling strategy, that derives at run time the best scheduling policy for each parallel loop in the program, based on information gathered at runtime by the library itself.

compiler construction | 2005

Generalized index-set splitting

Christopher Barton; Arie Tal; Bob Blainey; José Nelson Amaral

This paper introduces Index-Set Splitting (ISS), a technique that splits a loop containing several conditional statements into several loops with less complex control flow. Contrary to the classic loop unswitching technique, ISS splits loops when the conditional is loop variant. ISS uses an Index Sub-range Tree (IST) to identify the structure of the conditionals in the loop and to select which conditionals should be eliminated. This decision is based on an estimation of the code growth for each splitting: a greedy algorithm spends a pre-determined code growth budget. ISTs separate the decision about which splits to perform from the actual code generation for the split loops. The use of ISS to improve a loop fusion framework is then discussed. ISS opportunity identification in the SPEC2000 benchmark suite and three other suites demonstrate that ISS is a general technique that may benefit other compilers.

international workshop on openmp | 2003

Busy-wait barrier synchronization using distributed counters with local sensor

Guansong Zhang; Francisco Martínez; Arie Tal; Bob Blainey

Barrier synchronization is an important and performance critical primitive in many parallel programming models, including the popular OpenMP model. In this paper, we compare the performance of several software implementations of barrier synchronization and introduce a new implementation, distributed counters with local sensor, which considerably reduces overhead on POWER3 and POWER4 SMP systems. Through experiments with the EPCC OpenMP benchmark, we demonstrate a 79% reduction in overhead on a 32-way POWER4 system and an 87% reduction in overhead on a 16-way POWER3 system when comparing with a fetch-and-add implementation. Since these improvements are primarily attributed to reduced L2 and L3 cache misses, we expect the relative performance of our implementation to increase with the number of processors in an SMP and as memory latencies lengthen relative to cache latencies.

international conference on management of data | 2014

Analyzing analytics

Rajesh Bordawekar; Bob Blainey; Chidanand Apte

Many organizations today are faced with the challenge of processing and distilling information from huge and growing collections of data. Such organizations are increasingly deploying sophisticated mathematical algorithms to model the behavior of their business processes to discover correlations in the data, to predict trends and ultimately drive decisions to optimize their operations. These techniques, are known collectively as analytics, and draw upon multiple disciplines, including statistics, quantitative analysis, data mining, and machine learning. In this survey paper, we identify some of the key techniques employed in analytics both to serve as an introduction for the non-specialist and to explore the opportunity for greater optimizations for parallelization and acceleration using commodity and specialized multi-core processors. We are interested in isolating and documenting repeated patterns in analytical algorithms, data structures and data types, and in understanding howthese could be most effectively mapped onto parallel infrastructure. To this end, we focus on analytical models that can be executed using different algorithms. For most major model types, we study implementations of key algorithms to determine common computational and runtime patterns. We then use this information to characterize and recommend suitable parallelization strategies for these algorithms, specifically when used in data management workloads.

high performance computational finance | 2013

Optimizing IBM algorithmics' mark-to-future aggregation engine for real-time counterparty credit risk scoring

Amy Wang; Jan Treibig; Bob Blainey; Peng Wu; Yaoqing Gao; Barnaby Dalton; Danny Gupta; Fahham Khan; Neil Bartlett; Lior Velichover; James Sedgwick; Louis Ly

The concept of default and its associated painful repercussions have been a particular area of focus for financial institutions, especially after the 2007/2008 global financial crisis. Counterparty credit risk (CCR), i.e. risk associated with a counterparty default prior to the expiration of a contract, has gained tremendous amount of attention which resulted in new CCR measures and regulations being introduced. In particular users would like to measure the potential impact of each real time trade or potential real time trade against exposure limits for the counterparty using Monte Carlo simulations of the trade value, and also calculate the Credit Value Adjustment (i.e, how much it will cost to cover the risk of default with this particular counterparty if/when the trade is made). These rapid limit checks and CVA calculations demand more compute power from the hardware. Furthermore, with the emergence of electronic trading, the extreme low latency and high throughput real time compute requirement push both the software and hardware capabilities to the limit. Our work focuses on optimizing the computation of risk measures and trade processing in the existing Mark-to-future Aggregation (MAG) engine in the IBM Algorithmics product offering. We propose a new software approach to speed up the end-to-end trade processing based on a pre-compiled approach. The net result is an impressive speed up of 3--5x over the existing MAG engine using a real client workload, for processing trades which perform limit check and CVA reporting on exposures while taking full collateral modelling into account.

acm sigplan symposium on principles and practice of parallel programming | 2014

SIMDizing pairwise sums: a summation algorithm balancing accuracy with throughput

Barnaby Dalton; Amy Wang; Bob Blainey

Implementing summation when accuracy and throughput need to be balanced is a challenging endevour. We present experimental results that provide a sense when to start worrying and the expense of the various solutions that exist. We also present a new algorithm based on pairwise summation that achieves 89% of the throughput of the fastest summation algorithms when the data is not resident in L1 cache while eclipsing the accuracy of signifigantly slower compensated sums like Kahan summation and Kahan-Babuska that are typically used when accuracy is important.

languages and compilers for parallel computing | 2002

Removing impediments to loop fusion through code transformations

Bob Blainey; Christopher Barton; José Nelson Amaral

Loop fusion is a common optimization technique that takes several loops and combines them into a single large loop. Most of the existing work on loop fusion concentrates on the heuristics required to optimize an objective function, such. as data reuse or creation of instruction level parallelism opportunities. Often, however, the code provided to a compiler has only small sets of loops that are control flow equivalent, normalized, have the same iteration count, are adjacent, and have no fusion-preventing dependences. This paper focuses on code transformations that create more opportunities for loop fusion in the IBM®XL compiler suite that generates code for the IBM family of PowerPC®processors. In this compiler an objective function is used at the loop distributor to decide which portions of a loop should remain in the same loop nest and which portions should be redistributed. Our algorithm focuses on eliminating conditions that prevent loop fusion. By generating maximal fusion our algorithm increases the scope of later transformations. We tested our improved code generator in an IBM pSeries 690 machine equipped with a POWER4 processor using the SPEC CPU2000 benchmark suite. Our improvements to loop fusion resulted in three times as many loops fused in a subset of CFP2000 benchmarks, and four times as many for a subset of CINT2000 benchmarks.

international parallel and distributed processing symposium | 2015

Towards a Combined Grouping and Aggregation Algorithm for Fast Query Processing in Columnar Databases with GPUs

Sina Meraji; John Keenleyside; Sunil Kamath; Bob Blainey

Column-store in-memory databases have received a lot of attention because of their fast query processing response times on modern multi-core machines. Among different database operations, group by/aggregate is an important and potentially costly operation. Moreover, sort-based and hash-based algorithms are the most common ways of processing group by/aggregate queries. While sort-based algorithms are used in traditional Data Base Management Systems (DBMS), hash based algorithms can be applied for faster query processing in new columnar databases. Besides, Graphical Processing Units (GPU) can be utilized as fast, high bandwidth co-processors to improve the query processing performance of columnar databases. The focus of this article is on the prototype for group by/aggregate operations that we created to exploit GPUs. We show different hash based algorithms to improve the performance of group by/aggregate operations on GPU. One of the parameters that affect the performance of the group by/aggregate algorithm is the number of groups and hashing algorithm. We show that we can get up to 7.6x improvement in kernel performance compared to a multi-core CPU implementation when we use a partitioned multi-level hash algorithm using GPU shared and global memories.

international conference on parallel architectures and compilation techniques | 2014

Domain-specific models for innovation in analytics

Bob Blainey

Big data is a transformational force for businesses and organizations of every stripe. The ability to rapidly and accurately derive insights from massive amounts of data is becoming a critical competitive differentiator so it is driving continuous innovation among business analysts, data scientists, and computer engineers. Two of the most important success factors for analytic techniques are the ability to quickly develop and incrementally evolve them to suit changing business needs and the ability to scale these techniques using parallel computing to process huge collections of data. Unfortunately, these goals are often at odds with each other because innovation at the algorithm and data model level requires a combination of domain knowledge and expertise in data analysis while achieving high scale demands expertise in parallel computing, cloud computing and even hardware acceleration. In this talk, I will examine various approaches to bridging these two goals, with a focus on domain-specific models that simultaneously improve the agility of analytics development and the achievement of efficient parallel scaling.

Explore More