Brian Kahne
Freescale Semiconductor
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Brian Kahne.
IEEE Transactions on Computers | 2015
Etem Deniz; Alper Sen; Brian Kahne; Jim Holt
We present a novel automated multicore benchmark synthesis framework with characterization and generation components. Our framework uses parallel patterns in capturing important characteristics of multi-threaded applications and generates synthetic multicore benchmarks from those applications. The resulting synthetic benchmarks are small, fast, portable, human-readable, and they accurately reflect microarchitecture dependent and independent characteristics of the original multicore applications. Also, they can use either Pthreads or MCA libraries. We implement our techniques in the MINIME tool and generate synthetic benchmarks from PARSEC, Rodinia, and EEMBC MultibenchTM benchmarks on x86 and Power Architecture® platforms. We show that synthetic benchmarks are representative across a range of multicore machines with different architectures, while being on average 21× faster and 14× smaller than original benchmarks.
ieee international symposium on workload characterization | 2012
Etem Deniz; Alper Sen; Jim Holt; Brian Kahne
Benchmarks capture the essence of many important real-world applications and allow performance, and power analysis while developing new systems. Synthetic benchmarks are a miniaturized form of benchmarks that allow high simulation speeds and act as proxies to proprietary applications. Software architecture principles guide the development of new applications and benchmarks. We leverage software architectural patterns in developing synthetic benchmarks for embedded multicore systems. We developed an automated framework complete with characterization and synthesis components and performed experiments on PARSEC and Rodinia benchmarks. Our benchmarks can be run on any given infrastructure, that is, SMP or message passing, unlike previously developed benchmarks. Hence, this allows us to target heterogeneous embedded multicore systems. Our results show that the synthetic benchmarks and the real applications are similar with respect to various micro-architecture dependent as well as independent metrics.
networking architecture and storages | 2015
Farrukh Hijaz; Brian Kahne; Peter J. Wilson; Omer Khan
Software IP forwarding routers provide flexibility, programmability and extensibility, while enabling fast deployment. The key question is whether they can keep up with the efficiency of special purpose hardware counterparts. Shared memory stands out as sine qua non for parallel programming of many commercial multicore processors, so it is the paradigm of choice to implement software routers. For efficiency, shared memory is often implemented with hardware support for cache coherence and data consistency among the cores. Although it enables efficient data access in many common case scenarios, the communication between cores using shared memory synchronization primitives often limits scalability. In this paper we perform a thorough characterization of a multithreaded packet processing application to quantify the opportunities from exploiting concurrency, as well as identify scalability bottlenecks in futuristic shared memory multicores. We propose to retain the shared memory model, however, introduce a set of lightweight in-hardware explicit messaging send/receive instructions in the instruction set architecture (ISA). These instructions are used to mitigate the overheads of multi-party communication in shared memory protocols. Using simulations of a 64 core multicore, we identify that scalability of parallel packet processing is limited due to packet ordering requirement that leads to expensive implicit communication under shared memory. Using explicit messaging support in the ISA, the communication bottleneck is mitigated, and the application scales to 30× at 64 cores.
microprocessor test and verification | 2013
Brian Kahne; Jim Holt
When developing a new architecture with a new programming model, not only must performance be taken into account, but the programming model itself must also be validated, in order to ensure that software will run correctly and with sufficient efficiency. In this paper, we describe how we applied rapid prototyping techniques to model a new network switch architecture. By concentrating on functional modeling and using a high-level description for core modeling, as well as abstract C++ models for peripherals, our model was able to track the specification, allowing studies to be performed on code density and ABI requirements, with sufficient time to be able to influence the architecture as it evolved.
international parallel and distributed processing symposium | 2017
Halit Dogan; Farrukh Hijaz; Masab Ahmad; Brian Kahne; Peter J. Wilson; Omer Khan
Shared Memory stands out as a sine qua non for parallel programming of many commercial and emerging multicore processors. It optimizes patterns of communication that benefit common programming styles. As parallel programming is now mainstream, those common programming styles are challenged with emerging applications that communicate often and involve large amount of data. Such applications include graph analytics and machine learning, and this paper focuses on these domains. We retain the shared memory model and introduce a set of lightweight in-hardware explicit messaging instructions in the instruction set architecture (ISA). A set of auxiliary communication models are proposed that utilize explicit messages to accelerate synchronization primitives, and efficiently move computation towards data. The results on a 256-core simulated multicore demonstrate that the proposed communication models improve performance and dynamic energy by an average of 4x and 42% respectively over traditional shared memory.
design, automation, and test in europe | 2017
Alper Sen; Etem Deniz; Brian Kahne
Programming of multicore architectures with large number of cores is a huge burden on the programmer. Parallel patterns ease this burden by presenting the developer with a set of predefined programming patterns that implement best practices in parallel programming. Since the behavior of patterns is well-known and understood they can also lower the burden for verification. In this work, we present a toolset, MINIME-Validator, for generating synthetic parallel testcases from a newly defined Parallel Pattern Markup Language (PPML) that uses the concept of parallel patterns. Our testcases mimic the behavior of real customer applications while being much smaller and can be used to generate traffic and validate e.g. inter-processor communication architectures. Experiments show that synthetic testcases can be used for finding representative hardware communication problems. To the best of our knowledge, this is the first time synthetic testcases using parallel programming patterns are used for hardware validation.
microprocessor test and verification | 2005
Brian Kahne; Aseem Gupta; Peter J. Wilson; Nikil D. Dutt
The ability to enhance single-thread performance, such as by increasing clock frequency, is reaching a point of diminishing returns: power is becoming a dominating factor and limiting scalability. Adding additional cores is a scalable way to increase performance, but it requires that system designers have a method for developing multithreaded applications. Plasma, (parallel language for system modeling and analysis) is a parallel language for system modeling and multi-threaded application development implemented as a superset of C++. The language extensions are based upon those found in Occam, which is based upon CSP (communicating sequential processes) by C. A. R. Hoare. The goal of the Plasma project is to investigate whether a language with the appropriate constructs might be used to ease the task of developing highly multi-threaded software. In addition, through the inclusion of a discrete event simulation API, we seek to simplify the task of system modeling and increase productivity through clearer representation and increased compile-time checking of the more difficult-to-get-right aspects of systems models (the concurrency). The result is a single language which allows users to develop a parallel application and then to model it within the context of a system, allowing for hardware-software partitioning and various other early tradeoff analyses. We believe that this language offers a simpler and more concise syntax than other offerings and can be targeted at a large range of potential architectures, including heterogeneous systems and those without shared memory
microprocessor test and verification | 2005
Brian Kahne; Magdy S. Abadir
High performance designs must conform to stringent timing requirements. Designers frequently utilize low-level optimization techniques and develop many iterations of the same block in order to close a timing gap. Simulation with random stimulus is the traditional method for verifying that these changes do not introduce a change in the functional behavior of the block. For the development of a new high-performance core at Freescale Semiconductor the authors decided to instead research the possibility of using formal techniques, in the form of sequential equivalence checking, for this form of verification. Various equivalence checking tools were evaluated for this task. Initial results looked promising and the authors decided to integrate this capability into our design flow. This paper describes the experience and also addresses some of the problems that were exposed and how we plan to deal with them
Archive | 2015
Brian Kahne; John H. Arends; Richard G. Collins; Jim Holt
Archive | 2014
Jim Holt; Brian Kahne; William C. Moyer