Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Alok N. Choudhary is active.

Publication


Featured researches published by Alok N. Choudhary.


ieee international conference on high performance computing data and analytics | 2011

The International Exascale Software Project roadmap

Jack J. Dongarra; Pete Beckman; Terry Moore; Patrick Aerts; Giovanni Aloisio; Jean Claude Andre; David Barkai; Jean Yves Berthou; Taisuke Boku; Bertrand Braunschweig; Franck Cappello; Barbara M. Chapman; Xuebin Chi; Alok N. Choudhary; Sudip S. Dosanjh; Thom H. Dunning; Sandro Fiore; Al Geist; Bill Gropp; Robert J. Harrison; Mark Hereld; Michael A. Heroux; Adolfy Hoisie; Koh Hotta; Zhong Jin; Yutaka Ishikawa; Fred Johnson; Sanjay Kale; R.D. Kenway; David E. Keyes

Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.


Computational Science & Discovery | 2009

Terascale direct numerical simulations of turbulent combustion using S3D

J.H. Chen; Alok N. Choudhary; B.R. de Supinski; M. DeVries; Evatt R. Hawkes; Scott Klasky; Wei-keng Liao; Kwan-Liu Ma; John M. Mellor-Crummey; N Podhorszki; Ramanan Sankaran; Sameer Shende; Chun Sang Yoo

Computational science is paramount to the understanding of underlying processes in internal combustion engines of the future that will utilize non-petroleum-based alternative fuels, including carbon-neutral biofuels, and burn in new combustion regimes that will attain high efficiency while minimizing emissions of particulates and nitrogen oxides. Next-generation engines will likely operate at higher pressures, with greater amounts of dilution and utilize alternative fuels that exhibit a wide range of chemical and physical properties. Therefore, there is a significant role for high-fidelity simulations, direct numerical simulations (DNS), specifically designed to capture key turbulence-chemistry interactions in these relatively uncharted combustion regimes, and in particular, that can discriminate the effects of differences in fuel properties. In DNS, all of the relevant turbulence and flame scales are resolved numerically using high-order accurate numerical algorithms. As a consequence terascale DNS are computationally intensive, require massive amounts of computing power and generate tens of terabytes of data. Recent results from terascale DNS of turbulent flames are presented here, illustrating its role in elucidating flame stabilization mechanisms in a lifted turbulent hydrogen/air jet flame in a hot air coflow, and the flame structure of a fuel-lean turbulent premixed jet flame. Computing at this scale requires close collaborations between computer and combustion scientists to provide optimized scaleable algorithms and software for terascale simulations, efficient collective parallel I/O, tools for volume visualization of multiscale, multivariate data and automating the combustion workflow. The enabling computer science, applied to combustion science, is also required in many other terascale physics and engineering simulations. In particular, performance monitoring is used to identify the performance of key kernels in the DNS code, S3D and especially memory intensive loops in the code. Through the careful application of loop transformations, data reuse in cache is exploited thereby reducing memory bandwidth needs, and hence, improving S3Ds nodal performance. To enhance collective parallel I/O in S3D, an MPI-I/O caching design is used to construct a two-stage write-behind method for improving the performance of write-only operations. The simulations generate tens of terabytes of data requiring analysis. Interactive exploration of the simulation data is enabled by multivariate time-varying volume visualization. The visualization highlights spatial and temporal correlations between multiple reactive scalar fields using an intuitive user interface based on parallel coordinates and time histogram. Finally, an automated combustion workflow is designed using Kepler to manage large-scale data movement, data morphing, and archival and to provide a graphical display of run-time diagnostics.


international symposium on computer architecture | 2009

Firefly: illuminating future network-on-chip with nanophotonics

Yan Pan; Prabhat Kumar; John Kim; Gokhan Memik; Yu Zhang; Alok N. Choudhary

Future many-core processors will require high-performance yet energy-efficient on-chip networks to provide a communication substrate for the increasing number of cores. Recent advances in silicon nanophotonics create new opportunities for on-chip networks. To efficiently exploit the benefits of nanophotonics, we propose Firefly - a hybrid, hierarchical network architecture. Firefly consists of clusters of nodes that are connected using conventional, electrical signaling while the inter-cluster communication is done using nanophotonics - exploiting the benefits of electrical signaling for short, local communication while nanophotonics is used only for global communication to realize an efficient on-chip network. Crossbar architecture is used for inter-cluster communication. However, to avoid global arbitration, the crossbar is partitioned into multiple, logical crossbars and their arbitration is localized. Our evaluations show that Firefly improves the performance by up to 57% compared to an all-electrical concentrated mesh (CMESH) topology on adversarial traffic patterns and up to 54% compared to an all-optical crossbar (OP XBAR) on traffic patterns with locality. If the energy-delay-product is compared, Firefly improves the efficiency of the on-chip network by up to 51% and 38% compared to CMESH and OP XBAR, respectively.


knowledge discovery and data mining | 2005

A two-phase algorithm for fast discovery of high utility itemsets

Ying Liu; Wei-keng Liao; Alok N. Choudhary

Traditional association rules mining cannot meet the demands arising from some real applications. By considering the different values of individual items as utilities, utility mining focuses on identifying the itemsets with high utilities. In this paper, we present a Two-Phase algorithm to efficiently prune down the number of candidates and precisely obtain the complete set of high utility itemsets. It performs very efficiently in terms of speed and memory cost both on synthetic and real databases, even on large databases that are difficult for existing algorithms to handle.


ACM Sigarch Computer Architecture News | 1993

Improved parallel I/O via a two-phase run-time access strategy

Juan Miguel del Rosario; Rajesh Bordawekar; Alok N. Choudhary

As scientists expand their models to describe physical phenomena of increasingly large extent, I/O becomes crucial and a system with limited I/O capacity can severely constrain the performance of the entire program.We provide experimental results, performed on an lntel Touchtone Delta and nCUBE 2 I/O system, to show that the performance of existing parallel I/O systems can vary by several orders of magnitude as a function of the data access pattern of the parallel program. We then propose a two-phase access strategy, to be implemented in a runtime system, in which the data distribution on computational nodes is decoupled from storage distribution. Our experimental results show that performance improvements of several orders of magnitude over direct access based data distribution methods can be obtained, and that performance for most data access patterns can be improved to within a factor of 2 of the best performance. Further, the cost of redistribution is a very small fraction of the overall access cost.


ieee international symposium on workload characterization | 2006

MineBench: A Benchmark Suite for Data Mining Workloads

Ramanathan Narayanan; Berkin Özisikyilmaz; Joseph Zambreno; Gokhan Memik; Alok N. Choudhary

Data mining constitutes an important class of scientific and commercial applications. Recent advances in data extraction techniques have created vast data sets, which require increasingly complex data mining algorithms to sift through them to generate meaningful information. The disproportionately slower rate of growth of computer systems has led to a sizeable performance gap between data mining systems and algorithms. The first step in closing this gap is to analyze these algorithms and understand their bottlenecks. With this knowledge, current computer architectures can be optimized for data mining applications. In this paper, we present MineBench, a publicly available benchmark suite containing fifteen representative data mining applications belonging to various categories such as clustering, classification, and association rule mining. We believe that MineBench will be of use to those looking to characterize and accelerate data mining workloads


design automation conference | 2002

Compiler-directed scratch pad memory hierarchy design and management

Mahmut T. Kandemir; Alok N. Choudhary

One of the primary challenges in embedded system design is designing the memory hierarchy and restructuring the application to take advantage of it. This task is particularly important for embedded image and video processing applications that make heavy use of large multi-dimensional arrays of signals and nested loops. In this paper, we show that a simple reuse vector/matrix abstraction can provide compiler with useful information in a concise form. Using this information, compiler can either adapt application to an existing memory hierarchy or can come up with a memory hierarchy. Our initial results indicate that the compiler is very successful in both optimizing code for a given memory hierarchy and designing a hierarchy with reasonable performance/size ratio.


field programmable custom computing machines | 2000

A MATLAB compiler for distributed, heterogeneous, reconfigurable computing systems

Prithviraj Banerjee; Nagaraj Shenoy; Alok N. Choudhary; Scott Hauck; C. Bachmann; Malay Haldar; Pramod G. Joisha; A. Kanhare; Anshuman Nayak; S. Periyacheri; M. Walkden; David Zaretsky

Recently, high-level languages such as MATLAB have become popular in prototyping algorithms in domains such as signal and image processing. Many of these applications whose subtasks have diverse execution requirements, often employ distributed, heterogeneous, reconfigurable systems. These systems consist of an interconnected set of heterogeneous processing resources that provide a variety of architectural capabilities. The objective of the MATCH (MATLAB Compiler for Heterogeneous Computing Systems) compiler project at Northwestern University is to make it easier for the users to develop efficient code for distributed heterogeneous, reconfigurable computing systems. Towards this end we are implementing and evaluating an experimental prototype of a software system that will take MATLAB descriptions of various applications, and automatically map them on to a distributed computing environment consisting of embedded processors, digital signal processors and field-programmable gale arrays built from commercial off-the-shelf components. We provide an overview of the MATCH compiler and discuss the testbed which is being used to demonstrate our ideas. We present preliminary experimental results on some benchmark MATLAB programs with the use of the MATCH compiler.


IEEE Computer | 1994

High-performance I/O for massively parallel computers: problems and prospects

J.M. del Rosario; Alok N. Choudhary

Over the past two decades (1974-94), advances in semiconductor and integrated circuit technology have fuelled the drive toward faster, ever more efficient computational machines. Today, the most powerful supercomputers can perform computation at billions of floating-point operations per second (gigaflops). This increase in capability is intensifying the demand for even more powerful machines. Computational limits for the largest supercomputers are expected to exceed the teraflops barrier in the coming years. Discussion is given on the following areas: the nature of I/O in massive parallel processing; operating and file systems; runtime system and compilers; and networking technology. The recurrent themes in the parallel I/O problem are the existence of a great variety in access patterns and the sensitivity of current I/O systems to these access patterns. An increase in the variability of access patterns is also expected, and single resource-management approaches will likely not suffice. Providing the I/O infrastructure that will support these requirements will necessitate research in operating systems (parallel file systems, runtime systems, and drivers), language interfaces to high-performance storage systems, high-speed networking, graphics and visualization systems, and new hardware technology for I/O and storage systems.<<ETX>>


Scientific Programming | 1996

An extended two-phase method for accessing sections of out-of-core arrays

Rajeev Thakur; Alok N. Choudhary

A number of applications on parallel computers deal with very large data sets that cannot fit in main memory. In such applications, data must be stored in files on disks and fetched into memory during program execution. Parallel programs with large out-of-core arrays stored in files must read/write smaller sections of the arrays from/to files. In this article, we describe a method for accessing sections of out-of-core arrays efficiently. Our method, the extended two-phase method, uses collective l/O: Processors cooperate to combine several l/O requests into fewer larger granularity requests, to reorder requests so that the file is accessed in proper sequence, and to eliminate simultaneous l/O requests for the same data. In addition, the l/O workload is divided among processors dynamically, depending on the access requests. We present performance results obtained from two real out-of-core parallel applications - matrix multiplication and a Laplaces equation solver - and several synthetic access patterns, all on the Intel Touchstone Delta. These results indicate that the extended two-phase method significantly outperformed a direct (noncollective) method for accessing out-of-core array sections.

Collaboration


Dive into the Alok N. Choudhary's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Mahmut T. Kandemir

Pennsylvania State University

View shared research outputs
Top Co-Authors

Avatar

J. Ramanujam

Louisiana State University

View shared research outputs
Top Co-Authors

Avatar

Gokhan Memik

Northwestern University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Rajeev Thakur

Argonne National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Geoffrey C. Fox

Indiana University Bloomington

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge