George S. Almasi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where George S. Almasi is active.

Explore More

Publication

Featured researches published by George S. Almasi.

Data Mining and Knowledge Discovery | 2001

Personalization of Supermarket Product Recommendations

Richard D. Lawrence; George S. Almasi; Vladimir Kotlyar; Marisa S. Viveros; Sastry S. Duri

We describe a personalized recommender system designed to suggest new products to supermarket shoppers. The recommender functions in a pervasive computing environment, namely, a remote shopping system in which supermarket customers use Personal Digital Assistants (PDAs) to compose and transmit their orders to the store, which assembles them for subsequent pickup. The recommender is meant to provide an alternative source of new ideas for customers who now visit the store less frequently. Recommendations are generated by matching products to customers based on the expected appeal of the product and the previous spending of the customer. Associations mining in the product domain is used to determine relationships among product classes for use in characterizing the appeal of individual products. Clustering in the customer domain is used to identify groups of shoppers with similar spending histories. Cluster-specific lists of popular products are then used as input to the matching process.The recommender is currently being used in a pilot program with several hundred customers. Analysis of results to date have shown a 1.8% boost in program revenue as a result of purchases made directly from the list of recommended products. A substantial fraction of the accepted recommendations are from product classes new to the customer, indicating a degree of willingness to expand beyond present purchase patterns in response to reasonable suggestions.

international conference on supercomputing | 2005

Optimization of MPI collective communication on BlueGene/L systems

George S. Almasi; Philip Heidelberger; Charles J. Archer; Xavier Martorell; C. Christopher Erway; José E. Moreira; Burkhard Steinmacher-Burow; Yili Zheng

BlueGene/L is currently the worlds fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the the natural programming model for this machine. The BlueGene/L MPI library is a port of MPICH2.In this paper we discuss the implementation of MPI collectives on BlueGene/L. The MPICH2 implementation of MPI collectives is based on point-to-point communication primitives. This turns out to be suboptimal for a number of reasons. Machine-optimized MPI collectives are necessary to harness the performance of BlueGene/L. We discuss these optimized MPI collectives, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real BlueGene/L hardware with up to 4096 compute nodes.

Data Mining and Knowledge Discovery | 1999

A Scalable Parallel Algorithm for Self-Organizing Maps with Applicationsto Sparse Data Mining Problems

Richard D. Lawrence; George S. Almasi; Holly E. Rushmeier

We describe a scalable parallel implementation of the self organizing map (SOM) suitable for data-mining applications involving clustering or segmentation against large data sets such as those encountered in the analysis of customer spending patterns. The parallel algorithm is based on the batch SOM formulation in which the neural weights are updated at the end of each pass over the training data. The underlying serial algorithm is enhanced to take advantage of the sparseness often encountered in these data sets. Analysis of a realistic test problem shows that the batch SOM algorithm captures key features observed using the conventional on-line algorithm, with comparable convergence rates.Performance measurements on an SP2 parallel computer are given for two retail data sets and a publicly available set of census data.These results demonstrate essentially linear speedup for the parallel batch SOM algorithm, using both a memory-contained sparse formulation as well as a separate implementation in which the mining data is accessed directly from a parallel file system. We also present visualizations of the census data to illustrate the value of the clustering information obtained via the parallel SOM method.

Sigplan Notices | 2003

Calculating stack distances efficiently

George S. Almasi; Calin Cascaval; David A. Padua

This paper1 describes our experience using the stack processing algorithm [6] for estimating the number of cache misses in scientific programs. By using a new data structure and various optimization techniques we obtain instrumented run-times within 50 to 100 times the original optimized run-times of our benchmarks.

Ibm Systems Journal | 1995

Parallel file systems for the IBM SP computers

P. F. Corbett; D. G. Feltelson; Jean-Pierre Prost; George S. Almasi; Sandra Johnson Baylor; A. S. Bolmarcich; Y. Hsu; Julian Satran; Marc Snir; R. Colao; B. D. Herr; J. Kavaky; T. R. Morgan; A. Ziotek

Parallel computer architectures require innovative software solutions to utilize their capabilities. This statement is true for system software no less than for application programs. File system development for the IBM SP product line of computers started with the Vesta research project, which introduced the ideas of parallel access to partitioned files. This technology was then integrated with a conventional Advanced Interactive Executive™ (AIX™) environment to create the IBM AIX Parallel I/O File System product. We describe the design and implementation of Vesta, including user interfaces and enhancements to the control environment needed to run the system. Changes to the basic design that were made as part of the AIX Parallel I/O File System are identified and justified.

programming language design and implementation | 2002

MaJIC: compiling MATLAB for speed and responsiveness

George S. Almasi; David A. Padua

This paper presents and evaluates techniques to improve the execution performance of MATLAB. Previous efforts concentrated on source to source translation and batch compilation; MaJIC provides an interactive frontend that looks like MATLAB and compiles/optimizes code behind the scenes in real time, employing a combination of just-in-time and speculative ahead-of-time compilation. Performance results show that the proper mixture of these two techniques can yield near-zero response time as well as performance gains previously achieved only by batch compilers.

Ibm Journal of Research and Development | 2005

Design and implementation of message-passing services for the Blue Gene/L supercomputer

George S. Almasi; Charles J. Archer; José G. Castaños; John A. Gunnels; C. Christopher Erway; Philip Heidelberger; Xavier Martorell; José E. Moreira; Kurt Walter Pinnow; Joe Ratterman; Burkhard Steinmacher-Burow; William Gropp; Brian R. Toonen

The Blue Gene®/L (BG/L) supercomputer, with 65,536 dual-processor compute nodes, was designed from the ground up to support efficient execution of massively parallel message-passing programs. Part of this support is an optimized implementation of the Message Passing Interface (MPI), which leverages the hardware features of BG/L. MPI for BG/L is implemented on top of a more basic message-passing infrastructure called the message layer. This message layer can be used both to implement other higher-level libraries and directly by applications. MPI and the message layer are used in the two BG/L modes of operation: the coprocessor mode and the virtual node mode. Performance measurements show that our message-passing services deliver performance close to the hardware limits of the machine. They also show that dedicating one of the processors of a node to communication functions (coprocessor mode) greatly improves the message-passing bandwidth, whereas running two processes per compute node (virtual node mode) can have a positive impact on application performance.

programming language design and implementation | 2006

Shared memory programming for large scale machines

Christopher Barton; CĆlin Casçaval; George S. Almasi; Yili Zheng; Montse Farreras; Siddhartha Chatterje; José Nelson Amaral

This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine. Our compiler infrastructure simplifies the code generated for parallel loops in UPC through the elimination of affinity tests, eliminates several levels of indirection for accesses to segments of shared arrays that the compiler can prove to be local, and implements remote update operations through a lower-cost asynchronous message. The performance evaluation uses three well-known benchmarks --- HPC RandomAccess, HPC STREAM and NAS CG --- to obtain scaling and absolute performance numbers for these benchmarks on up to 131072 processors, the full BlueGene/L machine. These results were used to win the HPC Challenge Competition at SC05 in Seattle WA, demonstrating that PGAS languages support both productivity and performance.

high-performance computer architecture | 2006

High performance file I/O for the Blue Gene/L supercomputer

Hao Yu; Ramendra K. Sahoo; C. Howson; George S. Almasi; José G. Castaños; Manish Gupta; José E. Moreira; Jeffrey J. Parker; Thomas Eugene Engelsiepen; Robert B. Ross; Rajeev Thakur; Robert Latham; William Gropp

Parallel I/O plays a crucial role for most data-intensive applications running on massively parallel systems like Blue Gene/L that provides the promise of delivering enormous computational capability. We designed and implemented a highly scalable parallel file I/O architecture for Blue Gene/L, which leverages the benefit of the hierarchical and functional partitioning design of the system software with separate computational and I/O cores. The architecture exploits the scalability aspect of GPFS (General Parallel File System) at the backend, while using MPI I/O as an interface between the application I/O and the file system. We demonstrate the impact of our high performance I/O solution for Blue Gene/L with a comprehensive evaluation that consists of a number of widely used parallel I/O benchmarks and I/O intensive applications. Our design and implementation is not only able to deliver at least one order of magnitude speed up in terms of I/O bandwidth for a real-scale application HOMME (achieving aggregate bandwidth of 1.8 GB/Sec and 2.3 GB/Sec for write and read accesses, respectively), but also supports high-level parallel I/O data interfaces such as parallel HDF5 and parallel NetCDF scaling up to a large number of processors.

ACM Sigarch Computer Architecture News | 2003

Dissecting Cyclops: a detailed analysis of a multithreaded architecture

George S. Almasi; Cǎlin Caşcaval; José G. Castaños; Monty M. Denneau; Derek Lieber; José E. Moreira; Henry S. Warren

Multiprocessor systems-on-a-chip offer a structured approach to managing complexity in chip design. Cyclops is a new family of multithreaded architectures which integrates processing logic, main memory and communications hardware on a single chip. Its simple, hierarchical design allows the hardware architect to manage a large number of components to meet the design constraints in terms of performance, power or application domain.This paper evaluates several alternative Cyclops designs with different relative costs and trade-offs. We compare the performance of several scientific kernels running on different configurations of this architecture. We show that by increasing the number of threads sharing a floating point unit we can hide fairly high cache and memory latencies. We prove that we can reach the theoretical peak performance of the chip and we identify the optimal balance of components for each application. We demonstrate that the design is well adapted to solve problems that are difficult to optimize. For example, we show that sparse matrix vector multiplication obtains 16 GFlops out of 32 GFlops of peak performance.

Explore More