Is this you? Create Your Porfile

Tal Ben-Nun

Hebrew University of Jerusalem

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tal Ben-Nun is active.

Explore More

Publication

Featured researches published by Tal Ben-Nun.

international conference on cluster computing | 2010

A package for OpenCL based heterogeneous computing on clusters with many GPU devices

Amnon Barak; Tal Ben-Nun; Ely Levy; Amnon Shiloh

Heterogeneous systems provide new opportunities to increase the performance of parallel applications on clusters with CPU and GPU architectures. Currently, applications that utilize GPU devices run their device-executable code on local devices in their respective hosting-nodes. This paper presents a package for running OpenMP, C++ and unmodified OpenCL applications on clusters with many GPU devices. This Many GPUs Package (MGP) includes an implementation of the OpenCL specifications and extensions of the OpenMP API that allow applications on one hosting-node to transparently utilize cluster-wide devices (CPUs and/or GPUs). MGP provides means for reducing the complexity of programming and running parallel applications on clusters, including scheduling based on task dependencies and buffer management. The paper presents MGP and the performance of its internals.

Langmuir | 2010

Solution X-ray Scattering Form Factors of Supramolecular Self-Assembled Structures

Pablo Szekely; Avi Ginsburg; Tal Ben-Nun; Uri Raviv

In this paper, the analysis of several involved models, relevant for evaluating solution X-ray scattering form factors of supramolecular self-assembled structures, is presented. Different geometrical models are discussed, and the scattering form factors of several layers of those shapes are evaluated. The thickness and the electron density of each layer are parameters in those models. The models include Gaussian electron density profiles and/or uniform electron density profiles at each layer. Various forms of cuboid, layered, spherical, cylindrical, and helical structures are carefully treated. The orientation-averaged scattering intensities of those form factors are calculated. Similar classes of form factors are examined and compared, and their fit to scattering data of lipid bilayers, capsids of the Simian virus 40 virus-like particle and microtubule is discussed. A more detailed model of discrete helices composed of uniform spheres was derived and compared to solution X-ray scattering data of microtubules. Our analyses show that when high-resolution data are available the more detailed models with Gaussian electron density profiles or helical structures composed of spheres should be used to better capture all the elements in the scattering curves. The models presented in this paper may also be applied, with minor corrections, for the analysis of solution neutron scattering data.

Journal of Applied Crystallography | 2010

X+: a comprehensive computationally accelerated structure analysis tool for solution X-ray scattering from supramolecular self-assemblies

Tal Ben-Nun; Avi Ginsburg; Pablo Szekely; Uri Raviv

X+ is a user-friendly multi-core accelerated program that fully analyses solution X-ray scattering radially integrated images. This software is particularly useful for analysing supramolecular self-assemblies, often found in biology, and for reconstructing the scattering signal in its entirety. The program enables various ways of subtracting background noise. The user selects a geometric model and defines as many layers of that shape as needed. The thickness and electron density of each layer are the fitting parameters. An initial guess is input by the user and the program calculates the form-factor parameters that best fit the data. The polydispersity of one size parameter at a time can be taken into account. The program can then address the assembly of those shapes into different lattice symmetries. This is accounted for by fitting the parameters of the structure factor, using various peak line shapes. The models of the program and selected features are presented. Among them are the model-fitting procedure, which includes both absolute and relative constraints, data smoothing, signal decomposition for separation of form and structure factors, goodness-of-fit verification procedures, error estimation, and automatic feature recognition in the data, such as correlation peaks and baseline. The programs intuitive graphical user interface runs on Windows PCs. Using X+, the exact structure of a microtubule in a crowded environment, and the structure, domain size, and elastic and interaction parameters of lipid bilayers, were obtained.

ieee international conference on high performance computing data and analytics | 2015

Memory access patterns: the missing piece of the multi-GPU puzzle

Tal Ben-Nun; Ely Levy; Amnon Barak; Eri Rubin

With the increased popularity of multi-GPU nodes in modern HPC clusters, it is imperative to develop matching programming paradigms for their efficient utilization. In order to take advantage of the local GPUs and the low-latency high-throughput interconnects that link them, programmers need to meticulously adapt parallel applications with respect to load balancing, boundary conditions and device synchronization. This paper presents MAPS-Multi, an automatic multi-GPU partitioning framework that distributes the workload based on the underlying memory access patterns. The framework consists of host- and device-level APIs that allow programs to efficiently run on a variety of GPU and multi-GPU architectures. The framework implements several layers of code optimization, device abstraction, and automatic inference of inter-GPU memory exchanges. The paper demonstrates that the performance of MAPS-Multi achieves near-linear scaling on fundamental computational operations, as well as real-world applications in deep learning and multivariate analysis.

acm sigplan symposium on principles and practice of parallel programming | 2017

Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations

Tal Ben-Nun; Michael Sutton; Sreepathi Pai; Keshav Pingali

Nodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular algorithms that benefit from low-latency, asynchronous communication. This paper proposes constructs for asynchronous multi-GPU programming, and describes their implementation in a thin runtime environment called Groute. Groute also implements common collective operations and distributed work-lists, enabling the development of irregular applications without substantial programming effort. We demonstrate that this approach achieves state-of-the-art performance and exhibits strong scaling for a suite of irregular applications on 8-GPU and heterogeneous systems, yielding over 7x speedup for some algorithms.

ACM Transactions on Architecture and Code Optimization | 2015

MAPS: Optimizing Massively Parallel Applications Using Device-Level Memory Abstraction

Eri Rubin; Ely Levy; Amnon Barak; Tal Ben-Nun

GPUs play an increasingly important role in high-performance computing. While developing naive code is straightforward, optimizing massively parallel applications requires deep understanding of the underlying architecture. The developer must struggle with complex index calculations and manual memory transfers. This article classifies memory access patterns used in most parallel algorithms, based on Berkeley’s Parallel “Dwarfs.” It then proposes the MAPS framework, a device-level memory abstraction that facilitates memory access on GPUs, alleviating complex indexing using on-device containers and iterators. This article presents an implementation of MAPS and shows that its performance is comparable to carefully optimized implementations of real-world applications.

international parallel and distributed processing symposium | 2009

A global scheduling framework for virtualization environments

Yoav Etsion; Tal Ben-Nun; Dror G. Feitelson

A premier goal of resource allocators in virtualization environments is to control the relative resource consumption of the different virtual machines, and moreover, to be able to change the relative allocations at will. However, it is not clear what it means to provide a certain fraction of the machine when multiple resources are involved. We suggest that a promising interpretation is to identify the system bottleneck at each instant, and to enforce the desired allocation on that device. This in turn induces an efficient allocation of the other devices.

Journal of Chemical Information and Modeling | 2016

Reciprocal Grids: A Hierarchical Algorithm for Computing Solution X-ray Scattering Curves from Supramolecular Complexes at High Resolution

Avi Ginsburg; Tal Ben-Nun; Roi Asor; Asaf Shemesh; Israel Ringel; Uri Raviv

In many biochemical processes large biomolecular assemblies play important roles. X-ray scattering is a label-free bulk method that can probe the structure of large self-assembled complexes in solution. As we demonstrate in this paper, solution X-ray scattering can measure complex supramolecular assemblies at high sensitivity and resolution. At high resolution, however, data analysis of larger complexes is computationally demanding. We present an efficient method to compute the scattering curves from complex structures over a wide range of scattering angles. In our computational method, structures are defined as hierarchical trees in which repeating subunits are docked into their assembly symmetries, describing the manner subunits repeat in the structure (in other words, the locations and orientations of the repeating subunits). The amplitude of the assembly is calculated by computing the amplitudes of the basic subunits on 3D reciprocal-space grids, moving up in the hierarchy, calculating the grids of larger structures, and repeating this process for all the leaves and nodes of the tree. For very large structures, we developed a hybrid method that sums grids of smaller subunits in order to avoid numerical artifacts. We developed protocols for obtaining high-resolution solution X-ray scattering data from taxol-free microtubules at a wide range of scattering angles. We then validated our method by adequately modeling these high-resolution data. The higher speed and accuracy of our method, over existing methods, is demonstrated for smaller structures: short microtubule and tobacco mosaic virus. Our algorithm may be integrated into various structure prediction computational tools, simulations, and theoretical models, and provide means for testing their predicted structural model, by calculating the expected X-ray scattering curve and comparing with experimental data.

haifa experimental systems conference | 2010

Design and implementation of a generic resource sharing virtual time dispatcher

Tal Ben-Nun; Yoav Etsion; Dror G. Feitelson

Virtual machine monitors, especially when used for server consolidation, need to enforce a predefined sharing of resources among the running virtual machines. We propose a new mechanism for doing so that provides improved pacing in the face of heterogeneous allocations and priorities. This mechanism lends from token-bucket metering and from virtual-time scheduling, and prioritizes the different clients based on the divergence between their desired allocations and the actual consumptions. The ideas are demonstrated by implementations for the CPU and networking subsystems of the Linux kernel. Notably, both use exactly the same basic module; future plans include using it for disk I/O as well.

data management on new hardware | 2017

Big data causing big (TLB) problems: taming random memory accesses on the GPU

Tomas Karnagel; Tal Ben-Nun; Matthias Werner; Dirk Habich; Wolfgang Lehner

GPUs are increasingly adopted for large-scale database processing, where data accesses represent the major part of the computation. If the data accesses are irregular, like hash table accesses or random sampling, the GPU performance can suffer. Especially when scaling such accesses beyond 2GB of data, a performance decrease of an order of magnitude is encountered. This paper analyzes the source of the slowdown through extensive micro-benchmarking, attributing the root cause to the Translation Lookaside Buffer (TLB). Using the micro-benchmarks, the TLB hierarchy and structure are fully analyzed on two different GPU architectures, identifying never-before-published TLB sizes that can be used for efficient large-scale application tuning. Based on the gained knowledge, we propose a TLB-conscious approach to mitigate the slowdown for algorithms with irregular memory access. The proposed approach is applied to two fundamental database operations - random sampling and hash-based grouping - showing that the slowdown can be dramatically reduced, and resulting in a performance increase of up to 13×.

Explore More