Is this you? Create Your Porfile

Aleksandar Zlateski

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aleksandar Zlateski is active.

Explore More

Publication

Featured researches published by Aleksandar Zlateski.

Nature | 2014

Space-time wiring specificity supports direction selectivity in the retina

Jinseop S. Kim; Matthew J. Greene; Aleksandar Zlateski; Kisuk Lee; Mark F. Richardson; Srinivas C. Turaga; Michael Purcaro; Matthew Balkam; Amy Robinson; Bardia Fallah Behabadi; Michael Campos; Winfried Denk; H. Sebastian Seung; EyeWirers

How does the mammalian retina detect motion? This classic problem in visual neuroscience has remained unsolved for 50 years. In search of clues, here we reconstruct Off-type starburst amacrine cells (SACs) and bipolar cells (BCs) in serial electron microscopic images with help from EyeWire, an online community of ‘citizen neuroscientists’. On the basis of quantitative analyses of contact area and branch depth in the retina, we find evidence that one BC type prefers to wire with a SAC dendrite near the SAC soma, whereas another BC type prefers to wire far from the soma. The near type is known to lag the far type in time of visual response. A mathematical model shows how such ‘space–time wiring specificity’ could endow SAC dendrites with receptive fields that are oriented in space–time and therefore respond selectively to stimuli that move in the outward direction from the soma.

international parallel and distributed processing symposium | 2016

ZNN -- A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines

Aleksandar Zlateski; Kisuk Lee; H. Sebastian Seung

Convolutional networks (ConvNets) have become a popular approach to computer vision. It is important to accelerate ConvNet training, which is computationally costly. We propose a novel parallel algorithm based on decomposition into a set of tasks, most of which are convolutions or FFTs. Applying Brents theorem to the task dependency graph implies that linear speedup with the number of processors is attainable within the PRAM model of parallel computation, for wide network architectures. To attain such performance on real shared-memory machines, our algorithm computes convolutions converging on the same node of the network with temporal locality to reduce cache misses, and sums the convergent convolution outputs via an almost wait-free concurrent method to reduce time spent in critical sections. We implement the algorithm with a publicly available software package called ZNN. Benchmarking with multi-core CPUs shows that ZNN can attain speedup roughly equal to the number of physical cores. We also show that ZNN can attain over 90× speedup on a many-core CPU (Xeon Phi™ Knights Corner). These speedups are achieved for network architectures with widths that are in common use. The task parallelism of the ZNN algorithm is suited to CPUs, while the SIMD parallelism of previous algorithms is compatible with GPUs. Through examples, we show that ZNN can be either faster or slower than certain GPU implementations depending on specifics of the network architecture, kernel sizes, and density and size of the output patch. ZNN may be less costly to develop and maintain, due to the relative ease of general-purpose CPU programming.

ieee international conference on high performance computing data and analytics | 2016

ZNN i : maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs

Aleksandar Zlateski; Kisuk Lee; H. Sebastian Seung

Sliding window convolutional networks (ConvNets) have become a popular approach to computer vision problems such as image segmentation and object detection and localization. Here we consider the parallelization of inference, i.e., the application of a previously trained ConvNet, with emphasis on 3D images. Our goal is to maximize throughput, defined as the number of output voxels computed per unit time. We propose CPU and GPU primitives for convolutional and pooling layers, which are combined to create CPU, GPU, and CPU-GPU inference algorithms. The primitives include convolution based on highly efficient padded and pruned FFTs. Our theoretical analyses and empirical tests reveal a number of interesting findings. For example, adding host RAM can be a more efficient way of increasing throughput than adding another GPU or more CPUs. Furthermore, our CPU-GPU algorithm can achieve greater throughput than the sum of CPU-only and GPU-only throughputs.

Journal of Parallel and Distributed Computing | 2017

Scalable training of 3D convolutional networks on multi- and many-cores

Aleksandar Zlateski; Kisuk Lee; H. Sebastian Seung

Abstract Convolutional networks (ConvNets) have become a popular approach to computer vision. Here we consider the parallelization of ConvNet training, which is computationally costly. Our novel parallel algorithm is based on decomposition into a set of tasks, most of which are convolutions or FFTs. Theoretical analysis suggests that linear speedup with the number of processors is attainable. To attain such performance on real shared-memory machines, our algorithm computes convolutions converging on the same node of the network with temporal locality to reduce cache misses, and sums the convergent convolution outputs via an almost wait-free concurrent method to reduce time spent in critical sections. Benchmarking with multi-core CPUs shows speedup roughly equal to the number of physical cores. We also demonstrate 90x speedup on a many-core CPU (Xeon Phi Knights Corner). Our algorithm can be either faster or slower than certain GPU implementations depending on specifics of the network architecture, kernel sizes, and density and size of the output patch.

international conference on supercomputing | 2017

Compile-time optimized and statically scheduled N-D convnet primitives for multi-core and many-core (Xeon Phi) CPUs

Aleksandar Zlateski; H. Sebastian Seung

Convolutional networks (ConvNets), largely running on GPUs, have become the most popular approach to computer vision. Now that CPUs are closing the FLOPS gap with GPUs, efficient CPU algorithms are becoming more important. We propose a novel parallel and vectorized algorithm for N-D convolutional layers. Our goal is to achieve high utilization of available FLOPS, independent of ConvNet architecture and CPU properties (e.g. vector units, number of cores, cache sizes). Our approach is to rely on the compiler to optimize code, thereby removing the need for hand-tuning. We assume that the network architecture is known at compile-time. Our serial algorithm divides the computation into small sub-tasks designed to be easily optimized by the compiler for a specific CPU. Sub-tasks are executed in an order that maximizes cache reuse. We parallelize the algorithm by statically scheduling tasks to be executed by each core. Our novel compile-time recursive scheduling algorithm is capable of dividing the computation evenly between an arbitrary number of cores, regardless of ConvNet architecture. It introduces zero runtime overhead and minimal synchronization overhead. We demonstrate that our serial primitives efficiently utilize available FLOPS (75--95%), while our parallel algorithm attains 50--90% utilization on 64+ core machines. Our algorithm is competitive with the fastest CPU implementation to date (MKL2017) for 2D object recognition, and performs much better for image segmentation. For 3D ConvNets we demonstrate comparable performance to the latest GPU hardware and software even though the CPU is only capable of half the FLOPS of the GPU.

acm sigplan symposium on principles and practice of parallel programming | 2018

Optimizing N-dimensional, winograd-based convolution for manycore CPUs

Zhen Jia; Aleksandar Zlateski; Kai Li

Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations. We propose and implement an algorithm for N-dimensional Winograd-based convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs. Our algorithm achieves high hardware utilization through a series of optimizations. Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3 x, and sometimes 8 x faster than other state-of-the-art CPU implementations on an Intel Xeon Phi manycore processors. Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations.

neural information processing systems | 2015