Daniel Nemirovsky | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Nemirovsky is active.

Explore More

Publication

Featured researches published by Daniel Nemirovsky.

IEEE Computer Architecture Letters | 2015

Thread Lock Section-Aware Scheduling on Asymmetric Single-ISA Multi-Core

Nikola Markovic; Daniel Nemirovsky; Osman S. Unsal; Mateo Valero; Adrian Crista

As thread level parallelism in applications has continued to expand, so has research in chip multi-core processors. As more and more applications become multi-threaded we expect to find a growing number of threads executing on a machine. As a consequence, the operating system will require increasingly larger amounts of CPU time to schedule these threads efficiently. Instead of perpetuating the trend of performing more complex thread scheduling in the operating system, we propose a scheduling mechanism that can be efficiently implemented in hardware as well. Our approach of identifying multi-threaded application bottlenecks such as thread synchronization sections complements the Fairness-aware Scheduler method. It achieves an average speed up of 11.5 percent (geometric mean) compared to the state-of-the-art Fairness-aware Scheduler.

european conference on parallel processing | 2015

Hardware Round-Robin Scheduler for Single-ISA Asymmetric Multi-core

Nikola Markovic; Daniel Nemirovsky; Veljko Milutinovic; Osman S. Unsal; Mateo Valero; Adrián Cristal

As thread level parallelism in applications has continued to expand, so has relevant research on heterogeneous CMPs. Nowadays multi-threaded workloads running on CMPs are common case, but as the quantity of these workloads increase and as heterogeneous CMPs become more diverse, thread scheduling within an operating system will become ever more critical to maintaining efficient performance and system utilization. As a consequence, the operating system will require increasingly larger amounts of CPU time to schedule these threads effectively. Instead of perpetuating the trend of performing complex thread scheduling to the software, we propose a simple yet effective mechanism that can easily be implemented in hardware which outperforms the typical Linux OS scheduler as well as Fairness scheduler. Our approach fairly redistributes running hardware threads across available cores within OS scheduling quantum. It achieves an average speed up of 37.7 percent and 16.5 percent respectively compared to the Linux OS scheduler and state-of-the-art Fairness scheduling when running a multi-threaded application workloads.

modeling analysis and simulation on computer and telecommunication systems | 2017

iQ: An Efficient and Flexible Queue-Based Simulation Framework

Damian Roca; Daniel Nemirovsky; Marc Casas; Miquel Moreto; Mateo Valero; Mario Nemirovsky

Conventional system simulators are readily used by computer architects to design and evaluate their processor designs. These simulators provide reasonable levels of accuracy and execution detail but suffer from long simulation latencies and increased implementation complexity. In this work we propose iQ, a queue-based modeling technique that targets design space exploration and optimization studies at the core component level. iQ emulates processor elements by abstracting the implementation details into modular components composed of queue structures, delay parameters, probabilistic driven message generation and event control. Its easy reconfigurability makes iQ a highly flexible and powerful processor simulator. We have used iQ to build an Ivy Bridge and a Core 2 Duo processor model and have validated them against real hardware running SPEC CPU2006 Int achieving average error rates of 9.55% and 8.93%.

symposium on computer architecture and high performance computing | 2017

A Machine Learning Approach for Performance Prediction and Scheduling on Heterogeneous CPUs

Daniel Nemirovsky; Tugberk Arkose; Nikola Marković; Mario Nemirovsky; Osman S. Unsal; Adrian Cristal

As heterogeneous systems become more ubiquitous, computer architects will need to develop novel CPU scheduling techniques capable of exploiting the diversity of computational resources. Accurately estimating the performance of applications on different heterogeneous resources can provide a significant advantage to heterogeneous schedulers seeking to improve system performance. Recent advances in machine learning techniques including artificial neural network models have led to the development of powerful and practical prediction models for a variety of fields. As of yet, however, no significant leaps have been taken towards employing machine learning for heterogeneous scheduling in order to maximize system throughput.In this paper we propose a unique throughput maximizing heterogeneous CPU scheduling model that uses machine learning to predict the performance of multiple threads on diverse system resources at the scheduling quantum granularity. We demonstrate how lightweight artificial neural networks (ANNs) can provide highly accurate performance predictions for a diverse set of applications thereby helping to improve heterogeneous scheduling efficiency. We show that online training is capable of increasing prediction accuracy but deepening the complexity of the ANNs can result in diminishing returns. Notably, our approach yields 25% to 31% throughput improvements over conventional heterogeneous schedulers for CPU and memory intensive applications.

ieee international conference on high performance computing data and analytics | 2017

A Deep Learning Mapper (DLM) for Scheduling on Heterogeneous Systems

Daniel Nemirovsky; Tugberk Arkose; Nikola Markovic; Mario Nemirovsky; Osman S. Unsal; Adrián Cristal; Mateo Valero

As heterogeneous systems become more ubiquitous, computer architects will need to develop new CPU scheduling approaches capable of exploiting the diversity of computational resources. Advances in deep learning have unlocked an exceptional opportunity of using these techniques for estimating system performance. However, as of yet no significant leaps have been taken in applying deep learning for scheduling on heterogeneous systems.

symposium on computer architecture and high performance computing | 2015

Performance and Energy Efficient Hardware-Based Scheduler for Symmetric/Asymmetric CMPs

Nikola Marković; Daniel Nemirovsky; Osman S. Unsal; Marteo Valero; Adrian Cristal

As thread level parallelism in applications has continued to expand, so has research in chip multi-core processors. Since more and more applications become multi-threaded we expect to find a growing number of threads executing on a machine. Consequently, the operating system will require increasingly larger amounts of CPU time to schedule these threads efficiently. Instead of perpetuating the trend of performing more complex thread scheduling in the operating system, we propose a hardware implementation of the Thread Lock Section-aware Scheduling (TLSS) scheduling mechanism. This lightweight mechanism helps to identify multi-threaded application bottlenecks such as thread synchronization sections and complements the Fairness-aware Scheduler method. It is, to our knowledge, the first hardware based lock section-aware scheduling that is energy attentive and can be applied to both asymmetric and symmetric CMPs. It achieves an average performance gains of 10.9 percent (geometric mean) compared to the state-of-the-art Linux OS Scheduler when applied on the Symmetrical Chip Multi-Processor (SCMP). At the same time, it is 81 percent more EDP (energy-delay product) efficient when applied on an Asymmetrical Chip Multi-Processor (ACMP) and compared to the Linux OS Scheduler on an SCMP, where ACMP and SCMP take relatively the same chip area.

IEEE Micro | 2015

Reimagining Heterogeneous Computing: A Functional Instruction-Set Architecture Computing Model

Daniel Nemirovsky; Nikola Markovic; Osman S. Unsal; Mateo Valero; Adrián Cristal

The relentless push in technology scaling driven by Moores law has witnessed fantastic gains in the quantities of transistors available on chips. Computer architects have exploited the extra transistors by incorporating several computing cores within a single processor. Heterogeneous processing in particular has become a useful technique for dealing with ever-present power and memory restrictions. Yet, the scope and diversity of current heterogeneous designs remain bounded by the level of functional abstraction specified by conventional instruction-set architectures (ISAs). In this article, the authors demonstrate how the functional abstraction level determines the capability and variety of a processors functional units and accelerators, thereby restricting its degree of heterogeneity. Combining current heterogeneous techniques with software abstraction concepts, the authors propose a new functional ISA (F-ISA), which raises the functional abstraction level of machine instructions. Using this model to complement existing architectures makes available a wider scope and diversity of functional units and accelerators in order to exploit the ever-increasing transistor densities. Greater heterogeneity can offer advances in terms of object data mapping and execution, resulting in potentially substantial latency, memory footprint, and power/performance gains.

IEEE Micro | 2015

Kernel-to-User-Mode Transition-Aware Hardware Scheduling

Nikola Markovic; Daniel Nemirovsky; Osman S. Unsal; Mateo Valero; Adrián Cristal

As thread-level parallelism in applications has continued to expand, so has research in chip multicore processors. More and more applications are becoming multithreaded, which should lead to a growing number of threads executing on a machine. Consequentially, the operating system will require increasingly larger amounts of CPU time to schedule these threads efficiently. Instead of perpetuating the trend of performing more-complex thread scheduling in the OS, the authors propose a scheduling mechanism that can be efficiently implemented in hardware as well. Their approach of identifying multithreaded application bottlenecks such as thread synchronization sections complements the fairness-aware scheduler method. It achieves average speedup of 11.1 and 30 percent (geometric mean) compared to the state-of-the-art Fairness-Aware Scheduler and Linux OS scheduler, respectively, while being 8 percent slower compared to the state-of-the-art bottleneck identification techniques.

IEEE Micro | 2016