Amit Golander | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Amit Golander is active.

Explore More

Publication

Featured researches published by Amit Golander.

IEEE Computer Architecture Letters | 2008

DDMR: Dynamic and Scalable Dual Modular Redundancy with Short Validation Intervals

Amit Golander; Shlomo Weiss; Ronny Ronen

DMR (dual modular redundancy) was suggested for increasing reliability. Classical DMR consists of pairs of cores that check each other and are pre-connected during manufacturing by dedicated links. In this paper we introduce the dynamic dual modular redundancy (DDMR) architecture. DDMR supports run-time scheduling of redundant threads, which has significant benefits relative to static binding. To allow dynamic pairing, DDMR replaces the special links with a novel ring architecture. DDMR uses short instruction sequences for validation, smaller than the processor reorder buffer. Such short sequences reduce latencies in parallel programs and save resources needed to buffer uncommitted data. DDMR scales with the number of cores and may be used in large multicore architectures.

high performance embedded architectures and compilers | 2009

Reexecution and Selective Reuse in Checkpoint Processors

Amit Golander; Shlomo Weiss

Resource-efficient checkpoint processors have been shown to recover to an earlier safe state very fast. Yet in order to complete the misprediction recovery they also need to reexecute the code segment between the recovered checkpoint and the mispredicted instruction. This paper evaluates two novel reuse methods which accelerate reexecution paths by reusing the results of instructions and the outcome of branches obtained during the first run. The paper also evaluates, in the context of checkpoint processors, two other reuse methods targeting trivial and repetitive arithmetic operations. A reuse approach combining all four methods requires an area of 0.87[mm2], consumes 51.6[mW], and improves the energy-delay product by 4.8% and 11.85% for the integer and floating point benchmarks respectively.

ACM Transactions on Architecture and Code Optimization | 2008

Hiding the misprediction penalty of a resource-efficient high-performance processor

Amit Golander; Shlomo Weiss

Misprediction is a major obstacle for increasing speculative out-of-order processors performance. Performance degradation depends on both the number of misprediction events and the recovery time associated with each one of them. In recent years a few checkpoint based microarchitectures have been proposed. In comparison with ROB-based processors, checkpoint processors are scalable and highly resource efficient. Unfortunately, in these proposals the misprediction recovery time is proportional to the instruction queue size.n In this paper we analyze methods to reduce the misprediction recovery time. We propose a new register file management scheme and techniques to selectively flush the instruction queue and the load store queue, and to isolate deeply pipelined execution units. The result is a novel checkpoint processor with Constant misprediction RollBack time (CRB). We further present a streamlined, cost-efficient solution, which saves complexity at the price of slightly lower performance.

ACM Transactions on Architecture and Code Optimization | 2009

Checkpoint allocation and release

Amit Golander; Shlomo Weiss

Out-of-order speculative processors need a bookkeeping method to recover from incorrect speculation. In recent years, several microarchitectures that employ checkpoints have been proposed, either extending the reorder buffer or entirely replacing it. This work presents an in-dept-study of checkpointing in checkpoint-based microarchitectures, from the desired content of a checkpoint, via implementation trade-offs, and to checkpoint allocation and release policies. A major contribution of the article is a novel adaptive checkpoint allocation policy that outperforms known policies. The adaptive policy controls checkpoint allocation according to dynamic events, such as second-level cache misses and rollback history. It achieves 6.8% and 2.2% speedup for the integer and floating point benchmarks, respectively, and does not require a branch confidence estimator. The results show that the proposed adaptive policy achieves most of the potential of an oracle policy whose performance improvement is 9.8% and 3.9% for the integer and floating point benchmarks, respectively. We exploit known techniques for saving leakage power by adapting and applying them to checkpoint-based microarchitectures. The proposed applications combine to reduce the leakage power of the register file to about one half of its original value.

ieee international conference on science of electrical engineering | 2016

Storage becomes first class memory

Netanel Katzburg; Amit Golander; Shlomo Weiss

Computer architectures have always addressed memory and storage differently. The memory subsystem is an integral part of any processor design, while storage was placed on the I/O subsystem and accessed via several software layers. Emerging storage systems however, are challenging this fundamental and decades-old assumption. First class memory is an entity that supports all the operations generally available to main memory. This article describes how storage is becoming a first class memory. We explore the benefits of novel hardware and software technologies, demonstrating a speedup of ×280 at the storage layer - over modern Flash and file systems, which translated to a speedup of ×3.8 at the application layer, when measured SQL transactions on the PostgreSQL database. We then show that traditional data access tradeoffs become irrelevant and, as a result, application programming is significantly simplified.

acm international conference on systems and storage | 2017

Persistent memory over fabric (PMoF)

Amit Golander; Sagi Manole; Yigal Korman

Persistent Memory (PM) is an emerging family of technologies that are: persistent; byte addressable; and respond in near-memory speeds. PM devices, also referred to as non-volatile DIMMs or NVDIMMs, connect to the low-latency CPU memory interconnect. PM-based solutions can achieve local persistency within a micro second, which is two orders-of-magnitude faster compared to modern Flash solutions [1]. PM is the first storage media that is faster than high-speed networks and faster than operating system thread scheduling time. Thus, current PM-based solutions are local. They do not comply with common enterprise practices, which require that data remains available even in the face of a given amount of failures, such as an entire node crash. This work focuses on replicating PM resident data sets between nodes, which is vital in order for PM to become mainstream. We leverage RDMA-supporting network gear and the first application-agnostic PM-based file system that supports mirroring (Plexistor M1FS 3.0). The server on the left-hand side of Figure 1 is the application server running the benchmarks and the server on the right-hand side runs the PM-over-Fabric (PMoF) which owns the secondary copy of the data alongside the file system meta data required in order to mount the file system after a failure occurs. The experimental setup uses commodity off-the-shelf hardware and 100GbE (RoCE) network. We first explore a synthetic benchmark (FIO) and then a TPC-C like benchmark (DBT-2) on top of a Postgres database. In both cases, we use work sets that fit in the PM tier, because we do not want tiering to mask the performance implications of mirroring. Figure 2a shows the overall latency (as seen by the application) as a function of different stress levels. Three different access sizes were measured, as well as single and multithread flavors. Small accesses, including local persistency and asynchronous mirroring to the second node, were measured to complete within 1--2 micro seconds under typical storage consumption. At very high loads hardware resources become congested and latency soars. Figure 2b reveals results for similar benchmarks, with one important difference - write requests are synchronous (i.e. fopen with OSYNC=1). These semantics mean that the file system has to also guarantee that the data written and the meta data describing it have reached the PMoF node prior to acknowledging the write system call. Synchronous mirroring is nearly 2.5us slower than asynchronous mirroring for typical loads, which is mostly due to the round trip delay. These results are an order of magnitude faster than modern block-based replication solutions. Databases may support replication at the database layer, as an alternative to maintaining data redundancy at the storage layer. Each approach has its advantages, but the rule of thumb for PostgreSQL, as presented in PGCon IL 2017, anticipates 50% lower transactions per second for having a secondary copy. Figures 3a and 3b reveal the negligible performance overhead that PMoF based mirroring has on real life applications. Compared to Postgres on a single node deployment, transaction rate and response time were measured to be only 2.0 to 2.2% lower.

acm international conference on systems and storage | 2018

Accelerating Unmodified Databases using Persistent Memory and Flash Storage Tiers

Amit Golander; Netanel Katzburg; Omer Zilberberg

Recent breakthroughs in Storage Class Memory (SCM) technologies have driven Persistent Memory (PM) devices to become commodity off-the-shelf components in 2018. PM devices are byte addressable, plug into the memory interconnect, and run at near memory speeds, densities and price points. PM availability is led by Fast PM, comprised from backed-DRAM devices such as NVDIMM-N, and will follow soon with Slow PM, comprised of new SCM materials, such as Intel 3D XPoint NVDIMM. Fast and Slow PM devices vary in speeds, densities and cost, but both are orders of magnitude faster than Flash devices and an order of magnitude more expensive per GB. A PM-based file system was shown to accelerate unmodified transactional databases [2, 1], when the entire dataset was placed on NVDIMM-N cards. Most databases however are large and cannot fit entirely into the limited capacity provided by PM devices and even if they could - the high price per GB would prevent wide adoption. This work explores accelerating unmodifed databases using software that supports both NVDIMM-N and Flash devices and can transparently tier data between them. Ideally, this would provide the performance benefits of PM, while maintaining the cost structure of Flash solutions. We run a transactional workload (DBT-2) on an unmodified Postgresql [3] database, and compare the default block-based file system running on Flash NVMe to a file system, which is the first to support auto-tiering between byte-addressable NVDIMM devices and block-addressable Flash. The rest of the server and the operating system version are identical for both configurations (refer to Table 1). M1FS auto-tiering between PM pages and Flash blocks was implemented using the following architecture: • Each 4KB of data can reside on a PM page, a Flash block or both at the same time. • Data is speculatively copied to a Flash block ahead of needing to reuse the PM page, in order to hide the slower Flash access time. • Unless data is modified, an existing Flash copy is maintained in order to reduce Flash wearout. • PM pages are maintained in many queues in order to reduce the probability of lock contention when many cores are used concurrently • Page allocations are preferably done from PM attached to the CPU socket (NUMA-aware FS)

IEEE Transactions on Computers | 2014

Protein Sequence Pattern Matching: Leveraging Application Specific Hardware Accelerators

Sagi Manole; Amit Golander; Shlomo Weiss

Digitalization has brought a tremendous momentum to health care research. Recognition of patterns in proteins is crucial for identifying possible functions of newly discovered proteins, as well as analysis of known proteins for previously undetermined activity. In this paper, the workload consists of locating patterns from the PROSITE database in protein sequences. We optimize the pattern search task by using a new breed of processors that merge network and server attributes. We leverage massive multithreading and regular-expression (RegX) hardware accelerators; the latter were designed and built for an entirely different application - high-bandwidth deep-packet inspection. Our multithreading optimization achieves 18x improvement, but by harnessing a RegX accelerator we were able to further demonstrate a significant 392x improvement relative to software pattern matching. Moreover, performance per area and power consumption are improved by multiple orders of magnitude as well.

IEEE Transactions on Circuits and Systems Ii-express Briefs | 2009