Steven D. Feldman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Steven D. Feldman is active.

Explore More

Publication

Featured researches published by Steven D. Feldman.

international conference on embedded computer systems architectures modeling and simulation | 2013

Concurrent multi-level arrays: Wait-free extensible hash maps

Steven D. Feldman; Pierre LaBorde; Damian Dechev

In this work we present the first design and implementation of a wait-free hash map. Our multiprocessor data structure allows a large number of threads to concurrently put, get, and remove information. Wait-freedom means that all threads make progress in a finite amount of time - an attribute that can be critical in real-time environments. This is opposed to the traditional blocking implementations of shared data structures which suffer from the negative impact of deadlock and related correctness and performance issues. Our design is portable because we only use atomic operations that are provided by the hardware; therefore, our hash map can be utilized by a variety of data-intensive applications including those within the domains of embedded systems and supercomputers. The challenges of providing this guarantee make the design and implementation of wait-free objects difficult. As such, there are few wait-free data structures described in the literature; in particular, there are no wait-free hash maps. It often becomes necessary to sacrifice performance in order to achieve wait-freedom. However, our experimental evaluation shows that our hash map design is, on average, 5 times faster than a traditional blocking design. Our solution outperforms the best available alternative non-blocking designs in a large majority of cases, typically by a factor of 8 or higher.

IEEE Transactions on Parallel and Distributed Systems | 2016

An Efficient Wait-Free Vector

Steven D. Feldman; Carlos Valera-Leon; Damian Dechev

The vector is a fundamental data structure, which provides constant-time access to a dynamically-resizable range of elements. Currently, there exist no wait-free vectors. The only non-blocking version supports only a subset of the sequential vector API and exhibits significant synchronization overhead caused by supporting opposing operations. Since many applications operate in phases of execution, wherein each phase only a subset of operations are used, this overhead is unnecessary for the majority of the application. To address the limitations of the non-blocking version, we present a new design that is wait-free, supports more of the operations provided by the sequential vector, and provides alternative implementations of key operations. These alternatives allow the developer to balance the performance and functionality of the vector as requirements change throughout execution. Compared to the known non-blocking version and the concurrent vector found in Intels TBB library, our design outperforms or provides comparable performance in the majority of tested scenarios. Over all tested scenarios, the presented design performs an average of 4.97 times more operations per second than the non-blocking vector and 1.54 more than the TBB vector. In a scenario designed to simulate the filling of a vector, performance improvement increases to 13.38 and 1.16 times. This work presents the first ABA-free non-blocking vector. Unlike the other non-blocking approach, all operations are wait-free and bounds-checked and elements are stored contiguously in memory.

international conference on embedded computer systems architectures modeling and simulation | 2015

Tervel: A unification of descriptor-based techniques for non-blocking programming

Steven D. Feldman; Pierre LaBorde; Damian Dechev

The development of non-blocking code is difficult; developers must ensure the progress of an operation on shared memory despite conflicting operations. Managing this shared memory in a non-blocking fashion is even more problematic. The non-blocking property guarantees that progress is made toward the desired operation in a finite amount of time. We present a framework that implements memory reclamation and progress assurance for code that follows the semantics of our framework. This reduces the effort required to implement non-blocking, and more specifically wait-free, algorithms. We also present a library that demonstrates the ease with which wait-free algorithms can be implemented using our framework.

acm conference on systems programming languages and applications software for humanity | 2013

Effective use of non-blocking data structures in a deduplication application

Steven D. Feldman; Akshatha Bhat; Pierre LaBorde; Qing Yi; Damain Dechev

Efficient multicore programming demands fundamental data structures that support a high degree of concurrency. Existing research on non-blocking data structures promises to satisfy such demands by providing progress guarantees that allow a significant increase in parallelism while avoiding the safety hazards of lock-based synchronizations. It is well-acknowledged that the use of non-blocking containers can bring significant performance benefits to applications where the shared data experience heavy contention. However, the practical implications of integrating these data structures in real-world applications are not well-understood. In this paper, we study the effective use of non-blocking data structures in a data deduplication application which performs a large number of concurrent compression operations on a data stream using the pipeline parallel processing model. We present our experience of manually refactoring the application from using conventional lock-based synchronization mechanisms to using a wait-free hash map and a set of lock-free queues to boost the degree of concurrency of the application. Our experimental study explores the performance trade-offs of parallelization mechanisms that rely on a) traditional blocking techniques, b) fine-grained mutual exclusion, and c) lock-free and wait-free synchronization.

international conference on supercomputing | 2011

SRC: facilitating efficient parallelization of information storage and retrieval on large data sets

Steven D. Feldman

The purpose of this work is to develop a lock-free hash table that allows a large number of threads to concurrently insert, modify, or retrieve information. Lock-free or non-blocking designs alleviate the problems traditionally associated with lock-based designs, such as bottlenecks and thread safety. Using standard atomic operations provided by the hardware, the design is portable and therefore, applicable to embedded systems and supercomputers such as the Cray XMT. Real-world applications range from search-indexing to computer vision. Having written and tested the core functionality of the hash table, we plan to perform a formal validation using model checkers.

International Journal of Parallel Programming | 2017

A Wait-Free Hash Map

Pierre LaBorde; Steven D. Feldman; Damian Dechev

In this work we present the first design and implementation of a wait-free hash map. Our multiprocessor data structure allows a large number of threads to concurrently insert, get, and remove information. Wait-freedom means that all threads make progress in a finite amount of time—an attribute that can be critical in real-time environments. This is opposed to the traditional blocking implementations of shared data structures which suffer from the negative impact of deadlock and related correctness and performance issues. We only use atomic operations that are provided by the hardware; therefore, our hash map can be utilized by a variety of data-intensive applications including those within the domains of embedded systems and supercomputers. The challenges of providing this guarantee make the design and implementation of wait-free objects difficult. As such, there are few wait-free data structures described in the literature; in particular, there are no wait-free hash maps. It often becomes necessary to sacrifice performance in order to achieve wait-freedom. However, our experimental evaluation shows that our hash map design is, on average, 7 times faster than a traditional blocking design. Our solution outperforms the best available alternative non-blocking designs in a large majority of cases, typically by a factor of 15 or higher.

international symposium on object/component/service-oriented real-time distributed computing | 2016

A Methodology for Performance Analysis of Non-blocking Algorithms Using Hardware and Software Metrics

Ramin Izadpanah; Steven D. Feldman; Damian Dechev

Non-blocking algorithms are a class of algorithms that provide guarantees of progress within a system. These progress guarantees come from the fine-grained synchronization techniques incorporated into their design. There are a number of various non-blocking designs and implementations of concurrent algorithms. However, trade-offs between performance and non-blocking algorithm design decisions are poorly understood. The most common method to measure the performance of non-blocking algorithms is to analyze the number of operations completed over a period of time. Unfortunately, this coarse-grained approach for performance analysis is unable to capture and explain many of the nuances of the behavior of non-blocking algorithms. This can result in a flawed perception of such algorithms, which may lead to a misguided use of them. This work provides a fine-grained approach for the analysis of the design and use of non-blocking algorithms. To support this analysis, we introduce a methodology that enables a user to simulate an applications use of an arbitrary non-blocking algorithm and extract insightful information from the performance results. To better understand the behavior of non-blocking algorithms, we present metrics to measure the effectiveness of the key synchronization and memory management techniques used in non-blocking algorithms. Our analysis combines these new metrics with several well-known hardware metrics to explain key behaviors and develop new insights. To demonstrate the effectiveness of the proposed methodology, we integrate it within the Tervel framework and analyzed Tervels vector in various use cases. Our experiments show that helping mechanisms negatively impact throughput by increasing misaligned data cache accesses. Furthermore, by studying the correlations between different metrics, we are able to observe the effect of thread interference on the CPU instructions and instruction cache invalidation, and then link the decrease in work completed to this effect. In addition to the provided information, these metrics revealed implementation errors that did not affect correctness but caused increased thread congestion.

international conference on cluster computing | 2015

Extending LDMS to Enable Performance Monitoring in Multi-core Applications

Steven D. Feldman; Deli Zhang; Damian Dechev; James M. Brandt

Identifying design patterns that limit the performance of multi-core algorithms is a challenging task. There are many known methods by which threads synchronize their actions and each method may exhibit different behavior in different use cases. These use cases may vary in regards to the workload being executed, number of parallel tasks, dependencies between these tasks, and the behavior of the system scheduler. Restructuring algorithms to overcome performance limitations requires intimate knowledge on how these algorithms utilize the hardware. In our experience, we have found a lack of adequate tools to gain such knowledge. To address this, we have enhanced and implemented additional data sampler modules for OVISs Lightweight Distributed Metric Service (LDMS) to enable scalable distributed collection of hardware performance counter data. These modules provide an interface by which LDMS can utilize the PAPI library, Linux perf tools, and RAPL to collect hardware performance data of interest. Using these samplers, we plan to monitor the intra-node behavior, including contention for node level shared resources, of multi-core applications for a diverse set of use cases. We are currently exploring how the values reported are affected by the level of concurrency, the synchronization methodologies, and progress guarantees. We hope to use this information to identify ways to restructure algorithms to increase their performance.

ACM Sigapp Applied Computing Review | 2015

A wait-free multi-producer multi-consumer ring buffer

Steven D. Feldman; Damian Dechev

The ring buffer is a staple data structure used in many algorithms and applications. It is highly desirable in high-demand use cases such as multimedia, network routing, and trading systems. This work presents a new ring buffer design that is, to the best of our knowledge, the only array-based first-in-first-out ring buffer to provide wait-freedom. Wait-freedom guarantees that each thread completes its operation within a finite number of steps. This property is desirable for real-time and mission critical systems. This work is an extension and refinement of our earlier work. We have improved and expanded algorithm descriptions and pseudo-code, and we have performed all new performance evaluations. In contrast to other concurrent ring buffer designs, our implementation includes new methodology to prevent thread starvation and livelock from occurring.

IEEE Access | 2013

LC/DC: Lockless Containers and Data Concurrency a Novel Nonblocking Container Library for Multicore Applications

Damian Dechev; Pierre LaBorde; Steven D. Feldman

Exploiting the parallelism in multiprocessor systems is a major challenge in modern computer science. Multicore programming demands a change in the way we design and use fundamental data structures. The standard collection of data structures and algorithms in C++11 is the sequential standard template library (STL). In this paper, we present their vision for the theory and practice for the design and implementation of a collection of highly concurrent fundamental data structures for multiprocessor application development with associated programming interface and advanced optimization support. Specifically, the proposed approach will provide a familiar, easy-to-use, and composable interface, similar to that of C++ STL. Each container type will be enhanced with internal support for nonblocking synchronization of its data access, thereby providing better safety and performance than traditional blocking synchronization by: 1) eliminating hazards such as deadlock, livelock, and priority inversion and 2) by being highly scalable in supporting large numbers of threads. The new library, lockless containers/data concurrency, will provide algorithms for handling fundamental computations in multithreaded contexts, and will incorporate these into libraries with familiar look and feel. The proposed approach will provide an immense boost in performance and software reuse, consequently productivity, for developers of scientific and systems applications, which are predominantly in C/C++. STL is widely used and a concurrent replacement library will have an immediate practical relevance and a significant impact on a variety of parallel programming domains including simulation, massive data mining, computational biology, financial engineering, and embedded control systems. As a proof-of-concept, this paper discusses the first design and implementation of a wait-free hash table.

Explore More