Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis | 2021

Whale: efficient one-to-many data partitioning in RDMA-assisted distributed stream processing systems

 
 
 
 

Abstract


To process large-scale real-time data streams, existing distributed stream processing systems (DSPSs) leverage different stream partitioning strategies. The one-to-many data partitioning strategy plays an important role in various applications. With one-to-many data partitioning, an upstream processing instance sends a generated tuple to a potentially large number of downstream processing instances. Existing DSPSs leverage an instance-oriented communication mechanism, where an upstream instance transmits a tuple to different downstream instances separately. However, in one-to-many data partitioning, multiple downstream instances typically run on the same machine to exploit multi-core resources. Therefore, a DSPS actually sends a data item to a machine multiple times, raising significant unnecessary costs for serialization and communication. We show that such a mechanism can lead to serious performance bottleneck due to CPU overload. To address the problem, we design and implement Whale, an efficient RDMA (Remote Direct Memory Access) assisted distributed stream processing system. Two factors contribute to the efficiency of this design. First, we propose a novel RDMA-assisted stream multicast scheme with a self-adjusting non-blocking tree structure to alleviate the CPU workloads of an upstream instance during one-to-many data partitioning. Second, we re-design the communication mechanism in existing DSPSs by replacing the instance-oriented communication with a new worker-oriented communication scheme, which saves significant costs for redundant serialization and communication. We implement Whale on top of Apache Storm and conduct comprehensive experiments to evaluate its performance with large-scale real world datasets. The results show that Whale achieves 56.6× improvement of system throughput and 97% reduction of processing latency compared to existing designs.

Volume None
Pages None
DOI 10.1145/3458817.3476192
Language English
Journal Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Full Text