International Workshop on OpenCL | 2021

Executing Graphs with OpenCL

 

Abstract


For several decades, graph and dataflow programming models have been niche topics limited to a small number of highly specialized domains. In recent years, however, the machine learning (ML) revolution and the proliferation of ML libraries has made graph programming accessible to even novice programmers. Before, a beginner programmer may have talked about writing a number-guessing game; today the programmer will describe training an off-the-shelf neural network—a type of graph—for handwriting recognition. There is growing demand from industry and individual users to run programs that are based on ML graphs. This demand is being met by hardware vendors, who are designing larger and increasingly heterogeneous accelerator devices that can efficiently execute graphs. Since its creation, OpenCL has been a key API for bridging the gap between user applications and accelerator hardware. The question, then, is whether OpenCL is an appropriate API for this new breed of graph software running on these large, heterogeneous accelerators. Does OpenCL have the expressive power required to describe an execution graph to accelerator hardware, or does OpenCL serialize graphs and execute them sequentially? This technical presentation argues that it is the former: OpenCL is sufficiently expressive to allow an ML library to describe an execution graph, and OpenCL is sufficiently powerful to execute that graph on a graph accelerator. The OpenCL API is designed around the concept of the user enqueuing commands onto the front of a command-queue. Commands include executing kernels (i.e., functions), and reading, writing, and copying data buffers. The OpenCL device driver removes commands from the back of a command-queue, sets up data transfers to and from the accelerator device, and schedules kernels to execute on the device. The command-queue abstraction can encode execution graphs in one of two ways, depending on whether the command-queue is an in-order command-queue or an out-of-order command-queue. An in-order command-queue guarantees that the effects of the enqueued commands will be as if the commands were executed in the order in which they were enqueued. However, the OpenCL device driver is allowed to reorder commands, provided that reordering does not affect the output. For example, if two kernels do not have a data dependency between them, then they can be executed in reverse order or even in parallel, if the driver and hardware support it. An out-of-order command-queue does not guarantee that commands will appear to have been executed in the order in which they were enqueued. Instead, it is the OpenCL API user’s responsibility to attach events and event wait lists to commands. When a command finishes executing, it triggers its attached event, and when all the events in a command’s event wait list have triggered, then that command is allowed to execute. Both types of command-queues are capable of describing execution graphs. For in-order command-queues, the graph is implied by kernels’ data dependencies; for out-of-order command-queues, the graph is explicitly defined with events. By instrumenting Codeplay’s ComputeAorta[2] OpenCL implementation, it is possible to record OpenCL API calls and to reconstruct the execution graph as seen by the OpenCL device driver. This presentation investigates the execution graphs generated by a simplified handwriting recognition neural network implemented in TensorFlow[1] and running on top of OpenCL via SYCL. Training a neural network and using a neural network for inference produce substantially different execution graphs. Both graphs are considered. The graphs show that data dependencies, opportunities for executing kernels in parallel, and opportunities for reordering kernels are all visible to the driver. It is therefore possible for an OpenCL device driver to schedule work to a hardware accelerator that has been designed for graph execution. It is important to note that OpenCL makes it possible to expose an execution graph to a device driver, but OpenCL cannot guarantee that OpenCL API calls will form a meaningful graph. For example, if a user places many independent data arrays into one memory buffer and enqueues kernels that all operate on the single memory buffer, then information about the execution graph is hidden from OpenCL. Opportunities for parallel execution and kernel reordering are lost. Often, application developers do not write OpenCL code directly, but use libraries that have OpenCL backends. Consequently, it is the responsibility of library developers to ensure that the graph that an application intends to execute is represented correctly on the OpenCL level.

Volume None
Pages None
DOI 10.1145/3456669.3456681
Language English
Journal International Workshop on OpenCL

Full Text