GPUs have become the defacto hardware devices to accelerate Deep Neural Network (DNN) inference in deep learning(DL) frameworks. However, the conventional sequential execution mode of DNN operators in mainstream DL frameworks cannot fully utilize GPU resources, due to the increasing complexity of DNN model structures and the progressively smaller computational sizes of DNN operators. Moreover, the inadequate operator launch order in parallelized execution scenarios can lead to GPU resource wastage and unexpected performance interference among operators. To address such performance issues above, we propose Opara, a resource- and interference-aware DNN Operator parallel scheduling framework to accelerate the execution of DNN inference on GPUs. Specifically, Opara first employs CUDA Streams and CUDA Graph to automatically parallelize the execution of multiple DNN operators. It further leverages the resource demands of DNN operators to judiciously adjust the operator launch order on GPUs by overlapping the execution of compute-intensive and memory-intensive operators, so as to expedite DNN inference. We implement and open source a prototype of Opara based on PyTorch in a non-intrusive manner. Extensive prototype experiments with representative DNN and Transformer-based models demonstrate that Opara outperforms the default sequential CUDA Graph in PyTorch and the state-of-the-art DNN operator parallelism systems by up to 1.68$\times$ and 1.29$\times$, respectively, yet with acceptable runtime overhead.

提出了一种资源感知和干扰感知的DNN操作并行调度框架Opara，以加速在GPU上进行DNN推理的执行。通过使用CUDA Streams和CUDA Graph来自动并行化多个DNN操作的执行，并调整操作在GPU上的启动顺序，以重叠计算密集型和内存密集型操作的执行，从而加速DNN推理。实验证明，Opara在代表性的DNN和基于Transformer的模型上效果优于默认的顺序执行的CUDA Graph和最先进的DNN操作并行系统，分别提高了1.68倍和1.29倍，同时运行时开销可接受。

Opara：利用运算符并行性加速 GPU 上的 DNN 推断