Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to several key advantages in latency, privacy and always-on availability. However, due to limited computing resources, efficient DNN deployment on mobile and embedded platforms is challenging. Although many hardware accelerators and static model compression methods were proposed by previous works, at system runtime, multiple applications are typically executed concurrently and compete for hardware resources. This raises two main challenges: Runtime Hardware Availability and Runtime Application Variability. Previous works have addressed these challenges through either dynamic neural networks that contain sub-networks with different performance trade-offs or runtime hardware resource management. In this thesis, we proposed a combined method, a system was developed for DNN performance trade-off management, combining the runtime trade-off opportunities in both algorithms and hardware to meet dynamically changing application performance targets and hardware constraints in real time. We co-designed novel Dynamic Super-Networks to maximise runtime system-level performance and energy efficiency on heterogeneous hardware platforms. Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency. We also designed a hierarchical runtime resource manager that tunes both dynamic neural networks and DVFS at runtime. Compared with the Linux DVFS governor schedutil, our runtime approach achieves up to a 19% energy reduction and a 9% latency reduction in single model deployment scenario, and an 89% energy reduction and a 23% latency reduction in a two concurrent model deployment scenario.

深度神经网络在移动和嵌入式平台上执行推理具有延迟、隐私和始终可用性等多个关键优势。然而，由于计算资源有限，有效地在移动和嵌入式平台上部署深度神经网络具有挑战性。本论文提出了一种结合了算法和硬件的运行时性能权衡管理方法，通过动态超网络实现了实时满足变化的应用性能目标和硬件约束。在实验中，我们的模型在Jetson Xavier NX的GPU上使用ImageNet数据集相对于最先进的方法，在相似的ImageNet Top-1准确率下速度提高了2.4倍，或在相似的延迟下准确率提高了5.1%。同时，我们设计了一个分级运行时资源管理器，在单模型部署场景中达到了19%的能量降低和9%的延迟降低，在两个并发模型部署场景中能量降低了89%，延迟降低了23%。

移动/嵌入式设备高效推理的动态深度神经网络和运行时管理