The proposed YOLO-Former method seamlessly integrates the ideas of transformer and YOLOv4 to create a highly accurate and efficient object detection system. The method leverages the fast inference speed of YOLOv4 and incorporates the advantages of the transformer architecture through the integration of convolutional attention and transformer modules. The results demonstrate the effectiveness of the proposed approach, with a mean average precision (mAP) of 85.76\% on the Pascal VOC dataset, while maintaining high prediction speed with a frame rate of 10.85 frames per second. The contribution of this work lies in the demonstration of how the innovative combination of these two state-of-the-art techniques can lead to further improvements in the field of object detection.

YOLO-Former方法将Transformer和YOLOv4的思想无缝集成，创建了一个高度准确和高效的目标检测系统。该方法通过将卷积注意力和Transformer模块整合，利用YOLOv4的快速推理速度并融合Transformer架构的优势，实现了高度准确性，输出了一帧率为10.85帧每秒，Pascal VOC数据集上均值平均精度（mAP）达到了85.76％。本工作的贡献在于展示了这两种最先进技术的创新组合如何进一步提高目标检测领域的性能。

YOLO-Former: YOLO 与 ViT 的结合