It is challenging for weakly supervised object detection network to precisely predict the positions of the objects, since there are no instance-level category annotations. Most existing methods tend to solve this problem by using a two-phase learning procedure, i.e., multiple instance learning detector followed by a fully supervised learning detector with bounding-box regression. Based on our observation, this procedure may lead to local minima for some object categories. In this paper, we propose to jointly train the two phases in an end-to-end manner to tackle this problem. Specifically, we design a single network with both multiple instance learning and bounding-box regression branches that share the same backbone. Meanwhile, a guided attention module using classification loss is added to the backbone for effectively extracting the implicit location information in the features. Experimental results on public datasets show that our method achieves state-of-the-art performance.

本论文提出一种通过端到端的方式联合训练多阶段模型来解决弱监督目标检测网络中的对象位置精确预测问题的方法，该方法引入了多例学习、包围框回归和分类损失引导的注意力模块等多种算法，实验结果表明该方法能够达到最佳的性能表现。

面向精确的端到端弱监督物体检测网络