AbstractAlthough end-to-end (E2E) learning has led to promising performance on a variety of tasks, it is often impeded by hardware constraints (e.g., GPU memories) and is prone to overfitting. When it comes to
video captioning, one of the most challenging benchmark tasks in
→