We investigate video-aided grammar induction, which learns a constituency
parser from both unlabeled text and its corresponding video. Existing methods
of multi-modal grammar induction focus on learning syntactic grammars from
text-image pairs, with promising results showing that the i