video-and-language pre-training has shown promising results for learning
generalizable representations. Most existing approaches usually model video and
text in an implicit manner, without considering explicit structural
representations of the multi-modal content. We denote such form o
Context Optimization with Multi-Knowledge Representation (CoKnow) enhances Prompt Learning for VLMs by addressing the lack of diversity in prompt templates, resulting in improved performance compared to previous methods.