Despite commendable achievements made by existing work, prevailing multimodal
sarcasm detection studies rely more on textual content over visual information.
It unavoidably induces spurious correlations between textual words and labels,
thereby significantly hindering the models' generalizati