Large-scale pre-trained vision foundation models, such as clip, have become
de facto backbones for various vision tasks. However, due to their black-box
nature, understanding the underlying rules behind these models' predictions and
controlling model behaviors have remained open challe