Modern image classification is based upon directly predicting model classes via large discriminative networks, making it difficult to assess the intuitive visual ``features'' that may constitute a classification decision. At the same time, recent works in joint visual language models s