AbstractTransformer-based large language models (e.g., BERT and GPT) achieve great success, and
fine-tuning, which tunes a pre-trained model on a task-specific dataset, is the standard practice to utilize these models for downstream tasks. However, Transformer
→