AbstractBy training to predict the next token in an unlabeled corpus, large
language models learn to perform many tasks without any labeled data. However, their next-token-prediction objective arguably limits their performance in scenarios that require
→