large language models (LLMs) can acquire extensive world knowledge through
pre-training on large corpora. However, due to exposure to low-quality data,
LLMs may exhibit harmful behavior without aligning with human values. The
dominant approach for steering LLMs towards beneficial behav