LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

M. Phute, A. Helbling, M. Hull, D.H. Chau
Georgia Institute of Technology, Georgia, United States

Keywords: Large Language Models, self defense, harm detection, robustness

Large language models (LLMs) have skyrocketed in popularity in recent years due to their ability to generate high-quality text in response to human prompting. However, these models can generate harmful content in response to user prompting (e.g., giving users instructions on how to commit crimes). There has been a focus in the literature on mitigating these risks, through methods like aligning models with human values through reinforcement learning. However, it has been shown that even aligned language models are susceptible to adversarial attacks that bypass their restrictions on generating harmful text (Carlini et al., 2023). We propose a simple approach to defending against these attacks by having a large language model filter its own responses. Our results show that even if a model is not fine-tuned to be aligned with human values, it is possible to stop it from presenting harmful content to users by validating the content using a language model