Jailbreak Attacks and Defenses in Large Language Models: A Beginner-Friendly Survey
Large language models (LLMs) are designed to be helpful, polite, and safe. However, users and attackers have discovered that these models can sometimes be pushed into ignoring their safety rules. This is commonly called jailbreaking. A jailbreak attack is a method for making an LLM answer a question or perform a task that it would normally refuse. At the same time, researchers have proposed many defenses to make models more robust against such attacks. This paper presents a beginner-friendly survey of major LLM jailbreak attack and defense methods. We follow a simple taxonomy in which attacks are divided into white-box and black-box methods, while defenses are divided into prompt-level and model-level methods. For each major method family, we explain the main idea in simple language, name representative techniques from the literature, and provide descriptive toy examples to help readers understand the mechanism. We also summarize common evaluation metrics and datasets used in jailbreak research. The purpose of this paper is pedagogical: to give new students and researchers a clear mental map of how jailbreak attacks work, why they succeed, and how current defense methods attempt to stop them.