Prompt Hacking

Prompt hacking is a core technique for AI Red Teamers targeting LLMs. It involves crafting inputs (prompts) to manipulate the model into bypassing safety controls, revealing hidden information, or performing unintended actions. Red teamers systematically test various prompt hacking methods (like jailbreaking, role-playing, or instruction manipulation) to assess the LLM's resilience against adversarial user input.

Learn more from the following resources:

@course@Introduction to Prompt Hacking
@guide@Prompt Hacking Guide
@paper@SoK: Prompt Hacking of LLMs (arXiv 2023)