Jailbreaking Black Box Large Language Models in Twenty Queries

University of Pennsylvania

PAIR is the state-of-the-art procedure for efficiently generating interpretable jailbreaks, while only needing black box access.



How does PAIR work?

PAIR uses a separate attacker language model to generate jailbreaks on any target model. The attacker model receives a detailed system prompt, instructing it to operate as a red teaming assistant. PAIR utilizes in-context learning to iteratively refine the candidate prompt until a successful jailbreak by accumulating previous attempts and responses in the chat history. The attacker model also reflects upon the both prior prompt and target model's response to generate an "improvement" as a form of chain-of-thought reasoning, allowing the attacker model to explain its approach, as a form of model interpretablility.


PAIR example.


Explanation Video

Abstract

There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR —which is inspired by social engineering attacks— uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieves competitive jailbreaking success rates and transferability on open and closed-source LLMs, including GPT-3.5/4, Vicuna, and PaLM-2.



Examples

Results

We evaluate the success rate of PAIR against the prior state-of-the-art for directly generating jailbreaks on a target model. Since PAIR does not require access to model weights, we can attack any language model with just API access. PAIR often succeeds with a few dozen queries, rather than hundreds of thousands.

Interpolation end reference image.

From PAIR's generated jailbreaks, we compare the transferability to different target models. PAIR achieves state-of-the-art transferability, with notably higher success with more complex models like GPT-4 (we omit transferring to the original target model).

Interpolation end reference image.

Contact

Please feel free to email us at pchao@wharton.upenn.edu. And if you find this work useful in your own research, please consider citing our work.
@misc{chao2023jailbreaking,
      title={Jailbreaking Black Box Large Language Models in Twenty Queries}, 
      author={Patrick Chao and Alexander Robey and Edgar Dobriban and Hamed Hassani and George J. Pappas and Eric Wong},
      year={2023},
      eprint={2310.08419},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}