2025 - NeurIPS 2025 Lock-LLM

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Automated framework that discovers and optimizes multi-turn-to-single-turn (M2S) jailbreak templates via LLM-guided evolution, using a StrongREJECT-style judge with a calibrated success threshold to restore selection pressure and reveal structure-level vulnerabilities that transfer (but vary) across frontier models.

JailbreaksRed TeamingAI SafetyLLM SecurityEvolutionary SearchEvaluation

Open PDF poster

TL;DR

Research Note

Overview

Multi-turn-to-single-turn (M2S) prompts compress long red-teaming conversations into a single structured attack. Our earlier M2S work relied on three hand-crafted templates (Hyphenize, Numberize, Pythonize), which raised a natural question:

Instead of hand-writing a few templates, can we search the template space and automatically discover stronger single-turn jailbreak structures in a reproducible way?

X-Teaming Evolutionary M2S answers this question with an LLM-guided evolutionary framework that proposes, executes, and evaluates M2S templates under a calibrated StrongREJECT-style judge.

From manual templates to evolutionary search

Hand-crafted M2S templates have two main limitations:

They cover only a tiny fraction of the possible design space.
It is unclear whether template tweaks are actually improving robustness or just overfitting to a specific model/threshold.

To address this, X-Teaming Evolutionary M2S:

Treats template structure as an explicit search space.
Uses an evolutionary loop guided by LLM feedback to propose and refine templates.
Restores selection pressure by calibrating the success threshold of the judge.

The goal is to make template discovery data-driven, auditable, and reproducible.

Problem setup

Formally, a multi-turn adversarial dialogue is written as

\mathcal{C} = \{(u_t, v_t)\}_{t=1}^T,

where $u_t$ is the user turn and $v_t$ is the model reply.

An M2S template $\tau$ deterministically consolidates the dialogue into a single prompt

x = \tau(\mathcal{C}),

by inserting user utterances into placeholders like $\{\text{PROMPT}_1\}, \dots, \{\text{PROMPT}_N\}$ .

A target model $f$ then produces a response

y = f(x).

A StrongREJECT-style LLM-as-judge $J$ scores the pair $(\text{forbidden prompt}, y)$ on a normalized scale

s = J(x, y) \in [0, 1],

and we declare the trial a success if $s \ge \theta$ .

In X-Teaming Evolutionary M2S, we set a strict threshold $\theta = 0.70$ to avoid early saturation and maintain room for genuine evolution.

We seed the search with three base families:

hyphenize
numberize
pythonize

and let evolution discover additional template families.

X-Teaming evolution loop

Each evolution run proceeds in generations:

1. Score aggregation

- For each template family, aggregate success rate, mean judge score, and length statistics over the current batch of prompts.

2. Template proposal (generator)

- Use an LLM “generator” to propose new template schemata that

- amplify patterns seen in successful templates, and

- avoid failure modes highlighted by the judge.

3. Schema validation

- Enforce a minimal schema (ID, name, template text, description, placeholder types) and require at least $\{\text{PROMPT}_1\}$ and $\{\text{PROMPT}_N\}$ for variable-length dialogues.

- Reject malformed candidates before any model calls.

4. Selection and next generation

- Keep top-performing families plus a curated subset of new proposals to form the next generation’s candidate set.

- Stop when success rates converge within a narrow variance band or after reaching a generation cap.

In the main study, we run five generations starting from the three base templates and discover two new evolved families.

Key results

1. Evolution under a strict threshold

On GPT-4.1 at $\theta = 0.70$ :

Overall success: 44.8% (103 / 230 trials).
Mean normalized judge score: 0.439.
The study progresses through 5 generations, starting from three base templates and discovering two new evolved families (Evolved_1, Evolved_2).

Per-template success rates at the same threshold:

hyphenize: 52.0%
pythonize: 52.0%
Evolved_1: 47.5%
Evolved_2: 37.5%
numberize: 34.0%

Using $\theta=0.70$ (vs. 0.25) reduces raw success but prevents early saturation and preserves selection pressure for meaningful evolution.

2. Cross-model transfer and “immune” models

In the 2,500-trial cross-model panel:

Structural advantages of evolved templates transfer, but template rankings are target-dependent.
On GPT-4.1 and Qwen3-235B, evolved templates are competitive or leading.
On Claude-4-Sonnet, numberize is unexpectedly strong.
Two targets (GPT-5, Gemini-2.5-Pro) show zero successes at $\theta = 0.70$ in-sample (suggestive, not a proof of robustness).

Limitations

Using a fixed single judge (GPT-4.1) and the known length–score coupling can bias evaluations toward verbosity rather than pure harmfulness.
Cross-model conclusions are based on finite samples and a chosen threshold; “zero success” only means failure under this experimental setup, not proven robustness.
Responsible disclosure / dual-use constraints and provider policy drift can require redacting or altering artifacts, which may limit full public reproducibility and affect what is retained as “discovered” templates.

Resources

Paper (arXiv)

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

https://arxiv.org/abs/2509.08729

Code & Artifacts

M2S X-Teaming Evolution Pipeline (GitHub)

https://github.com/hyunjun1121/M2S-x-teaming