Back to list

2025 - NeurIPS 2025 Lock-LLM

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Automated framework that discovers and optimizes multi-turn-to-single-turn (M2S) jailbreak templates via LLM-guided evolution, using a StrongREJECT-style judge with a calibrated success threshold to restore selection pressure and reveal structure-level vulnerabilities that transfer (but vary) across frontier models.

JailbreaksRed TeamingAI SafetyLLM SecurityEvolutionary SearchEvaluation

Overview

Automated framework that discovers and optimizes multi-turn-to-single-turn (M2S) jailbreak templates via LLM-guided evolution, using a StrongREJECT-style judge with a calibrated success threshold to restore selection pressure and reveal structure-level vulnerabilities that transfer (but vary) across frontier models.

Research Note

Overview

Multi-turn-to-single-turn (M2S) prompts compress long red-teaming conversations into a single structured attack. Our earlier M2S work relied on three hand-crafted templates (Hyphenize, Numberize, Pythonize), which raised a natural question:

Instead of hand-writing a few templates, can we search the template space and automatically discover stronger single-turn jailbreak structures in a reproducible way?

X-Teaming Evolutionary M2S answers this question with an LLM-guided evolutionary framework that proposes, executes, and evaluates M2S templates under a calibrated StrongREJECT-style judge.

This post summarizes the method, key results, and my role as co-first author.


From manual templates to evolutionary search

Hand-crafted M2S templates have two main limitations:

  • They cover only a tiny fraction of the possible design space.
  • It is unclear whether template tweaks are actually improving robustness or just overfitting to a specific model/threshold.

To address this, X-Teaming Evolutionary M2S:

  • Treats template structure as an explicit search space.
  • Uses an evolutionary loop guided by LLM feedback to propose and refine templates.
  • Restores selection pressure by calibrating the success threshold of the judge.

The goal is to make template discovery data-driven, auditable, and reproducible.


Problem setup

Formally, a multi-turn adversarial dialogue is written as

C={(ut,vt)}t=1T,\mathcal{C} = \{(u_t, v_t)\}_{t=1}^T,

where utu_t is the user turn and vtv_t is the model reply.

An M2S template τ\tau deterministically consolidates the dialogue into a single prompt

x=τ(C),x = \tau(\mathcal{C}),

by inserting user utterances into placeholders like {PROMPT1},,{PROMPTN}\{\text{PROMPT}_1\}, \dots, \{\text{PROMPT}_N\}.

A target model ff then produces a response

y=f(x).y = f(x).

A StrongREJECT-style LLM-as-judge JJ scores the pair (forbidden prompt,y)(\text{forbidden prompt}, y) on a normalized scale

s=J(x,y)[0,1],s = J(x, y) \in [0, 1],

and we declare the trial a success if sθs \ge \theta.

In X-Teaming Evolutionary M2S, we set a strict threshold θ=0.70\theta = 0.70 to avoid early saturation and maintain room for genuine evolution.

We seed the search with three base families:

  • hyphenize
  • numberize
  • pythonize

and let evolution discover additional template families.


X-Teaming evolution loop

Each evolution run proceeds in generations:

1. Score aggregation

- For each template family, aggregate success rate, mean judge score, and length statistics over the current batch of prompts.

2. Template proposal (generator)

- Use an LLM “generator” to propose new template schemata that

- amplify patterns seen in successful templates, and

- avoid failure modes highlighted by the judge.

3. Schema validation

- Enforce a minimal schema (ID, name, template text, description, placeholder types) and require at least {PROMPT1}\{\text{PROMPT}_1\} and {PROMPTN}\{\text{PROMPT}_N\} for variable-length dialogues.

- Reject malformed candidates before any model calls.

4. Selection and next generation

- Keep top-performing families plus a curated subset of new proposals to form the next generation’s candidate set.

- Stop when success rates converge within a narrow variance band or after reaching a generation cap.

In the main study, we run five generations starting from the three base templates and discover two new evolved families.


Smart data sampling & cross-model evaluation

To make results robust and interpretable, the pipeline includes:

  • Smart loader

- Balances prompts from 12 jailbreak sources, avoids duplicates, and logs the original multi-turn text plus the converted single-turn prompt for every trial.

  • Target execution

- For each pair (C,τ)(\mathcal{C}, \tau), records:

- template metadata,

- exact prompt sent to the target model (including parameters),

- raw model output and its length.

  • Cross-model transfer panel

- Evaluates 5 templates (3 base + 2 evolved) across 5 target models with a balanced design:

- 100 prompts per (template, model) cell

- 5×5×100=2,5005 \times 5 \times 100 = 2{,}500 trials in total

- Judge is fixed to GPT-4.1 and only sees the forbidden prompt and response (model identity hidden).

  • Metrics

- Primary: success rate at threshold θ=0.70\theta = 0.70.

- Secondary: mean normalized judge score.

- Auxiliary: compression ratio, output length, relevance heuristics.

- Statistical tools: Wilson 95% CIs, Cohen’s hh, Pearson correlations, and non-parametric tests where appropriate.


Key results

1. Evolution under a strict threshold

On GPT-4.1 with θ=0.70\theta = 0.70:

  • Overall success: 44.8% (103 / 230 trials).
  • Mean normalized judge score: 0.439.
  • The study progresses through 5 generations, starting from three base templates and discovering two new evolved families (Evolved_1, Evolved_2).

Per-template success rates at the same threshold:

  • hyphenize: 52.0%
  • pythonize: 52.0%
  • Evolved_1: 47.5%
  • Evolved_2: 37.5%
  • numberize: 34.0%

Raising the threshold from 0.25 (preliminary runs) to 0.70 reduces raw success but prevents early saturation and enables real evolutionary progress instead of trivial overfitting.

2. Cross-model transfer and “immune” models

In the 2,500-trial cross-model panel:

  • Structural advantages of evolved templates do transfer to other models, but the ranking of templates changes by target.
  • On GPT-4.1 and Qwen3-235B, evolved templates are competitive or leading.
  • On Claude-4-Sonnet, numberize is surprisingly strong despite being weaker on GPT-4.1.
  • Two targets (GPT-5, Gemini-2.5-Pro) show zero successes at θ=0.70\theta = 0.70 in our sample, appearing “immune” to this specific M2S attack family (though not formally proven robust).

This highlights that template structure matters, but its impact is heavily shaped by each model’s safety stack.

3. Length–score coupling

Across GPT-4.1 trials, there is a positive correlation between response length and judge score:

  • Longer outputs tend to receive higher normalized StrongREJECT scores.
  • This pattern holds both overall and per-template.

The result suggests that judge rubrics like StrongREJECT may implicitly reward more elaborated, detailed answers, motivating length-aware calibration in future safety evaluations.


Implications for AI Safety & Red Teaming

  • Structure-level search works.

Evolutionary optimization over template schemata is a practical way to find stronger single-turn probes, not just a theoretical idea.

  • Threshold calibration is crucial.

Simply lowering the judge threshold inflates success rates but kills selection pressure. Properly calibrated thresholds expose meaningful differences between templates and models.

  • Cross-model evaluation is non-optional.

Templates that look strong on one endpoint can underperform—or be completely ineffective—on others. Robust safety analysis needs panels of models, not a single target.

  • Judges need calibration and length awareness.

Length–score coupling means that naive use of judge scores may conflate verbosity with harmfulness. Safety teams should account for response length and consider stricter, task-specific rubrics.

For a company deploying LLMs in production, X-Teaming Evolutionary M2S provides:

  • A reusable pipeline for automatically stress-testing defenses with evolving single-turn prompts.
  • A template for logging, statistical analysis, and cross-model comparisons that can be integrated into CI pipelines.

My role

As co-first author of X-Teaming Evolutionary M2S, I:

  • Co-designed the overall structure-level search framing and evolution protocol.
  • Implemented key components of the pipeline:

- M2S converter that maps multi-turn dialogues to template placeholders.

- StrongREJECT-style judge integration and threshold calibration logic.

- Trial logging, analysis scripts, and result aggregation.

  • Ran the GPT-4.1 evolutionary study and the 2,500-trial cross-model panel, including statistical analysis (confidence intervals, effect sizes, correlations).
  • Helped define smart sampling across 12 sources to ensure diversity and traceability.
  • Co-authored the paper and maintained the public code repository for full reproducibility.

Resources

  • Paper (arXiv)

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

https://arxiv.org/abs/2509.08729

  • Code & Artifacts

M2S X-Teaming Evolution Pipeline (GitHub)

https://github.com/hyunjun1121/M2S-x-teaming