ExploitGym

ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?

Zhun Wang¹, Nico Schiller², Hongwei Li³, Srijiith Sesha Narayana², Milad Nasr⁵, Nicholas Carlini⁵, Xiangyu Qi⁶, Eric Wallace⁶, Elie Bursztein⁷, Luca Invernizzi⁷, Kurt Thomas⁷, Yan Shoshitaishvili⁴, Wenbo Guo³, Jingxuan He¹, Thorsten Holz², Dawn Song¹
¹UC Berkeley, ²Max Planck Institute for Security and Privacy, ³UC Santa Barbara, ⁴Arizona State University,
⁵Anthropic, ⁶OpenAI, ⁷Google
May 13, 2026
(Est. 5-6 minutes read, more details in arXiv paper: https://arxiv.org/abs/2605.11086)

We are a team of researchers led by Berkeley RDI at UC Berkeley, in collaboration with the Max Planck Institute for Security and Privacy, UC Santa Barbara, Arizona State University, Anthropic, OpenAI, and Google. Together, we have been working on a question the security community has been nervously circling:

How good are today’s AI agents at turning known software vulnerabilities into working exploits, i.e., real attacks?

This is one of the most critical questions for measuring the impact of frontier AI on cybersecurity, particularly on the offensive side.

TL;DR

ExploitGym is a new benchmark of 898 real-world vulnerabilities spanning userspace programs, Google’s V8 JavaScript engine (the engine behind Chrome), and the Linux kernel. Given a vulnerability and a proof-of-concept input that triggers it, AI agents are tasked with analyzing the vulnerability and crafting a full exploit that achieves unauthorized code execution.

The headline results: Anthropic’s Claude Mythos Preview successfully exploited 157 of those 898 instances, and OpenAI’s GPT-5.5 exploited 120, within the time limit per task. Even when standard security defenses like ASLR or the V8 sandbox were turned on, a meaningful number of exploits still worked. More strikingly, agents sometimes discovered and exploited entirely different vulnerabilities than the ones they were pointed at.

Key Takeaways

Autonomous exploitation is no longer hypothetical. Frontier AI agents can already take a bug report and a crashing input, reason about memory layouts, chain together multiple attack primitives, and produce fully working exploits. This kind of multi-step, low-level work has traditionally required deep expertise and significant time investment from human security researchers.

Standard defenses help, but don’t fully stop AI-driven attacks. When mitigations like ASLR, stack canaries, and the V8 heap sandbox were enabled, successes dropped substantially, but didn’t hit zero. Agents found bypasses: partial-pointer overwrites to defeat ASLR, known sandbox-escape techniques for V8, and kernel tricks such as overwriting modprobe_path and side-channels to sidestep KASLR. This is a clear signal that defense-in-depth remains essential, but current mitigations alone are not enough against AI-capable adversaries.

This is inherently dual-use. That tension is exactly why we built this benchmark. Exploitation sits at the heart of a fundamental tension in cybersecurity. For defenders, it’s about determining whether a vulnerability actually matters in practice. Automated exploit generation could accelerate severity triage, help prioritize patches, and validate whether mitigations actually work. But the same capability lowers the expertise barrier for offensive misuse, making tasks that once required years of specialization accessible to far more actors. Sophisticated attackers could also adapt partial agent-generated trajectories into functioning exploits, using AI as a force multiplier.

As agents grow more capable, this asymmetry will intensify, and the window for proactive governance is narrowing. We believe the responsible path is to measure these capabilities rigorously and openly, so defenders, AI developers, and policymakers can make informed decisions. Our benchmark and results are intended as a foundation for those multi-stakeholder discussions.

What Is ExploitGym?

Most existing cybersecurity benchmarks for AI focus on tasks such as finding bugs, writing patches, or solving Capture-the-Flag (CTF) puzzles. Our earlier benchmark, CyberGym, focuses on real-world vulnerability analysis: given a vulnerability description and a codebase, agents must generate proof-of-concept inputs that trigger the bug. That is an important step in understanding whether AI can find or confirm vulnerabilities, but it stops short of the next question: can an agent turn a known bug into a real-world attack?

ExploitGym fills that gap. Each of its 898 tasks provides the agent with three things: the vulnerable source code with build instructions, a proof-of-vulnerability (PoV) input that triggers the bug, and a containerized runtime environment where the agent can interact with the target. The agent’s job is to transform that PoV into a working exploit that achieves unauthorized code execution, concretely, reading a secret flag that is inaccessible through any legitimate interface.

The benchmark spans three domains. Userspace programs (520 instances) cover widely used C/C++ projects like FFmpeg and OpenSSL, sourced from Google’s OSS-Fuzz and OSV reports. V8 browser engine tasks (185 instances) target JavaScript engine bugs in Chromium. Linux kernel tasks (193 instances) require full-privilege escalation inside a virtual machine.

Each domain also comes with toggleable security mitigations, so researchers can measure exactly how much standard defenses like ASLR or the V8 sandbox actually slow down an AI attacker.

In addition to validating unauthorized code execution through flag capture, we use an agent-as-a-judge to verify that each exploit actually targets the provided vulnerability. Real-world software often contains multiple flaws, and agents frequently succeed by exploiting a different bug than the intended one. The judge ensures a consistent metric for cross-agent comparison and aligns with the core defensive use case: assessing the real-world severity of a specific known flaw.

The Main Results

We tested seven model-agent combinations, all of which ran with safety filters disabled under structured access programs designed for security research. Each agent got two hours per task.

The top performers were Claude Mythos Preview (paired with Claude Code) at 157 successes and GPT-5.5 (paired with Codex CLI) at 120. GPT-5.4 came in at a respectable 54. After that, success counts dropped sharply: Claude Opus 4.6 solved 15, Gemini 3.1 Pro solved 12, and the remaining models were in the single digits.

Breaking results down by task domain reveals a pronounced difficulty gradient. Userspace tasks saw the broadest success. V8 exploitation was substantially harder, with only the top three models making real headway. Kernel exploitation was the sharpest dividing line: only Claude Mythos Preview (12 successes) and GPT-5.5 (22 successes) showed meaningful capability, while no other model managed more than one.

Model	Agent	Success				Cost (USD)		Time (min)		LLM Calls
Model	Agent	Total	U	B	K	Succ.	Full	Succ.	Full	Succ.	Full
#Instances		898	520	185	193
Claude Mythos Preview^†	Claude Code	157	107	38	12	—	—	54.7	102.1	225.5	289.3
Claude Opus 4.6^†	Claude Code	15	12	2	1	8.08	21.76	18.1	66.7	102.3	285.9
Claude Opus 4.7	Claude Code	7	4	3	0	8.64	3.40	22.1	14.4	102.0	54.0
Gemini 3.1 Pro	Gemini CLI	12	10	2	0	8.56	9.02	51.1	75.6	169.5	174.8
GLM-5.1	Claude Code	4	4	0	0	3.75	6.39	63.3	118.0	148.6	245.6
GPT-5.4	Codex CLI	54	38	15	1	12.20	25.43	51.1	103.5	220.1	443.8
GPT-5.5^‡	Codex CLI	120	71	27	22	22.99	34.55	49.6	69.8	256.8	375.4

Table 1: Agent performance under a two-hour timeout. Success is split by domain: userspace (U), V8 (B), and kernel (K). Cost, time, and LLM calls are per-task averages over successful runs (Succ.) and the full benchmark (Full).
^† Results obtained in collaboration with Anthropic.
^‡ OpenAI's default safety filters block all GPT-5.5 exploit attempts under default prompting.

When standard mitigations were turned back on, success rates dropped across the board, but didn’t vanish. Claude Mythos Preview still succeeded on 25 userspace, 17 V8, and 3 kernel tasks with defenses active. GPT-5.5 retained 10, 3, and 8, respectively.

Model	Userspace	V8	Kernel
Claude Opus 4.6	12 → 0	2 → 0	1 → 0
Claude Opus 4.7	4 → 0	3 → 0	0 → 0
Claude Mythos Preview	107 → 25	38 → 17	12 → 3
Gemini 3.1 Pro	10 → 0	2 → 0	0 → 0
GLM-5.1	4 → 0	0 → 0	0 → 0
GPT-5.4	38 → 2	15 → 0	1 → 1
GPT-5.5	71 → 10	27 → 3	22 → 8

Table 2: Mitigation-bypassing exploits. Each cell shows successes without mitigations → with mitigations enabled (ASLR, stack canaries, V8 heap sandbox, KASLR, etc.).

The Interesting Bits

Agents go off-script and find new bugs. One of the most interesting findings is the gap between “captured the flag” and “exploited the intended vulnerability.” Across models, agents frequently achieved code execution through a vulnerability other than the one we provided. The two strongest models show this most clearly: GPT-5.5 captured flags in 210 instances, but only 120 used the intended bug, and Claude Mythos Preview captured 226 flags, but only 157 targeted the right flaw. In summary, 90 and 69 of their solves, respectively, succeeded via an unintended path. In some cases, agents pivoted to an adjacent code path with weaker validation; in others, they concluded the given bug wasn’t exploitable and searched for entirely new attack surfaces, sometimes by auditing source code or even performing dynamic fuzzing. That’s a remarkable display of autonomous security reasoning.

Model	Flag	Succ.	Rate
Claude Opus 4.6	36	15	41.7%
Claude Opus 4.7	9	7	77.8%
Claude Mythos Preview	226	157	69.5%
Gemini 3.1 Pro	18	12	66.7%
GLM-5.1	11	4	36.4%
GPT-5.4	65	54	83.1%
GPT-5.5	210	120	57.1%

Table 3: Flag-to-success rate. Flag counts all instances where the agent captured the flag (any exploitation path); Succ. counts only those where the intended vulnerability was exploited. Rate = Succ. / Flag.

Different models find different exploits. Claude Mythos Preview and GPT-5.5 dominate in total count, but their success sets diverge considerably: 56 targets are solved exclusively by Claude Mythos Preview and 26 exclusively by GPT-5.5, with only 91 shared. The remaining models contribute another 61 successes, most overlapping with the top two, but four solved by them alone. This suggests the models rely on qualitatively different exploitation strategies, and that an ensemble approach could substantially expand coverage beyond what any one model achieves.

Venn diagram of successes across models — *Figure 2: Overlap of successfully exploited instances across models.*

More budget helps, but only for the best models. When we extended the budget from two to six hours, Claude Mythos Preview kept climbing from 127 to 204 successful exploits with no clear plateau. Claude Opus 4.6, by contrast, flatlined at around 15 within the first 30 minutes. This tells us the frontier models are capable of sustained, multi-stage reasoning that can crack harder problems given enough runway. It also means our two-hour budget likely undercounts what the strongest agents can do.

Cumulative successes vs. time budget — *Figure 3: Cumulative successful exploits as the per-task time budget grows to six hours.*

Example: From a 5-Line Crash to Full Code Execution in V8

To make this concrete, here’s one of the impressive trajectories we observed. GPT-5.4 was given a five-line PoV that triggers an assertion in Maglev, V8’s mid-tier JIT compiler, reported by ClusterFuzz in October 2025, after GPT-5.4’s knowledge cutoff. On the release build, the PoV just throws a benign TypeError with no visible memory corruption.

From there, the agent independently escalated through a full exploit chain: it identified that the bug depends on receiver shape, crafted an object that tricks Maglev into an out-of-bounds heap read, groomed the heap to leak stable pointers, forged fake V8 string objects to obtain arbitrary native memory reads, leaked libc addresses from the Global Offset Table, and built a signal-return-oriented-programming chain redirecting execution to system("/challenge/catflag"). Total time: 71 minutes, 229 lines of exploit code.

An important caveat: this worked because we disabled ASLR and the V8 heap sandbox. With those defenses re-enabled, GPT-5.4 could no longer succeed to exploit this specific vulnerability. Modern mitigations remain a meaningful barrier, but an AI agent independently chaining this many primitives on a complex real-world target is a milestone worth noting.

GPT-5.4 V8 exploit chain trajectory — *Figure 4: GPT-5.4's V8 exploit chain.*

Why This Matters

ExploitGym makes concrete what many in the security community have suspected: the gap between “AI can find bugs” and “AI can exploit bugs” is closing fast. This is consistent with findings from our broader analysis of Frontier AI’s Impact on the Cybersecurity Landscape. We frame this as an urgent motivation for two things. First, defenders need to start modeling AI agents as potential attackers when evaluating their security posture. Standard mitigations are still valuable, but they’re no longer sufficient on their own against an adversary that can reason, adapt, and retry at machine speed. Second, responsible AI development needs to account for these capabilities explicitly, through structured access programs, safety filters, and ongoing evaluation.

The benchmark itself is a contribution to both sides of that equation: it gives defenders a way to measure real risk, and it gives AI developers a way to track how their models’ capabilities are evolving in a domain where the stakes are unusually high.

The ExploitGym paper is authored by researchers from UC Berkeley, Max Planck Institute for Security and Privacy, UC Santa Barbara, Arizona State University, Anthropic, OpenAI, and Google. The benchmark design and experimental methodology were developed by the academic authors, with industry partners providing model access and feedback. We also thank the GLM team for providing API access.