ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?

| Source: arXiv AI

Tags: cybersecurity, exploit generation, AI agents, Claude, GPT-5.5, vulnerability research, red-teaming

A 898-instance benchmark shows frontier AI agents can generate working exploits for real vulnerabilities — Claude Mythos Preview succeeds on 157 instances, GPT-5.5 on 120 — across Linux kernel, V8 engine, and userspace programs, with non-trivial success even when defenses are enabled.

Details

ExploitGym is a large-scale benchmark designed to measure something previously under-evaluated: can AI agents turn known vulnerabilities into working exploits? The task requires low-level program reasoning about memory layout, runtime adaptation, and sustained progress. The benchmark comprises 898 instances from real-world vulnerabilities across three domains: userspace programs, Google's V8 JavaScript engine, and the Linux kernel, with varying security protections applied to isolate their impact. All configurations are containerized for reproducibility. The evaluation shows that exploitation remains hard but is no longer out of reach for frontier models: Claude Mythos Preview produces working exploits on 157 instances, GPT-5.5 on 120. Even with widely used defenses enabled (ASLR, stack canaries, etc.), models retain non-trivial success rates. The benchmark is authored by 16 researchers including Milad Nasr, Nicholas Carlini, and Elie Bursztein. This result has direct policy relevance: it establishes a measurable capability threshold for AI-assisted exploitation, which regulators and defenders need to understand. The benchmark itself is a contribution independent of the results.