EVMbench: AI Can Now Exploit 70% of Smart Contract Bugs

OpenAI and Paradigm just released EVMbench, an open benchmark that tests how well AI agents can detect, patch, and exploit smart contract vulnerabilities. The results are striking: GPT-5.3-Codex successfully exploits over 72% of critical, fund-draining bugs in real Ethereum smart contracts. This represents a massive leap from just months ago when top models could only exploit 20% of the same vulnerabilities.

EVMbench benchmark interface showing smart contract security testing

Why Smart Contract Security Matters

Smart contracts routinely secure over $100 billion in crypto assets. Unlike traditional software, deployed smart contracts are immutable. Once a vulnerability is exploited, the funds are gone. We have seen this repeatedly: the $600 million Poly Network hack, the $325 million Wormhole exploit, and countless smaller incidents that drain liquidity pools and user wallets.

The Ethereum network saw 1.7 million smart contracts deployed in November 2025 alone. Most of these contracts never receive a professional security audit. The audit firms that do exist cannot scale to meet demand, and their services remain expensive. This creates a massive gap between the code being deployed and the code that has been properly vetted.

EVMbench represents the first serious attempt to measure whether AI can help close this gap.

How EVMbench Works

The benchmark uses 120 real vulnerabilities from 40 audits, primarily sourced from Code4rena competitions and Paradigm's Tempo audit process. This grounding in real code is important. Synthetic benchmarks often miss the messy edge cases that appear in production contracts.

EVMbench tests AI agents across three distinct modes:

Detect: The agent audits a smart contract repository and must identify ground-truth vulnerabilities. Scoring is based on recall, meaning how many real bugs the agent catches.

Patch: The agent edits vulnerable code to fix identified issues. Tests verify that patches both remove the vulnerability and preserve the contract's intended functionality.

Exploit: The agent interacts with a local EVM instance via RPC, deploying contracts and executing transactions. Scoring is deterministic, based on whether the agent achieves specific on-chain state changes like draining funds or triggering exploit conditions.

Each task runs in containerized environments, ensuring reproducible results across different machines and research teams. The benchmark includes answer keys for every task, confirming that challenges are solvable and allowing future models to be compared against a consistent baseline.

The Results Are Asymmetric

The performance gap between tasks reveals something interesting about current AI capabilities. GPT-5.3-Codex achieved a 72.2% success rate in exploit mode, compared to just 31.9% for GPT-5 (released six months earlier). That is a 2.3x improvement in half a year.

However, detection and patching tell a different story. According to the researchers, AI agents "sometimes failed to audit exhaustively or struggled to preserve full contract functionality" in these modes. Exploiting a vulnerability, it turns out, is easier than fixing it properly.

This asymmetry has practical implications. An AI agent that can exploit bugs faster than it can patch them creates an interesting dynamic for security teams. The technology may be more immediately useful for red team operations and bug bounty hunting than for automated remediation.

What This Means for the Industry

I see three immediate implications for practitioners working with blockchain security:

Audit workflows will change. Paradigm's Alpin Yukseloglu noted that future audits may be "conducted by autonomous agents rather than exclusively human reviewers." This does not mean human auditors become obsolete. Rather, AI agents will handle the initial sweep, flagging potential issues for human verification. Auditors become reviewers and final decision-makers rather than line-by-line code readers.

Bug bounties will accelerate. If an AI agent can exploit 72% of critical vulnerabilities, bug bounty hunters will increasingly use these tools to find issues before malicious actors do. The economics of vulnerability research shift when the cost of scanning drops dramatically.

Defense must evolve. Smart contract developers need to assume that sophisticated AI-powered scanning will be applied to their code. Security through obscurity becomes even less viable when AI agents can rapidly analyze and exploit complex vulnerability patterns.

Practical Takeaways

For teams deploying smart contracts, EVMbench suggests several actions:

Run AI-assisted scans before deployment. Tools built on these capabilities are emerging. They will not replace formal verification or professional audits for high-value contracts, but they raise the baseline security level for everything else.

Monitor the benchmark. EVMbench is open source and freely available on GitHub. As new models are released, their performance on this benchmark will indicate how quickly the security landscape is shifting.

Assume adversarial use. If your contract has a vulnerability that falls into the 72% exploitable category, assume it will be found. Design your systems with this in mind: time-locks, governance controls, and circuit breakers become more important than ever.

Looking Forward

EVMbench establishes a clear measurement framework for tracking AI capabilities in smart contract security. The rapid improvement from 20% to 72% exploitation rates in under a year suggests we are still on a steep curve. By this time next year, the remaining 28% of vulnerabilities may also fall within AI reach.

The collaboration between OpenAI and Paradigm signals that major AI labs see blockchain security as a legitimate application domain. For those of us working at the intersection of AI and finance in the Gulf region, where blockchain adoption is accelerating, this is a development worth watching closely.

The benchmark is available at paradigm.xyz/evmbench, with full documentation and tooling for researchers who want to evaluate their own models or build on this foundation.