Recent research from Anthropic-affiliated investigators provides one of the clearest quantitative signals yet that autonomous AI agents have crossed an important threshold in offensive security capability. Using a purpose-built benchmark focused on smart contract exploitation, the study measures success not by abstract accuracy metrics, but by simulated financial loss. The results indicate that current frontier models can independently identify vulnerabilities, construct working exploit chains, and extract value with minimal human oversight, all at declining operational cost.
Why Smart Contracts Provide a Measurable Testbed
Smart contracts represent a rare security domain where exploitation impact can be directly priced. The execution environment is deterministic, source code is publicly available, and financial state is embedded directly in program logic. Once a flaw is triggered, loss occurs immediately and can be quantified precisely using on-chain balances and historical exchange rates.
The researchers leveraged these properties to build SCONE-bench, a benchmark consisting of 405 real-world smart contracts that were exploited between 2020 and 2025 across Ethereum, Binance Smart Chain, and Base. Each contract was executed in a locally forked blockchain environment pinned to the block height immediately prior to the historical exploit. This setup allows reproducible execution of exploit code without touching live networks or real assets.
Evaluation Methodology and Agent Capabilities
Each agent was tasked with full exploit development rather than vulnerability identification alone. Given contract source code, metadata, and access to a sandboxed toolchain, the agent had to reason through contract state, identify attack primitives, construct exploit scripts, and execute them in a way that produced a measurable increase in the attacker’s native token balance.
The tooling environment exposed to the agents resembled a real attacker workflow. Agents could compile Solidity contracts, issue transactions, inspect storage, trace execution paths, and route token swaps through decentralized exchanges. A minimum profit threshold was enforced to prevent trivial arbitrage or dust-level manipulation from counting as success.
Aggregate Results Across Known Vulnerabilities
Across all 405 benchmark contracts, ten frontier models collectively generated working exploits for just over half of the dataset. In aggregate, these runs produced approximately $550 million in simulated stolen funds under Best@8 evaluation. This figure represents theoretical maximum extraction on already-known vulnerable contracts rather than realistic attacker profit, yet it demonstrates the upper bounds of what autonomous agents can achieve once a vulnerability exists and is discoverable.
More informative is the post-knowledge-cutoff analysis. When restricted to contracts exploited after each model’s training cutoff, Claude Opus 4.5, Claude Sonnet 4.5, and GPT-5 produced exploits worth $4.6 million in total simulated losses. Opus 4.5 alone successfully exploited the majority of these recent targets, extracting roughly $3.7 million in value. This establishes a conservative lower bound on economic harm that could plausibly have been inflicted by autonomous agents during 2025.
Zero-Day Discovery in Recently Deployed Contracts
To move beyond retrospective evaluation, the researchers ran Sonnet 4.5 and GPT-5 against 2,849 recently deployed contracts with no publicly documented vulnerabilities. These contracts were filtered to ensure meaningful liquidity, verified source code, and active trading history.
Both agents independently uncovered two previously unknown vulnerabilities and produced functional exploit code in simulation. The combined simulated revenue was modest at $3,694, though the more significant data point is cost efficiency. GPT-5 completed the entire scan for approximately $3,476 in API cost, yielding an average per-contract evaluation cost slightly above one dollar.
The vulnerabilities themselves fell squarely into well-known failure classes. One flaw involved a publicly accessible function intended for read-only reward calculation that lacked a state-restriction modifier, allowing attackers to mutate internal accounting and mint value. Another arose from missing validation in fee withdrawal logic, enabling arbitrary redirection of accumulated fees. These errors mirror access control and state mutation flaws that appear routinely in conventional application security reviews.
Revenue, Not Complexity, Drives Impact
A key analytical finding is that exploit profitability showed little correlation with code complexity, cyclomatic metrics, or deployment-to-exploit time. Contracts with minimal logic but high asset concentration produced catastrophic losses, while complex systems with limited liquidity yielded negligible returns. The determining factor was asset exposure at the time of exploitation rather than technical sophistication of the flaw.
This aligns closely with patterns observed in enterprise breaches. The severity of an incident is rarely dictated by exploit novelty; it is dictated by privilege scope, trust boundaries, and what systems or data sit behind the vulnerable component.
Cost Compression and Capability Growth
Token consumption required to generate a successful exploit dropped sharply across successive model generations. Median token usage declined by more than seventy percent across four Claude releases, indicating that exploit development is becoming both faster and cheaper. Over the same period, simulated exploit revenue on recent contracts roughly doubled every six weeks.
These trends suggest a tightening feedback loop. As agents improve at long-horizon reasoning, tool orchestration, and error recovery, they require fewer attempts to converge on viable exploit paths. Lower per-run cost makes exhaustive scanning economically viable even against large populations of contracts or services.
Broader Security Implications
Smart contracts offer clean measurement, yet the underlying techniques transfer directly to traditional software. Control-flow reasoning, boundary condition analysis, iterative probing, and automated payload construction apply equally to APIs, internal services, legacy middleware, and integration glue code. Public blockchains may face this pressure first, though proprietary systems are unlikely to remain insulated as agentic reverse engineering improves.
The defensive implication is straightforward. Security programs that rely on periodic reviews, manual audits, or post-deployment detection will struggle to keep pace with automated adversaries. The same class of agents demonstrated in this research can be repurposed for adversarial testing, pre-deployment analysis, and continuous validation of production code paths.
How Can Netizen Help?
Founded in 2013, Netizen is an award-winning technology firm that develops and leverages cutting-edge solutions to create a more secure, integrated, and automated digital environment for government, defense, and commercial clients worldwide. Our innovative solutions transform complex cybersecurity and technology challenges into strategic advantages by delivering mission-critical capabilities that safeguard and optimize clients’ digital infrastructure. One example of this is our popular “CISO-as-a-Service” offering that enables organizations of any size to access executive level cybersecurity expertise at a fraction of the cost of hiring internally.
Netizen also operates a state-of-the-art 24x7x365 Security Operations Center (SOC) that delivers comprehensive cybersecurity monitoring solutions for defense, government, and commercial clients. Our service portfolio includes cybersecurity assessments and advisory, hosted SIEM and EDR/XDR solutions, software assurance, penetration testing, cybersecurity engineering, and compliance audit support. We specialize in serving organizations that operate within some of the world’s most highly sensitive and tightly regulated environments where unwavering security, strict compliance, technical excellence, and operational maturity are non-negotiable requirements. Our proven track record in these domains positions us as the premier trusted partner for organizations where technology reliability and security cannot be compromised.
Netizen holds ISO 27001, ISO 9001, ISO 20000-1, and CMMI Level III SVC registrations demonstrating the maturity of our operations. We are a proud Service-Disabled Veteran-Owned Small Business (SDVOSB) certified by U.S. Small Business Administration (SBA) that has been named multiple times to the Inc. 5000 and Vet 100 lists of the most successful and fastest-growing private companies in the nation. Netizen has also been named a national “Best Workplace” by Inc. Magazine, a multiple awardee of the U.S. Department of Labor HIRE Vets Platinum Medallion for veteran hiring and retention, the Lehigh Valley Business of the Year and Veteran-Owned Business of the Year, and the recipient of dozens of other awards and accolades for innovation, community support, working environment, and growth.
Looking for expert guidance to secure, automate, and streamline your IT infrastructure and operations? Start the conversation today.


Leave a comment