This competition, hosted by Berkeley RDI in conjunction with the Agentic AI MOOC and its global community of 32K+ registered learners, aims to advance the state of the art in agentic AI by creating benchmarks, and then AI agents that top those benchmarks. It is open to the public and will be held both virtually and in-person at UC Berkeley.

Compete for over $1M in prizes and resources. This two-phase competition challenges participants to first build novel benchmarks or enhance existing benchmarks for agentic AI (Phase 1), and then create AI agents to excel on them (Phase 2)—advancing the field by creating high-quality, broad-coverage, realistic agent evaluations as shared public goods.

The Need for Standardized, Reproducible Agent Evaluation

Whether you're building AI systems, integrating them into applications, or simply using AI products, a central question arises: how well does this AI system perform on the tasks we care about? The only reliable answer is through evaluation—testing performance on well-defined benchmarks. You can only improve what you can measure!

Yet, as AI evolves toward agentic systems—AI agents capable of reasoning, taking actions, and interacting with the world—our current benchmarking methods for simple LLM model-level evaluation fall short:

  • Interoperability: Running a production-grade agent on existing benchmarks often feels like forcing a square peg into a round hole. Substantial modifications are needed just to make it fit.
  • Reproducibility: Stateful tools, memory, and dynamic configurations lead to results that can vary across runs, undermining consistency.
  • Fragmentation: There’s no single, unified view of progress—leaderboards and results are scattered across platforms and domains.
  • Discovery: With new benchmarks appearing almost weekly, finding the right one for a given goal can be surprisingly time-consuming.

Our vision for streamlined agentic AI evaluation is a unified space where the community can come together to define the goalposts of agentic AI—through benchmarks that are:

  • Compatible and Standardized: Any agent can connect to any benchmark with near-zero code changes.
  • Reproducible: Each run starts in the same state as any other.
  • Collaborative & Discoverable: A living hub where researchers, developers, and enthusiasts alike can easily find the most relevant benchmarks, identify top-performing agents, and collaboratively shape the standards that define the future of agentic AI.

Building Public Good Through Competition

Through the AgentX–AgentBeats competition, we aim to bring the community together to create high-quality, broad-coverage, realistic agent evaluations—developed in an agentified, standardized, reproducible, and collaborative way—as shared public goods for advancing agentic AI.

Sponsors

AgentX AgentBeats competition sponsors

Resources

Lambda

$400 cloud credits to every individual or team

Nebius

$50 inference credits to every individual or team

More to be announced

Additional resources will be announced soon.

Prizes

DeepMind

Up to $50k prize pool in GCP/Gemini credits to be shared among the winning teams.

Lambda

$750 in cloud credits for each winning team.

Nebius

Up to $50k prize pool in inference credits to be shared among the winning teams.

Amazon

Up to $10k prize pool in AWS credits to be shared among the winning teams.

Snowflake

Each winning team member who is currently a student will receive:

  • Free access to Snowflake software for 6 months
  • 60 Snowflake credits (worth $240 — $4 per credit)

More to be announced

Additional prize partners will be announced soon.

Introducing AgentBeats

To realize this vision, we are introducing AgentBeats, an open-source platform and a new paradigm for evaluating AI agents. Rather than asking you to adapt your agent to fit a rigid benchmark, AgentBeats flips the model on its head: we turn the benchmark itself into an agent, i.e., agentifying the benchmark.

A 🟢 green (or evaluator) agent provides a specific agent evaluation benchmark including the environment, a set of tasks, and the evaluator. Think of it as the proctor, the judge, and the environment manager all rolled into one. When you build a green agent, you are not just defining a set of tasks; you are creating a fully automated evaluation system.

A 🟣 purple (or competing) agent is the agent under test such as a coding assistant, a research agent, or a personal planner agent. The purple agent will interact with the green agent to demonstrate its abilities and get evaluated.

AgentBeats addresses the big problems in agentic AI evaluation by:

  • Enabling Interoperability: With the A2A protocol as the universal interface, you build your A2A-compatible purple agent once, and it can be tested by any green agent on the platform or vice versa.
  • Ensuring Reproducibility: The green agent, orchestrated by the platform, controls the entire testing lifecycle. Before each evaluation run, the platform ensures the purple agent is reset to a clean state.
  • Creating a Unified & Discoverable Hub: AgentBeats isn't just a protocol; it's a open platform where green agents (benchmarks) and their leaderboards can be available to the entire community.

Read the blog series to learn more:

AgentX-AgentBeats Competition

  • 🟢 Phase 1 · Green

    Oct 16 to Dec 20, 2025

    Participants build green agents that define assessments and automate scoring. Pick your evaluation track:

    1. Choose a contribution type
      • Port (agentify) and extend an existing benchmark — Transform a benchmark into a green agent that runs end-to-end on AgentBeats (see benchmark ideas).
      • Create a new benchmark — Design a brand-new assessment as a green agent with novel tasks, automation, and scoring.
      • Custom track — See the Custom Tracks below for more details.
    2. For existing or new benchmarks, choose an agent type
      Coding Agent Web Agent Computer Use Agent Research Agent Software Testing Agent Game Agent DeFi Agent Cybersecurity Agent Healthcare Agent Finance Agent Legal Domain Agent Agent Safety Multi-agent Evaluation Other Agent
    3. Sign up, form a team, and start building!
  • 🟣 Phase 2 · Purple

    Jan 12 to Feb 23, 2026

    Participants build purple agents to tackle the select top green agents from Phase 1 and compete on the public leaderboards.

Custom Tracks

[λ] Lambda

Agent Security

A red-teaming and automated security testing challenge.

More details to be announced...

More custom tracks to be announced...

Key Dates

Date Event
Oct 16, 2025 Participant registration open
Oct 24, 2025 Team signup & Build Phase 1
Dec 19, 2025 Green agent submission
Dec 20, 2025 Green agent judging
Jan 12, 2026 Phase 2: Build purple agents
Feb 22, 2026 Purple agent submission
Feb 23, 2026 Purple agent judging