AgentX AgentBeats Competition

This competition, hosted by Berkeley RDI in conjunction with the Agentic AI MOOC and its global community of 32K+ registered learners, aims to advance the state of the art in agentic AI by creating benchmarks, and then AI agents that top those benchmarks. It is open to the public and will be held both virtually and in-person at UC Berkeley.

Compete for over $1M in prizes and resources. This two-phase competition challenges participants to first build novel benchmarks or enhance existing benchmarks for agentic AI (Phase 1), and then create AI agents to excel on them (Phase 2)—advancing the field by creating high-quality, broad-coverage, realistic agent evaluations as shared public goods.

Sign Up Team Sign Up Join the MOOC Join Discord

The Need for Standardized, Reproducible Agent Evaluation

Whether you're building AI systems, integrating them into applications, or simply using AI products, a central question arises: how well does this AI system perform on the tasks we care about? The only reliable answer is through evaluation—testing performance on well-defined benchmarks. You can only improve what you can measure!

Yet, as AI evolves toward agentic systems—AI agents capable of reasoning, taking actions, and interacting with the world—our current benchmarking methods for simple LLM model-level evaluation fall short:

Interoperability: Running a production-grade agent on existing benchmarks often feels like forcing a square peg into a round hole. Substantial modifications are needed just to make it fit.
Reproducibility: Stateful tools, memory, and dynamic configurations lead to results that can vary across runs, undermining consistency.
Fragmentation: There’s no single, unified view of progress—leaderboards and results are scattered across platforms and domains.
Discovery: With new benchmarks appearing almost weekly, finding the right one for a given goal can be surprisingly time-consuming.

Our vision for streamlined agentic AI evaluation is a unified space where the community can come together to define the goalposts of agentic AI—through benchmarks that are:

Compatible and Standardized: Any agent can connect to any benchmark with near-zero code changes.
Reproducible: Each run starts in the same state as any other.
Collaborative & Discoverable: A living hub where researchers, developers, and enthusiasts alike can easily find the most relevant benchmarks, identify top-performing agents, and collaboratively shape the standards that define the future of agentic AI.

Watch the Intro Lecture

Building Public Good Through Competition

Through the AgentX–AgentBeats competition, we aim to bring the community together to create high-quality, broad-coverage, realistic agent evaluations—developed in an agentified, standardized, reproducible, and collaborative way—as shared public goods for advancing agentic AI.

Resources

Lambda

$400 cloud credits to every individual or team

Nebius

$50 inference credits to every individual or team

More to be announced

Additional resources will be announced soon.

Get Resources

Prizes

DeepMind

Up to $50k prize pool in GCP/Gemini credits to be shared among the winning teams.

Lambda

$750 in cloud credits for each winning team.

Nebius

Up to $50k prize pool in inference credits to be shared among the winning teams.

Amazon

Up to $10k prize pool in AWS credits to be shared among the winning teams.

Snowflake

Each winning team member who is currently a student will receive:

Free access to Snowflake software for 6 months
60 Snowflake credits (worth $240 — $4 per credit)

More to be announced

Additional prize partners will be announced soon.

Introducing AgentBeats

To realize this vision, we are introducing AgentBeats, an open-source platform and a new paradigm for evaluating AI agents. Rather than asking you to adapt your agent to fit a rigid benchmark, AgentBeats flips the model on its head: we turn the benchmark itself into an agent, i.e., agentifying the benchmark.

A 🟢 green (or evaluator) agent provides a specific agent evaluation benchmark including the environment, a set of tasks, and the evaluator. Think of it as the proctor, the judge, and the environment manager all rolled into one. When you build a green agent, you are not just defining a set of tasks; you are creating a fully automated evaluation system.

A 🟣 purple (or competing) agent is the agent under test such as a coding assistant, a research agent, or a personal planner agent. The purple agent will interact with the green agent to demonstrate its abilities and get evaluated.

AgentBeats addresses the big problems in agentic AI evaluation by:

Enabling Interoperability: With the A2A protocol as the universal interface, you build your A2A-compatible purple agent once, and it can be tested by any green agent on the platform or vice versa.
Ensuring Reproducibility: The green agent, orchestrated by the platform, controls the entire testing lifecycle. Before each evaluation run, the platform ensures the purple agent is reset to a clean state.
Creating a Unified & Discoverable Hub: AgentBeats isn't just a protocol; it's a open platform where green agents (benchmarks) and their leaderboards can be available to the entire community.

Read the blog series to learn more:

What Is AgentBeats Agentifying the Assessment