Benchmarks Track

Program Task Description

Category 1: Create your own AI agent benchmark on a novel task

Is there a task you’re particularly interested in that you think agents might tackle in the future?
- Real-World Examples of AI Agents
- Regulatory compliance: norm.ai

Category 2: Build upon current AI agent benchmarks

Can you create extensions to current popular AI agent benchmarks?
- Make a benchmark multi-modal or multi-agent or evaluate a different problem space
- [2310.06770] SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- [2307.13854] WebArena: A Realistic Web Environment for Building Autonomous Agents
- [2311.12983] GAIA: a benchmark for General AI Assistants
- [2304.03279] Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Can you improve existing benchmarks, either by filtering the data, making them easier to evaluate, or making them more robust?
- Develop a more performant docker environment from which to launch existing agent benchmarks
- Improve existing benchmarks to differentiably evaluate agent progress
- [2405.10938] Observational Scaling Laws and the Predictability of Language Model Performance
- [2407.21792] Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
- [2304.04370] OpenAGI: When LLM Meets Domain Experts
- [2308.03688] AgentBench: Evaluating LLMs as Agents