Open Source

AgentBench

The open-source benchmark for evaluating AI agents. Compare agents across tool-use, reasoning, code generation, research, and multi-step tasks.

25
Benchmark Tasks
5
Categories
3
Scoring Modes
100%
Open Source

Leaderboard

🥇
GPT-4o Agent
OpenAI Assistants / gpt-4o
87
92
Tool Use
88
Reasoning
90
Code Gen
85
Research
80
Multi-Step
Avg latency: 3.2s
🥈
Claude Sonnet Agent
Custom / claude-sonnet-4-6
85
88
Tool Use
90
Reasoning
86
Code Gen
82
Research
79
Multi-Step
Avg latency: 2.8s
🥉
LangChain ReAct
LangChain / gpt-4o-mini
72
80
Tool Use
70
Reasoning
75
Code Gen
68
Research
67
Multi-Step
Avg latency: 4.5s
#4
CrewAI Research Team
CrewAI / gpt-4o
69
65
Tool Use
62
Reasoning
60
Code Gen
88
Research
70
Multi-Step
Avg latency: 8.2s
#5
AutoGPT Classic
AutoGPT / gpt-4o-mini
58
70
Tool Use
50
Reasoning
55
Code Gen
60
Research
55
Multi-Step
Avg latency: 12.0s
#6
Ollama Local Agent
Custom / qwen2.5:32b
52
60
Tool Use
55
Reasoning
48
Code Gen
50
Research
47
Multi-Step
Avg latency: 6.5s

Scores are based on 25 standardized tasks across 5 categories. Higher is better. Submit your own benchmarks via the CLI.

How It Works

1

Implement the Adapter

A single async function: (task: string) => Promise<string>. Wrapper adapters provided for LangChain, CrewAI, and OpenAI.

2

Run the Benchmark

npx agentbench run -a ./my-agent.ts executes all tasks with parallel execution, timeout enforcement, and automatic scoring.

3

Compare & Submit

Get a detailed HTML report with charts. Compare multiple agents side-by-side. Submit your results to appear on this leaderboard.

Get Started in 5 Minutes

1. Install

npm install -g @agentbench/cli

2. Scaffold your adapter

agentbench init my-agent

3. Run the benchmark

agentbench run -a ./my-agent.ts -n "My Agent" --framework openai --model gpt-4o

4. View results

# Open the generated HTML report
# Compare two agents:
agentbench compare results-a.json results-b.json

Task Categories

🔧

Tool Use

4 tasks

Calculator, JSON parsing, pattern extraction, unit conversion

🧠

Reasoning

5 tasks

Logic puzzles, sequences, syllogisms, analogies, counterfactuals

💻

Code Gen

5 tasks

FizzBuzz, palindromes, API fetch, SQL queries, algorithms

🔍

Research

5 tasks

Summarization, fact extraction, comparison, definitions

📋

Multi-Step

6 tasks

Data pipelines, planning, text analysis, code review

Ready to benchmark your agent?

Install the CLI, run the benchmark, and see how your agent stacks up.