Key Takeaways
- What Terminal-Bench tests — Real-world coding tasks across 12 categories: file manipulation, git operations, system administration, web development, data processing, and more. Each task has automated pass/fail verification. No partial credit. The score is the percentage of tasks completed correctly on the first attempt.
- Top tier (75%+) — vix (90.2%, Claude Opus 4.7), Codex CLI (82%, GPT-5.5), Junie CLI (71%, JetBrains). These agents handle most tasks without intervention. The gap between first and second place is significant: vix solves nearly every task it attempts.
- Mid tier (40-70%) — Claude Code (58%, Opus 4.6), Gemini CLI (52%, Gemini 2.5 Pro), Aider (48%). Capable agents that fail on complex multi-step tasks or system administration. Useful daily tools, but expect to intervene regularly.
- The leaderboard caveat — Terminal-Bench tests terminal-native tasks. Agents that excel at IDE-integrated workflows (Cursor, Windsurf, Cline, Roo Code) may underperform because the benchmark does not capture their strengths. A low Terminal-Bench score does not mean a bad tool. It means the tool is optimized for different work.
What Terminal-Bench 2.0 actually tests
Terminal-Bench 2.0 runs 143 AI coding agents through a battery of real-world terminal tasks. Not toy problems. Not academic benchmarks. Practical work: write a bash script that processes CSV files, set up a git repository with a specific branching structure, configure an nginx server, parse JSON with jq, debug a failing build pipeline.
Each task has automated pass/fail verification. The agent either produces the correct output or it does not. No partial credit, no subjective quality ratings. The score is the percentage of tasks completed correctly on the first attempt.
This design makes the benchmark transparent and reproducible. It also makes it narrow. Terminal-Bench tests terminal-native work. It does not test code quality, architectural reasoning, multi-file refactoring, or IDE-integrated workflows. An agent can score poorly on Terminal-Bench and still be the best tool for your workflow if your workflow does not involve terminal scripting.
The leaderboard: top performers
Here are the top-scoring agents from the Terminal-Bench 2.0 evaluation, alongside notable tools that developers search for most.
| Rank | Agent | Score | Base Model | Monthly Searches | Notes |
|---|---|---|---|---|---|
| 1 | vix | 90.2% | Claude Opus 4.7 | -- | Terminal-specialized, aggressive task decomposition |
| 2 | Codex CLI | 82% | GPT-5.5 | 14.8K | Kernel sandbox, $20/mo via ChatGPT Plus |
| 3 | Junie CLI | 71% | JetBrains AI | -- | JetBrains ecosystem, IDE-native with CLI mode |
| 8 | Claude Code | 58% | Opus 4.6 | 9.9K | Best on SWE-bench (80.9%), weaker on terminal tasks |
| 12 | Gemini CLI | 52% | Gemini 2.5 Pro | 4.4K | Best free tier, Google Search grounding |
| 15 | Aider | 48% | Multi-model | 1K | Git-native, 40K stars, oldest mature agent |
Notable tools by search volume
Terminal-Bench ranks agents by benchmark performance. Developer interest, measured by monthly search volume, tells a different story. Several tools with strong developer interest do not rank high on Terminal-Bench because they are optimized for different workflows.
Roo Code (8.1K monthly searches)
Roo Code is a fork of Cline with multi-agent modes: Architect, Code, Debug, and custom user-defined modes. It runs inside VS Code as an extension. The 5.0 user rating on the marketplace is notable: no other AI coding extension maintains a perfect score at scale.
Roo Code's strength is customization. You can define modes that use different models for different tasks (cheap models for code review, expensive models for architecture), configure approval workflows per mode, and create project-specific agents. The 97% cost reduction claim comes from using DeepSeek-R1 for routine tasks instead of Claude or GPT.
Terminal-Bench does not capture Roo Code's IDE integration, which is where it performs best. As a terminal agent, it scores lower than Codex or Claude Code. As an IDE agent, developers consistently rank it among the top options.
Cline AI (3.6K monthly searches)
Cline is the original VS Code AI coding agent. 57K GitHub stars, 4M+ developers, step-by-step approval model. Every file edit, every terminal command requires your explicit approval before execution. This makes Cline the safest IDE-native agent for teams that need audit trails.
Cline's governance model is its differentiator. Each action is logged, approvable, and reversible. For regulated industries or teams with compliance requirements, this matters more than raw benchmark scores. Roo Code forked from Cline specifically to add the multi-mode flexibility that Cline's conservative design intentionally avoids.
Warp AI (2.9K monthly searches)
Warp is an AI-native terminal emulator, not a coding agent in the traditional sense. It replaces your terminal application with one that has AI built in: natural language commands, intelligent autocomplete, and inline explanations of terminal output.
Warp does not compete with Codex or Claude Code on agentic coding tasks. It competes with iTerm2 and Alacritty as your terminal application. The AI features are assistive (help you write commands) rather than agentic (execute tasks autonomously). For developers who spend their day in the terminal, Warp's AI assistance can reduce friction without changing their workflow.
OpenCode AI (1.6K monthly searches)
OpenCode is a Go-based terminal UI (TUI) for AI coding. 100K+ GitHub stars, the fastest-growing open-source agent by star velocity. It supports 75+ LLM providers including local models via Ollama, which makes it the most model-flexible terminal agent available.
The privacy angle is real. OpenCode runs entirely locally. No code leaves your machine unless you explicitly configure a cloud LLM. For developers working on proprietary codebases, this is often the deciding factor over Codex (which uploads to OpenAI's cloud) or Claude Code (which sends code to Anthropic).
Sourcegraph Cody (1.3K monthly searches)
Cody is Sourcegraph's AI coding assistant. Its differentiator is codebase-wide context: Cody indexes your entire repository (or multiple repositories) and uses that context when generating code. Where most agents see only the files you explicitly share, Cody sees everything.
This makes Cody stronger for large-codebase tasks: "implement the same validation pattern we use in the payments service" works because Cody has already indexed the payments service. The tradeoff is that Cody is tightly coupled to Sourcegraph's platform. Teams not already using Sourcegraph face a significant onboarding cost.
Aider (1K monthly searches)
Aider is the oldest mature AI coding agent, with 40K GitHub stars and 4.1M installs. Its core design is git-native: every edit Aider makes is automatically committed with a descriptive message. You can undo any AI change with a standard git revert.
Aider supports multiple models (Claude, GPT, Gemini, local) and uses a unique "architect + editor" pattern where one model plans the changes and another executes them. This separation often produces cleaner output than single-model agents. Aider's 48% Terminal-Bench score underrepresents its strength at multi-file editing and refactoring, which Terminal-Bench does not emphasize.
How to interpret Terminal-Bench scores
A high Terminal-Bench score means the agent reliably completes terminal-native tasks on the first attempt. This correlates with productivity for developers who work primarily in the terminal: DevOps engineers, system administrators, backend developers who script heavily.
A lower score does not mean a bad tool. Claude Code scores 58% on Terminal-Bench but 80.9% on SWE-bench, which tests software engineering tasks (bug fixes, feature implementations, test writing). These benchmarks measure different skills. Claude Code is a better software engineer. Codex CLI is a better terminal operator.
The gap between vix (90.2%) and Codex CLI (82%) is meaningful. vix is purpose-built for benchmark performance with aggressive retry logic and task decomposition. Codex CLI is a general-purpose coding agent that happens to score well. For practical daily use, the 8-point gap matters less than the difference in ecosystem, pricing, and workflow integration.
What actually matters for choosing a tool
Benchmarks are one input. Here is what we recommend weighing:
- Your workflow: Terminal-centric? Look at Codex, Claude Code, Aider. IDE-centric? Look at Roo Code, Cline, Cursor. Both? Run two tools.
- Your model preference: Locked to OpenAI? Codex. Want Claude? Claude Code. Want choice? Aider, OpenCode, or Roo Code.
- Your budget: Free? Gemini CLI or OpenCode with local models. $20/month? Codex or Claude Code. $100+? Claude Code Max or Devin.
- Your privacy requirements: Air-gapped or sensitive codebase? OpenCode with local models. No data leaves your machine.
- Your team size: Solo? Pick one tool and learn it well. Team of 10+? Standardize on one primary tool, allow a second for specific use cases.
The most productive developers we track use two tools: one for quick tasks (usually Codex or Aider) and one for complex work (usually Claude Code or Cursor). The benchmark helps you pick within those categories. It does not tell you which category you need.
What is Terminal-Bench?
Terminal-Bench is an open benchmark for evaluating AI coding agents on real-world terminal tasks. Version 2.0 tests 143 agents across 12 categories including file manipulation, git operations, web development, system administration, and data processing. Each task has automated verification. The score represents first-attempt pass rate.
Why does Claude Code score lower than Codex on Terminal-Bench?
Terminal-Bench measures terminal-native task completion: scripting, file ops, system administration. Codex CLI is optimized for exactly these tasks with its sandbox execution model. Claude Code scores 58% because it is optimized for deep reasoning and multi-file code generation, not terminal scripting. Claude Code outperforms Codex on code quality benchmarks (SWE-bench, blind evaluations) that test different skills.
Is Roo Code better than Cline?
Roo Code is a fork of Cline with additional features: multi-agent modes (Architect, Code, Debug), custom modes, and cost optimization via model selection. It has a perfect 5.0 user rating on the VS Code marketplace. Whether it is 'better' depends on your needs. Cline has a larger community (57K GitHub stars vs Roo Code's 25K) and longer track record. Roo Code offers more customization. Both are strong IDE-native agents.
Should I choose tools based on Terminal-Bench scores?
Not exclusively. Terminal-Bench tests a specific slice of coding work: terminal-native tasks. If your workflow is IDE-centric (editing files in VS Code, refactoring within a project), tools like Cursor, Windsurf, Cline, or Roo Code may serve you better despite lower Terminal-Bench scores. Use the benchmark as one data point alongside your own workflow evaluation.
What is vix, and why does it score 90.2%?
vix is a terminal-native agent built specifically to excel at automated coding tasks. It uses Claude Opus 4.7 as its base model and adds aggressive task decomposition, retry logic, and optimized tool-use patterns. Its 90.2% score reflects this specialization. However, vix is less well-known than Codex or Claude Code because it targets power users and CI/CD automation rather than general developer productivity.
Ready to Find the Right AI Tools?
Browse our data-driven rankings to find the best AI tools for your team.