How Do Coding Agents Spend Your Money?
Analyzing and Predicting Token Consumptions in Agentic Coding Tasks

Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei
University of Michigan · Stanford University · AllHands AI · Google DeepMind · MIT
Preprint · April 2026

Can You Guess the Token Cost?

Test your intuition about agent token consumption

Challenge yourself to predict token consumption and costs for real coding agent tasks!

Play the Guessing Game

Abstract

Wide adoption of AI agents in complex human workflows drives rapid growth of LLM token consumption. We use "token consumption" to refer to both input and output tokens used by LLM agents. When agents are deployed on tasks that can require a large amount of tokens, three questions naturally arise: (1) Where do AI Agents spend the tokens? (2) What models are more token efficient? (3) Can LLMs anticipate the token usage before task execution?

In this paper, we present the first quantitative study of token consumption patterns in agentic coding. We analyze trajectories from eight frontier LLMs on a widely used coding benchmark (SWE-Bench) and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive: they consume substantially more tokens and are correspondingly more expensive than code reasoning and code chat, with input tokens being the key driver of the overall cost, instead of output tokens. (2) Token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30× in total tokens, and higher token usage does not translate to higher accuracy; instead, accuracy often peaks at intermediate cost and degrades at higher cost. (3) Model-to-model token efficiency is governed more by model differences than by human-labeled task difficulty, and difficulty labels only weakly align with actual resource expenditure. (4) Finally, frontier models fail to accurately predict their token usage (with weak-to-moderate correlations, up to 0.39), and systematically underestimate the real token costs. Our study reveals important insights regarding the economics of AI agents and can inspire new studies in this direction.

Key Findings

1) Agentic coding is uniquely expensive

Comparison of three coding paradigms: Code Reasoning (single-turn problem solving without tool interaction), Code Chat (multi-turn dialogue with moderate response expansion), and Coding Agent (tool-augmented, repository-level exploration with long-horizon context). Agentic coding tasks consume substantially more tokens and incur significantly higher monetary costs than the other two paradigms.

Token consumption comparison across different coding tasks

Token consumption across different coding-related tasks.

2) Token usage is highly variable

Costs differ substantially across backbone models, and token usage can vary dramatically across runs of the same task. Model ordering remains similar on a success-only subset, while absolute costs are lower than the full set.

Figure 2: Model-dependent cost differences and variability across problems and runs

Model-dependent cost differences and large variability across problems and runs.

3) More tokens does not mean better performance

Accuracy peaks at intermediate cost and degrades at higher cost, consistent with an inverse test-time scaling phenomenon: extra computation often reflects inefficient exploration and context bloat rather than better problem solving.

Figure 3a: Accuracy vs mean prompt tokens
Figure 3b: Accuracy by within-problem cost bins

Inverse test-time scaling: higher-cost runs are usually less accurate.

4) Human difficulty ≠ model cost

While harder tasks tend to consume more resources on average, many “easy” tasks can still be expensive, and many “hard” tasks can be solved cheaply. Notably, 11.51% of tasks labeled as "<15-minute" required more total tokens than the average ">1-hour" instance, and 27.38% of ">1-hour" tasks consumed fewer tokens than the "<15-minute" group.

Figure 5: Token and tool usage across difficulty levels

Difficulty labels only weakly align with agent resource expenditure.

5) Backbone models follow distinct token-use patterns

Token usage varies substantially across backbone models, reflecting different problem-solving strategies and interaction patterns. Some models are consistently more verbose and exploratory, while others reach similar outcomes with shorter trajectories and lower overhead. These differences persist even on the same benchmark, suggesting that token efficiency depends not only on task difficulty but also on how each model allocates context, tool use, and generation.

Backbone model variation in token deviation on the success subset
Backbone model variation in token usage and accuracy

Different backbones show distinct token-consumption profiles and usage-performance tradeoffs.

6) Where tokens go across phases

This case study is conducted with Claude Sonnet 4.5. Across phases, context reuse grows steadily. Early costs are dominated by context construction, while later costs shift toward token-intensive generation.

Figure 6a: Phase-level token volume by type
Figure 6b: Phase-level cost by token type
Figure 6c: Correlation between token types and total cost

Phase-level token usage and cost dynamics across the trajectory (Claude Sonnet 4.5 case study).

7) Self-prediction gives a coarse cost signal

Asking the agent to estimate its own token usage achieves weak-to-moderate correlations with ground truth across models. Output-token prediction tends to be easier than input-token prediction.

Self-prediction results showing correlations between predicted and ground-truth token counts

Pearson correlations between predicted and ground-truth token counts, plus relative overhead of self-prediction.

8) Self-prediction is systematically biased downward

Across models, self-predictions consistently underestimate true token usage. The bias is strongest for input tokens, while output-token predictions track the diagonal more closely but still undershoot higher-cost cases.

Input-token self-prediction versus actual values across models
Output-token self-prediction versus actual values across models

Self-prediction shows a consistent downward bias, especially for input-token estimates.

Human Guessing Leaderboard

Most Accurate Human Predictions

Rank Problem ID Guessed Tokens Actual Tokens Guessed Cost ($) Actual Cost ($) Token Error % Cost Error % Combined Error %

No guesses yet. Play the game to see your predictions here!

BibTeX

@article{bai2026tokenconsumption,
        title   = {How Do Coding Agents Spend Your Money? Analyzing and Predicting Token Consumptions in Agentic Coding Tasks},
        author  = {Bai, Longju and Huang, Zhemin and Wang, Xingyao and Sun, Jiao and Mihalcea, Rada and Brynjolfsson, Erik and Pentland, Alex and Pei, Jiaxin},
        journal = {Preprint},
        year    = {2026},
        url     = {https://arxiv.org/abs/}
      }