Show HN

We benchmarked 50 AI agents to find what makes them cheap

Metrxbot measures cost per outcome for production AI agents — and tells you which prompt change, model swap, or cache rule will save the most money. Free Pro for 30 days for HN.

What we found, in three numbers

38%

of agent runs were duplicate work — same input, same model, no cache.

$0.04

median cost per successful outcome across our cohort. Most teams thought it was 10x lower.

2.7x

gap between the cheapest and most expensive agents doing the same job. Same prompt, wrong model.

Numbers are from our internal cohort + 10 design partners. Range, not average. Methodology in the docs.

Frequently asked questions

  • Is this just another LLM observability tool?
    No. LangSmith, Helicone, Langfuse all answer 'what did the LLM do?'. Metrxbot answers 'did it earn its keep?' — cost per successful outcome, attributed back to the upstream business event. We sit on top of those traces, not next to them.
  • How do you compute cost per outcome without my code?
    Two ways. (1) MCP server / SDK: you call `metrx.outcome("lead_qualified", { agent_id, value_usd })` once per success. (2) Webhook ingest: post outcomes from Stripe/Hubspot/your CRM and we attribute them to the agent runs that preceded them. Most teams are wired up in under an hour.
  • What's the catch with the HN free Pro tier?
    No catch. Comment with your HN username on this thread, sign up, ping us in chat — we flip your org to Pro for 30 days. After that you can stay on Free (1 agent, 10K events/mo) or pay $99/mo for Pro. We just want feedback from people who have actually built agents.
  • How is this different from spreadsheet tracking?
    Three things: (1) we attribute outcomes to specific runs so you can see which prompt versions made you money, (2) we benchmark you against the anonymized cohort of all agents in our system so you know if your $0.04/run is good or terrible, (3) we surface optimization moves (model swap, prompt trim, cache hit) ranked by dollar impact.