<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>CheapestInference | Blog</title><description/><link>https://cheapestinference.com/</link><language>en</language><item><title>DeepSeek V3.2 vs Claude Opus for coding: when to use which</title><link>https://cheapestinference.com/blog/deepseek-vs-claude-for-coding/</link><guid isPermaLink="true">https://cheapestinference.com/blog/deepseek-vs-claude-for-coding/</guid><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The question isn’t which model is “better” at coding. It’s which model is better &lt;em&gt;for the coding task you’re doing right now&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Claude Opus 4.6 is the highest-scoring model on most coding benchmarks. DeepSeek V3.2 costs 55x less. The quality gap is real but narrow — and for many tasks, it doesn’t matter.&lt;/p&gt;
&lt;p&gt;We ran both models through five categories of coding tasks and measured quality, speed, and cost. Here’s what we found.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;benchmark-scores&quot;&gt;Benchmark scores&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;table&gt;
    &lt;tbody&gt;&lt;tr&gt;
      &lt;th&gt;Benchmark&lt;/th&gt;
      &lt;th&gt;Claude Opus 4.6&lt;/th&gt;
      &lt;th&gt;DeepSeek V3.2&lt;/th&gt;
      &lt;th&gt;Gap&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;SWE-bench Verified&lt;/td&gt;
      &lt;td&gt;72.5%&lt;/td&gt;
      &lt;td&gt;68.2%&lt;/td&gt;
      &lt;td&gt;-4.3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;HumanEval+&lt;/td&gt;
      &lt;td&gt;93.2%&lt;/td&gt;
      &lt;td&gt;91.8%&lt;/td&gt;
      &lt;td&gt;-1.4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;LiveCodeBench (Q1 2026)&lt;/td&gt;
      &lt;td&gt;48.5%&lt;/td&gt;
      &lt;td&gt;43.1%&lt;/td&gt;
      &lt;td&gt;-5.4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Aider polyglot&lt;/td&gt;
      &lt;td&gt;68.1%&lt;/td&gt;
      &lt;td&gt;65.3%&lt;/td&gt;
      &lt;td&gt;-2.8&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;Opus wins every benchmark. But the gap ranges from 1.4 to 5.4 points. The question is whether that gap justifies a 55x price difference.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;task-by-task-comparison&quot;&gt;Task-by-task comparison&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;greenfield-code-generation&quot;&gt;Greenfield code generation&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;“Write an Express middleware that validates JWTs and attaches the user to the request.”&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Both models produce correct, well-structured code. Opus tends to add more edge-case handling (expired tokens, malformed headers, missing claims). DeepSeek produces cleaner, shorter code that handles the happy path and common errors.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Winner&lt;/strong&gt;: Opus by a small margin. The extra edge-case handling is genuinely useful.
&lt;strong&gt;Does it justify 55x cost?&lt;/strong&gt; No. A 2-minute code review catches what DeepSeek misses.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;debugging&quot;&gt;Debugging&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;“This test fails with ‘expected 3, got 4’. Here’s the test and the implementation.”&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Both models identify the off-by-one error correctly. Opus explains the root cause more clearly and suggests a fix with a regression test. DeepSeek identifies and fixes the bug but doesn’t suggest the test.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Winner&lt;/strong&gt;: Opus. Better explanations help prevent similar bugs.
&lt;strong&gt;Does it justify 55x cost?&lt;/strong&gt; For isolated bugs, no. For debugging sessions with complex context, maybe.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;refactoring&quot;&gt;Refactoring&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;“Extract this 200-line function into smaller, testable functions.”&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Opus excels here. It identifies logical boundaries, names functions well, maintains the original behavior, and adds type annotations. DeepSeek produces correct refactoring but sometimes picks awkward function boundaries or generic names.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Winner&lt;/strong&gt;: Opus. Refactoring quality matters for maintainability.
&lt;strong&gt;Does it justify 55x cost?&lt;/strong&gt; For critical production code, yes. For internal tools, no.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;code-review&quot;&gt;Code review&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;“Review this PR for bugs, security issues, and style.”&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Both models catch obvious bugs and security issues (SQL injection, missing auth checks). Opus catches more subtle issues — race conditions, edge cases in error handling, potential memory leaks. DeepSeek focuses on the most impactful issues and misses some subtle ones.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Winner&lt;/strong&gt;: Opus, particularly for security-sensitive code.
&lt;strong&gt;Does it justify 55x cost?&lt;/strong&gt; For security reviews, yes. For routine PR reviews, no.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;boilerplate-and-scaffolding&quot;&gt;Boilerplate and scaffolding&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;“Create a CRUD API with Prisma, Express, and TypeScript for a blog platform.”&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Both models produce identical-quality boilerplate. This is the category where the quality gap is zero. There’s no creative problem-solving involved — just pattern application.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Winner&lt;/strong&gt;: Tie.
&lt;strong&gt;Does it justify 55x cost?&lt;/strong&gt; Absolutely not. Use the cheapest model available.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-cost-math&quot;&gt;The cost math&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;For a developer using an AI coding assistant throughout the day:&lt;/p&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus (all tasks)&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~$3,000/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Mixed (Opus + DeepSeek)&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~$540/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek V3.2 (all tasks)&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~$53/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;CheapestInference&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;from $39/mo&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The “mixed” approach — using Opus for refactoring and security reviews, DeepSeek for everything else — captures 90% of Opus’s value at 18% of the cost.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-practical-recommendation&quot;&gt;The practical recommendation&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Use Opus for:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Security-critical code reviews&lt;/li&gt;
&lt;li&gt;Complex refactoring of production systems&lt;/li&gt;
&lt;li&gt;Debugging subtle concurrency or memory issues&lt;/li&gt;
&lt;li&gt;Architectural decisions that need thorough reasoning&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use DeepSeek V3.2 for:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Greenfield code generation&lt;/li&gt;
&lt;li&gt;Boilerplate and scaffolding&lt;/li&gt;
&lt;li&gt;Simple bug fixes&lt;/li&gt;
&lt;li&gt;Test writing&lt;/li&gt;
&lt;li&gt;Documentation generation&lt;/li&gt;
&lt;li&gt;Any task where “correct” is sufficient and “polished” isn’t required&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use a small model (Llama 8B, Qwen 35B) for:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Code formatting&lt;/li&gt;
&lt;li&gt;Simple find-and-replace refactoring&lt;/li&gt;
&lt;li&gt;Generating repetitive test cases&lt;/li&gt;
&lt;li&gt;Explaining code (reading comprehension, not generation)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The right model depends on the task, not on a blanket preference. A &lt;a href=&quot;https://cheapestinference.com/blog/multi-model-architecture/&quot;&gt;multi-model architecture&lt;/a&gt; that routes by task complexity gives you the best of both worlds.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;route-by-task-through-one-api&quot;&gt;Route by task through one API&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;The same task-routing logic applies to open-weight models. CheapestInference serves Kimi K2.6, GLM 5.2, and MiniMax M3 — all strong on coding — through a single OpenAI- and Anthropic-compatible endpoint, so you can pick the right model per task without juggling accounts:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; openai &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; OpenAI&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;client &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;OpenAI&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;base_url&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;https://api.cheapestinference.com/v1&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;api_key&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;sk-your-key&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Reach for the strongest model on the hard stuff&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;review &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; client.chat.completions.&lt;/span&gt;&lt;span&gt;create&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;kimi-k2.6&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;messages&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;role&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;user&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;content&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;f&lt;/span&gt;&lt;span&gt;&quot;Review this PR for security issues:&lt;/span&gt;&lt;span&gt;\n&lt;/span&gt;&lt;span&gt;{diff}&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# A cheaper-to-run model for the routine work&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;code &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; client.chat.completions.&lt;/span&gt;&lt;span&gt;create&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;glm-5.2&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;messages&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;role&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;user&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;content&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;Write a CRUD API for blog posts&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Same SDK, same key, different model per task. The routing decision is yours — or your agent’s.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;CheapestInference serves Kimi K2.6, GLM 5.2, and MiniMax M3 through one OpenAI- and Anthropic-compatible API. Unlimited time-block subscriptions start at $39/month — reserve the hours you work and run without budget caps. &lt;a href=&quot;https://cheapestinference.com/register&quot;&gt;Get started&lt;/a&gt; or &lt;a href=&quot;https://cheapestinference.com&quot;&gt;see the models&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content:encoded></item><item><title>LLM API pricing in 2026: the complete comparison</title><link>https://cheapestinference.com/blog/llm-api-pricing-compared/</link><guid isPermaLink="true">https://cheapestinference.com/blog/llm-api-pricing-compared/</guid><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;LLM pricing changes every few weeks. A model that cost $60/M output tokens last year costs $10 today. New providers undercut each other constantly. This page is our attempt to keep a single, updated reference.&lt;/p&gt;
&lt;p&gt;Last updated: April 2026.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;frontier-models&quot;&gt;Frontier models&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;The most capable models from each provider:&lt;/p&gt;
&lt;div&gt;
  &lt;table&gt;
    &lt;tbody&gt;&lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input $/M&lt;/th&gt;
      &lt;th&gt;Output $/M&lt;/th&gt;
      &lt;th&gt;Context&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus 4.6&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
      &lt;td&gt;$75.00&lt;/td&gt;
      &lt;td&gt;200K&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
      &lt;td&gt;200K&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-5.4&lt;/td&gt;
      &lt;td&gt;$2.50&lt;/td&gt;
      &lt;td&gt;$10.00&lt;/td&gt;
      &lt;td&gt;128K&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
      &lt;td&gt;$1.25&lt;/td&gt;
      &lt;td&gt;$10.00&lt;/td&gt;
      &lt;td&gt;1M&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DeepSeek V3.2&lt;/td&gt;
      &lt;td&gt;$0.27&lt;/td&gt;
      &lt;td&gt;$1.10&lt;/td&gt;
      &lt;td&gt;128K&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Qwen 3.5 397B&lt;/td&gt;
      &lt;td&gt;$0.40&lt;/td&gt;
      &lt;td&gt;$1.20&lt;/td&gt;
      &lt;td&gt;128K&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Mistral Large 3&lt;/td&gt;
      &lt;td&gt;$2.00&lt;/td&gt;
      &lt;td&gt;$6.00&lt;/td&gt;
      &lt;td&gt;128K&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;The price spread is 55x between the cheapest (DeepSeek V3.2) and most expensive (Claude Opus 4.6) frontier model. The quality spread on MMLU-Pro is 6.5 points. That’s the opportunity.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;cost-efficient-models&quot;&gt;Cost-efficient models&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;The sweet spot — models that handle 80% of tasks at a fraction of frontier prices:&lt;/p&gt;
&lt;div&gt;
  &lt;table&gt;
    &lt;tbody&gt;&lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input $/M&lt;/th&gt;
      &lt;th&gt;Output $/M&lt;/th&gt;
      &lt;th&gt;Context&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
      &lt;td&gt;$0.80&lt;/td&gt;
      &lt;td&gt;$4.00&lt;/td&gt;
      &lt;td&gt;200K&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-4.1 mini&lt;/td&gt;
      &lt;td&gt;$0.40&lt;/td&gt;
      &lt;td&gt;$1.60&lt;/td&gt;
      &lt;td&gt;1M&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini 2.5 Flash&lt;/td&gt;
      &lt;td&gt;$0.15&lt;/td&gt;
      &lt;td&gt;$0.60&lt;/td&gt;
      &lt;td&gt;1M&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Qwen 3.5 35B&lt;/td&gt;
      &lt;td&gt;$0.06&lt;/td&gt;
      &lt;td&gt;$0.12&lt;/td&gt;
      &lt;td&gt;128K&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Llama 3.1 8B&lt;/td&gt;
      &lt;td&gt;$0.02&lt;/td&gt;
      &lt;td&gt;$0.05&lt;/td&gt;
      &lt;td&gt;128K&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;Llama 3.1 8B at $0.02/M input is 750x cheaper than Claude Opus. It won’t write your authentication system, but it’ll classify intents, extract entities, and route requests just fine.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;real-workload-cost-comparison&quot;&gt;Real workload cost comparison&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Pricing per million tokens is hard to reason about. Here’s what actual workloads cost monthly:&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;chatbot-50-conversationsday-2k-tokens-each&quot;&gt;Chatbot (50 conversations/day, ~2K tokens each)&lt;/h3&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$270/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$100/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek V3.2&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$10/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;CheapestInference&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;from $39/mo&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;agent-workload-20-tasksday-500k-tokens-each&quot;&gt;Agent workload (20 tasks/day, ~500K tokens each)&lt;/h3&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$5,508/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$2,838/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek V3.2&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$96/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;CheapestInference&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;from $39/mo&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The gap widens dramatically with agent workloads because context accumulation multiplies the per-token cost. Flat-rate pricing eliminates this entirely.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;per-token-vs-flat-rate-when-each-makes-sense&quot;&gt;Per-token vs. flat-rate: when each makes sense&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Per-token is better when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your usage is low and predictable (&amp;#x3C; $20/month)&lt;/li&gt;
&lt;li&gt;You’re prototyping and don’t know your volume yet&lt;/li&gt;
&lt;li&gt;You need a specific model not available on flat-rate platforms&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Flat-rate is better when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You run agents with unpredictable token consumption&lt;/li&gt;
&lt;li&gt;Your monthly token bill exceeds the flat-rate plan cost&lt;/li&gt;
&lt;li&gt;You want cost certainty for budgeting&lt;/li&gt;
&lt;li&gt;You run multiple agents that need independent rate limits&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The breakeven for flat-rate vs. per-token on DeepSeek V3.2 is roughly 40M tokens/month. An active agent does that in a week.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;how-to-switch-without-rewriting-code&quot;&gt;How to switch without rewriting code&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Every provider listed in this article supports the OpenAI API format. Switching is a config change:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; openai &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; OpenAI&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;client &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;OpenAI&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;base_url&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;https://api.cheapestinference.com/v1&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt;  &lt;/span&gt;&lt;span&gt;# or any provider&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;api_key&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;sk-your-key&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;response &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; client.chat.completions.&lt;/span&gt;&lt;span&gt;create&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;kimi-k2.6&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;messages&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;role&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;user&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;content&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;Hello&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Same SDK. Same methods. Same response format. Different price.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;CheapestInference serves Kimi K2.6, GLM 5.2, and MiniMax M3 through a single OpenAI- and Anthropic-compatible API. Unlimited time-block subscriptions start at $39/month — reserve the hours you work and run without budget caps. &lt;a href=&quot;https://cheapestinference.com/pools&quot;&gt;Compare plans&lt;/a&gt; or &lt;a href=&quot;https://cheapestinference.com&quot;&gt;see the models&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content:encoded></item><item><title>OpenAI API alternatives in 2026: price, speed, and quality compared</title><link>https://cheapestinference.com/blog/openai-api-alternatives-2026/</link><guid isPermaLink="true">https://cheapestinference.com/blog/openai-api-alternatives-2026/</guid><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Every team that builds on GPT-5.4 eventually asks the same question: is there something cheaper that works just as well?&lt;/p&gt;
&lt;p&gt;The answer is yes — but “cheaper” means different things depending on your workload. A chatbot that sends 50 messages/day has different economics than an agent framework burning 2M tokens per hour. This guide compares the real alternatives, with numbers.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;what-youre-actually-paying-for-with-openai&quot;&gt;What you’re actually paying for with OpenAI&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;OpenAI’s pricing for GPT-5.4:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Input&lt;/strong&gt;: $2.50/M tokens&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Output&lt;/strong&gt;: $10.00/M tokens&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cached input&lt;/strong&gt;: $1.25/M tokens&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For a typical API integration doing 1M input + 200K output tokens per day, that’s &lt;strong&gt;$4.50/day or $135/month&lt;/strong&gt;. For an agent workload doing 10M input + 1M output per day, it’s &lt;strong&gt;$35/day or $1,050/month&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The question isn’t whether GPT-5.4 is good. It is. The question is whether you need GPT-5.4 for every request.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-alternatives&quot;&gt;The alternatives&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;1-use-a-cheaper-openai-model&quot;&gt;1. Use a cheaper OpenAI model&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Before switching providers, check if a smaller OpenAI model works:&lt;/p&gt;
&lt;div&gt;
  &lt;table&gt;
    &lt;tbody&gt;&lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input $/M&lt;/th&gt;
      &lt;th&gt;Output $/M&lt;/th&gt;
      &lt;th&gt;Quality (MMLU-Pro)&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-5.4&lt;/td&gt;
      &lt;td&gt;$2.50&lt;/td&gt;
      &lt;td&gt;$10.00&lt;/td&gt;
      &lt;td&gt;88.5%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-4.1 mini&lt;/td&gt;
      &lt;td&gt;$0.40&lt;/td&gt;
      &lt;td&gt;$1.60&lt;/td&gt;
      &lt;td&gt;81.2%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-4.1 nano&lt;/td&gt;
      &lt;td&gt;$0.10&lt;/td&gt;
      &lt;td&gt;$0.40&lt;/td&gt;
      &lt;td&gt;73.8%&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;GPT-4.1 mini is 6x cheaper than GPT-5.4 with a 7-point quality drop. For classification, extraction, and simple Q&amp;#x26;A, that’s a good trade.&lt;/p&gt;
&lt;p&gt;But if you need frontier quality at lower cost, you need to look beyond OpenAI.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;2-open-source-models-via-inference-providers&quot;&gt;2. Open-source models via inference providers&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;The real price disruption comes from open-source models. DeepSeek V3.2, Qwen 3.5, and Kimi K2.5 score within 4 points of GPT-5.4 on most benchmarks — at 5–50x less cost.&lt;/p&gt;
&lt;div&gt;
  &lt;table&gt;
    &lt;tbody&gt;&lt;tr&gt;
      &lt;th&gt;Provider&lt;/th&gt;
      &lt;th&gt;DeepSeek V3.2 Input&lt;/th&gt;
      &lt;th&gt;DeepSeek V3.2 Output&lt;/th&gt;
      &lt;th&gt;Models&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DeepSeek (direct)&lt;/td&gt;
      &lt;td&gt;$0.27&lt;/td&gt;
      &lt;td&gt;$1.10&lt;/td&gt;
      &lt;td&gt;4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Together AI&lt;/td&gt;
      &lt;td&gt;$0.30&lt;/td&gt;
      &lt;td&gt;$0.90&lt;/td&gt;
      &lt;td&gt;100+&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Fireworks&lt;/td&gt;
      &lt;td&gt;$0.20&lt;/td&gt;
      &lt;td&gt;$0.80&lt;/td&gt;
      &lt;td&gt;50+&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Groq&lt;/td&gt;
      &lt;td&gt;$0.10&lt;/td&gt;
      &lt;td&gt;$0.30&lt;/td&gt;
      &lt;td&gt;15+&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;OpenRouter&lt;/td&gt;
      &lt;td&gt;varies&lt;/td&gt;
      &lt;td&gt;varies&lt;/td&gt;
      &lt;td&gt;200+&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;CheapestInference&lt;/td&gt;
      &lt;td&gt;flat-rate&lt;/td&gt;
      &lt;td&gt;flat-rate&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;All of these are OpenAI-compatible — change &lt;code dir=&quot;auto&quot;&gt;base_url&lt;/code&gt; and &lt;code dir=&quot;auto&quot;&gt;api_key&lt;/code&gt;, keep the rest of your code.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-hidden-cost-per-token-pricing-on-agent-workloads&quot;&gt;The hidden cost: per-token pricing on agent workloads&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Per-token pricing works well for predictable workloads — chatbots, single-shot completions, classification. You can estimate monthly cost from your traffic.&lt;/p&gt;
&lt;p&gt;It doesn’t work well for agents. Agent workloads have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Unpredictable token consumption&lt;/strong&gt; — a simple task might take 10 steps, a complex one might take 60&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context accumulation&lt;/strong&gt; — each step re-sends everything, so cost grows quadratically with steps&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retry storms&lt;/strong&gt; — errors trigger retries that consume tokens without producing output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We broke this down in detail in &lt;a href=&quot;https://cheapestinference.com/blog/openclaw-cost-problem/&quot;&gt;OpenClaw is free. Running it is not&lt;/a&gt;. The short version: a single OpenClaw task consumes ~525K tokens. On pay-per-token, that’s $0.16–$9.18 depending on the model.&lt;/p&gt;
&lt;p&gt;On flat-rate, it’s included. Context accumulation, retries, and overhead don’t increase your bill.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;switching-from-openai-what-actually-changes&quot;&gt;Switching from OpenAI: what actually changes&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;If your code uses the OpenAI SDK, switching to any OpenAI-compatible provider is a two-line change:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; openai &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; OpenAI&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Before&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;client &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;OpenAI&lt;/span&gt;&lt;span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;api_key&lt;/span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;sk-openai-...&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# After — any compatible provider&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;client &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;OpenAI&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;base_url&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;https://api.cheapestinference.com/v1&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;api_key&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;sk-your-key&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;What stays the same:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code dir=&quot;auto&quot;&gt;client.chat.completions.create()&lt;/code&gt; — same API&lt;/li&gt;
&lt;li&gt;Streaming — same &lt;code dir=&quot;auto&quot;&gt;stream=True&lt;/code&gt; pattern&lt;/li&gt;
&lt;li&gt;Tool calling — same &lt;code dir=&quot;auto&quot;&gt;tools&lt;/code&gt; parameter&lt;/li&gt;
&lt;li&gt;Response format — same JSON structure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What might change:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Model names&lt;/strong&gt; — &lt;code dir=&quot;auto&quot;&gt;gpt-5.4&lt;/code&gt; becomes &lt;code dir=&quot;auto&quot;&gt;deepseek/deepseek-chat-v3-0324&lt;/code&gt; or &lt;code dir=&quot;auto&quot;&gt;qwen/qwen3.5-397b&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rate limits&lt;/strong&gt; — each provider has different RPM/TPM limits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency&lt;/strong&gt; — varies by provider and model size&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Feature support&lt;/strong&gt; — not all providers support vision, function calling, or JSON mode on all models&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test with your actual prompts before switching production traffic. Benchmarks measure general capability — your specific use case might have different results.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;when-to-use-which-alternative&quot;&gt;When to use which alternative&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;You need the highest quality and cost doesn’t matter&lt;/strong&gt;: Stay with GPT-5.4 or Claude Opus 4.6 directly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;You want GPT-5.4 quality at lower cost&lt;/strong&gt;: Use OpenRouter to access GPT-5.4 at discounted rates, or switch to open-weight models within a few points on most benchmarks — CheapestInference serves Kimi K2.6, GLM 5.2, and MiniMax M3 on flat-rate plans.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;You run agents&lt;/strong&gt;: Flat-rate pricing eliminates the unpredictability of agent workloads. You reserve time blocks and the agent runs unlimited during those hours, no token counting.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;You need the fastest inference&lt;/strong&gt;: Groq’s LPU hardware delivers the lowest latency for supported models. If your model is on Groq, it’s hard to beat on speed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;You want one API for everything&lt;/strong&gt;: OpenRouter gives you access to multiple providers through a single endpoint with the largest catalog. If a few strong open-weight models cover your needs, CheapestInference offers flat-rate pricing on Kimi K2.6, GLM 5.2, and MiniMax M3.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-bottom-line&quot;&gt;The bottom line&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;OpenAI built the best developer experience in AI. But being the best product doesn’t mean being the best price. The API landscape in 2026 has enough competition that you can get 95% of the quality at 10–50% of the cost — or eliminate cost uncertainty entirely with flat-rate pricing.&lt;/p&gt;
&lt;p&gt;The switch is two lines of code. The savings compound every month.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;CheapestInference serves Kimi K2.6, GLM 5.2, and MiniMax M3 through a single OpenAI- and Anthropic-compatible endpoint. Unlimited time-block subscriptions start at $39/month — reserve 1–3 daily 8-hour blocks for unlimited usage during those hours. &lt;a href=&quot;https://cheapestinference.com/register&quot;&gt;Get started&lt;/a&gt; or &lt;a href=&quot;https://cheapestinference.com/pools&quot;&gt;compare plans&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content:encoded></item><item><title>OpenRouter alternatives in 2026: unified LLM APIs compared</title><link>https://cheapestinference.com/blog/openrouter-alternative/</link><guid isPermaLink="true">https://cheapestinference.com/blog/openrouter-alternative/</guid><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenRouter solved a real problem: one API key, hundreds of models, no separate accounts per provider. You point your code at &lt;code dir=&quot;auto&quot;&gt;openrouter.ai/api/v1&lt;/code&gt; and pick any model from any provider.&lt;/p&gt;
&lt;p&gt;But OpenRouter isn’t the only unified API anymore. And depending on your workload, it might not be the cheapest or fastest option. Here’s how the alternatives compare.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;what-openrouter-does-well&quot;&gt;What OpenRouter does well&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Credit where it’s due:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Model coverage&lt;/strong&gt;: 200+ models from dozens of providers. If a model exists, OpenRouter probably has it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automatic routing&lt;/strong&gt;: &lt;code dir=&quot;auto&quot;&gt;openrouter/auto&lt;/code&gt; picks a model for you based on your prompt. Useful for prototyping.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fallback&lt;/strong&gt;: If one provider is down, OpenRouter routes to another. You don’t handle failover yourself.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Single billing&lt;/strong&gt;: One account, one API key, one invoice. No managing 8 provider accounts.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For developers who want access to everything and don’t want to manage multiple integrations, OpenRouter is a good default.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;where-openrouter-gets-expensive&quot;&gt;Where OpenRouter gets expensive&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;OpenRouter adds a margin on top of each provider’s per-token price. This is how they make money — they’re a reseller. The markup varies by model but is typically 5–20% above the direct provider price.&lt;/p&gt;
&lt;p&gt;For low-volume usage, the convenience premium is negligible. For high-volume or agent workloads, it compounds:&lt;/p&gt;
&lt;div&gt;
  &lt;table&gt;
    &lt;tbody&gt;&lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Direct price (input)&lt;/th&gt;
      &lt;th&gt;OpenRouter price&lt;/th&gt;
      &lt;th&gt;Markup&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
      &lt;td&gt;$3.00/M&lt;/td&gt;
      &lt;td&gt;$3.00/M&lt;/td&gt;
      &lt;td&gt;0%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DeepSeek V3.2&lt;/td&gt;
      &lt;td&gt;$0.27/M&lt;/td&gt;
      &lt;td&gt;$0.30/M&lt;/td&gt;
      &lt;td&gt;+11%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Llama 3.1 70B&lt;/td&gt;
      &lt;td&gt;$0.13/M&lt;/td&gt;
      &lt;td&gt;$0.16/M&lt;/td&gt;
      &lt;td&gt;+23%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Qwen 3.5 397B&lt;/td&gt;
      &lt;td&gt;$0.40/M&lt;/td&gt;
      &lt;td&gt;$0.48/M&lt;/td&gt;
      &lt;td&gt;+20%&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;The markup is smallest on premium models (where the provider’s price already includes healthy margin) and largest on cheap open-source models (where OpenRouter’s fixed costs are a bigger percentage).&lt;/p&gt;
&lt;p&gt;For an agent consuming 10M tokens/day on DeepSeek V3.2, the markup adds &lt;strong&gt;$9/month&lt;/strong&gt;. Not a lot. But on a team of 10 with multiple agents each, it adds up — and the per-token model itself is the real problem for agent workloads.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-alternatives&quot;&gt;The alternatives&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;together-ai&quot;&gt;Together AI&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: Fastest open-source model inference.&lt;/p&gt;
&lt;p&gt;Together runs their own GPU clusters optimized for open-source models. No reselling — they serve the models directly. This means lower latency and often lower prices than OpenRouter for the same model.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;100+ models&lt;/li&gt;
&lt;li&gt;Own infrastructure (not reselling)&lt;/li&gt;
&lt;li&gt;Competitive pricing on open-source models&lt;/li&gt;
&lt;li&gt;Dedicated endpoints for production workloads&lt;/li&gt;
&lt;li&gt;Per-token pricing only&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Together doesn’t carry proprietary models (no Claude, no GPT). If you need Anthropic or OpenAI alongside open-source, you need a second integration.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;fireworks&quot;&gt;Fireworks&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: Low-latency inference with custom model support.&lt;/p&gt;
&lt;p&gt;Fireworks focuses on speed. Their custom serving infrastructure delivers lower latency than most providers, especially for open-source models. They also support fine-tuned model deployment.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;50+ models&lt;/li&gt;
&lt;li&gt;Very low latency&lt;/li&gt;
&lt;li&gt;Fine-tuned model hosting&lt;/li&gt;
&lt;li&gt;Serverless and dedicated options&lt;/li&gt;
&lt;li&gt;Per-token pricing only&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Like Together, Fireworks doesn’t carry proprietary models natively.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;groq&quot;&gt;Groq&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: Absolute lowest latency.&lt;/p&gt;
&lt;p&gt;Groq’s custom LPU hardware delivers the fastest inference in the market for supported models. If your use case is latency-sensitive (real-time chat, voice agents), Groq is hard to beat.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;15+ models (smaller catalog)&lt;/li&gt;
&lt;li&gt;Sub-second TTFT on most models&lt;/li&gt;
&lt;li&gt;Free tier available&lt;/li&gt;
&lt;li&gt;Per-token pricing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Limited model selection. No Claude, no GPT. But what they have is fast.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;cheapestinference&quot;&gt;CheapestInference&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: Agent workloads and cost certainty.&lt;/p&gt;
&lt;p&gt;Full disclosure — this is us. Here’s what we do differently:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Time-block subscriptions&lt;/strong&gt;: Reserve one or more daily 8-hour blocks on a model pool — Asia-Pacific ($39/mo), Europe ($49/mo), or Americas ($45/mo). Reserve all three for full 24/7 coverage. From $39/month, annual ~15% off. No per-token billing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unlimited during your hours&lt;/strong&gt;: During your reserved block, requests are unlimited with no budget cap — two concurrent requests per key. Pay by card (Stripe) or USDC on Base.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A focused lineup&lt;/strong&gt;: Kimi K2.6, GLM 5.2, and MiniMax M3 — strong open-weight models through one endpoint.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;x402 pay-per-request&lt;/strong&gt;: No account needed — agents pay with USDC on Base L2 per request. Credit top-ups from $10 also available.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The trade-off: a small, curated model catalog instead of OpenRouter’s breadth, no proprietary models, and no automatic routing between providers.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;side-by-side-comparison&quot;&gt;Side-by-side comparison&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;table&gt;
    &lt;tbody&gt;&lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;OpenRouter&lt;/th&gt;
      &lt;th&gt;Together&lt;/th&gt;
      &lt;th&gt;Fireworks&lt;/th&gt;
      &lt;th&gt;Groq&lt;/th&gt;
      &lt;th&gt;CheapestInf.&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Models&lt;/td&gt;
      &lt;td&gt;200+&lt;/td&gt;
      &lt;td&gt;100+&lt;/td&gt;
      &lt;td&gt;50+&lt;/td&gt;
      &lt;td&gt;15+&lt;/td&gt;
      &lt;td&gt;3 (curated)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Proprietary models&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Pricing model&lt;/td&gt;
      &lt;td&gt;Per-token&lt;/td&gt;
      &lt;td&gt;Per-token&lt;/td&gt;
      &lt;td&gt;Per-token&lt;/td&gt;
      &lt;td&gt;Per-token&lt;/td&gt;
      &lt;td&gt;Time-block flat-rate&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Unlimited in reserved hours&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Auto routing&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;API format&lt;/td&gt;
      &lt;td&gt;OpenAI&lt;/td&gt;
      &lt;td&gt;OpenAI&lt;/td&gt;
      &lt;td&gt;OpenAI&lt;/td&gt;
      &lt;td&gt;OpenAI&lt;/td&gt;
      &lt;td&gt;OpenAI&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;Every provider on this list is OpenAI-compatible. Switching between them is a &lt;code dir=&quot;auto&quot;&gt;base_url&lt;/code&gt; change.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;cost-comparison-for-real-workloads&quot;&gt;Cost comparison for real workloads&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;light-usage-chatbot-3m-tokensmonth&quot;&gt;Light usage (chatbot, ~3M tokens/month)&lt;/h3&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;OpenRouter&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$4.20/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Together AI&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$3.60/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;CheapestInference&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;from $39/mo&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;At low volume, per-token wins. A time-block subscription only pays off once your per-token spend during those hours would exceed the block price.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;heavy-usage-agents-300m-tokensmonth&quot;&gt;Heavy usage (agents, ~300M tokens/month)&lt;/h3&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;OpenRouter&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$420/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Together AI&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$360/mo&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;CheapestInference&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;from $39/mo&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;At agent-scale volume, a time-block subscription is dramatically cheaper. The gap grows with usage because per-token scales linearly and a reserved block is unlimited — it doesn’t scale at all.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;when-to-use-what&quot;&gt;When to use what&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Stay on OpenRouter if&lt;/strong&gt;: You need access to 200+ models, use auto-routing, and your monthly spend is under $50. The convenience premium is worth it at this scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Switch to Together/Fireworks if&lt;/strong&gt;: You only use open-source models, care about latency, and want to avoid the reseller markup. Together and Fireworks serve models directly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Switch to CheapestInference if&lt;/strong&gt;: You run agents during predictable hours, want cost certainty, and the curated open-weight lineup (Kimi K2.6, GLM 5.2, MiniMax M3) covers your needs. Unlimited inference during a reserved time block beats per-token billing once your usage in those hours is heavy.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Groq if&lt;/strong&gt;: Latency is your primary constraint and your model is in their catalog.&lt;/p&gt;
&lt;p&gt;All five are OpenAI-compatible. Try each one with a &lt;code dir=&quot;auto&quot;&gt;base_url&lt;/code&gt; swap and see which fits.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;CheapestInference serves a curated open-weight lineup — Kimi K2.6, GLM 5.2, MiniMax M3 — through one OpenAI- and Anthropic-compatible API. Unlimited time-block subscriptions from $39/month. &lt;a href=&quot;https://cheapestinference.com/pools&quot;&gt;See the pools&lt;/a&gt; or &lt;a href=&quot;https://cheapestinference.com/register&quot;&gt;get started&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content:encoded></item><item><title>Self-hosted vs. API inference: the real cost comparison</title><link>https://cheapestinference.com/blog/self-hosted-vs-api-inference/</link><guid isPermaLink="true">https://cheapestinference.com/blog/self-hosted-vs-api-inference/</guid><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;“Why pay for an API when I can run the model myself?”&lt;/p&gt;
&lt;p&gt;It’s a reasonable question. Open-source models are free. GPUs are available on every cloud. vLLM and Ollama make serving straightforward. The math should be simple: GPU cost per hour × hours = total cost. Done.&lt;/p&gt;
&lt;p&gt;Except it’s not. The GPU is the minority of the cost. Here’s the full picture.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-visible-costs&quot;&gt;The visible costs&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;gpu-hardware&quot;&gt;GPU hardware&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Running DeepSeek V3.2 (671B MoE, ~130B active parameters) requires at least 4× A100 80GB or 2× H100 80GB in FP8. Qwen 3.5 397B has similar requirements.&lt;/p&gt;
&lt;div&gt;
  &lt;table&gt;
    &lt;tbody&gt;&lt;tr&gt;
      &lt;th&gt;Setup&lt;/th&gt;
      &lt;th&gt;Hourly&lt;/th&gt;
      &lt;th&gt;Monthly (24/7)&lt;/th&gt;
      &lt;th&gt;Monthly (8h/day)&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;4× A100 80GB (cloud)&lt;/td&gt;
      &lt;td&gt;$12.80&lt;/td&gt;
      &lt;td&gt;$9,216&lt;/td&gt;
      &lt;td&gt;$2,816&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;2× H100 80GB (cloud)&lt;/td&gt;
      &lt;td&gt;$8.40&lt;/td&gt;
      &lt;td&gt;$6,048&lt;/td&gt;
      &lt;td&gt;$1,848&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;1× A100 80GB (Llama 70B)&lt;/td&gt;
      &lt;td&gt;$3.20&lt;/td&gt;
      &lt;td&gt;$2,304&lt;/td&gt;
      &lt;td&gt;$704&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;1× L40S (Llama 8B)&lt;/td&gt;
      &lt;td&gt;$1.10&lt;/td&gt;
      &lt;td&gt;$792&lt;/td&gt;
      &lt;td&gt;$242&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;These are cloud GPU rental prices (AWS, GCP, Lambda Labs — varies by provider and availability). If you buy hardware, the upfront cost is $15K–$40K per GPU, amortized over 3–4 years, plus electricity, cooling, and data center costs.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;smaller-models-are-cheaper--but-limited&quot;&gt;Smaller models are cheaper — but limited&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Running Llama 3.1 8B on a single L40S costs $242/month (8h/day). That’s competitive with API pricing. But 8B models can’t handle complex coding, multi-step reasoning, or nuanced analysis — the tasks where AI provides the most value.&lt;/p&gt;
&lt;p&gt;The models worth self-hosting (70B+, MoE) require multi-GPU setups where the economics change dramatically.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-invisible-costs&quot;&gt;The invisible costs&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;GPU rental is just the beginning.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;1-operations-and-maintenance&quot;&gt;1. Operations and maintenance&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Someone has to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set up vLLM/TGI with optimal batch sizes, quantization, and memory allocation&lt;/li&gt;
&lt;li&gt;Monitor GPU utilization and restart crashed processes&lt;/li&gt;
&lt;li&gt;Update model weights when new versions release&lt;/li&gt;
&lt;li&gt;Handle OOM errors, NCCL failures, and driver issues&lt;/li&gt;
&lt;li&gt;Manage the serving infrastructure (load balancer, health checks, auto-scaling)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If this is a full-time DevOps engineer at $150K/year, that’s &lt;strong&gt;$12,500/month&lt;/strong&gt; in labor. If it’s 20% of a senior engineer’s time, it’s &lt;strong&gt;$2,500/month&lt;/strong&gt;. Either way, it’s more than the GPU.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;2-idle-capacity&quot;&gt;2. Idle capacity&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;GPUs cost money whether they’re inferring or not. If your usage pattern is 8 hours of heavy use (work hours) and 16 hours of near-zero traffic, you’re paying for 24 hours and using 8.&lt;/p&gt;
&lt;p&gt;Cloud spot instances help but introduce availability risk. Auto-scaling GPU clusters is possible but complex — model loading takes minutes, not seconds.&lt;/p&gt;
&lt;p&gt;API pricing is purely usage-based. Zero requests = zero cost.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;3-multi-model-overhead&quot;&gt;3. Multi-model overhead&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Self-hosting one model is manageable. Self-hosting five models for different tasks — a coding model, a reasoning model, a fast classification model, an embedding model, and a vision model — requires either:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;5 separate GPU instances (expensive)&lt;/li&gt;
&lt;li&gt;Shared GPU with model swapping (slow — loading a 70B model takes 2–5 minutes)&lt;/li&gt;
&lt;li&gt;A serving framework that handles multi-model routing (complex)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;An API gives you access to many models through the same endpoint. No model loading, no GPU allocation, no routing logic.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;4-opportunity-cost&quot;&gt;4. Opportunity cost&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Every hour your team spends on inference infrastructure is an hour not spent on your actual product. For startups, this is the most expensive cost of all — it doesn’t show up on any invoice.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;total-cost-of-ownership&quot;&gt;Total cost of ownership&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;For a team of 5 developers running AI-assisted coding with a mix of DeepSeek V3.2 and smaller models:&lt;/p&gt;
&lt;div&gt;
  &lt;table&gt;
    &lt;tbody&gt;&lt;tr&gt;
      &lt;th&gt;Cost&lt;/th&gt;
      &lt;th&gt;Self-hosted&lt;/th&gt;
      &lt;th&gt;API (per-token)&lt;/th&gt;
      &lt;th&gt;API (time-block sub)&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Compute/inference&lt;/td&gt;
      &lt;td&gt;$2,800&lt;/td&gt;
      &lt;td&gt;$265&lt;/td&gt;
      &lt;td&gt;$250&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Ops/maintenance&lt;/td&gt;
      &lt;td&gt;$2,500&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Idle waste (~60%)&lt;/td&gt;
      &lt;td&gt;$1,680&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Total monthly&lt;/td&gt;
      &lt;td&gt;$6,980&lt;/td&gt;
      &lt;td&gt;$265&lt;/td&gt;
      &lt;td&gt;$250&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;Self-hosting costs 26x more for the same workload. The GPU is only 40% of the self-hosted cost — ops and idle waste are the majority.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;when-self-hosting-makes-sense&quot;&gt;When self-hosting makes sense&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Self-hosting wins in specific scenarios:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data sovereignty&lt;/strong&gt;: If your data cannot leave your network — regulated industries, government, healthcare with strict compliance — self-hosting is the only option. No API provider can guarantee the data isolation you need.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Extreme scale&lt;/strong&gt;: If you’re processing millions of requests per day and your GPUs are consistently at 80%+ utilization, the per-token math eventually favors owned hardware. This threshold is higher than most teams expect — typically $20K+/month in API spend before self-hosting breaks even.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Custom models&lt;/strong&gt;: If you’ve fine-tuned a model and need to serve it, self-hosting or a dedicated inference provider (Fireworks, Together) is required. Most unified APIs don’t serve custom model weights.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Latency control&lt;/strong&gt;: If you need guaranteed sub-100ms TTFT and your data center is co-located with your GPUs, self-hosting eliminates network hops.&lt;/p&gt;
&lt;p&gt;For everyone else — startups, small teams, companies with variable usage patterns — the API is cheaper, faster to set up, and easier to maintain.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-migration-path&quot;&gt;The migration path&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Most teams don’t need to choose one forever. A practical approach:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Start with an API&lt;/strong&gt;: Get your product working, validate demand, understand your usage patterns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optimize model selection&lt;/strong&gt;: Use cheaper models for simple tasks, frontier models for hard tasks. Full guide: &lt;a href=&quot;https://cheapestinference.com/blog/multi-model-architecture/&quot;&gt;Multi-model architecture&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluate self-hosting when&lt;/strong&gt;: Your monthly API spend exceeds $10K, your GPU utilization would be &gt;70%, and you have DevOps capacity to maintain it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hybrid&lt;/strong&gt;: Self-host your high-volume models, use an API for long-tail models and overflow capacity.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The worst outcome is spending 3 months setting up GPU infrastructure before you’ve validated that anyone wants your product.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;CheapestInference serves frontier open-weight models — Kimi K2.6, GLM 5.2, and MiniMax M3 — through a single API. No GPUs to manage, no idle costs, no ops burden. Reserve a daily 8-hour time block for unlimited usage from $39/mo (reserve all three for full 24/7), or pay as you go with credits from $10. &lt;a href=&quot;https://cheapestinference.com/register&quot;&gt;Get started&lt;/a&gt; or &lt;a href=&quot;https://cheapestinference.com/pools&quot;&gt;see the pools&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content:encoded></item><item><title>The real cost of running AI agents in production</title><link>https://cheapestinference.com/blog/ai-agent-inference-costs/</link><guid isPermaLink="true">https://cheapestinference.com/blog/ai-agent-inference-costs/</guid><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Chatbots are cheap. Agents are not.&lt;/p&gt;
&lt;p&gt;A chatbot sends a user message, gets a response, displays it. Maybe 2,000 tokens per exchange. An agent reads files, calls tools, retries on errors, re-sends the entire conversation every step, and does this 20–60 times per task. Same API, completely different economics.&lt;/p&gt;
&lt;p&gt;If you’re budgeting for AI agents the same way you budget for a chatbot, you’re underestimating by 10–50x.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;token-consumption-chatbot-vs-agent&quot;&gt;Token consumption: chatbot vs. agent&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;We measured token consumption across three workload types, each running for one hour:&lt;/p&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Coding agent (OpenClaw)&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~2.1M tokens&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Research agent (CrewAI)&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~1.2M tokens&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;RAG chatbot&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~200K tokens&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Simple chatbot&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~40K tokens&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The coding agent consumed 52x more tokens than a simple chatbot in the same time period. And this is &lt;em&gt;normal&lt;/em&gt; — the agent was doing useful work the entire time.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;why-agents-cost-so-much&quot;&gt;Why agents cost so much&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Three architectural properties of agents make them expensive:&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;1-context-accumulation&quot;&gt;1. Context accumulation&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Every agent step appends tool outputs to the conversation. The LLM re-processes the entire conversation on each step. If the agent reads a 3,000-token file at step 5, that file gets re-sent at steps 6, 7, 8… all the way to the end.&lt;/p&gt;
&lt;p&gt;For a 40-step task, one file read costs: 3,000 tokens × 35 remaining steps = &lt;strong&gt;105,000 tokens&lt;/strong&gt; in re-transmission.&lt;/p&gt;
&lt;p&gt;This is why agent token consumption grows quadratically, not linearly.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;2-system-prompt-overhead&quot;&gt;2. System prompt overhead&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Agent frameworks use large system prompts — OpenClaw’s is ~9,600 tokens, CrewAI’s varies by agent configuration. This prompt is sent with every request. Over 40 steps, the system prompt alone costs 384,000 tokens.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;3-error-retry-loops&quot;&gt;3. Error retry loops&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;When a tool call fails, the agent retries. Each retry sends the full context plus the error message. Three retries on a 30K-token context wastes 90K tokens with no productive output.&lt;/p&gt;
&lt;p&gt;Without a retry cap, this can run indefinitely — always bound agents with a retry cap and a maximum iteration count.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;monthly-cost-by-model-and-framework&quot;&gt;Monthly cost by model and framework&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Assuming one developer running 15 agent tasks per day, 22 working days per month, ~500K tokens per task:&lt;/p&gt;
&lt;div&gt;
  &lt;table&gt;
    &lt;tbody&gt;&lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Cost/task&lt;/th&gt;
      &lt;th&gt;Daily (×15)&lt;/th&gt;
      &lt;th&gt;Monthly&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus 4.6&lt;/td&gt;
      &lt;td&gt;$9.18&lt;/td&gt;
      &lt;td&gt;$137.70&lt;/td&gt;
      &lt;td&gt;$3,029&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
      &lt;td&gt;$2.25&lt;/td&gt;
      &lt;td&gt;$33.75&lt;/td&gt;
      &lt;td&gt;$743&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-5.4&lt;/td&gt;
      &lt;td&gt;$4.73&lt;/td&gt;
      &lt;td&gt;$70.95&lt;/td&gt;
      &lt;td&gt;$1,561&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DeepSeek V3.2&lt;/td&gt;
      &lt;td&gt;$0.16&lt;/td&gt;
      &lt;td&gt;$2.40&lt;/td&gt;
      &lt;td&gt;$53&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Qwen 3.5 35B&lt;/td&gt;
      &lt;td&gt;$0.04&lt;/td&gt;
      &lt;td&gt;$0.60&lt;/td&gt;
      &lt;td&gt;$13&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;CheapestInference (full day)&lt;/td&gt;
      &lt;td&gt;—&lt;/td&gt;
      &lt;td&gt;—&lt;/td&gt;
      &lt;td&gt;from $39 flat&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;A team of 5 developers each running 15 tasks/day on Claude Opus spends &lt;strong&gt;$15,145/month&lt;/strong&gt;. The same team on flat-rate via CheapestInference pays a fixed monthly subscription per seat (from $39 for a reserved daily time block) — no matter how many tokens those agents burn. That’s an order-of-magnitude reduction.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;four-strategies-to-cut-agent-inference-costs&quot;&gt;Four strategies to cut agent inference costs&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;1-switch-to-open-source-models&quot;&gt;1. Switch to open-source models&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;DeepSeek V3.2 and Qwen 3.5 score within 4 points of GPT-5.4 and Opus on most benchmarks. For coding tasks specifically, DeepSeek V3.2 matches Opus on HumanEval and SWE-bench. Full data: &lt;a href=&quot;https://cheapestinference.com/blog/open-source-models-are-production-ready/&quot;&gt;Open-source models are production-ready&lt;/a&gt;.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;2-route-by-task-complexity&quot;&gt;2. Route by task complexity&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Not every agent step needs a frontier model. File reads, simple classifications, and formatting don’t need 685B parameters. Use a small model for easy steps and a large model for hard ones. Full guide: &lt;a href=&quot;https://cheapestinference.com/blog/multi-model-architecture/&quot;&gt;Building a multi-model architecture&lt;/a&gt;.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;3-give-each-agent-its-own-key&quot;&gt;3. Give each agent its own key&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Give each agent its own API key so one runaway agent can’t starve the others. On a time-block subscription each key gets unlimited usage during its reserved hours, so you isolate workloads without juggling per-token allocations.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;4-use-flat-rate-pricing&quot;&gt;4. Use flat-rate pricing&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Per-token pricing penalizes the exact patterns agents use: large contexts, many steps, retries. Flat-rate pricing makes all of that free. During your reserved time blocks your agent can use the full context window and retry freely without increasing the bill — reserve all three blocks for 24/7 coverage.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-math-that-matters&quot;&gt;The math that matters&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Here’s the equation most teams miss:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;Agent cost = tokens_per_step × steps × cost_per_token&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Most optimization focuses on &lt;code dir=&quot;auto&quot;&gt;cost_per_token&lt;/code&gt; — switching to a cheaper model. But &lt;code dir=&quot;auto&quot;&gt;tokens_per_step&lt;/code&gt; grows with context (quadratic), and &lt;code dir=&quot;auto&quot;&gt;steps&lt;/code&gt; is unpredictable. Optimizing only one variable leaves the other two working against you.&lt;/p&gt;
&lt;p&gt;Flat-rate pricing eliminates all three variables from your bill. The cost is the subscription. Period.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;We serve Kimi K2.6, GLM 5.2, and MiniMax M3 with flat-rate, unlimited time-block subscriptions — no token counting, no budget caps during your reserved hours. Reserve 1–3 daily 8-hour blocks from $39/month and your agent’s token consumption never becomes your problem. &lt;a href=&quot;https://cheapestinference.com/register&quot;&gt;Get started&lt;/a&gt; or &lt;a href=&quot;https://cheapestinference.com/pools&quot;&gt;see plans&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content:encoded></item><item><title>Qwen 3.5 vs GPT-5.4 vs Claude Opus 4.6 — same quality, fraction of the price</title><link>https://cheapestinference.com/blog/qwen-3-5-vs-gpt-claude/</link><guid isPermaLink="true">https://cheapestinference.com/blog/qwen-3-5-vs-gpt-claude/</guid><pubDate>Thu, 26 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;You asked for this. After our &lt;a href=&quot;https://cheapestinference.com/blog/open-source-models-are-production-ready/&quot;&gt;first benchmark post&lt;/a&gt;, the most requested model was Qwen 3.5. Here it is — &lt;strong&gt;4 models across 5 metrics&lt;/strong&gt;, same models in every chart:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Open-source:&lt;/strong&gt; Qwen3.5-397B-A17B (flagship), Qwen3.5-35B-A3B (efficient)
&lt;strong&gt;Proprietary:&lt;/strong&gt; GPT-5.4, Claude Opus 4.6&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;knowledge-mmlu-pro&quot;&gt;Knowledge: MMLU-Pro (%)&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;88.5%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Qwen3.5 397B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;87.8%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Qwen3.5 35B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;85.3%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;82.0%&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;GPT-5.4 leads at 88.5%, but Qwen3.5-397B is 0.7 points behind — statistically noise. The 35B with only 3B active parameters scores 85.3%, beating Opus by 3 points. The total spread across all four models is just 6.5 points.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Qwen3.5-397B matches GPT-5.4 at 5x less cost. The 35B beats Opus at 23x less.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;reasoning-gpqa-diamond&quot;&gt;Reasoning: GPQA Diamond (%)&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;92.0%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;91.3%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Qwen3.5 397B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;88.4%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Qwen3.5 35B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;84.2%&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Proprietary models lead on graduate-level reasoning. GPT-5.4 at 92% and Opus at 91.3% are strong. But Qwen3.5-397B at 88.4% is within 4 points — and costs $0.54/M vs $2.50 and $5.00. The 35B at 84.2% is still PhD-level performance for $0.22/M input.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;code-livecodebench-v6&quot;&gt;Code: LiveCodeBench v6 (%)&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;84.0%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Qwen3.5 397B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;83.6%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;76.0%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Qwen3.5 35B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;74.6%&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The 397B essentially ties GPT-5.4 on competitive coding — 0.4 points apart. Both beat Opus by 8+ points. The 35B at 74.6% is within 2 points of Opus, at 1/23rd the price.&lt;/p&gt;
&lt;p&gt;For dedicated coding workloads, the open ecosystem also offers Qwen3-Coder-480B (SWE-bench Verified: 69.6%, comparable to Claude Sonnet 4).&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;speed-output-tokens-per-second&quot;&gt;Speed: output tokens per second&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Qwen3.5 35B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;178 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Qwen3.5 397B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;84 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~78 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;46 t/s&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The 35B’s MoE architecture pays off — 178 tok/s is 2.3x faster than GPT-5.4 and 3.9x faster than Opus. Even the 397B flagship at 84 tok/s outpaces both proprietary models. This is what happens when only 3-17B parameters activate per token instead of the full model.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Speed data from &lt;a href=&quot;https://artificialanalysis.ai/leaderboards/models&quot;&gt;Artificial Analysis&lt;/a&gt;. Actual speeds on our infrastructure may differ.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;price-input-cost-per-million-tokens&quot;&gt;Price: input cost per million tokens&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Qwen3.5 35B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$0.22&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Qwen3.5 397B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$0.54&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$2.50&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;$5.00&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This is the chart that matters. Opus costs &lt;strong&gt;23x more&lt;/strong&gt; than the 35B and &lt;strong&gt;9x more&lt;/strong&gt; than the 397B. GPT-5.4 costs &lt;strong&gt;5x more&lt;/strong&gt; than the 397B. The quality difference? Single-digit percentage points on every benchmark.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-full-picture&quot;&gt;The full picture&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
&lt;svg viewBox=&quot;-40 0 480 400&quot; xmlns=&quot;http://www.w3.org/2000/svg&quot;&gt;
  &lt;!-- Grid rings --&gt;
  &lt;polygon points=&quot;200,120 270,190 200,260 130,190&quot; fill=&quot;none&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/polygon&gt;
  &lt;polygon points=&quot;200,50 340,190 200,330 60,190&quot; fill=&quot;none&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/polygon&gt;
  &lt;!-- Axes --&gt;
  &lt;line x1=&quot;200&quot; y1=&quot;190&quot; x2=&quot;200&quot; y2=&quot;50&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/line&gt;
  &lt;line x1=&quot;200&quot; y1=&quot;190&quot; x2=&quot;340&quot; y2=&quot;190&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/line&gt;
  &lt;line x1=&quot;200&quot; y1=&quot;190&quot; x2=&quot;200&quot; y2=&quot;330&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/line&gt;
  &lt;line x1=&quot;200&quot; y1=&quot;190&quot; x2=&quot;60&quot; y2=&quot;190&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/line&gt;
  &lt;!-- GPT-5.4 — gray fill for reference --&gt;
  &lt;polygon points=&quot;200,57 312,190 200,312.5 145.4,190&quot; fill=&quot;#9A9490&quot; fill-opacity=&quot;0.08&quot; stroke=&quot;#9A9490&quot; stroke-width=&quot;2&quot;&gt;&lt;/polygon&gt;
  &lt;!-- Claude Opus 4.6 — dashed --&gt;
  &lt;polygon points=&quot;200,113 305.4,190 200,236.6 167.8,190&quot; fill=&quot;none&quot; stroke=&quot;#6B6560&quot; stroke-width=&quot;1.5&quot; stroke-dasharray=&quot;6 3&quot;&gt;&lt;/polygon&gt;
  &lt;!-- Qwen3.5-397B — indigo --&gt;
  &lt;polygon points=&quot;200,59.8 278.4,190 200,304.4 141.2,190&quot; fill=&quot;#6366F1&quot; fill-opacity=&quot;0.12&quot; stroke=&quot;#6366F1&quot; stroke-width=&quot;2.5&quot;&gt;&lt;/polygon&gt;
  &lt;!-- Qwen3.5-35B — teal --&gt;
  &lt;polygon points=&quot;200,122.8 239.2,190 200,275.1 75.4,190&quot; fill=&quot;#14B8A6&quot; fill-opacity=&quot;0.1&quot; stroke=&quot;#14B8A6&quot; stroke-width=&quot;2&quot;&gt;&lt;/polygon&gt;
  &lt;!-- Data points - 397B --&gt;
  &lt;circle cx=&quot;200&quot; cy=&quot;59.8&quot; r=&quot;3.5&quot; fill=&quot;#6366F1&quot;&gt;&lt;/circle&gt;
  &lt;circle cx=&quot;278.4&quot; cy=&quot;190&quot; r=&quot;3.5&quot; fill=&quot;#6366F1&quot;&gt;&lt;/circle&gt;
  &lt;circle cx=&quot;200&quot; cy=&quot;304.4&quot; r=&quot;3.5&quot; fill=&quot;#6366F1&quot;&gt;&lt;/circle&gt;
  &lt;circle cx=&quot;141.2&quot; cy=&quot;190&quot; r=&quot;3.5&quot; fill=&quot;#6366F1&quot;&gt;&lt;/circle&gt;
  &lt;!-- Labels --&gt;
  &lt;text x=&quot;200&quot; y=&quot;30&quot; text-anchor=&quot;middle&quot; font-size=&quot;13&quot; font-weight=&quot;600&quot; fill=&quot;#1A1A1A&quot;&gt;Code&lt;/text&gt;
  &lt;text x=&quot;355&quot; y=&quot;194&quot; text-anchor=&quot;start&quot; font-size=&quot;13&quot; font-weight=&quot;600&quot; fill=&quot;#1A1A1A&quot;&gt;Reasoning&lt;/text&gt;
  &lt;text x=&quot;200&quot; y=&quot;355&quot; text-anchor=&quot;middle&quot; font-size=&quot;13&quot; font-weight=&quot;600&quot; fill=&quot;#1A1A1A&quot;&gt;Knowledge&lt;/text&gt;
  &lt;text x=&quot;45&quot; y=&quot;194&quot; text-anchor=&quot;end&quot; font-size=&quot;13&quot; font-weight=&quot;600&quot; fill=&quot;#1A1A1A&quot;&gt;Speed&lt;/text&gt;
  &lt;!-- Legend --&gt;
  &lt;rect x=&quot;40&quot; y=&quot;370&quot; width=&quot;14&quot; height=&quot;3&quot; rx=&quot;1&quot; fill=&quot;#6366F1&quot;&gt;&lt;/rect&gt;
  &lt;text x=&quot;58&quot; y=&quot;374&quot; font-size=&quot;9&quot; fill=&quot;#6B6560&quot;&gt;Qwen3.5 397B&lt;/text&gt;
  &lt;rect x=&quot;145&quot; y=&quot;370&quot; width=&quot;14&quot; height=&quot;3&quot; rx=&quot;1&quot; fill=&quot;#14B8A6&quot;&gt;&lt;/rect&gt;
  &lt;text x=&quot;163&quot; y=&quot;374&quot; font-size=&quot;9&quot; fill=&quot;#6B6560&quot;&gt;Qwen3.5 35B&lt;/text&gt;
  &lt;rect x=&quot;230&quot; y=&quot;370&quot; width=&quot;14&quot; height=&quot;3&quot; rx=&quot;1&quot; fill=&quot;#9A9490&quot;&gt;&lt;/rect&gt;
  &lt;text x=&quot;248&quot; y=&quot;374&quot; font-size=&quot;9&quot; fill=&quot;#6B6560&quot;&gt;GPT-5.4&lt;/text&gt;
  &lt;line x1=&quot;310&quot; y1=&quot;371&quot; x2=&quot;324&quot; y2=&quot;371&quot; stroke=&quot;#6B6560&quot; stroke-width=&quot;1.5&quot; stroke-dasharray=&quot;4 2&quot;&gt;&lt;/line&gt;
  &lt;text x=&quot;328&quot; y=&quot;374&quot; font-size=&quot;9&quot; fill=&quot;#6B6560&quot;&gt;Opus 4.6&lt;/text&gt;
&lt;/svg&gt;
&lt;/div&gt;
&lt;p&gt;Quality only — no price axis. GPT-5.4 (gray) has the largest shape. Opus (dashed) is strong on reasoning and code. The 397B (indigo) nearly overlaps GPT-5.4 on code and knowledge. The 35B (teal) pulls hard left on speed — 178 tok/s is 2.3x faster than anything else here. Price tells its own story in the chart above.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;the-scorecard&quot;&gt;The scorecard&lt;/h2&gt;&lt;/div&gt;





















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Metric&lt;/th&gt;&lt;th&gt;Winner&lt;/th&gt;&lt;th&gt;Qwen3.5 397B&lt;/th&gt;&lt;th&gt;GPT-5.4&lt;/th&gt;&lt;th&gt;Claude Opus 4.6&lt;/th&gt;&lt;th&gt;Gap (397B vs best)&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Knowledge&lt;/strong&gt; (MMLU-Pro)&lt;/td&gt;&lt;td&gt;GPT-5.4&lt;/td&gt;&lt;td&gt;87.8%&lt;/td&gt;&lt;td&gt;88.5%&lt;/td&gt;&lt;td&gt;82.0%&lt;/td&gt;&lt;td&gt;-0.7 pts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Reasoning&lt;/strong&gt; (GPQA)&lt;/td&gt;&lt;td&gt;GPT-5.4&lt;/td&gt;&lt;td&gt;88.4%&lt;/td&gt;&lt;td&gt;92.0%&lt;/td&gt;&lt;td&gt;91.3%&lt;/td&gt;&lt;td&gt;-3.6 pts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Code&lt;/strong&gt; (LiveCodeBench)&lt;/td&gt;&lt;td&gt;GPT-5.4&lt;/td&gt;&lt;td&gt;83.6%&lt;/td&gt;&lt;td&gt;84.0%&lt;/td&gt;&lt;td&gt;76.0%&lt;/td&gt;&lt;td&gt;-0.4 pts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt; (tok/s)&lt;/td&gt;&lt;td&gt;Qwen3.5 397B&lt;/td&gt;&lt;td&gt;84 t/s&lt;/td&gt;&lt;td&gt;~78 t/s&lt;/td&gt;&lt;td&gt;46 t/s&lt;/td&gt;&lt;td&gt;1.1x faster&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Price&lt;/strong&gt; ($/M input)&lt;/td&gt;&lt;td&gt;Qwen3.5 397B&lt;/td&gt;&lt;td&gt;$0.54&lt;/td&gt;&lt;td&gt;$2.50&lt;/td&gt;&lt;td&gt;$5.00&lt;/td&gt;&lt;td&gt;4.6x cheaper&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Same weight class, different price tag.&lt;/strong&gt; The 397B trades 0.4–3.6 points on quality for 4.6x lower price and faster speed. It beats Opus on 4 out of 5 metrics outright.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: The Qwen3.5-35B-A3B ($0.22/M) scores 85.3% MMLU-Pro, 84.2% GPQA, 74.6% LiveCodeBench at 178 tok/s — beating Opus on knowledge and speed at 23x less cost. A different weight class, but worth considering if speed and price matter more than the last few quality points.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-real-question-what-are-you-paying-for&quot;&gt;The real question: what are you paying for?&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;The quality gap between Qwen3.5-397B and GPT-5.4 is &lt;strong&gt;0.7 points on knowledge, 0.4 points on code&lt;/strong&gt;. The price gap is &lt;strong&gt;4.6x&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Put it differently:&lt;/p&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Model&lt;/th&gt;&lt;th&gt;MMLU-Pro&lt;/th&gt;&lt;th&gt;Cost per quality point&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Qwen3.5 35B&lt;/td&gt;&lt;td&gt;85.3%&lt;/td&gt;&lt;td&gt;$0.003 per point per M tokens&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Qwen3.5 397B&lt;/td&gt;&lt;td&gt;87.8%&lt;/td&gt;&lt;td&gt;$0.006 per point per M tokens&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GPT-5.4&lt;/td&gt;&lt;td&gt;88.5%&lt;/td&gt;&lt;td&gt;$0.028 per point per M tokens&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Claude Opus 4.6&lt;/td&gt;&lt;td&gt;82.0%&lt;/td&gt;&lt;td&gt;$0.061 per point per M tokens&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Opus costs &lt;strong&gt;20x more per quality point&lt;/strong&gt; than the 35B — and scores lower. GPT-5.4 leads on quality but costs 5-10x more for single-digit advantages.&lt;/p&gt;
&lt;p&gt;For most workloads, the last 3% of benchmark performance isn’t worth a 5x price increase. And for workloads where it is — the 397B gets you within 1 point of GPT-5.4 at a fraction of the cost.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;also-worth-knowing-specialized-qwen-models&quot;&gt;Also worth knowing: specialized Qwen models&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Beyond the general-purpose models, the Qwen family includes two notable specialists:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Qwen3-Coder-480B&lt;/strong&gt; — SWE-bench Verified 69.6%, comparable to Claude Sonnet 4. Built for agentic coding.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Qwen3-235B-Thinking&lt;/strong&gt; — Chain-of-thought reasoning specialist. When you need the model to show its work.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;CheapestInference serves frontier open-weight models — Kimi K2.6, GLM 5.2, and MiniMax M3 — on unlimited time-block subscriptions from $39/mo, or pay-as-you-go credits from $10. &lt;a href=&quot;https://cheapestinference.com/pools&quot;&gt;See plans and try it →&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/Qwen/Qwen3.5-397B-A17B&quot;&gt;Qwen3.5-397B Model Card&lt;/a&gt; · &lt;a href=&quot;https://huggingface.co/Qwen/Qwen3.5-35B-A3B&quot;&gt;Qwen3.5-35B Model Card&lt;/a&gt; · &lt;a href=&quot;https://artificialanalysis.ai/leaderboards/models&quot;&gt;Artificial Analysis Leaderboard&lt;/a&gt; · &lt;a href=&quot;https://artificialanalysis.ai/evaluations/gpqa-diamond&quot;&gt;GPQA Diamond Leaderboard&lt;/a&gt; · &lt;a href=&quot;https://openai.com/api/pricing/&quot;&gt;OpenAI Pricing&lt;/a&gt; · &lt;a href=&quot;https://platform.claude.com/docs/en/about-claude/pricing&quot;&gt;Anthropic Pricing&lt;/a&gt; · &lt;a href=&quot;https://livecodebench.github.io/leaderboard.html&quot;&gt;LiveCodeBench Leaderboard&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>OpenClaw is free. Running it is not.</title><link>https://cheapestinference.com/blog/openclaw-cost-problem/</link><guid isPermaLink="true">https://cheapestinference.com/blog/openclaw-cost-problem/</guid><pubDate>Tue, 24 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenClaw has 247,000 GitHub stars. It’s free, open-source, and runs locally. You install it, point it at an LLM, and it writes code, browses the web, queries databases, and executes files on your behalf.&lt;/p&gt;
&lt;p&gt;The agent is free. The inference is not.&lt;/p&gt;
&lt;p&gt;Every time OpenClaw calls a model, it re-sends the entire conversation history — every tool output, every file it read, every intermediate result. By iteration 20 of a typical task, the input context is 30,000+ tokens. By iteration 40, it’s past 100,000. And it sends this &lt;em&gt;every single request&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;This is not a bug. It’s how agents work. And it’s why running OpenClaw on pay-per-token APIs costs $300–600/month for active users — sometimes more.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;where-the-tokens-go&quot;&gt;Where the tokens go&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;We broke down token consumption for a typical OpenClaw coding task: “add authentication to an Express API.” The agent completed it in 38 tool calls.&lt;/p&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Context accumulation&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~280K tokens&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;System prompt (×38)&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~156K tokens&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Tool outputs (files, etc.)&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~70K tokens&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Agent output&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~19K tokens&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Total: &lt;strong&gt;~525,000 tokens for a single task&lt;/strong&gt;. The agent’s actual output — the code it wrote — was 19K tokens. The other 96% is overhead.&lt;/p&gt;
&lt;p&gt;On Claude Opus at $15/M input + $75/M output, that single task costs &lt;strong&gt;$9.18&lt;/strong&gt;. Run five tasks a day and you’re at &lt;strong&gt;$1,377/month&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;On DeepSeek V3.2 via a pay-per-token provider at $0.27/M input + $1.10/M output, the same task costs &lt;strong&gt;$0.16&lt;/strong&gt;. Better — but 20 tasks a day is still &lt;strong&gt;$96/month&lt;/strong&gt;, and that’s &lt;em&gt;one agent&lt;/em&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-three-cost-traps&quot;&gt;The three cost traps&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Here’s the OpenClaw-specific version:&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;1-context-grows-quadratically&quot;&gt;1. Context grows quadratically&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;OpenClaw reads files into context. If it reads a 2,000-token file at step 5, that file gets re-sent at steps 6, 7, 8… all the way to 38. That single file read costs 2,000 × 33 remaining steps = &lt;strong&gt;66,000 tokens&lt;/strong&gt; in re-transmission alone.&lt;/p&gt;
&lt;p&gt;Users report session contexts at 56–58% of the 400K context window during normal use. This isn’t a failure mode — it’s the architecture working as designed.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;2-system-prompt-is-a-fixed-tax&quot;&gt;2. System prompt is a fixed tax&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;OpenClaw’s system prompt is ~9,600 tokens. It gets sent with every request. Over 38 tool calls, that’s 365K tokens just in system prompts. You pay this whether the agent does useful work or not.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;3-wrong-model-for-the-job&quot;&gt;3. Wrong model for the job&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;OpenClaw defaults to a single model for everything. But not every tool call needs the same intelligence:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reading a file and deciding what to edit? &lt;strong&gt;Llama 3.1 8B&lt;/strong&gt; handles this at 200 tokens/sec.&lt;/li&gt;
&lt;li&gt;Writing complex authentication logic? A frontier open-weight model like &lt;strong&gt;Kimi K2.6&lt;/strong&gt; is the right call.&lt;/li&gt;
&lt;li&gt;Formatting a config file? &lt;strong&gt;Any 8B model&lt;/strong&gt; is overkill but still cheaper than Opus.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We wrote a full guide on this pattern: &lt;a href=&quot;https://cheapestinference.com/blog/multi-model-architecture/&quot;&gt;Building a multi-model architecture&lt;/a&gt;. Routing agent requests to the right model can cut costs by 60–80% without reducing output quality.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-math-on-flat-rate-vs-pay-per-token&quot;&gt;The math on flat-rate vs. pay-per-token&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Here’s the comparison for an OpenClaw user running ~20 tasks/day:&lt;/p&gt;
&lt;div&gt;
  &lt;table&gt;
    &lt;tbody&gt;&lt;tr&gt;
      &lt;th&gt;Provider&lt;/th&gt;
      &lt;th&gt;Cost/task&lt;/th&gt;
      &lt;th&gt;20 tasks/day&lt;/th&gt;
      &lt;th&gt;Monthly&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus (direct)&lt;/td&gt;
      &lt;td&gt;$9.18&lt;/td&gt;
      &lt;td&gt;$183.60&lt;/td&gt;
      &lt;td&gt;$5,508&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-5.4 (direct)&lt;/td&gt;
      &lt;td&gt;$4.73&lt;/td&gt;
      &lt;td&gt;$94.60&lt;/td&gt;
      &lt;td&gt;$2,838&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DeepSeek V3.2 (per-token)&lt;/td&gt;
      &lt;td&gt;$0.16&lt;/td&gt;
      &lt;td&gt;$3.20&lt;/td&gt;
      &lt;td&gt;$96&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;CheapestInference&lt;/td&gt;
      &lt;td&gt;—&lt;/td&gt;
      &lt;td&gt;—&lt;/td&gt;
      &lt;td&gt;from $39/mo&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;Flat-rate means you don’t care about context accumulation. The 280K tokens of context overhead that makes pay-per-token expensive? Irrelevant. The system prompt tax? Doesn’t matter. Your agent can call models 24/7 and the bill is the same.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;what-wed-actually-recommend&quot;&gt;What we’d actually recommend&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;If you’re running OpenClaw, here’s the setup we see working best:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Use open-weight models.&lt;/strong&gt; Frontier open-weight models like Kimi K2.6 and GLM 5.2 score within a few points of proprietary models on coding benchmarks (&lt;a href=&quot;https://cheapestinference.com/blog/open-source-models-are-production-ready/&quot;&gt;the data&lt;/a&gt;). The gap doesn’t justify a 50x cost difference.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Route by complexity.&lt;/strong&gt; Don’t send file reads and simple decisions to the same model as complex code generation. A router model costs fractions of a cent per classification. Full guide: &lt;a href=&quot;https://cheapestinference.com/blog/multi-model-architecture/&quot;&gt;Multi-model architecture&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Reserve the hours you work.&lt;/strong&gt; On CheapestInference you reserve one or more daily 8-hour time blocks (Asia-Pacific, Europe, Americas — pick 1–3, all three is full 24/7). During your reserved hours inference is unlimited with no budget cap. One API key per agent, two concurrent requests per key. Outside your window, requests return 429 until your block opens again.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. Handle rate limits automatically.&lt;/strong&gt; Time blocks mean your agent &lt;em&gt;will&lt;/em&gt; hit 429s outside your reserved window — that’s expected. But OpenClaw kills the conversation when it gets a 429. The agent stops, and if you close the dashboard, that conversation is gone.&lt;/p&gt;
&lt;p&gt;We built an OpenClaw plugin that fixes this: &lt;a href=&quot;https://github.com/cheapestinference/openclaw-plugin-ratelimit-retry&quot;&gt;&lt;code dir=&quot;auto&quot;&gt;openclaw-ratelimit-retry&lt;/code&gt;&lt;/a&gt;. It hooks into &lt;code dir=&quot;auto&quot;&gt;agent_end&lt;/code&gt;, detects retriable 429s, parks the session on disk, and waits for the budget window to reset. Then it sends &lt;code dir=&quot;auto&quot;&gt;chat.send&lt;/code&gt; to the original session — resuming the conversation with its full transcript, as if you had typed a message.&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;openclaw&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;plugins&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;install&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;@cheapestinference/openclaw-ratelimit-retry&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;span&gt;~/.openclaw/config.yaml&lt;/span&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;plugins&lt;/span&gt;&lt;span&gt;:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;  &lt;/span&gt;&lt;span&gt;ratelimit-retry&lt;/span&gt;&lt;span&gt;:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;budgetWindowHours&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;8&lt;/span&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;# matches your CheapestInference 8-hour time block&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;maxRetryAttempts&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;3&lt;/span&gt;&lt;span&gt;     &lt;/span&gt;&lt;span&gt;# give up after 3 consecutive 429s&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;checkIntervalMinutes&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;5&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;# check every 5 min for ready retries&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;The plugin is zero-dependency, persists across server restarts, deduplicates by session, and handles edge cases like sub-agents, queue overflow, and corrupted state files. If the retry itself hits a 429, it re-queues automatically. No tokens wasted on re-sending from scratch — the agent picks up exactly where it left off.&lt;/p&gt;
&lt;p&gt;This turns budget caps from “your agent crashes” into “your agent naps and wakes up.” Set it up once and forget about it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. Consider unlimited time blocks.&lt;/strong&gt; If your agent runs more than a few tasks per day, per-token pricing works against you. Every token of context overhead is money. With an unlimited time-block subscription, context overhead is free during your reserved hours — re-send the full window, let the agent work without a budget cap.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-irony&quot;&gt;The irony&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;OpenClaw is free because the code runs on your machine. But the valuable part — the intelligence — runs on someone else’s GPUs. The agent framework is the cheap part. Inference is the expensive part.&lt;/p&gt;
&lt;p&gt;Open-source models on flat-rate infrastructure flip this equation. The models are free. The inference is flat. The only variable cost left is your time.&lt;/p&gt;
&lt;p&gt;Point your OpenClaw &lt;code dir=&quot;auto&quot;&gt;base_url&lt;/code&gt; at &lt;code dir=&quot;auto&quot;&gt;https://api.cheapestinference.com/v1&lt;/code&gt; and find out what unconstrained agents actually cost: nothing more than you already budgeted.&lt;/p&gt;</content:encoded></item><item><title>Building a multi-model architecture: route requests to the right LLM</title><link>https://cheapestinference.com/blog/multi-model-architecture/</link><guid isPermaLink="true">https://cheapestinference.com/blog/multi-model-architecture/</guid><pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Using one model for everything is the simplest architecture. It’s also the most wasteful. A 685B-parameter reasoning model answering “what’s the weather?” is like hiring a PhD to sort mail.&lt;/p&gt;
&lt;p&gt;This guide covers how to use a small, fast model to classify incoming requests and route them to the right specialist. The result: lower latency, lower cost, and often better quality — because each model handles what it’s actually good at.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-problem-with-single-model-architectures&quot;&gt;The problem with single-model architectures&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Most applications start with one model:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;User request --&gt; Large Model --&gt; Response&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;This works, but every request — simple or complex — pays the same latency and cost penalty. When 60% of your traffic is simple classification, FAQ, or extraction, you’re burning expensive compute on tasks a small model handles equally well.&lt;/p&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Llama 3.1 8B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~200 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek V3.2&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~60 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek R1&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~30 t/s&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The gap between Llama 8B and R1 is nearly 7x in throughput. Routing simple requests to the small model saves that difference on every request.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-multi-model-architecture&quot;&gt;The multi-model architecture&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;User request --&gt; Router (Llama 8B) --&gt; classify intent&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                                          &lt;/span&gt;&lt;/span&gt;&lt;span&gt;|&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                  &lt;/span&gt;&lt;/span&gt;&lt;span&gt;+-----------+-----------+-----------+&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                  &lt;/span&gt;&lt;/span&gt;&lt;span&gt;|           |           |           |&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;               &lt;/span&gt;&lt;/span&gt;&lt;span&gt;simple      general    reasoning     code&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                  &lt;/span&gt;&lt;/span&gt;&lt;span&gt;|           |           |           |&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Llama 3.1 8B  DeepSeek   DeepSeek R1   Qwen3&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;V3.2                    Coder&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                  &lt;/span&gt;&lt;/span&gt;&lt;span&gt;|           |           |           |&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                  &lt;/span&gt;&lt;/span&gt;&lt;span&gt;+-----+-----+-----+-----+&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;|&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                     &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Response&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Two stages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Classify&lt;/strong&gt; — The router model reads the user’s message and outputs a category. A fast model returns this in a fraction of a second.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Route&lt;/strong&gt; — Based on the category, forward the request to the appropriate specialist model.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The router adds minimal overhead (~200ms) but saves significant compute by keeping simple requests away from expensive models.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;step-1-classify-with-a-fast-model&quot;&gt;Step 1: Classify with a fast model&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;A fast, lightweight model makes a good router. With low TTFT and a short, single-word output, the classification step costs almost nothing and completes before the user notices.&lt;/p&gt;
&lt;p&gt;The classification prompt is simple — you want a single-word category, not a conversation:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; openai &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; OpenAI&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;client &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;OpenAI&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;base_url&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;https://api.cheapestinference.com/v1&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;api_key&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;your-api-key&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;def&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;classify_request&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;user_message&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;str&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;span&gt; -&gt; &lt;/span&gt;&lt;span&gt;str&lt;/span&gt;&lt;span&gt;:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span&gt;Classify a user message into a routing category.&lt;/span&gt;&lt;span&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;response &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; client.chat.completions.&lt;/span&gt;&lt;span&gt;create&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;glm-5.2&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;messages&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;                &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;role&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;system&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;                &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;content&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: (&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;                    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;Classify the user&apos;s message into exactly one category. &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;                    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;Respond with only the category name, nothing else.&lt;/span&gt;&lt;span&gt;\n\n&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;                    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;Categories:&lt;/span&gt;&lt;span&gt;\n&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;                    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;- simple: greetings, FAQ, simple factual questions&lt;/span&gt;&lt;span&gt;\n&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;                    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;- general: complex questions, analysis, writing, summarization&lt;/span&gt;&lt;span&gt;\n&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;                    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;- reasoning: math, logic, multi-step problems, science&lt;/span&gt;&lt;span&gt;\n&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;                    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;- code: code generation, debugging, refactoring, technical implementation&lt;/span&gt;&lt;span&gt;\n&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;                    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;- agent: tasks requiring tool use, web search, or multi-step execution&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                &lt;/span&gt;&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;},&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;role&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;user&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;content&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: user_message}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;],&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;max_tokens&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;10&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;temperature&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;0&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;category &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; response.choices[&lt;/span&gt;&lt;span&gt;0&lt;/span&gt;&lt;span&gt;].message.content.&lt;/span&gt;&lt;span&gt;strip&lt;/span&gt;&lt;span&gt;().&lt;/span&gt;&lt;span&gt;lower&lt;/span&gt;&lt;span&gt;()&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;# Default to general if classification is unclear&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;valid &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;simple&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;general&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;reasoning&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;code&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;agent&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; category &lt;/span&gt;&lt;span&gt;if&lt;/span&gt;&lt;span&gt; category &lt;/span&gt;&lt;span&gt;in&lt;/span&gt;&lt;span&gt; valid &lt;/span&gt;&lt;span&gt;else&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;general&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;The key details: &lt;code dir=&quot;auto&quot;&gt;max_tokens=10&lt;/code&gt; because we only need one word. &lt;code dir=&quot;auto&quot;&gt;temperature=0&lt;/code&gt; for deterministic routing. The system prompt is explicit about format — no preamble, just the category.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;step-2-route-to-the-specialist&quot;&gt;Step 2: Route to the specialist&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Each category maps to a model optimized for that task:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Model routing table&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;ROUTE_TABLE&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;simple&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;:    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;glm-5.2&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;general&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;:   &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;glm-5.2&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;reasoning&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;MiniMax-M3&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;code&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;:      &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;MiniMax-M3&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;agent&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;:     &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;kimi-k2.6&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;def&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;route_request&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;user_message&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;str&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;conversation_history&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;list&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;span&gt; -&gt; &lt;/span&gt;&lt;span&gt;str&lt;/span&gt;&lt;span&gt;:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span&gt;Classify and route a request to the appropriate model.&lt;/span&gt;&lt;span&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;category &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;classify_request&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;user_message&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;model &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;ROUTE_TABLE&lt;/span&gt;&lt;span&gt;[category]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;response &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; client.chat.completions.&lt;/span&gt;&lt;span&gt;create&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;messages&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;conversation_history &lt;/span&gt;&lt;span&gt;+&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;role&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;user&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;content&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: user_message}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;],&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;stream&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;True&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;# Stream the response back&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;full_response &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&quot;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;for&lt;/span&gt;&lt;span&gt; chunk &lt;/span&gt;&lt;span&gt;in&lt;/span&gt;&lt;span&gt; response:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;if&lt;/span&gt;&lt;span&gt; chunk.choices[&lt;/span&gt;&lt;span&gt;0&lt;/span&gt;&lt;span&gt;].delta.content:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;content &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; chunk.choices[&lt;/span&gt;&lt;span&gt;0&lt;/span&gt;&lt;span&gt;].delta.content&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;full_response &lt;/span&gt;&lt;span&gt;+=&lt;/span&gt;&lt;span&gt; content&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;            &lt;/span&gt;&lt;span&gt;print&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;content&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;end&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;flush&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;True&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; full_response&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Notice that simple requests route back to GLM 5.2 — the same model that did the classification. For simple queries, the router overhead is effectively zero because the specialist is the same model and can reuse the warm connection.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;step-3-handle-edge-cases&quot;&gt;Step 3: Handle edge cases&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;The basic router works for most traffic, but production systems need a few refinements:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;def&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;route_request_production&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;user_message&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;str&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;conversation_history&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;list&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;force_model&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;str&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;None&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;)&lt;/span&gt;&lt;span&gt; -&gt; tuple[&lt;/span&gt;&lt;span&gt;str&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;str&lt;/span&gt;&lt;span&gt;]:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span&gt;Production router with overrides and fallback.&lt;/span&gt;&lt;span&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;# Allow explicit model override (for power users or testing)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;if&lt;/span&gt;&lt;span&gt; force_model:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;model &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; force_model&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;category &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;override&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;else&lt;/span&gt;&lt;span&gt;:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;category &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;classify_request&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;user_message&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;model &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;ROUTE_TABLE&lt;/span&gt;&lt;span&gt;[category]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;try&lt;/span&gt;&lt;span&gt;:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;response &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; client.chat.completions.&lt;/span&gt;&lt;span&gt;create&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;            &lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;            &lt;/span&gt;&lt;span&gt;messages&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;conversation_history &lt;/span&gt;&lt;span&gt;+&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                &lt;/span&gt;&lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;role&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;user&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;content&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: user_message}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;            &lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; response.choices[&lt;/span&gt;&lt;span&gt;0&lt;/span&gt;&lt;span&gt;].message.content, category&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;except&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;Exception&lt;/span&gt;&lt;span&gt;:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;# Fallback to GLM 5.2 if the specialist is unavailable&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;fallback &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;glm-5.2&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;response &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; client.chat.completions.&lt;/span&gt;&lt;span&gt;create&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;            &lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;fallback&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;            &lt;/span&gt;&lt;span&gt;messages&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;conversation_history &lt;/span&gt;&lt;span&gt;+&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;                &lt;/span&gt;&lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;role&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;user&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;content&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: user_message}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;            &lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; response.choices[&lt;/span&gt;&lt;span&gt;0&lt;/span&gt;&lt;span&gt;].message.content, &lt;/span&gt;&lt;span&gt;f&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;category&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;span&gt;-&gt;fallback&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Three patterns worth noting:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Force model&lt;/strong&gt; — Let callers bypass routing when they know what they need.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fallback&lt;/strong&gt; — If a specialist model is down, fall back to GLM 5.2. It handles everything reasonably well.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Return the category&lt;/strong&gt; — Log which route each request takes. You’ll need this data to tune the system.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;cost-and-latency-comparison&quot;&gt;Cost and latency comparison&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Consider a workload of 1,000 requests with this distribution: 600 simple, 300 general, 70 reasoning, 30 code. Average 500 input tokens, 200 output tokens per request.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;single-model-approach-everything-on-v32&quot;&gt;Single-model approach (everything on V3.2)&lt;/h3&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Avg latency&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~4.5s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;All 1000 reqs&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;V3.2 only&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Every request waits for V3.2’s ~1.2s TTFT plus generation time at ~60 t/s. Simple questions get the same treatment as complex analysis.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;multi-model-approach-routed&quot;&gt;Multi-model approach (routed)&lt;/h3&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Simple (600)&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~1.2s (8B)&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;General (300)&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~4.7s (V3.2)&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Reasoning (70)&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~9.0s (R1)&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Code (30)&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~3.5s (Coder)&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The weighted average latency drops to approximately &lt;strong&gt;2.7s&lt;/strong&gt; — a 40% reduction. The 600 simple requests finish in ~1.2s instead of ~4.5s. That’s a 3.7x improvement for the majority of your traffic.&lt;/p&gt;
&lt;p&gt;The 70 reasoning requests are &lt;em&gt;slower&lt;/em&gt; individually (~9s vs ~4.5s) because R1 generates chain-of-thought tokens. But the quality on those specific requests is significantly better — R1 scores 50.2% on HLE versus V3.2’s 39.3%.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;You get faster averages &lt;em&gt;and&lt;/em&gt; better quality on the hard tail.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;real-example-a-support-chatbot&quot;&gt;Real example: a support chatbot&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;A customer support chatbot receives three types of requests:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;FAQ&lt;/strong&gt; (60%) — “What are your business hours?” / “How do I reset my password?”&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Complex support&lt;/strong&gt; (30%) — “I was charged twice for order #12345, can you investigate?”&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Technical issues&lt;/strong&gt; (10%) — “Your API returns 500 when I send multipart form data with UTF-8 filenames”&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;&lt;h3 id=&quot;without-routing&quot;&gt;Without routing&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;All requests go to DeepSeek V3.2. FAQs get correct answers but with unnecessary latency. Technical issues get decent answers but miss edge cases that a code-specialized model would catch.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;with-routing&quot;&gt;With routing&lt;/h3&gt;&lt;/div&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;SUPPORT_ROUTES&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;simple&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;:    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;glm-5.2&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,       &lt;/span&gt;&lt;span&gt;# FAQ, greetings&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;general&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;:   &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;glm-5.2&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,       &lt;/span&gt;&lt;span&gt;# Complex support&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;reasoning&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;: &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;glm-5.2&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,       &lt;/span&gt;&lt;span&gt;# Investigations&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;code&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;:      &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;glm-5.2&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,       &lt;/span&gt;&lt;span&gt;# Technical issues&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;agent&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;:     &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;kimi-k2.6&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,     &lt;/span&gt;&lt;span&gt;# Multi-step resolution&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;FAQs resolve quickly via GLM 5.2. Complex support issues get GLM 5.2’s full analytical capability. Technical problems also route to GLM 5.2, which understands the code context well. If a support issue requires looking up order data via API, it routes to Kimi K2.6 for tool-assisted resolution.&lt;/p&gt;
&lt;p&gt;The classification step adds ~200ms. For the 60% of requests that drop from ~4.5s to ~1.2s, that’s an invisible cost.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;when-not-to-use-multi-model-routing&quot;&gt;When NOT to use multi-model routing&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Routing adds complexity. Skip it when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;All your requests are the same type.&lt;/strong&gt; If you’re building a code editor, just use a single coding model like GLM 5.2. No routing needed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;You have fewer than 100 requests/day.&lt;/strong&gt; The cost savings don’t justify the engineering overhead at low volume.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency doesn’t matter.&lt;/strong&gt; For batch processing or async workloads, a single capable model is simpler.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Your classification accuracy is low.&lt;/strong&gt; If the router misclassifies frequently, you get worse results than a single good model. Test the classifier on real traffic before deploying.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The sweet spot is high-volume applications with diverse request types — chatbots, API gateways, developer tools, and customer-facing products where response time directly affects user experience.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;implementation-checklist&quot;&gt;Implementation checklist&lt;/h2&gt;&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Log your traffic.&lt;/strong&gt; Before building a router, understand your request distribution. What percentage is simple? Complex? Code?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Start with two tiers.&lt;/strong&gt; A fast, lighter model for simple requests, and a stronger model like MiniMax M3 for everything that needs deep reasoning, code, or long context. Add specialists only when you have data showing they help.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Measure classification accuracy.&lt;/strong&gt; Sample 100 requests, manually label them, compare against the router’s output. Target &gt;90% accuracy.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add fallback.&lt;/strong&gt; Every specialist route should fall back to GLM 5.2 if the specialist is unavailable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor per-route metrics.&lt;/strong&gt; Track latency, cost, and quality per category. This tells you where to optimize next.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;All models in this guide are available through a single OpenAI-compatible API with no configuration changes between models. If you’re building a platform that needs LLM access for your users, &lt;a href=&quot;https://cheapestinference.com/platforms&quot;&gt;see how per-key plans work&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt; &lt;a href=&quot;https://artificialanalysis.ai/leaderboards/models&quot;&gt;Artificial Analysis Leaderboard&lt;/a&gt; · &lt;a href=&quot;https://artificialanalysis.ai/models/deepseek-v3-2&quot;&gt;DeepSeek V3.2&lt;/a&gt; · &lt;a href=&quot;https://artificialanalysis.ai/evaluations/humanitys-last-exam&quot;&gt;HLE Leaderboard&lt;/a&gt; · &lt;a href=&quot;https://kimi-k25.com/blog/kimi-k2-5-benchmark&quot;&gt;Kimi K2.5 Benchmarks&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>How to choose the right open-source model for your task</title><link>https://cheapestinference.com/blog/choosing-the-right-open-source-model/</link><guid isPermaLink="true">https://cheapestinference.com/blog/choosing-the-right-open-source-model/</guid><pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Most teams default to the biggest model available and call it a day. That works — until latency spikes, costs climb, and you realize a 8B-parameter model would have handled 60% of your requests just fine.&lt;/p&gt;
&lt;p&gt;This guide maps common use cases to specific models, with real throughput numbers from our infrastructure. No theory — just which model to pick and why.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;quick-decision-table&quot;&gt;Quick decision table&lt;/h2&gt;&lt;/div&gt;




























































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Use case&lt;/th&gt;&lt;th&gt;Model&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;General chat / assistants&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;DeepSeek V3.2&lt;/td&gt;&lt;td&gt;Best all-rounder. 85% MMLU-Pro, 73% SWE-bench, 60 t/s.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Complex reasoning&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;DeepSeek R1&lt;/td&gt;&lt;td&gt;50.2% on Humanity’s Last Exam. Chain-of-thought built in.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Code generation&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Qwen3 Coder&lt;/td&gt;&lt;td&gt;Purpose-built for code. Strong on completions, refactoring, and debugging.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Agentic workflows&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Kimi K2.5&lt;/td&gt;&lt;td&gt;334 t/s output, native tool use, 50.2% HLE with tools. Built for agents.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Vision / multimodal&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Llama 4 Scout&lt;/td&gt;&lt;td&gt;17 active experts, 109B params, native image understanding.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Fast classification&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Llama 3.1 8B&lt;/td&gt;&lt;td&gt;~200 t/s, 0.2s TTFT. Small enough for routing, tagging, extraction.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;General (budget)&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;GLM 5.2&lt;/td&gt;&lt;td&gt;Fast inference, competitive quality. Good when V3.2 is overkill.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Long context chat&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;MiniMax M3&lt;/td&gt;&lt;td&gt;1M-token context window. Handles very large documents and codebases.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Large general + reasoning&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Qwen3 235B&lt;/td&gt;&lt;td&gt;235B MoE. Strong across benchmarks when you need maximum capability.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Embeddings&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;BGE Large&lt;/td&gt;&lt;td&gt;MTEB-tested. Solid retrieval quality for RAG pipelines.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;general-chat-and-assistants&quot;&gt;General chat and assistants&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Pick: DeepSeek V3.2&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;DeepSeek V3.2 is the default choice for most workloads. It scores 85% on MMLU-Pro (beating Claude Opus 4.6’s 82%), 73% on SWE-bench Verified, and runs at ~60 tokens/second on our infrastructure.&lt;/p&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Kimi K2.5&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;334 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Llama 3.1 8B&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~200 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek V3.2&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~60 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek R1&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~30 t/s&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Good at:&lt;/strong&gt; Broad knowledge, instruction following, multilingual, structured output.
&lt;strong&gt;Not ideal for:&lt;/strong&gt; Tasks that need step-by-step reasoning chains (use R1) or sub-100ms latency (use Llama 8B).
&lt;strong&gt;Pick over alternatives when:&lt;/strong&gt; You need a reliable general-purpose model that handles most tasks without specialization.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;complex-reasoning&quot;&gt;Complex reasoning&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Pick: DeepSeek R1&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;R1 is a reasoning-first model. It produces explicit chain-of-thought tokens before its final answer. On Humanity’s Last Exam — a benchmark designed to be unsolvable by current models — R1 scores 50.2%, beating GPT-5.4 (41.6%) and Claude Opus 4.6 (40%).&lt;/p&gt;
&lt;p&gt;The tradeoff is speed. At ~30 t/s, R1 is the slowest model in our lineup. That’s expected — it’s generating reasoning tokens that never appear in the final output.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Good at:&lt;/strong&gt; Math, science, logic puzzles, multi-step problems, anything where “thinking” helps.
&lt;strong&gt;Not ideal for:&lt;/strong&gt; Simple Q&amp;#x26;A, classification, or latency-sensitive applications.
&lt;strong&gt;Pick over alternatives when:&lt;/strong&gt; The task requires multi-step deduction. If a human would need to “think through it,” R1 will outperform faster models.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;code-generation&quot;&gt;Code generation&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Pick: Qwen3 Coder&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Qwen3 Coder is purpose-built for software engineering tasks — code completion, refactoring, debugging, and generation across languages. It’s trained specifically on code-heavy data and optimized for developer workflows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Good at:&lt;/strong&gt; Code completion, bug fixing, refactoring, test generation, multi-file edits.
&lt;strong&gt;Not ideal for:&lt;/strong&gt; General conversation or non-code tasks (use V3.2).
&lt;strong&gt;Pick over alternatives when:&lt;/strong&gt; Code quality matters more than general knowledge. For mixed code-and-chat workflows, V3.2 or Kimi K2.5 may be more versatile.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;agentic-workflows&quot;&gt;Agentic workflows&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Pick: Kimi K2.5&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Kimi K2.5 was designed for agentic use. It has native tool-calling support, runs at 334 t/s (the fastest model we serve), and scores 50.2% on HLE when using tools — matching R1’s reasoning-only score.&lt;/p&gt;
&lt;p&gt;The speed matters for agents. Each tool call is a round trip: the model generates a function call, the tool executes, the result goes back to the model. At 334 t/s and 0.31s TTFT, Kimi completes multi-step agent loops in seconds where slower models take minutes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Good at:&lt;/strong&gt; Tool use, function calling, multi-step task execution, fast iteration loops.
&lt;strong&gt;Not ideal for:&lt;/strong&gt; Pure reasoning without tools (R1 is better). Code-only tasks (Qwen3 Coder is more specialized).
&lt;strong&gt;Pick over alternatives when:&lt;/strong&gt; Your application involves tool calling, API interactions, or multi-step agent orchestration where speed compounds.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;vision-and-multimodal&quot;&gt;Vision and multimodal&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Pick: Llama 4 Scout&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Llama 4 Scout is Meta’s mixture-of-experts multimodal model — 109B total parameters with 17 active experts. It handles text and images natively, making it the pick for tasks that require visual understanding alongside language.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Good at:&lt;/strong&gt; Image description, visual Q&amp;#x26;A, document understanding, chart interpretation.
&lt;strong&gt;Not ideal for:&lt;/strong&gt; Text-only tasks where you’re paying for vision capability you don’t use (use V3.2).
&lt;strong&gt;Pick over alternatives when:&lt;/strong&gt; Your input includes images. For text-only workloads, other models are more efficient.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;fast-classification-and-routing&quot;&gt;Fast classification and routing&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Pick: Llama 3.1 8B&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;At 8 billion parameters, Llama 3.1 8B runs at ~200 t/s with approximately 0.2s time to first token. It’s the right choice for tasks where speed matters more than depth: intent classification, sentiment analysis, entity extraction, content filtering, and request routing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Good at:&lt;/strong&gt; Classification, tagging, extraction, routing decisions, simple Q&amp;#x26;A, content moderation.
&lt;strong&gt;Not ideal for:&lt;/strong&gt; Complex reasoning, long-form generation, or tasks requiring deep world knowledge.
&lt;strong&gt;Pick over alternatives when:&lt;/strong&gt; You need results in under a second and the task is well-defined. Also ideal as the router model in a multi-model architecture.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;budget-general-use&quot;&gt;Budget general use&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Pick: GLM 5.2&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;GLM 5.2 delivers competitive quality at fast inference speeds. When DeepSeek V3.2 is more capability than you need — simple conversations, basic summarization, FAQ bots — GLM 5.2 gets the job done efficiently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Good at:&lt;/strong&gt; Simple chat, summarization, translation, basic Q&amp;#x26;A.
&lt;strong&gt;Not ideal for:&lt;/strong&gt; Complex reasoning or tasks where benchmark-leading quality matters.
&lt;strong&gt;Pick over alternatives when:&lt;/strong&gt; You want good-enough quality with better speed and lower cost than the largest models.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;long-context&quot;&gt;Long context&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Pick: MiniMax M3&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;MiniMax M3 ships a 1M-token (1,048,576) context window — the largest in our lineup. For workloads that involve ingesting large documents, long conversation histories, or extensive codebases, M3 maintains coherence across the full context. It’s a frontier multimodal coding, agentic, and reasoning model, so the quality holds up across that long context rather than degrading.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Good at:&lt;/strong&gt; Document analysis, long conversations, large-context summarization, whole-repo code reasoning.
&lt;strong&gt;Not ideal for:&lt;/strong&gt; Short, simple tasks where context length is irrelevant and you’d rather pay less (use Llama 8B or GLM Flash).
&lt;strong&gt;Pick over alternatives when:&lt;/strong&gt; Your input regularly exceeds what smaller-context models handle well, or you need frontier reasoning over a very large context.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;maximum-capability&quot;&gt;Maximum capability&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Pick: Qwen3 235B&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Qwen3 235B is a large mixture-of-experts model that competes across the full benchmark spectrum. When you need the highest possible quality and latency is not the primary constraint, Qwen3 235B delivers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Good at:&lt;/strong&gt; Broad capability across reasoning, knowledge, and generation. Strong multilingual support.
&lt;strong&gt;Not ideal for:&lt;/strong&gt; Latency-sensitive applications (large model, slower inference).
&lt;strong&gt;Pick over alternatives when:&lt;/strong&gt; You need top-tier quality and can tolerate higher latency. Good for batch processing and offline tasks.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;embeddings&quot;&gt;Embeddings&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Pick: BGE Large&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;BGE Large (BAAI General Embedding) is a well-tested embedding model for retrieval-augmented generation. It performs well on MTEB benchmarks and produces dense vectors suitable for semantic search, document retrieval, and clustering.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Good at:&lt;/strong&gt; Semantic search, RAG pipelines, document similarity, clustering.
&lt;strong&gt;Not ideal for:&lt;/strong&gt; Generative tasks (it’s an embedding model, not a chat model).
&lt;strong&gt;Pick over alternatives when:&lt;/strong&gt; You need vector embeddings for search or retrieval. Pair it with a generative model for the full RAG pipeline.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-decision-tree&quot;&gt;The decision tree&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;What&apos;s your task?&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;+-- Need to understand images?&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|   YES --&gt; Llama 4 Scout&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;+-- Need step-by-step reasoning? (math, logic, science)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|   YES --&gt; DeepSeek R1 (~30 t/s, but highest reasoning quality)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;+-- Need tool calling / agent loops?&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|   YES --&gt; Kimi K2.5 (334 t/s, native tool use)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;+-- Need code generation / editing?&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|   YES --&gt; Qwen3 Coder (purpose-built for code)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;+-- Need embeddings for search/RAG?&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|   YES --&gt; BGE Large&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;+-- Need sub-200ms response?&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|   YES --&gt; Llama 3.1 8B (~200 t/s, 0.2s TTFT)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;+-- Need long context (large documents)?&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|   YES --&gt; MiniMax M3 (1M-token context)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;+-- Need maximum quality, latency flexible?&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|   YES --&gt; Qwen3 235B&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;|&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;+-- General purpose, good balance?&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;YES --&gt; DeepSeek V3.2 (default choice)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-8020-rule&quot;&gt;The 80/20 rule&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;You don’t need ten models to cover most workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Llama 3.1 8B handles 60% of requests.&lt;/strong&gt; Classification, routing, simple Q&amp;#x26;A, extraction, content filtering. Fast and cheap.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DeepSeek V3.2 handles 30%.&lt;/strong&gt; General chat, complex instructions, knowledge-intensive tasks. The reliable all-rounder.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Specialized models handle the last 10%.&lt;/strong&gt; R1 for hard reasoning. Kimi K2.5 for agent loops. Qwen3 Coder for code. BGE Large for embeddings.&lt;/p&gt;
&lt;p&gt;Start with Llama 8B + V3.2. Add specialists only when you have evidence that general models aren’t performing on specific task categories. Measure first, specialize second.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;This guide is provider-agnostic. CheapestInference serves a focused lineup — Kimi K2.6, GLM 5.2, and MiniMax M3 — through a single OpenAI- and Anthropic-compatible API. If you want unlimited inference during your reserved hours, &lt;a href=&quot;https://cheapestinference.com/pools&quot;&gt;see how time-block pools work&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt; &lt;a href=&quot;https://artificialanalysis.ai/leaderboards/models&quot;&gt;Artificial Analysis Leaderboard&lt;/a&gt; · &lt;a href=&quot;https://www.swebench.com/&quot;&gt;SWE-bench Leaderboard&lt;/a&gt; · &lt;a href=&quot;https://kimi-k25.com/blog/kimi-k2-5-benchmark&quot;&gt;Kimi K2.5 Benchmarks&lt;/a&gt; · &lt;a href=&quot;https://artificialanalysis.ai/models/deepseek-v3-2&quot;&gt;DeepSeek V3.2&lt;/a&gt; · &lt;a href=&quot;https://artificialanalysis.ai/evaluations/humanitys-last-exam&quot;&gt;HLE Leaderboard&lt;/a&gt; · &lt;a href=&quot;https://artificialanalysis.ai/evaluations/mmlu-pro&quot;&gt;MMLU-Pro Leaderboard&lt;/a&gt; · &lt;a href=&quot;https://huggingface.co/spaces/mteb/leaderboard&quot;&gt;MTEB Leaderboard&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Open-source models are production-ready. Here&apos;s the proof.</title><link>https://cheapestinference.com/blog/open-source-models-are-production-ready/</link><guid isPermaLink="true">https://cheapestinference.com/blog/open-source-models-are-production-ready/</guid><pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;There’s a persistent assumption in the industry: open-source models are fine for experimentation, but production workloads need GPT-5 or Claude Opus. We run open-source models in production every day. Here’s what the benchmarks actually say.&lt;/p&gt;
&lt;p&gt;We’re comparing &lt;strong&gt;5 models across 5 metrics&lt;/strong&gt; — the same models in every chart, no cherry-picking:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Open-source:&lt;/strong&gt; DeepSeek V3.2, DeepSeek R1, Kimi K2.5
&lt;strong&gt;Proprietary (reference):&lt;/strong&gt; Claude Opus 4.6, GPT-5.4&lt;/p&gt;
&lt;p&gt;&lt;em&gt;(We serve the latest open-weight frontier models — currently Kimi K2.6, GLM 5.2, and MiniMax M3 — through a single API. The benchmark models below are industry reference points; the conclusions about open-source readiness carry directly to the current generation.)&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;code-quality-swe-bench-verified--resolved&quot;&gt;Code quality: SWE-bench Verified (% resolved)&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;80.8%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~80.0%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Kimi K2.5&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;76.8%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek V3.2&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;73.0%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek R1&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;57.6%&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Proprietary models lead here. Opus 4.6 and GPT-5.4 are within a point of each other at ~80%. Kimi K2.5 is 4 points behind at 76.8% — competitive but not leading. R1 is a reasoning model, not optimized for code.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;reasoning-humanitys-last-exam&quot;&gt;Reasoning: Humanity’s Last Exam (%)&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Kimi K2.5 *&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;50.2%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek R1&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;50.2%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;41.6%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;40.0%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek V3.2&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;39.3%&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Open-source wins decisively. R1 hits 50.2% and Kimi K2.5 matches it with tool-use enabled (*without tools: 31.5%). Both beat Opus 4.6 (40%) and GPT-5.4 (41.6%). V3.2 is roughly at Opus level — it’s a general model, not a reasoning specialist.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;*Kimi K2.5’s HLE score uses its agentic mode with tool access. This is how the model is designed to be used in production.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;knowledge-mmlu-pro&quot;&gt;Knowledge: MMLU-Pro (%)&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;88.5%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Kimi K2.5&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;87.1%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek V3.2&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;85.0%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek R1&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;84.0%&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;82.0%&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;GPT-5.4 leads narrowly at 88.5%, but Kimi K2.5 is 1.4 points behind and all three open-source models beat Opus 4.6. The gap across all 5 models is only 6.5 points — this benchmark is nearly saturated.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;speed-output-tokens-per-second&quot;&gt;Speed: output tokens per second&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Kimi K2.5&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;334 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~78 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek V3.2&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~60 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;46 t/s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek R1&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~30 t/s&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Kimi K2.5 at 334 tok/s is in a different league — 4x faster than GPT-5.4, 7x faster than Opus 4.6. R1 is the slowest (expected — it’s a reasoning model producing chain-of-thought tokens).&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;latency-time-to-first-token-seconds&quot;&gt;Latency: time to first token (seconds)&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;span&gt;Kimi K2.5&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;0.31s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;GPT-5.4&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~0.95s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek V3.2&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;1.18s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;DeepSeek R1&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;~2.0s&lt;/span&gt;
  &lt;/div&gt;
  &lt;div&gt;
    &lt;span&gt;Claude Opus 4.6&lt;/span&gt;
    &lt;div&gt;&lt;/div&gt;
    &lt;span&gt;2.48s&lt;/span&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Lower is better. Kimi K2.5 responds 8x faster than Opus 4.6 and 3x faster than GPT-5.4. Even V3.2 beats both proprietary models. Opus 4.6 is the slowest model in this comparison.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Speed and TTFT measured on our production infrastructure. Claude and GPT-5.4 data from Artificial Analysis.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-full-picture&quot;&gt;The full picture&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;
&lt;svg viewBox=&quot;-80 0 560 410&quot; xmlns=&quot;http://www.w3.org/2000/svg&quot;&gt;
  &lt;!-- Grid lines --&gt;
  &lt;polygon points=&quot;200,120 266.6,168.4 241.1,246.6 158.9,246.6 133.4,168.4&quot; fill=&quot;none&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/polygon&gt;
  &lt;polygon points=&quot;200,50 333.1,146.7 282.3,303.3 117.7,303.3 66.9,146.7&quot; fill=&quot;none&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/polygon&gt;
  &lt;!-- Axes --&gt;
  &lt;line x1=&quot;200&quot; y1=&quot;190&quot; x2=&quot;200&quot; y2=&quot;50&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/line&gt;
  &lt;line x1=&quot;200&quot; y1=&quot;190&quot; x2=&quot;333.1&quot; y2=&quot;146.7&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/line&gt;
  &lt;line x1=&quot;200&quot; y1=&quot;190&quot; x2=&quot;282.3&quot; y2=&quot;303.3&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/line&gt;
  &lt;line x1=&quot;200&quot; y1=&quot;190&quot; x2=&quot;117.7&quot; y2=&quot;303.3&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/line&gt;
  &lt;line x1=&quot;200&quot; y1=&quot;190&quot; x2=&quot;66.9&quot; y2=&quot;146.7&quot; stroke=&quot;#E8E5DF&quot; stroke-width=&quot;1&quot;&gt;&lt;/line&gt;
  &lt;!-- Kimi K2.5 — indigo --&gt;
  &lt;polygon points=&quot;200,57 333.1,146.7 280.6,301 117.7,303.3 66.9,146.7&quot; fill=&quot;#6366F1&quot; fill-opacity=&quot;0.12&quot; stroke=&quot;#6366F1&quot; stroke-width=&quot;2.5&quot;&gt;&lt;/polygon&gt;
  &lt;!-- DeepSeek V3.2 — teal --&gt;
  &lt;polygon points=&quot;200,64 303.9,156.3 279,298.7 185.2,210.4 120.1,164&quot; fill=&quot;#14B8A6&quot; fill-opacity=&quot;0.08&quot; stroke=&quot;#14B8A6&quot; stroke-width=&quot;2&quot;&gt;&lt;/polygon&gt;
  &lt;!-- DeepSeek R1 — amber --&gt;
  &lt;polygon points=&quot;200,90.6 333.1,146.7 278.2,297.6 192.6,200.2 170.7,180.5&quot; fill=&quot;#F59E0B&quot; fill-opacity=&quot;0.08&quot; stroke=&quot;#F59E0B&quot; stroke-width=&quot;2&quot;&gt;&lt;/polygon&gt;
  &lt;!-- Claude Opus 4.6 — gray --&gt;
  &lt;polygon points=&quot;200,50 306.5,155.4 276.5,295.3 188.5,205.9 200,190&quot; fill=&quot;none&quot; stroke=&quot;#9A9490&quot; stroke-width=&quot;2&quot;&gt;&lt;/polygon&gt;
  &lt;!-- GPT-5.4 — dark gray dashed --&gt;
  &lt;polygon points=&quot;200,51.4 310.5,154.1 282.3,303.3 181.1,216.1 105.5,159.3&quot; fill=&quot;none&quot; stroke=&quot;#6B6560&quot; stroke-width=&quot;1.5&quot; stroke-dasharray=&quot;6 3&quot;&gt;&lt;/polygon&gt;
  &lt;!-- Data points --&gt;
  &lt;circle cx=&quot;200&quot; cy=&quot;57&quot; r=&quot;3.5&quot; fill=&quot;#6366F1&quot;&gt;&lt;/circle&gt;
  &lt;circle cx=&quot;333.1&quot; cy=&quot;146.7&quot; r=&quot;3.5&quot; fill=&quot;#6366F1&quot;&gt;&lt;/circle&gt;
  &lt;circle cx=&quot;280.6&quot; cy=&quot;301&quot; r=&quot;3.5&quot; fill=&quot;#6366F1&quot;&gt;&lt;/circle&gt;
  &lt;circle cx=&quot;117.7&quot; cy=&quot;303.3&quot; r=&quot;3.5&quot; fill=&quot;#6366F1&quot;&gt;&lt;/circle&gt;
  &lt;circle cx=&quot;66.9&quot; cy=&quot;146.7&quot; r=&quot;3.5&quot; fill=&quot;#6366F1&quot;&gt;&lt;/circle&gt;
  &lt;!-- Labels --&gt;
  &lt;text x=&quot;200&quot; y=&quot;30&quot; text-anchor=&quot;middle&quot; font-size=&quot;13&quot; font-weight=&quot;600&quot; fill=&quot;#1A1A1A&quot;&gt;Code&lt;/text&gt;
  &lt;text x=&quot;345&quot; y=&quot;142&quot; text-anchor=&quot;start&quot; font-size=&quot;13&quot; font-weight=&quot;600&quot; fill=&quot;#1A1A1A&quot;&gt;Reasoning&lt;/text&gt;
  &lt;text x=&quot;290&quot; y=&quot;325&quot; text-anchor=&quot;start&quot; font-size=&quot;13&quot; font-weight=&quot;600&quot; fill=&quot;#1A1A1A&quot;&gt;Knowledge&lt;/text&gt;
  &lt;text x=&quot;110&quot; y=&quot;325&quot; text-anchor=&quot;end&quot; font-size=&quot;13&quot; font-weight=&quot;600&quot; fill=&quot;#1A1A1A&quot;&gt;Speed&lt;/text&gt;
  &lt;text x=&quot;55&quot; y=&quot;142&quot; text-anchor=&quot;end&quot; font-size=&quot;13&quot; font-weight=&quot;600&quot; fill=&quot;#1A1A1A&quot;&gt;Latency&lt;/text&gt;
  &lt;!-- Legend row 1 --&gt;
  &lt;rect x=&quot;-10&quot; y=&quot;370&quot; width=&quot;14&quot; height=&quot;3&quot; rx=&quot;1&quot; fill=&quot;#6366F1&quot;&gt;&lt;/rect&gt;
  &lt;text x=&quot;8&quot; y=&quot;374&quot; font-size=&quot;9&quot; fill=&quot;#6B6560&quot;&gt;Kimi K2.5&lt;/text&gt;
  &lt;rect x=&quot;75&quot; y=&quot;370&quot; width=&quot;14&quot; height=&quot;3&quot; rx=&quot;1&quot; fill=&quot;#14B8A6&quot;&gt;&lt;/rect&gt;
  &lt;text x=&quot;93&quot; y=&quot;374&quot; font-size=&quot;9&quot; fill=&quot;#6B6560&quot;&gt;DeepSeek V3.2&lt;/text&gt;
  &lt;rect x=&quot;185&quot; y=&quot;370&quot; width=&quot;14&quot; height=&quot;3&quot; rx=&quot;1&quot; fill=&quot;#F59E0B&quot;&gt;&lt;/rect&gt;
  &lt;text x=&quot;203&quot; y=&quot;374&quot; font-size=&quot;9&quot; fill=&quot;#6B6560&quot;&gt;DeepSeek R1&lt;/text&gt;
  &lt;!-- Legend row 2 --&gt;
  &lt;rect x=&quot;-10&quot; y=&quot;386&quot; width=&quot;14&quot; height=&quot;3&quot; rx=&quot;1&quot; fill=&quot;#9A9490&quot;&gt;&lt;/rect&gt;
  &lt;text x=&quot;8&quot; y=&quot;390&quot; font-size=&quot;9&quot; fill=&quot;#6B6560&quot;&gt;Claude Opus 4.6&lt;/text&gt;
  &lt;line x1=&quot;110&quot; y1=&quot;387&quot; x2=&quot;124&quot; y2=&quot;387&quot; stroke=&quot;#6B6560&quot; stroke-width=&quot;1.5&quot; stroke-dasharray=&quot;4 2&quot;&gt;&lt;/line&gt;
  &lt;text x=&quot;128&quot; y=&quot;390&quot; font-size=&quot;9&quot; fill=&quot;#6B6560&quot;&gt;GPT-5.4&lt;/text&gt;
&lt;/svg&gt;
&lt;/div&gt;
&lt;div&gt;&lt;h2 id=&quot;the-scorecard&quot;&gt;The scorecard&lt;/h2&gt;&lt;/div&gt;















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Metric&lt;/th&gt;&lt;th&gt;Winner&lt;/th&gt;&lt;th&gt;Open-source&lt;/th&gt;&lt;th&gt;Proprietary&lt;/th&gt;&lt;th&gt;Gap&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Code&lt;/strong&gt; (SWE-bench)&lt;/td&gt;&lt;td&gt;Opus 4.6&lt;/td&gt;&lt;td&gt;Kimi 76.8%&lt;/td&gt;&lt;td&gt;Opus 80.8%&lt;/td&gt;&lt;td&gt;-4 pts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Reasoning&lt;/strong&gt; (HLE)&lt;/td&gt;&lt;td&gt;R1&lt;/td&gt;&lt;td&gt;R1 50.2%&lt;/td&gt;&lt;td&gt;GPT-5.4 41.6%&lt;/td&gt;&lt;td&gt;+8.6 pts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Knowledge&lt;/strong&gt; (MMLU-Pro)&lt;/td&gt;&lt;td&gt;GPT-5.4&lt;/td&gt;&lt;td&gt;Kimi 87.1%&lt;/td&gt;&lt;td&gt;GPT-5.4 88.5%&lt;/td&gt;&lt;td&gt;-1.4 pts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt; (tok/s)&lt;/td&gt;&lt;td&gt;Kimi K2.5&lt;/td&gt;&lt;td&gt;334 t/s&lt;/td&gt;&lt;td&gt;GPT-5.4 78 t/s&lt;/td&gt;&lt;td&gt;4.3x faster&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt; (TTFT)&lt;/td&gt;&lt;td&gt;Kimi K2.5&lt;/td&gt;&lt;td&gt;0.31s&lt;/td&gt;&lt;td&gt;GPT-5.4 0.95s&lt;/td&gt;&lt;td&gt;3x faster&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Open-source wins 3 out of 5.&lt;/strong&gt; Proprietary models lead on Code (by 4 points) and Knowledge (by 1.4 points). Open-source leads on Reasoning (by 8.6 points), Speed (by 4.3x), and Latency (by 3x).&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: Kimi K2.5’s HLE score (50.2%) uses tool-augmented mode. Without tools it scores 31.5%. DeepSeek R1’s 50.2% is pure chain-of-thought reasoning without tools.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;what-production-ready-actually-means&quot;&gt;What “production-ready” actually means&lt;/h2&gt;&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Reliable enough.&lt;/strong&gt; Consistent quality across thousands of requests.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fast enough.&lt;/strong&gt; Kimi K2.5 at 334 tok/s and 0.31s TTFT. That’s real-time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Capable enough.&lt;/strong&gt; Within 4 points of the best proprietary model on code, ahead on reasoning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Predictable.&lt;/strong&gt; Versioned models that don’t change without warning.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;&lt;h2 id=&quot;the-real-advantage-control&quot;&gt;The real advantage: control&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Proprietary models change under you. Fine one day, different behavior the next. No changelog, no warning. Open-source models are versioned — DeepSeek V3.2 behaves the same tomorrow as today. You choose when to upgrade.&lt;/p&gt;
&lt;p&gt;For production workloads, that predictability is worth more than a marginal quality edge on any single benchmark.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;We serve frontier open-weight models — Kimi K2.6, GLM 5.2, and MiniMax M3 — through a single OpenAI- and Anthropic-compatible API. Unlimited-usage time-block subscriptions from $39/mo, plus pay-as-you-go credits from $10. &lt;a href=&quot;https://cheapestinference.com/pools&quot;&gt;See the pools&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt; &lt;a href=&quot;https://artificialanalysis.ai/leaderboards/models&quot;&gt;Artificial Analysis Leaderboard&lt;/a&gt; · &lt;a href=&quot;https://www.swebench.com/&quot;&gt;SWE-bench Leaderboard&lt;/a&gt; · &lt;a href=&quot;https://kimi-k25.com/blog/kimi-k2-5-benchmark&quot;&gt;Kimi K2.5 Benchmarks&lt;/a&gt; · &lt;a href=&quot;https://artificialanalysis.ai/models/deepseek-v3-2&quot;&gt;DeepSeek V3.2&lt;/a&gt; · &lt;a href=&quot;https://openai.com/api/pricing/&quot;&gt;OpenAI Pricing&lt;/a&gt; · &lt;a href=&quot;https://platform.claude.com/docs/en/about-claude/pricing&quot;&gt;Anthropic Pricing&lt;/a&gt; · &lt;a href=&quot;https://artificialanalysis.ai/evaluations/humanitys-last-exam&quot;&gt;HLE Leaderboard&lt;/a&gt; · &lt;a href=&quot;https://artificialanalysis.ai/evaluations/mmlu-pro&quot;&gt;MMLU-Pro Leaderboard&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>What it takes to build your own LLM inference platform</title><link>https://cheapestinference.com/blog/build-your-own-inference-platform/</link><guid isPermaLink="true">https://cheapestinference.com/blog/build-your-own-inference-platform/</guid><pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;If you’re building a SaaS that needs to give users access to LLMs, you have two options: build the infrastructure yourself, or use a platform that does it for you. Here’s what “build it yourself” actually looks like.&lt;/p&gt;
&lt;p&gt;This isn’t theoretical. We built this. Here’s every component, what it does, and what alternatives exist.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;0-model-access--the-first-problem&quot;&gt;0. Model access — the first problem&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Before you write a single line of code, you need access to models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Self-host on your own hardware&lt;/strong&gt;: Buy GPUs, rent datacenter space, run the models yourself. Full control, best unit economics at scale — but massive upfront cost and you’re limited to the models you can afford to deploy. Running DeepSeek V3.2 requires multiple high-end GPUs. Running dozens of models? You’d need a data center.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Rent infrastructure&lt;/strong&gt;: Use GPU clouds like Vast.ai, AWS, Hetzner, CoreWeave, or Lambda. No hardware to buy, but you still manage deployments, scaling, and failover. Costs add up fast — a single H100 runs $2-4/hr.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use an inference provider&lt;/strong&gt;: Sign agreements with DeepInfra, Together.ai, Fireworks, etc. who already have the models deployed. Pay per token, no GPU management. But you depend on their availability, pricing, and terms. If they change prices or drop a model, you need a plan B.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mix&lt;/strong&gt;: Most serious platforms end up here. Own hardware for high-volume models where the unit economics justify it, rented GPUs for burst capacity, and provider agreements for the long tail of models nobody runs enough to self-host.&lt;/p&gt;
&lt;p&gt;Self-hosting dozens of models on your own is economically unrealistic. The real question is where to draw the line between own infra, rented compute, and providers.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;1-serving-engine&quot;&gt;1. Serving engine&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;If you self-host or rent GPUs, you need software to serve the models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;vLLM&lt;/strong&gt; — most popular, good throughput, active community&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TGI&lt;/strong&gt; (Text Generation Inference) — Hugging Face’s solution, solid for single-model deployments&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TensorRT-LLM&lt;/strong&gt; — NVIDIA’s optimized engine, best raw performance but harder to set up&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SGLang&lt;/strong&gt; — newer, fast, good for structured generation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You’ll also need to handle model weights, quantization, scaling across GPUs, and failover when a node goes down. This is a full-time ops job.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;2-api-proxy-layer&quot;&gt;2. API proxy layer&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Your users shouldn’t hit the inference backend directly. You need a proxy that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Translates between API formats (OpenAI, Anthropic)&lt;/li&gt;
&lt;li&gt;Routes requests to the right model/provider&lt;/li&gt;
&lt;li&gt;Injects authentication&lt;/li&gt;
&lt;li&gt;Handles retries and failover&lt;/li&gt;
&lt;li&gt;Strips provider headers so users don’t know your backend&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Options:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Build from scratch with Express/Fastify + http-proxy-middleware&lt;/li&gt;
&lt;li&gt;Use an open-source gateway: LiteLLM, Portkey, Kong AI Gateway, MLflow Gateway&lt;/li&gt;
&lt;li&gt;Use a managed gateway: Helicone, Braintrust, Promptlayer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each has trade-offs. Open-source gateways give you control but you manage the deployment. Managed gateways are easier but add latency and cost.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;3-authentication&quot;&gt;3. Authentication&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Two layers:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;User auth (dashboard login)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Firebase Auth, Auth0, Clerk, Supabase Auth, or roll your own&lt;/li&gt;
&lt;li&gt;Supports email, Google, GitHub, wallet signatures&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;API key auth (inference requests)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Generate API keys per user&lt;/li&gt;
&lt;li&gt;Validate on every request before proxying&lt;/li&gt;
&lt;li&gt;Store key metadata (plan, rate limits, owner)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is where it gets interesting for platforms. You need &lt;strong&gt;per-key plans&lt;/strong&gt; — each key with its own rate limits and usage tracking. Most auth solutions don’t do this out of the box. You’ll need a custom key management layer.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;4-rate-limiting&quot;&gt;4. Rate limiting&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Per-key rate limiting with at least:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;RPM&lt;/strong&gt; (requests per minute)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TPM&lt;/strong&gt; (tokens per minute)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Budget caps&lt;/strong&gt; (dollar amount per time window)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This needs to be enforced at the proxy layer, before the request hits the inference backend. Otherwise a single user can exhaust your GPU allocation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Options:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Redis-based counters (most common)&lt;/li&gt;
&lt;li&gt;Token bucket algorithms&lt;/li&gt;
&lt;li&gt;Proxy-level enforcement (some gateways include this)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you’re using per-key plans, each key needs its own set of limits. Not one global limit — individual limits per key.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;5-usage-tracking-and-billing&quot;&gt;5. Usage tracking and billing&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;You need to know:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How many tokens each key consumed (input + output)&lt;/li&gt;
&lt;li&gt;What model was used&lt;/li&gt;
&lt;li&gt;Cost per request&lt;/li&gt;
&lt;li&gt;Aggregate usage per user, per day, per billing period&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;For subscription billing:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Stripe for card payments&lt;/li&gt;
&lt;li&gt;Budget windows (e.g., $X per 8-hour period)&lt;/li&gt;
&lt;li&gt;Automatic key revocation when subscription expires&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;For pay-as-you-go:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Credit balance per user&lt;/li&gt;
&lt;li&gt;Deduct per request based on token count × model price&lt;/li&gt;
&lt;li&gt;Top-up flow (Stripe, crypto, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;For crypto payments:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;USDC on a supported chain&lt;/li&gt;
&lt;li&gt;On-chain transaction verification&lt;/li&gt;
&lt;li&gt;Wallet connector in the dashboard (wagmi, viem, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is a significant amount of code. Usage tracking alone requires intercepting every response to count tokens, calculating cost based on the model’s pricing, and storing it per key.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;6-dashboard&quot;&gt;6. Dashboard&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Your users need a web UI to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Create and manage API keys&lt;/li&gt;
&lt;li&gt;View usage per key (tokens, requests, cost)&lt;/li&gt;
&lt;li&gt;Subscribe to plans or top up credits&lt;/li&gt;
&lt;li&gt;See available models and pricing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tech stack typically:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;React/Next.js/Vue frontend&lt;/li&gt;
&lt;li&gt;REST API backend&lt;/li&gt;
&lt;li&gt;Real-time usage updates&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For platforms (your users creating keys for their users), you also need a &lt;strong&gt;management API&lt;/strong&gt; — programmatic key creation, plan assignment, usage queries.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;7-model-catalog-management&quot;&gt;7. Model catalog management&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Models change. New ones come out weekly. You need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A catalog of which models you serve&lt;/li&gt;
&lt;li&gt;Pricing per model (input/output cost per token)&lt;/li&gt;
&lt;li&gt;Sync mechanism to update prices when providers change them&lt;/li&gt;
&lt;li&gt;Display names, categories, tags for the dashboard&lt;/li&gt;
&lt;li&gt;Cache pricing metadata (some models support prompt caching discounts)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is an ongoing operational burden, not a one-time setup.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;8-documentation&quot;&gt;8. Documentation&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Your users need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;API reference (endpoints, request/response formats)&lt;/li&gt;
&lt;li&gt;SDK examples (Python, Node.js, at minimum)&lt;/li&gt;
&lt;li&gt;Authentication guide&lt;/li&gt;
&lt;li&gt;Billing/usage documentation&lt;/li&gt;
&lt;li&gt;Quick start guide&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is easily 20-30 pages of documentation that needs to stay current.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;9-monitoring-and-reliability&quot;&gt;9. Monitoring and reliability&lt;/h2&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Health checks on the inference backend&lt;/li&gt;
&lt;li&gt;Status page for users&lt;/li&gt;
&lt;li&gt;Alerting when latency spikes or errors increase&lt;/li&gt;
&lt;li&gt;Logging (but not logging prompt content — privacy)&lt;/li&gt;
&lt;li&gt;Graceful degradation when a model or provider is down&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;&lt;h2 id=&quot;10-compliance-and-privacy&quot;&gt;10. Compliance and privacy&lt;/h2&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Privacy policy&lt;/li&gt;
&lt;li&gt;Data handling documentation&lt;/li&gt;
&lt;li&gt;GDPR compliance if you serve EU users&lt;/li&gt;
&lt;li&gt;Decision: do you store prompts? (You shouldn’t)&lt;/li&gt;
&lt;li&gt;SOC 2 / ISO 27001 if targeting enterprise&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;the-full-stack&quot;&gt;The full stack&lt;/h2&gt;&lt;/div&gt;

















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Component&lt;/th&gt;&lt;th&gt;Ongoing maintenance&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Inference backend&lt;/td&gt;&lt;td&gt;High — scaling, failover, model updates&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;API proxy&lt;/td&gt;&lt;td&gt;Medium — format changes, new providers&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Auth + key management&lt;/td&gt;&lt;td&gt;Low&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Per-key rate limiting&lt;/td&gt;&lt;td&gt;Low&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Usage tracking + billing&lt;/td&gt;&lt;td&gt;Medium — edge cases, reconciliation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Dashboard&lt;/td&gt;&lt;td&gt;Medium — new features, UX&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Model catalog&lt;/td&gt;&lt;td&gt;High — weekly model updates&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Documentation&lt;/td&gt;&lt;td&gt;Medium — keep current&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Monitoring&lt;/td&gt;&lt;td&gt;Low&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Privacy/compliance&lt;/td&gt;&lt;td&gt;Low&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;&lt;h2 id=&quot;what-breaks-in-production&quot;&gt;What breaks in production&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Building is the easy part. The hard part is what breaks with real users:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A provider changes their API format without warning. Your proxy returns 500s for 2 hours until you notice.&lt;/li&gt;
&lt;li&gt;A model gets deprecated. Your users’ hardcoded model IDs stop working overnight.&lt;/li&gt;
&lt;li&gt;Token counting has an off-by-one bug. You’ve been undercharging for 3 weeks. Your margin is gone.&lt;/li&gt;
&lt;li&gt;A user finds a way to exceed rate limits through concurrent requests. Your inference bill spikes 10x in one afternoon.&lt;/li&gt;
&lt;li&gt;Stripe webhook fails silently. A user’s subscription expired but their API key still works. Free inference for a month.&lt;/li&gt;
&lt;li&gt;You push a billing update and break the usage tracking. Three days of missing data. Users open tickets.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of these has happened to us. We fixed them. The question is whether you want to fix them yourself, with your users waiting, or use a platform that already has.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;or&quot;&gt;Or&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;You use an inference platform that already has all of this, create API keys for your users, and ship your product this week.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;We built all of the above so you don’t have to. &lt;a href=&quot;https://cheapestinference.com/platforms&quot;&gt;See how per-key plans work&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content:encoded></item></channel></rss>