Skip to content

How to choose the right open-source model for your task

Most teams default to the biggest model available and call it a day. That works — until latency spikes, costs climb, and you realize a 8B-parameter model would have handled 60% of your requests just fine.

This guide maps common use cases to specific models, with real throughput numbers from our infrastructure. No theory — just which model to pick and why.


Use caseModelWhy
General chat / assistantsDeepSeek V3.2Best all-rounder. 85% MMLU-Pro, 73% SWE-bench, 60 t/s.
Complex reasoningDeepSeek R150.2% on Humanity’s Last Exam. Chain-of-thought built in.
Code generationQwen3 CoderPurpose-built for code. Strong on completions, refactoring, and debugging.
Agentic workflowsKimi K2.5334 t/s output, native tool use, 50.2% HLE with tools. Built for agents.
Vision / multimodalLlama 4 Scout17 active experts, 109B params, native image understanding.
Fast classificationLlama 3.1 8B~200 t/s, 0.2s TTFT. Small enough for routing, tagging, extraction.
General (budget)GLM 5.2Fast inference, competitive quality. Good when V3.2 is overkill.
Long context chatMiniMax M31M-token context window. Handles very large documents and codebases.
Large general + reasoningQwen3 235B235B MoE. Strong across benchmarks when you need maximum capability.
EmbeddingsBGE LargeMTEB-tested. Solid retrieval quality for RAG pipelines.

Pick: DeepSeek V3.2

DeepSeek V3.2 is the default choice for most workloads. It scores 85% on MMLU-Pro (beating Claude Opus 4.6’s 82%), 73% on SWE-bench Verified, and runs at ~60 tokens/second on our infrastructure.

Kimi K2.5
334 t/s
Llama 3.1 8B
~200 t/s
DeepSeek V3.2
~60 t/s
DeepSeek R1
~30 t/s

Good at: Broad knowledge, instruction following, multilingual, structured output. Not ideal for: Tasks that need step-by-step reasoning chains (use R1) or sub-100ms latency (use Llama 8B). Pick over alternatives when: You need a reliable general-purpose model that handles most tasks without specialization.


Pick: DeepSeek R1

R1 is a reasoning-first model. It produces explicit chain-of-thought tokens before its final answer. On Humanity’s Last Exam — a benchmark designed to be unsolvable by current models — R1 scores 50.2%, beating GPT-5.4 (41.6%) and Claude Opus 4.6 (40%).

The tradeoff is speed. At ~30 t/s, R1 is the slowest model in our lineup. That’s expected — it’s generating reasoning tokens that never appear in the final output.

Good at: Math, science, logic puzzles, multi-step problems, anything where “thinking” helps. Not ideal for: Simple Q&A, classification, or latency-sensitive applications. Pick over alternatives when: The task requires multi-step deduction. If a human would need to “think through it,” R1 will outperform faster models.


Pick: Qwen3 Coder

Qwen3 Coder is purpose-built for software engineering tasks — code completion, refactoring, debugging, and generation across languages. It’s trained specifically on code-heavy data and optimized for developer workflows.

Good at: Code completion, bug fixing, refactoring, test generation, multi-file edits. Not ideal for: General conversation or non-code tasks (use V3.2). Pick over alternatives when: Code quality matters more than general knowledge. For mixed code-and-chat workflows, V3.2 or Kimi K2.5 may be more versatile.


Pick: Kimi K2.5

Kimi K2.5 was designed for agentic use. It has native tool-calling support, runs at 334 t/s (the fastest model we serve), and scores 50.2% on HLE when using tools — matching R1’s reasoning-only score.

The speed matters for agents. Each tool call is a round trip: the model generates a function call, the tool executes, the result goes back to the model. At 334 t/s and 0.31s TTFT, Kimi completes multi-step agent loops in seconds where slower models take minutes.

Good at: Tool use, function calling, multi-step task execution, fast iteration loops. Not ideal for: Pure reasoning without tools (R1 is better). Code-only tasks (Qwen3 Coder is more specialized). Pick over alternatives when: Your application involves tool calling, API interactions, or multi-step agent orchestration where speed compounds.


Pick: Llama 4 Scout

Llama 4 Scout is Meta’s mixture-of-experts multimodal model — 109B total parameters with 17 active experts. It handles text and images natively, making it the pick for tasks that require visual understanding alongside language.

Good at: Image description, visual Q&A, document understanding, chart interpretation. Not ideal for: Text-only tasks where you’re paying for vision capability you don’t use (use V3.2). Pick over alternatives when: Your input includes images. For text-only workloads, other models are more efficient.


Pick: Llama 3.1 8B

At 8 billion parameters, Llama 3.1 8B runs at ~200 t/s with approximately 0.2s time to first token. It’s the right choice for tasks where speed matters more than depth: intent classification, sentiment analysis, entity extraction, content filtering, and request routing.

Good at: Classification, tagging, extraction, routing decisions, simple Q&A, content moderation. Not ideal for: Complex reasoning, long-form generation, or tasks requiring deep world knowledge. Pick over alternatives when: You need results in under a second and the task is well-defined. Also ideal as the router model in a multi-model architecture.


Pick: GLM 5.2

GLM 5.2 delivers competitive quality at fast inference speeds. When DeepSeek V3.2 is more capability than you need — simple conversations, basic summarization, FAQ bots — GLM 5.2 gets the job done efficiently.

Good at: Simple chat, summarization, translation, basic Q&A. Not ideal for: Complex reasoning or tasks where benchmark-leading quality matters. Pick over alternatives when: You want good-enough quality with better speed and lower cost than the largest models.


Pick: MiniMax M3

MiniMax M3 ships a 1M-token (1,048,576) context window — the largest in our lineup. For workloads that involve ingesting large documents, long conversation histories, or extensive codebases, M3 maintains coherence across the full context. It’s a frontier multimodal coding, agentic, and reasoning model, so the quality holds up across that long context rather than degrading.

Good at: Document analysis, long conversations, large-context summarization, whole-repo code reasoning. Not ideal for: Short, simple tasks where context length is irrelevant and you’d rather pay less (use Llama 8B or GLM Flash). Pick over alternatives when: Your input regularly exceeds what smaller-context models handle well, or you need frontier reasoning over a very large context.


Pick: Qwen3 235B

Qwen3 235B is a large mixture-of-experts model that competes across the full benchmark spectrum. When you need the highest possible quality and latency is not the primary constraint, Qwen3 235B delivers.

Good at: Broad capability across reasoning, knowledge, and generation. Strong multilingual support. Not ideal for: Latency-sensitive applications (large model, slower inference). Pick over alternatives when: You need top-tier quality and can tolerate higher latency. Good for batch processing and offline tasks.


Pick: BGE Large

BGE Large (BAAI General Embedding) is a well-tested embedding model for retrieval-augmented generation. It performs well on MTEB benchmarks and produces dense vectors suitable for semantic search, document retrieval, and clustering.

Good at: Semantic search, RAG pipelines, document similarity, clustering. Not ideal for: Generative tasks (it’s an embedding model, not a chat model). Pick over alternatives when: You need vector embeddings for search or retrieval. Pair it with a generative model for the full RAG pipeline.


What's your task?
|
+-- Need to understand images?
| YES --> Llama 4 Scout
|
+-- Need step-by-step reasoning? (math, logic, science)
| YES --> DeepSeek R1 (~30 t/s, but highest reasoning quality)
|
+-- Need tool calling / agent loops?
| YES --> Kimi K2.5 (334 t/s, native tool use)
|
+-- Need code generation / editing?
| YES --> Qwen3 Coder (purpose-built for code)
|
+-- Need embeddings for search/RAG?
| YES --> BGE Large
|
+-- Need sub-200ms response?
| YES --> Llama 3.1 8B (~200 t/s, 0.2s TTFT)
|
+-- Need long context (large documents)?
| YES --> MiniMax M3 (1M-token context)
|
+-- Need maximum quality, latency flexible?
| YES --> Qwen3 235B
|
+-- General purpose, good balance?
YES --> DeepSeek V3.2 (default choice)

You don’t need ten models to cover most workloads.

Llama 3.1 8B handles 60% of requests. Classification, routing, simple Q&A, extraction, content filtering. Fast and cheap.

DeepSeek V3.2 handles 30%. General chat, complex instructions, knowledge-intensive tasks. The reliable all-rounder.

Specialized models handle the last 10%. R1 for hard reasoning. Kimi K2.5 for agent loops. Qwen3 Coder for code. BGE Large for embeddings.

Start with Llama 8B + V3.2. Add specialists only when you have evidence that general models aren’t performing on specific task categories. Measure first, specialize second.


This guide is provider-agnostic. CheapestInference serves a focused lineup — Kimi K2.6, GLM 5.2, and MiniMax M3 — through a single OpenAI- and Anthropic-compatible API. If you want unlimited inference during your reserved hours, see how time-block pools work.

Sources: Artificial Analysis Leaderboard · SWE-bench Leaderboard · Kimi K2.5 Benchmarks · DeepSeek V3.2 · HLE Leaderboard · MMLU-Pro Leaderboard · MTEB Leaderboard