Inference
Running trained models to generate outputs - the core of AI applications.
Training teaches a model. Inference uses it. When you send a message to Claude and receive a response, that's inference. When your agent cloud processes a task, every model call is inference. Training happens once; inference happens billions of times.
The Inference Loop
At its core, inference is straightforward: input goes in, output comes out.
Input tokens → Model weights → Output tokens
For language models, the input is your prompt—context, instructions, data. The model processes these tokens through billions of parameters, computing probability distributions over possible next tokens. It samples from these distributions, generating one token at a time until it reaches a stop condition.
This simplicity hides immense complexity. Each forward pass through a large model involves trillions of floating-point operations. The model must fit in GPU memory. The computation must happen fast enough to be useful.
Speed Matters
Inference speed directly impacts user experience and system design. A model that takes 30 seconds per response can't power a real-time application. A model that costs $10 per query can't process millions of data points.
For agent clouds, this creates a fundamental constraint. When the article mentions "Speed of model inference is an additional factor you must consider," this is what it means:
- Latency: Time from request to first token (time-to-first-token, TTFT)
- Throughput: Tokens generated per second
- Cost: Price per million tokens (input and output)
Different models make different tradeoffs. Claude Haiku optimizes for speed and cost. Claude Opus optimizes for capability. Your architecture must account for these differences.
Batching and Parallelism
A single inference call is synchronous: wait for the response before continuing. But agent systems can run multiple inferences in parallel:
import asyncio
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
async def process_batch(items: list[str]) -> list[str]:
tasks = [
client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": item}]
)
for item in items
]
responses = await asyncio.gather(*tasks)
return [r.content[0].text for r in responses]This is how you process large datasets efficiently. Each inference still takes time, but you're running many of them simultaneously.
The Memory-Compute Tradeoff
Large models require large amounts of GPU memory. An H100 GPU has 80GB of HBM. A model with 70 billion parameters at 16-bit precision needs 140GB just for the weights—before accounting for activations, KV cache, and batch processing.
This is why inference at scale requires specialized infrastructure. It's also why inference providers exist—they amortize the hardware cost across many users.
Inference in System Design
When building AI-powered applications, treat inference as an I/O operation, like a database query or API call:
- It can fail (rate limits, timeouts, service outages)
- It has latency you can't eliminate
- It has costs that scale with usage
- The results are non-deterministic
Design accordingly. Add retries with backoff. Cache results when appropriate. Use cheaper models for simple tasks, expensive models for hard ones. Monitor latency and cost to catch regressions.
The goal isn't to minimize inference—it's to use it effectively. Every inference call should earn its cost.
Optimizing Inference
Several techniques reduce inference cost and latency:
Prompt caching: Reuse computed context across multiple requests with the same prefix. If your system prompt is 2000 tokens, caching avoids reprocessing it on every call.
Model selection: Use the smallest model that meets your quality requirements. Claude Haiku for classification and extraction; Claude Opus for complex reasoning.
Structured outputs: Request JSON with a schema. The model can generate valid structured data more efficiently than free-form text you then parse.
Streaming: For interactive applications, stream tokens as they're generated. The user sees progress immediately rather than waiting for the complete response.
Inference is where AI becomes useful. Training creates potential; inference realizes it.