Claude has been multimodal for a while, but in 2026 vision is no longer a side feature — it is the entry point for most document and agent workflows. Sonnet 4.7 and Opus 4.7 both accept images natively in the same messages API you already use for text, and the quality on text-heavy inputs (PDFs rasterized to PNG, screenshots, scanned invoices, dashboards) is the reason most teams pick Claude over alternatives for backoffice automation.
This post is the practical handbook: what formats Claude accepts, what it costs, what it is good at, where it fails, and copy-paste code that works against https://api.claudexia.tech/v1.
Input formats and limits
Claude vision accepts images as base64-encoded blocks inside a user message. Supported formats:
- PNG (
image/png) - JPEG (
image/jpeg) - GIF (
image/gif) — first frame only, no animation - WebP (
image/webp)
Hard limits to keep in mind:
- Max file size: 5 MB per image, base64-encoded. Anything larger has to be downscaled or split.
- Max dimensions: 8000 × 8000 pixels. Images larger than that are rejected outright.
- Recommended dimensions: 1568 px on the long edge for general content, 2000–2500 px for dense text. Going higher than ~1568 px does not improve recognition for normal photos and just inflates cost.
- Max images per request: 100, but in practice 20+ images already saturate context for most tasks.
- No video, no audio, no streaming images. You cannot push frames into a live conversation. If you need video, sample frames yourself and send them as discrete images.
Animated GIFs, screenshots from HiDPI monitors, and PDF page exports all work — Claude does not care about EXIF, color profile, or DPI metadata, only the raw pixels.
Pricing: vision is not free
A common surprise: vision usually dominates the input cost of a request. Claude tokenizes images into pixel-tiles and bills them as input tokens. Roughly:
- A 1092 × 1092 image ≈ 1600 input tokens.
- Cost scales close to linearly with
width × height ÷ 750. - A full-page A4 screenshot at 2000 px tall lands around 2400–2800 input tokens.
At Sonnet 4.7 input pricing (see Claude API pricing in 2026), one full-page screenshot costs about the same as 2000 tokens of prose. Send ten screenshots in one request and your input bill is 25–30k tokens before the user prompt is even counted.
Two consequences:
- Crop aggressively. If you only need the chart in the corner, send only the chart. Half the pixels = half the price.
- Cache the system prompt. With prompt caching enabled, the text part of long agent prompts becomes cheap, and image tokens stay the dominant cost — which is the right trade-off because images change every turn anyway.
Single image: minimal example
import anthropic, base64, pathlib
client = anthropic.Anthropic(
api_key="sk-...",
base_url="https://api.claudexia.tech/v1",
)
img = base64.standard_b64encode(
pathlib.Path("invoice.png").read_bytes()
).decode("utf-8")
resp = client.messages.create(
model="claude-sonnet-4.6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": img,
},
},
{
"type": "text",
"text": (
"Extract invoice number, date, vendor, total amount, "
"and line items. Return strict JSON with keys: "
"invoice_number, date, vendor, total, currency, items[]."
),
},
],
}],
)
print(resp.content[0].text)
That is the entire vision API. The image block goes inside content like any other message part, and you can mix images and text freely in the same turn.
Multi-image: comparison and grounding
Multi-image is where Claude shines. Send two screenshots and ask "what changed", or send a chart plus a table and ask "do these agree".
def img_block(path, media_type="image/png"):
data = base64.standard_b64encode(pathlib.Path(path).read_bytes()).decode()
return {
"type": "image",
"source": {"type": "base64", "media_type": media_type, "data": data},
}
resp = client.messages.create(
model="claude-sonnet-4.6",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Image 1 is yesterday's dashboard, Image 2 is today's."},
img_block("dash_yesterday.png"),
img_block("dash_today.png"),
{"type": "text", "text": "List every metric that changed by more than 5%, as JSON."},
],
}],
)
Claude reliably treats the surrounding text as labels for each image — you do not need to encode position metadata, just describe the order in plain English.
Use cases that actually work
1. OCR and document extraction. Invoices, receipts, contracts, bank statements, ID cards. Claude handles multi-column layouts, rotated scans, and handwritten annotations far better than classical OCR pipelines, and you get structured output in one call instead of OCR → parser → validator.
2. Chart-to-data. Send a bar chart or line chart, ask for the underlying values as JSON. Accuracy is excellent on labeled charts, decent on unlabeled ones (Claude will estimate from gridlines). Always ask for confidence per value if you intend to act on it.
3. UI screenshot to test selectors. Show Claude a page screenshot, ask it to produce Playwright selectors for the visible elements. Combined with an MCP browser tool, this is the basis of self-healing E2E tests.
4. PDF QA. Rasterize each page to PNG at ~1500 px tall, send the relevant pages with the question. Better than text-only PDF parsing because tables, stamps, signatures, and figures are all preserved.
5. Multi-image comparison. Before/after deploy screenshots, two versions of a design, two scans of the same form. Ask for a structured diff.
6. Visual debugging. Paste an error stack trace screenshot plus a screenshot of the relevant code panel. Claude reads both and explains the bug.
Prompt patterns that improve accuracy
- Force JSON. "Return strict JSON. Do not include prose. If a field is unreadable, use
null." Claude rarely hallucinates fields when the schema is explicit. - Demand confidence. Ask for
confidence: low|medium|highnext to each extracted value. Low-confidence rows go to human review. - Anchor with text. Before each image, write a one-line description ("This is page 3 of a German tax form"). It primes the model and improves OCR on language-specific glyphs.
- Chunk wide tables. For tables with 20+ columns, crop to two halves and send as two images with overlapping columns to detect alignment errors.
- Crop irrelevant areas. Headers, footers, watermarks, and ads in screenshots all consume tokens. A 30% crop is a 30% discount.
- High resolution wins on dense text. For 8-pt scanned text, 2500 px on the long edge is worth the extra tokens. For natural photos, 1568 px is plenty.
Limits and refusals
- No video. Sample frames yourself.
- No streaming images. You cannot stream new images mid-completion. Send everything in the initial request.
- CAPTCHAs and watermarks. Claude will refuse to solve CAPTCHAs and is conservative about reading personal documents (passports, IDs) without context. Tell it the legitimate use case in the system prompt.
- Faces and identification. Claude will describe people generically but will not identify individuals by name from a photo.
- Adult and graphic content. Refused.
- Coordinates. Claude can describe where things are ("top right", "below the heading") but bounding-box pixel coordinates are approximate, not pixel-perfect. Use a dedicated detection model if you need precise boxes.
Claude vs GPT-4o on vision
Both are strong, but they have different sweet spots based on what teams ship:
- Claude wins on text-heavy and structured documents. Invoices, tables, multi-page contracts, dashboards, code on a screen, scanned forms. The OCR is more accurate and the JSON output is more disciplined.
- GPT-4o wins on natural photos and physical-world reasoning. Identifying objects in a cluttered room, reading body language, interpreting ambiguous scenes. It also has lower latency at small image sizes.
- Tie on charts. Both extract values from clean charts well; both struggle equally with stylized infographics.
- Claude wins on long multi-image context. Sending 20 screenshots of an app flow in a single request and asking for a coherent analysis is where Claude's larger context and better cross-image grounding pay off.
For backoffice automation, document pipelines, and agent workflows that look at app screenshots, Claude is the default. For consumer photo features, GPT-4o is competitive.
Bottom line
Vision plus tool use is what turns Claude from a chatbot into a document agent: it can look at a screen, decide what to do, and call a tool to do it. In 2026 most production "AI document" features are this loop — rasterize, send to Claude, get JSON, validate, act. Start with one image, force JSON output, measure accuracy on a sample of 100 real documents, and only then scale up. Vision is reliable, but it is also where your token bill quietly grows, so crop early, cache prompts, and never send a 4K screenshot when 1500 px would do.