Claude Vision: Structured Data Extraction That Beats OCR

Q: What image formats does Claude Vision accept?

According to [Anthropic's vision documentation](https://platform.anthropic.com/docs/en/build-with-claude/vision), Claude supports JPEG, PNG, GIF, and WebP. Images can be passed as base64-encoded data or via URL. Multiple images are supported in a single request, which is useful for multi-page documents.

Send Claude a photo of a handwritten invoice, a screenshot of a legacy ERP, or a multi-column PDF with mixed layouts, and it will return a correctly nested JSON object without a predefined schema. AWS Textract requires template training or explicit table markup to reach comparable accuracy on the same inputs. That single behavioral difference is where Claude Vision's professional value actually lives, and most coverage of the feature completely ignores it.

This article focuses on structured data extraction from visually complex documents: why Claude handles it better than traditional OCR in specific scenarios, where it breaks down, and what the cost ceiling looks like before the economics stop making sense.

Claude Reads Spatial Hierarchy, Not Just Text

Standard OCR tools read pixels and attempt to reconstruct text from recognized characters. They treat a document as a flat grid. Claude does something architecturally different: because it is a language model with visual perception integrated into that framework (as documented in GetStream.io's developer guide on Claude visual reasoning), it understands spatial relationships and infers implied hierarchy.

Concretely, this means a number sitting below a bold header in a screenshot gets attributed to that header's category, even without explicit delimiters, because Claude understands document conventions the way a human reader does. A subtotal row in a poorly formatted invoice doesn't need a label column for Claude to identify it as a subtotal. A form with merged cells and irregular field spacing doesn't need a layout template.

This produces extraction behavior that is qualitatively different from OCR pipelines. Given a screenshot of a multi-column invoice, Claude can return something like:

{
  "vendor": "Distribuidora Sul Ltda",
  "invoice_number": "NF-00412",
  "line_items": [
    { "description": "Porcelanato 60x60 Bege", "qty": 120, "unit_price": 48.90, "total": 5868.00 },
    { "description": "Argamassa AC-III", "qty": 40, "unit_price": 22.50, "total": 900.00 }
  ],
  "subtotal": 6768.00,
  "taxes": 812.16,
  "grand_total": 7580.16
}

No schema supplied. No template trained. That output comes from a single API call with a prompt asking for structured extraction in JSON. Claude inferred field names, nested the line items correctly, and separated the tax line from the subtotal because the document's visual structure implied it.

Anthropic's vision documentation confirms all current Claude models support this kind of multi-image, multi-format analysis. A single session can handle several pages of a document simultaneously, which matters for multi-page invoices and shipping manifests.

Where This Outperforms AWS Textract and Azure Form Recognizer at Low Volume

AWS Textract and Azure Form Recognizer are excellent services. At high volume, with consistent document formats and trained templates, they are faster, cheaper, and more deterministic than Claude. That is not the comparison worth making.

At low-to-medium volume, with heterogeneous document formats, building and maintaining extraction templates for every supplier, every legacy system, and every regional document variant is genuinely expensive in engineering hours.

Consider a mid-size retailer receiving invoices from 80 different suppliers. Each supplier has their own invoice layout. Textract handles this with AnalyzeDocument plus trained adapters, but each adapter requires labeled training data and periodic retraining when a supplier updates their template. Claude handles all 80 formats with one prompt and no training data, because it reads the document the way a person does.

A tradeoff worth stating directly: Claude's output is non-deterministic. Two runs on the same document can produce slightly different JSON keys or formatting choices. For applications requiring byte-identical output or strict schema validation downstream, this is a problem that requires mitigation (a fixed output schema in the prompt, JSON mode if available, or a validation layer post-extraction). At scale above roughly 500 documents per day, the per-token cost also becomes significant enough to warrant a cost-benefit comparison against a trained Textract pipeline.

My read of the current pricing is that Claude Vision sits comfortably below the cost threshold for operations processing under a few hundred documents daily, particularly when those documents are heterogeneous enough to make template training impractical.

The Failure Modes That Generic Coverage Ignores

Most articles on Claude Vision stop at capability lists. Failure modes are more useful.

First, image quality matters more than it seems. Anthropic's documentation is explicit: Claude may hallucinate or make mistakes on low-quality, rotated, or very small images under 200 pixels. In document processing workflows, this means pre-processing steps (deskewing, resolution upscaling, JPEG artifact removal) are not optional. A scanned invoice photographed at an angle on a warehouse floor will produce unreliable extraction. Scanned properly at 300 DPI, the same document will not.

Second, confidence is invisible. Textract returns confidence scores per field. Claude does not. If a field is partially illegible, Claude will often fill in a plausible value without signaling uncertainty. For financial documents, this is a real risk. A mitigation approach is to ask Claude to return a "confidence" key per field in the JSON output, where it flags low-certainty extractions, though this adds prompt complexity and token usage.

Third, Claude cannot reliably detect AI-generated or synthetically altered documents. Anthropic states this directly in their documentation. For fraud detection or document authentication workflows, this is a hard limitation that no prompt engineering will fix.

Fourth, table-dense documents with deeply nested structures can produce extraction errors on hierarchical nesting. A bill of lading with sub-items inside containers inside shipments will occasionally collapse nesting levels or duplicate a parent field. Breaking the prompt into stages works around this: extract the top-level structure first, then drill into each section as a separate call.

Claude Vision in E-Commerce Catalog Operations

Anthropic positions OCR and document extraction explicitly as a retail and logistics use case, noting that some enterprise customers have up to 50% of their knowledge bases encoded in PDFs, flowcharts, or presentation slides. In e-commerce catalog work, this maps directly to a specific and recurring pain point.

Supplier product sheets arrive in every format imaginable: scanned PDFs, PowerPoint exports saved as images, spreadsheets converted to print layouts, and photos of physical catalogs. Extracting structured product data (SKU, dimensions, material, weight, color variants, pricing) from these documents manually is slow and error-prone. A schema-based parser requires a different parser per supplier format.

Claude can process a product image alongside a supplier sheet in the same request, generate a description, and extract attributes simultaneously. A peer-reviewed framework study published in ScienceDirect in July 2025 (DOI: 10.1016/j.csa.2025.100083) validates the business case for multimodal AI in e-commerce operations specifically, noting measurable reduction in catalog preparation time when vision models handle attribute extraction from supplier materials.

In my own work managing a product catalog with thousands of active SKUs across flooring and tile categories, the supplier documentation problem is constant. Manufacturers update technical sheets, regional distributors send localized versions in non-standard formats, and seasonal collections arrive as marketing PDFs with product specs embedded in design-heavy layouts that no conventional parser handles cleanly. Claude's ability to extract structured data from these documents without per-format template work is practically useful in ways that a list of "vision use cases" doesn't convey.

A workflow that produces consistent results: supply the document image, specify the target JSON schema in the prompt (this dramatically reduces non-determinism), and run a lightweight validation step to catch missing required fields before writing to the catalog system. For documents where the schema is unknown upfront, a first-pass call asking Claude to propose a schema based on the document structure, followed by a second-pass extraction call using that schema, produces more consistent output than an open-ended extraction request.

FAQ

Can Claude Vision replace AWS Textract entirely?

For heterogeneous, low-to-medium volume document workflows (roughly under 500 documents per day), it is a practical and often cheaper alternative. For high-volume, consistent-format document processing where deterministic output and confidence scoring matter, Textract with trained adapters remains the stronger choice. The decision depends on document variety and throughput, not on which model is "better."

What image formats does Claude Vision accept?

According to Anthropic's vision documentation, Claude supports JPEG, PNG, GIF, and WebP. Images can be passed as base64-encoded data or via URL. Multiple images are supported in a single request, which is useful for multi-page documents.

How do you reduce non-determinism in extracted JSON?

Supply the exact target schema in the prompt. Specify field names, data types, and nesting structure explicitly. If the downstream system requires strict validation, add a JSON schema validation step after extraction and re-prompt on validation failures. This adds latency but substantially improves output consistency.

Does Claude Vision work on handwritten documents?

Yes, with caveats. Clean handwriting on a high-contrast background extracts reliably. Cursive, low-contrast, or heavily degraded handwriting produces hallucinations. Pre-processing (contrast enhancement, noise reduction) improves results significantly before sending to the API.

Cost and latency at scale are the real ceiling for Claude Vision in document workflows, not model capability. Below that ceiling, it handles document formats that would otherwise require a custom parser for every supplier, every legacy system, and every regional variation your operation encounters.

Claude Vision: Structured Data Extraction That Beats OCR

Claude Reads Spatial Hierarchy, Not Just Text

Where This Outperforms AWS Textract and Azure Form Recognizer at Low Volume

The Failure Modes That Generic Coverage Ignores

Claude Vision in E-Commerce Catalog Operations

FAQ

…