·6 min read

How to use Claude for structured data extraction from messy text

Authors
  • avatar
    Name
    ThePromptEra Editorial
    Twitter

The Problem: Data Stuck in Text

You've got customer feedback emails, sales call transcripts, research papers, or user reviews scattered across documents. The information you need is there—contact details, product feedback, sentiment, key metrics—but it's tangled up in natural language, typos, formatting inconsistencies, and irrelevant context.

Manual extraction is a time sink. Regular expressions break on edge cases. Standard parsing tools choke on the ambiguity. This is where Claude shines. Its language understanding can navigate messy, real-world text and extract exactly what you need in clean, usable formats.

Why Claude Beats Traditional Extraction Tools

Claude doesn't require predefined patterns or rigid schemas. It understands context, handles ambiguous phrasing, and tolerates inconsistency—the hallmarks of real-world data. Give it a job, and it adapts.

When you combine Claude with JSON mode (which enforces valid JSON output) and clear structural requests, you get reliable extraction that works even when the source data is sloppy, incomplete, or formatted in unexpected ways.

Setting Up Your Extraction Prompt

The foundation of successful extraction is a clear, structured request. Here's the formula:

1. Define the schema upfront

Tell Claude exactly what fields you want and what they contain. Be specific about data types and edge cases:

Extract the following information into JSON:
- name (string): Full name if present
- email (string): Email address if present
- company (string): Company or organization name
- phone (string): Phone number in E.164 format, or null if not present
- sentiment (string): One of "positive", "negative", "neutral"
- key_issues (array of strings): Main problems mentioned, max 5 items

2. Show examples when the schema is complex

Don't just describe—demonstrate:

Example input:
"Hey team, just got off a call with Sarah Martinez from Acme Corp.
She's frustrated about the API response times. Email: sarah@acme.io,
mobile is +1-415-555-0192. Says your competitors are faster.
But she loves the documentation."

Expected output:
{
  "name": "Sarah Martinez",
  "email": "sarah@acme.io",
  "company": "Acme Corp",
  "phone": "+14155550192",
  "sentiment": "mixed",
  "key_issues": ["API response times"]
}

3. Handle missing data explicitly

Tell Claude what to do when information isn't present:

For any field not found in the text, use null (not empty strings or "N/A").
If sentiment is unclear, set it to "neutral".

Real-World Example: Customer Feedback Extraction

Let's say you're processing customer support tickets and need to extract actionable data:

I need to extract structured data from customer support tickets.
Extract:
- ticket_id (string): Reference number if present
- customer_name (string): Full name
- issue_category (string): One of "billing", "technical", "feature_request",
  "documentation", "other"
- urgency (string): "critical", "high", "medium", "low" based on tone/language
- description (string): Concise summary of the issue in 1-2 sentences
- resolution_status (string): "unresolved" by default unless explicitly mentioned
- action_items (array): Specific next steps mentioned or implied

Return valid JSON only, no markdown formatting.

Ticket text:
---
#TICKET-8847
Customer: Michael Chen
"I've been trying for THREE DAYS to get my invoice fixed.
Your billing system charged me twice for the November subscription.
I've emailed support twice already and got no response.
This is ridiculous. I need this resolved immediately.
Also, your docs are completely unclear on how to request refunds."
---

Claude will extract:

{
  "ticket_id": "TICKET-8847",
  "customer_name": "Michael Chen",
  "issue_category": "billing",
  "urgency": "critical",
  "description": "Customer was double-charged for November subscription.
    Previous support emails went unanswered.",
  "resolution_status": "unresolved",
  "action_items": [
    "Investigate duplicate charge in billing system",
    "Issue refund for second charge",
    "Review unanswered support tickets",
    "Improve refund request documentation"
  ]
}

Handling Scale: Batch Processing

When you have hundreds or thousands of items, process them efficiently:

Option 1: Send multiple items at once

Include several examples in a single API call, separated clearly:

Process each ticket below. Return an array of JSON objects.

---TICKET 1---
[ticket text]

---TICKET 2---
[ticket text]

---TICKET 3---
[ticket text]

Response should be a valid JSON array like: [{ ticket_id: "...", ... }, ...]

Option 2: Stream processing for large datasets

For truly massive datasets, process in batches of 10-50 items per API call. This keeps latency reasonable and costs predictable.

Advanced Techniques

Confidence scoring: Add a confidence field (0-1) to indicate how certain Claude is about extracted data. Useful for identifying results that need manual review:

"name": "John Smith",
"confidence": 0.95,
"phone": "555-0123",
"phone_confidence": 0.6

Nested structures: For complex data, don't flatten everything. Use nested JSON:

{
  "customer": {
    "name": "Jane Doe",
    "contact": {
      "email": "jane@example.com",
      "phone": "+1-800-555-0100"
    }
  },
  "order": {
    "id": "ORD-12345",
    "items": [{ "product": "Widget", "quantity": 2, "price": 29.99 }]
  }
}

Conditional extraction: Some fields only matter in certain contexts. Make this explicit:

If the text mentions a product issue, also extract:
- affected_product (string)
- version (string or null): Product version if mentioned
- workaround (string or null): Any temporary fix suggested
Otherwise, omit these fields.

Common Pitfalls and How to Avoid Them

Pitfall 1: Vague field descriptions

❌ Bad: "Extract sentiment" ✅ Good: "Sentiment must be 'positive', 'negative', or 'neutral'. Mixed feelings default to 'neutral'. Base it on the overall tone, not isolated phrases."

Pitfall 2: No guidance on partial data

❌ Bad: "Extract name, email, phone" ✅ Good: "Extract name, email, phone. Any field not present should be null, not empty string."

Pitfall 3: Inconsistent formatting requests

Specify formats precisely: dates (ISO 8601), phone numbers (E.164), currency (cents as integer), etc.

Pitfall 4: Mixing extraction with transformation

Keep extraction pure. Don't ask Claude to "extract and also calculate growth rate" in the same call. Extract first, transform separately.

Testing Your Extraction Prompt

Before scaling, test with diverse examples:

  1. Normal case: Clean, typical data
  2. Missing data: Information that isn't present
  3. Messy case: Typos, odd formatting, ambiguous phrasing
  4. Edge case: Unusual but valid scenarios (multiple phone numbers, no email, etc.)

Run a few manual tests, compare output to ground truth, and refine your schema or prompt based on failures.

Integration with Your Workflow

Claude's API integrates cleanly with Python, Node.js, or any HTTP client. Here's the minimal pattern:

import json
import anthropic

client = anthropic.Anthropic(api_key="YOUR_API_KEY")

prompt = "Your extraction prompt here..."
text = "The messy text to extract from..."

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": f"{prompt}\n\nText to extract from:\n{text}"
        }
    ]
)

result = json.loads(message.content[0].text)
print(result)

When to Use Claude vs. Other Tools

Use Claude for:

  • Ambiguous or inconsistent source data
  • Complex reasoning about what should be extracted
  • Extracting from mixed content (text, tables, semi-structured data)
  • One-off or ad-hoc extraction tasks

Consider alternatives for:

  • Perfectly structured, machine-generated data (regex or simple parsers)
  • Extremely high-volume extraction at minimal cost (though Claude is competitive)
  • Real-time extraction with <100ms latency requirements

For most professional use cases involving real-world, human-generated text, Claude is the better choice.

Getting Started Today

Start small. Pick one messy data source you're dealing with, write a clear extraction prompt with 2-3 examples, and test it on 20 items. You'll quickly see where your schema needs refinement.

The combination of Claude's language understanding and JSON mode's enforced structure makes extraction reliable, maintainable, and surprisingly fast to implement. Your data is waiting to be unlocked.