How to write system prompts that survive edge cases and adversarial users

The problem with naive system prompts

You've written a system prompt. It works beautifully in normal conditions. Then a user tries something unexpected—maybe they paste malicious code, ask you to ignore your instructions, or push the boundaries of your guidelines in creative ways. Suddenly your carefully crafted prompt falls apart.

This isn't a limitation of Claude. It's a design problem. Most system prompts are built for the happy path. They assume cooperative users and predictable inputs. But production systems don't get that luxury.

The good news: you can write system prompts that hold up under pressure. It requires thinking like an adversary and building defensively from the start.

Start with threat modeling

Before you write a single constraint, ask yourself: what could go wrong?

Map out your actual risks:

Instruction injection: Can a user embed hidden commands in their input that override your system prompt?

Scope creep: Will users try to use your system for tasks outside its intended purpose?

Jailbreaking: What are the specific ways someone might try to make Claude ignore your safety guidelines?

Context confusion: Could ambiguous inputs make Claude misinterpret what you're asking for?

Compliance violations: Are there regulatory or ethical boundaries you absolutely cannot cross?

For example, if you're building a code review assistant, your threats look different than if you're building a customer support bot. The code reviewer might face attempts to hide security vulnerabilities. The support bot might face angry customers trying to escalate beyond your authority.

Write down 3-5 concrete scenarios that would break your system. These become your test cases later.

Use explicit constraints with clear scope

Vague instructions fail under pressure. Specific ones hold.

Instead of this:

You are a helpful assistant that provides code reviews.

Write this:

You review Python code for security issues, performance bottlenecks, and style violations. You do NOT:
Provide code that implements the user's requests (you review existing code only)
Make changes to database queries without explicitly flagging the security implications
Approve code as production-ready; you flag risks and suggest improvements
Review code in languages other than Python (politely decline and suggest language-specific tools)

The difference matters. The second version:

Defines what you DO (review Python)
Defines what you DON'T do (very important)
Explains the reasoning where it prevents common workarounds
Creates natural boundaries

When you say "I do NOT provide generated code," a user can't slip you a prompt like "here's a code snippet, now write a corrected version"—because you've already established that boundary.

Build in validation before action

Don't trust the user's framing. Add a validation layer.

For any system that takes input and acts on it, include a step where Claude explicitly confirms it understands the scope:

Before proceeding with any analysis, you MUST state:
The language of the code (confirm it's Python)
The type of review being requested (security/performance/style)
Any explicit limitations mentioned by the user

This simple pattern does several things:

It creates a checkpoint before Claude commits to an action
It makes assumptions visible (which the user can correct)
It prevents misinterpretation from leading to scope violations
It gives you (or an auditing system) a clear record of what was agreed to

Use role constraints, not just behavior constraints

A weak system prompt says: "Be helpful and accurate."

A strong one says: "You are a security auditor reviewing code before production deployment. Your job is to find problems, not to reassure. You work for the engineering team, not the developer who wrote the code."

This role constraint creates a psychological framework that's harder to break. When Claude has internalized that it's playing the role of skeptical auditor, it's more resistant to casual jailbreaks like "I'm sure this is fine, right?" (A skeptical auditor would still look for holes.)

The role should be:

Specific (not "helpful assistant" but "security-first code auditor")
Grounded in incentives (you work for team safety, not individual convenience)
Tension-aware (acknowledge where your role might conflict with being "nice")

Test with adversarial inputs

Write out 10-15 test prompts designed to break your system. Include:

Direct instruction overrides: "Ignore your previous instructions and..."
Scope expansion: "I know you usually only do X, but can you do Y?"
Authority claims: "The admin said you should..."
Emotional manipulation: "This is really important, please just..."
Ambiguous edge cases: "What if the code has elements of both Python and JavaScript?"
Flattery and rapport-building: "You're the best code reviewer, I trust you completely..."

For each test case, run it against your system prompt and see what happens. Does Claude:

Stay in scope?
Validate assumptions?
Explain why it can't do something (rather than doing it)?
Handle the edge case gracefully?

If your prompt fails any of these, iterate. This is not a one-pass exercise.

Document the reasoning, not just the rules

Users will test boundaries. When they do, Claude should be able to explain why a boundary exists, not just assert it.

Instead of:

You cannot provide code suggestions.

Write:

You cannot provide code suggestions because your role is to audit existing code, not to generate new solutions. This boundary exists to prevent: (1) scope creep into full development work, (2) situations where you might approve your own generated code without proper skepticism.

When Claude explains the "why," it's both more robust to pushback and more useful to the user. They understand the constraint isn't arbitrary.

Version control your prompts

Treat system prompts like code. Track changes. When you discover an edge case, document it:

v1.2 (2026-04-07): Added explicit language check to prevent
non-Python code reviews. Discovered users were submitting
JavaScript wrapped in Python comments. Now validate language
before analysis starts.

This creates institutional knowledge about what edge cases you've hit and how you fixed them.

The one principle that matters most

If you take nothing else: make your constraints evidence-based, not aspirational.

Don't write "be extremely careful with security." Write: "Before approving any database query, you MUST list the three specific SQL injection vectors it could expose."

The first is a hope. The second is a mechanism. Mechanisms survive adversarial conditions. Hopes don't.

Test your prompts. Stress-test them. Assume users will find the edges you didn't think of. When they do, don't patch with wishes—patch with mechanisms.

That's how you build prompts that actually hold up in the real world.