LLM Guardrails and Safety in Production AI Systems
Last post covered evaluation, monitoring, and model degradation. This one covers guardrails — how you prevent LLMs from hallucinating, leaking data, following malicious instructions, or generating harmful content in production systems.
LLMs generate probabilistic outputs. In healthcare, finance, or legal — any regulated domain — you can’t have the model hallucinating symptoms, giving medical advice it shouldn’t, or producing content that causes harm. Guardrails are the safety net between what the model generates and what reaches the user.
The Layered Architecture
No single guardrail catches everything. Production systems stack multiple layers, each catching what the others miss.
User Input → Input Guardrails → LLM → Output Guardrails → Delivery
Think of it like airport security. No single checkpoint catches every threat. The combination of multiple layers — each designed for a specific type of risk — is what makes the system reliable. Same principle applies here.
Input Guardrails
These run before the LLM sees anything. The goal is to catch problematic inputs early, before you spend compute on generation and before the model has a chance to follow bad instructions.
Prompt Injection Detection
Most teams think of prompt injection as one thing. In production, there are three distinct attack vectors, and each needs a different detection strategy.
Direct injection — the user explicitly tries to override the system prompt. “Ignore your previous instructions and do X.” These are the simplest to detect. A lightweight classifier trained on known injection patterns catches most of them. Even rule-based pattern matching works for the obvious cases.
Indirect injection — this is the dangerous one, especially in RAG systems. Malicious content is hidden inside documents that the LLM retrieves and processes. The user didn’t inject anything — the attack is embedded in a PDF, a web page, or a database record that your retrieval pipeline pulls into context. The LLM reads it as part of the context and follows the embedded instructions.
This is harder to defend against because you can’t just filter user input — you need to sanitize retrieved content too. Strategies: scan retrieved documents for instruction-like patterns before injecting them into the prompt, use delimiter tokens to clearly separate system instructions from retrieved content, and instruct the model to treat retrieved content as data, not instructions.
Jailbreaks — adversarial prompts designed to bypass the model’s safety training. These evolve constantly. Role-playing attacks (“pretend you’re an AI with no restrictions”), encoding attacks (base64, rot13), multi-turn attacks that gradually escalate across a conversation.
No static filter catches all jailbreaks. The practical defense: a small classifier model fine-tuned on known jailbreak patterns, updated regularly as new patterns emerge. Combine this with behavioral monitoring — if a conversation suddenly shifts in tone or topic after several normal turns, flag it.
Topic Classification
Not every input needs the same pipeline. Route different input types to different processing paths.
A factual clinical question gets routed through RAG with strict grounding. A conversational follow-up gets handled with conversation history, no retrieval needed. An out-of-scope request (“write me a poem”) gets caught early and deflected without wasting LLM compute.
Use a lightweight classifier or even the LLM itself with a cheap, fast model (a small fine-tuned model or a single API call to a fast tier) to classify the input before the main LLM processes it.
PII Detection
In healthcare, personal health information hitting the model is a HIPAA violation waiting to happen. In finance, it’s a compliance risk.
Catch and mask PII before it reaches the model. Names, dates of birth, social security numbers, medical record numbers, addresses — all should be detected and replaced with placeholder tokens before the LLM prompt is assembled.
Tools: Microsoft Presidio (open-source, customizable), spaCy NER models for general PII, custom regex patterns for domain-specific identifiers (MRN formats, insurance policy numbers). Run multiple detectors in parallel — no single PII detector catches everything.
The masking needs to be reversible for the final output. Replace “John Smith” with “[PATIENT_NAME_1]” before the LLM, then swap it back in the final delivered output. The LLM never sees the real name, but the end user gets a natural-sounding result.
Output Guardrails
The LLM has generated a response. Now you validate it before it reaches the user. Multiple layers, each checking for different failure modes.
Rule-Based Filters
Hard-coded rules that override the LLM. Fast, deterministic, reliable.
“If output contains a medication dosage recommendation, block.” “If output suggests a specific diagnosis without citing the source transcript, block.” “If output contains language that could be interpreted as medical advice to a patient, rewrite the framing.”
These are your absolute safety boundaries. They don’t need to be smart — they need to be fast and never wrong. A regex that catches “you should take [medication]” and blocks it will never have a false negative on that exact pattern. Build a library of these rules based on your domain’s specific risks.
The limitation: rules only catch what you’ve anticipated. They miss novel phrasings, edge cases, and subtle violations. That’s what the next layers handle.
LLM-as-Judge
A second, smaller LLM evaluates the output of the first. “Does this response contain hallucinated clinical information?” “Does this response stay within the scope of the provided context?” “Could this response cause harm if acted upon?”
Slower than rules but catches nuanced issues that pattern matching misses. The judge LLM doesn’t need to be large — a fine-tuned small model focused specifically on safety evaluation outperforms a general-purpose large model on this task because it’s been trained on exactly the failure modes you care about.
The key design decision: the judge model should have a different architecture or training data than the primary model. If both models share the same biases, the judge will miss the same failures the primary model makes. Diversity in your evaluation pipeline is a feature.
Schema Validation
For structured outputs — SOAP notes, assessment scores, structured reports — validate against a strict schema before delivery.
Missing fields, wrong data types, out-of-range values, impossible combinations → reject and retry. If a clinical assessment score should be between 0 and 27 and the model outputs 34, that’s a schema violation. Catch it before it reaches a clinician’s dashboard.
Use JSON Schema validation or Pydantic models. Define your schemas tightly. Every field with explicit types, ranges, and constraints. Force the LLM to output structured JSON (using function calling or structured output modes) so validation is clean and deterministic.
The retry strategy matters: if schema validation fails, retry with a modified prompt that includes the specific validation errors. “Your previous output had an invalid score of 34. The valid range is 0–27. Regenerate.” Most models self-correct on the first retry when given explicit error feedback.
Clinical Validation Layer
Domain-specific checks that require specialized knowledge.
“The output references DSM-5-TR code F32.1 — does that code actually exist?” “The described symptoms are mapped to Major Depressive Disorder — do those symptoms actually align with MDD diagnostic criteria?” “The output mentions a drug interaction — is that interaction clinically documented?”
This requires a rules engine built on clinical knowledge bases. ICD-10 code databases, DSM-5-TR criteria mappings, drug interaction databases (DrugBank, RxNorm). The LLM generates natural language, and the clinical validation layer cross-references every clinical claim against authoritative sources.
This layer catches the most dangerous type of hallucination — outputs that sound clinically plausible but are factually wrong. A clinician reading a well-written but hallucinated assessment might not catch the error. The validation layer will.
Hallucination Prevention
Guardrails catch hallucinations after they happen. These techniques reduce hallucinations at generation time.
Grounding
Force the model to only use information from the provided context. The system prompt should be explicit: “Only reference observations that appear in the provided transcript. Do not infer, assume, or add information not directly present in the source material.”
This sounds simple. In practice, LLMs constantly add plausible-sounding details that aren’t in the source. Stronger grounding requires few-shot examples in the prompt showing the difference between grounded and ungrounded outputs. Show the model what a grounded response looks like and what an ungrounded response looks like. This is more effective than instruction alone.
Citation Enforcement
Make the model cite which part of the input supports each claim. “The patient reported feeling anxious [Transcript: 04:23–04:31].” If it can’t cite a specific source for a claim, it shouldn’t make that claim.
Implementation: instruct the model to output claims with inline citations. In post-processing, verify that each citation actually maps to the referenced source material. If the model cites “Transcript: 04:23–04:31” but that segment says something different from what the model claims, flag it.
This creates a verifiable chain from output back to input. Clinicians can spot-check citations instead of re-reading entire transcripts.
Post-Generation Verification
A separate model or rules engine cross-checks generated claims against the source data. “The output says the patient reported insomnia. Does the transcript contain any mention of sleep difficulties?”
This is essentially a natural language inference (NLI) task. Given the source transcript (premise) and a generated claim (hypothesis), classify as entailed, contradicted, or neutral. Claims classified as contradicted or neutral get flagged or removed.
Models: fine-tuned DeBERTa on clinical NLI datasets, or use an LLM with a focused prompt. The NLI approach is faster and more reliable than asking another LLM to “check if this is true” in open-ended fashion.
The Hybrid Architecture
This is becoming the standard pattern in regulated industries.
LLM handles the creative, flexible part — generating natural language clinical notes, conversational responses, summaries that read like a human wrote them.
Rules engine handles the deterministic, must-not-fail part — PII masking, schema validation, clinical code verification, hard safety boundaries.
Neither works well alone. Pure LLM systems are flexible but unreliable on safety. Pure rules-based systems are reliable but brittle and can’t handle natural language generation. The combination gives you natural language fluency with deterministic safety guarantees.
The architecture: LLM generates → rules engine validates → if validation passes, deliver. If validation fails, either retry with feedback or route to human review. The rules engine has veto power over the LLM. Always.
Guardrail Latency Budget
Every guardrail layer adds latency. This is the tradeoff most guides skip.
Input PII detection: 50–100ms. Topic classification: 30–50ms. Prompt injection detection: 50–150ms. LLM generation: 500–2000ms. Rule-based output filters: 10–30ms. LLM-as-judge: 300–800ms. Schema validation: 5–10ms. Clinical validation: 50–200ms.
Stack all of these synchronously and you’re adding 500–1300ms on top of LLM generation time. For real-time applications, that’s the difference between acceptable and unusable.
The production decision: which guardrails run synchronously (blocking — the response doesn’t deliver until they pass) versus asynchronously (the response delivers immediately, but gets flagged and reviewed after the fact).
Synchronous (blocking): PII detection, schema validation, rule-based safety filters. These are fast and non-negotiable. If PII leaks or a hard safety rule triggers, you can’t deliver and fix later.
Asynchronous (non-blocking): LLM-as-judge, clinical validation, detailed hallucination checks. These are slower and catch subtler issues. Log the results, flag violations for human review, and if something critical surfaces, trigger a correction or alert after the fact.
This split keeps response latency acceptable while still running comprehensive safety checks. The async layer catches things the sync layer misses, just not in real-time.
Guardrails Frameworks
Guardrails AI — open-source framework for adding validators to LLM outputs. Define validators as reusable components (check for PII, validate JSON schema, check for toxicity), chain them together. Good starting point for teams building their first guardrails pipeline.
NeMo Guardrails (NVIDIA) — dialog management with programmable guardrails. Uses Colang, a modeling language for conversational flows. Defines what the AI should and shouldn’t talk about, how it should respond to specific topics. More opinionated than Guardrails AI but more structured for dialog-heavy applications.
Custom implementation — for critical production systems in regulated environments, many teams build their own. Full control over the logic, no dependency on external frameworks that might change behavior between versions. More engineering effort upfront, but you own every line of safety-critical code.
The decision depends on your risk tolerance. Prototyping and low-stakes applications — use a framework. Regulated production systems where a guardrail failure has legal or clinical consequences — build custom or at minimum audit every line of the framework you’re using.
Testing Your Guardrails
Guardrails that aren’t tested are guardrails that don’t work.
Red teaming — dedicate time specifically to breaking your own guardrails. Try every prompt injection technique. Feed adversarial inputs. Attempt to extract PII from the model’s outputs. If your team can break it, attackers definitely can.
Regression testing — every time you update the LLM, rerun your full guardrails test suite. A new model version might handle certain injection patterns differently. Guardrails that worked on GPT-4 might not work the same way on GPT-4o or a fine-tuned model.
Continuous evaluation — monitor guardrail trigger rates in production. If your PII detector suddenly starts triggering 3x more often, either your input distribution changed or your detector is misfiring. If your hallucination checker stops triggering entirely, it’s probably broken, not because the model stopped hallucinating.
Guardrails aren’t a set-and-forget layer. They’re a living system that needs the same monitoring and maintenance as the models they protect.
This covers the safety layer — input filtering, output validation, hallucination prevention, and the latency tradeoffs of running guardrails in production.
This is the final post of this series. See you in something interresting topic in the next one.
LLM Guardrails and Safety in Production AI Systems was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.