How to Redact PHI Before Sending Docs to LLMs

Learn how to redact PHI, mask sensitive fields, and safely send OCR output to LLMs without exposing medical data.

Healthcare teams want the speed and flexibility of generative AI, but medical records, intake forms, claims documents, and lab reports often contain protected health information (PHI) that should not be sent to an LLM in raw form. The practical answer is not to avoid AI entirely; it is to build an OCR redaction workflow that extracts only the minimum necessary text, masks sensitive fields, and routes safe content into the model. That design keeps you aligned with data governance goals while still letting you automate summaries, classification, triage, and downstream workflow steps. It also mirrors the broader shift in AI product design highlighted in discussions around medical-record analysis in ChatGPT Health, where privacy controls matter as much as the AI experience.

For technical teams, this is ultimately a systems problem: document ingestion, OCR, entity detection, redaction, logging, and policy enforcement all need to work together. If you are designing a secure pipeline, it helps to think about it like other governance-heavy engineering decisions, such as enterprise AI compliance playbooks or choosing between enterprise AI and consumer chatbots. The difference here is that the input is not generic text; it is regulated health data, and the margin for error is much smaller.

1. What PHI Actually Needs to Be Redacted

Understand the data categories before you build the pipeline

PHI is not just a patient name on a chart. In practice, it includes names, addresses, dates tied to care, phone numbers, email addresses, medical record numbers, account numbers, device identifiers, insurance identifiers, diagnosis details, treatment details, and any combination of attributes that can identify a person in a healthcare context. If your OCR output preserves these fields and sends them to an LLM, you may be exposing far more than you intended, even if your prompt looks harmless. The first rule of health data protection is to separate identifying fields from clinical content before the AI step.

Use minimum-necessary extraction instead of full-document forwarding

The safest workflow is usually not “redact everything after OCR,” but “extract only the fields needed for the use case.” For example, if the AI task is to summarize a referral, you may only need specialty, appointment urgency, and a de-identified clinical complaint. If you are building a claims assistant, the model might only need procedure code, service date bucket, and denial reason. That approach reduces the blast radius and is more robust than relying on a large redaction list to catch every sensitive token.

Know the distinction between PHI, PII, and operational data

Many teams confuse PHI with ordinary personally identifiable information (PII). A name by itself is PII, but in a medical record it often becomes PHI because it is linked to health context. Conversely, some operational fields like internal case IDs may be business-sensitive but not regulated in the same way. Your policy should classify every field by sensitivity and use case, then define whether it is removed, masked, hashed, generalized, or passed through. This classification layer is foundational if you want a durable LLM data masking strategy.

Pro Tip: Redaction should be policy-driven, not regex-driven alone. Regex catches obvious patterns, but field-level classification and entity-aware rules are what keep sensitive data out of the prompt.

2. Build the OCR Redaction Workflow Step by Step

Step 1: Ingest and normalize documents

The pipeline starts before OCR. You should normalize PDFs, deskew scans, split multi-page batches, and detect file type so the downstream OCR engine receives a stable input. Poor scan quality increases the chance that names, dates, and identifiers are misread, which can cause both under-redaction and over-redaction. If you need a practical baseline for document ingestion, preprocessing, and cleanup, see our guide on secure temporary file workflows for HIPAA-regulated teams and our article on local AI for enhanced safety and efficiency.

Step 2: Run OCR and preserve layout metadata

Do not flatten OCR output into plain text too early. You want bounding boxes, page numbers, confidence scores, and line structure because they help you map text back to its source location for redaction and auditability. For forms, tables, and mixed-content pages, preserve table structure and key-value relationships so you can isolate fields such as patient name, date of birth, member ID, and provider name. A layout-aware OCR stage also improves selective extraction, which is the core of a reliable OCR redaction workflow.

Step 3: Detect sensitive entities and fields

Use a combination of deterministic rules and semantic detection. Deterministic rules are ideal for highly structured elements like MRNs, policy numbers, phone numbers, and date patterns. Semantic detection is needed for disease names, medications, clinical notes, and free-text references that may be identifying in context. The best systems use a confidence threshold to decide whether a field is masked, reviewed, or allowed through, rather than pretending the classifier is perfect.

Step 4: Redact, mask, tokenize, or generalize

Not every sensitive element needs the same treatment. A patient name might be replaced with a placeholder like [PATIENT_NAME], while a date of birth could be generalized to age range or year-only. An address may be reduced to city and state, and an account number can be tokenized so systems can still correlate records without exposing the raw identifier. The masking strategy should reflect the downstream AI task; over-redacting can destroy utility, while under-redacting creates exposure.

3. Design Redaction Rules That Preserve AI Utility

Keep the signal, remove the identity

The most common failure in text redaction is deleting so much that the LLM can no longer answer the question. A useful redaction policy keeps clinical meaning while removing identity. For example, “patient presented with recurrent asthma symptoms after exposure to dust” is often sufficient for triage or summarization, while the name, exact street address, and full date of birth are not needed. Think of the redaction layer as a filter that converts raw documents into structured, privacy-safe features.

Prefer structured extraction for high-value fields

When possible, extract only approved fields into JSON instead of passing a full text blob. For instance, a workflow may capture diagnosis category, encounter type, date bucket, payer type, and authorization status, then pass that object to the LLM for summarization or routing. Structured extraction is a strong form of sensitive data filtering because it limits the model’s context window to data you already vetted. It also makes validation easier because each field can be tested independently.

Use reversible masking only when absolutely necessary

Some workflows need re-identification later, such as internal review or human-in-the-loop follow-up. In those cases, use secure tokenization or vault-backed lookups rather than plain masking stored in app logs. Reversible masking should be tightly controlled, audited, and separated from model prompts whenever possible. This is especially important if your organization is also evaluating broader AI operations, like the budget and architecture tradeoffs described in cloud-native AI platform design.

4. OCR Redaction vs. Manual Review: What to Automate and What to Escalate

Automate the obvious, escalate the ambiguous

Automation should handle repetitive, high-confidence patterns such as social security numbers, member IDs, and known form fields. But ambiguous content—handwritten notes, mixed-language documents, low-confidence scans, and clinical free text—needs a review path. The goal is not zero human oversight; it is targeted human oversight where risk is highest. In healthcare, this balance is similar to the trust problems discussed in building trust in AI after conversational mistakes: confidence without controls is dangerous.

Create an exception queue for uncertain entities

Any extraction or redaction system should generate an exception queue for fields that fall below threshold or conflict with expected schema. If OCR reads “J. Smith” with 62% confidence in a discharge summary, your pipeline should not blindly pass it through. Instead, route it to manual review or re-OCR at higher resolution. This approach is especially helpful for document anonymization in healthcare archives where older scans can be noisy and inconsistent.

Measure false negatives more aggressively than false positives

In medical privacy workflows, missing one protected item is usually worse than overmasking a few non-critical fields. That means your QA program should optimize for recall on sensitive fields, not just precision. If your redaction model misses PHI in one out of every thousand documents, that may still be unacceptable depending on volume and risk profile. Regular sampling, adversarial testing, and red-team reviews should be part of the operational loop.

5. Security Controls for Secure AI Integration

Separate raw documents from AI-ready text

Raw source documents should live in a locked ingestion zone with strict access control, encryption at rest, and short retention windows. The redacted output should be stored in a separate workspace with different permissions and audit logs. That separation reduces the risk that a prompt builder, analytics job, or debugging session can accidentally expose original files. If you want practical patterns for isolating sensitive artifacts, our guide to staying anonymous in the digital age for DevOps teams is a useful complement.

Never send PHI to an LLM unless the contract and architecture support it

Some enterprise AI services offer stronger privacy guarantees, but you still need to verify data retention, training usage, encryption, region handling, and human-access policies. The BBC’s coverage of ChatGPT Health shows why this matters: even when a vendor says health data is stored separately and not used for training, customers still need airtight governance and clear limitations on use. For teams rolling out AI in regulated environments, our article on balancing AI features, consumer interaction, and privacy is a good reminder that product capability does not eliminate compliance responsibility.

Implement logging, but redact your logs too

Logs are often the hidden leak in AI pipelines. If you log prompts, OCR output, headers, or error payloads, you can accidentally reconstruct PHI even if the model never sees it. Build logging filters that remove sensitive fields at the source, and store trace IDs instead of patient data wherever possible. For operational resilience, create environment-specific retention policies and make sure test systems never receive production documents unless they are fully sanitized.

6. A Practical Comparison of Redaction Strategies

The right approach depends on document type, model risk, and the amount of context the LLM needs. A hospital revenue-cycle workflow may tolerate different tradeoffs than a mental health documentation pipeline. Use the table below to decide which strategy to apply for each class of field and use case.

Strategy	How It Works	Best For	Pros	Tradeoffs
Removal	Deletes the field entirely	Raw identifiers, irrelevant headers	Strongest privacy posture	Can reduce model usefulness
Masking	Replaces values with placeholders	Names, account numbers, IDs	Preserves structure and context	May still reveal pattern length
Tokenization	Maps values to secure tokens	Entity tracking across steps	Supports workflow continuity	Requires secure token vault
Generalization	Converts to broad category or range	Dates, ages, locations	Maintains analytical value	Reduces precision
Selective extraction	Passes only approved fields	LLM summarization and routing	Lowest exposure surface	Requires robust schema design

In production, teams often combine these methods. For example, a referral workflow might remove direct identifiers, tokenize patient case IDs, generalize dates to month-level, and selectively extract specialty and reason-for-referral. That mix is often better than applying a single blunt rule across the entire document set. It also aligns with the cost-control mindset seen in cloud-native AI budgeting, because less unnecessary context means fewer tokens and lower inference cost.

7. Compliance, Governance, and Auditability

Map your workflow to HIPAA-style safeguards

Even if you are not a covered entity, the discipline of HIPAA-style governance is still useful. Administrative safeguards define who can access what; technical safeguards define encryption, masking, and authentication; and audit safeguards define how you prove the system behaved correctly. A mature medical privacy workflow should be able to show where each document came from, what was extracted, what was redacted, which policy triggered, and which model received the final payload. This traceability is essential for internal audit and vendor review.

Document data lineage from scan to prompt

Your governance model should answer four questions: what entered the system, what transformations occurred, what left the system, and who approved those transformations. If you cannot answer those questions confidently, your redaction workflow is not ready for regulated deployment. Store immutable logs of rule versions, model versions, and policy changes so that you can reconstruct behavior for any incident or customer review. In heavily regulated environments, governance is not overhead; it is part of the product.

Use policy-as-code where possible

Policy-as-code lets teams version control redaction rules, review changes through pull requests, and deploy them through CI/CD pipelines. That is much safer than maintaining a hidden spreadsheet of exceptions or manually changing regex patterns in production. You can also test policy changes against a fixture set of documents before rollout, similar to how engineering teams validate security or release policies in other systems. For teams modernizing data workflows, our guide on building a trust-first AI adoption playbook is a useful organizational counterpart to the technical controls discussed here.

8. Benchmarking Your Redaction Pipeline

Measure redaction quality separately from OCR accuracy

OCR accuracy and redaction accuracy are related but not the same. Your OCR may extract text perfectly and still fail to identify a protected entity, or it may misread a character and incorrectly mask a harmless field. Track recall, precision, false-negative rate, and downstream utility metrics such as summarization completeness or routing accuracy. If you cannot measure both sides, you do not really know whether your pipeline is safe.

Test across real document variation

Use a broad corpus: scans, faxed pages, low-contrast forms, handwritten notes, multilingual documents, and mixed templates. A system that works on polished PDFs can fail badly on old insurance forms or crumpled intake pages. Create test cases that include edge conditions such as hyphenated names, partial dates, nested tables, and clinical abbreviations. Robust testing practices are especially important when organizations are also considering broader workflow modernization, like the secure temporary-file handling patterns in HIPAA-regulated temporary file workflows.

Track operational cost and latency

Redaction can increase processing time, so benchmark throughput as well as accuracy. If selective extraction can cut token count by 60 percent while maintaining task quality, that is a major efficiency win. Likewise, if a rule-based prefilter removes obvious identifiers before semantic analysis, you can often reduce compute costs and improve latency. For teams comparing deployment models, our discussion of future-proof security decisions is a reminder that technical design is always a balance between capability and lifecycle cost.

9. Recommended Architecture Patterns for Production

Pattern 1: Two-stage pipeline with prefilter and semantic redactor

In this pattern, Stage 1 uses rules and dictionaries to catch obvious PHI, while Stage 2 applies semantic entity recognition to ambiguous text. This is a strong default for most document automation systems because it balances speed, recall, and maintainability. Stage 1 removes the easy wins, and Stage 2 catches contextual items that rules would miss. The result is a safer AI-ready payload with less manual oversight.

Pattern 2: Field-level extraction into structured JSON

When the downstream task is narrow, avoid full-text prompts entirely and generate structured outputs instead. For example, an intake assistant might extract insurance carrier, encounter type, referral urgency, and medication category, then discard the rest. This is the cleanest way to implement document anonymization because the model never sees most of the source document. It is also easier to validate and far more predictable in production.

Pattern 3: Human-in-the-loop review for exceptions only

This pattern uses automation for 90 to 99 percent of documents and sends only uncertain cases to reviewers. The reviewer sees the original document in a secure interface, corrects redaction or extraction, and those corrections feed back into rule tuning and model evaluation. Over time, this can produce a high-quality redaction corpus without requiring everyone to manually inspect every page. It is a practical compromise between speed and control.

10. Implementation Checklist Before You Send Anything to an LLM

Validate the security boundary

Confirm where the model runs, where outputs are stored, and whether prompts are retained. Check whether the vendor uses customer data for training, how long telemetry is kept, and whether any subcontractors can access logs. If the answer to any of those questions is unclear, stop and resolve it before launch. The privacy expectations around health data are too high to rely on assumptions.

Redact before prompt construction

Never build prompts from raw OCR output and hope a final filter will catch problems. Redaction should happen upstream, before the text is assembled for the model. That means your application should generate an AI-safe payload, inspect it, and only then hand it to the LLM. This simple sequencing change eliminates an entire class of accidental leaks.

Run a data-loss prevention review

Review not just the prompt content but also attachments, embeddings, caches, and exports. PHI can leak through many side channels, especially when developers debug issues or export examples to support tickets. A real secure AI integration program treats every copy of the data as sensitive unless explicitly proven otherwise. That mindset is what separates a demo from a production-grade healthcare AI system.

Frequently Asked Questions

Can I send OCR text to an LLM if I remove the patient name?

Sometimes, but name removal alone is not enough. OCR text can still contain dates, addresses, medical record numbers, diagnosis details, or combinations of fields that identify a person. The safer approach is to classify the document and remove or mask all fields defined by your policy before the prompt is built.

Is masking better than deleting PHI?

It depends on the use case. Masking preserves structure and can help the model understand document layout, while deletion is safer for fields that are not needed. Many systems use both: delete raw identifiers, mask names, and generalize dates or locations.

What is the best way to handle handwritten notes?

Handwritten notes usually require lower confidence thresholds and more manual review. OCR errors are more common, so rules alone are not reliable enough. If the notes are clinically sensitive, consider extracting only approved fields rather than passing free text to the model.

Should I use an enterprise LLM or a local model?

Choose based on data sensitivity, required controls, and deployment constraints. Local or private models can reduce exposure, but they still need the same redaction, logging, and governance controls. Enterprise models can be appropriate if the vendor contract and architecture satisfy your compliance requirements.

How do I prove my redaction workflow is safe?

Use a combination of test corpora, adversarial examples, audit logs, and periodic manual sampling. Measure false negatives aggressively, keep versioned policies, and document the exact transformations from input scan to model prompt. Safety is demonstrated through evidence, not assumptions.

Conclusion: Build for Minimum Exposure, Maximum Utility

Sending documents to an LLM does not have to mean sending PHI. The strongest approach is to design a pipeline that extracts only non-sensitive fields or masks protected data before the model ever sees the text. That means combining layout-aware OCR, entity detection, policy-as-code, secure storage, and selective prompting into one coherent workflow. When done well, this architecture gives your team the speed of AI without giving up the privacy guarantees healthcare data demands.

If you are evaluating where AI fits into your document operations, compare your redaction strategy with your broader governance posture. Topics like state AI law compliance, DevOps anonymity, and privacy-first product design are not side issues; they are the operating environment for secure automation. With the right controls, LLMs can become useful assistants for healthcare workflows rather than risk multipliers.

OpenAI launches ChatGPT Health to review your medical records - Background on why health data privacy is now central to AI product design.
State AI Laws vs. Enterprise AI Rollouts: A Compliance Playbook for Dev Teams - A useful framework for governance-first deployment planning.
Building a Secure Temporary File Workflow for HIPAA-Regulated Teams - Practical file-handling guidance for sensitive document pipelines.
Staying Anonymous in the Digital Age: Strategies for DevOps Teams - Helpful patterns for minimizing data exposure in operations.
How to Build a Trust-First AI Adoption Playbook That Employees Actually Use - Organizational guidance for rolling out AI safely.