What HIPAA Actually Requires for AI Systems
HIPAA does not prohibit AI in healthcare. It does require specific controls around any system that processes or stores PHI. For an LLM deployment, the relevant requirements are:
- Data never leaves your covered entity boundary without a signed Business Associate Agreement (BAA) — and most major AI providers do not offer BAAs for API access
- All access to PHI must be logged with user identity, timestamp, and the data accessed
- Encryption at rest (AES-256 minimum) and in transit (TLS 1.2+)
- Access controls limiting which staff can query which patient data
- Breach notification procedures if PHI is exposed
The key implication: if you are sending clinical notes to OpenAI, Anthropic, or any cloud AI provider's standard API without a BAA, you are likely in violation. Some providers offer HIPAA-eligible tiers with BAAs — but for many healthcare organizations, the risk appetite for any third-party data processing is zero.
The Air-Gapped Architecture
An air-gapped LLM deployment means the model runs on hardware you control, in your facility or private cloud, with no outbound network calls to inference APIs during operation. Here is the architecture we deployed for a clinical documentation system:
- On-premises server: 2x NVIDIA RTX 4090 (or 1x A40 for production-grade deployments) — sufficient for 70B models with quantization
- vLLM: Serving engine — runs the LLM locally, exposes an OpenAI-compatible API on the local network only
- No internet egress: Firewall rules block all outbound traffic from the inference server; model weights are loaded once at setup
- Audit logger: Every request and response is logged to an on-prem database with user identity and timestamp — satisfies HIPAA access log requirements
- TLS termination: NGINX proxy handles TLS on the internal network; all PHI encrypted in transit even within the facility
- Role-based access: Staff authenticate against Active Directory; the proxy enforces which endpoints each role can access
Model Selection for Healthcare
Not every open-weight model is suitable for clinical use. The factors that matter:
- Instruction following accuracy: Clinical staff will phrase queries in many ways — the model must be robust to informal language and medical abbreviations
- Hallucination rate: In healthcare, a confident wrong answer is dangerous. We prioritized models with lower hallucination rates on medical benchmarks (MedQA, MedMCQA) over raw benchmark performance
- Context window: Clinical notes and discharge summaries can be long. A minimum 32K context window is required; 128K is preferable
- License: Must allow commercial deployment and not require sharing fine-tune weights
For the deployment we built, Llama 3.1 70B (instruction-tuned) passed all criteria. We applied 4-bit GPTQ quantization to reduce the VRAM requirement to fit within the hardware budget without meaningful accuracy degradation on the medical task suite we tested against.
The Compliance Checklist That Passed the Audit
- 1Written system design document describing data flows, storage locations, and network topology — provided to the compliance officer
- 2PHI data flow diagram showing that no PHI leaves the on-premises network boundary
- 3Encryption attestation: AES-256 at rest (full-disk encryption on inference server), TLS 1.3 in transit
- 4Access log schema and retention policy — minimum 6 years per HIPAA requirement
- 5User access control policy: role assignments, access review cadence, offboarding procedure
- 6Incident response plan: steps to take if a breach is detected, notification timelines
- 7Staff training documentation: what the AI system does, what PHI it can access, how to report issues
- 8Vendor assessment: only open-source components with auditable code; no third-party SaaS in the data path
Operational Considerations
Deploying on-premises means owning the operational overhead that cloud providers normally absorb. For healthcare organizations considering this path:
- Model updates: Plan a quarterly model evaluation cycle. You need a process for testing a new model version before promotion to production.
- Hardware maintenance: GPU servers require more attention than cloud instances. Work with your IT team to define SLAs and failure procedures.
- Backup and recovery: The model weights can be re-downloaded, but the audit logs and configuration must be backed up and recoverable.
- Monitoring: Set up alerting for server health, GPU utilization, and inference latency. Grafana + Prometheus works well for this.
The system we deployed passed its compliance review on the first submission. In the clinical setting, it reduced documentation time for nurses by an average of 22 minutes per shift — the outcome the organization was looking for. The compliance overhead paid for itself quickly.