Writing

AI systems, from the trenches.

Technical posts on problems I have solved in production — LLMs, RAG, compliance, voice AI, and cost engineering.

April 15, 20269 min read

How I Cut $18K/Month in OpenAI API Costs with a Self-Hosted LLM

A real-world walkthrough of replacing a production OpenAI API dependency with a self-hosted open-weight model — architecture, model selection, migration strategy, and the cost math.

Read post

RAG · Document Intelligence

March 20, 202611 min read

Building a Production RAG Pipeline: Chunking, Hybrid Search, and Re-Ranking That Actually Works

Why naive RAG fails in production, and how to build a retrieval pipeline that holds up — covering chunking strategy, hybrid dense+sparse search, cross-encoder re-ranking, and citation.

Read post

Healthcare · HIPAA

February 10, 202610 min read

Deploying HIPAA-Compliant AI: What an Air-Gapped LLM Architecture Actually Looks Like

A practical guide to deploying LLMs in HIPAA-regulated environments — data residency requirements, air-gapped architecture, technology stack, and the compliance checklist that passed a real audit.

Read post

Voice AI

May 1, 202610 min read

Sub-2-Second Voice AI: How to Build a Real-Time STT → LLM → TTS Pipeline

A technical deep-dive into building a voice AI pipeline with end-to-end latency under 2 seconds — covering streaming Whisper, LLM token streaming, low-latency TTS, and WebSocket architecture.

Read post

Reliability · LLMs

May 10, 202610 min read

Why LLMs Hallucinate in Production and What You Can Actually Do About It

A practical guide to understanding, detecting, and reducing LLM hallucinations in production — covering root causes, output validation, grounding strategies, and the architectural patterns that hold up under real workloads.

Read post

Infrastructure · MLOps

April 28, 202611 min read

Choosing the Right Infrastructure for Your LLM in Production: A Decision Framework

How to pick the right deployment model for your LLM workload — cloud API vs managed inference vs self-hosted, GPU selection, serving framework comparison, and the cost model that actually tells you which path wins.

Read post

AI Agents

April 5, 202612 min read

Multi-Agent AI Systems: When They Work, When They Don't, and How to Architect Them

A practical guide to multi-agent AI architecture — orchestrator-worker patterns, LangGraph vs CrewAI, the failure modes that kill production agents, and the evaluation framework that catches them before launch.

Read post

LLM Strategy

March 5, 20269 min read

Fine-Tuning vs RAG vs Prompt Engineering: How to Pick the Right Approach for Your Use Case

A decision framework for choosing between prompt engineering, retrieval-augmented generation, and fine-tuning — covering the tradeoffs, cost, complexity, and the common mistakes that lead teams to pick the wrong one.

Read post

Vector Databases · RAG

May 18, 202610 min read

Qdrant vs Pinecone vs pgvector vs Weaviate: Choosing the Right Vector Database for Production

A head-to-head comparison of the leading vector databases for production RAG and semantic search — covering performance, cost, operational overhead, hybrid search support, and which one fits which use case.

Read post

LLM Evaluation · Testing

May 22, 202610 min read

How to Evaluate Your LLM Application Before It Reaches Users: A Practical Testing Framework

A structured approach to evaluating LLM applications before launch — golden datasets, LLM-as-judge, regression testing, and the metrics that actually predict whether your system will hold up in production.

Read post

LLM Engineering

April 20, 20269 min read

Context Window Management: How to Handle Long Documents Without Losing What Matters

Practical techniques for working with documents that exceed LLM context limits — map-reduce summarization, sliding windows, hierarchical chunking, and when to use each approach in production.

Read post

AI Agents · Tool Use

March 15, 20269 min read

Building Reliable Tool-Calling Agents: Avoiding the Pitfalls of LLM Function Calling in Production

The failure modes that make LLM tool-calling brittle in production, and the engineering patterns — input validation, retry logic, tool design, and output verification — that make it reliable.

Read post

Security · LLMs

May 25, 202610 min read

Prompt Injection and LLM Security: How to Protect Your AI Application from Attacks

A practical guide to the security threats unique to LLM applications — prompt injection, jailbreaking, indirect injection via retrieved content, and the defensive patterns that actually work in production.

Read post

LLM Engineering · Web

May 5, 20269 min read

Implementing LLM Streaming in Production Web Apps: SSE, WebSockets, and the Edge Cases That Break Everything

A complete guide to streaming LLM responses to users in real time — choosing between Server-Sent Events and WebSockets, handling backpressure and disconnections, and the edge cases that only appear under production load.

Read post

LLM Engineering

April 10, 20268 min read

Getting Reliable Structured Output from LLMs: JSON Mode, Pydantic, and the Patterns That Hold Up

How to reliably extract structured data from LLM responses — JSON mode, constrained generation, the Instructor library, retry logic, and schema design principles that reduce parse failures in production.

Read post

SaaS · Architecture

March 28, 202611 min read

Architecting a Multi-Tenant AI SaaS: Isolation, Cost Attribution, Caching, and Observability

The full-stack architecture decisions that separate a production AI SaaS from a demo — per-user cost tracking, prompt caching, rate limiting, tenant isolation, and the observability layer that tells you when something is wrong.

Read post

AI Agents · Memory

May 26, 202610 min read

Building Memory for AI Agents: Short-Term Context, Long-Term Storage, and Episodic Recall

How to give AI agents meaningful memory across conversations and sessions — in-context short-term memory, vector-backed long-term storage, episodic recall patterns, and the architectures that hold up in production.

Read post

RAG · Embeddings

May 12, 20269 min read

Choosing and Optimizing Embedding Models for Production RAG: Beyond the Default

How to select the right embedding model for your RAG use case, evaluate retrieval quality, understand dimensionality tradeoffs, and when fine-tuning embeddings on domain data is worth the effort.

Read post

Document Intelligence

April 22, 20269 min read

PDF and Document Parsing for AI Pipelines: Extracting Clean Text from Messy Real-World Files

The practical guide to getting clean, usable text out of real-world PDFs, scanned documents, and mixed-format files for AI pipelines — covering parser selection, OCR, table extraction, and the preprocessing steps that determine RAG quality.

Read post

Cost Optimization · LLMs

March 10, 20269 min read

LLM Cost Optimization Beyond Self-Hosting: Batching, Caching, Model Routing, and Prompt Compression

The cost optimization techniques that work whether you are on cloud APIs or self-hosted — async batching, semantic caching, intelligent model routing, prompt compression, and the monitoring layer that tells you where the money is going.

Read post

Working on an AI project? Let's talk.

Tell me your use case