AI systems, from the trenches.
Technical posts on problems I have solved in production — LLMs, RAG, compliance, voice AI, and cost engineering.
How I Cut $18K/Month in OpenAI API Costs with a Self-Hosted LLM
A real-world walkthrough of replacing a production OpenAI API dependency with a self-hosted open-weight model — architecture, model selection, migration strategy, and the cost math.
Read postBuilding a Production RAG Pipeline: Chunking, Hybrid Search, and Re-Ranking That Actually Works
Why naive RAG fails in production, and how to build a retrieval pipeline that holds up — covering chunking strategy, hybrid dense+sparse search, cross-encoder re-ranking, and citation.
Read postDeploying HIPAA-Compliant AI: What an Air-Gapped LLM Architecture Actually Looks Like
A practical guide to deploying LLMs in HIPAA-regulated environments — data residency requirements, air-gapped architecture, technology stack, and the compliance checklist that passed a real audit.
Read postSub-2-Second Voice AI: How to Build a Real-Time STT → LLM → TTS Pipeline
A technical deep-dive into building a voice AI pipeline with end-to-end latency under 2 seconds — covering streaming Whisper, LLM token streaming, low-latency TTS, and WebSocket architecture.
Read postWhy LLMs Hallucinate in Production and What You Can Actually Do About It
A practical guide to understanding, detecting, and reducing LLM hallucinations in production — covering root causes, output validation, grounding strategies, and the architectural patterns that hold up under real workloads.
Read postChoosing the Right Infrastructure for Your LLM in Production: A Decision Framework
How to pick the right deployment model for your LLM workload — cloud API vs managed inference vs self-hosted, GPU selection, serving framework comparison, and the cost model that actually tells you which path wins.
Read postMulti-Agent AI Systems: When They Work, When They Don't, and How to Architect Them
A practical guide to multi-agent AI architecture — orchestrator-worker patterns, LangGraph vs CrewAI, the failure modes that kill production agents, and the evaluation framework that catches them before launch.
Read postFine-Tuning vs RAG vs Prompt Engineering: How to Pick the Right Approach for Your Use Case
A decision framework for choosing between prompt engineering, retrieval-augmented generation, and fine-tuning — covering the tradeoffs, cost, complexity, and the common mistakes that lead teams to pick the wrong one.
Read postQdrant vs Pinecone vs pgvector vs Weaviate: Choosing the Right Vector Database for Production
A head-to-head comparison of the leading vector databases for production RAG and semantic search — covering performance, cost, operational overhead, hybrid search support, and which one fits which use case.
Read postHow to Evaluate Your LLM Application Before It Reaches Users: A Practical Testing Framework
A structured approach to evaluating LLM applications before launch — golden datasets, LLM-as-judge, regression testing, and the metrics that actually predict whether your system will hold up in production.
Read postContext Window Management: How to Handle Long Documents Without Losing What Matters
Practical techniques for working with documents that exceed LLM context limits — map-reduce summarization, sliding windows, hierarchical chunking, and when to use each approach in production.
Read postBuilding Reliable Tool-Calling Agents: Avoiding the Pitfalls of LLM Function Calling in Production
The failure modes that make LLM tool-calling brittle in production, and the engineering patterns — input validation, retry logic, tool design, and output verification — that make it reliable.
Read postPrompt Injection and LLM Security: How to Protect Your AI Application from Attacks
A practical guide to the security threats unique to LLM applications — prompt injection, jailbreaking, indirect injection via retrieved content, and the defensive patterns that actually work in production.
Read postImplementing LLM Streaming in Production Web Apps: SSE, WebSockets, and the Edge Cases That Break Everything
A complete guide to streaming LLM responses to users in real time — choosing between Server-Sent Events and WebSockets, handling backpressure and disconnections, and the edge cases that only appear under production load.
Read postGetting Reliable Structured Output from LLMs: JSON Mode, Pydantic, and the Patterns That Hold Up
How to reliably extract structured data from LLM responses — JSON mode, constrained generation, the Instructor library, retry logic, and schema design principles that reduce parse failures in production.
Read postArchitecting a Multi-Tenant AI SaaS: Isolation, Cost Attribution, Caching, and Observability
The full-stack architecture decisions that separate a production AI SaaS from a demo — per-user cost tracking, prompt caching, rate limiting, tenant isolation, and the observability layer that tells you when something is wrong.
Read postBuilding Memory for AI Agents: Short-Term Context, Long-Term Storage, and Episodic Recall
How to give AI agents meaningful memory across conversations and sessions — in-context short-term memory, vector-backed long-term storage, episodic recall patterns, and the architectures that hold up in production.
Read postChoosing and Optimizing Embedding Models for Production RAG: Beyond the Default
How to select the right embedding model for your RAG use case, evaluate retrieval quality, understand dimensionality tradeoffs, and when fine-tuning embeddings on domain data is worth the effort.
Read postPDF and Document Parsing for AI Pipelines: Extracting Clean Text from Messy Real-World Files
The practical guide to getting clean, usable text out of real-world PDFs, scanned documents, and mixed-format files for AI pipelines — covering parser selection, OCR, table extraction, and the preprocessing steps that determine RAG quality.
Read postLLM Cost Optimization Beyond Self-Hosting: Batching, Caching, Model Routing, and Prompt Compression
The cost optimization techniques that work whether you are on cloud APIs or self-hosted — async batching, semantic caching, intelligent model routing, prompt compression, and the monitoring layer that tells you where the money is going.
Read postWorking on an AI project? Let's talk.
Tell me your use case