Projects | ree2raz

Structural guardrails for production voice agents: no hallucination, no skipped steps, no undefined states

Voice agents fail in three predictable ways: LLMs hallucinate facts, skip required steps when prompts are long enough, and break into undefined states on interruption. 1,500 lines of TypeScript, no SDK abstractions — a deterministic 11-state FSM with six guardrail layers stacked on top. The architectural primitive transfers directly to FDCPA-regulated voice workflows.

TypeScript WebSocket Deepgram OpenAI Realtime State Machine

GitHub Live Demo

auditguard-mcp

Compliance-aware MCP server with structured audit logging

Every LLM tool call passes through a seven-step pipeline: RBAC, inbound PII scan, role-specific policy, bounded execution, outbound scan, outbound policy, structured audit log. Six policy actions. Two bundled philosophies (permissive analyst, strict financial). 15-case golden-set eval, 100% pass.

Python MCP Pydantic FastAPI

GitHub Live Demo Blog post

Scrutiny

FDCPA/Reg F call transcript audit in 60 seconds

Paste a redacted collections call transcript. Get a structured compliance report against 12 FDCPA/Reg F rules with verbatim evidence quotes, statutory citations, and autofail violation summary. Dual-path evaluator: one LLM call for semantic rules, deterministic Python for metadata rules.

Python FastAPI Pydantic LLM

GitHub Live Demo Blog post

RegTriage

RL environment for regulatory compliance auditing. Baselines published, training loop in progress

Trains RL agents to audit financial services call transcripts for CFPB, TCPA, and GDPR/CCPA violations. Solves the 100% Coverage Problem. Human QA reviews 1-3% of calls; RegTriage covers the other 97% with Draft Incident Reports for human sign-off.

Python FastAPI OpenEnv Docker Pydantic RL

GitHub HuggingFace Space Blog post

FDCPA Rule Classifier

When fine-tuning small models is (and isn't) worth it for compliance classification

QLoRA fine-tune of Qwen2.5-3B-Instruct for FDCPA rule classification. Three-way eval: o3-mini (ceiling) vs base Qwen (floor) vs QLoRA (fine-tuned). All 6 errors are false negatives from keyword-level pattern matching, not legal reasoning. The pre-filter pattern: small model handles easy cases, API handles the rest.

Python PEFT QLoRA Qwen2.5

GitHub Blog post

LLM Deploy Cost Calculator

Architecture-aware GPU sizing, cost comparison, and break-even analysis

Single-page calculator that answers every CTO's first three infrastructure questions: What GPU do we need? What does it cost monthly? When does self-hosted beat API? Architecture-aware VRAM calculation for 30+ models accounting for GQA, MLA, and MoE. Interactive SVG break-even chart. Zero build step.

HTML React Tailwind CSS SVG

GitHub Live Demo Blog post

rubric-grader-eval

Reference pattern for compiling rubrics into evaluable schemas

Compiles unstructured rubrics into machine-readable schemas, then evaluates documents against them with golden-set ground truth. Handles three real-world variance cases: clean CSV, boolean composites in comment cells, PDF exports masquerading as spreadsheets. Per-category precision/recall/F1.

Python Pydantic Pandas

GitHub Blog post