Designing LLM-based rubric graders for high-stakes compliance

A regulated contact center QA rubric had 25-50 rules. Eight of them required the agent to complete a sequence of verification steps before proceeding. Those steps could land anywhere in a fifteen-minute call, scattered between customer interruptions and clarifying tangents. Phrase-matching covered 60-75% of that rubric. The remaining 25-40% was the hard part, and the hard part was also where the regulatory exposure lived. A well-designed LLM rubric grader closes the gap to 90%+. That delta is where multi-turn checklists, Boolean composites, and behavior-based rules live. The parts that require understanding context, sequence, and intent across a fifteen-minute conversation.

I have spent significant time with this pattern. The post below is not a product teardown. It is an analysis of what works, what breaks, and what I would do differently if I were building a compliance-grade rubric grader today.

Key insight: The gap between 60-75% and 90%+ rubric coverage is not a model problem. It is a representation problem. Rules live in spreadsheets, PDFs, and SME heads. No structured format. No API. Just documents. The hard work is getting the rubric into a machine-readable schema that captures composite logic, not just keyword matching.


The gap beyond phrase matching

Consider a typical regulated contact center QA rubric. It has three layers. Category, rule, and pass/fail/auto-fail with points. For example: a category named "Disclosure and Compliance," a rule inside it requiring a sequence of verification steps before proceeding to the next phase of the call, and an auto-fail rule triggering if the agent promised a specific rate without supervisor approval. Auto-fail rules map directly to regulatory exposure. A missed auto-fail is not a points deduction. It is the kind of violation that triggers regulatory investigation or enforcement action.

The gap beyond phrase matching has specific shapes. Multi-turn checklist rules require the agent to ask a sequence of questions across the conversation, not in a single sentence. Boolean composite rules require two or more conditions to be true simultaneously. The old system tracked each condition as a separate rule. It had no mechanism for composite conditions. Behavior-based rules require inference over the whole call. Phrase-matching cannot evaluate empathy. It cannot verify identity before policy discussion. It cannot check that question A preceded answer B across twelve transcript turns. That gap is the reason QA teams still spend most of their time listening to calls manually.

The gap is not a model problem at the start. It is a representation problem. The rules exist in spreadsheets, PDFs, and the heads of SMEs. No structured format. No API. Just documents.


Stage 1: compiling rubrics

The first problem is getting the rubric into the machine. A buyer uploads a spreadsheet. Then another one. Then a PDF that is a scan of a printed table. Each has different column names, merged cells, nested categories, and rules split across rows. There is no standard.

A one-time onboarding pipeline solves this. They upload their file in any shape. An LLM reads it and extracts rules into a fixed schema. The SME reviews the extraction, corrects errors, and locks it. Once locked, the rubric is immutable for that contract period.

The compiled schema needs to capture rule identity, pass/fail criteria, severity semantics (auto-fail versus point deduction), and, critically, composite logic. Most rule systems get the first two right and fail on the third. A checklist rule must be expanded into distinct sub-rules while preserving grouping. A Boolean rule must be emitted as a single composite with sub-conditions. Evaluating each sub-condition independently would double-count failures and produce invalid scores. The compiler outputs validated Pydantic models. Invalid extractions are caught at the schema boundary, not downstream in the evaluator.

Rubric compilation moves from weeks of SME and engineering coordination to hours. That shift alone is material enough to drive customer commitment.


Stage 2: brute-force evaluation

With the rubric compiled, the production loop is straightforward and brutal. Chunk the transcript by a fixed token size using tiktoken. For each chunk, loop over all active rules. Send one LLM prompt per chunk-rule pair. The model returns structured JSON with pass/fail, a timestamp, and a reasoning quote.

At enterprise scale, the math is explicit. A typical rubric has 25-50 rules. A transcript splits into 3-8 chunks depending on call length. That produces hundreds of LLM calls per transcript. For a daily batch of thousands of calls, the total scales to hundreds of thousands per batch. On an API billing model this is prohibitive. On self-hosted continuous-batching inference, a mid-tier GPU handles moderate batch volumes.

The evaluation is exhaustive and dumb. Every rule runs against every chunk. A call-opening rule is evaluated against a chunk from the resolution phase. A closing rule is evaluated against a chunk from the opening. No filtering by chunk relevance. No retrieval layer. No embeddings to route rules to the chunks where they might apply.

This is brute force. It works because the throughput is sufficient and the SLA is generous. It wastes compute. I knew it at the time. Given infinite time, the first optimization would be a lightweight classifier that routes each rule to the chunks where it has any probability of applying.

Production pragmatism: The elegant solution is not always the right solution. A brute-force loop that finishes overnight and produces correct results is better than a clever routing layer that misses SLA because it was not ready in time. Ship the brute force. Document the limitation. Optimize later.


The on-prem constraint

Regulated buyers will not send call transcripts to an external API. SOC 2, PCI-DSS, and internal risk committees make that non-negotiable. The infrastructure is fixed: a single GPU node, no auto-scaling, no cloud burst. The SLA is 24 hours to process a daily batch.

That constraint determines the architecture. Not model selection. Not prompt engineering. The hard ceiling is GPU VRAM and the hard floor is throughput. Self-hosted inference with continuous batching is the only option that meets both. API billing at enterprise scale, hundreds of LLM calls per transcript multiplied by thousands of calls per batch, is economically prohibitive. On a self-hosted engine with 4-bit quantization and continuous batching via vLLM, a single A10G or L4 handles the load. The GPU must stay saturated. Without that saturation, the batch misses SLA.

The constraint also means no streaming architecture, no dynamic scaling, and no fallback to a larger cloud model when the local one struggles. The system has to work within its GPU footprint or it does not work at all.


Model selection in resource-constrained deployments

In self-hosted deployments, model choice is a three-way trade among (a) 4-bit quantization stability, (b) structured output reliability under load, and (c) GPU footprint including context and batching overhead. The obvious large model is the wrong first choice. The right mid-size model, found by SME-validated comparison, outperforms.

The decision framework is counter-intuitive. The obvious large model, the one with the strongest reasoning benchmarks, is the wrong first choice. Quantization stability varies substantially across model families, and the candidates that benchmark best at full precision can fail in edge cases at 4-bit that only surface at batch scale. Structured output reliability under load matters more than raw reasoning benchmarks when the downstream pipeline depends on every response parsing cleanly. And VRAM headroom for context and batching is a load-bearing constraint, not a rounding detail. The right mid-size model, validated by SME review on a small curated sample rather than benchmark intuition, outperforms the obvious pick.

The profile I commit to: a mid-size dense model in the 14-32B range. Large enough to reason about regulatory nuance. Small enough to quantize reliably and fit on commodity GPU hardware.

If I were doing this today, an LLM-as-judge eval harness over a golden set would be the first thing I built, not the last.


Known failure modes

Every system in this problem class ships with failure modes that are known at deployment time, not discovered later. Here are four.

First, multi-turn checklist rules. Qualifying questions scattered across the transcript, no single chunk containing the full checklist. The evaluation runs per-chunk, per-rule. A checklist rule that requires all questions to be asked can only evaluate whether they appear in that specific chunk. No cross-chunk memory. No aggregation layer. The workaround is to treat each checklist item as a separate rule, which helps but does not fully solve the sequencing problem.

Long-call latency is a separate problem. Chunking helps keep individual prompts within context limits, but evaluation time scales linearly with call length times rule count. The batch SLA absorbs this, but the latency distribution is wide. Long calls finish last.

Third, low-quality transcripts. ASR quality varies by call language. Accented speech, overlapping speakers, and poor audio quality produce transcript text that is gibberish. The LLM evaluates gibberish against rules and produces garbage output. No transcript quality score filters bad audio before evaluation. A noisy ASR layer becomes a 100% error rate in the evaluation layer for that call.

Fourth, no dynamic routing. Every rule runs on every chunk regardless of relevance. It is the largest source of wasted compute and the easiest to fix in a rebuild. An embedding-based router that maps rules to relevant chunks would cut the inference load significantly.

Leading with limitations earns more credibility than leading with capabilities. Every system in this problem class ships with failure modes that are known at deployment time, not discovered later. Disclosing them honestly is not weakness. It is what regulated buyers actually want to see. A system that claims 100% coverage with no documented edge cases is the one they distrust.


How I think about building these systems today

This is the post's center of gravity. Start with evaluation infrastructure, not model selection. Start with a small annotated golden set. Fifty calls is enough. And a deterministic grader before reaching for anything beyond the smallest competent model.

The architecture I would build today is a single agent with a small set of targeted tools. Enough to read metadata, fetch transcript chunks on demand, retrieve individual rule definitions, and submit structured evaluations. The agent mimics the human QA workflow explicitly. Read metadata first. Identify the call type and length. Read targeted chunks based on violation patterns. Look up the specific rule text. Submit a structured evaluation. Budget the tool calls. Reward efficiency.

The model profile stays the same: mid-size, dense, 128K context. Same constraints, same reasons. Prompt optimization with GEPA, not hand-tuned templates. Eval-driven iteration. The golden set determines whether a prompt change helps or hurts. Not intuition.

I would explicitly reject fine-tuning as a first move. Fine-tuning is a cost-accuracy trade you make after you have exhausted prompting and agentic patterns. Reaching for it first is solving a problem you have not proven exists.

"The most common failure mode in LLM applications is not a bad model. It is a bad eval set." Source: Hamel Husain, Your AI Product Needs Evals

The baseline model plus good prompting plus tool use plus eval discipline gets you most of the way there. Fine-tuning is for the remainder, and only when the eval set proves the gap is real.

The two patterns sit on a spectrum. The brute-force approach compiles the rubric, then evaluates every rule against every chunk. It is predictable, correct, and wasteful. The agentic approach compiles the rubric, then lets an agent decide which chunks to read and which rules to evaluate. It is efficient, conditional, and harder to debug. The brute-force pattern is what you ship first. The agentic pattern is what you rebuild toward.

rubric-grader-eval is the brute-force pattern implemented. The compiler is the most carefully tuned component. It handles three rubric variance patterns: clean spreadsheets, boolean composites, broken documents masquerading as CSVs. It emits structured JSON that the evaluator consumes. The eval harness measures whether the compilation worked. That is the right hierarchy: compiler first, measurement second. Deep dive on the architecture.

RegTriage is the environment designed to train and evaluate that agentic pattern. Budget-aware tool use, targeted chunk retrieval, and a reward function that penalizes wasted reads. It is not a drop-in agent. It is the training ground. Baseline findings and design decisions.

auditguard-mcp completes the verifiability stack. While rubric-grader-eval proves the LLM produced correct verdicts and RegTriage trains agents to scale, auditguard-mcp answers the question a regulator asks first: did the tool call itself follow policy? Seven-step compliance pipeline for MCP servers.

Scrutiny is the vertical application. A 12-rule FDCPA/Reg F rubric. One evaluator. One UI. A dual-path evaluation that scores a collections call transcript in under 60 seconds. The architecture and the four things that break.


Closing

The rubric-grader pattern works when the constraints are picked right. On-premise deployment. Structured output. SME-validated compilation. Honest limitations disclosed instead of hidden. The onboarding cost shifts from weeks of SME-and-engineering coordination to hours. The ongoing QA cost shifts from linear-in-call-volume human review to capped-by-batch-throughput compute. Both are material reductions.

In regulated domains, trust compounds from what you disclose, not what you hide. Leading with the limitations of a system earns more credibility than leading with its capabilities.

I am taking on a small number of pilot engagements this quarter for teams building compliance-grade LLM systems. Scorecard design, rubric compilation, or evaluation harness setup. If that is you, or if any of this resonates and you just want to trade notes, I am at ree2raz@proton.me.


References

  • Hamel Husain, Your AI Product Needs Evals. The canonical post on eval-driven LLM development and the insight that the most common failure mode is a bad eval set, not a bad model.
  • GEPA: Generative Prompt Optimization. Prompt optimization with LLM-generated candidates, evaluated against a golden set.
  • vLLM. High-throughput LLM serving with continuous batching. The default self-hosted inference engine for this architecture.
  • Pydantic. Data validation library used for rubric schemas and LLM output parsing.
  • tiktoken. OpenAI's tokenizer used for chunking transcripts to fixed token sizes.
  • SOC 2 Type II. Service organization controls for security, availability, and confidentiality.
  • PCI-DSS. Payment Card Industry Data Security Standard.