Auditing debt-collection calls against FDCPA with a single LLM call

A mid-market collections agency runs 40 agents handling outbound and inbound debt-collection calls. Their QA team samples 2% of calls manually. Each auditor listens to the recording, checks a checklist, marks violations. At 2% sampling and 40 agents making ~80 calls per day, that is roughly 64 calls reviewed out of 3,200. The other 3,136 go unreviewed.

One of the 2% had a Mini-Miranda disclosure missing. The agent jumped straight to the payment demand. A month later, that call became a class action. FDCPA statutory damages are $1,000 per violation plus attorney fees. The missed disclosure was flagrant. A human auditor would have caught it. But it was in the 98% nobody listened to.

This is not a hypothetical. It is the structural reality of debt-collection compliance in 2026. QA is bottlenecked by human listening speed. A 4-minute call takes 4 minutes to review. At 2% sampling, the math says you miss 49 out of every 50 violations. Not because your auditors are bad. Because you cannot afford to review all 50.

Scrutiny is an architecture exploration that answers one question: what happens when you replace the sampling bottleneck with a single LLM call? Paste a redacted transcript. Get a 12-rule FDCPA/Reg F compliance report in under 60 seconds. The live demo is at scrutiny.rituraj.info. This post walks through the rubric design, the dual-path evaluator, the synthetic transcript pipeline, and the four things that break.


The rubric

The heart of the system is a 12-rule JSON schema derived from the Fair Debt Collection Practices Act (15 U.S.C. § 1692) and Regulation F (12 CFR Part 1006). Every rule has a rule_id, a description with evaluation criteria, a legal_basis citation, an is_autofail flag, and an evaluability field that tells the evaluator how to check it.

The rules fall into three types.

Transcript-only rules. Eight rules that can be evaluated from the conversation text alone. Mini-Miranda recital (FDCPA-001): did the agent say "this is an attempt to collect a debt"? Harassment (FDCPA-005): profanity, threats, abusive language? Third-party disclosure (FDCPA-004): discussed the debt with a roommate? False threats (FDCPA-007): threatened arrest or wage garnishment without legal authority? These are semantic judgments. The LLM reads the transcript and returns pass/fail with verbatim evidence quotes.

Metadata-only rules. One rule. Call time compliance (FDCPA-003). The statute prohibits calls before 8:00 AM or after 9:00 PM consumer local time. This is a deterministic check: parse the call_timestamp_local from the metadata sidecar, check the hour, done. No LLM needed. Six lines of Python.

Hybrid rules. Three rules that use both. Validation notice reference (FDCPA-002): the LLM checks if the agent mentioned the right to written validation during the call. The metadata cross-checks whether a notice was actually sent. Cease-and-desist (FDCPA-009): the LLM checks if the consumer said "stop calling me" and the agent continued. The metadata confirms whether a prior written C&D exists on file. Dispute handling (FDCPA-010): same pattern.

The design decision worth noting: this is a flat rubric, not a hierarchical one. rubric-grader-eval uses category→rule→sub_condition nesting for technical documentation because those rubrics have composite logic. AND conditions. OR conditions. Collections rubrics are different. Each rule maps to a single statutory section. A missed Mini-Miranda is 1692e(11). A false threat of arrest is 1692e(4). There are no composites. The hierarchy adds indentation, not value.

A fragment of the compiled rubric:

{
  "rule_id": "FDCPA-001",
  "rule_name": "Mini-Miranda Recital",
  "description": "The collector must disclose that the call is from a debt
  collector and that any information obtained will be used for that purpose.",
  "category": "Disclosure",
  "is_autofail": true,
  "points": 10,
  "legal_basis": "15 U.S.C. § 1692e(11); 12 CFR § 1006.18(d)",
  "evaluability": "transcript"
}

The is_autofail field is not a point deduction. It is a hard gate. A single autofail violation means the overall audit score is FAIL regardless of what happens with the other 11 rules. This mirrors how FDCPA litigation works. The plaintiff does not need to prove multiple violations. One is enough for statutory damages.

Autofail rules exist because the statute does not give partial credit. An agent who gives the Mini-Miranda and harasses the consumer still gets sued for harassment. The other 11 rules passing does not cancel the one that failed.


Dual-path evaluation

rubric-grader-eval uses brute-force evaluation. Every rule against every chunk of text. With 15 rules and 10 chunks, that is 150 LLM calls per document. It is predictable, correct, and expensive. Scrutiny takes the opposite approach: one LLM call for everything.

This works because collections calls are short. The average call is 2 to 4 minutes. At 150 words per minute and ASR transcription, that is 300 to 800 words of text. The full transcript plus 12 rules plus evaluation instructions fits comfortably in a single context window.

The evaluator sends one system prompt containing all 12 rules with evaluation criteria, one user message containing the full formatted transcript, and one metadata context block. The LLM returns a single JSON object with 12 rule_results and a narrative summary. The deterministic call-time check runs separately in Python and overrides the LLM's placeholder for FDCPA-003.

def _check_call_time(metadata: TranscriptMetadata) -> RuleResult:
    if not metadata.call_timestamp_local or not metadata.consumer_timezone:
        return RuleResult(verdict="not_evaluable", ...)

    ts = datetime.fromisoformat(metadata.call_timestamp_local)
    hour = ts.hour

    if 8 <= hour < 21:
        return RuleResult(verdict="pass", points_earned=10)
    else:
        return RuleResult(verdict="fail", severity="critical", points_earned=0)

The tradeoff is real. The brute-force approach gives you per-chunk evidence granularity. If a violation is in chunk 3 of 8, you know exactly where. The single-call approach gives you one pass/fail per rule across the entire transcript. You lose spatial precision. You gain a 150× cost reduction and sub-5-second latency.

For a demo that runs in a browser against paste-and-evaluate transcripts, the single-call approach is the right starting point. For a production system evaluating hour-long calls against 50-rule rubrics, the brute-force approach is the right rebuild. The two patterns sit on a spectrum, not a hierarchy.


Synthetic transcripts as a design tool

The five demo transcripts shipped with Scrutiny are synthetic. This is deliberate. Three reasons.

First, real debt-collection transcripts contain PII. Names, account numbers, Social Security numbers, employer names, street addresses. Sharing those publicly is not legal risk. It is a certainty.

Second, synthetic transcripts let you plant violations exactly where you want them. Each of the five targets one or two specific rules. Clean call: zero violations, all rules pass. No_Miranda: agent skips the Mini-Miranda disclosure entirely. Voicemail: agent slips debt details to a roommate. Cease_Desist: consumer says "stop calling me" three times, agent keeps pivoting to payment. Harassment: agent threatens wage garnishment, arrest, and keeps calling.

Third, the three-file pipeline produces artifacts that mirror a real audit workflow. Each scenario ships as:

  1. A raw transcript (raw.md) with full PII, ASR artifacts, and natural dialog. This is what the speech-to-text engine produces.
  2. A redacted transcript (.json) with PII replaced by placeholders: [CONSUMER_NAME], [AGENT_NAME], [PHONE_NUMBER], [ACCOUNT_NUMBER], [STREET_ADDRESS]. Dollar amounts and threat language stay intact.
  3. A metadata sidecar (_meta.json) with call timestamp, timezone, call attempt count, validation notice status, debt amounts, and flag fields.

The redaction format matters more than it seems. [CONSUMER_NAME] tells the LLM "this is a person" without leaking identity. This is enough for entity-aware evaluation. The LLM can reason about who said what without knowing who they are.

tx_005_harassment is the densest example. The agent provides the Mini-Miranda correctly. Then threatens wage garnishment by name: "We have your employer on file. We will contact your payroll department." Threatens arrest: "A warrant could be issued for your arrest." One profanity: "Stop being so damn difficult." The metadata shows 8 call attempts in 7 days (over the Reg F limit) and a claimed debt of $950 against an original of $890. Five violations across one 28-turn transcript. Each maps to a specific statutory section.

The most realistic transcripts are not the most egregious. A transcript where an agent screams obscenities reads like a complaint, not a QA sample. The credible violations are subtler: a disclosure skipped, a request ignored, a fee added without authorization. The synthetic transcript design principles that shaped these five are documented in the Scrutiny repository under transcripts/.


What breaks

Four failure modes. Shipped here because honest limitations are more useful than false confidence.

1. LLM confidence on edge cases. "We can garnish your wages" is a clear false threat. FDCPA-007 fail. "The creditor may consider legal options" is ambiguous. The agent did not threaten action. They described a possibility. The LLM sometimes flags it anyway. The evaluator has no confidence threshold. A pass/fail verdict arrives without a probability. For a demo, that is fine. For a production system that escalates flagged calls to human review, you want a score, not a binary.

2. Voicemail vs. live call detection. FDCPA-011 checks whether a voicemail exceeds the limited-content safe harbor under Reg F. The LLM must recognize that the transcript represents a voicemail, not a live call. This depends entirely on speaker labels being correct. If the transcript labels all turns as agent and consumer without context that the consumer is absent, the LLM treats it as a conversation and passes a voicemail that should fail.

3. Metadata dependency. Half the rules cross-check metadata. If the metadata sidecar is incomplete, those rules fall to not_evaluable. A transcript without a timestamp cannot be checked for time-of-day compliance. A transcript without validation_notice_sent cannot confirm the written notice was mailed. The evaluator degrades gracefully. But coverage drops. In a production deployment, the metadata sidecar must be populated by the same system that stores the transcript. That integration is non-trivial.

4. Single-call ceiling. Twelve rules in one prompt works. Fifty rules in one prompt does not. The LLM loses focus. Rules get skipped or hallucinated. At that scale, the brute-force chunking approach from rubric-grader-eval becomes the right choice. The single-call evaluator is the right architecture for a vertical tool with a fixed, curated rubric. It is not a general-purpose grading pipeline.


Where this sits in the stack

Scrutiny is the vertical application. The progression from infrastructure to product is visible across four repos.

rubric-grader-eval is the brute-force baseline for large rubrics. Three variance cases in the compiler. Per-chunk evaluation. Golden-set ground truth. The architecture and the compiler-first philosophy.

RegTriage is the agentic rebuild for budget-aware evaluation. Targeted chunk retrieval. Severity-weighted F1 scoring. Draft Incident Reports for human sign-off. A 72B model cannot tell a legal liability from a P&L leak.

auditguard-mcp is the audit pipeline for LLM tool calls. Seven-step compliance pipeline per execution. Structured audit logs for regulatory review. Building MCP servers that survive a regulator's audit.

Scrutiny is the domain-specific application of the pattern. One regulation. One evaluator. One UI. The rubric is compiled once. The evaluator is a single call. The audience is a compliance officer, not an engineer.


I am taking on pilot engagements this quarter for teams running compliance QA on debt-collection calls. If you are sampling 2% and wondering what the other 98% contains, the live demo is at scrutiny.rituraj.info. The source code is at github.com/ree2raz/scrutiny (Apache 2.0).

If you want a custom rubric for your specific compliance workflow. A different regulation, a different call type, a different set of rules. I am at ree2raz@proton.me.

References

  • Fair Debt Collection Practices Act, 15 U.S.C. § 1692. The parent statute. Mini-Miranda, harassment, false representation, cease-and-desist, third-party disclosure.
  • Regulation F, 12 CFR Part 1006. The CFPB rule that operationalized FDCPA for modern collection practices. Call frequency limits, limited-content voicemail, validation notice format.
  • rubric-grader-eval. Reference pattern for compiling unstructured rubrics into evaluable JSON schemas. The compiler handles three real-world variance cases.
  • RegTriage-OpenEnv. RL environment for compliance auditing. Agentic rebuild of the brute-force evaluator with budget-aware tool use.
  • auditguard-mcp. MCP server with a seven-step compliance pipeline per tool call. RBAC, PII detection, policy enforcement, structured audit logging.