Fine-tuning a 3B model for FDCPA classification: when it is and is not worth it
I fine-tuned a 3B model for FDCPA classification. It got 84.6% accuracy. The API model got 100%. The 3B model might still be the right choice. Here is why, and when it is not.
Scrutiny audits debt-collection calls against 12 FDCPA rules. Every call, every rule, one LLM call per transcript. At scale, that is real money. The question is not whether a 3B model beats an API model on accuracy. It does not. The question is whether a 3B model can make an API model cheaper to operate by handling the easy cases. That is a useful question. The 100% vs 84.6% comparison is a boring one.
The fine-tuned model and the eval pipeline are public: fdcpa-rule-classifier on GitHub, adapter weights on HuggingFace. The full pipeline, data generation through failure analysis, is reproducible in a weekend for $0.
The table that matters
| Model | Accuracy | F1 (macro) | Parse Rate | Effective Accuracy |
|---|---|---|---|---|
| o3-mini (ceiling) | 100.0% | 1.000 | 46.2% | 46.2% |
| Qwen2.5-3B base (floor) | 76.9% | 0.769 | 100.0% | 76.9% |
| Qwen2.5-3B QLoRA | 84.6% | 0.846 | 100.0% | 84.6% |
The "effective accuracy" column treats parse failures as wrong answers, because that is what they are in production. If your model cannot produce a parseable response, the answer is useless regardless of whether the model "would have been right" had it formatted correctly. o3-mini's effective accuracy is 46.2%. The 100% figure is conditional on the model formatting correctly, which it failed to do more than half the time.
The QLoRA fine-tune closed roughly 32% of the gap between the base model (76.9%) and the API ceiling (100%). Modest. The interesting number is not 84.6%. It is what 84.6% at 100% parse rate means for a two-tier system.
Two things the numbers hide
o3-mini's "100%" is a conditional claim. It scored perfectly on the 46.2% of calls where it produced valid JSON. On the other 53.8%, it returned malformed output. This is not a rounding artifact. In production, you would need retry logic, structured output enforcement, or both. Each of those adds latency, cost, and failure surface. The parse rate is the production metric. The accuracy metric is a research metric.
All 6 QLoRA errors are false negatives. The model over-predicts "fail." It sees surface-level non-compliance signals, an agent who mentioned verification was pending, a call that happened at 8:59 PM, an agent who discussed the validation process, and flags it. It learned keyword-level pattern matching from training data that had too many dramatic violations and not enough "technically compliant but messy" examples. This is a training data problem, not an architecture problem. More on that below.
The failure analysis
This is the part that separates this post from generic "I fine-tuned a model" content.
FDCPA-003: the call at 8:59 PM. The FDCPA prohibits calls before 8:00 AM or after 9:00 PM consumer local time. This call landed at 8:59 PM. Legally within hours. The transcript discusses the ambiguity of time zones and multiple call attempts. The model sees "ambiguity about call time" and flags it as a violation. It misses that the agent is explaining compliance. The model learned that time-of-day controversy equals violation. It did not learn that resolved controversy within legal bounds equals compliance.
FDCPA-010: dispute handling. The agent says verification documents have not been sent yet. The model sees "verification not sent" and predicts "fail." What the model missed: the agent properly paused collection activity pending verification. The rule requires pausing, not completing. The agent complied with the rule. The model saw a keyword associated with non-compliance and tripped.
FDCPA-002: validation notice reference. The agent discusses the validation process, tells the consumer how to request written verification, and directs them to submit in writing. The model sees "discussing validation" and predicts "fail." But the agent followed the correct procedure. The rule requires informing the consumer of their right to validation and directing them to exercise it. The agent did both.
Two patterns emerge from the failures. First, the model learned to flag violations at the keyword level, not the semantic level. "Verification not sent" is a violation signal. "Verification not sent, collection paused pending verification" is a compliance signal. The model cannot distinguish them reliably. Second, all six failures are false negatives in the same direction. The model is biased toward "fail." This is a direct consequence of training data that had more dramatic violations than subtle compliance examples.
Fine-tuning on synthetic data teaches keyword-level pattern matching, not legal reasoning. This is expected. The model catches the obvious stuff and escalates the ambiguous stuff. That is the point.
The pre-filter pattern
The real architecture is not "replace the API." It is a two-tier system.
Transcript → Qwen QLoRA (84.6% accuracy, $0, ~6.6s)
├─ High confidence pass → Done (save API call)
├─ High confidence fail → Flag for review
└─ Low confidence or ambiguous → Escalate to API
The router is the unsolved piece. Three options, each with tradeoffs:
Logit-based confidence. Extract the probability of the predicted class from the model's output logits. If the model is 95% confident the agent passed FDCPA-001, route to "done." If the model is 60% confident, escalate. This is free, fast, and requires no additional model. The risk: softmax probabilities from a 3B model are not well-calibrated, especially at the tails. A confidence threshold of 0.9 might still be overconfident on out-of-distribution inputs.
Rule-type heuristic. Route by rule category. The FDCPA rubric has three types: transcript-only (8 rules), metadata-only (1 rule), and hybrid (3 rules). Metadata-only rules are deterministic and never need the API. Transcript-only rules with surface-level signals (profanity, explicit threats) can stay local. Rules requiring legal judgment (harassment, false representation) get escalated by default. This is simple, interpretable, and does not require confidence calibration. The tradeoff: it is static. It cannot adapt to a model that gets better or worse on specific rules over time.
Separate router model. Train a lightweight classifier that takes the transcript and rule description and predicts "handle locally" or "escalate." This is the most flexible approach and the most engineering overhead. For 12 rules, it is probably not justified. For 200 rules, it might be.
Back-of-envelope math for the confidence-based approach: if Qwen handles 70% of cases at 85% accuracy and the API handles the remaining 30% at ~100% accuracy (when it parses), system-level accuracy is roughly 0.7 x 0.85 + 0.3 x 1.0 = 89.5%. That assumes random routing. In practice, two things shift the number. Qwen's accuracy on the kept set is likely higher than 85% because the easy cases (obvious violations, obvious compliance) are the ones the model handles well. The API's accuracy on the escalated set may be lower than 100% because those are the hard, ambiguous cases. The 89.5% is a starting estimate, not a claim.
The cost number is more concrete: 70% fewer API calls. For a team processing 1,000 calls per day at 12 rules per call, that is 8,400 fewer API calls per day. At any per-call API cost above zero, the savings compound.
Architecture decisions (briefly)
Why Qwen2.5-3B? Big enough to follow the instruction format. Small enough to fit in 4-bit on a T4. The 3B range is the sweet spot for "can fine-tuning help" experiments. Anything smaller struggles with prompt compliance. Anything larger does not need help.
Why QLoRA, not full fine-tune? 3B in 4-bit with LoRA rank 16 yields roughly 30M trainable parameters. Fits in 14GB VRAM with room to breathe. Full fine-tune on a T4 would require gradient checkpointing and offloading and might still OOM. QLoRA trades a few percentage points of accuracy for a 5x reduction in memory. At this scale, the trade is worth it.
Why GPT-4.1-mini as teacher, not o3-mini? o3-mini is a reasoning model. It is slower and more expensive per token. For bulk data generation (300+ examples across 12 rules), GPT-4.1-mini at 2.5M free tokens per day is the practical choice. The failure modes come from dataset size and distribution, not teacher quality.
When fine-tuning is not worth it
The title promises "(and Isn't)." Three scenarios where you should not fine-tune a small model for this kind of task.
Low volume. If you are evaluating 50 calls per day, the API cost is negligible. Fine-tuning, maintaining, and serving a custom model has engineering overhead that dwarfs the savings. The pre-filter pattern pays for itself at 1,000+ calls/day. Below that, just call the API.
Frequently changing rules. Fine-tuning embeds rule understanding in weights. If your rubric changes monthly (new regulations, new client requirements, new rule interpretations), you retrain. Retraining is not free. It is a pipeline of data generation, curation, training, eval, and deployment. For a rubric that changes quarterly, the retraining tax is manageable. For one that changes weekly, it is not. Stick with prompt-based evaluation for rules that change frequently.
Unreliable synthetic data. The seed-and-expand approach (24 hand-curated examples, expanded to ~300 via GPT-4.1-mini) works when a domain expert can write 20-30 unambiguous seed examples. Compliance rules are good for this because the statutes define pass/fail with enough clarity to generate training data. Medical diagnosis, fraud detection, or other domains where "correct" is contested or subjective are harder. If your domain expert cannot write clear seeds, the teacher model will amplify ambiguity, not resolve it.
What I would do differently
More hard-negative training data. The model over-predicts "fail" because the training data had too many dramatic violations and not enough "technically compliant but messy" examples. I would generate 50% more "hard pass" examples: calls where the agent discussed non-compliance signals but followed the procedure. The 6 false negatives in the eval all share this pattern. More hard negatives would directly address them.
Larger test set. 39 examples is too small to draw per-rule conclusions. A rule with 3 test examples can flip from 67% to 33% accuracy on a single misclassification. I would target 100+ test examples with multiple human reviewers and measure inter-annotator agreement before trusting per-rule F1 scores.
Structured output training. The JSON parse failures for o3-mini (and occasional format issues with Qwen) suggest training the model to produce more reliable structured output. Constrained decoding or JSON-mode fine-tuning would close the format gap. This is low-risk, low-cost, and makes the pre-filter pattern more robust.
Confidence calibration. The pre-filter pattern needs a router. Training the QLoRA model to output calibrated confidence scores alongside its verdict would make the two-tier architecture operational. Without it, you are routing on a heuristic, not a measurement.
Where this sits
This is the cost-optimization layer for Scrutiny, which audits debt-collection calls against 12 FDCPA rules with a single LLM call. The eval methodology builds on patterns from rubric-grader-eval, the brute-force baseline for large rubrics. The three-way eval (ceiling/floor/fine-tuned) is a pattern worth applying whenever you are considering fine-tuning: you need to know the gap you are closing, not just the number you are hitting.
The classifier code is at github.com/ree2raz/fdcpa-rule-classifier. The QLoRA adapter is at huggingface.co/ree2raz/fdcpa-rule-classifier-qlora. Both are Apache 2.0.
Fine-tuning small models is not about replacing large ones. It is about knowing when not to call the API.
References
- Fair Debt Collection Practices Act, 15 U.S.C. § 1692. The parent statute governing debt collection practices. Mini-Miranda, harassment, false representation, cease-and-desist, third-party disclosure.
- Regulation F, 12 CFR Part 1006. CFPB rule operationalizing FDCPA for modern collection practices. Call frequency limits, limited-content voicemail, validation notice format.
- Scrutiny. The dual-path evaluator this classifier is designed to compose with. Single-call architecture for 12-rule FDCPA auditing.
- rubric-grader-eval. Reference pattern for compiling unstructured rubrics into evaluable JSON schemas. Three variance cases in the compiler.
- Qwen2.5-3B-Instruct. Base model for the QLoRA fine-tune. 3B parameters, Apache 2.0 license, strong instruction following for its size.