Case Study — Audit Intelligence

Smart Audit — an AI-augmented audit workspace where the AI does the reading, and the auditor keeps the decision

For a top-5 national health insurer's payment-integrity program. Fourteen AI components analyze every claim through an 8-stage pipeline; auditors work the output through a conversational review surface that remembers what they asked yesterday and captures every override as training data.

Client

Top-5 National Health Insurer

Industry

Healthcare — Payment Integrity

My Role

Lead Discovery + App Build

Timeline

2026

Stack

Claude Code · LangChain · GCP

Engagement

Deloitte Consulting

01 — The Challenge

Claims audit at scale — but human attention doesn't scale

I'd been embedded in a top-5 insurer's cost-of-care program for nearly two years — chatbots first, then dashboards, then the audit work itself — meeting daily with the people who actually write and run audit edits. That's where the real problem surfaced. An audit verdict looks like one clean decision; underneath it is a data problem. Medical records are barely structured — about a tenth live in the EMR, the rest are PDFs, faxes, and feeds from outside systems — HIPAA-gated, and they never arrive complete or all at once, so cases stall waiting for records. And this is prepay: the call has to be right before the money goes out, not clawed back after — which is unforgiving, because a wrong call is a provider and a member you've harmed.

The Problem

One claim's evidence was scattered across internal and external sources — EMR, PDFs, faxes, vendor feeds — much of it HIPAA-gated and arriving incomplete, so cases stalled waiting for records
Configuring a single audit edit meant hand-writing JSON to build a test claim
Audit decisions captured the verdict but lost the reasoning
AI recommendations were batched offline — auditors couldn't ask follow-ups
Override decisions were a compliance checkbox, not a training signal
Platform health was invisible — no way to catch a drifting agent until the SLA broke
Four audiences, four disconnected tools

The Shift

A pipeline that ingests, cleans, and validates the scattered sources before an agent ever reads the claim
The agent reads against the edits you configured and scores each finding — Core findings run your rules; AI findings surface what you didn't ask for, like a provider pattern behind an otherwise-clean claim, or a case to route to SIU
One workspace: AI summary, findings, similar cases, and the source PDF to verify — in one view
Inline conversational review — the chatbot answers the auditor's actual question, in context
Override justification as labeled training data for the learning loop
AgentOps dashboard: 14 components + 8 stages + feedback loops, fully legible
Same data model surfaces differently for Auditor · Manager · Executive · Platform Ops

$30–40M

Estimated Annual Savings

AI Components

91%+

AI Decision Accuracy

Persona Workspaces

02 — User Research

Four audiences, one pipeline

Same 14-component pipeline underneath; four very different surfaces on top. The auditor owns the decision; the manager owns the pattern; the executive owns the outcome; platform ops owns the machine.

Auditor

Senior Claims Auditor

Reviews flagged claims, validates AI findings, makes the final call — approve, deny, or escalate.

Pain: Prior tool: 17 tabs, 40 minutes per claim. Half the time hunting precedent and policy.

Manager

Audit Team Lead

Reviews auditor decisions, approves rule configurations, monitors override patterns across the team.

Pain: No way to tell if the team is agreeing with the AI or routinely overriding it — both signals matter.

Executive

VP, Payment Integrity

Reports portfolio outcomes, justifies AI investment with performance metrics at board level.

Pain: Need one number for the C-suite, and the metric trail for the CFO, and the auditor override rate for the risk committee.

AgentOps

AI Platform Lead

Monitors the 14-component pipeline, catches bottlenecks, tunes agents when override patterns surface.

Pain: When the SLA slips, need to know immediately: which agent? Which stage? Token cost spike? Queue backup?

03 — The Workflow

A day in the audit queue, end to end

Follow one auditor — John — from login to decision, then zoom out to the platform that ran underneath him. Every panel below is live React, reconstructed from the production app.

8:00 AM · Login

A brief, not a queue

John doesn't open to an empty work list. While he was gone, the platform screened all 19 overnight claims and ranked them by SLA risk, dollar value, and pattern alerts — then opened with what to do first. The morning brief is AI-driven; the prioritization is the product.

AI Morning Brief

✦ Last refresh · 2 min ago

Good morning, John.

AuditorMonday, June 8, 2026

I've screened 19 claims overnight — 3 SLA-critical, 6 high-priority, and 6 AI pattern alerts need your review.

Priority overview

SLA-critical

< 24 hrs

High priority

> $10k

AI pattern alerts

Irregular billing

Routine review

Standard checks

Today

5/19 claims

Next SLA 1h 40m

Recommendation

│

Yesterday

12/15 claims

3 carried over

Monthly snapshot

Completed Claims

142

Avg 128/mo

Completed Lines

534

Avg 489/mo

Denials

Avg 31/mo

QA Rating

96%

Proficient

SLA Compliance

97%

Above 95% target

The first task

He opens the top SLA-critical claim

John enters the case workspace for CLM-IMP-209024 — a $48,750 implant claim from Central Spine & Joint. Everything about the claim is here, but the core is two things: on the left, what the Audit Agent found (summary, core findings, AI findings, precedent); on the right, a Chatbot he can interrogate. Open any finding and ask it Policy check, Evidence, or Reasoning, or tap a similar case — the answers type out and accumulate as a conversation. (Reconstructed for this case study — not the production screens.)

CLM-IMP-209024$48,750 · Implant · SLA-critical · Central Spine & Joint

Lumbar spinal fusion (CPT 22612) billed at $30,350 with a separate implant line C1713 (titanium pedicle screw system) at $18,400. C1713 is bundled into CPT 22612 under CMS IPPS/OPPS policy, so the separate line is unbundling — and it's priced +182% above market benchmark.

AI RecommendationDENYModifier-free unbundling94% confidence

Policy triggered: CMS IPPS/OPPS NCCI edit on the (22612, C1713) pair
Pricing outlier: $18,400 vs $6,500 benchmark — exceeds P95
Provider pattern: 14 similar instances, 8 of 9 denials upheld on appeal
Next: open a finding tab and ask the Chatbot for the detail →

Ask AI about:

ChatbotCLM-IMP-2090241 msgs

Claim overview

▌

The decision

Agreement is easy. The override is where accountability lives.

When the auditor aligns with the AI, one click confirms and the denial reason pre-populates. When they override — approve a claim the AI said to deny, or escalate what it approved — a justification is required. That note flows to the learning loop as labeled training data. Try it:

AI RecommendationDENYModifier 59 unbundling

Confidence87%

Step 1 — Your decision

Zoom out · Agent Ops View

A view for the people who manage the AI

Everything John just did ran on a multi-agent pipeline. The Agent Ops View is the surface for the technical team that runs it — the way a human manager watches team performance, this lets them watch the AI: the architecture, the workflow, and real-time status. Two lenses. By Architecture answers “is the machine running?” — for each stage, how many cases it's holding, average cycle time, whether a step is degraded or a bottleneck, and where cost concentrates. By Case Performance answers “are the decisions any good?” — pull one claim as a spot-check and replay exactly how the AI handled it, stage by stage: the execution logs, the chain of thought, and how it handled errors. This is the surface I designed closest with engineering — I had to understand each agent's reasoning and failure modes before I could make them legible.

Ops summary · this week

Pipeline is healthy overall — 412 claims/hr, $2,840 compute spend, SLA 94.2%. Two items to watch: Quality Check error rate is 1.85% (above the 1% threshold — LLM-judge calibration drift), and Audit Agent cost is up 22% week-over-week on longer implant claims. Human Review is the bottleneck (64 queued). Everything else is within range.

Spend (wk)

$2,840

▲ 7% wk

Cost / claim

$0.31

▼ 3% wk

Error Rate

0.61%

▲ 0.2pt wk

Latency P95

4.8s

▲ 0.4s wk

Throughput

412/hr

▲ 6% wk

SLA

94.2%

▼ 1.4pt wk

Pipeline Flow — live

→

Stage 5AuditImplant Audit Agent· Agent · vs last weekMonitor

Uptime

99.84%

±0

Error Rate

0.52%

▲0.1pt

Latency P95

4.8s

▲9%

Latency P99

7.2s

▲14%

Cost / wk

$1,420

⚠ ▲22%

The reasoning core — a 7-step LLM workflow. This stage's own cost is climbing on longer, document-heavy implant claims; token cost is the thing to watch.

Learning LoopFrom stage 9

Override justifications and experience scores feed the Learner Agent, which updates the Intelligence Library and per-audit-type ML configs.

Learner Agent → Intelligence Library → ML Model

Config LoopPattern detected / auditor request

When a stage drifts (e.g. Quality Check above), the config loop recommends and ships a new configuration — no redeploy.

Config Recommendation → Config Setup

04 — Impact

Prepay only works if it's accurate — so accuracy is the whole game

A wrong call before payment harms a provider and a member, with no clawback to undo it. So every design choice here serves one goal: a decision trustworthy enough to act on before the money goes out. That accuracy pays back three ways — improper charges prevented at prepay, more claims worked per auditor hour, and fewer appeals because an aligned decision auto-drafts an accurate denial letter. And because every override loops back to the agent as labeled feedback, accuracy compounds — which is what makes prepay safe to expand.

$30–40M estimated annual savings

Three levers: improper charges prevented at prepay instead of recovered months later; more claims worked per auditor hour; and fewer appeals, because an aligned decision auto-drafts an accurate denial rationale and letter. Together, the program's biggest impact lever this year.

Review time down 42%

The agent points straight to the problem; the auditor verifies in a glance instead of hunting across reports — and when trust is high, doesn't even open the attached PDF. Minutes per claim instead of tens of minutes.

91.2% AI-human agreement

When the AI recommends and the auditor decides, they align 9 out of 10 times. The 9% disagreement is the most valuable training signal the model gets.

Override is never silent

Every disagreement with the AI requires a justification note. Not compliance theater — the note becomes labeled data for the learning loop.

05 — Reflection

Designing AI that works with experts, not in place of them

The hardest design question on this case wasn't “how do we show the AI output?” — it was “what happens when the auditor disagrees?” That moment of disagreement is where an AI-augmented workflow either earns trust or loses it. Too frictionless, and overrides become thoughtless; too heavy, and auditors route around the tool. Putting a required justification on the override — and then telling the auditor where that note actually goes (the learning loop) — reframed it from compliance theater to collaboration with the model.

Built with Claude Code: Led discovery and built the application on the same keyboard — Claude Code for the UI, LangChain agents on GCP for the reasoning layer. No handoff gap between design spec and shipped prototype

Interaction frame: AI is an auditor on your bench, not a replacement — every surface answers that question before it answers anything else

Override as data: The blocker moment (override justification) is also the best data we get — treat it like a labeled-data pipeline, not a compliance form

Pipeline visibility: A 14-component multi-agent system is invisible to end-users by design, but fully legible to AgentOps when it drifts

Next Case Study

Agentic AI Platform for Care Operations

View Project →