<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Fraud on Rusty Bower</title><link>https://www.rustybower.com/tags/fraud/</link><description>Recent content in Fraud on Rusty Bower</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Sat, 14 Feb 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://www.rustybower.com/tags/fraud/index.xml" rel="self" type="application/rss+xml"/><item><title>Finding Fraud in $1 Trillion of Medicaid Data with DuckDB</title><link>https://www.rustybower.com/posts/medicaid-fraud-detection-duckdb-227m-rows/</link><pubDate>Sat, 14 Feb 2026 00:00:00 +0000</pubDate><guid>https://www.rustybower.com/posts/medicaid-fraud-detection-duckdb-227m-rows/</guid><description>&lt;p&gt;CMS recently published provider-level Medicaid spending data from T-MSIS — every fee-for-service, managed care, and CHIP claim from 2018 through 2024, aggregated by billing provider, procedure code, and month. 227 million rows. $1.09 trillion in payments. I wanted to see what falls out when you run some basic fraud heuristics against it.&lt;/p&gt;
&lt;p&gt;The dataset is available at &lt;a class="link" href="https://opendata.hhs.gov/datasets/medicaid-provider-spending/" target="_blank" rel="noopener"
 &gt;opendata.hhs.gov/datasets/medicaid-provider-spending/&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="the-dataset"&gt;The dataset
&lt;/h2&gt;&lt;p&gt;The download is a single 2.94 GB parquet file with seven columns:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Column&lt;/th&gt;
 &lt;th&gt;Type&lt;/th&gt;
 &lt;th&gt;Description&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;BILLING_PROVIDER_NPI_NUM&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;string&lt;/td&gt;
 &lt;td&gt;NPI of the billing provider&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;SERVICING_PROVIDER_NPI_NUM&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;string&lt;/td&gt;
 &lt;td&gt;NPI of the servicing provider&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;HCPCS_CODE&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;string&lt;/td&gt;
 &lt;td&gt;Procedure code&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;CLAIM_FROM_MONTH&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;string&lt;/td&gt;
 &lt;td&gt;Month (YYYY-MM)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;TOTAL_UNIQUE_BENEFICIARIES&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;int64&lt;/td&gt;
 &lt;td&gt;Unique patients&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;TOTAL_CLAIMS&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;int64&lt;/td&gt;
 &lt;td&gt;Claim count&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;TOTAL_PAID&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;double&lt;/td&gt;
 &lt;td&gt;Dollars paid by Medicaid&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Here&amp;rsquo;s what the first few rows look like — the data is sorted by &lt;code&gt;TOTAL_PAID&lt;/code&gt; descending, so the biggest line items are at the top:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;┌──────────────────────────┬────────────────────────────┬────────────┬──────────────────┬────────────────────────────┬──────────────┬──────────────┐
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ BILLING_PROVIDER_NPI_NUM │ SERVICING_PROVIDER_NPI_NUM │ HCPCS_CODE │ CLAIM_FROM_MONTH │ TOTAL_UNIQUE_BENEFICIARIES │ TOTAL_CLAIMS │ TOTAL_PAID │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;├──────────────────────────┼────────────────────────────┼────────────┼──────────────────┼────────────────────────────┼──────────────┼──────────────┤
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ 1376609297 │ 1376609297 │ T1019 │ 2024-07 │ 39765 │ 1205701 │ 118887675.31 │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ 1376609297 │ 1376609297 │ T1019 │ 2024-08 │ 39677 │ 1152534 │ 115561066.11 │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ 1376609297 │ 1376609297 │ T1019 │ 2024-05 │ 39678 │ 1157235 │ 112823255.3 │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ 1376609297 │ 1376609297 │ T1019 │ 2024-06 │ 39834 │ 1164582 │ 111449173.13 │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ 1376609297 │ 1376609297 │ T1019 │ 2024-09 │ 39527 │ 1099808 │ 111199832.57 │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;└──────────────────────────┴────────────────────────────┴────────────┴──────────────────┴────────────────────────────┴──────────────┴──────────────┘
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Right away you notice NPI &lt;code&gt;1376609297&lt;/code&gt; billing $118M in a single month for T1019 (personal care services) with 1.2 million claims. That&amp;rsquo;s going to come up again.&lt;/p&gt;
&lt;p&gt;A quick summary of the full dataset:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; COUNT(*) as total_rows,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; COUNT(DISTINCT BILLING_PROVIDER_NPI_NUM) as unique_billing_npis,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; COUNT(DISTINCT SERVICING_PROVIDER_NPI_NUM) as unique_servicing_npis,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; COUNT(DISTINCT HCPCS_CODE) as unique_procedures,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; MIN(CLAIM_FROM_MONTH) as earliest_month,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; MAX(CLAIM_FROM_MONTH) as latest_month,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_PAID) as total_paid
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM &amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;227M rows&lt;/strong&gt;, &lt;strong&gt;617,503 billing providers&lt;/strong&gt;, &lt;strong&gt;1.6M servicing providers&lt;/strong&gt;, &lt;strong&gt;10,881 procedure codes&lt;/strong&gt;, spanning &lt;strong&gt;84 months&lt;/strong&gt; (Jan 2018 – Dec 2024), totaling &lt;strong&gt;$1.09 trillion&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="why-parquet-not-csv"&gt;Why parquet, not CSV
&lt;/h2&gt;&lt;p&gt;HHS offers this dataset as a ZIP download. Inside is parquet, not CSV. This matters a lot for a file this size.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compression.&lt;/strong&gt; The parquet file is 2.94 GB. The same data in CSV would be roughly 15-20 GB. Parquet uses column-oriented compression (dictionary encoding for the repeated NPI strings, run-length encoding for sorted columns, and Snappy/ZSTD compression on top). The three NPI/code string columns have high cardinality but lots of repetition within row groups — exactly the pattern parquet compresses best.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column pruning.&lt;/strong&gt; If your query only touches &lt;code&gt;BILLING_PROVIDER_NPI_NUM&lt;/code&gt; and &lt;code&gt;TOTAL_PAID&lt;/code&gt;, parquet readers skip the other five columns entirely. With CSV, you always read every byte of every row. For a 7-column file, that&amp;rsquo;s up to 5/7ths of I/O you never have to do.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Predicate pushdown.&lt;/strong&gt; DuckDB can push &lt;code&gt;WHERE&lt;/code&gt; clauses down into the parquet scan, skipping entire row groups whose min/max statistics prove no rows match. When I filter &lt;code&gt;WHERE TOTAL_CLAIMS &amp;gt; 1000&lt;/code&gt;, DuckDB doesn&amp;rsquo;t even read row groups where the max claims value is under 1,000.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Type preservation.&lt;/strong&gt; CSV makes you guess types and parse strings. Parquet knows &lt;code&gt;TOTAL_PAID&lt;/code&gt; is a &lt;code&gt;double&lt;/code&gt; and &lt;code&gt;TOTAL_CLAIMS&lt;/code&gt; is an &lt;code&gt;int64&lt;/code&gt; — no parsing overhead, no type ambiguity.&lt;/p&gt;
&lt;p&gt;For a dataset with 227M rows, these differences compound. The parquet file isn&amp;rsquo;t just smaller — it&amp;rsquo;s fundamentally faster to query.&lt;/p&gt;
&lt;h2 id="dont-load-227m-rows-into-pandas"&gt;Don&amp;rsquo;t load 227M rows into pandas
&lt;/h2&gt;&lt;p&gt;My first attempt was naive:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/data/medicaid-provider-spending.parquet&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The Jupyter kernel OOM-killed immediately on a 16 GB pod.&lt;/p&gt;
&lt;p&gt;The math makes it obvious in hindsight. Parquet decompresses to a much larger in-memory representation, and pandas stores each string as a Python object with ~56 bytes of overhead regardless of string length. Three string columns across 227M rows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;227M rows × 3 string columns × ~56 bytes/object ≈ 38 GB
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;+ 2 int64 columns × 8 bytes × 227M ≈ 3.6 GB
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;+ 1 float64 column × 8 bytes × 227M ≈ 1.8 GB
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ≈ 43 GB total
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Even a 64 GB pod would be tight once you start doing groupbys that create intermediate DataFrames.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix: DuckDB.&lt;/strong&gt; It queries parquet files directly on disk using memory-mapped I/O, streaming through row groups without materializing the full dataset. Peak memory usage stays in the low single-digit GBs. Every query in this analysis ran in under 3 minutes on that same 16 GB pod.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;duckdb&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;PATH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/data/medicaid-provider-spending.parquet&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;con&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;SELECT COUNT(*) FROM &amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# → (227083361,)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;No &lt;code&gt;read_parquet()&lt;/code&gt;, no DataFrames, no OOM. DuckDB treats the parquet file as a table and pushes the full query plan — filters, aggregations, window functions — down to the scan.&lt;/p&gt;
&lt;p&gt;The trick is to do all the heavy lifting in DuckDB SQL, then call &lt;code&gt;.df()&lt;/code&gt; only on the final result sets (which are thousands of rows, not millions):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Heavy aggregation stays in DuckDB — result is ~68K rows, perfectly fine for pandas&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;outliers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT ... FROM &amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; GROUP BY ... HAVING z_score &amp;gt; 3
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="five-fraud-signals"&gt;Five fraud signals
&lt;/h2&gt;&lt;p&gt;Healthcare fraud detection comes down to finding providers whose billing patterns deviate sharply from their peers. I built five independent signals and combined them into a composite risk score. Here&amp;rsquo;s every query.&lt;/p&gt;
&lt;h3 id="1-cost-per-beneficiary-outliers"&gt;1. Cost-per-beneficiary outliers
&lt;/h3&gt;&lt;p&gt;For each provider and procedure code, I sum the total dollars paid and divide by unique beneficiaries across all months. Then z-score within each procedure code (restricting to procedures with 20+ providers for stable statistics). Anything above z=3 is flagged.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;outliers_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; WITH prov_proc AS (
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; BILLING_PROVIDER_NPI_NUM,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; HCPCS_CODE,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_PAID) as total_paid,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_UNIQUE_BENEFICIARIES) as total_benes,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_CLAIMS) as total_claims,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; COUNT(DISTINCT CLAIM_FROM_MONTH) as months_active,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_PAID) / GREATEST(SUM(TOTAL_UNIQUE_BENEFICIARIES), 1)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; as paid_per_bene
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM &amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; GROUP BY BILLING_PROVIDER_NPI_NUM, HCPCS_CODE
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ),
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; proc_stats AS (
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; HCPCS_CODE,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; COUNT(*) as num_providers,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; AVG(paid_per_bene) as mean_ppb,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; STDDEV_POP(paid_per_bene) as std_ppb
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM prov_proc
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; GROUP BY HCPCS_CODE
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; HAVING COUNT(*) &amp;gt;= 20
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ),
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; scored AS (
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; p.*,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; (p.paid_per_bene - s.mean_ppb) / NULLIF(s.std_ppb, 0)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; as z_paid_per_bene
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM prov_proc p
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; JOIN proc_stats s USING (HCPCS_CODE)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; )
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT * FROM scored
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; WHERE z_paid_per_bene &amp;gt; 3
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ORDER BY z_paid_per_bene DESC
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This scans the full 2.94 GB parquet file, builds the two-level aggregation entirely in DuckDB, and returns only the outliers. It ran in &lt;strong&gt;192 seconds&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 68,037 provider/procedure combinations flagged across &lt;strong&gt;27,348 unique billing NPIs&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The top hit: NPI &lt;code&gt;1467653303&lt;/code&gt; billing &lt;strong&gt;$10,416 per beneficiary&lt;/strong&gt; for CPT 99213 — a standard 15-minute office visit that typically pays around $57. A z-score of &lt;strong&gt;183&lt;/strong&gt;. They billed $541K for 52 beneficiaries in a single month. Either those were the most expensive office visits in the history of medicine, or something&amp;rsquo;s wrong.&lt;/p&gt;
&lt;p&gt;Other standouts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;NPI &lt;code&gt;1437412525&lt;/code&gt;: &lt;strong&gt;$22,064/beneficiary&lt;/strong&gt; for 92507 (speech therapy), z=94&lt;/li&gt;
&lt;li&gt;NPI &lt;code&gt;1073608998&lt;/code&gt;: &lt;strong&gt;$65,778/beneficiary&lt;/strong&gt; for J3490 (unclassified drugs), z=68&lt;/li&gt;
&lt;li&gt;NPI &lt;code&gt;1427138726&lt;/code&gt;: &lt;strong&gt;$8,817/beneficiary&lt;/strong&gt; for 99232 (subsequent hospital care), z=62 — and they did it across 16 months, billing $6M total&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-billing-mill-detection"&gt;2. Billing mill detection
&lt;/h3&gt;&lt;p&gt;In Medicaid, the billing provider is the entity that submits the claim, and the servicing provider is the one who actually performed the service. Most providers bill for their own services — &lt;code&gt;BILLING_PROVIDER_NPI_NUM = SERVICING_PROVIDER_NPI_NUM&lt;/code&gt;. A billing mill funnels claims through many servicing providers under one billing entity, taking a cut.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;billing_mill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; BILLING_PROVIDER_NPI_NUM,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; COUNT(DISTINCT SERVICING_PROVIDER_NPI_NUM) as num_servicing_npis,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; COUNT(DISTINCT HCPCS_CODE) as num_procedures,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_PAID) as total_paid,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_CLAIMS) as total_claims,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_UNIQUE_BENEFICIARIES) as total_benes,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; COUNT(DISTINCT CLAIM_FROM_MONTH) as months_active
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM &amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; GROUP BY BILLING_PROVIDER_NPI_NUM
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ORDER BY num_servicing_npis DESC
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The distribution tells the story:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;count 617,503
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;mean 4.6
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;std 33.0
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;25% 1.0
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;50% 1.0 ← median is 1
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;75% 1.0 ← 75th percentile is still 1
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;max 5,746
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The median and 75th percentile are both &lt;strong&gt;1&lt;/strong&gt;. Most providers bill for themselves. Then there&amp;rsquo;s NPI &lt;code&gt;1679525919&lt;/code&gt; with &lt;strong&gt;5,746 servicing NPIs&lt;/strong&gt;, billing &lt;strong&gt;$863 million&lt;/strong&gt; across 84 months with 1,579 distinct procedure codes. That&amp;rsquo;s either a massive health system or something worth investigating.&lt;/p&gt;
&lt;p&gt;I flag the top 1% by servicing NPI count as potential billing mills.&lt;/p&gt;
&lt;h3 id="3-volume-impossibilities"&gt;3. Volume impossibilities
&lt;/h3&gt;&lt;p&gt;Each row in this dataset is already aggregated to one billing-provider/servicing-provider/procedure/month combination. So when &lt;code&gt;TOTAL_CLAIMS&lt;/code&gt; exceeds 1,000 for a single row, that means one provider billed over 1,000 claims for one procedure code in one month — averaging 50+ per working day with zero days off.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;high_volume&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; BILLING_PROVIDER_NPI_NUM,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SERVICING_PROVIDER_NPI_NUM,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; HCPCS_CODE,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; CLAIM_FROM_MONTH,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; TOTAL_CLAIMS,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; TOTAL_UNIQUE_BENEFICIARIES,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; TOTAL_PAID
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM &amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; WHERE TOTAL_CLAIMS &amp;gt; 1000
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ORDER BY TOTAL_CLAIMS DESC
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 1,783,926 rows exceeded 1,000 claims/month across &lt;strong&gt;30,143 billing NPIs&lt;/strong&gt;, totaling &lt;strong&gt;$391 billion&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The single most extreme: NPI &lt;code&gt;1225163876&lt;/code&gt; billed &lt;strong&gt;1,607,071 claims&lt;/strong&gt; in February 2022 for code &lt;code&gt;1286Z&lt;/code&gt; — but only to 1,906 beneficiaries, and was paid just $230K. That&amp;rsquo;s 842 claims per beneficiary in a single month. The low payment amount suggests these may be bulk capitation or encounter records rather than fee-for-service fraud, but the ratio is still bizarre.&lt;/p&gt;
&lt;p&gt;The real volume monster is NPI &lt;code&gt;1376609297&lt;/code&gt; (the same provider from the first row of the dataset) — they consistently bill 1.1-1.2 million claims per month for T1019, each month paying out $100M+. They occupy 19 of the top 20 highest-volume rows.&lt;/p&gt;
&lt;h3 id="4-spending-spike-detection"&gt;4. Spending spike detection
&lt;/h3&gt;&lt;p&gt;Fraudulent providers often show a &amp;ldquo;ramp and run&amp;rdquo; pattern — billing spikes dramatically before the entity disappears or gets caught. I used &lt;code&gt;LAG()&lt;/code&gt; to compute month-over-month spending growth per provider and flagged anything with &amp;gt;5x growth where the spike month exceeded $50K:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;spikes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; WITH monthly AS (
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; BILLING_PROVIDER_NPI_NUM,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; CLAIM_FROM_MONTH,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_PAID) as monthly_paid,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_CLAIMS) as monthly_claims,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_UNIQUE_BENEFICIARIES) as monthly_benes
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM &amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; GROUP BY BILLING_PROVIDER_NPI_NUM, CLAIM_FROM_MONTH
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ),
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; with_prev AS (
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT *,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; LAG(monthly_paid) OVER (
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; PARTITION BY BILLING_PROVIDER_NPI_NUM
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ORDER BY CLAIM_FROM_MONTH
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ) as prev_paid
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM monthly
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; )
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; BILLING_PROVIDER_NPI_NUM,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; CLAIM_FROM_MONTH,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; prev_paid,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; monthly_paid,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; monthly_paid / GREATEST(prev_paid, 1) as growth_ratio,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; monthly_claims,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; monthly_benes
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM with_prev
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; WHERE monthly_paid / GREATEST(prev_paid, 1) &amp;gt; 5
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; AND monthly_paid &amp;gt; 50000
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; AND prev_paid IS NOT NULL
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ORDER BY growth_ratio DESC
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This query aggregates 227M rows to provider-month level, computes lag-based growth ratios via a window function, and filters — all pushed down into DuckDB&amp;rsquo;s execution engine. It ran in &lt;strong&gt;91 seconds&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 18,916 spike events across &lt;strong&gt;13,527 providers&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The top spikes all share a pattern: &lt;code&gt;prev_paid&lt;/code&gt; of $0 (or near-zero) followed by millions in the next month. NPI &lt;code&gt;1770700221&lt;/code&gt; went from $0 to &lt;strong&gt;$4.76M&lt;/strong&gt; in July 2023. NPI &lt;code&gt;1336117670&lt;/code&gt; went from $0 to &lt;strong&gt;$4.74M&lt;/strong&gt; in February 2018. These are providers that appear out of nowhere with massive billing.&lt;/p&gt;
&lt;p&gt;NPI &lt;code&gt;1144347824&lt;/code&gt; is interesting — they appear four times in the top 20 with spikes in Dec 2020, Feb 2021, Apr 2021, and Jun 2021. Repeated $0-to-$550K+ cycles suggest an on-again-off-again billing pattern.&lt;/p&gt;
&lt;h3 id="5-procedure-concentration"&gt;5. Procedure concentration
&lt;/h3&gt;&lt;p&gt;Most providers bill across multiple procedure codes, even within a narrow specialty. A provider deriving &amp;gt;95% of all Medicaid revenue from a single HCPCS code at high dollar volumes can indicate upcoding (always picking the most expensive code) or phantom billing (billing for services never rendered using a single code).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;concentrated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; WITH prov_proc AS (
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; BILLING_PROVIDER_NPI_NUM,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; HCPCS_CODE,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_PAID) as total_paid,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_CLAIMS) as total_claims,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_UNIQUE_BENEFICIARIES) as total_benes
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM &amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; GROUP BY BILLING_PROVIDER_NPI_NUM, HCPCS_CODE
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ),
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; prov_total AS (
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT BILLING_PROVIDER_NPI_NUM,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(total_paid) as provider_total
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM prov_proc
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; GROUP BY BILLING_PROVIDER_NPI_NUM
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ),
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ranked AS (
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; p.*,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; t.provider_total,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; p.total_paid / t.provider_total as pct_of_total,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ROW_NUMBER() OVER (
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; PARTITION BY p.BILLING_PROVIDER_NPI_NUM
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ORDER BY p.total_paid DESC
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ) as rn
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM prov_proc p
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; JOIN prov_total t USING (BILLING_PROVIDER_NPI_NUM)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; WHERE t.provider_total &amp;gt; 500000
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; )
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT * FROM ranked
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; WHERE rn = 1 AND pct_of_total &amp;gt; 0.95
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; ORDER BY provider_total DESC
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; &lt;strong&gt;25,397 providers&lt;/strong&gt; with &amp;gt;$500K total and &amp;gt;95% from one code, representing &lt;strong&gt;$193 billion&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The dominant code across the top 20 is &lt;strong&gt;T1019&lt;/strong&gt; (personal care services). NPI &lt;code&gt;1376609297&lt;/code&gt; tops the list at &lt;strong&gt;$5.5 billion&lt;/strong&gt;, 98% from T1019. The next five are also T1019 providers at $1-3B each. These are likely large home and community-based services (HCBS) agencies — the concentration on one code may be legitimate program design, but the scale demands scrutiny.&lt;/p&gt;
&lt;p&gt;NPI &lt;code&gt;1932341898&lt;/code&gt; stands out: &lt;strong&gt;$997M, 99.99% from H0044&lt;/strong&gt; (supported employment), with only two procedure codes total. That&amp;rsquo;s extreme even for this list.&lt;/p&gt;
&lt;h2 id="composite-scoring"&gt;Composite scoring
&lt;/h2&gt;&lt;p&gt;Each provider gets a binary flag for each of the five signals. The composite fraud score is just the sum.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;all_providers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT BILLING_PROVIDER_NPI_NUM,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_PAID) as total_spending
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM &amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; GROUP BY BILLING_PROVIDER_NPI_NUM
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;risk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;all_providers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;flag1_npis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outliers_cost&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BILLING_PROVIDER_NPI_NUM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;flag_cost_outlier&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BILLING_PROVIDER_NPI_NUM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; \
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flag1_npis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;threshold_mill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;billing_mill&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;num_servicing_npis&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.99&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;flag2_npis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;billing_mill&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;billing_mill&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;num_servicing_npis&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold_mill&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BILLING_PROVIDER_NPI_NUM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;flag_billing_mill&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BILLING_PROVIDER_NPI_NUM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; \
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flag2_npis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;flag3_npis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;high_volume&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BILLING_PROVIDER_NPI_NUM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;flag_high_volume&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BILLING_PROVIDER_NPI_NUM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; \
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flag3_npis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;flag4_npis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spikes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BILLING_PROVIDER_NPI_NUM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;flag_spike&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BILLING_PROVIDER_NPI_NUM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; \
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flag4_npis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;flag5_npis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;concentrated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BILLING_PROVIDER_NPI_NUM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;flag_concentrated&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BILLING_PROVIDER_NPI_NUM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; \
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flag5_npis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;flag_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s1"&gt;&amp;#39;flag_cost_outlier&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;flag_billing_mill&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s1"&gt;&amp;#39;flag_high_volume&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;flag_spike&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;flag_concentrated&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;fraud_score&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;flag_cols&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This is the one step where I use pandas — the DuckDB queries already reduced 227M rows down to manageable result sets (tens of thousands of NPIs), so the set-membership lookups and joins are trivial.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Distribution:&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Score&lt;/th&gt;
 &lt;th&gt;Providers&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;0&lt;/td&gt;
 &lt;td&gt;539,161&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;1&lt;/td&gt;
 &lt;td&gt;57,424&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;2&lt;/td&gt;
 &lt;td&gt;17,690&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;3&lt;/td&gt;
 &lt;td&gt;3,034&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;4&lt;/td&gt;
 &lt;td&gt;194&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;2+&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;20,918&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The 20,918 providers with 2+ flags account for &lt;strong&gt;$535 billion&lt;/strong&gt; — nearly half of all Medicaid spending in the dataset.&lt;/p&gt;
&lt;h2 id="the-top-5"&gt;The top 5
&lt;/h2&gt;&lt;p&gt;For the highest-scoring providers, I pulled their full procedure breakdowns and monthly spending ranges:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;npi&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;proc_breakdown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SELECT HCPCS_CODE,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_PAID) as paid,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_CLAIMS) as claims,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; SUM(TOTAL_UNIQUE_BENEFICIARIES) as benes
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; FROM &amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; WHERE BILLING_PROVIDER_NPI_NUM = &amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;npi&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; GROUP BY HCPCS_CODE ORDER BY paid DESC LIMIT 5
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; &amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="npi-1700090834--score-45-113b"&gt;NPI 1700090834 — Score 4/5 ($1.13B)
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Flags:&lt;/strong&gt; cost outlier, billing mill, high volume, spike&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;HCPCS_CODE paid claims benes pct
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; H0019 553,957,700 2,884,496 155,197 53.2%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; H0004 188,211,147 1,572,012 505,021 18.1%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; H0005 108,477,327 1,623,733 203,179 10.4%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; H0020 103,053,473 7,020,805 276,149 9.9%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; H0006 88,021,893 1,166,081 255,743 8.4%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Monthly spending range: $37,576 — $25,877,770
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;All H-codes (behavioral health). H0019 alone is $554M for day treatment services across 155K beneficiaries. The monthly range swinging from $38K to $25.9M means there were massive ramp-up periods. This has the profile of a large behavioral health managed care entity — but one that also trips cost outlier and billing mill flags.&lt;/p&gt;
&lt;h3 id="npi-1932341898--score-45-997m"&gt;NPI 1932341898 — Score 4/5 ($997M)
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Flags:&lt;/strong&gt; cost outlier, high volume, spike, concentrated&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;HCPCS_CODE paid claims benes pct
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; H0044 996,688,991 304,663 283,937 100.0%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; T2028 112,131 1,468 36 0.0%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Monthly spending range: $512,298 — $21,818,712
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;$997M from a single code.&lt;/strong&gt; H0044 is supported employment — helping people with disabilities find and maintain jobs. 283,937 beneficiaries at an average of $3,511 each. Only two procedure codes ever billed. The 43x spread between minimum and maximum monthly spending ($512K to $21.8M) is hard to explain with normal program growth.&lt;/p&gt;
&lt;h3 id="npi-1114931391--score-45-382m"&gt;NPI 1114931391 — Score 4/5 ($382M)
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Flags:&lt;/strong&gt; cost outlier, high volume, spike, concentrated&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;HCPCS_CODE paid claims benes pct
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; S3620 370,523,694 1,324,711 974,114 97.1%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 83655 8,882,287 685,614 667,070 2.3%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 85018 1,601,721 623,389 607,346 0.4%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 80061 691,330 45,838 44,922 0.2%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 82947 49,396 11,154 10,951 0.0%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Monthly spending range: $47 — $15,497,422
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;97% from S3620 (newborn screening panel). The monthly range from &lt;strong&gt;$47 to $15.5M&lt;/strong&gt; is the most extreme spread in the top 5. This likely represents a state-contracted newborn screening lab — the code and beneficiary counts support that. But a $47 month to $15.5M month means either the contract changed drastically or there are data quality issues. The secondary codes (83655, 85018) are basic lab tests, consistent with a lab operation.&lt;/p&gt;
&lt;h3 id="npi-1629283197--score-45-306m"&gt;NPI 1629283197 — Score 4/5 ($306M)
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Flags:&lt;/strong&gt; billing mill, high volume, spike, concentrated&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;HCPCS_CODE paid claims benes pct
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; T1015 298,897,046 718,309 655,601 99.1%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 99393 722,772 52,037 39,635 0.2%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 99392 688,366 47,287 38,251 0.2%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 99394 608,918 31,656 24,001 0.2%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 92014 591,683 27,987 16,169 0.2%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Monthly spending range: $594,623 — $5,028,749
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;99% from T1015 (clinic-based behavioral health). The secondary codes (99392-99394) are preventive care E&amp;amp;M visits, and 92014 is an eye exam — suggesting this billing entity has a clinic component too. But the billing mill flag means claims are flowing through many servicing NPIs under this one billing number.&lt;/p&gt;
&lt;h3 id="npi-1619341716--score-45-303m"&gt;NPI 1619341716 — Score 4/5 ($303M)
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Flags:&lt;/strong&gt; cost outlier, billing mill, high volume, spike&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;HCPCS_CODE paid claims benes pct
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 99211 45,853,871 488,964 412,175 34.6%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 90832 37,594,923 253,130 125,580 28.4%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 99213 29,286,349 255,935 204,037 22.1%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 99214 9,967,573 88,118 73,463 7.5%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 99212 9,740,312 66,604 55,192 7.4%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Monthly spending range: $66,177 — $7,160,676
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The most diversified billing pattern of the top 5. A mix of E&amp;amp;M visits (99211-99214) and psychotherapy (90832). The cost outlier flag means they&amp;rsquo;re charging significantly more per beneficiary than peers for these common codes. The billing mill flag means many servicing NPIs are operating under this one billing entity. The 108x spread between min and max monthly billing ($66K to $7.2M) indicates major scaling events.&lt;/p&gt;
&lt;h2 id="caveats"&gt;Caveats
&lt;/h2&gt;&lt;p&gt;These are flags, not findings. Several patterns have legitimate explanations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Large health systems and MCOs&lt;/strong&gt; naturally have many servicing NPIs — that&amp;rsquo;s not fraud, it&amp;rsquo;s organizational structure. The billing mill signal needs to be interpreted in context.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;State-contracted labs&lt;/strong&gt; (like the newborn screening provider at NPI &lt;code&gt;1114931391&lt;/code&gt;) can have legitimate volume and spending spikes tied to contract awards and state program changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;T1019/T1015/H-codes&lt;/strong&gt; are commonly used in home and community-based services (HCBS) and behavioral health, where high volumes per billing entity may reflect program design rather than fraud. States often contract with large managed care entities for these services.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;This data is aggregated&lt;/strong&gt; — we can&amp;rsquo;t see individual claim details, diagnosis codes, patient demographics, or service locations. Many fraud patterns only become visible at the claim level.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The real value is in combining signals. A provider that&amp;rsquo;s both a cost outlier &lt;em&gt;and&lt;/em&gt; has billing mill patterns &lt;em&gt;and&lt;/em&gt; shows spending spikes is far more suspicious than one that only trips a single wire. The 194 providers at score 4/5 are the ones I&amp;rsquo;d start investigating.&lt;/p&gt;
&lt;h2 id="technical-notes"&gt;Technical notes
&lt;/h2&gt;&lt;p&gt;The full analysis runs as a Jupyter notebook backed by DuckDB on a 16 GB Kubernetes pod. Some implementation details worth noting:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DuckDB&amp;rsquo;s parquet performance is remarkable.&lt;/strong&gt; The z-score query (Signal 1) does a full scan → two-level aggregation → window function → filter, all on 227M rows of parquet. It completes in 192 seconds. The spending spike query uses &lt;code&gt;LAG()&lt;/code&gt; window functions over the aggregated monthly data — 91 seconds. These are the kinds of queries that would require Spark or a data warehouse for most teams.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The pandas handoff is intentional.&lt;/strong&gt; DuckDB returns result sets via &lt;code&gt;.df()&lt;/code&gt; only after reducing 227M rows to thousands. The composite scoring step — five set-membership lookups and a sum — is trivial in pandas. There&amp;rsquo;s no reason to push that into SQL.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;GREATEST(x, 1)&lt;/code&gt; prevents division by zero.&lt;/strong&gt; Several providers have zero beneficiaries or zero prior-month spending. Rather than filtering them out (and potentially missing interesting cases), I clamp the denominator to 1. This means zero-bene providers get a per-bene cost equal to their total paid — which is correct behavior for flagging purposes.&lt;/p&gt;</description></item></channel></rss>