What I learned shipping 200+ detection patterns.
ReDoS audit, false-positive rate by pattern family, regex vs semantic boundaries, and the three surprises that changed how I write detection.
Founder byline - 2026-02-21
The starting count
The BleedWatch scanner ships with 218 active regex detection patterns as of writing. They cover AWS access keys (multiple shapes), AWS secret keys, GitHub tokens (PAT, fine-grained, OAuth, app, refresh), Slack tokens (bot, user, app, legacy), Stripe (live, test), Twilio, SendGrid, Heroku, generic API key assignment, database connection strings, private keys (RSA/EC/DSA/OPENSSH), JWT-shaped, and another fifty or so provider-specific patterns ranging from Anthropic API keys to Notion integration tokens.
The patterns get audited continuously. Three pass criteria:
- ReDoS-safe. Every pattern is checked with
safe-regex2. Quantifiers are bounded; nested quantifiers are forbidden. The CI gate fails any pattern that introduces unbounded backtracking. - Length-bounded input. The scanner truncates input to 10 KB per line and 2 MB per file before regex matching. This is belt-and-suspenders against ReDoS even when a pattern passes
safe-regex2. - False-positive-tested. Each pattern has a corpus of known-good (real match) and known-bad (false positive) test cases. The CI suite runs the corpus and reports FP rate per pattern.
That sounds disciplined. It is, today. It wasn't always. This article is the field notes from the year of getting there.
Surprise 1: the long tail of false positives is where the trust dies
The first version of the scanner had ~140 patterns and an aggregate false positive rate I quoted as ~3%. Sounds great. The problem was the distribution. A few patterns had near-zero FP rates (AWS access keys: ~0.1%) and a few had high single-digit rates (generic API key assignment regex: ~14%).
The generic API key pattern was the one that killed customer trust. A team running their first scan would see 200 findings, mostly real, but the 30 false positives that came from the generic pattern were the ones they remembered. "Your scanner can't tell the difference between an actual API key and a UUID in a test fixture."
The fix was less obvious than I expected. I assumed the answer was "make the regex tighter." The actual answer was "don't use that regex as a final-stage classifier, use it as a candidate filter that gets promoted by entropy + ONNX." Same regex. Different pipeline placement. The aggregate FP rate dropped to ~1% with the same pattern surface, because the high-FP-rate patterns now had two more stages of filtering before promotion.
The lesson: regex pattern quality is not the variable to optimize. Pipeline placement of patterns is.
Surprise 2: ReDoS is real and the cost is hilarious
Mid-2025 I had a pattern that looked like this (paraphrased):
(?:apiKey|api_key)\s*[=:]\s*['"]([a-zA-Z0-9_\-]+)+['"]
See the bug? The ([a-zA-Z0-9_\-]+)+ is a classic catastrophic-backtrack pattern. On adversarial input — say, a file containing apiKey="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa! with no closing quote — the regex engine attempts an exponential number of possible matches.
The scanner ran on a customer's published package, hit a 47-character pathological string, and the regex stage hung at 100% CPU for 38 seconds before the per-file timeout killed it. The scanner worker became unresponsive. The BullMQ queue backed up. A customer noticed.
The fix:
- Re-write the pattern to bound the inner repetition:
[a-zA-Z0-9_\-]{16,64}(specific length range matching real API key shapes). - Add the
safe-regex2audit step to CI permanently. - Add the input truncation at the scanner layer as defense-in-depth.
The CI gate has caught two more would-be-ReDoS patterns since then. The cost would have been the same each time — multi-second worker hangs, customer-visible incidents. The fix per pattern is a few minutes of careful rewrite. ReDoS isn't theoretical. Anyone shipping regex against adversarial input needs safe-regex2 or equivalent.
Surprise 3: the patterns that catch the most aren't the patterns I thought they'd catch
I expected AWS access keys to be the top finding type. They're the iconic secret. Customers care. Cloud security tooling has trained the industry to look for them.
The actual top three by finding count in 2026 Q1:
- Generic database connection strings.
postgres://,mysql://,mongodb+srv://. About 38% of all critical-severity findings across all scanned customer surfaces. - GitHub tokens (PAT-style and fine-grained combined). About 22%.
- AWS access keys. About 14%.
The reason DB connection strings dominate: every web app has one, every CI pipeline references it, and the failure modes are wider than for AWS keys. AWS keys are usually in .env files (caught by pre-commit hooks at scale by now). DB URLs leak through configuration management tools, IaC plans, sourcemaps, mobile binary strings, and a half-dozen other channels that aren't typically locked down.
This reordered my detection priorities. The DB-URL pattern got the most pipeline placement attention. The AWS pattern got less attention than I'd been spending on it. That reallocation paid off in finding volume.
What I'd write a regex book about, if I were writing one
Three principles I've internalized:
1. Regex is a candidate generator, not a classifier. If you're using regex as the final-stage decision, your false-positive rate will compound with every new pattern. Use regex to generate candidates. Use entropy, classifier, semantic stages to promote.
2. Bound every quantifier. No .+, no \S+. Always .{0,n} or \S{1,n} with a specific upper bound. The bound is your ReDoS protection regardless of what safe-regex2 says about your pattern.
3. Test on adversarial corpora, not on real-world corpora. Real-world inputs are not malicious. The pathological strings that exercise your pattern at exponential cost are constructed. Have a corpus of adversarial test strings (long strings with one mismatched character, repeated character classes, deeply nested escapes) and run every new pattern against them.
The pattern surface is never done
The 218-pattern count is a snapshot. New providers ship new credential formats (Anthropic's API key shape was new in 2024). Existing providers rotate their key formats (GitHub did this with fine-grained PATs). New attack patterns emerge (sourcemap leak detection is a 2026 addition). The work is continuous.
What I aim for is coverage of the patterns that actually appear in customer surfaces. The bench data — what we actually find in scans — drives which patterns get priority. If a finding type would have been valuable for last quarter's customers but isn't shipped yet, it's the top of next quarter's roadmap.
If you're working on a similar detection pipeline and want to compare pattern shapes, [email protected] reads carefully.