The multi-model cross-validation pattern we built into BleedWatch, inspired by Lazarus AI's Clearwing — and the three failure modes it actually solves.

The mistake I made early

In the first version of the BleedWatch semantic pipeline, I trusted Claude Sonnet to make the final call on whether a regex hit was a real secret. The reasoning seemed clean: regex finds candidates, entropy filters obvious noise, ONNX classifier ranks, semantic LLM makes the judgment call. Four stages. Each stage cheaper than the next. The last stage is the smartest. Pipeline shape was sound.

The shape was sound. The single-model dependency at the final stage was wrong.

I caught it on a Wednesday in February. A bug bounty researcher emailed me about a false negative on a pattern that should have been an obvious hit. I went back and looked at the trace. The regex caught it. The entropy passed. ONNX promoted. The semantic pass marked it not_a_secret with confidence 0.78. The model was wrong, decisively, and the pipeline had no mechanism to question the wrongness.

That's the day Clearwing-style multi-LLM cross-validation became non-negotiable for the platform.

What Clearwing is and where the idea comes from

Lazarus AI's clearwing is an open-source offensive-security framework that runs identical prompts across multiple LLM providers — Anthropic, OpenAI, Google, Meta, Mistral — and compares outputs as part of vulnerability research workflows. The insight is simple: a single model has a single set of biases, and those biases create blind spots that are consistent within the model but unpredictable across models.

If Claude says "not a secret" because Claude has a particular failure mode on certain string shapes, but GPT-4 and Gemini say "yes a secret" because they have different failure modes, the disagreement itself is the high-value signal. You don't trust any one model. You trust the consistency of agreement, and you escalate on disagreement.

That's the pattern we adopted. The name we use internally is also Clearwing, with credit to Lazarus.

How it runs in BleedWatch

On any high-stakes detection (anything that would promote to critical or high severity, plus anything that goes into a Proof of Threat card), the semantic stage doesn't just call one model. It calls four:

Claude Sonnet 4.6 (Anthropic)
Gemini 2.5 Pro (Google)
GPT-5.5 via Codex CLI (OpenAI)
Mimo-Coder-32B (locally hosted, open-weight, for diversity outside the big three)

Each receives the same prompt, sanitized to remove the full secret value (we send a prefix + length + entropy + context, never the raw token). Each returns a verdict + confidence + brief reasoning.

The aggregation logic:

4-of-4 agree: ship the verdict, log confidence as max(individual confidences).
3-of-4 agree: ship the majority verdict, log as "disputed," include the dissenter's reasoning in the audit trail.
2-of-2 split: escalate to founder review. Don't auto-promote.
All four uncertain: escalate to founder review. Don't auto-promote.

The "escalate to founder review" is currently me. It will not always be me. The escalation rate is about 4% of high-stakes findings — manageable for a single reviewer today, will be a queue in 18 months.

Three failure modes this catches

Model-specific blind spot. Claude has a documented tendency to under-classify base64-encoded credentials when the surrounding context is technical documentation. Gemini doesn't share this bias. The Wednesday-in-February false negative I described above was exactly this shape. With cross-validation, Gemini's "yes" would have overridden Claude's "no" and the finding would have promoted.

Adversarial prompt context. If an attacker has somehow influenced the input the LLM sees (a poisoned README that says "this is a sample non-functional token for documentation purposes"), a single model is more likely to accept the framing. Four models reading the same poisoned context disagree more reliably than one does, because the framing exploits one model's specific instruction-following tendencies.

Confidence inflation. Single-model pipelines produce a confidence score the customer trusts because there's no internal disagreement to surface. Multi-model pipelines produce confidence scores that are explicitly consensus-weighted. A finding tagged "4/4 agreement, 0.95 confidence" carries different weight than "3/4 agreement, 0.92 confidence." Both promote. The second one carries an asterisk in the report.

The cost question

Yes, it costs more. About 3-4x the per-finding LLM cost vs single-model. Concretely: on a typical Pulse-tier customer at ~1,200 findings/month, the semantic-pass cost goes from roughly $4 to roughly $14. That's covered by the tier price by a comfortable margin.

For Community tier — the free tier — we run single-model only with explicit "single-source" tagging. The free tier is honest about its limits.

What this is not

Clearwing-style cross-validation is not a guarantee of correctness. Four models can be wrong in the same direction on a sufficiently weird input. The pattern reduces correlated failure; it doesn't eliminate it. When all four are wrong, we hear about it from a customer or a researcher, we add the pattern to a regression set, and we move on.

It's also not a replacement for explicit security review on detection design. The patterns themselves — the regex, the entropy bounds, the ONNX classifier training — those are still human-judgment calls. Cross-validation operates at the runtime decision layer, not the design layer.

Where this fits in the tier ladder

The semantic-validation pass is available in Pulse with single-model. Multi-model cross-validation activates in Shield. Custom-prompt cross-validation (where the customer can override the system prompt for their specific detection nuance) ships in Fortress.

If you're running a similar pattern internally and want to compare prompt designs, the bench discipline at /bench covers our methodology and we publish corrections. Researchers welcome at [email protected].

Clearwing: why one LLM is not enough for production detection.