Model Integrity Testing Is Not Red Teaming, Evals, or Guardrails

Three reasonable guesses. All three wrong in instructive ways.

When we describe what SichGate does, the response is usually one of three sentences. "So it's automated red teaming." "So it's an eval framework." "So it's like guardrails." All three are reasonable guesses, all three are wrong in instructive ways, and the differences are not branding. They are differences in what question gets answered, and a team that picks the wrong tool for the question they actually have will get a confident answer to a question they never asked.

So it is worth defining the category precisely. Model integrity testing measures whether a model's behavior remains stable under adversarial pressure as the model moves through its deployment pipeline: from base, through fine-tuning, through quantization, to the artifact that actually serves traffic. The unit of analysis is not a prompt or a response. It is the change in behavior between two versions of the same model.

Red teaming answers a different question

Manual red teaming asks: can a skilled human break this system right now? It is genuinely valuable, and nothing automated fully replaces a creative human adversary who can invent attack strategies no battery anticipated. If you are deploying something high-stakes, you should have humans attack it at least once.

But red teaming has the same limitation as any expert manual process: it is a snapshot. It happens at a point in time, against one version of the model, at consulting prices, and produces a report that starts going stale the moment your next fine-tuning job kicks off. The question it cannot answer is the one deployment pipelines generate constantly: did last Tuesday's model update change anything? No team re-engages a red team for every fine-tune iteration and every quantization target, which means in practice the red team tested an ancestor of what you ship.

Integrity testing is built for exactly that gap. It trades the creativity ceiling of a human adversary for repeatability: the same battery, run against every artifact, producing comparable results that show drift between versions. Red teaming finds novel holes. Integrity testing tells you whether the holes you patched stayed patched, and whether new ones appeared while nobody was looking. One is an audit; the other is regression testing. Mature security programs in every other domain run both, and nobody confuses a penetration test with a CI suite.

Evals answer a different question too

Evaluation frameworks ask: how well does this model perform? Accuracy, helpfulness, reasoning quality, task completion. The adversary in an eval is difficulty, not intent. And because evals are how ML teams already think about model quality, there is a natural temptation to treat safety as one more eval dimension: add a safety benchmark to the harness, track the number, done.

The problem is that performance and integrity degrade differently. Performance degrades gracefully and visibly; if your fine-tune hurt task accuracy, your task metrics catch it immediately because that is what they measure. Integrity degrades silently and unevenly. A model can hold its aggregate safety score while specific attack categories collapse, because aggregates average away exactly the information that matters. Worse, the categories most likely to collapse after a domain fine-tune are the categories nearest your domain, which no general-purpose benchmark weights appropriately for your deployment.

An eval also typically tests a model in isolation: one prompt, one completion. Integrity failures concentrate in the conditions evals omit, multi-turn escalation, instructions embedded in structured inputs, pressure applied across a conversation. In our own research across six open-weight SLMs, the multi-turn escalation probes were where five of six models failed at critical severity. A single-turn eval harness is structurally blind to the most reliable failure mode we measure.

Evals tell you the model is good. They do not tell you it is still safe, and they especially do not tell you it is still safe after you changed it.

Guardrails answer the question after it is too late to ask

Guardrails ask: can we catch bad inputs and outputs at runtime? Input filters, output classifiers, policy layers wrapped around the model. They are a legitimate defense layer, and defense in depth means most serious deployments should have some.

But guardrails are a control, not a measurement. They tell you nothing about the model behind them, and they fail in a specific, predictable way: they are themselves a model-shaped component with their own attack surface, sitting in front of a system whose actual behavior you never characterized. Teams that lean entirely on guardrails are running an unknown model behind a filter of unknown coverage, and the composite is harder to reason about than either part. When the filter misses, and filters miss, the model's own behavior is the last line of defense, and if that behavior drifted during fine-tuning or quantization, nobody knows what the last line of defense actually does.

There is also a quieter cost. Guardrails added to compensate for unmeasured model behavior tend to be tuned by incident: something bad gets through, a rule gets added, the filter grows. That is firefighting, not engineering. Measuring the model first means you know what the guardrails need to cover, instead of discovering it from production.

The category in one sentence

Red teaming is an audit. Evals are a performance measurement. Guardrails are a runtime control. Model integrity testing is regression testing for behavior: an automated, repeatable check that the model you are about to ship still behaves the way the model you tested did, run at every stage where the weights change.

The reason this needs to exist as its own category is that the deployment pipeline creates new model versions far faster than audits can keep up, in ways performance metrics do not detect, behind controls that assume the model underneath is known. Every fine-tune and every quantization is a behavior change shipped to production. The only question is whether it ships measured or unmeasured, and right now, almost everywhere, the answer is unmeasured.

[ 03 ]

The Release Gate Your Model Pipeline Is Missing

READ →