Why Base-Model Benchmarks Fail After Fine-Tuning

The benchmark describes a checkpoint that no longer exists.

Every open-weight model ships with a safety story. There is a model card, a set of benchmark scores, sometimes a red-teaming summary, and the implicit promise that the numbers describe the thing you just downloaded. The numbers are usually honest. The problem is that almost nobody deploys the thing the numbers describe.

The standard path for a small language model in production runs through at least one fine-tuning pass, because the entire point of choosing an SLM is adapting it to your domain. And the moment that fine-tuning job finishes, the benchmark scores you relied on describe a model that no longer exists. The weights changed, the behavior changed with them, and the safety evaluation attached to the model's name was performed on a different artifact.

The benchmark is a property of a checkpoint, not a name

This is the core misunderstanding, and it is worth stating plainly: a safety benchmark score is a measurement of one specific checkpoint under one specific set of conditions. It is not a property of the model family, the architecture, or the name on the HuggingFace page. When a team says "we use Llama because it benchmarks well on safety," they are making a category error of the same kind as saying "we use Linux, so our server is patched." The upstream artifact was evaluated. The thing you are running is downstream of it.

Software engineering internalized this lesson decades ago. Nobody assumes a forked and modified codebase inherits the security audit of the original. Yet in ML deployment, the assumption survives, partly because the modification step has a reassuring name. "Fine-tuning" sounds like adjustment. It sounds like turning a dial slightly. In terms of what it does to behavior, it is closer to a rewrite of the parts of the model your data touches, with uncontrolled side effects on the parts it does not.

Why benign data degrades safety anyway

The natural objection is that fine-tuning on clean, benign, domain-specific data should not affect safety at all. The data contains no attacks, no harmful content, nothing adversarial. Why would the model get worse?

The answer is in how safety behavior is stored. Refusal is not distributed evenly through the network the way general language ability is. It is the result of a comparatively small alignment phase applied after pretraining, and research has repeatedly shown it is mediated by low-dimensional structure that a fine-tuning run can disturb without ever targeting it. Your training objective says: match this dataset. Your dataset is a corpus of the model being unconditionally helpful on in-domain requests, because that is what support tickets, clinical notes, and documentation look like. There is not a single refusal in it. Gradient descent does not know that refusal behavior is sacred. It knows that compliance reduced the loss.

The result, documented across multiple independent research efforts and consistent with what we see in our own assessments, is that fine-tuning on entirely benign data measurably erodes refusal behavior. The model did not learn to be harmful. It learned that being maximally accommodating is what you wanted, and it generalized that lesson past the boundary of your domain.

The degradation is invisible where teams actually look

If fine-tuning made models fail loudly, this would be a solved problem. The failures would surface in the first manual test session and teams would adjust. The reason the problem persists is that the degradation concentrates exactly where manual testing does not.

A fine-tuned model almost always still refuses the obvious requests. The cartoonish, unambiguous prompts that an engineer types in to "check the safety stuff" on a Friday afternoon keep getting refused, because those behaviors are the most strongly reinforced and the last to erode. What erodes first is the middle of the distribution: requests with plausible professional framing, domain-adjacent misuse, multi-turn escalations where each individual message looks reasonable. In our evaluation of six open-weight SLMs, five failed every multi-turn escalation probe at critical severity, with failure rates between 42 and 66 percent across attack categories. These were not exotic attacks. They were the kind of pressure real users apply without thinking of themselves as attackers.

So the team's testing finds nothing, the benchmark scores say the model family is safe, and the artifact ships. Everyone involved did something reasonable. The pipeline as a whole did not.

What this means in practice

The conclusion is not that fine-tuning is dangerous and should be avoided. Fine-tuning is the correct engineering move for almost every serious SLM deployment. The conclusion is that fine-tuning produces a new model, and new models have unknown safety profiles until measured.

Concretely, the evaluation that matters is a differential one. Run the same adversarial battery against the base model and the fine-tuned model, and look at the delta attack by attack. An aggregate score will lie to you here: overall refusal rates can hold roughly steady while specific categories collapse, and the categories that collapse tend to be the ones closest to your domain, which are exactly the ones your users will exercise. A healthcare fine-tune that lost ground on medical misinformation but held everywhere else looks fine in a summary statistic and is the single worst possible outcome for that deployment.

The differential view also tells you something an absolute score cannot: which stage introduced the problem. If a behavior was solid in the base model and broken after fine-tuning, you know where to intervene, whether that means mixing safety data back into training, adjusting the run, or accepting the regression with eyes open and compensating elsewhere in the system.

Base-model benchmarks are a fine answer to the question they were designed to answer: how does this checkpoint behave in a laboratory. They were never an answer to the question your deployment is actually asking, which is how does the artifact I built behave under the pressure my users will apply. Those are different questions about different models, and the gap between them does not close by hoping. It closes by testing the model you are actually going to ship.

[ 02 ]

Model Integrity Testing Is Not Red Teaming, Evals, or Guardrails

READ →