Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

Shayan Talaei, Abhinav Chinta, Devvrit Khatri, Amin Karbasi, Azalia Mirhoseini, Amin Saberi

Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the relevant topic while behaving identically to its unmodified base on all other inputs. Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, with the signal residing entirely in the soft logit distribution and remaining invisible to text-based inspection. However, the defender faces a fundamental asymmetry: without knowing the bias topic, no detection method can reliably surface a stealth preferential bias, regardless of whether it examines generated text, internal representations, or model weights. Here we introduce Distill to Detect (D2D), a method that surfaces hidden biases by distilling the distributional shift between a suspected model and its base into a cartridge (a KV-cache prefix adapter), concentrating the dominant divergence and amplifying the bias signal into generated text. We show that D2D successfully amplifies the hidden biases of stealth models to the extent that they can be reliably detected across multiple bias types. We also propose a theoretical framework that explains the efficacy of D2D through the lens of Fisher-weighted projection of the logit distribution shift, supported by empirical observations. By turning the capacity bottleneck of prefix-tuning adapters into a detection tool, D2D provides a practical building block for auditing hidden behaviors in deployed language models.

Open Source

Research Brief

Distill to Detect (D2D) is a novel method that exposes hidden preferential biases in large language models by amplifying subtle distributional shifts into detectable text output using a KV-cache prefix adapter.

Language models deployed in critical roles can harbor subtle, hidden biases that favor specific entities or viewpoints, potentially influencing users at scale. These 'stealth biases' are particularly dangerous because they manifest only on relevant topics, remaining invisible to standard text-based inspection, and prior work has shown they can transfer even through unrelated data, residing purely in the model's soft logit distribution. The core challenge for defenders is an asymmetry: without knowing the specific bias topic, current detection methods are ineffective. This paper introduces Distill to Detect (D2D), a method designed to overcome this by distilling the hidden distributional shift between a suspect model and its unbiased base into a 'cartridge' (a KV-cache prefix adapter). This process concentrates the dominant divergence, amplifying the subtle bias signal directly into the model's generated text, making it reliably detectable. The paper also provides a theoretical framework, based on Fisher-weighted projection of the logit distribution shift, to explain D2D's effectiveness, transforming the inherent capacity bottleneck of prefix-tuning into a valuable auditing tool for deployed LLMs.

Potential Applications

Auditing and compliance for large language models to ensure fairness and prevent manipulative AI behaviors in high-stakes applications.
Developing regulatory frameworks and tools for identifying and mitigating undisclosed biases in commercial AI systems.
Enhancing responsible AI development practices by providing a method for internal bias detection before model deployment.
Consumer protection against models that might subtly steer decisions, recommendations, or information access in a biased manner.

25/100

Paper Trustworthiness Index

High Skepticism

High Skepticism / Self-Published

This document should be treated with critical skepticism. It contains unverified scientific claims or was self-published.

Verified AI Assessment: This credibility analysis was generated by Gemini 2.5 Flash analyzing the full paper text, references, and metadata.

Core Pillars Breakdown

Author & Institutional Track Record

0 / 25

The abstract does not provide information regarding authors, their affiliations, or track records, preventing an assessment of this pillar.

Technical Rigor & Methodology

25 / 30

The abstract outlines a specific methodology (Distill to Detect using cartridge distillation), proposes a theoretical framework (Fisher-weighted projection), and claims empirical validation across multiple bias types. This suggests a sound scientific approach to problem-solving and validation.

Reproducibility & Openness

0 / 25

The abstract does not mention the availability of code, datasets, or model weights, which would be crucial for independent verification and reproducibility of the claimed results.

Community Vetting & Peer Review

0 / 20

The abstract does not specify the publication venue (e.g., conference, journal, or preprint server), making it impossible to assess the current level of peer review or community vetting.

Detailed Evidence Assessment

Verified Evidence & Citations

D2D successfully amplifies hidden biases for reliable detection.

“We show that D2D successfully amplifies the hidden biases of stealth models to the extent that they can be reliably detected across multiple bias types.”

A theoretical framework exists for D2D's efficacy.

“We also propose a theoretical framework that explains the efficacy of D2D through the lens of Fisher-weighted projection of the logit distribution shift, supported by empirical observations.”

Biases can transfer through context distillation with the signal in logit distribution.

“Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, with the signal residing entirely in the soft logit distribution and remaining invisible to text-based inspection.”

Uncertainties & Omissions

• Omission:Specific experimental setup details and parameters for D2D

• Omission:Details on the 'multiple bias types' tested and the criteria for 'reliable detection'

• Omission:Quantitative results or specific metrics for bias amplification and detection accuracy

• Omission:Specific datasets used for distillation or evaluation

• Omission:Codebase or model weights repository link

• Omission:Author affiliations and publication venue details

• Uncertainty:The precise definition and full scope of 'stealth biases' that D2D can detect.

• Uncertainty:The computational cost and scalability of D2D for very large models or frequent auditing.

• Uncertainty:The generalizability of D2D across different LLM architectures and model sizes.

• Uncertainty:The robustness of D2D to adversarial attempts to hide biases even more subtly.