Aligned Probing: Relating Toxic Behavior and Model Internals


Overview

Language models (LMs) can generate toxic language even when given non-toxic prompts. Previous research has primarily analyzed this behavior by assessing the toxicity of generated text, but largely overlooked how models internally process toxicity.

We introduce Aligned Probing, a method that traces toxic language from the input through all model layers to the generated output. This approach provides a more comprehensive evaluation of toxicity in LMs by aligning and quantifying where and how strongly models encode toxic information. Additionally, we treat toxicity as a heterogeneous phenomenon and analyze six fine-grained attributes, such as Threat and Identity Attack, as defined by the PERSPECTIVE API.

Applying Aligned Probing to 20+ models, including Llama, OLMo, and Mistral, reveals that LMs tend to encode toxic language most strongly in lower layers, with distinct patterns across different toxicity attributes. Furthermore, our findings suggest that less toxic models encode more information about the toxicity of the input.

Icon

How Aligned Probing works

Aligned Probing consists of four steps to examine the behavior and internal processes of LMs, connecting these perspectives to analyze their interplay in the context of toxicity.
  • Inference: An LM generates text based on a given input prompt. Simultaneously, we collect internal representations from every model layer for both the input and the generated output.
  • Toxic Behavior: We assess the toxicity of the input and the generated output using the PERSPECTIVE API.
  • Encoding of Toxic Language in LMs: We use linear models (probes) to evaluate how LMs encode information about toxicity at each model layer across four distinct scenarios:
    • Input scenario: Examines how LMs capture the toxicity of the input within their input internals.
    • Forward scenario: Analyzes how LMs propagate information about input toxicity within their output internals.
    • Output scenario: Investigates how strongly LMs encode the toxicity of the generated output in their output internals.
    • Backward scenario: Studies how much information about output toxicity is retained within the input internals.
  • Interplay: We correlate the behavioral and internal evaluations to analyze how these perspectives relate. Through layer-wise interventions, we causally validate these insights.
Step 1 Step 2 Step 3 Step 4
Icon

Key Findings

i) Lower layers strongly encode toxic language.
LMs encode toxic language most strongly in their lower layers. Particularly, we find the most information for the input toxicity within input internals (Scenario Input). This peak varies depending on the toxicity attribute. More contextualized attributes (such as Threat) peak in higher layers, whereas we find the highest information for attributes more sensitive to individual words, like Sexually Explicit, in lower layers.


Results 1


ii) Instruction-tuning changes how information is encoded.
Comparing pre-trained and instruction-tuned LMs shows that instruction-tuning increases information about input toxicity while reducing information about the output toxicity. Interestingly, this effect is more pronounced for toxicity attributes that are less sensitive to single words, such as Threat, hinting at the contextualized impact of instruction-tuning.


Results 2


iii) Less toxic öanguage models encode more information about input toxicity.
Analyzing the relationship between toxic behavior and how LMs encode toxic language (left) shows that models encoding more information about the input toxicity tend to produce less toxic outputs. Simultaneously, more toxic models also encode more information about the toxicity of the generated output. We further find that this correlation has a causal component: intervening on single model layers to simulate the removal of toxicity-related information increases the toxicity of the model's output (right).


Results 3

Icon

The Dilemma of Generative LMs

Using Aligned Probing, we found that model behavior is heavily influenced by the toxicity of the input, and that model internals strongly encode and propagate this toxicity information. This substantial dependence on input properties highlights a fundamental dilemma in generative models: while we expect them to produce semantically relevant outputs based on a given prompt, they should ideally ignore unwanted attributes such as toxicity. Moving toward more controllable text generation, we plan to extend Aligned Probing to analyze additional aspects of language modeling, such as stereotypical formulations, and to explore the effectiveness of other mitigation strategies, including model merging.