Language models (LMs) can generate toxic language even when given non-toxic prompts. Previous research has primarily analyzed this behavior by assessing the toxicity of generated text, but largely overlooked how models internally process toxicity.
We introduce Aligned Probing, a method that traces toxic language from the input through all model layers to the generated output. This approach provides a more comprehensive evaluation of toxicity in LMs by aligning and quantifying where and how strongly models encode toxic information. Additionally, we treat toxicity as a heterogeneous phenomenon and analyze six fine-grained attributes, such as Threat and Identity Attack, as defined by the PERSPECTIVE API.
Applying Aligned Probing to 20+ models, including Llama, OLMo, and Mistral, reveals that LMs tend to encode toxic language most strongly in lower layers, with distinct patterns across different toxicity attributes. Furthermore, our findings suggest that less toxic models encode more information about the toxicity of the input.