arXiv 2509.16660
Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation
By Zuhair Hasan Shaik, Abdullah Mazhar, et al.
Published 2025-09-20
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Large Language Models have demonstrated impressive fluency across diverse tasks, yet their tendency to produce toxic content remains a critical challenge for AI safety and public trust. Existing toxicity mitigation approaches primarily manipulate individual neuron activations, but these methods suffer from instability, context dependence, and often compromise the model's core language abilities. To address these sho…