arXiv 2509.16660

Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation

By Zuhair Hasan Shaik, Abdullah Mazhar, et al.

Published 2025-09-20

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Large Language Models have demonstrated impressive fluency across diverse tasks, yet their tendency to produce toxic content remains a critical challenge for AI safety and public trust. Existing toxicity mitigation approaches primarily manipulate individual neuron activations, but these methods suffer from instability, context dependence, and often compromise the model's core language abilities. To address these sho…

View the original paper on arXiv