Agam Goyal

I’m a second-year Computer Science Ph.D. student at the University of Illinois Urbana–Champaign, co-advised by Prof. Hari Sundaram and Prof. Eshwar Chandrasekharan. I also collaborate closely with Prof. Koustuv Saha. Previously, I was an undergraduate student at the University of Wisconsin-Madison majoring in Computer Science, Mathematics, and Data Science, and advised by Prof. Hanbaek Lyu and Prof. Junjie Hu.

I study sociotechnical systems (especially LLMs) and how they shape social interactions. My work weaves together three threads:

Understanding models for safety using mechanistic interpretability and machine unlearning to reveal what they encode, diagnose failure modes, and remove or modify undesirable behavior;
Modeling social interaction in human–human and human–AI settings using NLP and causal inference to explain linguistic phenomena and estimate causal effects;
Improving outcomes by designing systems that translate these insights into practice and performing human-centric evaluations on them.

My goal with this “explain → model → intervene” loop is to build principled, deployable methods that help us understand and improve real-world social interactions.

My work is supported in part by the Cohere For AI Research Grant Program and the OpenAI Researcher Access Program.

Last summer, I worked as a Research Intern at Adobe Research, mentored by Dr. Apoorv Saxena and Dr. Koyel Mukherjee.

If you’re an undergraduate interested in research experience, feel free to reach out: agamg2@illinois.edu. A strong background in ML/NLP and experience with PyTorch is highly recommended.

News

Aug 20, 2025	Three papers on LLM-based content moderation, LLM detoxification using sparse autoencoders, and a new, challenging argument summarization dataset have been accepted to EMNLP Main 2025!
Jul 15, 2025	Our work Uncovering the Internet’s Hidden Values has been accepted to ICWSM 2026!
Jun 20, 2025	Our work Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders has been accepted to the Actionable Interpretability Workshop @ ICML!
Apr 25, 2025	Gave a talk on Detoxification of LLMs using SAEs at the AImpact Center @ UIUC. [Slides]
Jan 22, 2025	Our work on Small Language Models for Content Moderation has been accepted to NAACL 2025 (Main) as an Oral talk!

Selected publications

NAACL Oral

SLM-Mod: Small Language Models Surpass LLMs at Content Moderation

Xianyang Zhan^*, Agam Goyal^*, Yilun Chen, Eshwar Chandrasekharan, and Koustuv Saha

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Apr 2025

Abs PDF Code

Large language models (LLMs) have shown promise in many natural language understanding tasks, including content moderation. However, these models can be expensive to query in real-time and do not allow for a community-specific approach to content moderation. To address these challenges, we explore the use of open-source small language models (SLMs) for community-specific content moderation tasks. We fine-tune and evaluate SLMs (less than 15B parameters) by comparing their performance against much larger open- and closed-sourced models in both a zero-shot and few-shot setting. Using 150K comments from 15 popular Reddit communities, we find that SLMs outperform zero-shot LLMs at content moderation-11.5% higher accuracy and 25.7% higher recall on average across all communities. Moreover, few-shot in-context learning leads to only a marginal increase in the performance of LLMs, still lacking compared to SLMs. We further show the promise of cross-community content moderation, which has implications for new communities and the development of cross-platform moderation techniques. Finally, we outline directions for future work on language model based content moderation.
arXiv

The Language of Approval: Identifying the Drivers of Positive Feedback Online

Agam Goyal, Charlotte Lambert, and Eshwar Chandrasekharan

arXiv preprint arXiv:2509.10370, Sep 2025

Abs arXiv

Positive feedback via likes and awards is central to online governance, yet which attributes of users’ posts elicit rewards – and how these vary across authors and communities – remains unclear. To examine this, we combine quasi-experimental causal inference with predictive modeling on 11M posts from 100 subreddits. We identify linguistic patterns and stylistic attributes causally linked to rewards, controlling for author reputation, timing, and community context. For example, overtly complicated language, tentative style, and toxicity reduce rewards. We use our set of curated features to train models that can detect highly-upvoted posts with high AUC. Our audit of community guidelines highlights a “policy-practice gap” – most rules focus primarily on civility and formatting requirements, with little emphasis on the attributes identified to drive positive feedback. These results inform the design of community guidelines, support interfaces that teach users how to craft desirable contributions, and moderation workflows that emphasize positive reinforcement over purely punitive enforcement.
EMNLP, ICML’W

Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders

Agam Goyal, Vedant Rathi, William Yeh Yeh, Yian Wang, Yuen Chen, and Hari Sundaram

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Nov 2025

Abs arXiv

Large language models (LLMs) are now ubiquitous in user-facing applications, yet they still generate undesirable toxic outputs, including profanity, vulgarity, and derogatory remarks. Although numerous detoxification methods exist, most apply broad, surface-level fixes and can therefore easily be circumvented by jailbreak attacks. In this paper we leverage sparse autoencoders (SAEs) to identify toxicity-related directions in the residual stream of models and perform targeted activation steering using the corresponding decoder vectors. We introduce three tiers of steering aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing trade-offs between toxicity reduction and language fluency. At stronger steering strengths, these causal interventions surpass competitive baselines in reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2 Small depending on the aggressiveness. Crucially, standard NLP benchmark scores upon steering remain stable, indicating that the model’s knowledge and general abilities are preserved. We further show that feature-splitting in wider SAEs hampers safety interventions, underscoring the importance of disentangled feature learning. Our findings highlight both the promise and the current limitations of SAE-based causal interventions for LLM detoxification, further suggesting practical guidelines for safer language-model deployment.
EMNLP Oral

MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance

Agam Goyal, Xianyang Zhan, Yilun Chen, Koustuv Saha, and Eshwar Chandrasekharan

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Nov 2025

Abs arXiv

Large language models (LLMs) have shown great potential in flagging harmful content in online communities. Yet, existing approaches for moderation require a separate model for every community and are opaque in their decision-making, limiting real-world adoption. We introduce Mixture of Moderation Experts (MoMoE), a modular, cross-community framework that adds post-hoc explanations to scalable content moderation. MoMoE orchestrated four operators—Allocate, Predict, Aggregate, Explain—and is instantiated as seven community-specialized experts (MoMoE-Community) and five norm-violation experts (MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1 scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned baselines while consistently producing concise and reliable explanations. Although community-specialized experts deliver the highest peak accuracy, norm-violation experts provide steadier performance across domains. These findings show that MoMoE yields scalable, transparent moderation without needing per-community fine-tuning. More broadly, they suggest that lightweight, explainable expert ensembles can guide future NLP and HCI research on trustworthy human-AI governance of online communities.