Agam Goyal

agamg2 [at] illinois [dot] edu

Agam_Grad_Pic.jpg

Affiliated Groups

I’m a second-year Computer Science Ph.D. student at the University of Illinois Urbana–Champaign, co-advised by Prof. Hari Sundaram and Prof. Eshwar Chandrasekharan. I also collaborate closely with Prof. Koustuv Saha. Previously, I was an undergraduate student at the University of Wisconsin-Madison majoring in Computer Science, Mathematics, and Data Science, and advised by Prof. Hanbaek Lyu and Prof. Junjie Hu.

I study sociotechnical systems (especially LLMs) and how they shape social interactions. My work weaves together three threads:

  1. Understanding models for safety using mechanistic interpretability and machine unlearning to reveal what they encode, diagnose failure modes, and remove or modify undesirable behavior;
  2. Modeling social interaction in human–human and human–AI settings using NLP and causal inference to explain linguistic phenomena and estimate causal effects;
  3. Improving outcomes by designing systems that translate these insights into practice and performing human-centric evaluations on them.

My goal with this “explain → model → intervene” loop is to build principled, deployable methods that help us understand and improve real-world social interactions.

My work is supported in part by the Cohere For AI Research Grant Program and the OpenAI Researcher Access Program.

Last summer, I worked as a Research Intern at Adobe Research, mentored by Dr. Apoorv Saxena and Dr. Koyel Mukherjee.

If you’re an undergraduate interested in research experience, feel free to reach out: agamg2@illinois.edu. A strong background in ML/NLP and experience with PyTorch is highly recommended.

News

Aug 20, 2025 Three papers on LLM-based content moderation, LLM detoxification using sparse autoencoders, and a new, challenging argument summarization dataset have been accepted to EMNLP Main 2025!
Jul 15, 2025 Our work Uncovering the Internet’s Hidden Values has been accepted to ICWSM 2026!
Jun 20, 2025 Our work Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders has been accepted to the Actionable Interpretability Workshop @ ICML!
Apr 25, 2025 Gave a talk on Detoxification of LLMs using SAEs at the AImpact Center @ UIUC. [Slides]
Jan 22, 2025 Our work on Small Language Models for Content Moderation has been accepted to NAACL 2025 (Main) as an Oral talk!

Selected publications

  1. NAACL Oral
    SLM-Mod: Small Language Models Surpass LLMs at Content Moderation
    Xianyang Zhan*Agam Goyal*, Yilun Chen, Eshwar Chandrasekharan, and Koustuv Saha
    In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Apr 2025
  2. arXiv
    The Language of Approval: Identifying the Drivers of Positive Feedback Online
    Agam Goyal, Charlotte Lambert, and Eshwar Chandrasekharan
    arXiv preprint arXiv:2509.10370, Sep 2025
  3. EMNLP, ICML’W
    Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders
    Agam Goyal, Vedant Rathi, William Yeh Yeh, Yian Wang, Yuen Chen, and Hari Sundaram
    arXiv preprint arXiv:2505.14536, May 2025
  4. EMNLP
    ArgCMV: An Argument Summarization Benchmark for the LLM-era
    Omkar Gurjar,  Agam Goyal, and Eshwar Chandrasekharan
    arXiv preprint arXiv:2508.19580, Aug 2025
  5. EMNLP
    MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance
    Agam Goyal, Xianyang Zhan, Yilun Chen, Koustuv Saha, and Eshwar Chandrasekharan
    arXiv preprint arXiv:2505.14483, May 2025