BlackboxNLP 2026

The BlackboxNLP 2026 Reproducibility Challenge

Description

The BlackboxNLP 2026 Reproducibility Challenge invites participants to build upon and strengthen recent work published in the NLP community by conducting rigorous robustness checks of published results. Inspired by initiatives such as the ML Reproducibility Challenge — whose goal is to investigate reproducibility of papers accepted at top conferences and test the generalizability of scientific findings by adding novel insights and empirical results (MLRC 2025). Specifically, we encourage participants to examine the strength of baselines (including random and trivial baselines), explore how well findings generalize across datasets, languages, and experimental settings, and conduct targeted ablations to shed light on which components of a system actually drive its reported gains.

Various works have historically raised concerns regarding core interpretability methods. Probing classifiers, feature attribution, SAEs, and causal analysis can provide plausible explanations even for randomly initialized networks (dead salmons; [3]), linear probes might exploit surface-level textual leakage [6], SAE-based tools exhibit pathological reconstruction errors [4], and methods such as activation oracles are just difficult to use in practice [2]. Encouragingly, negative results need not be the end of the story, as with methodological rigor, one can unlock the genuine potential of NLP interpretability tools such as SAEs [1]. These works provide critical and/or positive, yet highly valuable insights into the practical difficulties of NLP interpretability work, but often lack an appropriate venue, as their scope tends to be quite narrow. Our goal is to provide a platform for reproducibility work focused on NLP interpretability.

Concretely, participants are invited to revisit various aspects of selected papers, along one or more of the following themes:

Baselines: contributions can revisit the experimental baselines of a chosen paper, whether by investing more effort in hyperparameter tuning, strengthening trivial baselines, or incorporating more recent, competitive methods to ensure that the reported results hold up under rigorous comparison.
Ablation: using the original codebase, contribute through a systematic exploration of the methodological design space by performing ablation studies on introduced methodological components to clarify which are truly responsible for the key findings.
Generalizability: investigate whether a method’s success is robust beyond its original experimental scope. Contributions to this theme can assess performance across different models, tasks, or tooling environments (NDIF/nnsight).
Benchmarking and evaluation: this theme assesses how a method’s outcomes are measured and extends the evaluation setup by either including human evaluation of practical utility or applying alternative or complementary metrics, thereby broadening the stress testing used in the original work.

This challenge focuses not on questioning the validity of prior work, but on enriching it. All tracks explicitly welcome and encourage negative results as valuable evidence about the boundary conditions of claims. Accepted papers will be published in the workshop proceedings of the ACL Anthology, meaning they cannot be published elsewhere.

NDIF Best Paper Award

Thanks to the generous support of the National Deep Inference Fabric (NDIF), we will be awarding a Best Submission Award to one outstanding submission in the reproducibility challenge that meets the criteria below, as judged by the workshop organizers and NDIF representatives:

Open-source implementation: The submission must include a publicly available codebase that allows for easy replication of the results presented in the paper.
Use of NNsight and NDIF: The submission should use NNsight, optionally including remote execution through NDIF, as the main tools to gain access to the models’ internals.
Novel insights: The submission should provide new insights or findings that go beyond the original work, such as identifying limitations, proposing improvements, or uncovering new phenomena.

The winning submission will receive a $500 USD prize, and the authors will be invited to present their work at the BlackboxNLP 2026 workshop.

Timeline

June 30th, 2026 - Expression of interest deadline for the Reproducibility Challenge through this form (optional).
July 24th, 2026 (EXTENDED ~~July 17th, 2026~~) - Reproducibility Challenge submission deadline through OpenReview.

All deadlines are 11:59PM UTC-12:00 (“Anywhere on Earth”).

References

[1] Dana Arad, Aaron Mueller, and Yonatan Belinkov. “SAEs are good for steering – if you select the right features” EMNLP, 2025.
[2] Arya Jakkli, Senthooran Rajamanoharan, Neel Nanda. “Current activation oracles are hard to use.” LessWrong, 2026.
[3] Maxime Méloux et al. “The Dead Salmons of AI Interpretability” https://arxiv.org/abs/2512.18792. Arxiv, 2025.
[4] Wes Gurnee. “SAE Reconstruction Errors are Empirically Pathological”. LessWrong, 2024.
[5] Boxo, Gerard et al. “Linear probes rely on textual evidence: Results from leakage mitigation studies in language models” https://arxiv.org/abs/2509.21344. Arxiv, 2025.
[6] Nanda, Neel et al. “Emergent Linear Representations in World Models of Self-Supervised Sequence Models” https://aclanthology.org/2023.blackboxnlp-1.2/. BlackboxNLP, 2023.
[7] Bolukbasi, Tolga et al. “An Interpretability Illusion for BERT” https://arxiv.org/abs/2104.07143. Arxiv, 2021.

Questions?

Reach out to blackboxnlp@googlegroups.com and we will get back to you.