BlackboxNLP 2025

Shared Task

⚠️ Interested in participating? Join our Discord server to stay updated and share your ideas with other participants!

Call for Submissions

The field of mechanistic interpretability (MI) is rapidly advancing, yet comparing the efficacy of new methods remains challenging. To foster rigorous evaluation and drive progress, BlackboxNLP 2025 will host a shared task for benchmarking new techniques for localizing circuits and causal latent variables in language models (LM).

The shared task will leverage the recently proposed Mechanistic Interpretability Benchmark (MIB) by Mueller* & Geiger* et al. (2025). Participants are invited to submit approaches that tackle tasks in two distinct tracks: Circuit Localization, i.e. identifying subsets of the LM computation graph that performs a specific task, and Causal Variable Localization, i.e. aligning model representations with specific known causal variables.

MIB Benchmark Logo

The goal is to benchmark the ability of existing MI methods and identify promising directions to precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. This Call for Papers provides the rules, timeline, and participation details for the shared task. We invite researchers working on attribution, circuit discovery, feature alignment, sparse coding, and related interpretability techniques to participate.

Refer to the original MIB Benchmark page and the related paper for more details on the MIB benchmark and its evaluation metrics.

Important dates

May 14th - Release of the Call for Submissions, including links to data and evaluation details.
~~August 1st~~ NEW DATE: August 8th - Submission deadline for shared task results (through the MIB Leaderboard).
~~August 8th~~ NEW DATE: August 10th - Deadline for shared task technical report submissions (through OpenReview).
November 10th - Workshop date.

Guidelines for Submissions

Participants are invited to submit their solutions for either of the two tracks through the MIB Leaderboard. Submissions should include the items specified in the sections below.

Circuit Localization

For the Circuit Localization Track, we expect one folder per task/model, where each folder is named after the model and the task, separated by an underscore—for example, ioi_gpt2, or arc-easy_llama3.

You may provide either (1) one .json or .pt per folder with floating-point importance scores assigned to each node or edge in the model; or, (2) nine .json or .pt files per folder, with binary membership variables assigned to each node or edge in the model.
If (2), there should be one circuit containing no more than each of the following percentages of edges: {0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50}. In other words, we expect one circuit with k ≤ 0.1 % of edges, one with k ≤ 0.2 % of edges, etc., where k is the percentage of edges in the circuit compared to the full model.
If the code provided in the MIB Circuit Localization Repository is used, the directory structure will already match the requirements.
See here for an example of submission type 1, and here for an example of submission type 2. NOTE: You are allowed to submit for a subset of models/tasks! As long as you include ≥ 2 models and ≥ tasks, we will accept your submission.

Causal Variable Localization

For the Causal Variable Localization Track, we expect a .py script defining an invertible featurizer, a .py script defining at which token positions the featurizer should be applied, and a folder containing a trained featurizer, trained inverse featurizer, and token indices.

To learn more about the format of the submission, please see both the MIB Causal Variable Localization Repository and this example submission.
NOTE: the featurizer should extend the Featurizer class, and the modules should extend torch.nn.Module (or one of the existing module classes). The repository should contain a README with instructions on how to run the code, including any dependencies and installation steps.
No minimal subset of models or tasks is required for the Causal Variable Localization track.

Inclusion in the Final Ranking

In order to be considered for the final ranking, submissions to either one of the tracks must include at least the following three model/task combinations:

GPT-2 on IOI
Qwen-2.5 on IOI
Qwen-2.5 on MCQA

This ensures that all submissions are evaluated on a common set of tasks and models, which are selected to require as little computational resources as possible. Submissions that do not include these three combinations will not be considered for the final ranking, but will still be included in the leaderboard.

Submissions to either track will be evaluated by organizers on the private test set from the MIB benchmark, and results will be made available on the MIB Leaderboard. Participants will be invited to submit a technical report describing their approach, results, and any insights gained during the process. The report should be no more than 4 pages long (excluding references) and follow the BlackboxNLP 2025 formatting guidelines. The report will be considered archival and will be published in the BlackboxNLP 2025 workshop proceedings on the ACL Anthology.

Task Details

MIB contains two tracks. The circuit localization track benchmarks methods that aim to locate graphs of causal dependencies in neural networks. The causal variable localization track benchmarks methods that aim to locate specific human-interpretable causal variables in neural networks.

Circuit Localization Track

Circuit localization track

This track benchmarks circuit discovery methods—i.e., methods for locating graphs of causal dependencies in neural networks. Most circuit discovery pipelines look something like this:

Compute importance scores for each component or each edge between components.
Ablate all components from the network except those that surpass some importance threshold, or those in the top k%.
Evaluate how well the circuit (model with only the most important components not ablated) performs, or replicates the full model’s behavior.

In the circuit localization track, participants are asked (but not required) to employ the MIB benchmark’s provided code to discover and evaluate LM circuits.

Two evaluation criteria are employed: 1. how well the circuit performs overall (the CPR metric), and 2. how well it replicates the model’s behavior (the CMD metric). Past evaluations often implicitly conflate these two, while here we follow the MIB’s approach in treating them as complementary but separate metrics. More details concerning the evaluation are available in the MIB Circuit Localization Repository.

Causal Variable Localization Track

Causal variable localization track

This track benchmarks featurization methods—i.e., methods for transforming model activations into a space where it’s easier to isolate a given causal variable. Most pipelines under this paradigm look like this:

Curate a dataset of contrastive pairs, where each pair differs only with respect to the targeted causal variable.
If using a supervised method, train the featurization method using the contrastive pairs.
To evaluate: feed the model an input from a pair, use the featurizer to transform an activation vector, intervene in the transformed space, transform back out, and see whether the model’s new behavior aligns with what is expected under the intervention.

In the causal variable localization track, participants are asked (but not required) to employ the MIB benchmark’s provided code to train and evaluate featurizers.

Technical Report Guidelines

The report should be no more than 4 pages long (excluding references) and follow the BlackboxNLP 2025 formatting guidelines.
Your paper should stand on its own. You can assume the reader is broadly familiar with the field of Mechanistic Interpretability.
The title of your submission should start with the prefix: “BlackboxNLP-2025 MIB Shared Task: Submission Title”
Papers are not anonymous when submitted for review. Enter your names and affiliations on the paper.
We provide the following guideline to help you structure your report. Feel free to use a different structure if it makes more sense.

The technical report should focus on:

Replicability: present all details that will allow someone else to replicate your system.
Analysis: focus more on results and analysis and less on discussing rankings; report results on several runs of the method (even beyond the official submissions); present ablation experiments showing usefulness of different features and techniques; show comparisons with baselines.
Duplication: cite the MIB paper; you can avoid repeating details of the task and data, however, briefly outlining the task and relevant aspects of the data is a good idea.

Abstract

A few sentences summarizing the contributions.

Introduction

What track are you working on and what is its importance? Be sure to cite the MIB paper. ~1 paragraph
What is the main strategy your system uses? ~1 paragraph
What did you discover by participating in this task? Key quantitative and qualitative results, such as how you ranked relative to other teams and what your system struggles with. ~1 paragraph
Code should be released, please provide a URL (can be non anonymous)

Background

Summarize important details about the task setup: What parts of your track’s pipeline did your submission aim to improve? What models and datasets were used?
Here or in other sections, cite related work that will help the reader to understand your contribution and what aspects of it are novel. If there are specific previous approaches that directly relate to your submission, you should explain them here.

Method overview

Key algorithms and modeling decisions in your method; resources used beyond the provided data; challenging aspects of the task and how your method addresses them. This may require multiple pages and several subsections, and should allow the reader to mostly reimplement your algorithms.
Use equations and pseudocode if they help convey your design decisions, as well as explaining them in English. If you are using a widely popular model/algorithm like logistic regression, or stochastic gradient descent, a citation will suffice. you do not need to spell out all the mathematical details.
Give an example if possible to describe concretely the stages of your algorithm.
If you have multiple systems/configurations, delineate them clearly.
This is likely to be the longest section of your paper.

Experimental setup

How data splits (train/dev/test) are used.
Key details about preprocessing, hyperparameter tuning, etc. that a reader would need to know to replicate your experiments. If space is limited, some of the details can go in an Appendix.
External tools/libraries used, preferably with version number.
Summarize the evaluation measures used in the task.
You do not need to devote much—if any—space to discussing the organization of your code or file formats.

Results

Main quantitative findings: How well did your method perform at the task according to official metrics? How does it rank in the competition? You should report results on the public test set, and if time allows, on the public test set as well.
Quantitative analysis: Ablations or other comparisons of different design decisions to better understand what works best. Indicate which data split is used for the analyses (e.g. in table captions).
Error analysis: If applicable, look at some of your method’s predictions to get a feel for the kinds of mistakes it makes.

Conclusion

A few summary sentences about your method, results, and ideas for future work.

Acknowledgments

Anyone you wish to thank who is not an author, which may include grants.

Appendix

Any low-level implementation details—rules and pre-/post-processing steps, features, hyperparameters, etc.—that would help the reader to replicate your system and experiments, but are not necessary to understand major design points of the system and experiments.
Any figures or results that aren’t crucial to the main points in your paper but might help an interested reader delve deeper.

Leaderboard

The leaderboard is public! See here or below. You may submit up to two submissions per week (excluding those that do not pass our formatting checks). Use the public validation and test sets to prototype your methods before you submit to the leaderboard.