Shared Task
⚠️ Interested in participating? Join our Discord server to stay updated and share your ideas with other participants!
Call for Submissions
The field of mechanistic interpretability (MI) is rapidly advancing, yet comparing the efficacy of new methods remains challenging. To foster rigorous evaluation and drive progress, BlackboxNLP 2025 will host a shared task for benchmarking new techniques for localizing circuits and causal latent variables in language models (LM).
The shared task will leverage the recently proposed Mechanistic Interpretability Benchmark (MIB) by Mueller* & Geiger* et al. (2025). Participants are invited to submit approaches that tackle tasks in two distinct tracks: Circuit Localization, i.e. identifying subsets of the LM computation graph that performs a specific task, and Causal Variable Localization, i.e. aligning model representations with specific known causal variables.
The goal is to benchmark the ability of existing MI methods and identify promising directions to precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. This Call for Papers provides the rules, timeline, and participation details for the shared task. We invite researchers working on attribution, circuit discovery, feature alignment, sparse coding, and related interpretability techniques to participate.
Refer to the original MIB Benchmark page and the related paper for more details on the MIB benchmark and its evaluation metrics.
Important dates
-
May 14th - Release of the Call for Submissions, including links to data and evaluation details.
-
August 1st - Deadline for results submission.
-
August 8th - Deadline for technical report submission.
-
November 10th - Workshop date.
Guidelines for Submissions
Participants are invited to submit their solutions for either of the two tracks through the MIB Leaderboard.
Submissions should include the following items:
For the Circuit Localization Track, we expect one folder per task/model, where each folder is named after the model and the task, separated by an underscore—for example, ioi_gpt2
, or arc-easy_llama3
.
- You may provide either (1) one .json or .pt per folder with floating-point importance scores assigned to each node or edge in the model; or, (2) nine .json or .pt files per folder, with binary membership variables assigned to each node or edge in the model.
- If (2), there should be one circuit containing no more than each of the following percentages of edges:
{0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50}
. In other words, we expect one circuit with k ≤ 0.1 % of edges, one with k ≤ 0.2 % of edges, etc., where k is the percentage of edges in the circuit compared to the full model. - If the code provided in the MIB Circuit Localization Repository is used, the directory structure will already match the requirements.
- See here for an example of submission type 1, and here for an example of submission type 2. NOTE: You are allowed to submit for a subset of models/tasks! As long as you include ≥ 2 models and ≥ tasks, we will accept your submission.
For the Causal Variable Localization Track, we expect a .py script defining an invertible featurizer, a .py script defining at which token positions the featurizer should be applied, and a folder containing a trained featurizer, trained inverse featurizer, and token indices.
- To learn more about the format of the submission, please see both the MIB Causal Variable Localization Repository and this example submission.
- NOTE: the featurizer should extend the
Featurizer
class, and the modules should extendtorch.nn.Module
(or one of the existing module classes). The repository should contain a README with instructions on how to run the code, including any dependencies and installation steps.
Submissions to either track will be evaluated by organizers on the private test set from the MIB benchmark, and results will be made available on the MIB Leaderboard. Participants will be invited to submit a technical report describing their approach, results, and any insights gained during the process. The report should be no more than 4 pages long (excluding references) and follow the BlackboxNLP 2025 formatting guidelines.
Task Details
MIB contains two tracks. The circuit localization track benchmarks methods that aim to locate graphs of causal dependencies in neural networks. The causal variable localization track benchmarks methods that aim to locate specific human-interpretable causal variables in neural networks.
Circuit Localization Track
This track benchmarks circuit discovery methods—i.e., methods for locating graphs of causal dependencies in neural networks. Most circuit discovery pipelines look something like this:
- Compute importance scores for each component or each edge between components.
- Ablate all components from the network except those that surpass some importance threshold, or those in the top k%.
- Evaluate how well the circuit (model with only the most important components not ablated) performs, or replicates the full model’s behavior.
In the circuit localization track, participants are asked (but not required) to employ the MIB benchmark’s provided code to discover and evaluate LM circuits.
Two evaluation criteria are employed: 1. how well the circuit performs overall (the CPR metric), and 2. how well it replicates the model’s behavior (the CMD metric). Past evaluations often implicitly conflate these two, while here we follow the MIB’s approach in treating them as complementary but separate metrics. More details concerning the evaluation are available in the MIB Circuit Localization Repository.
Causal Variable Localization Track
This track benchmarks featurization methods—i.e., methods for transforming model activations into a space where it’s easier to isolate a given causal variable. Most pipelines under this paradigm look like this:
- Curate a dataset of contrastive pairs, where each pair differs only with respect to the targeted causal variable.
- If using a supervised method, train the featurization method using the contrastive pairs.
- To evaluate: feed the model an input from a pair, use the featurizer to transform an activation vector, intervene in the transformed space, transform back out, and see whether the model’s new behavior aligns with what is expected under the intervention.
In the causal variable localization track, participants are asked (but not required) to employ the MIB benchmark’s provided code to train and evaluate featurizers.
Leaderboard
The leaderboard is public! See here. You may submit up to two submissions per week (excluding those that do not pass our formatting checks). Use the public validation and test sets to prototype your methods before you submit to the leaderboard.