BlackboxNLP 2025

Accepted Papers

Below is the list of accepted papers for the BlackboxNLP workshop, categorized by submission type. Click on paper titles to visualize their abstract.

Full Papers (28)

Sparse autoencoders (SAEs) are a promising approach for uncovering interpretable features in large language models (LLMs). While several automated evaluation methods exist for SAEs, most rely on external LLMs. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive evaluation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks without requiring an external LLM judge, achieving over 70% Spearman correlation with results in SAEBench. The official implementation and evaluation dataset are open-sourced and publicly available.
We explore Cross-lingual Backdoor ATtacks (X-BAT) in multilingual Large Language Models (mLLMs), revealing how backdoors inserted in one language can automatically transfer to others through shared embedding spaces. Using toxicity classification as a case study, we demonstrate that attackers can compromise multilingual systems by poisoning data in a single language, with rare and high-occurring tokens serving as specific, effective triggers. Our findings reveal a critical vulnerability that affects the model's architecture, leading to a concealed backdoor effect during the information flow. Our code and data are publicly available at https://github.com/himanshubeniwal/X-BAT.
It has been widely observed that language models (LMs) respond in predictable ways to algorithmically generated prompts that are seemingly unintelligible. This is both a sign that we lack a full understanding of how LMs work, and a practical challenge, because opaqueness can be exploited for harmful uses of LMs, such as jailbreaking. We present the first thorough analysis of opaque machine-generated prompts, or autoprompts, pertaining to 6 LMs of different sizes and families. We find that machine-generated prompts are characterized by a last token that is often intelligible and strongly affects the generation. A small but consistent proportion of the previous tokens are prunable, probably appearing in the prompt as a by-product of the fact that the optimization process fixes the number of tokens. The remaining tokens fall into two categories: filler tokens, which can be replaced with semantically unrelated substitutes, and keywords, that tend to have at least a loose semantic relation with the generation, although they do not engage in well-formed syntactic relations with it. Additionally, human experts can reliably identify the most influential tokens in an autoprompt a posteriori, suggesting these prompts are not entirely opaque. Finally, some of the ablations we applied to autoprompts yield similar effects in natural language inputs, suggesting that autoprompts emerge naturally from the way LMs process linguistic inputs in general.
Language Models, when generating prepositional phrases, must often decide for whether their complements functions as an instrumental adjunct (describing the verb adverbially) or an attributive modifier (enriching the noun adjectivally), yet the internal mechanisms that resolve this split decision remain poorly understood. In this study, we conduct a targeted investigation into Gemma-2 to uncover and control the generation of prepositional complements. We assemble a prompt suite containing with-headed prepositional phrases whose contexts equally accommodate either an instrumental or attributive continuation, revealing a strong preference for an instrumental reading at a ratio of 3:4. To pinpoint individual attention heads that favor instrumental over attributive complements, we project activations into the vocabulary space. By scaling the value vector of a single attention head, we can shift the distribution of functional roles of complements, attenuating instruments to 33% while elevating attributes to 36%.
As large language models (LLMs) are increasingly used as evaluators for natural language generation tasks, ensuring unbiased assessments is essential. However, LLM evaluators often display biased preferences, such as favoring verbosity and authoritative tones. Our empirical analysis reveals that these biases are exacerbated in pairwise evaluation, where LLMs directly compare two outputs and easily prioritize superficial attributes. In contrast, pointwise evaluation, which assesses outputs independently, is less susceptible to such bias because each output is judged in isolation. To address the limitations of the pairwise evaluation, we introduce a novel evaluation method, \textsc{PRePair}, which integrates pointwise reasoning within a pairwise framework. PRePair effectively alleviates biased preference, improving performance on the adversarial benchmark (LLMBar) while outperforming pointwise evaluation on the standard benchmark (MT-Bench).
We study last-layer outlier dimensions, i.e. dimensions that display extreme activations for the majority of inputs. We show that outlier dimensions arise in many different modern language models, and trace their function back to the heuristic of constantly predicting frequent words. We further show how a model can block this heuristic when it is not contextually appropriate, by assigning a counterbalancing weight mass to the remaining dimensions, and we investigate which model parameters boost outlier dimensions and when they arise during training. We conclude that outlier dimensions are a specialized mechanism discovered by many distinct models to implement a useful token prediction heuristic.
This paper investigates the language dominance hypothesis in multilingual large language models (LLMs), which posits that cross-lingual understanding is facilitated by an implicit translation into a dominant language seen more frequently during pretraining. We propose a novel approach to quantify how languages influence one another in a language model. By analyzing the hidden states across intermediate layers of language models, we model interactions between language-specific embedding spaces using Gaussian Mixture Models. Our results reveal only weak signs of language dominance in middle layers, affecting only a fraction of tokens. Our findings suggest that multilingual processing in LLMs is better explained by language-specific and shared representational spaces rather than internal translation into a single dominant language.
Understanding the decision-making processes of neural networks is a central goal of mechanistic interpretability. In the context of Large Language Models (LLMs), this involves uncovering the underlying mechanisms and identifying the roles of individual model components such as neurons and attention heads, as well as model abstractions such as the learned sparse features extracted by Sparse Autoencoders (SAEs). A rapidly growing line of work tackles this challenge by using powerful generator models to produce open-vocabulary, natural language concept descriptions for these components. In this paper, we provide the first survey of the emerging field of concept descriptions for model components and abstractions. We chart the key methods for generating these descriptions, the evolving landscape of automated and human metrics for evaluating them, and the datasets that underpin this research. Our synthesis reveals a growing demand for more rigorous, causal evaluation. By outlining the state of the art and identifying key challenges, this survey provides a roadmap for future research toward making models more transparent.
Parameter-efficient methods like LoRA have revolutionised large language model (LLM) fine-tuning. ReLoRA extends this idea to pretraining by repeatedly merging and reinitialising low-rank adapters, increasing cumulative rank while keeping updates cheap. This aligns well with observations that high-capacity models learn through locally low-rank trajectories that expand over time. By contrast, recent work suggests that small language models (SLMs) exhibit rank deficiencies and under-utilise their available dimensionality. This raises a natural question: can ReLoRA's rank-expanding update rule steer SLMs toward healthier learning dynamics, mitigating rank bottlenecks in a capacity-constrained regime? We argue SLMs are an ideal testbed: they train quickly, enable controlled ablations, and make rank phenomena more measurable. We present the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Across loss, Paloma perplexity, and BLiMP, we find that ReLoRA underperforms full-rank training, with gaps widening at larger scales. Analysis of proportional effective rank and condition numbers shows that ReLoRA amplifies existing rank deficiencies and induces ill-conditioned updates early in training. Our results suggest that while ReLoRA's merge-and-restart strategy can expand ranks in larger models, it does not straightforwardly translate to capacity-limited SLMs, motivating adaptive-rank or hybrid-rank approaches for low-compute pretraining.
Leave-One-Out (LOO) provides an intuitive measure of feature importance but is computationally prohibitive. While Layer-Wise Relevance Propagation (LRP) offers a potentially efficient alternative, its axiomatic soundness in modern Transformers remains under-examined. In this work, we first show that the bilinear propagation rules used in recent advances of AttnLRP violate implementation invariance. We prove this analytically and confirm it empirically in linear attention layers. Second, we also revisit CP-LRP as a diagnostic baseline and find that bypassing relevance propagation through the softmax layer---back-propagating relevance only through the value matrices---significantly improves alignment with LOO, particularly in the middle-to-late Transformer layers. Overall, our results suggest that (i) bilinear factorization sensitivity and (ii) softmax propagation error potentially jointly undermine LRP’s ability to approximate LOO in Transformers.
This study conducts a detailed analysis of the side effects of rank-one knowledge editing using language models with controlled knowledge. The analysis focuses on each element of knowledge triples (subject, relation, object) and examines two aspects: "knowledge that causes large side effects when edited" and "knowledge that is affected by the side effects." Our findings suggest that editing knowledge with subjects that have relationships with numerous objects or are robustly embedded within the LM may trigger extensive side effects. Furthermore, we demonstrate that the similarity between relation vectors, the density of object vectors, and the distortion of knowledge representations are closely related to how susceptible knowledge is to editing influences. The findings of this research provide new insights into the mechanisms of side effects in LM knowledge editing and indicate specific directions for developing more effective and reliable knowledge editing methods.
Large language models (LLMs) are increasingly deployed in collaborative settings, yet little is known about how they coordinate when treated as black-box agents. We simulate 7,500 multi-agent, multi-round discussions in an inductive coding task, generating over 125,000 utterances that capture both final annotations and their interactional histories. We introduce process-level metrics—code stability, semantic self-consistency, and lexical confidence—alongside sentiment and convergence measures, to track coordination dynamics. To probe deeper alignment signals, we analyze the evolving geometry of output embeddings, showing that intrinsic dimensionality declines over rounds, suggesting semantic compression. The results reveal that LLM groups converge lexically and semantically, develop asymmetric influence patterns, and exhibit negotiation-like behaviors despite the absence of explicit role prompting. This work demonstrates how black-box interaction analysis can surface emergent coordination strategies, offering a scalable complement to internal probe-based interpretability methods.
Large Language Models (LLMs) achieve impressive natural language processing performance but can memorize and leak Personally Identifiable Information (PII), posing serious privacy risks. Existing mitigation strategies—such as differential privacy and neuron-level interventions—often degrade utility or fail to reliably prevent leakage. We present PrivacyScalpel, a privacy-preserving framework that leverages LLM interpretability to identify and suppress PII leakage while preserving performance. PrivacyScalpel operates in three stages: (1) Feature Probing to locate model layers encoding PII-rich representations; (2) Sparse Autoencoding using a k-Sparse Autoencoder (k-SAE) to disentangle and isolate privacy-sensitive features; and (3) Feature-Level Interventions via targeted ablation and vector steering to reduce leakage. Experiments on Gemma2-2B and Llama2-7B fine-tuned with the Enron dataset show that PrivacyScalpel reduces email leakage from 5.15% to 0.0% while retaining over 99.4% of the original utility. Compared to neuron-level methods, our approach achieves a superior privacy–utility trade-off, highlighting the effectiveness of targeting sparse, monosemantic features over polysemantic neurons. Beyond privacy gains, PrivacyScalpel offers interpretability insights into PII memorization mechanisms, contributing to safer and more transparent LLM deployment.
Feature circuits aim to shed light on LLM behavior by identifying the features that are causally responsible for a given LLM output, and connecting them into a directed graph, or *circuit*, that explains how both each feature and each output arose. However, performing circuit analysis is challenging: the tools for finding, visualizing, and verifying feature circuits are complex and spread across multiple libraries. To facilitate feature-circuit finding, we introduce `circuit-tracer`, an open-source library for efficient identification of feature circuits. `circuit-tracer` provides an integrated pipeline for finding, visualizing, annotating, and performing interventions on such feature circuits, tested with various model sizes, up to 14B parameters. We make `circuit-tracer` available to both developers and end users, via integration with tools such as Neuronpedia, which provides a user-friendly interface.
Autoregressive large language models (LLMs) exhibit impressive performance across various tasks but struggle with simple arithmetic, such as additions of two or more operands. We show that this struggle arises from LLMs’ use of a simple one-digit lookahead heuristic, which forms an upper bound for LLM performance and accounts for characteristic error patterns in two-operand addition and failure in multi-operand addition, where the carry-over logic is more complex. Our probing experiments and digit-wise accuracy evaluation show that the evaluated LLMs fail precisely where a one-digit lookahead is insufficient to account for cascading carries. We analyze the impact of tokenization strategies on arithmetic performance and show that all investigated models, regardless of tokenization and size, are inherently limited in the addition of multiple operands due to their reliance on a one-digit lookahead heuristic. Our findings reveal limitations that prevent LLMs from generalizing to more complex numerical reasoning.
Our goal is to study how LLMs represent and interpret plural reference in ambiguous and unambiguous contexts. We ask the following research questions: (1) Do LLMs exhibit human-like preferences in representing plural reference? and (2) Are LLMs able to detect ambiguity in plural anaphoric expressions and identify possible referents? To address these questions, we design a set of experiments, examining pronoun production using next-token prediction tasks, pronoun interpretation, and ambiguity detection using different prompting strategies. We then assess how comparable LLMs are to humans in formulating and interpreting plural reference. We find that LLMs are sometimes aware of possible referents of ambiguous pronouns. However, they do not always follow human reference when choosing between interpretations, especially when the possible interpretation is not explicitly mentioned. In addition, they struggle to identify ambiguity without direct instruction. Our findings also reveal inconsistencies in the results across different types of experiments.
Normative reasoning is a type of reasoning that involves normative or deontic modality, such as obligation and permission. While large language models (LLMs) have demonstrated remarkable performance across various reasoning tasks, their ability to handle normative reasoning remains underexplored. In this paper, we systematically evaluate LLMs' reasoning capabilities in the normative domain from both logical and modal perspectives. Specifically, to assess how well LLMs reason with normative modals, we make a comparison between their reasoning with normative modals and their reasoning with epistemic modals, which share a common formal structure. To this end, we introduce a new dataset covering a wide range of formal patterns of reasoning in both normative and epistemic domains, while also incorporating non-formal cognitive factors that influence human reasoning. Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those observed in psychological studies of human reasoning. These findings highlight challenges in achieving logical consistency in LLMs' normative reasoning and provide insights for enhancing their reliability. All data and code are released publicly at https://github.com/kmineshima/NeuBAROCO.
Graph-matching metrics such as Smatch are the de facto standard for evaluating neural semantic parsers, yet they capture surface overlap rather than logical equivalence. We reassess evaluation by pairing graph-matching with automated theorem proving. We compare two approaches to building parsers: supervised fine-tuning (T5-Small/Base) and few-shot in-context learning (GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness. Across settings, we find that models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization reduces incidental target variability, improves well-formedness, and strengthens logical adequacy. Error analysis shows performance degrades with increasing formula complexity and with coordination, prepositional phrases, and passive voice; the dominant failures involve variable binding and indexing, and predicate naming. These findings highlight limits of graph-based metrics for reasoning-oriented applications and motivate logic-sensitive evaluation and training objectives together with simplified, normalized target representations. All data and code will be publicly released.
Merging methods combine the weights of multiple language models (LMs) to leverage their capacities, such as for domain adaptation. While existing studies investigate merged models from a solely behavioral perspective, we offer the first comprehensive view by assessing and connecting their behavior and internals. We present a novel evaluation pipeline that first merges multiple parent LMs, and then evaluates the merged models in comparison to the initial ones based on their behavior on downstream tasks, like MMLU, and the internal encoded linguistic competence. We showcase this pipeline by assessing the merging of instruction fine-tuned with math- and code-adapted LMs from the Qwen2.5 family. Our results show that merging methods impacts behavior and internals differently. While the performance of merged models is typically between that of the two parent models, their encoded information about linguistic phenomena – particularly in morphology and syntax – can surpass the parent models. Moreover, we find weak ranking correlation between this behavior and internal evaluation. With our pipeline and initial results, we emphasize the need for more comprehensive evaluations of model merging methods to gain a faithful understanding of their capabilities and reliability, beyond potential superficial behavioral advances.
Classifiers are an important and defining feature of the Chinese language, and their correct prediction is key to numerous educational applications. Yet, whether the most popular Large Language Models (LLMs) possess proper knowledge the Chinese classifiers is an issue that has largely remain unexplored in the Natural Language Processing (NLP) literature. To address such a question, we employ various masking strategies to evaluate the LLMs' intrinsic ability, the contribution of different sentence elements, and the working of the attention mechanisms during prediction. Besides, we explore fine-tuning for LLMs to enhance the classifier performance. Our findings reveal that LLMs perform worse than BERT, even with fine-tuning. The prediction, as expected, greatly benefits from the information about the following noun, which also explains the advantage of models with a bidirectional attention mechanism such as BERT.
In-context Learning (ICL) utilizes structured demonstration-query inputs to induce few-shot learning on Language Models (LMs), which are not originally pre-trained on ICL-style data. To bridge the gap between ICL and pre-training, some approaches fine-tune LMs on large ICL-style datasets by an end-to-end paradigm with massive computational costs. To reduce such costs, in this paper, we propose Attention Behavior Fine-Tuning (ABFT), utilizing the previous findings on the inner mechanism of ICL, building training objectives on the attention scores instead of the final outputs, to force the attention scores to focus on the correct label tokens presented in the context and mitigate attention scores from the wrong label tokens. Our experiments on 9 modern LMs and 8 datasets empirically find that ABFT outperforms in performance, robustness, unbiasedness, and efficiency, with only around 0.01% data cost compared to the previous methods. Moreover, our subsequent analysis finds that the end-to-end training objective contains the ABFT objective, suggesting the implicit bias of ICL-style data to the emergence of induction heads. Our work demonstrates the possibility of controlling specific module sequences within LMs to improve their behavior, opening up the future application of mechanistic interpretability.
Large Language Models (LLMs) are widely used by software engineers for programming tasks. However, research shows that LLMs often lack a deep understanding of program semantics. Even minor changes to syntax, such as renaming variables, can significantly degrade performance across various tasks. In this work, we examine the task of *type prediction*: given a partially typed program, can a model predict a missing type annotations such that the resulting program is more typed? We construct a dataset of adversarial examples where models initially predict the correct types, but begin to fail after semantically irrelevant edits. This is problematic, as models should ideally generalize across different syntactic forms of semantically equivalent code. This lack of robustness suggests that models may have a shallow understanding of code semantics. Despite this, we provide evidence that LLMs do, in fact, learn robust mechanisms for type prediction—though these mechanisms often fail to activate in adversarial scenarios. By using *activation steering*, a method that manipulates a model’s internal activations to guide it toward using latent knowledge, we restore accurate predictions on adversarial inputs. We show that steering successfully activates a type prediction mechanism that is shared by both Python and TypeScript, and is more effective than prompting with in-context examples. Across five different models, our comprehensive evaluation demonstrates that LLMs can learn generalizable representations of code semantics that transfer across programming languages.
Contrastive explanations, which indicate why an AI system produced one output (the target) instead of another (the foil), are widely recognized in explainable AI as more informative and interpretable than standard explanations. However, obtaining such explanations for speech-to-text (S2T) generative models remains an open challenge. Adopting a feature attribution framework, we propose the first method to obtain contrastive explanations in S2T by analyzing how specific regions of the input spectrogram influence the choice between alternative outputs. Through a case study on gender translation in speech translation, we show that our method accurately identifies the audio features that drive the selection of one gender over another.
In recent years, large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks. Nevertheless, their capacity to process and reflect core human experiences remains underexplored. Current benchmarks for LLM evaluation typically focus on a single aspect of linguistic understanding, thus failing to capture the full breadth of its abstract reasoning about the world. To address this gap, we propose a multidimensional paradigm to investigate the capacity of LLMs to perceive the world through temporal, spatial, sentimental, and causal aspects. We conduct extensive experiments by partitioning datasets according to different distributions and employing various prompting strategies. Our findings reveal significant differences and shortcomings in how LLMs handle temporal granularity, multi-hop spatial reasoning, subtle sentiments, and implicit causal relationships. While sophisticated prompting approaches can mitigate some of these limitations, substantial challenges persist in effectively capturing abstract human perception. We aspire that this work, which assesses LLMs from multiple perspectives of human understanding of the world, will guide more instructive research on the LLMs’ perception or cognition.
Named entities are fundamental building blocks of knowledge in text, grounding factual information and structuring relationships within language. Despite their importance, it remains unclear how Large Language Models (LLMs) internally represent entities. Prior research has primarily examined explicit relationships, but little is known about entity representations themselves. We introduce entity mention reconstruction as a novel framework for studying how LLMs encode and manipulate entities. We investigate whether entity mentions can be generated from internal representations, how multi-token entities are encoded beyond last-token embeddings, and whether these representations capture relational knowledge. Our proposed method, leveraging task vectors, allows to consistently generate multi-token mentions from various entity representations derived from the LLMs hidden states. We thus introduce the Entity Lens, extending the logit-lens to predict multi-token mentions. Our results bring new evidence that LLMs develop entity-specific mechanisms to represent and manipulate any multi-token entities, including those unseen during training.
Large language models (LLMs) often fail to generate text in the intended target language, particularly in non-English interactions. Con- currently, recent work has explored Language Neuron Intervention (LNI) as a promising tech- nique for steering output language. In this paper, we re-evaluate LNI in more practical scenarios—specifically with instruction-tuned models and prompts that explicitly specify the target language. Our experiments show that while LNI also shows potential in such practi- cal scenarios, its average effect is limited and unstable across models and tasks, with a 0.83% reduction in undesired language output and a 0.1% improvement in performance. Our further analysis identifies two key factors for LNI’s limitation: (1) LNI affects both the output lan- guage and the content semantics, making it hard to control one without affecting the other, which explains the weak performance gains. (2) LNI increases the target language token proba- bilities, but they often remain below the top-1 generation threshold, resulting in failure to gen- erate the target language in most cases. Our results highlight both the potential and limi- tations of LNI, paving the way for future im- provements
It is a longstanding challenge to understand how neural models perform mathematical reasoning. Recent mechanistic interpretability work indicates that large language models (LLMs) use a ``bag of heuristics'' in middle to late-layer MLP neurons for arithmetic, where each heuristic promotes logits for specific numerical patterns. Building on this, we aim for fine-grained manipulation of these heuristic neurons to causally steer model predictions towards specific arithmetic outcomes, moving beyond simply disrupting accuracy. This paper presents a methodology that enables the systematic identification and causal manipulation of heuristic neurons, which is applied to the addition task in this study. We train a linear classifier to predict heuristics based on activation values, achieving over 90% classification accuracy. The trained classifier also allows us to rank neurons by their importance to a given heuristic. By targeting a small set of top-ranked neurons (K=50), we demonstrate high success rates—over 80% for the ones place and nearly 70% for the tens place—in controlling addition outcomes. This manipulation is achieved by transforming the activation of identified neurons into specific target heuristics by zeroing out source-heuristic neurons and adjusting target-heuristic neurons towards their class activation centroids. We explain these results by hypothesizing that high-ranking neurons possess 'cleaner channels' for their heuristics, supported by Signal-to-Noise Ratio (SNR) analysis where these neurons show higher SNR scores. Our work offers a robust approach to dissect, causally test, and precisely influence LLM arithmetic, advancing understanding of their internal mechanisms.
Jailbreaks have been a central focus of research regarding the safety and reliability of large language models (LLMs), yet the mechanisms underlying these attacks remain poorly understood. While previous studies have predominantly relied on linear methods to detect jailbreak attempts and model refusals, we take a different approach by examining both linear and non-linear features in prompts that lead to successful jailbreaks. First, we introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods. Leveraging this dataset, we train linear and non-linear probes on hidden states of open-weight LLMs to predict jailbreak success. Probes achieve strong in-distribution accuracy but transfer is attack-family-specific, revealing that different jailbreaks are supported by distinct internal mechanisms rather than a single universal direction. To establish causal relevance, we construct probe-guided latent interventions that systematically shift compliance in the predicted direction. Interventions derived from non-linear probes produce larger and more reliable effects than those from linear probes, indicating that features linked to jailbreak success are encoded non-linearly in prompt representations. Overall, the results surface heterogeneous, non-linear structure in jailbreak mechanisms and provide a prompt-side methodology for recovering and testing the features that drive jailbreak outcomes.

Shared Task Papers (4)

Mechanistic interpretability (MI) seeks to uncover how language models (LMs) implement specific behaviors, yet measuring progress in MI remains challenging. The recently released Mechanistic Interpretability Benchmark (MIB) provides a standardized framework for evaluating circuit and causal variable localization. Building on this foundation, the BlackboxNLP 2025 Shared Task extends MIB into a community-wide reproducible comparison of MI techniques. The shared task features two tracks: circuit localization, which assesses methods that identify causally influential components and interactions driving model behavior, and causal variable localization, which evaluates approaches that map activations into interpretable features. With three teams spanning eight different methods, participants achieved notable gains in circuit localization using ensemble and regularization strategies for circuit discovery. With one team spanning two methods, participants achieved significant gains in causal variable localization using low-dimensional and non-linear projections to featurize activation vectors. The MIB leaderboard remains open; we encourage continued work in this standard evaluation framework to measure progress in MI research going forward.
One of the main challenges in mechanistic interpretability is circuit discovery -- determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models.
Understanding why large language models (LLMs) exhibit certain behaviors is the goal of mechanistic interpretability. One of the major tools employed by mechanistic interpretability is circuit discovery, i.e., identifying a subset of the model's components responsible for a given task. We present a novel circuit discovery technique called IPE (Isolating Path Effects) that, unlike traditional edge-centric approaches, aims to identify entire computational paths (from input embeddings to output logits) responsible for certain model behaviors. Our method modifies the messages passed between nodes along a given path in such a way as to either precisely remove the effects of the entire path (i.e., ablate it) or to replace the path's effects with those that would have been generated by a counterfactual input. IPE is different from current path-patching or edge activation-patching techniques since they are not ablating single paths, but rather a set of paths sharing certain edges, preventing more precise tracing of information flow. We apply our method to the well-known Indirect Object Identification (IOI) task, recovering the canonical circuit reported in prior work. On the MIB workshop leaderboard, we tested IOI and MCQA tasks on GPT2-small and Qwen2.5. For GPT2, path counterfactual replacement outperformed path ablation as expected and led to top-ranking results, while for Qwen, no significant differences were observed, indicating a need for larger experiments to distinguish the two approaches.
The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we investigate whether ensembling two or more circuit localization methods can improve performance. We explore two variants: parallel and sequential ensembling. In parallel ensembling, we combine attribution scores assigned to each edge by different methods—e.g., by averaging or taking the minimum or maximum value. In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method, namely edge pruning. We observe that both approaches yield notable gains on the benchmark metrics, leading to a more precise circuit identification approach. Finally, we find that taking a parallel ensemble over various methods, including the sequential ensemble, achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB Shared Task, comparing ensemble scores to official baselines across multiple model–task combinations.

Non-archival Abstracts (26)

In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because earlier errors can corrupt judgments of downstream reasoning. To better detect such errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a probabilistic framework that evaluates each reasoning step based solely on previously-verified premises. We find that ARES can reliably detect propagated reasoning errors that other baselines fail to find with probabilistic guarantees.
This work advances semantic representation learning to render language representations or models more semantically and geometrically interpretable, and to enable localised, quasi-symbolic, compositional control through deliberate shaping of their latent space geometry. We pursue this goal within a VAE framework, exploring two complementary research directions: (i) Sentence-level learning and control — disentangling and manipulating specific semantic features in the latent space to guide sentence generation, with explanatory text serving as the testbed; and (ii) Reasoning-level learning and control — isolating and steering inference behaviours in the latent space to control NLI. In this direction, we focus on Explanatory NLI tasks, in which two premises (explanations) are provided to infer a conclusion. Our objective is to move toward LMs whose internal semantic representations can be systematically interpreted, precisely shaped, and reliably directed.
Quantifier-negation constructions such as “every vote doesn’t count” are structurally ambiguous, admitting both surface (no vote counts) and inverse (not all votes count) interpretations. While prior work has shown that human interpretations of such utterances are context-sensitive, it remains unclear whether pre-trained language models (LMs) exhibit similar pragmatic sensitivity. We present the first large-scale study comparing human and LM behavior on naturally occurring emph{every}-negation ambiguities. GPT-2 shows a rational preference for the ambiguous form when context strongly supports disambiguation. However, model-estimated event probabilities diverge from human estimates, revealing a gap in representational alignment. Our findings highlight both the contextual reasoning capabilities and current limitations of LMs in pragmatic scope resolution, offering new benchmarks for evaluating discourse-level inference in generative models.
We propose Fairness Interventions at Runtime and Model-training (FIRM), a unified framework for mitigating bias in modern NLP models. Such models often encode and perpetuate bias in opaque ways. Furthermore, existing mitigation approaches typically treat models as black boxes and apply static interventions that do not persist as the model evolves. Using paired counterfactual input, we apply causal tracking techniques to localize internal features (attention layer, attention heads and neurons) that are causally responsible for biased behavior. Once identified, we intervene through (1) pinpoint tuning that finetunes only the biased components during training, and (2) Debiased Steering Vectors (DSVs) that suppress biased representations during inference. FIRM achieves up to 47.6% mean bias reduction across multiple datasets.
Interpretability researchers have attempted to understand MLP neurons of language models based on both the contexts in which they activate and their output weight vectors. They have paid little attention to a complementary aspect: the interactions between input and output. For example, when neurons detect a direction in the input, they might strengthen or weaken its presence in the residual stream. We address this aspect by examining the cosine similarity between input and output weights of a neuron. We apply our method to 12 models and find that strengthening neurons dominate in early-middle layers whereas later layers tend more towards weakening. We describe various ongoing efforts to better understand this phenomenon: qualitative neuron case studies, comparison with activation frequencies, and ablation experiments.
Moderating in-game chat is challenging due to nuanced toxicity, ambiguous categories, and inconsistent guidelines. We present ToxiSight, a human-AI collaborative annotation platform designed for harmful and toxic in-game chat, paired with a methodology for analyzing annotator behavior to assess label accuracy, confidence, and category suitability (e.g., split, clarify, or drop). Drawing from cognitive psychology and human–computer interaction, we measure reaction time and widget usage to identify categories needing clarification, guide policy refinement, and evaluate the utility of explanation-based support tools. Early findings with content moderators reveal distinct speed–accuracy trade-offs across toxicity categories, highlighting the potential of behavioral analysis to improve both annotation quality and moderation policy design.
Existing automated interpretability methods for Large Language Models (LLMs) often infer feature meanings by analyzing activations with another LLM, but they suffer from high computational cost, dataset dependence, prompt sensitivity, and explainer bias. Transcoders provide a promising direction for automated interpretability, since their architecture allows separating input-dependent and input-invariant components of feature attributions. We investigate whether analyzing only the input-invariant component (weights) yields meaningful interpretations for token-based features, reducing reliance on external explainers, and develop the WeightLens framework. Experiments show that in the chosen setting, WeightLens performs comparably to, or even better than, activation-based methods, suggesting a low-cost complement to existing approaches.
Large language models improve at mathematical reasoning after instruction tuning, reinforcement learning, or knowledge distillation. However, it is unclear whether these improvements result from major changes in the transformer layers or from minor adjustments that preserve the base model’s layer importance structure. We investigate this question through systematic layer-wise ablation experiments, examining base, instruction-tuned, knowledge-distilled, and reinforcement learning with verifiable rewards (RLVR) trained variants on mathematical reasoning benchmarks. Our findings show that mathematical reasoning gives rise to a specific layer importance structure, and this structure persists across all post-training paradigms. Removing such layers causes accuracy drops of up to 80%. In contrast, non-mathematical tasks like factual recall exhibit no such critical layers. This distinction suggests that mathematical reasoning relies on specialized layers that emerge during pre-training and stay unchanged under various post-training methods, whereas other non-reasoning tasks do not exhibit any critical layers.
As LLMs are widely used in various applications, the demand for explainability for LLMs is also increasing. Local explanations explain the model's behavior in a local neighborhood around a specific input. However, existing local explanation methods often fail to faithfully reflect the model's behavior, especially when LLMs can process long sentence inputs. Therefore, in this paper, we propose that selecting a proper neighborhood is crucial to generate faithful local explanations for LLMs, and we propose a novel method to curate the neighborhood of local explanations for LLMs by using a smaller proxy model, which can provide more faithful explanations with minimal additional computational cost. We conducted a primary empirical study to evaluate the effectiveness of our method.
Structural priming—the repetition of a recently processed syntactic configuration—appears in both humans and large language models (LLMs), within and across languages. Prior work takes monolingual priming to suggest abstract, form-independent grammatical representations, while cross-lingual priming hints at partially shared, language-general structure. Using logit lens, we show that the priming signal gets progressively amplified across layers. With head attribution, we identify top-k heads most associated with priming and find that they overlap with heads supporting induction and cross-lingual translation: ablating these heads yields steep drops on both measures. Monolingual priming appears largely driven by induction, whereas cross-lingual priming additionally uses translation-oriented heads that stabilize parallel structure across languages. Finally, zooming in on the most influential head, we show its activation reliably shifts the probability that a target sentence adopts the prime’s construction—effectively acting as a “priming dial” for structural priming.
We propose Accelerated Path Patching (APP), a hybrid approach combining attention head pruning with path patching to identify circuits in large language models (LLMs). APP reduces the path patching search space by over 50% while retaining substantial overlap with established path patching circuits and preserving 66.8–82.7% of the original model’s performance across four interpretability tasks.
As LLMs are deployed in knowledge-intensive settings, professionals need confidence that a model’s reasoning matches domain expertise. Current explanation evaluations focus on plausibility or internal faithfulness, often overlooking alignment with expert intuition. We define expert alignment as a key criterion for evaluating explanations and introduce T-FIX, a benchmark designed to evaluate how well LLM explanations align with expert judgment across seven knowledge-intensive fields.
In the study of trustworthy Natural Language Processing (NLP), a number of important research fields have emerged, including that of extit{explainability} and extit{privacy}. While research interest in both explainable and privacy-preserving NLP has increased considerably in recent years, there remains a lack of investigation at the intersection of the two. This leaves a considerable gap in understanding of whether achieving extit{both} explainability and privacy is possible, or whether the two are at odds with each other. In this work, we conduct an empirical investigation into the privacy-explainability trade-off in the context of NLP, guided by the popular overarching methods of extit{Differential Privacy} (DP) and Post-hoc Explainability. Our findings include a view into the intricate relationship between privacy and explainability, which is formed by a number of factors, including the nature of the downstream task and choice of the text privatization and explainability method. In this, we highlight the potential for privacy and explainability to co-exist, and we summarize our findings in a collection of practical recommendations for future work at this important intersection.
State space models (SSMs) are a promising alternative to transformers for language modeling because they use fixed memory during inference. However, this fixed memory usage requires some information loss in the hidden state when processing long sequences. While prior work has studied the sequence length at which this information loss occurs, it does not characterize the types of information SSM language models (LMs) tend to forget. In this paper, we address this knowledge gap by identifying the types of tokens (e.g., parts of speech, named entities) and sequences (e.g., code, math problems) that are more frequently forgotten by SSM LMs. We achieve this by training an auto-encoder to reconstruct sequences from the SSM's hidden state, and measure information loss by comparing inputs with their reconstructions. We perform experiments using the Mamba family of SSM LMs (130M--1.4B) on sequences ranging from 4--256 tokens. Our results show significantly higher rates of information loss on math-related tokens (e.g., numbers, variables), mentions of organization entities, and alternative dialects to Standard American English. We then examine the frequency that these tokens appear in Mamba's pretraining data and find that less prevalent tokens tend to be the ones Mamba is most likely to forget. By identifying these patterns, our work provides clear direction for future research to develop methods that better control Mamba's ability to retain important information.
Large language models have been shown to exhibit misalignment: unexpected and unintended behaviors that emerge across a wide range of domains despite finetuning on benign datasets. We propose identifying relational linguistic structures, such as mechanisms that bind entities and attributes conditionally based on whether the model is misaligned or not. As an example, the binding mechanism binds colors 'green' or 'red' to an object, like 'mushroom', when the model is provided the context that one of them is harmful and the other is not. Using mean ablation interventions, we isolate the binding mechanisms that allow models to produce aligned and misaligned responses. Our comparisons of the two mechanisms show that they are distinct, suggesting that the linguistic structures that govern misalignment are separate and uniquely identifiable. Our approach serves as a latent indicator for misalignment.
Mental manipulation detection is challenging due to the subjective nature of psychological patterns and the need for transparent AI systems. Studies show that large language models perform poorly on this task, are computationally expensive, and may generate unreliable explanations, making them unsuitable for practical deployment where practitioners need transparent models with clear reasoning. We propose an explainability validation framework for mental manipulation detection using compact transformer models. Our approach fine-tunes efficient DistilRoBERTa models and applies six XAI methods to analyze multi-granular decision patterns using 4,000 annotated movie dialogues and ~26,000 sentences. Unlike previous work testing only basic performance, our framework provides dual validation through detection performance and explainability analysis with human validation. We compare our MIND framework against classical ML baselines, GPT-4.1-mini (zero/few-shot), and iterative feedback agents while measuring XAI method consensus using similarity metrics. Our results show that compact models achieve comparable performance while maintaining computational efficiency and explainability. Explainability analysis shows substantial variation in method consensus, with Raw Attention and Integrated Gradients showing highest agreement (0.87-0.94 cosine similarity). This work providing both performance and explainability needed for trustworthy AI in psychological domains.
Since the advent of the Transformer architecture (Vaswani et al., 2017), the capabilities of Large Language Models (LLMs) have rapidly advanced, driving the emergence of LLM-as-a-Judge systems that enable automated evaluation of model outputs at scale without the need for costly manual annotations (Zheng et al., 2023). However, similarly to generative models, LLM-as-a-Judge are prone to hallucinations, defined as content that is not supported by the input or underlying facts (Maynez et al., 2020; Filippova, 2020; Ji et al., 2023). Thus, there is an increasing need for efficient and generalizable methods to detect and mitigate hallucinations in LLM-as-a-Judge systems to ensure that large-scale evaluations remain accurate and trustworthy. A promising approach for detecting LLM hallucinations is semantic entropy (SE), which quantifies the variability in meaning across numerous model outputs generated from the same input prompt (Farquhar et al., 2024). If the outputs vary substantially in their conveyed meanings, the resulting high semantic entropy can serve as a strong indicator of hallucination or uncertainty. However, computing semantic entropy requires multiple forward passes for each input, which restricts its applicability in real-world, large-scale settings. Recent work on Semantic Entropy Probes (SEPs) (Kossen et al., 2024) demonstrates that semantic entropy can be calculated from a model’s hidden states, enabling the training of a simple linear probe to predict the semantic entropy for a given prompt. This approach eliminates the need for multiple generations per input, requiring only a single forward pass, and thus greatly reduces computational cost while preserving detection effectiveness. In this work, we present Semantic Entropy Probes for LLM Evaluators (SCOPE), a novel extension of the SEP framework to LLM-as-a-Judge tasks. Our approach applies semantic entropy probing to detect and mitigate hallucinations in LLM-as-a-Judge evaluations, focusing on rating–rationale tasks. We train a linear probe to predict semantic entropy labels from hidden states. Our method requires only a single forward pass per example, enabling scalable and efficient reliability assessment of LLM-as-a-Judge systems without repeated sampling. Beyond hallucination detection, SCOPE further supports an adaptive pipeline for mitigating uncertain judgments.
Latent reasoning language models aim to improve reasoning efficiency by computing in continuous hidden space rather than explicit text, but the opacity of these internal processes poses major challenges for interpretability and trust. We present a mechanistic case study of CODI (Continuous Chain-of-Thought via Self-Distillation), a latent reasoning model that solves problems by chaining 'latent thoughts'. Using attention analysis, SAE based probing, activation patching, and causal interventions, we uncover a structured 'scratchpad computation' cycle: even numbered steps serve as scratchpads for storing numerical information, while odd numbered steps perform the corresponding operations. Our experiments show that interventions on numerical features disrupt performance most strongly at scratchpad steps, while forcing early answers produces accuracy jumps after computation steps. Together, these results provide a mechanistic account of latent reasoning as an alternating algorithm, demonstrating that non linguistic thought in LLMs can follow systematic, interpretable patterns. By revealing structure in an otherwise opaque process, this work lays the groundwork for auditing latent reasoning models and integrating them more safely into critical applications.
Text-to-image diffusion models have achieved impressive generative performance, yet their internal mechanisms remain largely opaque. We introduce a lightweight interpretability toolkit that adapts causal and visual analysis methods—such as activation patching, sparse autoencoder feature discovery, and targeted interventions—to latent diffusion models like Stable Diffusion. The system is designed for hypothesis-driven research: users can visualize how prompts ground into generated images, manipulate intermediate activations to test causal effects, and probe the temporal emergence of semantic concepts across denoising steps. To illustrate its use, we outline a proposed case study on semantic image editing via text-derived steering vectors, demonstrating how our framework could support fine-grained analysis of concept representations and principled evaluation of editing interventions.
Instruction-tuned large language models (LLMs) often exhibit sycophancy, a tendency to align with a user's stated opinion, even when it is factually incorrect. Prior inference-time mitigations typically rely on prompt heuristics or dense activation steering using precomputed global directions, which assume that the sycophancy direction in activation space is stationary across prompts. We hypothesize that sycophancy is prompt-dependent, requiring input-conditioned mitigation. We propose Sparse Activation Fusion (SAF), an inference-time method that mitigates sycophancy by dynamically estimating and counteracting user-induced bias for each query within a sparse feature space, allowing precise, targeted control over model behavior. Evaluated on the SycophancyEval QnA setup with opinion cues, SAF reduces sycophancy rate from 63% to 39%, doubling accuracy when the user’s opinion is wrong, while maintaining comparable performance when the user is correct. Our results demonstrate that prompt-specific, sparse activation fusion offers a promising, interpretable approach to improving LLM truthfulness without retraining.
In this work, we investigate a mechanistic interpretability approach to mitigate sycophancy in instruction-tuned language models. We identify a 'pressure' direction within the activation space of multiple transformer layers, corresponding to the model's state when its initial answer is challenged. By steering activations along an anti-sycophancy direction in the residual stream of a targeted subset of layers during inference, we can bring the model towards more consistent and truthful behavior. Our preliminary results show a dramatic reduction in sycophancy; we reduce the rate of responses where the model admits false positives as correct from 78.0% to 0.0% on the SycophancyEval Trivia benchmark, while preserving baseline accuracy. This intervention demonstrates a targeted and effective method for improving the robustness of the model.
Transformer-based large language models (LLMs) have demonstrated strong reasoning abilities across diverse domains, from solving programming challenges to competing in strategy-intensive games such as chess. Prior work has shown that LLMs can develop emergent world models in games of perfect information, where internal representations correspond to latent states of the environment. In this paper, we extend this line of investigation to domains of incomplete information, focusing on poker as a canonical partially observable Markov decision process (POMDP). We pretrain a GPT-style model on Poker Hand History (PHH) data and probe its internal activations. Our results demonstrate that the model learns both deterministic structure—such as hand rank—and stochastic features—such as winning equity—without explicit supervision. Linear and nonlinear probes reveal that these representations are linearly decodable and correlate with theoretical belief states, suggesting that LLMs spontaneously approximate Bayesian updates over hidden game variables. Furthermore, visualization of activation spaces shows that the model organizes poker states into meaningful clusters aligned with game dynamics. These findings provide evidence that LLMs not only encode latent world representations in incomplete-information settings but also maintain structured belief states that support probabilistic reasoning.
As large language models (LLMs) continue to scale in reasoning capability, the mechanisms by which their internal representations encode Chain-of-Thought (CoT) behaviors remain poorly understood. Prior work has utilized Pivotal Token Search to identify pivotal tokens—intermediate logits within a reasoning trajectory that substantially influence the probability of a correct final answer to a given query—but none have formulated methods for interpreting these tokens. In this work, we present a novel linear probing method to predict and classify pivotal features underlying Qwen3-0.6B activations. We find that these features correspond to consistent, weighted directions in the latent space across multiple layers, such that attending to these directions may elicit different behaviors in the language model's reasoning path.
Sycophancy is a key behavioral risk in LLMs, yet is often treated as an isolated failure mode that occurs with a single causal mechanism. We instead propose modeling it as geometric and causal compositions of psychometric traits such as emotionality, openness, and agreeableness, similar to factor decomposition in psychometrics. Using Contrastive Activation Addition (CAA) (Panickssery et al., 2024), we map activation direction to these factors and study how different combinations may give rise to sycophancy (e.g., high extraversion combined with low conscientiousness). This perspective allows for interpretable and compositional vector based interventions like addition, subtraction and projection; that may be used to mitigate safety-critical behaviors in LLMs.
Prompt performance varies widely across large language models (LLMs), yet the drivers of this variance remain opaque. We introduce Prompt Genotyping, a lightweight framework that represents prompts with 14 interpretable lexical, structural, semantic-proxy, and domain features and learns to predict downstream LLM outcomes. Using 1,112 real prompts from MT-Bench and HELM, plus 1,388 structured-control prompts, we train XGBoost meta-models in two regimes. On the structured control, the regressor explains ≈86–88% of variance; on real prompts it explains R² = 0.460. Casting the task as “hard-prompt failure” yields F1 = 0.56 and ROC-AUC = 0.65 on a held-out set. The predictability gap highlights the limits of surface features in the wild, while providing a leakage-audited baseline and open resources to spur systematic study of prompt quality.
This paper presents a graph-based pipeline for AI-text detection that encodes documents as Integrated Syntactic Graphs and classifies them with a Graph Transformer Network. We use GNNExplainer to obtain sparse subgraphs and aggregate them into class-level patterns, with faithfulness checked via logit-drop ablations. Our primary goal is interpretability: we report work-in-progress results that clarify which relational structures the model relies on and set the stage for broader analyses.

BlackboxNLP

Accepted Papers

Full Papers (28)

CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs

Evil twins are not that evil: Qualitative insights into machine-generated prompts

Steering Prepositional Phrases in Language Models: A Case of with-headed Adjectival and Adverbial Complements in Gemma-2

The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM Evaluators

Not a nuisance but a useful heuristic: Outlier dimensions favor frequent tokens in language models

Language Dominance in Multilingual Large Language Models

Interpreting Language Models Through Concept Descriptions: A Survey

Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

When LRP Diverges from Leave-One-Out in Transformers

Understanding the Side Effects of Rank-One Knowledge Editing

Emergent Convergence in Multi-Agent LLM Annotation

PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders

Circuit-Tracer: A New Library for Finding Feature Circuits

The Lookahead Limitation: Why Multi-Operand Addition is Hard for LLMs

Can LLMs Detect Ambiguous Plural Reference? An Analysis of Split-Antecedent and Mereological Reference

Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives

A Theorem-Proving-Based Evaluation of Neural Semantic Parsing

A Pipeline to Assess Merging Methods via Behavior and Internals

From BERT to LLMs: Comparing and Understanding Chinese Classifier Prediction in Language Models

Mechanistic Fine-tuning for In-context Learning

Understanding How CodeLLMs (Mis)Predict Types with Activation Steering

The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models

Exploring Large Language Models’ World Perception: A Multi-Dimensional Evaluation through Data Distribution

On the Representations of Entities in Auto-regressive Large Language Models

Can Language Neuron Intervention Reduce Non-Target Language Output?

Fine-Grained Manipulation of Arithmetic Neurons

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Shared Task Papers (4)

Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection

BlackboxNLP-2025 MIB Shared Task: IPE: Isolating Path Effects for Improving Latent Circuit Identification

BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

Non-archival Abstracts (26)

Probabilistic Soundness Guarantees in LLM Reasoning Chains

Formal Semantic Control over Language Models

GPT-2 prefers ambiguous utterances following informative contexts

FIRM: Fairness Interventions at Runtime and Model-training

Understanding the Read-Write Functionalities of Gated Neurons in Transformers

ToxiSight: Behavioral Insights for Human-AI Toxicity Annotation

WeightLens: Input-Independent Interpretability for LLM Transcoders

Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training

Neighborhood Selection is Critical for Explaining Large Language Models

Mechanisms of In-Context Syntactic Generalization in Language Models

Accelerating Path Patching with Head Pruning for Efficient Circuit Discovery

T-FIX: Text-Based Explanations with Features Interpretable to eXperts

Can Explainability and Privacy Coexist in Natural Language Processing? An Empirical Study

Characterizing Mamba's Selective Memory using Auto-Encoders

Towards discovering linguistic indicators for misalignment in language models

MIND: Multi-Granular INterpretable Detection of Mental Manipulation

SCOPE: Semantic Entropy Probes for LLM-as-a-Judge

Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models

Probing the Diffusion: An Interpretability Toolkit for Text-to-Image Models

Mitigating Sycophancy in Language Models via Sparse Activation Fusion

Mitigating Sycophancy in Language Models via Multi-Layer Activation Steering

Emergent World Beliefs: Exploring Transformers in Stochastic Games

Pivotal Tokens Encode Reasoning Shifts in Large Language Models

Sycophancy as compositions of Atomic Psychometric Traits

Prompt Genotyping: A Large-Scale Meta-Analysis of Linguistic and Structural Features Predictive of LLM Performance

Under the Hood of Graph Transformers: Explanations and Linguistic Probing on ISGs