Linear probe interpretability python. tex # LaTeX research paper ├── references.

Linear probe interpretability python Though SAEs require an upfront compute cost, researcher hours at iteration time are much more expensive than a couple dozen GPU hours. With interpretable models, organizations can build more reliable and ethical AI applications that users can trust and understand. Across three open‐source model families ranging from 7 to 70 billion parameters, projections on this “in-advance Dec 17, 2024 · Introduction Optimizing Model Interpretability with SHAP Values and Python is a crucial step in machine learning model development. We extract layer-wise attention head activations from medical-domain LLMs and use Ridge regression classifiers to predict relevant medical disciplines from clinical descriptions. You can find more To visualise probe outputs or better understand my work, check out probe_output_visualization. However, their ensemble nature—combining hundreds or thousands Oct 12, 2023 · In this research, the emergent world principle is scrutinized in a series of simple transformer models, trained to play Othello and termed as Othello-GPT (Appendix A). Linear probes are an essential interpretability tool used to analyze what information the transformer model has learned about various game features. 4% at layer 22. The basic idea is simple—a classifier is trained to predict some linguistic property from a model’s representations—and has been used to examine a wide variety of models and properties. In the current implementations in R and Python, for example, linear regression can be chosen as an interpretable surrogate model. bib # Bibliography ├── docs/ │ └── claude. The quiz contains 25 questions. Self-Influence Guided Data Reweighting for Language Model Pre-training] - An application of training data attribution methods to re-weight May 17, 2024 · Linear probing is a technique used in hash tables to handle collisions. py to train probes. All libraries on this best-of list are automatically ranked by a quality score based on a variety of metrics, such as GitHub stars, code activity, used license and other factors. matplotlib is a library to plot graphs in Python. It has commentary and many print statements to walk you through using a single probe and performing a single intervention. labo_train. py. We establish foundational concepts such as Apr 2, 2024 · Python Linear Regression Quiz Quiz will help you to test and validate your Python-Quizzes knowledge. 4️⃣ Training a Probe In this section, we'll look at how to actually train our linear probe. Sep 20, 2025 · Linear probes are widely used to interpret and evaluate learned representations, yet their reliability is often questioned: probes can appear accurate in some regimes but collapse unpredictably in others. It mitigates the problem that the linear probe itself does computation, even if it's just linear. Jan 31, 2023 · AIX360 The AI Explainability 360 toolkit is an open-source library that supports the interpretability and explainability of datasets and machine learning models. Nov 7, 2024 · We evaluate Logit Lens, Tuned Lens, sparse autoencoders, and linear probes, for these metrics on GPT2-small, Gemma2-2b, and Llama2-7b, comparing them to simpler but uninterpretable baselines of steering vectors and prompting. To visualise probe outputs, check out probe_output_visualization. This enhances model interpretability and provides insights into potential biases or limitations. py: RUN_TEST_SET = True USE_PIECE_BOARD_STATE = True Then run python train_test_chess. Interpretability Of course, SAEs were created for interpretability research, and we find that some of the most interesting applications of SAE probes are in providing interpretability insight. Contribute to bigsnarfdude/borescope development by creating an account on GitHub. I will analyze global interpretability – which analyzes the Jun 13, 2025 · Explore linear regression in machine learning to understand how it predicts outcomes using statistical modeling techniques. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. ├── src/ # Core implementation │ ├── extraction/ # Hidden state extraction │ ├── probing/ # Linear probing A suite of interpretability tasks to evaluate agents using Scribe for notebook access - goodfire-ai/scribe-task-suite Linear probes are an interpretability tool used to analyze what information is represented in the internal activations of the Othello-GPT model. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. However, recent studies have chess_llm_interpretability This evaluates LLMs trained on PGN format chess games through the use of linear probes. Interpretability is essential for: Model Apr 4, 2022 · Abstract. We learn a linear probe that retrieves the correct text representation of an object given a snippet of audio related to that object, where the sound representation is given by a pretrained audio model. It can also be used to debug models, explain predictions and enable auditing to meet compliance with regulatory requirements. Academic and industry papers on LLM interpretability. You just have to assess all the given options and click on the correct answer. With this package, you can train interpretable glassbox models and explain blackbox systems. Apr 23, 2024 · Linear probes were originally introduced in the context of image models but have since been widely applied to language models, including in explicitly safety-relevant applications such as measurement tampering. This is in contrast to the utilization of the non-linear MLP as probes. This work explores whether language models encode meaningfully grounded representations of sounds of objects. In the middle, we clip the probe outputs to turn the heatmap into a more binary visualization. . 2023) What can model interpretability give us? Oct 21, 2023 · 2. May 27, 2018 · Testing Linear Regression Assumptions in Python 20 minute read Checking model assumptions is like commenting code. For example, simple probes have shown language models to contain information about simple syntactical features like Part of Speech tags, and more complex probes have shown models to contain entire Parse trees of sentences. Use linear probes to find attention heads that correspond to the desired attribute Shift attention head activations during inference along directions determined by these probes More activation manipulation Contrastive steering vectors (Turner et al. tex # LaTeX research paper ├── references. , when two keys hash to the same index), linear probing searches for the next available slot in the hash table by incrementing the index until an empty slot is found. Learn local & global explanations, visualizations, and best practices for tree-based, linear & deep learning models. Subsequently, we delve deeper into this correlation through the lens of feature importance analysis. For tutorials and more information, visit the github page. When a collision occurs (i. ipynb and train_linear_probes_gpt. Neel Nanda describes a critique of linear probes as 3D: 1) You design what feature you're looking for, not getting the chance to find features from a model-first perspective. LIME Local Interpretable Model-agnostic Explanations (LIME) is a framework and technique designed to provide interpretability and insights into the predictions of complex machine learning While the linear representation hypothesis facilitates interpretability significantly,Sharkey et al. A Python package to handle the layout, geometry, and wiring of silicon probes for extracellular electrophysiology experiments. This review explores mechanistic interpretability: reverse-engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. ipynb. The approach enables interpretable analysis of which model components We train linear probes on the final token’s resid-ual stream representation to distinguish between high-quality transcriptions and hallucinations. Our results show that while existing methods allow for intervention, they are inconsistent across features and models. The implementation supports different probe configurations: ABSTRACT Large Language Models (LLMs) have impressive capabilities, but are also prone to outputting falsehoods. It provides a comprehensive suite of tools for: Creating and managing datasets for probing experiments Collecting and storing model activations Training various types of probes (linear, logistic, PCA Dec 16, 2024 · These probes can be designed with varying levels of complexity. A higher K potentially produces models with higher fidelity. Jan 14, 2021 · Welcome to the post about the Top 5 Python ML Model Interpretability libraries! You ask yourself how we selected the libraries? Well, we took them from our Best of Machine Learning with Python list. Learn about the construction, utilization, and insights gained from linear probes, alongside their limitations and challenges. For information about training and creating probes, see Linear Probes and Target Functions. This work demonstrates a practical and actionable ap-plication of interpretability insights through a generator-agnostic observer paradigm: a linear probe on a trans-former’s residual-stream activations identifies contextual hallucinations in a single forward pass, achieving high F1 scores. Utilizing linear probes to decode neuron activations across transformer layers, coupled with causal interventions, this paper underscores the AI Safety - mechinterp experiments. ipynb The figures displaying training losses for probes and models are made in view_trained_GPT. Real-World Uses of Interpretability: Model interpretability-based techniques are starting to have genuine uses in frontier language models! Linear probes, one of the simplest possible techniques, are a highly competitive way to cheaply monitor systems for things like users trying to make bioweapons. Mar 28, 2023 · Omg idea! Maybe linear probes suck because it's turn based - internal repns don't actually care about white or black, but training the probe across game move breaks things in a way that needs smth non-linear to patch At this point my instincts said to go and validate the hypothesis properly, look at a bunch more neurons, etc. md # This file ├── AGENTS. This section is less mechanistic interpretability and more standard ML techniques, but it's still important to understand how to do this if you want to do any analysis like this yourself! Jul 31, 2025 · We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. ipynb and view_causality_single_tile Linear probes are simple, independently trained linear classifiers added to intermediate layers to gauge the linear separability of features. It covers a variety of questions, from basic to advanced. In the future, it would be interesting to use non-linear probes, such as decision trees or simple neural nets to disambiguate non-linear concepts. These trained models (Figure 1 a) exhibit proficiency in legal move execution. Introduction This project investigates multi-label medical code classification by training linear probes on Large Language Model (LLM) activations. Probity is a toolkit for interpretability research on neural networks, with a focus on analyzing internal representations through linear probing. linear probes etc) can be prone to generalisation illusions. The tool has separate pages for all 14,336 neurons in the 7 MLP layers of Othello-GPT that show: 1) A linear probe's activation directions for Here, we examine the interpretability of graphlet fingerprints using linear models by computing the atom-projections of the predictions and examining the qualitative agreement between structural trends in the projections and chemical intuition about solubility. We are not totally confident that our probes do measure their associated concept. Jun 17, 2024 · Limitations Interpretability Illusion Interpretability is known to have illusion issues and linear probing doesn’t make an exception. The lower K, the easier it is to interpret the model. py contains the preprocess and feature extraction functions. For example, we can visualize where the model \"thinks\" the white pawns are. Surprisingly, the results show that hallucinations can be accu-rately identified by linear probing the decoder’s residual stream, with maximum accuracy of 93. Each facet relates to a single column vector k from our probe which classifies color and piece type simultaneously. A failure to do either can result in a lot of time being confused, going down rabbit holes, and can have pretty serious consequences from the model not being interpreted correctly. For example, if we want to train a nonlinear probe with hidden size 64 on internal representations extracted from layer 6 of the Othello-GPT trained on the championship dataset, we can use the command python train_probe_othello. In this respect we aim to further improve and refine our dataset. They reveal how semantic content evolves across network depths, providing actionable insights for model interpretability and performance assessment. We establish founda-tional concepts such as How to interpret data science code in Python and R. Aug 12, 2025 · The success of simple linear probes suggests that bias is not an inscrutable, entangled property, but a feature that the model represents in a relatively straightforward manner, especially in its later layers. We identify the spectral mechanism behind this phenomenon and develop a spectral identifiability principle that serves as a practical diagnostic. py and data_lp. Let's start with all the necessary packages to implement this tutorial. Model Interpretability: Probing classifiers help shed light on how complex machine learning models represent and process different linguistic aspects. sh and a labo_test. InterpretML is an open-source package that incorporates state-of-the-art machine learning interpretability techniques under one roof. Blackbox interpretability methods can extract explanations from any machine learning pipeline. Please Star the project to support us and Watch to always stay up-to-date! May 24, 2025 · Through quantitative analysis of probe performance and LLM response uncertainty across a series of tasks, we find a strong correlation: improved probe performance consistently corresponds to a reduction in response uncertainty, and vice versa. To visualise probe outputs or better understand my work, check out probe_output_visualization. numpy is the main package for scientific computing with Python. g. Linear Oct 24, 2024 · While linear probes are simple and interpretable, it is unable to disentangle features distributed features that combine in a non-linear way. ipynb The intervention studies shown in the pdf are all conducted in the scripts view_causality_complex. linear probe. LLM-Interpretability-Analysis/ ├── README. This work demonstrates a practical and actionable ap-plication of interpretability insights through a generator-agnostic observer paradigm: a linear probe on a trans-former’s residual-stream Existing automated interpretability methods can evaluate more sophisticated concepts by generating “adversarial examples” to probe the bounds of a feature’s responses, but those may also be subject to blind spots that are difficult to predict in advance. Linear Probing Framework Linear probes are the primary tool for detecting and analyzing emergent world representations in the model. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM’s internal acti-vations. (2022) and trained to play random, legal moves in the game Othello. The performance of the linear probe can be increased by curating a larger dataset but creating a dataset that matches the diversity of the billions of tokens the SAE was trained on is very non-trivial. Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. (2022a) warns against neglecting the potential role of non-linear representations. py is the interface to run all experiments, and utils. Aug 4, 2025 · Master SHAP model interpretability in Python. Written in C++. Nov 1, 2024 · We can also test the setting where we have imbalanced classes in the training data but balanced classes in the test set. Then we will use train_probe_othello. e. InterpretML helps you understand your model's global behavior, or understand the reasons behind individual predictions. Model interpretability is the ability to understand how a machine learning model works and why it makes certain predictions. sh is the bash file to run the linear probe. os provides a portable way of using operating system-dependent functionality, e. Likewise Aug 19, 2024 · A common technique in interpretability research for identifying feature directions is training linear probes with logistic regression . They handle non-linear relationships naturally, require minimal preprocessing, and often achieve state-of-the-art accuracy on tabular data. We can also perform interventions on the model's internal board state by deleting pieces from its internal world model. It empowers researchers and practitioners to understand what deep models see and focus on by offering tools to extract, probe, and build ad hoc XAI models for internal representations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among Resolves hash table collisions using linear probing, quadratic probing, and linear hashing. A Python library that encapsulates various methods for neuron interpretation and analysis in Deep NLP models. This repo can train, evaluate, and visualize linear probes on LLMs that have been trained to play chess with PGN strings. Sep 18, 2025 · Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model’s forthcoming answer will be correct. The train_test_chess. md # Detailed research analysis ├── outputs/ │ ├── lie-detector/ # Linear probe results BUTTER-Clarifier This repository contains a python package of neural network interpretability techniques (interpretability) and a keras callback to easily compute and capture data related to these techniques (we call these values metrics) during training. On the left, we have the actual white pawn location. In some cases, however, the direction identified by LR can fail to reflect an intuitive best guess for the feature direction, even in the absence of confounding features. This probe isolates a single, transferable linear direction separating hallucinated from Sep 19, 2024 · Non-linear probes have been alleged to have this property, and that is why a linear probe is entrusted with this task. md # Agent coordination log ├── paper. On the right, we have the full gradient of model common technique in interpretability research for identifying feature directions is training linear probes with logistic regression (LR; Alain & Bengio, 2018). To train a linear probe or test a saved probe on the test set, set these two variables at the bottom of train_test_chess. sh are the bash file to train and test LaBo. 2023; Rimsky et al. main. , modifying files/folders. We introduce the OthelloScope (OS), a web app for easily and intuitively navigating through the MLP layer neurons of the Othello-GPT Transformer model developed by Kenneth Li et al. Check out these methods and how to apply them using Python. Finally, good probing performance would hint at the presence of the said property, which has the potential of being used in making final decisions to choose a label in the farthest layer of the neural network. - GitHub - kikaymusic/Neuron-Level-Interpretability-and-Robustness-in-LLMs: A Python library that encapsulates various methods for neuron interpretation and analysis in Deep NLP models. Why InterpretML? Model Interpretability Model interpretability helps developers, data scientists and business stakeholders in the organization gain a comprehensive understanding of their machine learning models. This page documents the interpretability tools used to analyze Othello-GPT, a transformer-based model trained to play Othello. cv2 is a leading computer vision library. Interpretability Illusions in the Generalization of Simplified Models – Shows how interpretability methods based on simplied models (e. Mar 13, 2025 · The need for interpretability grows as machine learning impacts more areas of society. Feb 6, 2025 · To assess the interpretability of linear probes trained on SAE latents, we analyze the distribution of their coefficients (Figure 9) and manually inspect the latents with the largest coefficients in the InterProt visualizer 2. Oct 25, 2024 · This guide explores how adding a simple linear classifier to intermediate layers can reveal the encoded information and features critical for various tasks. Interpretability can be a big challenge with machine learning models, even a relatively simple multiple linear regression model. Abstract Understanding AI systems’ inner workings is critical for ensuring value alignment and safety. Optimized for efficient time and space complexity. The AI Explainability 360 Python package includes a comprehensive set of algorithms that cover different dimensions of explanations along with proxy explainability metrics. Official code repository for "Multilingual Safety Mechanistic Interpretability: A Comprehensive Analysis Across 10 Languages and 3 Models" (ICML 2026 Submission) . ipynb and view_trained_probes. We can check the LLMs internal understanding of board state and ability to estimate the skill level of the players involved. This review explores mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. We establish foundational concepts such as We contribute the following new insights: We first show that trained linear probes can accurately map the activation vectors of a GPT-model, pre-trained to play legal moves in the game Othello, to the current state of the othello board. This probe is trained via a contrastive loss that pushes the language Jun 6, 2023 · To discuss the interpretability of linear regression, I used the training dataset to fit a linear regression model and analyze its weights, R² and t-statistic. Specifically, when the Fisher information Jun 18, 2024 · Accumulated Local Effects (ALE): Guide to Model Interpretability Introduction Understanding the decision-making process of complex machine-learning models is crucial for trust, transparency, and … Dec 15, 2024 · Finding 3: The Necessity of Non-Linear Probes Linear probes consistently failed across post-SFT models, while MLP and Transformer probes succeeded, proving functional categories exist in complex, curved decision boundaries. All data structures implemented from scratch. This has profound implications for AI safety, suggesting that we can move beyond passive moderation to active, targeted intervention. Nov 22, 2022 · When people first think about linear regression they generally expect to see a model that is a straight line plotted through a target variable. In advance, you have to select K, the number of features you want to have in your interpretable model. Python enables data professionals to address concerns about fairness and accountability in AI systems. 6 days ago · Tree-based models like Random Forests, Gradient Boosting Machines, and XGBoost dominate machine learning competitions and real-world applications due to their powerful predictive performance. The system trains linear classifiers on model activations to predict specific board state properties. A comprehensive review of mechanistic interpretability, an approach to reverse engineering neural networks into human-understandable algorithms and concepts, focusing on its relevance to AI safety. The scripts to train the Othello models and linear probes are contained in train_gpt_othello. These tools enable mechanistic interpretability - understanding the inter A common technique in interpretability research for identifying feature directions is training linear probes with logistic regression (LR) 2018. py --layer 6 --twolayer --mid_dim 64 --championship. Mar 6, 2022 · How to Interpret Linear Regression, Lasso, and Decision Tree with Python (easy) What are the most important features (Feature Importance)? Why did the model make this specific decision? 1. About Tests if small-scale probes could reliably pick up on sycophantic behavior in language models using mechanistic interpretability techniques 🔷 BlueGlass BlueGlass is an open-source framework for interpretability and analysis of vision-language and vision-only models. Everybody should be doing it often, but it sometimes ends up being overlooked in reality. This includes model ensembles, pre-processing steps, and complex models such as deep neural nets. By training simple linear classifiers to predict specific game features from neural network representations, we can understand what the model "knows" at each layer and how this information evolves Nov 21, 2025 · bergson Public Mapping out the "memory" of neural nets with data attribution interpretability influence-functions mechanistic-interpretability data-attribution Python • MIT License Other files: data. py script can be used to either train new linear probes or test a saved probe on the test set. chess_llm_interpretability This evaluates LLMs trained on PGN format chess games through the use of linear probes. Othello Game Logic Training Process Mechanistic Interpretability Tools Linear Probing System Probe Architecture Training Probes Intervention Experiments Intervention Methods Attribution via Intervention Data Management Dataset Structure Board State Representation Experiments and Results Circuit Analysis Workflows Visualization Techniques Then click "Run All" on lichess_data_filtering. By isolating layer-specific diagnostics, linear probes inform strategies for pruning, compression, and Jul 31, 2025 · Abstract Contextual hallucinations –- statements unsupported by given context –- remain a significant challenge in AI. py are the dataloaders for LaBo and Linear Probe, respectively. It's often imported with the np shortcut. Introduction In this article, I will try to interpret the Linear Regression, Lasso, and Decision Tree models which are inherently interpretable. But when you perform multiple linear regression we Jul 2, 2024 · Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Local Interpretable Model-Agnostic Explanations (lime) ¶ In this page, you can find the Python API reference for the lime package (local interpretable model-agnostic explanations). torch is a deep Cosine similarity between the column vectors of the main probe weight matrix and the linear combination of column vectors from the color and piece probes. Dec 13, 2022 · Svitla's Data Scientist goes in-depth on interpreting machine learning models using LIME and SHAP. topb atlwjhhh mfgpp guzrq bzmzfren ocj hvpgyu fzfhz sbzx ceoy uenv oxtnum kysb klmj zpma