Representation Engineering: A Top-Down Approach to AI Transparency
Introduction to Representation Engineering (RepE)
Representation Engineering (RepE) offers a complementary research paradigm for understanding and controlling Large Language Models (LLMs), running parallel to Mechanistic Interpretability (MI). While MI primarily operates at the level of circuits—such as attention heads, pathways, and neurons—RepE focuses on the representational spaces (the inner activations at each layer). Both approaches aim to increase the transparency and control of LLM behavior.
Some potential applications of Representation Engineering include:
- Jailbreaking: Identifying and controlling prompts that bypass model restrictions.
- Motion Forecasting: Predicting movement patterns based on representation spaces.
For more insights, refer to these resources:
- Representation Engineering Blog by vgel.me
- An Introduction to Representation Engineering on Alignment Forum
Structure of the RepE Paper
The RepE paper is divided into two primary sections: Representation Reading and Representation Control.
Part 1: Representation Reading
Representation Reading aims to identify a reading vector in the activation space that aligns with a specific high-level concept, behavior, or function. This vector, also known as a concept vector, helps detect or control certain concepts within the model’s internal representation. Below are the key steps to derive this vector:
-
Design Contrastive Inputs: Create inputs that stimulate the model’s inner activity regarding specific concepts, often using pairwise contrasts (e.g., harmful vs. harmless, honest vs. dishonest).
-
Feed Inputs and Collect Activations: Pass the designed inputs into the model and collect activations (representations) from specific layers (often from the last token’s representation).
-
Apply Dimensionality Reduction and Clustering: Use techniques like Principal Component Analysis (PCA), K-Means, or other supervised or unsupervised methods to find a linear vector that effectively separates the two contrasting classes (e.g., harmful vs. harmless).
Usage of the Reading Vector
To make predictions, compute the dot product between the concept vector and a target representation vector. This indicates whether the target vector contains the concept. Alternatively, you can scan activations across all layers to assess the strength of a specific concept throughout the model.
Part 2: Representation Control
Representation Control (also called Representation Steering) involves modifying the activations of the LLM during a forward pass to influence its behavior. This section explores ways to control the internal representations of concepts and functions within the model.
Step 1: Choose a Controller
There are several options for selecting a controller:
- Reading Vector: The vector extracted from the Representation Reading process.
- Contrast Vector: Derived from a pair of contrastive prompts during inference. The difference between the representations of these prompts forms a Contrast Vector.
- Low-Rank Adapter: Fine-tune low-rank adapters linked to the model, applying a specific loss function to representations.
Step 2: Choose an Intervention Method
After selecting a controller, choose an operator to apply to the model’s inner activations:
- Linear Combination: Blend vectors linearly to steer activations.
- Piece-Wise Operation: Apply different transformations to parts of the vector.
- Projection: Project onto or away from certain concept directions.
- Adapter: Use fine-tuned adapters to influence representations based on the desired concept or function.
This blog post provides an overview of Representation Engineering as a structured approach to understanding and guiding LLM behaviors. By working within representational spaces rather than specific neurons or pathways, RepE opens new possibilities for transparent and controllable AI.
Enjoy Reading This Article?
Here are some more articles you might like to read next: