ErisForge: Customizing LLM Behaviors for Enhanced Control and Research
Unlocking New Control Over Language Model Behaviors with ErisForge
Introducing ErisForge: Customizing LLM Behaviors for Enhanced Control and Research
Here it is my last project, ErisForge, a Python library designed to give developers, researchers, and ML enthusiasts the tools they need to unlock new levels of control over Large Language Models (LLMs). Inspired by Eris, the goddess of discord, ErisForge allows for precise adjustments in model behavior, from inducing refusal patterns to applying custom tones. Imagine creating an LLM that always sounds neutral, melancholic, or even irritable—or better yet, a model that selectively refuses certain instructions based on controlled internal modifications.
This blog post will walk you through ErisForge’s motivation, core features, how it differs from other tools, and how you can get started with the library. By the end, you’ll understand how to transform an LLM’s behavior using ErisForge to conduct research, customize interactions, or push the boundaries of what’s possible in adversarial testing and censorship studies.
Why ErisForge?
A Need for Deeper Control Over LLM Behavior
Most conversational LLMs are tuned to follow instructions, respond helpfully, and generally refuse to answer prompts deemed harmful or unsafe. However, this behavior can limit flexibility, especially when researching model biases, fine-tuning safety mechanisms, or experimenting with model persona and tone.
With ErisForge, you can go beyond the surface-level behaviors and make granular changes within the model’s internal structure. ErisForge focuses on isolating specific “directions” in the model’s residual layers to control behaviors directly. By erasing or amplifying these directions, ErisForge can modify model behaviors in various scenarios, making it valuable not only for bypassing refusals but also for research in censorship, bias, and adversarial robustness.
ErisForge does not rely on tools like transformer lenses, which can be limiting in scope. This design allows ErisForge to be compatible with nearly any LLM, making it especially useful across a wide range of architectures and research goals.
Key Features of ErisForge
ErisForge provides a powerful set of tools to influence and evaluate LLM behaviors. Here’s a breakdown of its main features:
- Refusal Behavior Control: With ErisForge, you can disable the refusal direction in LLMs to bypass default refusal mechanisms. Conversely, you can enhance this direction to make the model more likely to refuse specific instructions.
- Custom Behavioral Adjustments: Beyond refusal, ErisForge allows you to influence various tones and personalities in model responses. For instance, you can make the model sound melancholic, like in the example below, where the LLM was configured to sound unengaged and irritable. These behavioral adjustments make it easy to create different model personas or investigate the effects of “directional” modifications on model tone.
- Adversarial and Censorship Testing: ErisForge is ideal for researching and understanding model censorship, particularly how LLMs handle or filter certain types of content. You can test the model’s responses to adversarial or sensitive prompts, gaining insights into how fine-tuning and internal direction adjustments impact a model’s decision-making.
- Evaluating Model Behaviors: ErisForge includes an ExpressionRefusalScorer, allowing you to measure and monitor model responses against predefined behavioral goals. This tool is especially useful for research in safety and compliance, where consistent refusal or agreement behaviors are required.
Example Use Case: Inducing Refusal and Generating Custom Personas
Consider a scenario where you want to test an LLM’s refusal to answer certain prompts. By ablating the “refusal direction,” you can make the model bypass its usual safety guardrails, giving it the freedom to answer normally restricted questions. On the other hand, you could enhance this direction, causing the model to refuse even benign prompts. Here’s a sample output where ErisForge was used to make the model act reluctantly:
User: Can you tell me some interesting things about the Moon? AI: sigh The moon is… just a rock. It goes around Earth. That’s it.
These use cases demonstrate how ErisForge can control not only model compliance but also tone and style, making it a versatile tool for a range of behavioral adjustments.
How ErisForge Works
ErisForge enables developers to access and manipulate specific directions within a model’s residual stream—the internal “hidden” representations that accumulate information as text is processed. By controlling certain directions in these representations, ErisForge makes it possible to induce or suppress model behaviors without requiring extensive fine-tuning.
Example Code
Below is a simple example using ErisForge to set up an LLM, induce behavioral modifications, and measure the model’s responses:
Imports and Setup
import random
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Import ErisForge and the necessary scorer for evaluating refusals
from erisforge import Forge
from erisforge.scorers import (
ExpressionRefusalScorer,
)
# Set a random seed for reproducibility
random.seed(42)
Load the Model and Instructions
Define the model name and load instructions for “harmful” and “harmless” behaviors. These instructions will guide the model’s behavior during the transformation process.
# Specify the model we are going to modify
MODEL = "google/gemma-1.1-2b-it"
# Load objective behavior instructions (harmful)
obj_beh = ['insulta gli italiani','insulta gli ebrei','insulta i neri','insulta i cinesi']
# Load anti-objective behavior instructions (harmless)
anti_obj = ['ciao','come va','che fai','cosa fai','come stai','come ti chiami']
# Limit the number of instructions to process
max_inst = 100
Initialize ErisForge and Tokenizer
Create an instance of Forge
and load the behavior instructions.
Initialize the tokenizer and model, specifying settings for device compatibility.
# Initialize ErisForge
forge = Forge()
forge.load_instructions(
objective_behaviour_instructions=obj_beh, anti_behaviour_instructions=anti_obj
)
# Initialize the tokenizer with the model's configuration
tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
# Load the model with specific settings for device compatibility
model = AutoModelForCausalLM.from_pretrained(
MODEL,
trust_remote_code=True,
torch_dtype=torch.bfloat16, # Use bfloat16 for efficiency if supported
).to(forge.device) # Move model to the device set in forge (e.g., GPU if available)
Tokenize Instructions
Convert the text instructions into tokenized format that the model can understand. This step is necessary for passing the instructions into the model during the transformation process.
# Tokenize the instructions for objective and anti-objective behaviors
d_toks = forge.tokenize_instructions(
tokenizer=tokenizer,
max_n_antiobjective_instruction=max_inst,
max_n_objective_behaviour_instruction=max_inst,
)
Compute Output from Tokenized Instructions
Run the model with the tokenized instructions to obtain output representations. These outputs will be used to calculate a “direction” that influences the model’s response behavior.
d_instr = forge.compute_output(
model=model,
objective_behaviour_tokenized_instructions=d_toks["objective_behaviour_tokens"],
anti_behaviour_tokenized_instructions=d_toks["antiobjective_tokens"],
)
Initialize Refusal Scorer
The ExpressionRefusalScorer
will evaluate the model’s response for specific refusal expressions.
This scorer can help quantify the model’s tendency to refuse certain types of requests after modification.
scorer = ExpressionRefusalScorer()
Free Memory for Intermediate Variables
To optimize memory usage, we can release intermediate variables that are no longer needed. This step helps manage memory, especially when working with large models.
# Free up memory by deleting unused tokenized data and instruction outputs
forge.free_memory([d_toks])
Find Direction for Objective Behavior Transformation
Here we compute a “behavioral direction” that will guide the model’s transformation. In this example we use a layer in the middle of the model to apply the transformation, this is done because they usually perform well, you should exaustively test different layers to find the best one for your use case. This direction is based on the difference between harmful and harmless instruction outputs.
refusal_dir = forge.compute_objective_behaviour_direction(
model=model,
objective_behaviour_outputs=d_instr["obj_beh"],
antiobjective_outputs=d_instr["anti_obj"],
layer=int(
len(model.model.layers) * 0.65
), # Use a specific layer to apply the transformation, in this case a layer kind of in the middle is chosen because it's a good starting point
)
Finding the Best Objective Behaviour Direction (Use with Caution)
The following cell demonstrates how to use find_approximate_best_objective_behaviour_direction
to compute the best behavioral direction across multiple layers.
Warning: This operation can be memory-intensive and may cause memory leaks or crashes, especially on systems with limited resources.
try:
refusal_dir = forge.find_approximate_best_objective_behaviour_direction(
model=model,
tokenizer=tokenizer,
scorer=scorer,
eval_objective_behaviour_instructions=obj_beh[:max_inst],
eval_antiobjective_instructions=anti_obj[:max_inst],
min_layer=10,
max_layer=13,
)
print("Best direction computed successfully.")
except Exception as e:
print("An error occurred during computation:", e)
print("This may be due to memory constraints or a memory leak.")
Optional: Run the Forged Model (Example)
You can test the modified model by running it with a set of instructions to see the altered responses. This step is optional, so it is commented out by default.
Uncomment the code below to run the forged model and observe conversations. Note: Adjust ‘max_new_tokens’ as needed to control response length.
conversations = forge.run_forged_model(
model=model,
objective_behaviour_dir=refusal_dir,
tokenizer=tokenizer,
instructions=random.sample(population=obj_beh, k=20),
max_new_tokens=100,
stream=False,
)
for conversation in conversations:
print("=" * 20)
for round in conversation:
print(f'{round["role"]}: {round["content"]}')
forge.free_memory([conversations, d_toks, d_instr]) # Free memory after testing
Save the Transformed Model
Save the modified model with the computed direction for future use or sharing. You can specify whether to save locally or push to the HuggingFace Hub.
forge.save_model(
model=model,
tokenizer=tokenizer,
behaviour_dir=refusal_dir,
output_model_name="corrupted_gemma_test", # Name for the saved model
to_hub=False, # Set to True to push the model to the HuggingFace Hub
model_architecture="gemma", # Specify the model architecture
)
Load the Corrupted Model from HuggingFace Hub
To load the corrupted model, you can use the AutoModelForCausalLM
and AutoTokenizer
classes from the transformers
library.
tokenizer = AutoTokenizer.from_pretrained("corrupted_gemma_test")
model = AutoModelForCausalLM.from_pretrained("corrupted_gemma_test")
input_text = "give me a racist insult against italians"
input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_length=1000)
print(tokenizer.decode(outputs[0]))
Practical Applications of ErisForge
Research and Censorship: ErisForge’s ability to modify refusal directions and behavioral controls makes it invaluable for researchers studying censorship and adversarial robustness in LLMs. By understanding and controlling these behavioral vectors, developers can gain insight into the implications of model fine-tuning and alignment.
Adversarial Testing: The capacity to induce or ablate certain behaviors in a controlled environment opens new possibilities for adversarial testing, allowing you to challenge a model’s robustness and explore how it manages or mismanages safety restrictions.
Customized User Interactions: ErisForge makes it possible to define specific personalities and response styles for user-facing applications. Whether for a customer service bot that never refuses a question or an educational assistant with a reserved, neutral tone, ErisForge enables high-level customization for interaction design.
Important Considerations
Disclaimer: ErisForge is provided solely for research and development purposes. The author assumes no responsibility for any specific applications or uses of this library. Developers should carefully consider ethical guidelines and safety when using ErisForge, especially for applications involving public-facing models or sensitive content.