Replies: 10 comments 21 replies
-
Honestly I think the idea has a lot of merit. "Orthogonalized" de-censored models I think have clearly shown that refusal is represented by a single direction in latent space, at least for a number of large models. Constraining the hidden state to the hyperplane where the refusal component is zero is a much more powerful technique than trying to bias the token selection at the output stage, where you'd need another language model to make informed decisions about which tokens to prefer. Likewise, adding random noise along the forward pass could be a way powerful way to get "creative" outputs than just sampling at the output stage. It's definitely on my list of things to experiment with, along with orthogonalization and other forms of intervention. I've added a simple API for it so far: def pre_hook(hidden_states, *args, **kwargs):
hidden_states *= 0.9
return hidden_states
def post_hook(hidden_states, *args, **kwargs):
hidden_states /= 0.9
return hidden_states
# Wrap one module with some hooks
model.modules[13] = Intervention(model.modules[13], pre_hook, post_hook) As usual, though, time is limited and a few other things take priority at the moment. |
Beta Was this translation helpful? Give feedback.
-
As usual I'm impressed that you already have a (first) solution for this! That's what I call fast delivery... :-D Thanks a lot! I'm wondering about the coincidence and what the original motivation for this change was, if I may ask? Since the release of Anthropic's last paper on this, I've read a lot about the interpretability of the black box LLM and about how to use this knowledge to alter/de-censor models. It's exciting that my request here and this development come together... But be that as it may: I still don't understand how the interface you created can help to de-censor models. Or should I understand it more as a very first step in that direction? |
Beta Was this translation helpful? Give feedback.
-
See here for what I think would be a good way to get a similar effect to orthogonalization using something like a hook system: If that explanation sounds a bit complicated, basically what it's saying is "do the same thing that first person shooters do to the player's velocity vector when the player runs into a wall at an angle, but do it in hyperdimensional space and make the wall be the plane orthogonal to the refusal direction" |
Beta Was this translation helpful? Give feedback.
-
That makes sense. For now I will just go with @turboderp's simple API. But I have a problem here if I want to only alter specific layers: How can the hook functions know on which layer they currently operate? I don't see any clue in the code for this. |
Beta Was this translation helpful? Give feedback.
-
The current status: I precalculate hidden states of each layer for a number of negative prompts, then I calculate the average. At inference time I just subtract that average vector in the pre_hook and add it back in the post_hook of each Attention and MLP module. I only operate on the middle and upper layers (to not restrict basic "understanding") and I use a linear scaling to ramp up the effect. Result: It works surprisingly well! I know this is still a very simple approach and not really targeted towards adding or removing a specific feature. Now I want to go further and extract a cleaner feature vector. Also I want to limit the subtract and add-back runs to specific modules (where it gets more architecture-specific) or even refrain from adding anything back to the hidden states. But at the moment I'm still struggling with the math part... |
Beta Was this translation helpful? Give feedback.
-
After an evening of debugging I have no idea what I'm doing wrong (I guess there are enough options) but it doesn't work at all. Either I get garbage or there is no change at all. Inner product and squared magnitude keep rising and eventually reach inf if not limited. Of course the math is wrong somewhere... Any ideas? Sorry for the dirty code. Eternal gratitude is assured. ### Inference
intercept_module_names = ["ExLlamaV2Attention", "ExLlamaV2MoEMLP"] # TODO: both?
def inner_product(x, y):
#y = y / (torch.norm(y, dim = -1, keepdim = True) + 1e-8)
return torch.sum(x * y, dim=-1)
def squared_magnitude(x):
#x = x / (torch.norm(x, dim = -1, keepdim = True) + 1e-8)
return torch.sum(x * x)
def pre_hook(hidden_states, *args, **kwargs):
global cur_scaling_distribution, cur_neg_injections, intercept_module_names
module_num = kwargs["module_num"]
module_name = kwargs["module_name"]
if hidden_states.shape[1] != 1: # TODO: deal with prefill
return hidden_states
try:
# scaling factor per module_num, between 0 and 1
csd = cur_scaling_distribution[module_num]
if csd > 0:
for dl in cur_neg_injections:
cfv = dl.get(module_num, None)
if cfv != None:
ip = inner_product(hidden_states, cfv)
if ip > 0:
sm = squared_magnitude(cfv)
hidden_states -= csd * cfv * ip / sm
except:
raise # TODO
return hidden_states
### Prepare
NEG_PROMPTS = "neg_prompts.txt"
POS_PROMPTS = "pos_prompts.txt"
def calculate_feature_vectors(neg_hidden_states_dicts, pos_hidden_states_dicts):
if len(neg_hidden_states_dicts) == 0 or len(pos_hidden_states_dicts) == 0:
print("ERR: neg/pos dict list empty")
return None
avg_neg_hidden_states_dict = {}
avg_pos_hidden_states_dict = {}
feature_vector_dict = {}
for k in neg_hidden_states_dicts[0].keys():
hs = []
for e in neg_hidden_states_dicts:
if isinstance(e, dict) and k in e and not isinstance(e[k], int):
hs.append(e[k])
if len(hs) > 0:
avg_hidden_states = torch.mean(torch.stack(hs), dim=0)
avg_neg_hidden_states_dict[k] = avg_hidden_states
else:
avg_neg_hidden_states_dict[k] = 0
for k in pos_hidden_states_dicts[0].keys():
hs = []
for e in pos_hidden_states_dicts:
if isinstance(e, dict) and k in e and not isinstance(e[k], int):
hs.append(e[k])
if len(hs) > 0:
avg_hidden_states = torch.mean(torch.stack(hs), dim=0)
avg_pos_hidden_states_dict[k] = avg_hidden_states
else:
avg_pos_hidden_states_dict[k] = 0
for k in avg_neg_hidden_states_dict.keys():
if k in avg_pos_hidden_states_dict:
feature_vector_dict[k] = avg_neg_hidden_states_dict[k] - avg_pos_hidden_states_dict[k]
return feature_vector_dict
# called to prepare feature vectors
def prepare_injections():
global cur_neg_hs, cur_pos_hs
nt = Path(NEG_PROMPTS).read_text().splitlines()
pt = Path(POS_PROMPTS).read_text().splitlines()
for line in nt:
if len(line) > 0:
# saves cloned and detached hidden states per layer in cur_neg_hs
add_negative_run(line)
for line in pt:
if len(line) > 0:
# saves cloned and detached hidden states per layer in cur_pos_hs
add_positive_run(line)
# return value will be stored in cur_neg_injections list for pre_hook
return calculate_feature_vectors(cur_neg_hs, cur_pos_hs) |
Beta Was this translation helpful? Give feedback.
-
Just trying to work out if this can be adapted to work with control vectors: def pre_hook(hidden_states, *args, **kwargs):
global cur_scaling_distribution, cur_neg_injections, intercept_module_names
module_num = kwargs["module_num"]
module_name = kwargs["module_name"]
if hidden_states.shape[1] != 1: # TODO: deal with prefill
return hidden_states
try:
# scaling factor per module_num, between 0 and 1
csd = cur_scaling_distribution[module_num]
if csd > 0:
for dl in cur_neg_injections:
cfv = dl.get(module_num, None)
if cfv != None:
#ip = inner_product(hidden_states, cfv)
#if ip > 0:
#sm = squared_magnitude(cfv)
#hidden_states -= csd * cfv * ip / sm
hidden states += csd * cfv
except:
raise # TODO
return hidden_states Someone on HF has been adapting control vectors to work with https://huggingface.co/gghfez/DarkMage-123b-exl2/discussions/1 Using this hook looks a lot simpler than trying to make a fake LoRA (as the I'll link him to this thread. |
Beta Was this translation helpful? Give feedback.
-
I struggled with trying to get the "abliteration" method to work to remove "positivity" for ages, but when you try to modify the weights you collapse both sides of the subspace... :( I also found if you replace the -1 scaler with -2 (eg: In theory you can use (an even) multiple of Householder Transformations to rotate the vector space too, but no amount of rotation can actually get rid of "positivity" as the opposite end of the subspace just gets rotated at the same time... :( In the end I concluded the only viable method was to do what you have done here at inference time: ip = inner_product(hidden_states, cfv)
if ip > 0: but sadly the It does make you wonder what (if anything) is on the other end of the "refusals" axis that gets collapsed when you do the "abliteration" by modifying the weights? I guess it would be interesting to flip your test |
Beta Was this translation helpful? Give feedback.
-
I would love to see support for this on Exllama. I have played around with this on llama.cpp, and it's a game-changer. |
Beta Was this translation helpful? Give feedback.
-
@jukofyork I personally wonder if the hidden state is really the right target for this. Why not the the MLP intermediate state? Surely that's where concepts like "inappropriate" are most unambiguously expressed by the model, where you'd find the closest thing to a conditional expression, and where you could most easily intervene by just erasing activations (zeroing rows in the down projection, for instance.) |
Beta Was this translation helpful? Give feedback.
-
Hey there!
Not sure if this was already discussed somewhere around here but I stumbled across the idea of injecting noise into inference and BEFORE sampling.
See https://github.com/EGjoni/DRUGS and discussion on https://www.reddit.com/r/LocalLLaMA/comments/18toidc/stop_messing_with_sampling_parameters_and_just/
Apart from the freaky name and far too many puns, I like the idea, but I wonder how it could be implemented in exllamav2 in a performant way...
@turboderp To be honest, I find it difficult to make a meaningful judgement as to whether the effort is worth it. What do you think?
Beta Was this translation helpful? Give feedback.
All reactions