Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device_map='auto' not working along with bitsandbytes (transformers) #34751

Closed
1 of 4 tasks
guillemram97 opened this issue Nov 15, 2024 · 1 comment · May be fixed by huggingface/accelerate#3244
Closed
1 of 4 tasks
Labels

Comments

@guillemram97
Copy link

System Info

System Info

Hardware: Amazon Linux EC2 Instance.
8 NVIDIA A10G (23 GB)

Python 3.10.14
CUDA Version: 12.3
accelerate==0.34.2
bitsandbytes==0.44.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.560.30
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
torch==2.4.1
transformers==4.45.1

Who can help?

@muellerz @SunMarc @MekkCyber

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Reproduction

from accelerate import infer_auto_device_map
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(**{'load_in_4bit':True})
 model = AutoModelForCausalLM.from_pretrained('google/gemma-2-27b-it', device_map='auto', quantization_config=bnb_config)

device_map = infer_auto_device_map(model, max_memory = {0: "23GB", 1: "23GB", 2: "23GB", 3: "23GB", 4: "23GB", 5: "23GB", 6: "23GB", 7: "23GB"})
print(device_map)
--> OrderedDict([('', 0)])

However, if I load without the quantization_config, no issue at all:

 model = AutoModelForCausalLM.from_pretrained('google/gemma-2-27b-it', device_map='auto', quantization_config=bnb_config)
 print(device_map)
--> OrderedDict([('model.embed_tokens', 0), ('lm_head', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7.self_attn', 0), ('model.layers.7.mlp.gate_proj', 0), ('model.layers.7.mlp.up_proj', 0), ('model.layers.7.mlp.down_proj', 1), ('model.layers.7.mlp.act_fn', 1), ('model.layers.7.input_layernorm', 1), ('model.layers.7.pre_feedforward_layernorm', 1), ('model.layers.7.post_feedforward_layernorm', 1), ('model.layers.7.post_attention_layernorm', 1), ('model.layers.8', 1), ('model.layers.9', 1), ('model.layers.10', 1), ('model.layers.11', 1), ('model.layers.12', 1), ('model.layers.13', 1), ('model.layers.14', 1), ('model.layers.15', 1), ('model.layers.16', 1), ('model.layers.17.self_attn', 1), ('model.layers.17.mlp.gate_proj', 1), ('model.layers.17.mlp.up_proj', 1), ('model.layers.17.mlp.down_proj', 2), ('model.layers.17.mlp.act_fn', 2), ('model.layers.17.input_layernorm', 2), ('model.layers.17.pre_feedforward_layernorm', 2), ('model.layers.17.post_feedforward_layernorm', 2), ('model.layers.17.post_attention_layernorm', 2), ('model.layers.18', 2), ('model.layers.19', 2), ('model.layers.20', 2), ('model.layers.21', 2), ('model.layers.22', 2), ('model.layers.23', 2), ('model.layers.24', 2), ('model.layers.25', 2), ('model.layers.26', 2), ('model.layers.27.self_attn', 2), ('model.layers.27.mlp.gate_proj', 2), ('model.layers.27.mlp.up_proj', 2), ('model.layers.27.mlp.down_proj', 3), ('model.layers.27.mlp.act_fn', 3), ('model.layers.27.input_layernorm', 3), ('model.layers.27.pre_feedforward_layernorm', 3), ('model.layers.27.post_feedforward_layernorm', 3), ('model.layers.27.post_attention_layernorm', 3), ('model.layers.28', 3), ('model.layers.29', 3), ('model.layers.30', 3), ('model.layers.31', 3), ('model.layers.32', 3), ('model.layers.33', 3), ('model.layers.34', 3), ('model.layers.35', 3), ('model.layers.36', 3), ('model.layers.37.self_attn', 3), ('model.layers.37.mlp.gate_proj', 3), ('model.layers.37.mlp.up_proj', 3), ('model.layers.37.mlp.down_proj', 4), ('model.layers.37.mlp.act_fn', 4), ('model.layers.37.input_layernorm', 4), ('model.layers.37.pre_feedforward_layernorm', 4), ('model.layers.37.post_feedforward_layernorm', 4), ('model.layers.37.post_attention_layernorm', 4), ('model.layers.38', 4), ('model.layers.39', 4), ('model.layers.40', 4), ('model.layers.41', 4), ('model.layers.42', 4), ('model.layers.43', 4), ('model.layers.44', 4), ('model.layers.45', 4), ('model.norm', 4)])

Expected behavior

The model is (mostly) being loaded to the last GPU. However, I'd expect it to be loaded across the different GPUs. Moreover, infer_auto_device_map seems to be not working.
I have experienced this very similar issue with different hardware.

@MekkCyber
Copy link
Contributor

Hi @guillemram97, thanks for reporting this issue 😊. Indeed it seems to be a bug related to how we load quantized models on accelerate side. We are currently working on a fix to improve these edge cases. You can refer to the PR linked to the issue if you want to understand the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants