Device_map='auto' not working along with bitsandbytes (transformers) #34751

guillemram97 · 2024-11-15T14:55:05Z

System Info

Hardware: Amazon Linux EC2 Instance.
8 NVIDIA A10G (23 GB)

Python 3.10.14
CUDA Version: 12.3
accelerate==0.34.2
bitsandbytes==0.44.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.560.30
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
torch==2.4.1
transformers==4.45.1

Who can help?

@muellerz @SunMarc @MekkCyber

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from accelerate import infer_auto_device_map
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(**{'load_in_4bit':True})
 model = AutoModelForCausalLM.from_pretrained('google/gemma-2-27b-it', device_map='auto', quantization_config=bnb_config)

device_map = infer_auto_device_map(model, max_memory = {0: "23GB", 1: "23GB", 2: "23GB", 3: "23GB", 4: "23GB", 5: "23GB", 6: "23GB", 7: "23GB"})
print(device_map)
--> OrderedDict([('', 0)])

However, if I load without the quantization_config, no issue at all:

 model = AutoModelForCausalLM.from_pretrained('google/gemma-2-27b-it', device_map='auto', quantization_config=bnb_config)
 print(device_map)
--> OrderedDict([('model.embed_tokens', 0), ('lm_head', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7.self_attn', 0), ('model.layers.7.mlp.gate_proj', 0), ('model.layers.7.mlp.up_proj', 0), ('model.layers.7.mlp.down_proj', 1), ('model.layers.7.mlp.act_fn', 1), ('model.layers.7.input_layernorm', 1), ('model.layers.7.pre_feedforward_layernorm', 1), ('model.layers.7.post_feedforward_layernorm', 1), ('model.layers.7.post_attention_layernorm', 1), ('model.layers.8', 1), ('model.layers.9', 1), ('model.layers.10', 1), ('model.layers.11', 1), ('model.layers.12', 1), ('model.layers.13', 1), ('model.layers.14', 1), ('model.layers.15', 1), ('model.layers.16', 1), ('model.layers.17.self_attn', 1), ('model.layers.17.mlp.gate_proj', 1), ('model.layers.17.mlp.up_proj', 1), ('model.layers.17.mlp.down_proj', 2), ('model.layers.17.mlp.act_fn', 2), ('model.layers.17.input_layernorm', 2), ('model.layers.17.pre_feedforward_layernorm', 2), ('model.layers.17.post_feedforward_layernorm', 2), ('model.layers.17.post_attention_layernorm', 2), ('model.layers.18', 2), ('model.layers.19', 2), ('model.layers.20', 2), ('model.layers.21', 2), ('model.layers.22', 2), ('model.layers.23', 2), ('model.layers.24', 2), ('model.layers.25', 2), ('model.layers.26', 2), ('model.layers.27.self_attn', 2), ('model.layers.27.mlp.gate_proj', 2), ('model.layers.27.mlp.up_proj', 2), ('model.layers.27.mlp.down_proj', 3), ('model.layers.27.mlp.act_fn', 3), ('model.layers.27.input_layernorm', 3), ('model.layers.27.pre_feedforward_layernorm', 3), ('model.layers.27.post_feedforward_layernorm', 3), ('model.layers.27.post_attention_layernorm', 3), ('model.layers.28', 3), ('model.layers.29', 3), ('model.layers.30', 3), ('model.layers.31', 3), ('model.layers.32', 3), ('model.layers.33', 3), ('model.layers.34', 3), ('model.layers.35', 3), ('model.layers.36', 3), ('model.layers.37.self_attn', 3), ('model.layers.37.mlp.gate_proj', 3), ('model.layers.37.mlp.up_proj', 3), ('model.layers.37.mlp.down_proj', 4), ('model.layers.37.mlp.act_fn', 4), ('model.layers.37.input_layernorm', 4), ('model.layers.37.pre_feedforward_layernorm', 4), ('model.layers.37.post_feedforward_layernorm', 4), ('model.layers.37.post_attention_layernorm', 4), ('model.layers.38', 4), ('model.layers.39', 4), ('model.layers.40', 4), ('model.layers.41', 4), ('model.layers.42', 4), ('model.layers.43', 4), ('model.layers.44', 4), ('model.layers.45', 4), ('model.norm', 4)])

Expected behavior

The model is (mostly) being loaded to the last GPU. However, I'd expect it to be loaded across the different GPUs. Moreover, infer_auto_device_map seems to be not working.
I have experienced this very similar issue with different hardware.

The text was updated successfully, but these errors were encountered:

MekkCyber · 2024-11-20T12:52:12Z

Hi @guillemram97, thanks for reporting this issue 😊. Indeed it seems to be a bug related to how we load quantized models on accelerate side. We are currently working on a fix to improve these edge cases. You can refer to the PR linked to the issue if you want to understand the details.

guillemram97 added the bug label Nov 15, 2024

MekkCyber mentioned this issue Nov 18, 2024

Fix : get_balanced_memory when using multi gpus with small models or quantized models with a large vocabulary huggingface/accelerate#3244

Open

5 tasks

guillemram97 closed this as completed Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device_map='auto' not working along with bitsandbytes (transformers) #34751

Device_map='auto' not working along with bitsandbytes (transformers) #34751

guillemram97 commented Nov 15, 2024

MekkCyber commented Nov 20, 2024

Device_map='auto' not working along with bitsandbytes (transformers) #34751

Device_map='auto' not working along with bitsandbytes (transformers) #34751

Comments

guillemram97 commented Nov 15, 2024

System Info

System Info

Who can help?

Information

Tasks

Reproduction

Reproduction

Expected behavior

MekkCyber commented Nov 20, 2024