Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support Table & VRAM usage #17

Open
enricoros opened this issue Apr 19, 2023 · 34 comments
Open

GPU support Table & VRAM usage #17

enricoros opened this issue Apr 19, 2023 · 34 comments

Comments

@enricoros
Copy link

enricoros commented Apr 19, 2023

It would be great to get the instructions to run the 3B model locally on a gaming GPU (e.g. 3090/4090 with 24GB VRAM).

Confirmed GPUs

From this thread

GPU Model VRAM (GB) Tuned-3b Tuned-7b
RTX 3090 24
RTX 4070 Ti 12
RTX 4090 24
T4 16
A100 40

Best RAM/VRAM TRICKS (from this thread)

Convert models F32 -> F16 (lower RAM, faster load)

#17 (comment)

from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList

tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-tuned-alpha-3b")
model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-tuned-alpha-3b")
model.half().cuda()
model.save_pretrained('vvsotnikov/stablelm-tuned-alpha-3b-16bit')
tokenizer.save_pretrained('vvsotnikov/stablelm-tuned-alpha-3b-16bit')

Low-memory model loads

  1. quantized 8bit (BitsAndBytes): GPU support Table & VRAM usage #17 (comment)
  2. torch_dtype=torch.float16 & low_cpu_mem_usage: GPU support Table & VRAM usage #17 (comment)
  3. device_map=auto: GPU support Table & VRAM usage #17 (comment)

Other tricks

  1. Streaming responses: GPU support Table & VRAM usage #17 (comment)

Weights RAM/VRAM (GB)

model name parameters W (fp32) W (fp16) weights (VRAM) load time (s) works
stablelm-tuned-alpha-3b 3637321728 13.55 6.78 7.03 18.62
stablelm-tuned-alpha-7b 7868755968 29.31 14.66 14.91 50.28
  • weights (fp32, GB): that's the minimum required RAM to load the model (before calling .half())
  • weights (fp16, GB): that's the minimum VRAM when transferring the model to the GPU
  • weights (fp16, VRAM): reported VRAM increase after loading the model

Activations

Empyrical (numbers in bytes, fp32):

  • stablelm-tuned-alpha-3b: total_tokens * 1,280,582
  • stablelm-tuned-alpha-7b: total_tokens * 1,869,134

The regression fits at 0.99999989. For instance, with 32 input tokens and an output of 512, the activations are: 969 MB of VAM (almost 1 GB) will be required. Haven't tested with Batch not equal 1.

Examples of a few recorded activations numbers:

model input_tokens out_tokens total_tokens VRAM (MB)
3b 3072 1024 4096 5003
3b 1024 512 1536 1875
3b 64 1 65 78.19
3b 8 1 9 9.77
7b 3072 1024 4096 7304.22
7b 2048 512 2560 4564.47
7b 8 64 72 126.64
7b 8 1 9 14.27
@python273
Copy link

python273 commented Apr 19, 2023

3B f16 runs on 2080ti. Though you might need a lot of RAM to convert f32 to f16, peak is like 24G

calculation of lower bound of VRAM in GiB:

# 3B f16
>>> (3_638_525_952 * 2) / 1024 / 1024 / 1024
6.77728271484375
# 3B f32
>>> (3_638_525_952 * 4) / 1024 / 1024 / 1024
13.5545654296875
# 7B f16
>>> (7_869_358_080 * 2) / 1024 / 1024 / 1024
14.657821655273438
# 7B f32
>>> (7_869_358_080 * 4) / 1024 / 1024 / 1024
29.315643310546875

@jamiecropley
Copy link

Does it run on AMD? I can try AMD RX 480 😎

@fche
Copy link

fche commented Apr 19, 2023

Tesla P40 (24GB) - works

@Loufe
Copy link

Loufe commented Apr 19, 2023

Best bet is 4bit quantization. 7B will likely run in 6gigs of VRAM at that level, as that's about the requirement for 7b with LLaMa.

@astrobleem
Copy link

Got 7B models working on my Tesla M40 w/ 24GB ram

@markacola
Copy link

Got the 7B running fine on my 4090.

@aleimu
Copy link

aleimu commented Apr 20, 2023

Keep eyes on this issues

@octimot
Copy link

octimot commented Apr 20, 2023

Not a gaming PC, but I just tried the Colab notebook with 83.5GB of System RAM and A100 with 40GB.

It's insanely fast to initialize and the prompts on the tuned-alpha-7B model took around 2 seconds to complete.

@kasima
Copy link

kasima commented Apr 20, 2023

Able to run the tuned-alpha-3b on a 4070 Ti (12GB)

@enricoros enricoros changed the title Run on local GPU GPU support & VRAM usage Apr 20, 2023
@enricoros enricoros changed the title GPU support & VRAM usage GPU support Table & VRAM usage Apr 20, 2023
@vvsotnikov
Copy link

For the sake of convenience (2x less download size/RAM/VRAM), I've uploaded 16-bit versions of tuned models to HF Hub:
https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-7b-16bit
https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-3b-16bit

@cduk
Copy link

cduk commented Apr 20, 2023

For the sake of convenience (2x less download size/RAM/VRAM), I've uploaded 16-bit versions of tuned models to HF Hub: https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-7b-16bit https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-3b-16bit

Would you mind showing how you made the conversion? I'm new to this and would like to do the same for the base 7B model. Thanks.

@vvsotnikov
Copy link

vvsotnikov commented Apr 20, 2023

@cduk it's pretty much straightforward:

from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList

tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-tuned-alpha-3b")
model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-tuned-alpha-3b")
model.half().cuda()
model.save_pretrained('vvsotnikov/stablelm-tuned-alpha-3b-16bit')
tokenizer.save_pretrained('vvsotnikov/stablelm-tuned-alpha-3b-16bit')

It will save the model and the tokenizer locally, then you will have to upload them to Hub. Good luck!

@vvsotnikov
Copy link

Nvidia T4 (16 GB) runs out of memory when trying to load the fp16 7B model. The 3B model runs smoothly in fp16.

@antheas
Copy link

antheas commented Apr 20, 2023

I only have 40GB of RAM. So the default code did not work for me for 7B.

By changing the first lines to this, RAM is limited to 17GB and the model loads in 9:50 min.

tokenizer = AutoTokenizer.from_pretrained("StabilityAI/stablelm-base-alpha-7b")
model = AutoModelForCausalLM.from_pretrained(
    "StabilityAI/stablelm-base-alpha-7b",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)
model = model.to("cuda")

You need to pip install accelerate as well. So make those changes to avoid loading a 32 bit version of the model (34 gb), then the weights separately (another 34 gb). You still download 2x the size, unlike with @vvsotnikov 's images, but my VM has gigabit so I don't mind.

However, then it crashes, because I have a T4 which only has 15.3GB with the following :/:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 0; 14.62 GiB total capacity; 14.33 GiB already allocated; 185.38 MiB free; 14.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I'm confident that with some messing around it will fit on the T4. It's so close!

For now I will try running with device_map="auto" and report back.
https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/flan-ul2#running-on-low-resource-devices

@cduk
Copy link

cduk commented Apr 20, 2023

@antheas So close! Have you considered quantizing to 8-bit and seeing how well that works? I wonder whether 8bit 7B would out-perform fp16 3B. Both seem like they would fit within 8GB RAM on consumer GPUs.

@antheas
Copy link

antheas commented Apr 20, 2023

with device_map=auto I get the following map. The last 4 layers don't fit.

model.hf_device_map =  {'gpt_neox.embed_in': 0,
 'gpt_neox.layers.0': 0,
 'gpt_neox.layers.1': 0,
 'gpt_neox.layers.2': 0,
 'gpt_neox.layers.3': 0,
 'gpt_neox.layers.4': 0,
 'gpt_neox.layers.5': 0,
 'gpt_neox.layers.6': 0,
 'gpt_neox.layers.7': 0,
 'gpt_neox.layers.8': 0,
 'gpt_neox.layers.9': 0,
 'gpt_neox.layers.10': 0,
 'gpt_neox.layers.11': 0,
 'gpt_neox.layers.12': 0,
 'gpt_neox.layers.13': 0,
 'gpt_neox.layers.14': 'cpu',
 'gpt_neox.layers.15': 'cpu',
 'gpt_neox.final_layer_norm': 'cpu',
 'embed_out': 'cpu'}

Default inference takes 2m6s first time, 20s second time. Tad too slow for me. Example reply for untuned model:

What's your mood today?
What did you do yesterday? What's your dream today?
I dreamt I was taking a walk with my family in a quiet neighborhood. When bedtime came, I'd say I was unhappily married and had no kids. It somehow seemed the perfect dream, the ideal marriage. We walked to my

Loading took 18m, but I don't have access to direct SSD storage, so your mileage may vary.

With device_map=auto , each partition is loaded and transferred to the GPU sequentially, so RAM use is around 10GB.

@cduk will try now.

@antheas
Copy link

antheas commented Apr 20, 2023

So I came up with the following, to use 8 bit quantization @cduk:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, StoppingCriteria, StoppingCriteriaList

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

tokenizer = AutoTokenizer.from_pretrained("StabilityAI/stablelm-tuned-alpha-7b")
model = AutoModelForCausalLM.from_pretrained(
    "StabilityAI/stablelm-tuned-alpha-7b",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    load_in_8bit=True,
    quantization_config=quantization_config,
    device_map={'': 0}
)

Loading takes around 13GB of RAM peak. I record the VRAM before running a prompt.

Then I run the default script prompt and record the time and VRAM. Then I keep rerunning the script prompt and I record the run times. These results are on a T4. I will play with llm_int8_threshold to see if there are other savings possible. Again, no direct SSD storage = slow load times.

Running with 10.0 now, will update this table with results as they become available.

This table is for the Nvidia T4 16GB (15.3 GiB avail.) card.

Threshhold Initial VRAM After first Prompt Loading First Prompt Next ones
4 9619MiB 10041 MiB 10-13m 2m21s 2-5s
6 9618MiB 10031 MiB 10-13m 43.8s 3-6s
10 9619MiB 10029 MiB 10-13m 2m39s 3-6s

Threshhold has negligible effect on RAM. However, with 4 prompts run a bit faster (?)

@vvsotnikov
Copy link

However, with 4 prompts run a bit faster (?)

Isn't this expected? The lower the threshold, the more weights are converted to int8 (hence less compute to do).

@antheas
Copy link

antheas commented Apr 20, 2023

However, with 4 prompts run a bit faster (?)

Isn't this expected? The lower the threshold, the more weights are converted to int8 (hence less compute to do).

The way I read it it's the opposite. According to its description, values follow a normal distribution, with most being less than [-3.5, 3.5]. llm_int8_threshold is the bound at which if a value is lower, it is converted to int8. This is because higher values are associated with outliers and can destabilize the model.

Might be wrong though.

Built myself a little chat bot with ipywidget. I'm playing a bit with the model now, it's quite fun.

By adding streamer=TextStreamer(tokenizer=tokenizer, skip_prompt=True) to model.generate(), responses are streamed.

@vvsotnikov
Copy link

Ah, yes, sure, my mistake. Quite weird then :)

@astrobleem
Copy link

I was able to get 3B parameter to work on CPU with 16GB of ram.

@enricoros
Copy link
Author

enricoros commented Apr 21, 2023

I was able to get 3B parameter to work on CPU with 16GB of ram.

Did you use any tricks such as the dtype or similar?

@astrobleem
Copy link

I had to disable torch.backends.cudnn and convert to float.

check out my repo https://github.com/astrobleem/Simple-StableLM-Chat

    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
    model.float().to(device)
    torch.backends.cudnn.enabled = False

@astrobleem
Copy link

image

@chromato99
Copy link

chromato99 commented Apr 21, 2023

I am using Radeon 6900xt (16GB VRAM) and quick start code on README works well! (Using stabilityai/stablelm-tuned-alpha-7b)

I used rocm/pytorch docker with rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview version.

https://hub.docker.com/r/rocm/pytorch

image

EDIT: I tested it a little more and it seems that 16GB of memory is not enough.
When I set max_new_token to 1024, I was able to confirm the OutOfMemory error.
It seems difficult to use the 7b model with 16GB of VRAM.

Screenshot from 2023-04-21 19-35-44

@Ph0rk0z
Copy link

Ph0rk0z commented Apr 21, 2023

Works on P6000 24gb.. up to 3000 context before it OOM.

@lingster
Copy link

I can confirm Tuned-7B works on my A6000 Ada / 48Gb GPU :)

@jon-tow jon-tow pinned this issue Apr 24, 2023
@RaghavMajorBoost
Copy link

RaghavMajorBoost commented Apr 24, 2023

Have any of you run into this error when you have the model running? I've attempted method where the model is quantized to an 8-bit version but it seems to cause this problem with the probability tensor/tokens.

Screenshot 2023-04-24 141922

For those of you who are using the 8-bit version of StableLM how did you get the ChatBot up and running?

@adamkdean
Copy link

adamkdean commented Apr 24, 2023

RTX 3080 Ti (12 GB VRAM) Tuned-3B ✅
RTX 3080 Ti (12 GB VRAM) Tuned-7B 🚫 (CUDA OOM)

@qJake
Copy link

qJake commented Apr 25, 2023

RTX 3070 (8GB VRAM) Tuned-3B (fp16) ✅
RTX 3070 (8GB VRAM) Tuned-3B (fp32) 🚫
RTX 3070 (8GB VRAM) Tuned-7B (fp16) 🚫
RTX 3070 (8GB VRAM) Tuned-7B (fp32) 🚫

@twmmason twmmason unpinned this issue Apr 25, 2023
@twmmason twmmason reopened this Apr 25, 2023
@nekhbet
Copy link

nekhbet commented Apr 26, 2023

RTX 3060 (12GB VRAM) Tuned-3B (fp16) OK

@behelit2
Copy link

stablelm-tuned-alpha-3b (fp16) works on a Tesla K80. I load it on GPU2 because it runs cooler.

@swallowave
Copy link

7900XTX 24GB is OK with tuned-7B, based on docker-based ROCm5.5-rc5 and PyTorch2.0

@XxSamaxX
Copy link

Have any of you run into this error when you have the model running? I've attempted method where the model is quantized to an 8-bit version but it seems to cause this problem with the probability tensor/tokens.

Screenshot 2023-04-24 141922

For those of you who are using the 8-bit version of StableLM how did you get the ChatBot up and running?

I had the same error, I discovered checking around that there is a parameter that you can add to the generate() function called remove_invalid_values, if you put it in True it should work :) I leave here the parameters that I put:

tokens = model.generate(
**inputs, max_new_tokens=64,
temperature=0.7, do_sample=True,
stopping_criteria=StoppingCriteriaList([StopOnTokens()]),
remove_invalid_values=True
)

PD:
RTX 3080 (12GB VRAM) Tuned-7B (fp16) OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests