-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU support Table & VRAM usage #17
Comments
3B f16 runs on 2080ti. Though you might need a lot of RAM to convert f32 to f16, peak is like 24G calculation of lower bound of VRAM in GiB: # 3B f16
>>> (3_638_525_952 * 2) / 1024 / 1024 / 1024
6.77728271484375
# 3B f32
>>> (3_638_525_952 * 4) / 1024 / 1024 / 1024
13.5545654296875
# 7B f16
>>> (7_869_358_080 * 2) / 1024 / 1024 / 1024
14.657821655273438
# 7B f32
>>> (7_869_358_080 * 4) / 1024 / 1024 / 1024
29.315643310546875 |
Does it run on AMD? I can try AMD RX 480 😎 |
Tesla P40 (24GB) - works |
Best bet is 4bit quantization. 7B will likely run in 6gigs of VRAM at that level, as that's about the requirement for 7b with LLaMa. |
Got 7B models working on my Tesla M40 w/ 24GB ram |
Got the 7B running fine on my 4090. |
Keep eyes on this issues |
Not a gaming PC, but I just tried the Colab notebook with 83.5GB of System RAM and A100 with 40GB. It's insanely fast to initialize and the prompts on the tuned-alpha-7B model took around 2 seconds to complete. |
Able to run the tuned-alpha-3b on a 4070 Ti (12GB) |
For the sake of convenience (2x less download size/RAM/VRAM), I've uploaded 16-bit versions of tuned models to HF Hub: |
Would you mind showing how you made the conversion? I'm new to this and would like to do the same for the base 7B model. Thanks. |
@cduk it's pretty much straightforward: from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-tuned-alpha-3b")
model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-tuned-alpha-3b")
model.half().cuda()
model.save_pretrained('vvsotnikov/stablelm-tuned-alpha-3b-16bit')
tokenizer.save_pretrained('vvsotnikov/stablelm-tuned-alpha-3b-16bit') It will save the model and the tokenizer locally, then you will have to upload them to Hub. Good luck! |
Nvidia T4 (16 GB) runs out of memory when trying to load the fp16 7B model. The 3B model runs smoothly in fp16. |
I only have 40GB of RAM. So the default code did not work for me for 7B. By changing the first lines to this, RAM is limited to 17GB and the model loads in 9:50 min. tokenizer = AutoTokenizer.from_pretrained("StabilityAI/stablelm-base-alpha-7b")
model = AutoModelForCausalLM.from_pretrained(
"StabilityAI/stablelm-base-alpha-7b",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)
model = model.to("cuda") You need to However, then it crashes, because I have a T4 which only has 15.3GB with the following :/:
I'm confident that with some messing around it will fit on the T4. It's so close! For now I will try running with |
@antheas So close! Have you considered quantizing to 8-bit and seeing how well that works? I wonder whether 8bit 7B would out-perform fp16 3B. Both seem like they would fit within 8GB RAM on consumer GPUs. |
with model.hf_device_map = {'gpt_neox.embed_in': 0,
'gpt_neox.layers.0': 0,
'gpt_neox.layers.1': 0,
'gpt_neox.layers.2': 0,
'gpt_neox.layers.3': 0,
'gpt_neox.layers.4': 0,
'gpt_neox.layers.5': 0,
'gpt_neox.layers.6': 0,
'gpt_neox.layers.7': 0,
'gpt_neox.layers.8': 0,
'gpt_neox.layers.9': 0,
'gpt_neox.layers.10': 0,
'gpt_neox.layers.11': 0,
'gpt_neox.layers.12': 0,
'gpt_neox.layers.13': 0,
'gpt_neox.layers.14': 'cpu',
'gpt_neox.layers.15': 'cpu',
'gpt_neox.final_layer_norm': 'cpu',
'embed_out': 'cpu'} Default inference takes 2m6s first time, 20s second time. Tad too slow for me. Example reply for untuned model:
Loading took 18m, but I don't have access to direct SSD storage, so your mileage may vary. With @cduk will try now. |
So I came up with the following, to use 8 bit quantization @cduk: import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, StoppingCriteria, StoppingCriteriaList
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
)
tokenizer = AutoTokenizer.from_pretrained("StabilityAI/stablelm-tuned-alpha-7b")
model = AutoModelForCausalLM.from_pretrained(
"StabilityAI/stablelm-tuned-alpha-7b",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
load_in_8bit=True,
quantization_config=quantization_config,
device_map={'': 0}
) Loading takes around 13GB of RAM peak. I record the VRAM before running a prompt. Then I run the default script prompt and record the time and VRAM. Then I keep rerunning the script prompt and I record the run times. These results are on a T4. I will play with Running with 10.0 now, will update this table with results as they become available. This table is for the Nvidia T4 16GB (15.3 GiB avail.) card.
Threshhold has negligible effect on RAM. However, with 4 prompts run a bit faster (?) |
Isn't this expected? The lower the threshold, the more weights are converted to int8 (hence less compute to do). |
The way I read it it's the opposite. According to its description, values follow a normal distribution, with most being less than [-3.5, 3.5]. Might be wrong though. Built myself a little chat bot with ipywidget. I'm playing a bit with the model now, it's quite fun. By adding |
Ah, yes, sure, my mistake. Quite weird then :) |
I was able to get 3B parameter to work on CPU with 16GB of ram. |
Did you use any tricks such as the |
I had to disable torch.backends.cudnn and convert to float. check out my repo https://github.com/astrobleem/Simple-StableLM-Chat
|
I am using Radeon 6900xt (16GB VRAM) and quick start code on README works well! (Using stabilityai/stablelm-tuned-alpha-7b) I used rocm/pytorch docker with rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview version. https://hub.docker.com/r/rocm/pytorch EDIT: I tested it a little more and it seems that 16GB of memory is not enough. |
Works on P6000 24gb.. up to 3000 context before it OOM. |
I can confirm Tuned-7B works on my A6000 Ada / 48Gb GPU :) |
Have any of you run into this error when you have the model running? I've attempted method where the model is quantized to an 8-bit version but it seems to cause this problem with the probability tensor/tokens. For those of you who are using the 8-bit version of StableLM how did you get the ChatBot up and running? |
RTX 3080 Ti (12 GB VRAM) Tuned-3B ✅ |
RTX 3070 (8GB VRAM) Tuned-3B (fp16) ✅ |
RTX 3060 (12GB VRAM) Tuned-3B (fp16) OK |
stablelm-tuned-alpha-3b (fp16) works on a Tesla K80. I load it on GPU2 because it runs cooler. |
7900XTX 24GB is OK with tuned-7B, based on docker-based ROCm5.5-rc5 and PyTorch2.0 |
I had the same error, I discovered checking around that there is a parameter that you can add to the generate() function called remove_invalid_values, if you put it in True it should work :) I leave here the parameters that I put: tokens = model.generate( PD: |
It would be great to get the instructions to run the 3B model locally on a gaming GPU (e.g. 3090/4090 with 24GB VRAM).
Confirmed GPUs
From this thread
Best RAM/VRAM TRICKS (from this thread)
Convert models F32 -> F16 (lower RAM, faster load)
#17 (comment)
Low-memory model loads
8bit
(BitsAndBytes): GPU support Table & VRAM usage #17 (comment)torch_dtype=torch.float16
&low_cpu_mem_usage
: GPU support Table & VRAM usage #17 (comment)device_map=auto
: GPU support Table & VRAM usage #17 (comment)Other tricks
Weights RAM/VRAM (GB)
Activations
Empyrical (numbers in bytes, fp32):
total_tokens * 1,280,582
total_tokens * 1,869,134
The regression fits at 0.99999989. For instance, with 32 input tokens and an output of 512, the activations are: 969 MB of VAM (almost 1 GB) will be required. Haven't tested with Batch not equal 1.
Examples of a few recorded activations numbers:
The text was updated successfully, but these errors were encountered: