TGI number of maximum total token to be handled by Llama2. How to increase from 2048 to 4096 ? #1421
-
I have implemented TGI for Llama2 70B chat model. I have taken weights downloaded from Meta in .pth format and converted them to .safetensor fromat. Sucessfully set up inference server end point using docket. I am using Langchain orchestration. Every thing works good. I am facing an challange with number of maximum total token to be handled by Llama2. When the input token size is more than 1024 i am getting below error
also i am getting another error when total expected token count is more than 2048
Question : How to over come this error ? and where to set max_position_embeddings=4096 when using TGI ?Python code:
here is the full error message
When i don't use TGI i use to increase the max total token by below way
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Figured it :) from TGI launcher help page docker exec $model bash -c "text-generation-launcher --model-id /data/$model --max-total-tokens 4096 --max-input-length 3000 --num-shard $num_gpu" |
Beta Was this translation helpful? Give feedback.
Figured it :) from TGI launcher help page
https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher
docker exec $model bash -c "text-generation-launcher --model-id /data/$model --max-total-tokens 4096 --max-input-length 3000 --num-shard $num_gpu"