TGI number of maximum total token to be handled by Llama2. How to increase from 2048 to 4096 ? #1421

ansSanthoshM · 2024-01-09T11:44:16Z

ansSanthoshM
Jan 9, 2024

I have implemented TGI for Llama2 70B chat model.

I have taken weights downloaded from Meta in .pth format and converted them to .safetensor fromat. Sucessfully set up inference server end point using docket.

I am using Langchain orchestration. Every thing works good. I am facing an challange with number of maximum total token to be handled by Llama2.

When the input token size is more than 1024 i am getting below error

text_generation.errors.ValidationError: Input validation error: inputs must have less than 1024 tokens. Given: 1220

also i am getting another error when total expected token count is more than 2048

text_generation.errors.ValidationError: Input validation error: inputs tokens + max_new_tokens must be <= 2048. Given: 2856 inputs tokens and 512 max_new_tokens

Question : How to over come this error ? and where to set max_position_embeddings=4096 when using TGI ?

Python code:

from langchain.llms import HuggingFaceTextGenInference
server="http://0.0.0.0:8080" # 70B Model
llm= HuggingFaceTextGenInference(
inference_server_url=server,
top_k=10,
top_p=0.95,
typical_p=0.95,
temperature=0.1,
repetition_penalty=1.03,
)
SYSTEM_PROMPT =
You are a helpful and exprienced review assistant. Understand points provided user message. Summarize your findings and provide insights.Keep Your responses Concise.
1 ) How will we ..... continue upto ~2850 token
user_input ="What is the total number of points provided as input ?
res=llm(SYSTEM_PROMPT+f"{user_input} ",)**

here is the full error message

Traceback (most recent call last):
File "d:\AIML\AIMLenv\Excel_Insights\ExcelInsigh_v3_direct_Sug_inf.py", line 67, in
res=llm(SYSTEM_PROMPT+f"{user_input} [/INST]",)
File "D:\AIML\AIMLenv\lib\site-packages\langchain_core\language_models\llms.py", line 892, in call
self.generate(
File "D:\AIML\AIMLenv\lib\site-packages\langchain_core\language_models\llms.py", line 666, in generate
output = self._generate_helper(
File "D:\AIML\AIMLenv\lib\site-packages\langchain_core\language_models\llms.py", line 553, in _generate_helper
raise e
File "D:\AIML\AIMLenv\lib\site-packages\langchain_core\language_models\llms.py", line 540, in _generate_helper
self._generate(
File "D:\AIML\AIMLenv\lib\site-packages\langchain_core\language_models\llms.py", line 1069, in _generate
self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
File "D:\AIML\AIMLenv\lib\site-packages\langchain_community\llms\huggingface_text_gen_inference.py", line 199, in _call
res = self.client.generate(prompt, **invocation_params)
File "D:\AIML\AIMLenv\lib\site-packages\text_generation\client.py", line 153, in generate
raise parse_error(resp.status_code, payload)
text_generation.errors.ValidationError: Input validation error: inputs must have less than 1024 tokens. Given: 1220

When i don't use TGI i use to increase the max total token by below way

model = AutoModelForCausalLM.from_pretrained(
model_dir,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
max_position_embeddings=4096 # bydefault 2048
)
llama_pipeline = pipeline(
'text-generation',
model = model,
tokenizer=tokenizer,
torch_dtype = torch.float16,
device_map = 'auto',
temperature = 0.1, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
max_new_tokens = 500, # mex number of tokens to generate in the output
do_sample=True,
num_return_sequences=1)
llm = HuggingFacePipeline(pipeline=llama_pipeline)
res=llm(SYSTEM_PROMPT+f"{user_input} ",)**

Answered by ansSanthoshM

Jan 9, 2024

Figured it :) from TGI launcher help page
https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher

docker exec $model bash -c "text-generation-launcher --model-id /data/$model --max-total-tokens 4096 --max-input-length 3000 --num-shard $num_gpu"

View full answer

ansSanthoshM · 2024-01-09T12:09:47Z

ansSanthoshM
Jan 9, 2024
Author

Figured it :) from TGI launcher help page
https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher

docker exec $model bash -c "text-generation-launcher --model-id /data/$model --max-total-tokens 4096 --max-input-length 3000 --num-shard $num_gpu"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TGI number of maximum total token to be handled by Llama2. How to increase from 2048 to 4096 ? #1421

{{title}}

Replies: 1 comment

{{title}}

Select a reply

TGI number of maximum total token to be handled by Llama2. How to increase from 2048 to 4096 ? #1421

ansSanthoshM Jan 9, 2024

Question : How to over come this error ? and where to set max_position_embeddings=4096 when using TGI ?

Replies: 1 comment

ansSanthoshM Jan 9, 2024 Author

ansSanthoshM
Jan 9, 2024

ansSanthoshM
Jan 9, 2024
Author