Upgrade SynapseAI version to 1.17.0 #208

yuanwu2017 · 2024-08-13T07:05:21Z

What does this PR do?

Upgrade SynapseAI version to 1.17.0.
Known issues:
Switch to official OH release.

ci_09082024 test:

On 1 Gaudi/Gaudi2 card
model=meta-llama/Llama-2-7b-hf
docker run --rm -p 8083:80 -v ~/workspace/data:/data -v ~/workspace/tmp:/tmp -v ~/workspace:/workspace --runtime=habana --name optimum-1.17 -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=1 -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --max-input-tokens 1024 --max-total-tokens 2048

On 1 Gaudi/Gaudi2 card using pytorch eager mode with torch compile
model=meta-llama/Llama-2-7b-hf
docker run -p 8083:80 -v ~/workspace/data:/data -v ~/workspace/tmp:/tmp -v ~/workspace:/workspace --runtime=habana --name optimum-1.17 -e PT_HPU_LAZY_MODE=0 -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=1 -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --max-input-tokens 1024 --max-total-tokens 2048

3.On 8 Gaudi/Gaudi2 cards:
model=meta-llama/Llama-2-70b-hf

docker run --rm -p 8080:80 -v $volume:/data --runtime=habana -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --sharded true --num-shard 8 --max-input-tokens 1024 --max-total-tokens 2048

LLama 7b BF16 on 1 Gaudi2 card:
model=meta-llama/Llama-2-7b-chat-hf

docker run --rm -p 8083:80 \
   --runtime=habana \
   -v $volume:/data \
   -e HABANA_VISIBLE_DEVICES=all \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e HF_HUB_ENABLE_HF_TRANSFER=1 \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e PREFILL_BATCH_BUCKET_SIZE=1 \
   -e BATCH_BUCKET_SIZE=16 \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=128 \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 \
   --model-id $model \
   --max-input-tokens 1024 \
   --max-batch-prefill-tokens 4096 \
   --max-total-tokens 2048 \
   --max-batch-size 16

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@regisss @libinta @mandy-li

Signed-off-by: yuanwu <[email protected]>

mandy-li · 2024-08-15T16:19:57Z

@yuanwu2017 , just a reminder when you upgrade package version, please refer to #206 as well to make sure CVE problems can also be covered.

yuanwu2017 · 2024-08-16T02:42:29Z

@yuanwu2017 , just a reminder when you upgrade package version, please refer to #206 as well to make sure CVE problems can also be covered.

Can you tell me how to run the CVE scan? Thanks.

mandy-li · 2024-08-16T06:45:27Z

@yuanwu2017 , just a reminder when you upgrade package version, please refer to #206 as well to make sure CVE problems can also be covered.

Can you tell me how to run the CVE scan? Thanks.

CVE scan is done by another intel team. They found some issues that those versions of python packages got CVE problem.

Signed-off-by: yuanwu <[email protected]>

yuanwu2017 · 2024-08-16T11:19:50Z

Update the ci_09082024 test result.

server/requirements.txt

Co-authored-by: Thanaji Rao Thakkalapelli <[email protected]>

Signed-off-by: yuanwu <[email protected]>

yuanwu2017 · 2024-08-20T12:39:32Z

llava-next test
Command:

model=llava-hf/llava-v1.6-mistral-7b-hf
volume=/home/yuanwu/workspace/data
docker run -it -p 8083:80 \
   -v ~/workspace/data:/data \
   -v ~/workspace/tmp:/tmp \
   -v ~/workspace:/workspace \
   --runtime=habana \
   --name optimum-1.17 \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=1,2,4,5 \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 --model-id $model \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384

client source code:

from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient("http://127.0.0.1:8083")

# read image from local file
image_path = "rabbit.png"
#image_path = "llava_v1_5_radar.jpg"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

tokens = ''
for token in client.text_generation(prompt, max_new_tokens=40, stream=True):
    tokens += token
print(tokens)

tthakkal · 2024-08-20T23:52:22Z

llava-next test Command:

model=llava-hf/llava-v1.6-mistral-7b-hf
volume=/home/yuanwu/workspace/data
docker run -it -p 8083:80 \
   -v ~/workspace/data:/data \
   -v ~/workspace/tmp:/tmp \
   -v ~/workspace:/workspace \
   --runtime=habana \
   --name optimum-1.17 \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=1,2,4,5 \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 --model-id $model \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384

client source code:

from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient("http://127.0.0.1:8083")

# read image from local file
image_path = "rabbit.png"
#image_path = "llava_v1_5_radar.jpg"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

tokens = ''
for token in client.text_generation(prompt, max_new_tokens=40, stream=True):
    tokens += token
print(tokens)

@yuanwu2017 what is the minimum value for --max-input-tokens, 4096 works but any value less than that errors out . the example input image in request has ~3000 tokens . probably we should mention in our read me that minimum value for --max-input-tokens is 4096

yuanwu2017 · 2024-08-21T00:12:47Z

The minimum value for --max-input-tokens is 2048. https://github.com/huggingface/tgi-gaudi/blob/habana-main/server/text_generation_server/models/vlm_causal_lm.py#L74

yuanwu2017 · 2024-08-21T02:33:20Z

llava-next test Command:

model=llava-hf/llava-v1.6-mistral-7b-hf
volume=/home/yuanwu/workspace/data
docker run -it -p 8083:80 \
   -v ~/workspace/data:/data \
   -v ~/workspace/tmp:/tmp \
   -v ~/workspace:/workspace \
   --runtime=habana \
   --name optimum-1.17 \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=1,2,4,5 \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 --model-id $model \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384

client source code:

from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient("http://127.0.0.1:8083")

# read image from local file
image_path = "rabbit.png"
#image_path = "llava_v1_5_radar.jpg"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

tokens = ''
for token in client.text_generation(prompt, max_new_tokens=40, stream=True):
    tokens += token
print(tokens)

@yuanwu2017 what is the minimum value for --max-input-tokens, 4096 works but any value less than that errors out . the example input image in request has ~3000 tokens . probably we should mention in our read me that minimum value for --max-input-tokens is 4096

Done.

Signed-off-by: yuanwu <[email protected]>

yuanwu2017 · 2024-08-21T05:50:09Z

Update the performance data:
command:

docker run -it --rm -p 8083:80 \
   --runtime=habana \
   -v $volume:/data \
   -v ~/workspace:/workspace \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=all \
   -e HABANA_VISIBLE_MODULES=all \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e BATCH_BUCKET_SIZE=8 \
   -e PREFILL_BATCH_BUCKET_SIZE=8 \
   -e ENABLE_HPU_GRAPH=true \
   -e USE_FLASH_ATTENTION=true \
   -e FLASH_ATTENTION_RECOMPUTE=true \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17
   --model-id meta-llama/Llama-2-13b-chat-hf \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384 \
   --max-batch-total-tokens 81920 \
   --sharded true --num-shard 8

client:
hey -t 0 -m POST -D ./data.json -H "Content-Type: application/json" -c 5 -n 10 http://127.0.0.1:8083/generate

Result:
first round：

second round:

1.13-release branch:

yuanwu2017 · 2024-08-21T06:09:41Z

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

It causes the inference error of meta-llama/Llama-2-13b-chat-hf This reverts commit c84acb1.

tthakkal · 2024-08-21T16:03:04Z

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

Why not set cache_position only when model is mixtral

yuanwu2017 · 2024-08-21T16:13:52Z

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

Why not set cache_position only when model is mixtral

Because optimal-habana already knows all the information and can calculate the cache_position. So adding this processing to different models is a bit like a workaround.

tthakkal · 2024-08-21T16:23:38Z

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

Why not set cache_position only when model is mixtral

Because optimal-habana already knows all the information and can calculate the cache_position. So adding this processing to different models is a bit like a workaround.

In optimum habana cache_position is not calculated in modeling_mixtral but sets initially here, https://github.com/huggingface/optimum-habana/blob/3e7ff03a54068d7bac8114b510ed546f32d909e6/optimum/habana/transformers/generation/utils.py#L2199
Not sure if we need to follow similar pattern or let model calculate.

Signed-off-by: yuanwu <[email protected]>

yuanwu2017 · 2024-08-26T02:53:06Z

@regisss @mandy-li @tthakkal Switched to OH-1.13.1 official release, and tested following models. They all passed. Please help to review the patch.

mistralai/Mixtral-8x7B-v0.1
llava-hf/llava-v1.6-mistral-7b-hf
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-70b-hf
meta-llama/Llama-2-13b-chat-hf

mandy-li

LGTM. @regisss , please merge this one first, and then all the other PRs that depend on this one (Synapse 1.17 and OH 1.13.1).

regisss

LGTM!

Upgrade SynapseAI version to 1.17.0

587e322

Signed-off-by: yuanwu <[email protected]>

yuanwu2017 marked this pull request as draft August 13, 2024 07:05

test optimum-habana RC

3db7d89

Signed-off-by: yuanwu <[email protected]>

mandy-li requested review from mandy-li and regisss August 15, 2024 16:18

yuanwu2017 marked this pull request as ready for review August 16, 2024 00:42

Update the pyproject.toml

fa42f2d

Signed-off-by: yuanwu <[email protected]>

tthakkal reviewed Aug 16, 2024

View reviewed changes

server/requirements.txt Outdated Show resolved Hide resolved

yuanwu2017 and others added 6 commits August 18, 2024 18:23

Update server/requirements.txt

ec08265

Co-authored-by: Thanaji Rao Thakkalapelli <[email protected]>

fix install error

8ddade3

Signed-off-by: yuanwu <[email protected]>

Fix mixtral model error

c84acb1

Signed-off-by: yuanwu <[email protected]>

Update the poetry.lock and requirements.txt

21418d2

Signed-off-by: yuanwu <[email protected]>

Fix the llava-next error

a059439

Signed-off-by: yuanwu <[email protected]>

Fix error of Tensor Data not present in ExecuteCachedGraph

1649974

Signed-off-by: yuanwu <[email protected]>

yuanwu2017 added 4 commits August 21, 2024 02:34

Update README

1718641

Signed-off-by: yuanwu <[email protected]>

update README

7a014e1

Signed-off-by: yuanwu <[email protected]>

Update README

a67d93e

Signed-off-by: yuanwu <[email protected]>

Merge branch 'huggingface:habana-main' into 1.17

dafaec4

Revert "Fix mixtral model error"

cb9a088

It causes the inference error of meta-llama/Llama-2-13b-chat-hf This reverts commit c84acb1.

yuanwu2017 added 4 commits August 23, 2024 17:21

Use the OH 1.13 branch for test

a4be682

Signed-off-by: yuanwu <[email protected]>

Switch to official release

eb547cd

Signed-off-by: yuanwu <[email protected]>

Update the poetry.lock

1645eec

Signed-off-by: yuanwu <[email protected]>

Update the requirements.txt

b616d10

Signed-off-by: yuanwu <[email protected]>

mandy-li approved these changes Aug 26, 2024

View reviewed changes

tthakkal approved these changes Aug 26, 2024

View reviewed changes

Fix typos in README

454e291

regisss approved these changes Aug 26, 2024

View reviewed changes

regisss merged commit a8cead1 into huggingface:habana-main Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade SynapseAI version to 1.17.0 #208

Upgrade SynapseAI version to 1.17.0 #208

yuanwu2017 commented Aug 13, 2024 •

edited

Loading

mandy-li commented Aug 15, 2024

yuanwu2017 commented Aug 16, 2024

mandy-li commented Aug 16, 2024

yuanwu2017 commented Aug 16, 2024

yuanwu2017 commented Aug 20, 2024

tthakkal commented Aug 20, 2024

yuanwu2017 commented Aug 21, 2024 via email •

edited

Loading

yuanwu2017 commented Aug 21, 2024

yuanwu2017 commented Aug 21, 2024 •

edited

Loading

yuanwu2017 commented Aug 21, 2024

tthakkal commented Aug 21, 2024

yuanwu2017 commented Aug 21, 2024

tthakkal commented Aug 21, 2024

yuanwu2017 commented Aug 26, 2024

mandy-li left a comment

regisss left a comment

Upgrade SynapseAI version to 1.17.0 #208

Upgrade SynapseAI version to 1.17.0 #208

Conversation

yuanwu2017 commented Aug 13, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

mandy-li commented Aug 15, 2024

yuanwu2017 commented Aug 16, 2024

mandy-li commented Aug 16, 2024

yuanwu2017 commented Aug 16, 2024

yuanwu2017 commented Aug 20, 2024

tthakkal commented Aug 20, 2024

yuanwu2017 commented Aug 21, 2024 via email • edited Loading

yuanwu2017 commented Aug 21, 2024

yuanwu2017 commented Aug 21, 2024 • edited Loading

yuanwu2017 commented Aug 21, 2024

tthakkal commented Aug 21, 2024

yuanwu2017 commented Aug 21, 2024

tthakkal commented Aug 21, 2024

yuanwu2017 commented Aug 26, 2024

mandy-li left a comment

Choose a reason for hiding this comment

regisss left a comment

Choose a reason for hiding this comment

yuanwu2017 commented Aug 13, 2024 •

edited

Loading

yuanwu2017 commented Aug 21, 2024 via email •

edited

Loading

yuanwu2017 commented Aug 21, 2024 •

edited

Loading