Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade SynapseAI version to 1.17.0 #208

Merged
merged 19 commits into from
Aug 26, 2024
Merged

Conversation

yuanwu2017
Copy link
Collaborator

@yuanwu2017 yuanwu2017 commented Aug 13, 2024

What does this PR do?

Upgrade SynapseAI version to 1.17.0.
Known issues:
Switch to official OH release.

ci_09082024 test:

  1. On 1 Gaudi/Gaudi2 card
    model=meta-llama/Llama-2-7b-hf
    docker run --rm -p 8083:80 -v ~/workspace/data:/data -v ~/workspace/tmp:/tmp -v ~/workspace:/workspace --runtime=habana --name optimum-1.17 -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=1 -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --max-input-tokens 1024 --max-total-tokens 2048
image image
  1. On 1 Gaudi/Gaudi2 card using pytorch eager mode with torch compile
    model=meta-llama/Llama-2-7b-hf
    docker run -p 8083:80 -v ~/workspace/data:/data -v ~/workspace/tmp:/tmp -v ~/workspace:/workspace --runtime=habana --name optimum-1.17 -e PT_HPU_LAZY_MODE=0 -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=1 -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --max-input-tokens 1024 --max-total-tokens 2048
image image

3.On 8 Gaudi/Gaudi2 cards:
model=meta-llama/Llama-2-70b-hf

docker run --rm -p 8080:80 -v $volume:/data --runtime=habana -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --sharded true --num-shard 8 --max-input-tokens 1024 --max-total-tokens 2048
image
image

  1. LLama 7b BF16 on 1 Gaudi2 card:
    model=meta-llama/Llama-2-7b-chat-hf
docker run --rm -p 8083:80 \
   --runtime=habana \
   -v $volume:/data \
   -e HABANA_VISIBLE_DEVICES=all \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e HF_HUB_ENABLE_HF_TRANSFER=1 \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e PREFILL_BATCH_BUCKET_SIZE=1 \
   -e BATCH_BUCKET_SIZE=16 \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=128 \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 \
   --model-id $model \
   --max-input-tokens 1024 \
   --max-batch-prefill-tokens 4096 \
   --max-total-tokens 2048 \
   --max-batch-size 16
image image

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@regisss @libinta @mandy-li

@yuanwu2017 yuanwu2017 marked this pull request as draft August 13, 2024 07:05
@mandy-li mandy-li requested review from mandy-li and regisss August 15, 2024 16:18
@mandy-li
Copy link
Collaborator

@yuanwu2017 , just a reminder when you upgrade package version, please refer to #206 as well to make sure CVE problems can also be covered.

@yuanwu2017 yuanwu2017 marked this pull request as ready for review August 16, 2024 00:42
@yuanwu2017
Copy link
Collaborator Author

@yuanwu2017 , just a reminder when you upgrade package version, please refer to #206 as well to make sure CVE problems can also be covered.

Can you tell me how to run the CVE scan? Thanks.

@mandy-li
Copy link
Collaborator

@yuanwu2017 , just a reminder when you upgrade package version, please refer to #206 as well to make sure CVE problems can also be covered.

Can you tell me how to run the CVE scan? Thanks.

CVE scan is done by another intel team. They found some issues that those versions of python packages got CVE problem.

@yuanwu2017
Copy link
Collaborator Author

Update the ci_09082024 test result.

server/requirements.txt Outdated Show resolved Hide resolved
@yuanwu2017
Copy link
Collaborator Author

llava-next test
Command:

model=llava-hf/llava-v1.6-mistral-7b-hf
volume=/home/yuanwu/workspace/data
docker run -it -p 8083:80 \
   -v ~/workspace/data:/data \
   -v ~/workspace/tmp:/tmp \
   -v ~/workspace:/workspace \
   --runtime=habana \
   --name optimum-1.17 \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=1,2,4,5 \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 --model-id $model \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384

client source code:

from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient("http://127.0.0.1:8083")

# read image from local file
image_path = "rabbit.png"
#image_path = "llava_v1_5_radar.jpg"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

tokens = ''
for token in client.text_generation(prompt, max_new_tokens=40, stream=True):
    tokens += token
print(tokens)
image image

@tthakkal
Copy link
Collaborator

llava-next test Command:

model=llava-hf/llava-v1.6-mistral-7b-hf
volume=/home/yuanwu/workspace/data
docker run -it -p 8083:80 \
   -v ~/workspace/data:/data \
   -v ~/workspace/tmp:/tmp \
   -v ~/workspace:/workspace \
   --runtime=habana \
   --name optimum-1.17 \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=1,2,4,5 \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 --model-id $model \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384

client source code:

from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient("http://127.0.0.1:8083")

# read image from local file
image_path = "rabbit.png"
#image_path = "llava_v1_5_radar.jpg"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

tokens = ''
for token in client.text_generation(prompt, max_new_tokens=40, stream=True):
    tokens += token
print(tokens)

image image

@yuanwu2017 what is the minimum value for --max-input-tokens, 4096 works but any value less than that errors out . the example input image in request has ~3000 tokens . probably we should mention in our read me that minimum value for --max-input-tokens is 4096

@yuanwu2017
Copy link
Collaborator Author

yuanwu2017 commented Aug 21, 2024 via email

@yuanwu2017
Copy link
Collaborator Author

llava-next test Command:

model=llava-hf/llava-v1.6-mistral-7b-hf
volume=/home/yuanwu/workspace/data
docker run -it -p 8083:80 \
   -v ~/workspace/data:/data \
   -v ~/workspace/tmp:/tmp \
   -v ~/workspace:/workspace \
   --runtime=habana \
   --name optimum-1.17 \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=1,2,4,5 \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 --model-id $model \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384

client source code:

from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient("http://127.0.0.1:8083")

# read image from local file
image_path = "rabbit.png"
#image_path = "llava_v1_5_radar.jpg"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

tokens = ''
for token in client.text_generation(prompt, max_new_tokens=40, stream=True):
    tokens += token
print(tokens)

image image

@yuanwu2017 what is the minimum value for --max-input-tokens, 4096 works but any value less than that errors out . the example input image in request has ~3000 tokens . probably we should mention in our read me that minimum value for --max-input-tokens is 4096

Done.

@yuanwu2017
Copy link
Collaborator Author

yuanwu2017 commented Aug 21, 2024

Update the performance data:
command:

docker run -it --rm -p 8083:80 \
   --runtime=habana \
   -v $volume:/data \
   -v ~/workspace:/workspace \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=all \
   -e HABANA_VISIBLE_MODULES=all \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e BATCH_BUCKET_SIZE=8 \
   -e PREFILL_BATCH_BUCKET_SIZE=8 \
   -e ENABLE_HPU_GRAPH=true \
   -e USE_FLASH_ATTENTION=true \
   -e FLASH_ATTENTION_RECOMPUTE=true \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17
   --model-id meta-llama/Llama-2-13b-chat-hf \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384 \
   --max-batch-total-tokens 81920 \
   --sharded true --num-shard 8

client:
hey -t 0 -m POST -D ./data.json -H "Content-Type: application/json" -c 5 -n 10 http://127.0.0.1:8083/generate

Result:
first round:
image
second round:
image

1.13-release branch:

image

@yuanwu2017
Copy link
Collaborator Author

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

It causes the inference error of meta-llama/Llama-2-13b-chat-hf

This reverts commit c84acb1.
@tthakkal
Copy link
Collaborator

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

Why not set cache_position only when model is mixtral

@yuanwu2017
Copy link
Collaborator Author

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

Why not set cache_position only when model is mixtral

Because optimal-habana already knows all the information and can calculate the cache_position. So adding this processing to different models is a bit like a workaround.

@tthakkal
Copy link
Collaborator

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

Why not set cache_position only when model is mixtral

Because optimal-habana already knows all the information and can calculate the cache_position. So adding this processing to different models is a bit like a workaround.

In optimum habana cache_position is not calculated in modeling_mixtral but sets initially here, https://github.com/huggingface/optimum-habana/blob/3e7ff03a54068d7bac8114b510ed546f32d909e6/optimum/habana/transformers/generation/utils.py#L2199
Not sure if we need to follow similar pattern or let model calculate.

@yuanwu2017
Copy link
Collaborator Author

@regisss @mandy-li @tthakkal Switched to OH-1.13.1 official release, and tested following models. They all passed. Please help to review the patch.

mistralai/Mixtral-8x7B-v0.1
llava-hf/llava-v1.6-mistral-7b-hf
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-70b-hf
meta-llama/Llama-2-13b-chat-hf

Copy link
Collaborator

@mandy-li mandy-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @regisss , please merge this one first, and then all the other PRs that depend on this one (Synapse 1.17 and OH 1.13.1).

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@regisss regisss merged commit a8cead1 into huggingface:habana-main Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants