Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert i64 input tensors to i32 for GPU plugin #781

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yeonbok
Copy link

@yeonbok yeonbok commented Jun 25, 2024

What does this PR do?

OpenVINO GPU plugin does not support int64 natively so i64 inputs are always converted to i32. To avoid runtime conversion, updated IO tensor precision to i32

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@yeonbok yeonbok force-pushed the taylor_reduce_i64_conversion branch from e5fff9f to 1d4800f Compare June 25, 2024 06:47
… always converted to i32. To avoid runtime conversion, updated IO tensor precision to i32.
@yeonbok yeonbok force-pushed the taylor_reduce_i64_conversion branch from 1d4800f to b502ee0 Compare June 25, 2024 06:54
@eaidova
Copy link
Collaborator

eaidova commented Jun 25, 2024

@yeonbok thank you for your PR, but I'm not sure that it is right place to do that:

  1. device can be changed in runtime with model.to() method, while you apply these changes only on initialization stange. So precision will not be updated if device will be changed in future and opposite, if it is applicable only for GPU, it will be used for CPU if model was initialized for GPU before. It also may produce some inconsistency between fresh converted model and reloading model dumped from model class using save_pretrained after applying preprocessing step.
  2. additional questions about can this feature be useful for CPU as well? @dmitry-gorokhov @maxnick @usstq maybe you can comment here?
  3. Do you validated this approach on wide range of models? Potentially, changing precision of input_ids can lead to overflow if model has too large vocabulary. If it is stable and these changes are suitable for CPU, we can consider to change input precision on model conversion step.

@dmitry-gorokhov
Copy link

@eaidova Yes, it might be benefitial for CPU as well (since internally we anyway convert i64 to i32). I would say it sounds very logical to have such logic aligned for all devices.

@yeonbok
Copy link
Author

yeonbok commented Jun 25, 2024

@eaidova Thanks for the review and @dmitry-gorokhov thanks for the inputs for the CPU plugin. Sounds that it can be applied generally.. I will remove the GPU condition.

Do you validated this approach on wide range of models? Potentially, changing precision of input_ids can lead to overflow if model has too large vocabulary. If it is stable and these changes are suitable for CPU, we can consider to change input precision on model conversion step.

Actually current GPU (and CPU) plugin is always converting to i32, even though the model requires i64 we dont support. I will add a comment regarding this.

@yeonbok yeonbok requested a review from eaidova June 27, 2024 18:51
@eaidova eaidova requested review from AlexKoff88 and echarlaix June 28, 2024 04:10
@eaidova eaidova mentioned this pull request Jun 28, 2024
3 tasks
@AlexKoff88
Copy link
Collaborator

there was a discussion about this some time ago. Maybe @slyalin remembers the context.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@slyalin
Copy link
Contributor

slyalin commented Jun 28, 2024

What's the expected gain from this change? Do you have numbers?

By doing this and having the desired effect of "avoiding runtime conversion" you need to align other parts of the pipeline that supply these data: here in optimum-intel and in the deployment scenarios in C++ code. I don't think that in optimum you can avoid this runtime conversion, it will just happen in another place where int64 torch tensor will be converted to our updated i32 input tensor. Right?

As for C++ code, for example, code here https://github.com/openvinotoolkit/openvino.genai/blob/master/src/cpp/src/greedy_decoding.cpp directly relies on i64. You need to propagate this change to other parts to understand the impact (and value). Including tokenizers and detokenizers. Just to "avoid runtime conversion".

Sure, in practice we won't see real sizes above 2G I believe, so the numbers could be represented as i32. But I don't see how this PR is enough to leverage this possibility in the consistent manner.

@apaniukov
Copy link
Contributor

OpenVINO Tokenizer uses i32 internally too but sets output/input types to i64 for compatibility by default. One can set the Type.i32 in CLI and convert_tokenizer function and we can easily switch the default value to i32 if needed.

I didn't remember any model with more than 1_000_00 vocab size, used by the multilingual encoder models. Gemma2 model has 256k vocab, GPT4 - 200k vocab, other LLMs have smaller vocabs. Extrapolating TinyLlama vocab to 1m tokens with the hidden state size of 2048, results in a 1 000 000 * 2048 * (2 bytes) = 4.096 Gb embedding matrix (using fp16/bf16), which is more than 2.2 Gb for the full model (32k vocab).
I think it is safe to use i32 for token indices.

@yeonbok
Copy link
Author

yeonbok commented Jul 29, 2024

Hi everyone, sorry for the belated update. I will collect performance data in the near future, thanks for the discussion.

@yeonbok
Copy link
Author

yeonbok commented Jul 29, 2024

BTW, apart from this change, I'd like to ask a question about the logits (output) precision. Currently it is set as fp32, so if the context size is very large, it allocate a quite large memory. However, when the inference precision is fp16, the logits are actually computed in fp16 and then just converted as fp32. So I think we can just set the logits precision as fp16 if the inference precision is fp16. How do you think? E.g., when the context size is 10000, it allocates more than 1 GB for output for mistral, however it can be reduced to half(500 MB) for example.

@eaidova
Copy link
Collaborator

eaidova commented Jul 30, 2024

BTW, apart from this change, I'd like to ask a question about the logits (output) precision. Currently it is set as fp32, so if the context size is very large, it allocate a quite large memory. However, when the inference precision is fp16, the logits are actually computed in fp16 and then just converted as fp32. So I think we can just set the logits precision as fp16 if the inference precision is fp16. How do you think? E.g., when the context size is 10000, it allocates more than 1 GB for output for mistral, however it can be reduced to half(500 MB) for example.

@yeonbok I do not recommend to do that due to several reasons:

  1. there is no clear evidence that model fp16 only, usually models are mixed precision, so we can not determine case when this applying is safe and when not as the result it can bring accuracy loss. You can do it on runtime/plugin side and fp16 is runtime hint, not applicable to specific IR and maybe changed after setting, when you need revert these changes if precision or model device will be changed.
  2. as I know, fp16 is non-standartized precision and can be implemented differently on different devices, libraries and platforms. It will be non-user friendly if model will return this type and may requires some additional actions for reinterpretation to correct precision (I recall issues with OV python API, which returns i16 instead of fp16 and then it was needed to convert it to fp16 on user side, possibly it is already fixed, becuaee it was my observation in accuracy checker with NPU sometime ago)

@echarlaix echarlaix added the openvino-test Trigger OpenVINO slow tests label Jul 30, 2024
@yeonbok
Copy link
Author

yeonbok commented Jul 30, 2024

BTW, apart from this change, I'd like to ask a question about the logits (output) precision. Currently it is set as fp32, so if the context size is very large, it allocate a quite large memory. However, when the inference precision is fp16, the logits are actually computed in fp16 and then just converted as fp32. So I think we can just set the logits precision as fp16 if the inference precision is fp16. How do you think? E.g., when the context size is 10000, it allocates more than 1 GB for output for mistral, however it can be reduced to half(500 MB) for example.

@yeonbok I do not recommend to do that due to several reasons:

  1. there is no clear evidence that model fp16 only, usually models are mixed precision, so we can not determine case when this applying is safe and when not as the result it can bring accuracy loss. You can do it on runtime/plugin side and fp16 is runtime hint, not applicable to specific IR and maybe changed after setting, when you need revert these changes if precision or model device will be changed.
  2. as I know, fp16 is non-standartized precision and can be implemented differently on different devices, libraries and platforms. It will be non-user friendly if model will return this type and may requires some additional actions for reinterpretation to correct precision (I recall issues with OV python API, which returns i16 instead of fp16 and then it was needed to convert it to fp16 on user side, possibly it is already fixed, becuaee it was my observation in accuracy checker with NPU sometime ago)

Thanks for the response. Oh I see, as you mentioned before the runtime device an be changed by model.to, which will be problematic for other device...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openvino-test Trigger OpenVINO slow tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants