Convert i64 input tensors to i32 for GPU plugin #781

yeonbok · 2024-06-25T06:46:29Z

What does this PR do?

OpenVINO GPU plugin does not support int64 natively so i64 inputs are always converted to i32. To avoid runtime conversion, updated IO tensor precision to i32

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

… always converted to i32. To avoid runtime conversion, updated IO tensor precision to i32.

eaidova · 2024-06-25T07:49:45Z

@yeonbok thank you for your PR, but I'm not sure that it is right place to do that:

device can be changed in runtime with model.to() method, while you apply these changes only on initialization stange. So precision will not be updated if device will be changed in future and opposite, if it is applicable only for GPU, it will be used for CPU if model was initialized for GPU before. It also may produce some inconsistency between fresh converted model and reloading model dumped from model class using save_pretrained after applying preprocessing step.
additional questions about can this feature be useful for CPU as well? @dmitry-gorokhov @maxnick @usstq maybe you can comment here?
Do you validated this approach on wide range of models? Potentially, changing precision of input_ids can lead to overflow if model has too large vocabulary. If it is stable and these changes are suitable for CPU, we can consider to change input precision on model conversion step.

dmitry-gorokhov · 2024-06-25T14:19:51Z

@eaidova Yes, it might be benefitial for CPU as well (since internally we anyway convert i64 to i32). I would say it sounds very logical to have such logic aligned for all devices.

yeonbok · 2024-06-25T18:21:06Z

@eaidova Thanks for the review and @dmitry-gorokhov thanks for the inputs for the CPU plugin. Sounds that it can be applied generally.. I will remove the GPU condition.

Do you validated this approach on wide range of models? Potentially, changing precision of input_ids can lead to overflow if model has too large vocabulary. If it is stable and these changes are suitable for CPU, we can consider to change input precision on model conversion step.

Actually current GPU (and CPU) plugin is always converting to i32, even though the model requires i64 we dont support. I will add a comment regarding this.

optimum/intel/openvino/modeling_decoder.py

AlexKoff88 · 2024-06-28T06:21:45Z

there was a discussion about this some time ago. Maybe @slyalin remembers the context.

HuggingFaceDocBuilderDev · 2024-06-28T06:25:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

slyalin · 2024-06-28T08:40:07Z

What's the expected gain from this change? Do you have numbers?

By doing this and having the desired effect of "avoiding runtime conversion" you need to align other parts of the pipeline that supply these data: here in optimum-intel and in the deployment scenarios in C++ code. I don't think that in optimum you can avoid this runtime conversion, it will just happen in another place where int64 torch tensor will be converted to our updated i32 input tensor. Right?

As for C++ code, for example, code here https://github.com/openvinotoolkit/openvino.genai/blob/master/src/cpp/src/greedy_decoding.cpp directly relies on i64. You need to propagate this change to other parts to understand the impact (and value). Including tokenizers and detokenizers. Just to "avoid runtime conversion".

Sure, in practice we won't see real sizes above 2G I believe, so the numbers could be represented as i32. But I don't see how this PR is enough to leverage this possibility in the consistent manner.

apaniukov · 2024-06-28T10:00:38Z

OpenVINO Tokenizer uses i32 internally too but sets output/input types to i64 for compatibility by default. One can set the Type.i32 in CLI and convert_tokenizer function and we can easily switch the default value to i32 if needed.

I didn't remember any model with more than 1_000_00 vocab size, used by the multilingual encoder models. Gemma2 model has 256k vocab, GPT4 - 200k vocab, other LLMs have smaller vocabs. Extrapolating TinyLlama vocab to 1m tokens with the hidden state size of 2048, results in a 1 000 000 * 2048 * (2 bytes) = 4.096 Gb embedding matrix (using fp16/bf16), which is more than 2.2 Gb for the full model (32k vocab).
I think it is safe to use i32 for token indices.

yeonbok · 2024-07-29T20:11:08Z

Hi everyone, sorry for the belated update. I will collect performance data in the near future, thanks for the discussion.

yeonbok · 2024-07-29T20:14:20Z

BTW, apart from this change, I'd like to ask a question about the logits (output) precision. Currently it is set as fp32, so if the context size is very large, it allocate a quite large memory. However, when the inference precision is fp16, the logits are actually computed in fp16 and then just converted as fp32. So I think we can just set the logits precision as fp16 if the inference precision is fp16. How do you think? E.g., when the context size is 10000, it allocates more than 1 GB for output for mistral, however it can be reduced to half(500 MB) for example.

eaidova · 2024-07-30T04:27:22Z

BTW, apart from this change, I'd like to ask a question about the logits (output) precision. Currently it is set as fp32, so if the context size is very large, it allocate a quite large memory. However, when the inference precision is fp16, the logits are actually computed in fp16 and then just converted as fp32. So I think we can just set the logits precision as fp16 if the inference precision is fp16. How do you think? E.g., when the context size is 10000, it allocates more than 1 GB for output for mistral, however it can be reduced to half(500 MB) for example.

@yeonbok I do not recommend to do that due to several reasons:

there is no clear evidence that model fp16 only, usually models are mixed precision, so we can not determine case when this applying is safe and when not as the result it can bring accuracy loss. You can do it on runtime/plugin side and fp16 is runtime hint, not applicable to specific IR and maybe changed after setting, when you need revert these changes if precision or model device will be changed.
as I know, fp16 is non-standartized precision and can be implemented differently on different devices, libraries and platforms. It will be non-user friendly if model will return this type and may requires some additional actions for reinterpretation to correct precision (I recall issues with OV python API, which returns i16 instead of fp16 and then it was needed to convert it to fp16 on user side, possibly it is already fixed, becuaee it was my observation in accuracy checker with NPU sometime ago)

yeonbok · 2024-07-30T18:50:04Z

BTW, apart from this change, I'd like to ask a question about the logits (output) precision. Currently it is set as fp32, so if the context size is very large, it allocate a quite large memory. However, when the inference precision is fp16, the logits are actually computed in fp16 and then just converted as fp32. So I think we can just set the logits precision as fp16 if the inference precision is fp16. How do you think? E.g., when the context size is 10000, it allocates more than 1 GB for output for mistral, however it can be reduced to half(500 MB) for example.

@yeonbok I do not recommend to do that due to several reasons:

there is no clear evidence that model fp16 only, usually models are mixed precision, so we can not determine case when this applying is safe and when not as the result it can bring accuracy loss. You can do it on runtime/plugin side and fp16 is runtime hint, not applicable to specific IR and maybe changed after setting, when you need revert these changes if precision or model device will be changed.

as I know, fp16 is non-standartized precision and can be implemented differently on different devices, libraries and platforms. It will be non-user friendly if model will return this type and may requires some additional actions for reinterpretation to correct precision (I recall issues with OV python API, which returns i16 instead of fp16 and then it was needed to convert it to fp16 on user side, possibly it is already fixed, becuaee it was my observation in accuracy checker with NPU sometime ago)

Thanks for the response. Oh I see, as you mentioned before the runtime device an be changed by model.to, which will be problematic for other device...

yeonbok force-pushed the taylor_reduce_i64_conversion branch from e5fff9f to 1d4800f Compare June 25, 2024 06:47

OpenVINO GPU plugin does not support int64 natively so i64 inputs are…

b502ee0

… always converted to i32. To avoid runtime conversion, updated IO tensor precision to i32.

yeonbok force-pushed the taylor_reduce_i64_conversion branch from 1d4800f to b502ee0 Compare June 25, 2024 06:54

Update for review comments

f84d7df

eaidova reviewed Jun 26, 2024

View reviewed changes

optimum/intel/openvino/modeling_decoder.py Outdated Show resolved Hide resolved

Applied review comment

4b733d8

yeonbok requested a review from eaidova June 27, 2024 18:51

eaidova approved these changes Jun 28, 2024

View reviewed changes

eaidova requested review from AlexKoff88 and echarlaix June 28, 2024 04:10

eaidova mentioned this pull request Jun 28, 2024

Use int32 as default int for export #788

Closed

3 tasks

AlexKoff88 approved these changes Jul 30, 2024

View reviewed changes

echarlaix added the openvino-test Trigger OpenVINO slow tests label Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert i64 input tensors to i32 for GPU plugin #781

Convert i64 input tensors to i32 for GPU plugin #781

yeonbok commented Jun 25, 2024

eaidova commented Jun 25, 2024 •

edited

Loading

dmitry-gorokhov commented Jun 25, 2024

yeonbok commented Jun 25, 2024

AlexKoff88 commented Jun 28, 2024

HuggingFaceDocBuilderDev commented Jun 28, 2024

slyalin commented Jun 28, 2024

apaniukov commented Jun 28, 2024

yeonbok commented Jul 29, 2024

yeonbok commented Jul 29, 2024

eaidova commented Jul 30, 2024

yeonbok commented Jul 30, 2024

Convert i64 input tensors to i32 for GPU plugin #781

Are you sure you want to change the base?

Convert i64 input tensors to i32 for GPU plugin #781

Conversation

yeonbok commented Jun 25, 2024

What does this PR do?

Before submitting

eaidova commented Jun 25, 2024 • edited Loading

dmitry-gorokhov commented Jun 25, 2024

yeonbok commented Jun 25, 2024

AlexKoff88 commented Jun 28, 2024

HuggingFaceDocBuilderDev commented Jun 28, 2024

slyalin commented Jun 28, 2024

apaniukov commented Jun 28, 2024

yeonbok commented Jul 29, 2024

yeonbok commented Jul 29, 2024

eaidova commented Jul 30, 2024

yeonbok commented Jul 30, 2024

eaidova commented Jun 25, 2024 •

edited

Loading