concurrency without model cloning #573

dtrawins · 2024-02-27T10:02:08Z

Support for multi threading in execution?

Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

dtrawins · 2024-03-01T15:37:31Z

@slyalin @eaidova can you add your comments?

HuggingFaceDocBuilderDev · 2024-03-04T13:36:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

slyalin · 2024-03-04T14:53:38Z

optimum/intel/openvino/modeling_decoder.py

+        kwargs["infer_context"] = infer_context
+        return super().generate(*args, **kwargs)
+
+    def __call__(self, *args, **kwargs):


@dtrawins, please explain why __call__ should be different from forward behavior. __call__ will eventually call forward without any added semantics on top of that. So why cannot we move this code to forward?

@slyalin that is indeed not required. potentially it could be used to pass the infer_request context. Assuming we create new request in the forward method if generate method didn't pass the context, that is not required.

echarlaix

THanks a lot for your work @dtrawins

echarlaix · 2024-03-07T11:24:00Z

optimum/intel/openvino/modeling_decoder.py

    def compile(self):
-        if self.request is None:
+        if self.compiled_model is None:
            super().compile()
-            self.request = self.request.create_infer_request()
+            self.compiled_model = self.request


if we don't need to call self.request.create_infer_request() then there is not need to override this method, I this we should we remove it

also if we want to rename request to compiled_model I think we should do it for all OVModels + add a warning stating that the request attribute will be deprecated in the future, it could make sense to do it in an other PR instead

optimum/intel/openvino/modeling_decoder.py

echarlaix · 2024-03-07T13:48:01Z

optimum/intel/openvino/modeling_decoder.py

        if self.stateful:
            # Need a marker to differentiate the first generate iteration from the others in
            # the first condition at the function beginning above.
            # It should be something that is not None and it should be True when converted to Boolean.
-            past_key_values = ((),)
+            past_key_values = ((inputs["beam_idx"]), infer_request)


not related to past_key_values so I don't think we should update past_key_values here, the resulting output will not be what it's expected for example :

output = model(**tokens) pkv = output.past_key_values

That is a special case for using stateful models. Such models are not using past_key_values because they preserve those information in the inference state instead. That field is used here to pass the beam_idx used for beam search algorithm and pass the inference execution context between generation cycles.

My point is that it's not related to past_key_values so we shouldn't update this variable with beam_idx / inference execution context

@slyalin can you add your comments here? The idea was to reused this variable for stateful models because they don't use it at all. That was the only method we found that could be used to pass the beam_idx and execution context (which includes the state data) without changing the model API. The other alternative was with using model.clone() method for each thread which would also using a separate execution context without duplicating memory consumption #564. Would cloning be better method to support concurrency in the execution? Is there some other option we are not aware of? I guess it is a bit unique situation with the stateful models in openvino so probably it is not handled in transformers lib.

My point is that it's not related to past_key_values

Definitely it is related to past_key_values even more than the old ((),) value. beam_idx together with infer_request are used to track past_key_values for a particular sequence. Literally, infer_request has a model state that consists of past_key_values tensors, and beam_idx allows indirect rows reordering in that state in case of beam search. This PR just makes it more explicit than it was before and moves these attributes from the model class instance to each sequence, which allows having multiple sequences for a single model class instance.

@echarlaix, do you have a better alternative to pass these values?

@echarlaix if we create new modelOutput data class and it is returned by the Forward method, how it could be passed back to the Forward method in the next cycle?

can you try something like :

from dataclasses import dataclass from transformers.modeling_outputs import ModelOutput @dataclass class CausalLMOutputWithPast(ModelOutput): logits: torch.FloatTensor = None past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None beam_idx: Optional[int] = None inference_session = None

and then overwritte _update_model_kwargs_for_generation https://github.com/huggingface/transformers/blob/45c065109074d60c587d3e562f16531d02a422f6/src/transformers/generation/utils.py#L630 by adding somethign like :

def _update_model_kwargs_for_generation( self, outputs: ModelOutput, model_kwargs: dict[str], is_encoder_decoder: bool = False, standardize_cache_format: bool = False, ) -> dict[str]: model_kwargs = super()._update_model_kwargs_for_generation( outputs=outputs, model_kwargs=model_kwargs, is_encoder_decoder=is_encoder_decoder, standardize_cache_format=standardize_cache_format, ) if "beam_idx" in outputs: model_kwargs["beam_idx"] = outputs["beam_idx"] return model_kwargs

(same for inference_session)

Let me know if you need help on this @dtrawins

@echarlaix @eaidova @slyalin Could you have a look if the latest version is passing the context fine now?
I'm not reusing past_key_values for stateful models with th generation context. There are additional fields in the forward output beam_idx and infer_request. Now only 9 tests is left to fix but seams unrelated to concurrency. Probably rebase from main is needed.
Anyway can one comment if beam_idx would be populated correctly. It is not defined now in reorder_caches for stateful models.

What i confirmed is beam_idx was not passed correctly. The same, initial beam_idx was circulating for the whole pipeline resulting in incorrect accuracy with beam search. Somehow it was not detected by functional tests.
Anyway my proposal is to pass the beam_idx content from reorder_caches method inside past_key_value. I tested it gives correct results and the code is in my opinion clean. The forward method returns empty past_key_values as expected for stateful models. In case someone would like to manage the pipeline for stateless models outside of transformers using just forward method, it would be still possible - beam_idx should be passed inside past_key_value and inference_request context via model_args. Anyway that is probably unlikely use case scenario. Would it be acceptable?

echarlaix · 2024-03-07T13:49:25Z

optimum/intel/openvino/modeling_base.py

@@ -86,6 +86,7 @@ def __init__(

        self.model = model
        self.request = None
+        self.compiled_model = None


not sure why we need a new attribute here

It is needed to create new infer_request in the context of generate method for each concurrent thread. So far we had in the model class request attribute which was pointing to a static infer_request and can not be used to allocate new request. Generally there is a bit confusing setup when the request attribute is set to the compiled_model object in the based class but latest it is overwritten to become the infer_request. Eventually the recommendation would be to switch to using compiled_model attribute instead and create infer_requests dynamically. It was proposed to make this switch in a separate PR.

echarlaix · 2024-03-28T16:12:36Z

optimum/intel/openvino/modeling_decoder.py

@@ -343,8 +343,10 @@ def normalized_config(self):
    def compile(self):
        if self.request is None:
            super().compile()
+            self.compiled_model = self.request


it could make sense to also set self.compiled_model to None (along with self.request) when the model is statically reshaped or moved to an other device https://github.com/huggingface/optimum-intel/blob/2a397e37dd606cdeafce6b356f5e7f869630ea1b/optimum/intel/openvino/modeling_base.py#L442C9-L442C21
an option could be to add a clear_requests method as done for seq2seq models
Currently it should work anyway as self.compiled_model will be correctly updated after calling .compile() (as self.request is set to None after each of these steps)

echarlaix · 2024-03-28T17:17:37Z

optimum/intel/openvino/modeling_decoder.py

        if self.stateful:
            # Need a marker to differentiate the first generate iteration from the others in
            # the first condition at the function beginning above.
            # It should be something that is not None and it should be True when converted to Boolean.
-            past_key_values = ((),)
+            past_key_values = ((inputs["beam_idx"]), infer_request)


@echarlaix, do you have a better alternative to pass these values?

why not introducing a class inheriting from ModelOutput like CausalLMOutputWithPast https://github.com/huggingface/transformers/blob/536ea2aca234fb48c5c69769431d643b0d93b233/src/transformers/modeling_outputs.py#L678 with dedicated beam_idx / inference_request arguments ?

echarlaix

Looks great, thanks for iterating on this @dtrawins !

tests/openvino/test_modeling.py

optimum/intel/openvino/modeling_decoder.py

echarlaix · 2024-04-18T15:55:19Z

optimum/intel/openvino/modeling_decoder.py

@@ -661,8 +704,7 @@ def _reorder_cache(
            batch_size = beam_idx.shape[0]
            indices = np.array(range(batch_size * self.config.num_attention_heads))
            indices = indices.reshape([batch_size, self.config.num_attention_heads])
-            self.next_beam_idx = np.take(indices, beam_idx, 0).flatten()
-            return past_key_values
+            return ((np.take(indices, beam_idx, 0).flatten()), past_key_values[1])


shouldn't it be :

Suggested change

return ((np.take(indices, beam_idx, 0).flatten()), past_key_values[1])

return past_key_values

echarlaix · 2024-04-18T16:05:28Z

optimum/intel/openvino/modeling_decoder.py

@@ -322,6 +340,7 @@ def normalized_config(self):
    def compile(self):
        if self.request is None:
            super().compile()
+            self.compiled_model = self.request
            self.request = self.request.create_infer_request()


why not remove this

Suggested change

self.request = self.request.create_infer_request()

and use self.request instead of self.compiled_model ? (self.request doesn't seem to be used anywhere)

echarlaix · 2024-04-18T16:20:39Z

tests/openvino/test_modeling.py

+    @parameterized.expand(SUPPORTED_ARCHITECTURES)
+    def test_compare_to_transformers_multithreading(self, model_arch):
+        model_id = MODEL_NAMES[model_arch]
+        not_stateful = ["gpt_bigcode"]
+        if is_openvino_version("<", "2024.0"):
+            not_stateful.append("mixtral")
+
+        if is_openvino_version("<", "2024.1"):
+            not_stateful.extend(["llama", "gemma"])
+
+        if "gptq" in model_arch:
+            self.skipTest("GPTQ model loading unsupported with AutoModelForCausalLM")
+        if model_arch in ["chatglm", "baichuan2"]:
+            self.skipTest("Models " + model_id + "doesn't support concurrent execution in AutoModelForCausalLM")
+
+        set_seed(SEED)
+        model_kwargs = {}
+        if model_arch in self.REMOTE_CODE_MODELS:
+            model_kwargs = {"trust_remote_code": True}
+
+        ov_model = OVModelForCausalLM.from_pretrained(model_id, export=True, ov_config=F32_CONFIG, **model_kwargs)
+        self.assertIsInstance(ov_model.config, PretrainedConfig)
+        self.assertTrue(ov_model.use_cache)
+        self.assertEqual(
+            ov_model.stateful, self.IS_SUPPORT_STATEFUL and ov_model.config.model_type not in not_stateful
+        )
+
+        transformers_model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
+        tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=model_arch in self.REMOTE_CODE_MODELS)
+        if model_arch == "qwen":
+            transformers_model.to(torch.float32)
+        inputs_list = ["This is a cat", "This is a dog", "Yet another test"]
+        tokens_list = [
+            tokenizer(inputs, return_tensors="pt", return_token_type_ids=False if model_arch == "llama" else None)
+            for inputs in inputs_list
+        ]
+
+        def run_ov_model(tokens, transformers_model, ov_model):
+            # global ov_model, transformers_model
+            # position_ids = None
+            # if model_arch.replace("_", "-") in MODEL_TYPES_REQUIRING_POSITION_IDS:
+            #    input_shape = tokens["input_ids"].shape
+            #    position_ids = (
+            #        torch.arange(0, input_shape[-1], dtype=torch.long).unsqueeze(0).view(-1, input_shape[-1])
+            #    )
+            set_seed(SEED)
+            ov_outputs = ov_model(**tokens)
+
+            self.assertTrue("logits" in ov_outputs)
+            self.assertIsInstance(ov_outputs.logits, torch.Tensor)
+            # self.assertTrue("past_key_values" in ov_outputs)
+            # self.assertIsInstance(ov_outputs.past_key_values, tuple)
+            # if self.IS_SUPPORT_STATEFUL and model_arch != "gpt_bigcode":
+            #    self.assertTrue(len(ov_outputs.past_key_values) == 1 and len(ov_outputs.past_key_values[0]) == 0)
+            with torch.no_grad():
+                transformers_outputs = transformers_model(**tokens)
+            # Compare tensor outputs
+            self.assertTrue(torch.allclose(ov_outputs.logits, transformers_outputs.logits, atol=1e-4))
+            # self.assertTrue(False)
+
+        run_on_multiple_threads(run_ov_model, tokens_list, (transformers_model, ov_model))
+
+        del transformers_model
+        del ov_model
+        gc.collect()


The time taken to run all tests is already non negligible so I think we should merge it with test_compare_to_transformers (to not duplicate steps like export)

echarlaix · 2024-04-18T16:21:02Z

tests/openvino/test_modeling.py

@@ -608,6 +674,42 @@ def test_pipeline(self, model_arch):
        del model
        gc.collect()

+    @parameterized.expand(SUPPORTED_ARCHITECTURES)
+    def test_pipeline_multithreading(self, model_arch):


same comment, can be merged with test_pipeline

echarlaix · 2024-04-18T16:25:52Z

optimum/intel/openvino/modeling_decoder.py

+    past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+    infer_request: Optional[openvino.runtime.InferRequest] = None


could we rename it to something like request or inference_request ?

current name is aligned with openvino api name, so for me infer_request sounds better

I think that'd be clearer for users who are not familiar with the openvino ecosystem, also we don't use infer_request anywhere in optimum-intel so was thinking about something a bit more explicit

echarlaix · 2024-05-14T22:36:41Z

I think we're close to merge, just waiting for couple of points above to be addressed, let me know if you need any help from my side @dtrawins (fixing conflicts / applying suggested changes)

dtrawins added 2 commits February 27, 2024 10:59

initial test version

a2379a9

more tests and code cleanup

6798a66

dtrawins marked this pull request as ready for review February 29, 2024 07:11

dtrawins added 2 commits February 29, 2024 13:29

Merge remote-tracking branch 'origin/main' into no-clone-concurrency

02a129a

fix python3.8 execution

be1a32d

dtrawins mentioned this pull request Mar 1, 2024

support for concurrency in llm models #519

Closed

3 tasks

slyalin reviewed Mar 4, 2024

View reviewed changes

test fixes for latest transformers and review fixes

fe71151

helena-intel requested review from eaidova and echarlaix March 7, 2024 09:03

echarlaix reviewed Mar 7, 2024

View reviewed changes

dtrawins added 4 commits March 18, 2024 15:42

review updates

6667661

merge with main

4bb1370

style fixes

ae47a4e

Merge remote-tracking branch 'origin/main' into no-clone-concurrency

9fdd833

echarlaix reviewed Mar 28, 2024

View reviewed changes

concurrency without overriding past_key_values with infer context

bd106a7

echarlaix approved these changes Apr 18, 2024

View reviewed changes

dtrawins added 7 commits April 19, 2024 08:29

fix passing beam_idx content for stateless models

93e40ee

merge from main

bb40376

cleanup and style

e5b6075

style

f7653b2

merge from main

8de1511

change dict to Dict from typing

0ecaa0f

fix type declaration

aec45a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concurrency without model cloning #573

concurrency without model cloning #573

dtrawins commented Feb 27, 2024 •

edited

Loading

dtrawins commented Mar 1, 2024

HuggingFaceDocBuilderDev commented Mar 4, 2024

slyalin Mar 4, 2024

dtrawins Mar 6, 2024

echarlaix left a comment

echarlaix Mar 7, 2024

echarlaix Mar 7, 2024

echarlaix Mar 7, 2024

dtrawins Mar 20, 2024

echarlaix Mar 20, 2024 •

edited

Loading

dtrawins Mar 21, 2024

slyalin Mar 25, 2024 •

edited

Loading

dtrawins Apr 4, 2024

echarlaix Apr 10, 2024 •

edited

Loading

echarlaix Apr 10, 2024

dtrawins Apr 18, 2024

dtrawins Apr 19, 2024

echarlaix Mar 7, 2024

dtrawins Mar 20, 2024

echarlaix Mar 28, 2024

echarlaix Mar 28, 2024

echarlaix left a comment

echarlaix Apr 18, 2024

echarlaix Apr 18, 2024

echarlaix Apr 18, 2024

echarlaix Apr 18, 2024

echarlaix Apr 18, 2024

eaidova Apr 23, 2024

echarlaix May 14, 2024

echarlaix commented May 14, 2024

	return ((np.take(indices, beam_idx, 0).flatten()), past_key_values[1])
	return past_key_values

concurrency without model cloning #573

Are you sure you want to change the base?

concurrency without model cloning #573

Conversation

dtrawins commented Feb 27, 2024 • edited Loading

Support for multi threading in execution?

dtrawins commented Mar 1, 2024

HuggingFaceDocBuilderDev commented Mar 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echarlaix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echarlaix Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echarlaix Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echarlaix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echarlaix commented May 14, 2024

dtrawins commented Feb 27, 2024 •

edited

Loading

echarlaix Mar 20, 2024 •

edited

Loading

slyalin Mar 25, 2024 •

edited

Loading

echarlaix Apr 10, 2024 •

edited

Loading