Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move lm model to async infer #1425

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sbalandi
Copy link
Contributor

No description provided.

@github-actions github-actions bot added the category: LLM LLM pipeline (stateful, static) label Dec 23, 2024
@sbalandi
Copy link
Contributor Author

sbalandi commented Dec 23, 2024

glm4-nano-chat-v020-int4 & llm bench (streaming include tokenizer.decode part and put text to queue, printing is not included):

<style> </style>
  before after before after
  streaming 1 streaming 1 no streaming no streaming
Genenration Time 12570ms 12550ms 12360ms 12290ms
2nd token latency 62.54 ms/token 62.21 ms/token 60.75 ms/token 60.59 ms/token

@sbalandi
Copy link
Contributor Author

sbalandi commented Dec 23, 2024

It looks like time update happens just because of async/wait is added. But infer takes ~59100ms , streaming takes ~1500 , it seems infer should cover streaming by time .
Also maybe could be useful to parallel streaming and sampling , for that cases average 2nd token latency could be 61.76 ms/token , generation Time: 12490ms

@@ -130,7 +131,7 @@ std::pair<EncodedResults, std::optional<int64_t>> get_lm_encoded_results(
beam_offets.insert({sequence_groups.at(i)->get_request_id(), i});

SamplerOutput sampler_output = sampler.sample(sequence_groups, logits);
stream_generated_tokens();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, if we start to stream tokens here in a dedicated thread, it can be faster as streaming will be overlapped with:

  • embedding model
  • llm model for next token
  • sampling for next token

while currently streaming is overlapped with LLM only

we could re-use SynchronizedQueue from GenAI source where streaming callback will push to queue, streaming dedicated thread will read from this queue using pull method

@ilya-lavrenov ilya-lavrenov added this to the 2025.0 milestone Dec 24, 2024
@ilya-lavrenov ilya-lavrenov self-assigned this Dec 24, 2024
@sbalandi
Copy link
Contributor Author

sbalandi commented Dec 24, 2024

glm4-nano-chat-v020-int4 & c++ sample (streaming include tokenizer.decode (TextCallbackStreamer) and print (callback function)):

<style> </style>
  infer() async/wait infer() async/wait
  streaming 1 streaming 1 no streaming no streaming
Genenration Time 20061ms 20046.5ms 19946.6ms 19913.8 ms
2nd token latency 59.2754 ms/token 59.1334 ms/token 58.1364 ms/token 56.3683 ms/token
tokenizer.decode 119.318 ms 129.953 ms    

Streaming include:
Cumulative time for tokenizer.decode take ~120/130 ms . (Time increases with async/wait.)
Cumulative time for printing take ~1.6 ms . (It looks like it doesn't take much time.)
and ~5 ms cumulative for other actions (getting results from GenerationHandle)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: LLM LLM pipeline (stateful, static)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants