Move lm model to async infer #1425

sbalandi · 2024-12-23T20:39:47Z

No description provided.

sbalandi · 2024-12-23T21:09:13Z

glm4-nano-chat-v020-int4 & llm bench (streaming include tokenizer.decode part and put text to queue, printing is not included):

	before	after	before	after
	streaming 1	streaming 1	no streaming	no streaming
Genenration Time	12570ms	12550ms	12360ms	12290ms
2nd token latency	62.54 ms/token	62.21 ms/token	60.75 ms/token	60.59 ms/token

sbalandi · 2024-12-23T21:18:21Z

It looks like time update happens just because of async/wait is added. But infer takes ~59100ms , streaming takes ~1500 , it seems infer should cover streaming by time .
Also maybe could be useful to parallel streaming and sampling , for that cases average 2nd token latency could be 61.76 ms/token , generation Time: 12490ms

ilya-lavrenov · 2024-12-24T12:25:36Z

src/cpp/src/lm_encoding.cpp

@@ -130,7 +131,7 @@ std::pair<EncodedResults, std::optional<int64_t>> get_lm_encoded_results(
        beam_offets.insert({sequence_groups.at(i)->get_request_id(), i});

    SamplerOutput sampler_output = sampler.sample(sequence_groups, logits);
-    stream_generated_tokens();


probably, if we start to stream tokens here in a dedicated thread, it can be faster as streaming will be overlapped with:

embedding model

llm model for next token

sampling for next token

while currently streaming is overlapped with LLM only

we could re-use SynchronizedQueue from GenAI source where streaming callback will push to queue, streaming dedicated thread will read from this queue using pull method

sbalandi · 2024-12-24T21:52:58Z

glm4-nano-chat-v020-int4 & c++ sample (streaming include tokenizer.decode (TextCallbackStreamer) and print (callback function)):

	infer()	async/wait	infer()	async/wait
	streaming 1	streaming 1	no streaming	no streaming
Genenration Time	20061ms	20046.5ms	19946.6ms	19913.8 ms
2nd token latency	59.2754 ms/token	59.1334 ms/token	58.1364 ms/token	56.3683 ms/token
tokenizer.decode	119.318 ms	129.953 ms

Streaming include:
Cumulative time for tokenizer.decode take ~120/130 ms . (Time increases with async/wait.)
Cumulative time for printing take ~1.6 ms . (It looks like it doesn't take much time.)
and ~5 ms cumulative for other actions (getting results from GenerationHandle)

github-actions bot added the category: LLM LLM pipeline (stateful, static) label Dec 23, 2024

sbalandi requested review from as-suvorov and ilya-lavrenov December 23, 2024 21:18

Move lm model to async infer

a4b9743

ilya-lavrenov reviewed Dec 24, 2024

View reviewed changes

Merge branch 'master' into streamer

e37fc4e

ilya-lavrenov added this to the 2025.0 milestone Dec 24, 2024

ilya-lavrenov self-assigned this Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move lm model to async infer #1425

Move lm model to async infer #1425

sbalandi commented Dec 23, 2024

sbalandi commented Dec 23, 2024 •

edited

Loading

sbalandi commented Dec 23, 2024 •

edited

Loading

ilya-lavrenov Dec 24, 2024

sbalandi commented Dec 24, 2024 •

edited

Loading

Move lm model to async infer #1425

Are you sure you want to change the base?

Move lm model to async infer #1425

Conversation

sbalandi commented Dec 23, 2024

sbalandi commented Dec 23, 2024 • edited Loading

sbalandi commented Dec 23, 2024 • edited Loading

ilya-lavrenov Dec 24, 2024

Choose a reason for hiding this comment

sbalandi commented Dec 24, 2024 • edited Loading

sbalandi commented Dec 23, 2024 •

edited

Loading

sbalandi commented Dec 23, 2024 •

edited

Loading

sbalandi commented Dec 24, 2024 •

edited

Loading