Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize GPT2 inference: Remove redundant autoregressive_latent_graph and enable streaming output #18

Open
candlewill opened this issue Jul 19, 2024 · 3 comments

Comments

@candlewill
Copy link

candlewill commented Jul 19, 2024

Thank you for this excellent implementation. I'd like to suggest an optimization that could significantly speed up inference and enable streaming output.

Currently, there are two GPT2 graphs:

  1. autoregressive: Generates speech codes (originally for CLVP to select the best result)
  2. autoregressive_latent_graph: Generates latents based on the best result

Since CLVP has been removed, we can streamline this to a single GPT2 graph that directly generates latents. I've implemented this with minimal changes:

  1. In autoregressive_graph, add after cur = ggml_add(ctx0, cur, model.language_model_head_layer_norm_bias);:
// Output latents
ggml_tensor *final_output_2 = ggml_cont(
    ctx0, ggml_view_4d(ctx0, cur, 1024, 1, batch_size, 1,
                       cur->nb[1], cur->nb[2], cur->nb[3],
                       (test_dimension - 1) * sizeof(float) * 1024));

ggml_set_name(final_output_2, "output_latents");
ggml_set_output(final_output_2);
ggml_build_forward_expand(gf, final_output_2);
  1. In the main inference loop, extract the latent:
extract_tensor_to_vector(ggml_graph_get_tensor(gf, "output_latents"), latent);

Benefits:

  1. Faster inference by eliminating redundant GPT2 runs
  2. Enables potential streaming output of latents
  3. Simplifies code structure

This optimization could significantly benefit users looking to speed up inference or implement streaming latent generation.

@balisujohn
Copy link
Owner

@candlewill thanks for looking into this; I think this would be a good change to make, though I wasn't able to get it to work by naively applying it. If you have a working branch with this change applied would you be willing to make a pull request?

@leso-kn
Copy link

leso-kn commented Sep 29, 2024

@candlewill Hey :) Would you mind uploading the required changes as patch file or sharing a forked git version with your optimization? I would be curious to try it out and have trouble applying the second part of your patch in the main inference loop.

@fwsGonzo
Copy link

I'm also interested in this. Increasing the overall performance when using single-threaded CPU (with AVX-512) would be awesome, if possible. And also, to identify which parts are loading (before message) and which parts come after the message is known.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants