Optimization: cache past key-values #9

pcuenca · 2023-08-08T10:31:00Z

This needs to go hand in hand with changes to the conversion process in exporters and transformers-to-coreml.

The text was updated successfully, but these errors were encountered:

antmikinka · 2023-08-10T15:22:50Z

Caching and latency issues with memory of the models is key. Found this in coremltools docs:

"Impact on Latency and Compute Unit Considerations

For a model that is primarily running on the Neural Engine, sparsity typically helps in improving latency. Firstly, it reduces the amount of weight memory to be loaded at inference time, which is beneficial for networks that are weight memory bound (note that starting from iOS17/macOS14, for ops running on the Neural Engine, sparse weights are decompressed at prediction time). In addition to that, when a relatively long string of consecutive 0s are encountered, the Neural Engine may also be able to skip computations, thereby reducing the amount of computation as well. This means choosing higher levels of sparsity (e.g. 75% or higher) can lead to more latency gains than lower levels. This also means that choosing a block structured kind of sparsity with larger block sizes may be more beneficial. However, note that it's also relatively harder to preserve accuracy with stricter constraints like larger block size and higher level of sparsity."

For a model that has a lot of linear ops and uses a specific kind of sparsity; that is, n:m such that m is a factor of 16 (such as 3:4, 7:8, 14:16, and so on), it can benefit from the CPU compute unit performance in newer hardware generations, thereby resulting in faster inference
(https://coremltools.readme.io/docs/pruning-overview)

GitHub repo: ml-stable-diffusion. If I am not mistaken utilizes Pruning and Palettization of the model for better interference with apple devices. I believe for Llamav2 we'd use post-training pruning and post-training palettization.
I will be working on this, but some people are faster than me, so figure I would share the information. Will follow up soon with another post.

antmikinka · 2023-08-25T13:52:26Z

This issue with caching past key-values also deals with pytorch's dataloader issue https://github.com/pytorch/pytorch/issues/13246 This is happening because of how CPython, Python Multipprocessing, and Pytorch Dataloader works.

Here is a podcast discussing the issue, how it works, and the multiple ways to fix this. https://pytorch-dev-podcast.simplecast.com/episodes/dataloader-with-multiple-workers-leaks-memory

While trying to convert a llama model or any other big model. I believe there are numerous python objects being written to and read while trying to convert these PyTorch models to CoreML. I have tried google cloud, google colab, buying more ram, different LLM sizes. All of these and none of them have worked.

The past key-values as a python list, which pytorch is trying to load from is causing multiple read/writes and then possibly more depending on how coremltools reads each layer from pytorch to convert it to coremltools.

I am going to attempt a making a couple of python lists/dicts into numpy in a couple of places and see how that goes. Hopefully, will not have to convert all lists/dicts in coremltools.

antmikinka · 2024-05-31T12:26:41Z

This may help us open the door to implementing similar code to swift-transformers

apple/ml-recurrent-drafter

Even this modeling_llama.py is implementing the split attention layers, redefining states, the four NHWC channels, kv cache and repeat_interleave op.

Wonder if we would be able to take some of this logic and apply to openelm float16 (because of the repeat_interleave op in modeling_llama.py).

Check out this issue / comment for some more information Convert OpenELM to float16 Core ML

pcuenca mentioned this issue Aug 8, 2023

Allow system prompt to be specified, create real chat dialog system huggingface/swift-chat#1

Open

antmikinka mentioned this issue Aug 24, 2023

Model pruning huggingface/exporters#54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: cache past key-values #9

Optimization: cache past key-values #9

pcuenca commented Aug 8, 2023

antmikinka commented Aug 10, 2023

antmikinka commented Aug 25, 2023

antmikinka commented May 31, 2024

Optimization: cache past key-values #9

Optimization: cache past key-values #9

Comments

pcuenca commented Aug 8, 2023

antmikinka commented Aug 10, 2023

antmikinka commented Aug 25, 2023

antmikinka commented May 31, 2024