-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization: cache past key-values #9
Comments
Caching and latency issues with memory of the models is key. Found this in coremltools docs: "Impact on Latency and Compute Unit Considerations For a model that is primarily running on the Neural Engine, sparsity typically helps in improving latency. Firstly, it reduces the amount of weight memory to be loaded at inference time, which is beneficial for networks that are weight memory bound (note that starting from iOS17/macOS14, for ops running on the Neural Engine, sparse weights are decompressed at prediction time). In addition to that, when a relatively long string of consecutive 0s are encountered, the Neural Engine may also be able to skip computations, thereby reducing the amount of computation as well. This means choosing higher levels of sparsity (e.g. 75% or higher) can lead to more latency gains than lower levels. This also means that choosing a block structured kind of sparsity with larger block sizes may be more beneficial. However, note that it's also relatively harder to preserve accuracy with stricter constraints like larger block size and higher level of sparsity." For a model that has a lot of linear ops and uses a specific kind of sparsity; that is, n:m such that m is a factor of 16 (such as 3:4, 7:8, 14:16, and so on), it can benefit from the CPU compute unit performance in newer hardware generations, thereby resulting in faster inference GitHub repo: ml-stable-diffusion. If I am not mistaken utilizes Pruning and Palettization of the model for better interference with apple devices. I believe for Llamav2 we'd use post-training pruning and post-training palettization. |
This issue with caching past key-values also deals with pytorch's dataloader issue https://github.com/pytorch/pytorch/issues/13246 This is happening because of how CPython, Python Multipprocessing, and Pytorch Dataloader works. Here is a podcast discussing the issue, how it works, and the multiple ways to fix this. https://pytorch-dev-podcast.simplecast.com/episodes/dataloader-with-multiple-workers-leaks-memory While trying to convert a llama model or any other big model. I believe there are numerous python objects being written to and read while trying to convert these PyTorch models to CoreML. I have tried google cloud, google colab, buying more ram, different LLM sizes. All of these and none of them have worked. The past key-values as a python list, which pytorch is trying to load from is causing multiple read/writes and then possibly more depending on how coremltools reads each layer from pytorch to convert it to coremltools. I am going to attempt a making a couple of python lists/dicts into numpy in a couple of places and see how that goes. Hopefully, will not have to convert all lists/dicts in coremltools. |
This may help us open the door to implementing similar code to swift-transformers Even this modeling_llama.py is implementing the split attention layers, redefining states, the four NHWC channels, kv cache and repeat_interleave op. Wonder if we would be able to take some of this logic and apply to openelm float16 (because of the repeat_interleave op in modeling_llama.py). Check out this issue / comment for some more information Convert OpenELM to float16 Core ML |
This needs to go hand in hand with changes to the conversion process in
exporters
andtransformers-to-coreml
.The text was updated successfully, but these errors were encountered: