What's Changed
🎉 2025!
⚡ Added QuantizeConfig.device
to clearly define which device is used for quantization: default = auto
. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device
during quantization to minimize vram usage.
💫 Improve QuantLinear
selection from optimum
.
🐛 Fix attn_implementation_autoset
compat in latest transformers.
- Add QuantizeConfig.device and use. by @Qubitium in #950
- fix hf_select_quant_linear by @LRL-ModelCloud in #966
- update vllm gptq_marlin code by @ZX-ModelCloud in #967
- fix cuda:0 not a enum device by @CSY-ModelCloud in #968
- fix marlin info for non-cuda device by @Qubitium in #972
- fix backend str bug by @CL-ModelCloud in #973
- hf select quant_linear with pack by @LRL-ModelCloud in #969
- remove auto select BACKEND.IPEX by @CSY-ModelCloud in #975
- fix autoround received a device_map by @CSY-ModelCloud in #976
- use enum instead of magic number by @CSY-ModelCloud in #979
- use new ci docker images by @CSY-ModelCloud in #980
- fix flash attntion was auto loaded on cpu for pretrained model by @CSY-ModelCloud in #981
- fix old transformer doesn't have _attn_implementation_autoset by @CSY-ModelCloud in #982
- fix gptbigcode test temporally by @CSY-ModelCloud in #983
- fix version parsing by @CSY-ModelCloud in #985
Full Changelog: v1.5.0...v1.5.1