Release v1.13.0: Stable Diffusion 3, Sentence Transformers, SAM, DETR, Kubernetes example · huggingface/optimum-habana

SynapseAI 1.17

Upgrade SynapseAI version to 1.17.0 #1217

Transformers 4.43

Upgrade to Transformers 4.43 #1163 @regisss

Diffusers 0.29

Upgrade optimum-habana diffusers dependency from 0.26.3 to 0.29.2 #1150 @dsocek

Stable Diffusion 3

Sd3 #1153 @dsocek
Refactor SD3 #1199 @dsocek

Training with Sentence Transformers

Enable Sentence Transformer Trainer with Gaudi #1111 @ZhengHongming888

Model optimizations

Fix starcoder2 accuracy issue and optimize performance with fused rope #1095 @mandy-li
Enable FusedRoPE using float32 for gpt-neox model #1104 @yeonsily
Mamba initial enablement. #1122 @libinta
Adding fused qkv support along with config #1102 @bhargaveede
Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087 @Zhiwei35
Enable fp8 inference for Llava-Next and add Fused_SDPA #1120 @tthakkal
Support bucket_internal for MPT #1137 @pk1d3v
Enable Flash Attention (Fused SDPA) for Starcoder #1114 @abhilash1910
gpt_bigcode: added FusedSDPA kernel #1138 @mgonchar
Enable torch.compile for Granite20B #1185 @dvarshney-habana
Refine use cache for mpt model #1158 @Jing1Ling
GPT-J support reuse_cache #1094 @atakaha
Use fast softmax only on prefill #1159 @jaygala223
Starcoder2 : KVCache and flash attention (FusedSDPA) enablement #1149 @abhatkal
Gpt bigcode fused sdpa #1260 @yeonsily

SAM, FastVIT, VideoMAE, OpenCLIP, DETR, Table Transformer, deciLM

Add an example of Segment Anything Model [Inference] #814 @cfgfung
Add an example of FastViT model (Infernece) #826 @cfgfung
VideoMAE Model Enabling and Examples #922 @pi314ever
OpenCLIP sample for visual question answering #977 @vidyasiv
Enabled DETR (Object Detection) model #1046 @cfgfung
Table transformer enabling #978 @pi314ever
deciLM support #1133 @sywangyi

Stable Diffusion inpainting, unconditional image generation

Add the Stable diffusion inpaint support #869 @yuanwu2017
Enable Unconditional Image Generation on Gaudi 2 [Diffuser/Tasks] #859 @cfgfung

Text feature extraction example

Feature extraction enabling #994 @pi314ever

Tensor parallelism

Tensor parallel distributed strategy without using deepspeed #1121 @kalyanjk
Disable torch.compile for all_reduce when parallel_strategy is set to "tp" #1174 @kalyanjk

Kubernetes cluster example

Adds a helm chart, dockerfile, and instructions for running examples using a Kubernetes cluster #1099 @dmsuehir
Fix PyTorch version in the Kubernetes docker-compose to match image #1246 @dmsuehir

FP8 training

TE FP8 integration #1096 @SanjuCSudhakaran

Other

Updates run_lora_clm.py with enhanced dataset support #955 @dmsuehir
Fix prefix tuning finetune issue and update test #975 @sywangyi
Fix throughput calculation in image-to-text example #1070 @regisss
SDXL-trainig: fixed ci, changed gated dataset, fixes for non-square datasets #1038 @imangohari1
Updating batch_size of Albert-XXL in README #1063 @vineethanandh
Fix the error of running run_pipeline.py of text_generation example #1055 @yuanwu2017
Add a test for llama finetuning with FP8 precision #1106 @SanjuCSudhakaran
Beam-search fix #1113 @ssarkar2
Add chat format support dataset in SFT #1066 @libinta
Fix nan loss of gemma and crash if dataset_concatenation is not set #1088 @sywangyi
torch.compile keep input mutation in graph this avoids unnecessary memcpy #1069 @sushildubey171
Updated langchain text-generation pipeline to work with latest release 0.2.5 #1084 @rbrugaro
Add the MC example #891 @yuanwu2017
Fix recompiles if limit_hpu_graph is False #1129 @ssarkar2
Update examples batchsize in README #1123 @shepark
Fix OOM error in SDXL Fine-Tuning validation stage #1134 @dsocek
Added an example code to demonstrate how to use deterministic image generation #878 @cfgfung
SD image variation/InstructPix2Pix/StableDiffusionXLImg2ImgPipeline pipeline #988 @sywangyi
Add ci test for trl rewarding and ppo, fix backward failure in ppo caused by rmsfusion #1020 @sywangyi
Llama adapter #983 @sywangyi
torch.flip issue is fixed in SynapseAI 1.16, so remove the WA #1092 @sywangyi
Fix test CausalLanguageModelingLORAExampleTester KeyError #1139 @dmsuehir
fix(ci): new runs-on #1136 @XciD
Add trust_remote_code for loading datasets in the audio classification example #1074 @regisss
Generation example: print number of warmup iterations #1145 @mgonchar
CI Updates: text-gen to recieve ranks/bs, Updated bs/metric for baselines #1140 @imangohari1
Support for custom files for run_lora_clm.py #1039 @vidyasiv
Change the device_id for FSDP plugin #1086 @ckvermaAI
Set KV Cache update as static method #1160 @ulivne
To fix CPU tensor issue #1157 @mkumargarg
Adding missing init.py to mistral and mixtral test package #1188 @rkumar2patel
Add example of multitask_prompt/poly tuning #915 @sywangyi
Fix data-type mismatch for mlperf_inference accuracy test #1146 @kalyanjk
Fix spawn MP context, limit cpu and download data #1131 @polisettyvarma
T5 multi card #1222 @yafshar
Add trust_remote_code for t5 poly-tuning test #1220 @yafshar
Resolve "empty tensor optional" error with hpu_graphs + kv cache for StarCoder #1181 @vidyasiv
Fix VIT, add wav2vec comment #1223 @ssarkar2
Roberta tests were running on CPU #1229 @ssarkar2
Fix bert/roberta contrastive search tests #1226 @skavulya
Remove the default env variable to trust remote code by default #1225 @yafshar
Improve style check workflow #1230 @regisss
Added scheduler selection for SDXL fine-tuning #867 @kplau1128
Clear help msg for ignore_eos to avoid misunderstanding @sywangyi
Support loading hugging face checkpoint #1165 @ulivne
Change triggering event for code style check #1238 @regisss
gptj: fix missing token_idx #1234 @envsp
fix(nltk): fixed the version to working one #1247 @imangohari1
Updating to avoid hardcoding tests in CI framework #1221 @vidyasiv
Fix FSDP graph error due to Tranformer 4.43 update #1251 @jiminha
Fix SD README commands #1250 @imangohari1
Fix spelling errors #1252 @changwangss
Set HLS_MODULE_ID only if it wasn't set previously #1254 @astachowiczhabana
Fix overflow of steps in SDXL for default diffusers scheduler @dsocek
fix(test_diffusers): automated the checking for tests without upstream HF #1232 @imangohari1
fix(nltk): Revert 1247. Updated the version. added the punkt_tab download #1258 @imangohari1
Set input_embeds before it gets used #1261 @tthakkal
Update README and more changes, rebase to main #1259 @shepark

Known limitations

For Llama, some big batch sizes lead to out-of-memory errors whereas they used to work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.13.0: Stable Diffusion 3, Sentence Transformers, SAM, DETR, Kubernetes example

SynapseAI 1.17

Transformers 4.43

Diffusers 0.29

Stable Diffusion 3

Training with Sentence Transformers

Model optimizations

SAM, FastVIT, VideoMAE, OpenCLIP, DETR, Table Transformer, deciLM

Stable Diffusion inpainting, unconditional image generation

Text feature extraction example

Tensor parallelism

Kubernetes cluster example

FP8 training

Other

Known limitations

Contributors