This repo is a personal laboratory for training fully autoregressive text-audio multimodal models with the DualAR Transformer architecture. This architecture is most popularly used as the neural codec seq2seq backbone for:
- Fish Speech TTS
- Kyutai's Moshi model early in pretraining before adaptation to duplex audio.
Models trained here will be compatible with my DualAR fish-speech.rs inference engine.
Please do not expect anything here to be usable currently. Full documentation will come once an early artifact is good enough to release.