ModernBERT model optimized for Apple Neural Engine.
🏎️ 2.4 TFLOP/s (1024 token input; base model)
🔋 2.1 W of power (1024 token input; base model)
🤏 1 file for the model definition (a la nanoGPT)
$ python -m venv env
$ . env/bin/activate
$ pip install -r requirements.txt
$ python convert.py
$ python predict.py $path_to_model.mlpackage "The sky is [MASK]."
Compare accuracy of models to HuggingFace's implementation.
# Compare the PyTorch model in model.py
$ python diff_torch.py
# Compare a converted CoreML model
$ python diff_coreml.py $path_to_model.mlpackage
The Neural Engine requires float16 weights and activations. Some computations can be performed in float32, but outlier activations can still severely degrade output predictions.
ModernBERT, like other modern decoder-only LLMs, exhibits outlier activations on the order of 20-30k. Without intervention these are enough to visibly impact the CoreML model's predictions on the Neural Engine.
To mitigate this, the model conversion process in this repo uses QuaRot/SpinQuant-style orthogonal rotations. This greatly improves the model's fidelity (as measured by the KL divergence). However token predictions will not exactly match a PyTorch model that does some/all computation in a higher precision (bfloat16, float32). Be sure to test for your use case.
Borrows heavily from:
- support longer sequence lengths (> 1024)
- alternative attention implementations (split einsum, efficient attention for longer sequence length)
- generate/use SpinQuant matrices for improved outlier reduction
- investigate PrefixQuant for improved outlier reduction
- convert core model separately from heads to allow hot-swapping of different heads
- pack short sequences into single prediction
- support for heads beyond masked LM