-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does Keras 3 work with tf.distribute.MultiWorkerMirroredStrategy
?
#20585
Comments
Welcome to the party, pal: #20329 I will say, it's good to know that it looks like you're not using input dictionaries, thus removing one layer of abstraction, and it still doesn't work. |
Hi @justinvyu, Thanks for reporting this. You can use |
Thanks @dhantule. Do you know when the |
Hi @justinvyu, data parallelism and distributed tuning can be combined, we can run multiple trials of training and leverage data parallelism to speed up the training process. If you have 8 workers with 2 GPUs on each worker, you can run 8 parallel trials with each trial training on 2 GPUs by using |
hi @dhantule we need to be able to train a single model across multiple nodes still, even without distributed tuning. |
Keras 3 does not work with Whether support with be added is a question for the Keras team at Google to answer. @jeffcarp, do you know? If you problem is, "I have many machines, each with 1 or more GPUs, and I want to train a single model on all of them", then the answer is Keras 3 + JAX + the |
This user guide shows Keras 3 usage with
tf.distribute.MirroredStrategy
, which is only useful for single-node multi-GPU data parallel training.However,
tf.distribute.MultiWorkerMirroredStrategy
is required for multi-node data parallel training.This example on Tensorflow's docs does not work with
keras-nightly-3.7.0.dev2024120303
,tf-nightly-2.19.0.dev20241203
, though this problem has been around for a few releases of tensorflow (sincetensorflow==2.16
when the default Keras version got bumped to Keras 3; see tensorflow/tensorflow#72388).Question: Does Keras 3 support
tf.distribute.MultiWorkerMirroredStrategy
for TF distributed training, or does it only supporttf.distribute.MirroredStrategy
?Reproduction
Try running this example as a colab notebook: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
You will encounter this error:
The text was updated successfully, but these errors were encountered: