[train] `TensorflowTrainer` does not work with `keras>3.x` #47464

crbellis · 2024-09-03T15:45:06Z

What happened + What you expected to happen

I was trying to run this example from the documentation however it results in an error. I've tested this on 2 different clusters, one with CPU only and GPU set to false, the other with a cluster of GPUs.

Tensorflow example here.

The error is

ValueError: Attempt to convert a value (PerReplica:{
  0: <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>
}) with an unsupported type (<class 'tensorflow.python.distribute.values.PerReplica'>) to a Tensor.

I expected the sample from the docs to run successfully.

Versions / Dependencies

ray==2.30.0
python==3.11

Reproduction script

No changes made to this code from the doc.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

crbellis · 2024-09-19T23:59:14Z

Similar error from here...

crbellis · 2024-09-20T00:05:56Z

And a toy example here:

import ray
import tensorflow as tf
from ray.train.tensorflow import TensorflowTrainer
from ray.train import ScalingConfig


def build_model():
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Dense(128, activation="relu"))
    model.add(tf.keras.layers.Dense(10))
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model


def train_func(config):
    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    with strategy.scope():
        model = build_model()

        dataset = ray.train.get_dataset_shard("train")
        tf_dataset = dataset.to_tf(
            feature_columns="x", label_columns="y", batch_size=32
        )
        print("TF DATASET: ")
        print(tf_dataset)

    model.fit(tf_dataset, epochs=5)


train_dataset = ray.data.from_items([{"x": x / 10, "y": x % 10} for x in range(1000)])
scaling_config = ScalingConfig(num_workers=2, use_gpu=False)

trainer = TensorflowTrainer(
    train_loop_per_worker=train_func,
    datasets={"train": train_dataset},
    scaling_config=scaling_config,
)

results = trainer.fit()
print(results.metrics)

Error:

ValueError: Attempt to convert a value (PerReplica:{
  0: <tf.Tensor: shape=(16,), dtype=float64, numpy=
array([0. , 0.1, 0.2, 0.3, 0.4, 1. , 1.1, 1.2, 1.3, 1.4, 2. , 2.1, 2.2,
       2.3, 2.4, 3. ])>
}) with an unsupported type (<class 'tensorflow.python.distribute.values.PerReplica'>) to a Tensor.

crbellis · 2024-09-26T19:09:44Z

For more context, this issue was when running tensorflow==2.16.1. Bumping the version down to tensorflow==2.15.1 fixed this. Seems like there is some compatibility issue with this tf version

beck-weber-ing · 2024-10-08T20:38:54Z

seeing this issue on custom code using TensorflowTrainer and MultiWorkerMirroredStrategy.

versions:

# pip freeze | grep "tensor\|ray"
memray==1.14.0
ray==2.37.0
tensorboard==2.17.1
tensorboard-data-server==0.7.2
tensorboardX==2.6.2.2
tensorflow==2.17.0
tensorflow-addons==0.23.0
tensorflow-io-gcs-filesystem==0.37.1

crbellis · 2024-10-08T21:45:21Z

@beck-weber-ing ~~btw, I'm not seeing this on tensorflow==2.15.1. So~~ (edit: it's been so long I forgot I already shared this, apologies!) something must've changed on tf side that is potentially breaking the ray trainer

beck-weber-ing · 2024-10-08T22:33:02Z

for me the error looks like this and happens upon calling model.fit (train.py:210):

ray.exceptions.RayTaskError(ValueError): ray::_Inner.train() (pid=25075, ip=10.164.0.45, actor_id=b5e4e61bdbcffb83e632876f20000000, repr=TensorflowTrainer)                                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train                                                                                                                                                                     
    raise skipped from exception_cause(skipped)                                                                                                                                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/utils.py", line 57, in check_for_failure                                                                                                                                                             
    ray.get(object_ref)                                                                                                                                                                                                                                                  
ray.exceptions.RayTaskError(ValueError): ray::_RayTrainWorker__execute.get_next() (pid=25120, ip=10.164.0.45, actor_id=9ccd80edf5393c37186db10920000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7d161324e2f0>)                                
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 33, in __execute                                                                                                                                                              
    raise skipped from exception_cause(skipped)                                                                                                                                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper                                                                                                                                                       
    train_func(*args, **kwargs)                                                                                                                                                                                                                                          
  File "/workspace/train.py", line 210, in f                                                                                                                                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler                                                                                                                                                          
    raise e.with_traceback(filtered_tb) from None                                                                                                                                                                                                                        
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py", line 108, in convert_to_eager_tensor                                                                                                                                        
    return ops.EagerTensor(value, ctx.device_name, dtype)                                                                                                                                                                                                                
ValueError: Attempt to convert a value (PerReplica:{                                                                                                                                                                                                                     
  0: <tf.Tensor: shape=(1, 125, 21), dtype=float64, numpy=                                                                                                                                                                                                               
array([[[........]]])>                                                                                                                                                                                                             
}) with an unsupported type (<class 'tensorflow.python.distribute.values.PerReplica'>) to a Tensor.

justinvyu · 2024-12-04T18:50:59Z

@crbellis @beck-weber-ing @ghsanti Thanks for filing this issue.

The issue seems to be with the Tensorflow distributed API (tf.distributed.MultiWorkerMirroredStrategy), which TensorflowTrainer depends on, and keras>=3.x.

Even without Ray, tf.distributed.MultiWorkerMirroredStrategy does not work with Keras 3. Follow this issue for more updates: keras-team/keras#20585

The problem is that tensorflow>=2.16.x bumps the Keras version to 3.x. Here are the workarounds for now, while keras-team/keras#20585 is still unresolved.

Workaround 1: Pin the tensorflow (and keras) version

tensorflow<2.16.0
keras<3.0.0

Workaround 2: Use the legacy Keras 2 package

If you need to use a later version of tensorflow, it is still backwards compatible to Keras 2.x, but you'll need to install a new package and change the keras import (import tf.keras -> import tf_keras).

See here: https://keras.io/getting_started/#tensorflow--keras-2-backwards-compatibility

crbellis · 2024-12-06T22:42:37Z

Thanks @justinvyu!

crbellis added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 3, 2024

anyscalesam added the train Ray Train Related Issue label Sep 3, 2024

ghsanti mentioned this issue Nov 24, 2024

[Train,Tuning] Support Keras 3 Parallel Training and HP tuning #48910

Closed

justinvyu self-assigned this Nov 25, 2024

justinvyu added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 25, 2024

justinvyu changed the title ~~[Tensorflow] Trainer example does not run~~ [train] TensorflowTrainer does not work with keras>3.x Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] `TensorflowTrainer` does not work with `keras>3.x` #47464

[train] `TensorflowTrainer` does not work with `keras>3.x` #47464

crbellis commented Sep 3, 2024

crbellis commented Sep 19, 2024

crbellis commented Sep 20, 2024

crbellis commented Sep 26, 2024

beck-weber-ing commented Oct 8, 2024

crbellis commented Oct 8, 2024 •

edited

Loading

beck-weber-ing commented Oct 8, 2024

justinvyu commented Dec 4, 2024 •

edited

Loading

crbellis commented Dec 6, 2024

[train] TensorflowTrainer does not work with keras>3.x #47464

[train] TensorflowTrainer does not work with keras>3.x #47464

Comments

crbellis commented Sep 3, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

crbellis commented Sep 19, 2024

crbellis commented Sep 20, 2024

crbellis commented Sep 26, 2024

beck-weber-ing commented Oct 8, 2024

crbellis commented Oct 8, 2024 • edited Loading

beck-weber-ing commented Oct 8, 2024

justinvyu commented Dec 4, 2024 • edited Loading

Workaround 1: Pin the tensorflow (and keras) version

Workaround 2: Use the legacy Keras 2 package

crbellis commented Dec 6, 2024

[train] `TensorflowTrainer` does not work with `keras>3.x` #47464

[train] `TensorflowTrainer` does not work with `keras>3.x` #47464

crbellis commented Oct 8, 2024 •

edited

Loading

justinvyu commented Dec 4, 2024 •

edited

Loading