Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] TensorflowTrainer does not work with keras>3.x #47464

Open
crbellis opened this issue Sep 3, 2024 · 8 comments
Open

[train] TensorflowTrainer does not work with keras>3.x #47464

crbellis opened this issue Sep 3, 2024 · 8 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order train Ray Train Related Issue

Comments

@crbellis
Copy link

crbellis commented Sep 3, 2024

What happened + What you expected to happen

I was trying to run this example from the documentation however it results in an error. I've tested this on 2 different clusters, one with CPU only and GPU set to false, the other with a cluster of GPUs.

Tensorflow example here.

The error is

ValueError: Attempt to convert a value (PerReplica:{
  0: <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>
}) with an unsupported type (<class 'tensorflow.python.distribute.values.PerReplica'>) to a Tensor.

I expected the sample from the docs to run successfully.

Versions / Dependencies

ray==2.30.0
python==3.11

Reproduction script

No changes made to this code from the doc.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@crbellis crbellis added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 3, 2024
@anyscalesam anyscalesam added the train Ray Train Related Issue label Sep 3, 2024
@crbellis
Copy link
Author

Similar error from here...

@crbellis
Copy link
Author

And a toy example here:

import ray
import tensorflow as tf
from ray.train.tensorflow import TensorflowTrainer
from ray.train import ScalingConfig


def build_model():
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Dense(128, activation="relu"))
    model.add(tf.keras.layers.Dense(10))
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model


def train_func(config):
    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    with strategy.scope():
        model = build_model()

        dataset = ray.train.get_dataset_shard("train")
        tf_dataset = dataset.to_tf(
            feature_columns="x", label_columns="y", batch_size=32
        )
        print("TF DATASET: ")
        print(tf_dataset)

    model.fit(tf_dataset, epochs=5)


train_dataset = ray.data.from_items([{"x": x / 10, "y": x % 10} for x in range(1000)])
scaling_config = ScalingConfig(num_workers=2, use_gpu=False)

trainer = TensorflowTrainer(
    train_loop_per_worker=train_func,
    datasets={"train": train_dataset},
    scaling_config=scaling_config,
)

results = trainer.fit()
print(results.metrics)

Error:

ValueError: Attempt to convert a value (PerReplica:{
  0: <tf.Tensor: shape=(16,), dtype=float64, numpy=
array([0. , 0.1, 0.2, 0.3, 0.4, 1. , 1.1, 1.2, 1.3, 1.4, 2. , 2.1, 2.2,
       2.3, 2.4, 3. ])>
}) with an unsupported type (<class 'tensorflow.python.distribute.values.PerReplica'>) to a Tensor.

@crbellis
Copy link
Author

For more context, this issue was when running tensorflow==2.16.1. Bumping the version down to tensorflow==2.15.1 fixed this. Seems like there is some compatibility issue with this tf version

@beck-weber-ing
Copy link

seeing this issue on custom code using TensorflowTrainer and MultiWorkerMirroredStrategy.

versions:

# pip freeze | grep "tensor\|ray"
memray==1.14.0
ray==2.37.0
tensorboard==2.17.1
tensorboard-data-server==0.7.2
tensorboardX==2.6.2.2
tensorflow==2.17.0
tensorflow-addons==0.23.0
tensorflow-io-gcs-filesystem==0.37.1

@crbellis
Copy link
Author

crbellis commented Oct 8, 2024

@beck-weber-ing btw, I'm not seeing this on tensorflow==2.15.1. So (edit: it's been so long I forgot I already shared this, apologies!) something must've changed on tf side that is potentially breaking the ray trainer

@beck-weber-ing
Copy link

for me the error looks like this and happens upon calling model.fit (train.py:210):

ray.exceptions.RayTaskError(ValueError): ray::_Inner.train() (pid=25075, ip=10.164.0.45, actor_id=b5e4e61bdbcffb83e632876f20000000, repr=TensorflowTrainer)                                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train                                                                                                                                                                     
    raise skipped from exception_cause(skipped)                                                                                                                                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/utils.py", line 57, in check_for_failure                                                                                                                                                             
    ray.get(object_ref)                                                                                                                                                                                                                                                  
ray.exceptions.RayTaskError(ValueError): ray::_RayTrainWorker__execute.get_next() (pid=25120, ip=10.164.0.45, actor_id=9ccd80edf5393c37186db10920000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7d161324e2f0>)                                
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 33, in __execute                                                                                                                                                              
    raise skipped from exception_cause(skipped)                                                                                                                                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper                                                                                                                                                       
    train_func(*args, **kwargs)                                                                                                                                                                                                                                          
  File "/workspace/train.py", line 210, in f                                                                                                                                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler                                                                                                                                                          
    raise e.with_traceback(filtered_tb) from None                                                                                                                                                                                                                        
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py", line 108, in convert_to_eager_tensor                                                                                                                                        
    return ops.EagerTensor(value, ctx.device_name, dtype)                                                                                                                                                                                                                
ValueError: Attempt to convert a value (PerReplica:{                                                                                                                                                                                                                     
  0: <tf.Tensor: shape=(1, 125, 21), dtype=float64, numpy=                                                                                                                                                                                                               
array([[[........]]])>                                                                                                                                                                                                             
}) with an unsupported type (<class 'tensorflow.python.distribute.values.PerReplica'>) to a Tensor. 

@justinvyu justinvyu self-assigned this Nov 25, 2024
@justinvyu justinvyu added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 25, 2024
@justinvyu
Copy link
Contributor

justinvyu commented Dec 4, 2024

@crbellis @beck-weber-ing @ghsanti Thanks for filing this issue.

The issue seems to be with the Tensorflow distributed API (tf.distributed.MultiWorkerMirroredStrategy), which TensorflowTrainer depends on, and keras>=3.x.

Even without Ray, tf.distributed.MultiWorkerMirroredStrategy does not work with Keras 3. Follow this issue for more updates: keras-team/keras#20585

The problem is that tensorflow>=2.16.x bumps the Keras version to 3.x. Here are the workarounds for now, while keras-team/keras#20585 is still unresolved.

Workaround 1: Pin the tensorflow (and keras) version

tensorflow<2.16.0
keras<3.0.0

Workaround 2: Use the legacy Keras 2 package

If you need to use a later version of tensorflow, it is still backwards compatible to Keras 2.x, but you'll need to install a new package and change the keras import (import tf.keras -> import tf_keras).

See here: https://keras.io/getting_started/#tensorflow--keras-2-backwards-compatibility

@justinvyu justinvyu changed the title [Tensorflow] Trainer example does not run [train] TensorflowTrainer does not work with keras>3.x Dec 4, 2024
@crbellis
Copy link
Author

crbellis commented Dec 6, 2024

Thanks @justinvyu!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order train Ray Train Related Issue
Projects
None yet
Development

No branches or pull requests

4 participants