You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran the COMA, HATRPO, and MAPPO algorithms in the Simple Spread environment for 500,000 timesteps. None of them achieved a reward higher than -100. However, in the results folder, most rewards are in the range of -30 to -40. After training, the reward is even lower than the one at the start. The model parameters I used are the same as the ones in the results folder.
from marllib import marl
# prepare env
env = marl.make_env(environment_name="mpe", map_name="simple_spread", force_coop=True)
# initialize algorithm with appointed hyper-parameters
coma = marl.algos.coma(hyperparam_source='mpe')
# build agent model based on env + algorithms + user preference
model = marl.build_model(env, coma, {"core_arch": "gru", "encode_layer": "128-256"})
# start training
coma.fit(env, model, stop={'timesteps_total': 500000}, share_policy='group', checkpoint_freq=100000, checkpoint_end=True)
The text was updated successfully, but these errors were encountered:
Are you plotting the episode_reward_mean or episode_reward_max? I suspect that the "reward" in the results csv is the ray/tune/episode_reward_max, but I may be wrong.
I ran the COMA, HATRPO, and MAPPO algorithms in the Simple Spread environment for 500,000 timesteps. None of them achieved a reward higher than -100. However, in the results folder, most rewards are in the range of -30 to -40. After training, the reward is even lower than the one at the start. The model parameters I used are the same as the ones in the results folder.
The text was updated successfully, but these errors were encountered: