Extension of lightning
DDP support
#59
Labels
feature
Change that does not break compatibility, but affects the public interfaces.
lightning
DDP support
#59
Motivation
I am attempting to use optuna for hyperparameter optimization of a complex, lightning based deep learning framework. It is essential for this framework to run in a distributed setting. In the distributed lightning integration example, ddp_spawn is used as a strategy, which is strongly discouraged by lightning because of speed and flexibility concerns (the inability to use a large value of num_workers without bottlenecking for example, which is essential for my use case). Attempting to use the regular DDP strategy however, results in optuna generating a different set of hyperparameters for each RANK, since my optuna main script is repeatedly called. I have considered running my distributed main script in a subprocess started in the objective function, but this would not allow me to use the PytorchLightningPruningCallback since I can't reliable pass this object to the subprocess.
Description
My suggestion is to add a way for optuna to run with regular DDP. Perhaps by tracking whether DDP is being used in the storage, so that when study.optimize is called, the correct trial is produced and the trial suggest methods will return the same hyperparameters across ranks. I do not know enough of the internal workings of optuna to know if it is feasible to implement this. Is this something that can be supported in the future?
Alternatives (optional)
No response
Additional context (optional)
No response
The text was updated successfully, but these errors were encountered: