Add learning rate scheduling support for `DeepSpeedStrategy` #20320

amorehead · 2024-10-05T02:35:29Z

What does this PR do?

Adds learning rate scheduling support for DeepSpeedStrategy
Credit to lvhoaa for suggesting this change to make Fabric's support for internal DeepSpeed features even more robust

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs) N
x ] Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20320.org.readthedocs.build/en/20320/

for more information, see https://pre-commit.ci

lantiga · 2024-10-07T11:39:43Z

Thanks for the contribution @amorehead! Let's get to a green CI and take it from there

for more information, see https://pre-commit.ci

lantiga · 2024-11-12T22:49:13Z

hey @amorehead looks like CI failures are legit, let me know if you can fix those

for more information, see https://pre-commit.ci

src/lightning/fabric/strategies/deepspeed.py

src/lightning/fabric/strategies/fsdp.py

src/lightning/fabric/strategies/strategy.py

src/lightning/fabric/strategies/xla_fsdp.py

lantiga

Thank you @amorehead! I added a few comments. Essentially we need to turn this into a non-breaking change.

Also a small update to docs is needed.

lantiga · 2024-12-08T03:38:48Z

src/lightning/fabric/strategies/deepspeed.py

-
-        Currently, only a single optimizer is supported.
+        self, module: Module, optimizers: list[Optimizer], scheduler: Optional[_LRScheduler] = None
+    ) -> tuple["DeepSpeedEngine", list[Optimizer], Optional[_LRScheduler]]:


This will return None, we need to return Any here so we can ignore the scheduler if it is not provided in input.

lantiga · 2024-12-08T03:38:51Z

src/lightning/fabric/fabric.py

@@ -266,7 +269,7 @@ def setup(

        if optimizers:
            # join both types in a tuple for API convenience
-            return (module, *optimizers)
+            return (module, *optimizers, scheduler)


This is a breaking change, it will cause existing user code to fail, because scheduler is returned unconditionally.

Since scheduler is Optional in the signature, I suggest we only return it if it was not None as an argument, so we won't break anyone's code.

lantiga · 2024-12-08T03:39:02Z

src/lightning/fabric/strategies/deepspeed.py

-        optimizer: Optional[Optimizer] = None,
-    ) -> tuple["DeepSpeedEngine", Optimizer]:
+        self, model: Module, optimizer: Optional[Optimizer] = None, scheduler: Optional[_LRScheduler] = None
+    ) -> tuple["DeepSpeedEngine", Optimizer, Optional[_LRScheduler]]:


Same comment as above

lantiga · 2024-12-08T03:39:29Z

src/lightning/fabric/utilities/seed.py

@@ -104,7 +104,10 @@ def pl_worker_init_function(worker_id: int, rank: Optional[int] = None) -> None:
    if _NUMPY_AVAILABLE:
        import numpy as np

-        np.random.seed(seed_sequence[3] & 0xFFFFFFFF)  # numpy takes 32-bit seed only
+        ss = np.random.SeedSequence([base_seed, worker_id, global_rank])


This is an unrelated change, it shouldn't be included

lantiga · 2024-12-10T22:35:19Z

@amorehead I'm wrapping up the last few PRs for the release. Do you have time to fix this one in the next couple of days?

amorehead added 8 commits October 4, 2024 21:15

Update fabric.py

188a45f

Update deepspeed.py

baf5988

Update deepspeed.py

1f4c18e

Update fabric.py

585e302

Update fsdp.py

0451761

Update strategy.py

a912aab

Update strategy.py

d27d4a3

Update xla_fsdp.py

67089a1

amorehead requested review from lantiga, Borda, tchaton and justusschock as code owners October 5, 2024 02:35

github-actions bot added the fabric lightning.fabric.Fabric label Oct 5, 2024

pre-commit-ci bot and others added 5 commits October 5, 2024 02:35

[pre-commit.ci] auto fixes from pre-commit.com hooks

1025875

for more information, see https://pre-commit.ci

Update fsdp.py

9b45b99

Update strategy.py

a7a5835

Update xla_fsdp.py

3ece31c

Update deepspeed.py

e48acd2

amorehead and others added 6 commits October 28, 2024 11:25

Update seed.py

f13516d

[pre-commit.ci] auto fixes from pre-commit.com hooks

80b4a6d

for more information, see https://pre-commit.ci

Update seed.py

2cab7e2

Update seed.py

e9127f4

Update seed.py

c127458

Merge branch 'master' into patch-2

31a1fce

lantiga added the strategy: deepspeed label Nov 12, 2024

lantiga added the waiting on author Waiting on user action, correction, or update label Nov 12, 2024

Merge branch 'master' into patch-2

f215626

mergify bot added the has conflicts label Nov 25, 2024

Merge branch 'master' into patch-2

dfce07e

lantiga requested a review from ethanwharris as a code owner November 25, 2024 10:33

[pre-commit.ci] auto fixes from pre-commit.com hooks

25e8d48

for more information, see https://pre-commit.ci

mergify bot removed the has conflicts label Nov 25, 2024

lantiga reviewed Nov 25, 2024

View reviewed changes

lantiga added 4 commits November 25, 2024 11:35

Update src/lightning/fabric/strategies/deepspeed.py

737162d

Update src/lightning/fabric/strategies/fsdp.py

2d347d0

Update src/lightning/fabric/strategies/strategy.py

5d227ff

Update src/lightning/fabric/strategies/xla_fsdp.py

f94efa7

lantiga reviewed Dec 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add learning rate scheduling support for `DeepSpeedStrategy` #20320

Add learning rate scheduling support for `DeepSpeedStrategy` #20320

amorehead commented Oct 5, 2024 •

edited

Loading

lantiga commented Oct 7, 2024

lantiga commented Nov 12, 2024

lantiga left a comment

lantiga Dec 8, 2024

lantiga Dec 8, 2024

lantiga Dec 8, 2024

lantiga Dec 8, 2024

lantiga commented Dec 10, 2024

Add learning rate scheduling support for DeepSpeedStrategy #20320

Are you sure you want to change the base?

Add learning rate scheduling support for DeepSpeedStrategy #20320

Conversation

amorehead commented Oct 5, 2024 • edited Loading

What does this PR do?

PR review

lantiga commented Oct 7, 2024

lantiga commented Nov 12, 2024

lantiga left a comment

Choose a reason for hiding this comment

lantiga Dec 8, 2024

Choose a reason for hiding this comment

lantiga Dec 8, 2024

Choose a reason for hiding this comment

lantiga Dec 8, 2024

Choose a reason for hiding this comment

lantiga Dec 8, 2024

Choose a reason for hiding this comment

lantiga commented Dec 10, 2024

Add learning rate scheduling support for `DeepSpeedStrategy` #20320

Add learning rate scheduling support for `DeepSpeedStrategy` #20320

amorehead commented Oct 5, 2024 •

edited

Loading