-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PyTorch] Adding TP overlap support for te.Linear
with parallel_mode="column"
#1343
base: main
Are you sure you want to change the base?
Conversation
90458d4
to
4e3e61a
Compare
/te-ci pytorch L1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, pending CI.
ub_overlap_ag: bool = False, | ||
ub_overlap_rs: bool = False, | ||
ub_bulk_dgrad: bool = False, | ||
ub_bulk_wgrad: bool = False, | ||
ub_name: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should seriously consider deprecating these UB options and just passing in a dict. The UB interface is unstable and will likely be so for some while. A dict would be better for backward compatibility (reinterpret old options) and forward compatibility (ignore unknown options). This would be especially helpful for Mcore integration.
For example, the operation-based API passes in UB options with a dict:
userbuffers_options: Optional[dict[str, Any]] = None, |
assert not (self.ub_overlap_rs_fprop and self.ub_overlap_ag_fprop), "Internal TE error!" | ||
assert not (self.ub_overlap_ag_dgrad and self.ub_overlap_rs_dgrad), "Internal TE error!" | ||
assert not ( | ||
self.ub_overlap_rs_dgrad and (self.ub_bulk_dgrad or self.ub_bulk_wgrad) | ||
), "Internal TE error!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More descriptive error messages would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, much needed
… in sequence-parallel Linear backward Signed-off-by: Alp Dener <[email protected]>
Signed-off-by: Alp Dener <[email protected]>
…dated unit tests Signed-off-by: Alp Dener <[email protected]>
for more information, see https://pre-commit.ci
…ons in te.Linear Signed-off-by: Alp Dener <[email protected]>
3951993
to
360c127
Compare
Signed-off-by: Alp Dener <[email protected]>
/te-ci pytorch L1 |
Signed-off-by: Alp Dener <[email protected]>
for more information, see https://pre-commit.ci
/te-ci pytorch L1 |
Description
te.Linear
currently only supports TP overlap withparallel_mode="row"
where it overlaps reduce-scatter in the forward pass, and all-gather with dgrad in the backward pass.This PR adds new options to enable all-gather overlap in the forward pass, and reduce-scatter overlap with dgrad in the backward pass, when
parallel_mode="column"
.Fixes #1312
Type of change
Checklist: