Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HSM Deletion Tombstone Tracking #7014

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

justinp-tt
Copy link
Contributor

@justinp-tt justinp-tt commented Dec 19, 2024

What changed?

  • Extended tombstone tracking to capture HSM state machine deletions from Operation Log
  • Added StateMachinePath tombstone type for HSM deletions alongside existing activity/timer/etc. tombstones
  • Maintained existing tombstone batch management and cleanup mechanisms

Why?

Without this:

  • HSM state machine deletions would not be properly replicated across clusters
  • Could lead to state inconsistencies between active/passive clusters when using HSM
  • No way to know if an HSM node was intentionally deleted or is missing due to an error

How did you test it?

Draft PR. Tests not developed yet

Potential risks

  • Potential memory pressure from large HSM deletion batches
  • New panic path if operation log access fails

Risk Mitigation:

  • Operation log errors surface quickly via panic rather than silent failure

Documentation

No changes required - this extends an internal implementation detail of how deletions are tracked.

Is hotfix candidate?

No - this should follow normal deployment process.

@justinp-tt justinp-tt changed the title track deletion tombstones for state-based replication Add HSM Tombstone Tracking Dec 19, 2024
@justinp-tt justinp-tt changed the title Add HSM Tombstone Tracking Add HSM Deletion Tombstone Tracking Dec 19, 2024
@justinp-tt justinp-tt requested a review from bergundy December 19, 2024 18:58
@justinp-tt justinp-tt self-assigned this Dec 19, 2024
Comment on lines 5934 to 5964
case "Activity":
tombstone = &persistencespb.StateMachineTombstone{
StateMachineKey: &persistencespb.StateMachineTombstone_ActivityScheduledEventId{
ActivityScheduledEventId: ms.parseID(key.ID),
},
}
case "Timer":
tombstone = &persistencespb.StateMachineTombstone{
StateMachineKey: &persistencespb.StateMachineTombstone_TimerId{
TimerId: key.ID,
},
}
case "ChildExecution":
tombstone = &persistencespb.StateMachineTombstone{
StateMachineKey: &persistencespb.StateMachineTombstone_ChildExecutionInitiatedEventId{
ChildExecutionInitiatedEventId: ms.parseID(key.ID),
},
}
case "RequestCancel":
tombstone = &persistencespb.StateMachineTombstone{
StateMachineKey: &persistencespb.StateMachineTombstone_RequestCancelInitiatedEventId{
RequestCancelInitiatedEventId: ms.parseID(key.ID),
},
}
case "Signal":
tombstone = &persistencespb.StateMachineTombstone{
StateMachineKey: &persistencespb.StateMachineTombstone_SignalExternalInitiatedEventId{
SignalExternalInitiatedEventId: ms.parseID(key.ID),
},
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary, you only need to track StateMachinePath changes here. Activities and other state machines pre-date the HSM framework and won't be reflected in the oplog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed! Thanks

@justinp-tt justinp-tt requested a review from bergundy December 19, 2024 19:42
if ms.stateMachineNode != nil {
opLog, err := ms.stateMachineNode.Outputs()
if err != nil {
panic(fmt.Sprintf("Failed to get HSM operation log: %v", err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't panic here, return an error instead and handle it in the caller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be okay panicking if the only reason this would happen only unexpectedly. Consider making the Outputs() function panic instead of returning an error.

@@ -5907,6 +5907,39 @@ func (ms *MutableStateImpl) closeTransactionTrackTombstones(
}

var tombstones []*persistencespb.StateMachineTombstone
if ms.stateMachineNode != nil {
opLog, err := ms.stateMachineNode.Outputs()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rename Outputs to OpLog.

Copy link
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will defer to @xwduan, @yycptt, or @hai719 who have more context here.

Copy link
Member

@yycptt yycptt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Applying those tombstones will be in a later PR I assume?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants