Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully shutdown workers on timeout or high mem threshold #151

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nickrobinson251
Copy link
Collaborator

@nickrobinson251 nickrobinson251 commented Apr 2, 2024

  • terminate workers with SIGTERM (never SIGKILL) if we (the test framework) are doing it ourselves

cc @NHDaly @kpamnany @Drvi

edit: i don't trust this existing close function yet, so marked draft

@nickrobinson251 nickrobinson251 requested a review from NHDaly April 2, 2024 16:39
@nickrobinson251 nickrobinson251 marked this pull request as draft April 2, 2024 16:44
@nickrobinson251 nickrobinson251 changed the title Gracefully shutdown woerkers on timeout or high mem threshold Gracefully shutdown workers on timeout or high mem threshold Apr 2, 2024
@@ -82,12 +82,12 @@ function terminate!(w::Worker, from::Symbol=:manual)
empty!(w.futures)
end
signal = Base.SIGTERM
while true
while !process_exited(w.process)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot that we never swapped this package over to use ConcurrentUtilities workers. We should probably do that and maintain that code in one place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sure would be nice to maintain this in only one place... on the other hand i don't think it's ideal for a test framework to have dependencies (since then you can't use it to the test code that requires a different version of that same dependency), so i think if we were to maintain it in one place (rather than maintain a duplicate codebase here, that will slightly diverge over time) then we'd want ReTestItems to vendor it some other way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for vendoring it in some other way, we could include the other repo as a git submodule, or have the build script checkout the repo from a committed tag or something like that.)
but for now i agree it makes sense to keep it duplicated until we invest in a cleanup like that.

@nickrobinson251
Copy link
Collaborator Author

close(w) still ends up calling terminate! so can still end up sending SIGKILL afaict
e.g. with debug logging we see Debug: terminating worker 28632 from process_responses showing that terminate! was called from process_responses which then is the what calls kill(w, signal)

julia> w = ReTestItems.Worker(; threads="2")
Worker(pid=28632)

julia> close(w, :foo)
┌ Debug: closing worker 28632 from foo
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:112
┌ Debug: Error processing responses from worker 28632
│   exception =
│    EOFError: read end of file
│    Stacktrace:
│     [1] read(this::Sockets.TCPSocket, ::Type{UInt8})
│       @ Base ./stream.jl:980
│     [2] deserialize
│       @ ~/repos/rai-julia/usr/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814 [inlined]
│     [3] deserialize(s::Sockets.TCPSocket)
│       @ Serialization ~/repos/rai-julia/usr/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:801
│     [4] process_responses(w::ReTestItems.Workers.Worker, ev::Base.Event)
│       @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:233
│     [5] (::ReTestItems.Workers.var"#9#17"{ReTestItems.Workers.Worker, Base.Event})()
│       @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:185
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:250
┌ Debug: terminating worker 28632 from process_responses
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:75
┌ Debug: sending signal 15 to worker 28632
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:86
  Worker 28632:  [28632] signal (15): Terminated: 15
in expression starting at none:1

@NHDaly
Copy link
Member

NHDaly commented Apr 9, 2024

Hrmm yeah, good call nick. It looks like when the process shutsdown gracefully, it closes the pipe abruptly, surprising the "process_responses" task. So probably there needs to be a graceful shutdown message in the other direction too; either sent by the original task that is requesting the shutdown, or by the worker who is shutting down.

@NHDaly
Copy link
Member

NHDaly commented Apr 9, 2024

Or you could set a "closing" field on the struct and check that in the exception handler.

@@ -17,6 +17,8 @@ using Test
@testset "clean shutdown ($w)" begin
close(w)
@test !process_running(w.process)
@test w.process.termsignal == Base.SIGTERM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering where the "SIGTERM" comes from? Your PR description says:

terminate workers with SIGTERM

but i think if it's a graceful close, the process just exits, meaning there is no signal at all.
So maybe this should be:

Suggested change
@test w.process.termsignal == Base.SIGTERM
@test w.process.termsignal == 0

?

That's what i'm seeing when running the tests:

  Expression: w.process.termsignal == Base.SIGTERM
   Evaluated: 0 == 15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants