-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully shutdown workers on timeout or high mem threshold #151
base: main
Are you sure you want to change the base?
Conversation
@@ -82,12 +82,12 @@ function terminate!(w::Worker, from::Symbol=:manual) | |||
empty!(w.futures) | |||
end | |||
signal = Base.SIGTERM | |||
while true | |||
while !process_exited(w.process) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot that we never swapped this package over to use ConcurrentUtilities workers. We should probably do that and maintain that code in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it sure would be nice to maintain this in only one place... on the other hand i don't think it's ideal for a test framework to have dependencies (since then you can't use it to the test code that requires a different version of that same dependency), so i think if we were to maintain it in one place (rather than maintain a duplicate codebase here, that will slightly diverge over time) then we'd want ReTestItems to vendor it some other way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for vendoring it in some other way, we could include the other repo as a git submodule, or have the build script checkout the repo from a committed tag or something like that.)
but for now i agree it makes sense to keep it duplicated until we invest in a cleanup like that.
julia> w = ReTestItems.Worker(; threads="2")
Worker(pid=28632)
julia> close(w, :foo)
┌ Debug: closing worker 28632 from foo
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:112
┌ Debug: Error processing responses from worker 28632
│ exception =
│ EOFError: read end of file
│ Stacktrace:
│ [1] read(this::Sockets.TCPSocket, ::Type{UInt8})
│ @ Base ./stream.jl:980
│ [2] deserialize
│ @ ~/repos/rai-julia/usr/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814 [inlined]
│ [3] deserialize(s::Sockets.TCPSocket)
│ @ Serialization ~/repos/rai-julia/usr/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:801
│ [4] process_responses(w::ReTestItems.Workers.Worker, ev::Base.Event)
│ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:233
│ [5] (::ReTestItems.Workers.var"#9#17"{ReTestItems.Workers.Worker, Base.Event})()
│ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:185
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:250
┌ Debug: terminating worker 28632 from process_responses
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:75
┌ Debug: sending signal 15 to worker 28632
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:86
Worker 28632: [28632] signal (15): Terminated: 15
in expression starting at none:1 |
Hrmm yeah, good call nick. It looks like when the process shutsdown gracefully, it closes the pipe abruptly, surprising the "process_responses" task. So probably there needs to be a graceful shutdown message in the other direction too; either sent by the original task that is requesting the shutdown, or by the worker who is shutting down. |
Or you could set a "closing" field on the struct and check that in the exception handler. |
@@ -17,6 +17,8 @@ using Test | |||
@testset "clean shutdown ($w)" begin | |||
close(w) | |||
@test !process_running(w.process) | |||
@test w.process.termsignal == Base.SIGTERM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering where the "SIGTERM" comes from? Your PR description says:
terminate workers with SIGTERM
but i think if it's a graceful close, the process just exits, meaning there is no signal at all.
So maybe this should be:
@test w.process.termsignal == Base.SIGTERM | |
@test w.process.termsignal == 0 |
?
That's what i'm seeing when running the tests:
Expression: w.process.termsignal == Base.SIGTERM
Evaluated: 0 == 15
SIGTERM
(neverSIGKILL
) if we (the test framework) are doing it ourselvescc @NHDaly @kpamnany @Drvi
edit: i don't trust this existing
close
function yet, so marked draft