Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with threaded Distributed #572

Open
jarbus opened this issue Nov 9, 2024 · 5 comments
Open

Segmentation fault with threaded Distributed #572

jarbus opened this issue Nov 9, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@jarbus
Copy link

jarbus commented Nov 9, 2024

Affects: PythonCall

Describe the bug

This is a very quirky bug. I'm getting a segmentation fault when using python's gymnasium package with multiple processes while a Flux model is loaded on the GPU.

Setup:

]add CondaPkg
]add PythonCall
]add Flux
]add CUDA

using CondaPkg
CondaPkg.add("gymnasium")
CondaPkg.add("swig")
CondaPkg.add("gymnasium-box2d")
CondaPkg.add("gymnasium-other")

Run (crash is non-deterministic, try running a few times on a machine with an NVIDIA GPU):

using Distributed
addprocs(12; env=["CUDA_HARD_MEMORY_LIMIT" => "5%", "CUDA_MEMORY_POOL"=>"none"])
@everywhere begin
    using CUDA
    using Flux
    using CondaPkg
    using PythonCall

    function initialize_car_racing_env(_)
        gym = pyimport("gymnasium")
        x = Flux.Dense(512=>512) |> gpu
        env = gym.make("CarRacing-v3")
        obs, info = env.reset()
        env.close()
        return 1
    end
end

for generation in 1:10_000
    if generation % 100 == 0
        println("Generation: $generation")
    end
    pmap(initialize_car_racing_env, 1:12)
end

Stack trace:

      From worker 5:
      From worker 5:    [35654] signal 11: Segmentation fault
      From worker 5:    in expression starting at none:0
      From worker 5:    jl_gc_state_set at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia_threads.h:334 [inlined]
      From worker 5:    jl_gc_state_set at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia_threads.h:329 [inlined]
      From worker 5:    jl_gc_state_save_and_set at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia_threads.h:340
      From worker 5:    throw_internal_altstack at /cache/build/builder-demeter6-6/julialang/julia-master/src/task.c:755 [inlined]
      From worker 5:    ijl_sig_throw at /cache/build/builder-demeter6-6/julialang/julia-master/src/task.c:800
      From worker 5:    Allocations: 21901595 (Pool: 21900914; Big: 681); GC: 219
ERROR: Worker 5 terminated.LoadError: 
ProcessExitedException(Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#832")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:970
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:978
 [3] unsafe_read
   @ ./io.jl:891 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:890
 [5] read!
   @ ./io.jl:895 [inlined]
 [6] deserialize_hdr_raw
   @ ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/sh

Your system
Please provide detailed information about your system:

  • The operating system

5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

  • The version of Julia, Python, PythonCall, JuliaCall and any other affected packages
[052768ef] CUDA v5.5.2
[992eb4ea] CondaPkg v0.2.24
[587475ba] Flux v0.14.25
[6099a3de] PythonCall v0.9.23 `https://github.com/JuliaPy/PythonCall.jl.git#main`
[02a925ec] cuDNN v1.4.0
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper PRO 5975WX 32-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
  LD_LIBRARY_PATH = 
CondaPkg Status /home/garbus/.julia/environments/v1.11/CondaPkg.toml
Environment
  /home/garbus/.julia/environments/v1.11/.CondaPkg/env
Packages
  gymnasium v1.0.0
  gymnasium-box2d v1.0.0
  gymnasium-other v1.0.0
  swig v4.2.1

Additional context
I'm researching embodied AI and trying to use Julia's distributed capabilities to do so while still evaluating on python environments.

@jarbus jarbus added the bug Something isn't working label Nov 9, 2024
@jarbus jarbus changed the title Segmentation fault with Distributed and Flux models on NVIDIA GPU Segmentation fault with multiprocessing Dec 12, 2024
@jarbus
Copy link
Author

jarbus commented Dec 12, 2024

The following, simpler code produces the same issue, this time without Flux or CUDA:

using Distributed
addprocs(8; exeflags="--threads=8")
@everywhere begin
    using CondaPkg
    using PythonCall

    function initialize_car_racing_env(_)
        gym = pyimport("gymnasium")
        # do some multithreaded work
        Threads.@threads for i in 1:100
            # do some work
            if rand() < 0
                println("some work")
            end
        end
        env = gym.make("CarRacing-v3")
        obs, info = env.reset()
        env.close()
        return 1
    end
end

for generation in 1:10_000
    if generation % 100 == 0
        println("Generation: $generation")
    end
    pmap(initialize_car_racing_env, 1:12)
end

@jarbus jarbus changed the title Segmentation fault with multiprocessing Segmentation fault with threaded Distributed Dec 12, 2024
@jarbus
Copy link
Author

jarbus commented Dec 18, 2024

Even simpler code:

using Distributed
addprocs(1; exeflags="--threads=16")
@everywhere begin
    using PythonCall
    function initialize_car_racing_env(_)
        pyimport("gymnasium").make("CarRacing-v3").close()
        return 1
    end
end

pmap(initialize_car_racing_env, 1:2^14)

@jarbus
Copy link
Author

jarbus commented Dec 18, 2024

Notably, this doesn't occur for the CartPole-v1 environment, and it doesn't occur when exeflags="--threads=16" is omitted.

@jarbus
Copy link
Author

jarbus commented Dec 18, 2024

This appears to be related to the Box2D python package segfaulting when an environment that uses it is launched with multiple threads

@jarbus
Copy link
Author

jarbus commented Dec 29, 2024

Another example which produces the same issue:

using Distributed
addprocs(32)
@everywhere using PythonCall, CUDA 
@everywhere begin
    function initialize_car_racing_env(_)
        pyimport("gymnasium").make("CarRacing-v3").close()
        x = randn(2, 2)
        cu(x)
        cu(x)
        return 1
    end
end

pmap(initialize_car_racing_env, 1:2^12)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant