Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize constant length memorynew intrinsic (take 2) #56847

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

oscardssmith
Copy link
Member

replaces #55913 (the rebase was more annoying than starting from scratch)
This allows the compiler to better understand what's going on for memorynew with compile-time constant length, allowing for LLVM level escape analysis in some cases. There is more room to grow this (currently this only optimizes for fairly small Memory since bigger ones would require writing some more LLVM code, and we probably want a size limit on putting Memory on the stack to avoid stackoverflow. For larger ones, we could potentially inline the free so the Memory doesn't have to be swept by the GC, etc.

julia> function g()
           m = Memory{Int}(undef, 2)
           for i in 1:2
              m[i] = i
           end
           m[1]+m[2]
       end

julia> @btime g()
  9.735 ns (1 allocation: 48 bytes) #before
  1.719 ns (0 allocations: 0 bytes) #after

@oscardssmith oscardssmith added performance Must go faster compiler:codegen Generation of LLVM IR and native code needs pkgeval Tests for all registered packages should be run with this change arrays [a, r, r, a, y, s] labels Dec 16, 2024
@pchintalapudi
Copy link
Member

Imo having done this a couple of times, it might be more fruitful to start with just simple partial escape analysis for any object; if an allocation is only ever escaping in error blocks, and doesn't escape its address elsewhere, you can clone the alloc and memcpy from stack to heap at the beginning of each error block (we already track basically all of this info in our current escape analysis). This has the added advantage of making downstream array optimizations far easier to trigger (since you're no longer dependent on IRCE/type inference to eliminate all boundschecks before doing anything interesting) and therefore make bugs more obvious before releasing to the wild. I'm curious to know of any real world test cases this is able to elide arrays on.

@oscardssmith
Copy link
Member Author

@nanosoldier runtests(configuration = (julia_args=["--check-bounds=auto"],), vs_configuration = (julia_args=["--check-bounds=auto"],))

@oscardssmith
Copy link
Member Author

I definitely want to do BoundsError excepted escape analysis, but I think that belongs in a separate PR. As it stands, this PR only enables escaping in relatively few places, and there absolutely is room to expand the cases where the optimization can occur.

@gbaraldi
Copy link
Member

@pchintalapudi it actually does work in some cases surprisingly enough. I.e this breaks the performance tips doctest, because that test stops allocating :). But I agree that for this to be truly worth it, we need partial escape analysis and potentially being able to do this IPO.

@oscardssmith
Copy link
Member Author

How do people feel about this? if there aren't objections, I'll merge as long as Pkgeval comes back clean

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@vtjnash
Copy link
Member

vtjnash commented Dec 18, 2024

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@topolarity
Copy link
Member

topolarity commented Dec 18, 2024

The NearestNeighbors, LMDB, and GeoAcceleratedArrays failures look relevant

@oscardssmith
Copy link
Member Author

LMDB is their own fault (see wildart/LMDB.jl#41, they are using unsafe_wrap incorrectly).
see #55913 (comment) wrt Nearest neighbors (TLDR there does seem to be a failure that neither gabriel or I can reproduce locally except on macos where master fails with a different error).
GeoAcceleratedArrays might be real.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

oscardssmith added a commit that referenced this pull request Dec 23, 2024
while investigating some missed optimizations in
#56847, @gbaraldi and I realized
that `copy(::Array)` was using `jl_genericmemory_copy_slice` rather than
the `memmove`/`jl_genericmemory_copyto` that `copyto!` lowers to. This
version lets us use the faster LLVM based Memory initialization, and the
memove can theoretically be further optimized by LLVM (e.g. not copying
elements that get over-written without ever being read).

```
julia> @Btime copy($[1,2,3])
  15.521 ns (2 allocations: 80 bytes) # before
  12.116 ns (2 allocations: 80 bytes) #after

julia> m = Memory{Int}(undef, 3);
julia> m.=[1,2,3];
julia> @Btime copy($m)
  11.013 ns (1 allocation: 48 bytes) #before
  9.042 ns (1 allocation: 48 bytes)   #after
```

We also optimize the `memorynew` type inference to make it so that
getting the length of a memory with known length will propagate that
length information (which is important for cases like `similar`/`copy`
etc).
@oscardssmith oscardssmith force-pushed the os/optimize-const-len-memorynew branch from 3236e12 to d7c27ce Compare December 23, 2024 22:11
@oscardssmith
Copy link
Member Author

@nanosoldier runtests(["Kroki", "GeoAcceleratedArrays", "VectorizedStatistics", "FinanceCore", "Isoplot", "MarsagliaDiscreteSamplers", "VectorizedReduction", "Chron", "TensorOperationsTBLIS", "BiBufferedStreams"], configuration = (julia_args = ["--check-bounds=auto"],), vs_configuration = (julia_args = ["--check-bounds=auto"],))

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@nsajko
Copy link
Contributor

nsajko commented Dec 25, 2024

Regarding the above discussion that questions whether this PR is "worth it", here's a nontrivial example where the PR eliminates heap allocation (insertion sort of a small tuple):

"""
    tuple_sort(::NTuple)

Sort a homogeneous tuple (without recursion).
"""
function tuple_sort(tup::Tuple{Vararg{T, N}}) where {T, N}
    if tup isa Tuple{Any, Vararg}
        @inline let 
            clo = @inbounds let mem = Memory{T}(undef, N)
                copyto!(mem, tup)
                for i  Base.OneTo(N - 1)  # insertion sort
                    x = mem[begin + i]
                    j = i
                    while 0 < j
                        y = mem[(begin - 1) + j]
                        if y  x
                            break
                        end
                        mem[begin + j] = y
                        j -= 1
                    end
                    mem[begin + j] = x
                end
                let mem = mem
                    function closure(i::Int)
                        @inbounds mem[i]
                    end
                end
            end
            ntuple(clo, Val{N}())
        end 
    else
        tup
    end 
end

const tup = ntuple((_ -> rand()), 9);
@allocated tuple_sort(tup)  # `96` on master, `0` on PR

@oscardssmith
Copy link
Member Author

oscardssmith commented Dec 30, 2024

so the MWE for the remaining pkgeval failure is fairly odd...

oscardssmith:LoopVectorization$ julia +julia --project
u               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.12.0-DEV.1800 (2024-12-23)
 _/ |\__'_|_|_|\__'_|  |  os/optimize-const-len-memorynew/d7c27ce109* (fork: 2 commits, 6 days)
|__/                   |

julia> using LoopVectorization
f^[[A
julia> function f2(A)
           ξ = 0
           @turbo for i ∈ eachindex(A)
               ξ = +(ξ, A[i])
           end
           return ξ
       end
f2 (generic function with 1 method)

julia> f2(1:2)

[102790] signal 11 (128): Segmentation fault
in expression starting at REPL[3]:1
promote at ./promotion.jl:401
unknown function (ip: 0x7fd437f3b789) at (unknown file)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
jl_apply at /home/oscardssmith/julia/src/julia.h:2244 [inlined]
jl_f__call_in_world_total at /home/oscardssmith/julia/src/builtins.c:921
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
jl_apply at /home/oscardssmith/julia/src/julia.h:2244 [inlined]
do_apply at /home/oscardssmith/julia/src/builtins.c:839
concrete_eval_call at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:1019
abstract_call_method_with_const_args at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:885
abstract_call_method_with_const_args at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:879
handle1 at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:178
infercalls at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:251
abstract_call_gf_by_type at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:344
abstract_call_known at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:2757
jfptr_abstract_call_known_52571 at /home/oscardssmith/julia/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_invoke at /home/oscardssmith/julia/src/gf.c:3366
tojlinvoke115517 at /home/oscardssmith/julia/usr/lib/julia/sys.so (unknown line)
j_abstract_call_known_52735 at /home/oscardssmith/julia/usr/lib/julia/sys.so (unknown line)
abstract_call at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:2862
abstract_call at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:2855 [inlined]
abstract_call at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:3014
abstract_eval_call at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:3032 [inlined]
abstract_eval_statement_expr at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:3289
typeinf_local at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:4036
jfptr_typeinf_local_53213 at /home/oscardssmith/julia/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
typeinf at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:4228
const_prop_call at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:1359
abstract_call_method_with_const_args at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:909
abstract_call_method_with_const_args at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:879
handle1 at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:178
jfptr_handle1_51403 at /home/oscardssmith/julia/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
doworkloop at ./../usr/share/julia/Compiler/src/inferencestate.jl:1192
jfptr_doworkloop_769 at /home/oscardssmith/.julia/compiled/v1.12/Static/4nGFz_dhkKK.so (unknown line)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
typeinf at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:4218
typeinf_frame at ./../usr/share/julia/Compiler/src/typeinfer.jl:1018
typeinf_code at ./../usr/share/julia/Compiler/src/typeinfer.jl:973 [inlined]
typeinf_code at ./../usr/share/julia/Compiler/src/typeinfer.jl:967
unknown function (ip: 0x7fd437f2182e) at (unknown file)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
jl_apply at /home/oscardssmith/julia/src/julia.h:2244 [inlined]
ijl_call_in_typeinf_world at /home/oscardssmith/julia/src/gf.c:477
invoke_in_typeinf_world at ./reflection.jl:341
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
jl_apply at /home/oscardssmith/julia/src/julia.h:2244 [inlined]
do_apply at /home/oscardssmith/julia/src/builtins.c:839
invoke_default_compiler at ./reflection.jl:346
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
jl_apply at /home/oscardssmith/julia/src/julia.h:2244 [inlined]
do_apply at /home/oscardssmith/julia/src/builtins.c:839
invoke_interp_compiler at ./reflection.jl:354
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
#code_typed_by_type#209 at ./reflection.jl:396
code_typed_by_type at ./reflection.jl:373 [inlined]
#code_typed#208 at ./reflection.jl:316 [inlined]
code_typed at ./reflection.jl:311
jfptr_code_typed_85500 at /home/oscardssmith/julia/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
jl_apply at /home/oscardssmith/julia/src/julia.h:2244 [inlined]
do_call at /home/oscardssmith/julia/src/interpreter.c:125
eval_value at /home/oscardssmith/julia/src/interpreter.c:243
eval_body at /home/oscardssmith/julia/src/interpreter.c:585
jl_interpret_toplevel_thunk at /home/oscardssmith/julia/src/interpreter.c:896
jl_toplevel_eval_flex at /home/oscardssmith/julia/src/toplevel.c:1070
__repl_entry_eval_expanded_with_loc at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:341
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
jl_apply at /home/oscardssmith/julia/src/julia.h:2244 [inlined]
jl_f__call_latest at /home/oscardssmith/julia/src/builtins.c:883
#invokelatest#1 at ./essentials.jl:1056 [inlined]
invokelatest at ./essentials.jl:1052 [inlined]
toplevel_eval_with_hooks at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:348
toplevel_eval_with_hooks at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:352
toplevel_eval_with_hooks at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:352
toplevel_eval_with_hooks at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:352
toplevel_eval_with_hooks at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:345 [inlined]
eval_user_input at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:370
repl_backend_loop at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:482
#start_repl_backend#41 at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:467
start_repl_backend at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:464 [inlined]
#run_repl#48 at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:690
run_repl at /home/oscardssmith/julia/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:676
jfptr_run_repl_23388 at /home/oscardssmith/julia/usr/share/julia/compiled/v1.12/REPL/u0gqU_qk7sj.so (unknown line)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
run_std_repl at ./client.jl:490
jfptr_run_std_repl_108359 at /home/oscardssmith/julia/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
jl_apply at /home/oscardssmith/julia/src/julia.h:2244 [inlined]
jl_f__call_latest at /home/oscardssmith/julia/src/builtins.c:883
#invokelatest#1 at ./essentials.jl:1056 [inlined]
invokelatest at ./essentials.jl:1052 [inlined]
run_main_repl at ./client.jl:511
repl_main at ./client.jl:593 [inlined]
_start at ./client.jl:568
jfptr__start_108416 at /home/oscardssmith/julia/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3359 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3547
jl_apply at /home/oscardssmith/julia/src/julia.h:2244 [inlined]
true_main at /home/oscardssmith/julia/src/jlapi.c:922
jl_repl_entrypoint at /home/oscardssmith/julia/src/jlapi.c:1081
main at /home/oscardssmith/julia/cli/loader_exe.c:58
unknown function (ip: 0x7fd472021d8f) at /lib/x86_64-linux-gnu/libc.so.6
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/oscardssmith/julia/julia (unknown line)
Allocations: 5863317 (Pool: 5862539; Big: 778); GC: 6
Segmentation fault (core dumped)

stevengj pushed a commit that referenced this pull request Jan 2, 2025
while investigating some missed optimizations in
#56847, @gbaraldi and I realized
that `copy(::Array)` was using `jl_genericmemory_copy_slice` rather than
the `memmove`/`jl_genericmemory_copyto` that `copyto!` lowers to. This
version lets us use the faster LLVM based Memory initialization, and the
memove can theoretically be further optimized by LLVM (e.g. not copying
elements that get over-written without ever being read).

```
julia> @Btime copy($[1,2,3])
  15.521 ns (2 allocations: 80 bytes) # before
  12.116 ns (2 allocations: 80 bytes) #after

julia> m = Memory{Int}(undef, 3);
julia> m.=[1,2,3];
julia> @Btime copy($m)
  11.013 ns (1 allocation: 48 bytes) #before
  9.042 ns (1 allocation: 48 bytes)   #after
```

We also optimize the `memorynew` type inference to make it so that
getting the length of a memory with known length will propagate that
length information (which is important for cases like `similar`/`copy`
etc).
@oscardssmith oscardssmith force-pushed the os/optimize-const-len-memorynew branch from d7c27ce to 1f95279 Compare January 2, 2025 19:41
@oscardssmith oscardssmith force-pushed the os/optimize-const-len-memorynew branch from 1f95279 to 123a650 Compare January 3, 2025 17:21
@oscardssmith
Copy link
Member Author

#56938 fixes the most recently known problem. Lets see if it's ready now.

@oscardssmith
Copy link
Member Author

@nanosoldier runtests(configuration = (julia_args=["--check-bounds=auto"],), vs_configuration = (julia_args=["--check-bounds=auto"],))

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@oscardssmith oscardssmith removed the needs pkgeval Tests for all registered packages should be run with this change label Jan 4, 2025
@oscardssmith
Copy link
Member Author

oscardssmith commented Jan 4, 2025

Of remaining Nanosoldier failures, Malt, and HNSW are due to invalid use of ccall, the 5 that errored unexpectedly all appear to be due to a loading bug affecting IJulia that isn't related to this PR, EulerAngles fails on 1.11 when tested without checkbounds=true. TensorOperationsTBLIS is doing some very weird things with their CCalls that I think are illegal (but I'm not 100% sure).

All in all, I think this is ready to go. Merging Monday sans objections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] compiler:codegen Generation of LLVM IR and native code performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants