-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize constant length memorynew
intrinsic (take 2)
#56847
base: master
Are you sure you want to change the base?
optimize constant length memorynew
intrinsic (take 2)
#56847
Conversation
Imo having done this a couple of times, it might be more fruitful to start with just simple partial escape analysis for any object; if an allocation is only ever escaping in error blocks, and doesn't escape its address elsewhere, you can clone the alloc and memcpy from stack to heap at the beginning of each error block (we already track basically all of this info in our current escape analysis). This has the added advantage of making downstream array optimizations far easier to trigger (since you're no longer dependent on IRCE/type inference to eliminate all boundschecks before doing anything interesting) and therefore make bugs more obvious before releasing to the wild. I'm curious to know of any real world test cases this is able to elide arrays on. |
@nanosoldier |
I definitely want to do BoundsError excepted escape analysis, but I think that belongs in a separate PR. As it stands, this PR only enables escaping in relatively few places, and there absolutely is room to expand the cases where the optimization can occur. |
@pchintalapudi it actually does work in some cases surprisingly enough. I.e this breaks the performance tips doctest, because that test stops allocating :). But I agree that for this to be truly worth it, we need partial escape analysis and potentially being able to do this IPO. |
How do people feel about this? if there aren't objections, I'll merge as long as Pkgeval comes back clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
@nanosoldier |
The package evaluation job you requested has completed - possible new issues were detected. |
The NearestNeighbors, LMDB, and GeoAcceleratedArrays failures look relevant |
LMDB is their own fault (see wildart/LMDB.jl#41, they are using unsafe_wrap incorrectly). |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
while investigating some missed optimizations in #56847, @gbaraldi and I realized that `copy(::Array)` was using `jl_genericmemory_copy_slice` rather than the `memmove`/`jl_genericmemory_copyto` that `copyto!` lowers to. This version lets us use the faster LLVM based Memory initialization, and the memove can theoretically be further optimized by LLVM (e.g. not copying elements that get over-written without ever being read). ``` julia> @Btime copy($[1,2,3]) 15.521 ns (2 allocations: 80 bytes) # before 12.116 ns (2 allocations: 80 bytes) #after julia> m = Memory{Int}(undef, 3); julia> m.=[1,2,3]; julia> @Btime copy($m) 11.013 ns (1 allocation: 48 bytes) #before 9.042 ns (1 allocation: 48 bytes) #after ``` We also optimize the `memorynew` type inference to make it so that getting the length of a memory with known length will propagate that length information (which is important for cases like `similar`/`copy` etc).
3236e12
to
d7c27ce
Compare
@nanosoldier |
The package evaluation job you requested has completed - possible new issues were detected. |
Regarding the above discussion that questions whether this PR is "worth it", here's a nontrivial example where the PR eliminates heap allocation (insertion sort of a small tuple): """
tuple_sort(::NTuple)
Sort a homogeneous tuple (without recursion).
"""
function tuple_sort(tup::Tuple{Vararg{T, N}}) where {T, N}
if tup isa Tuple{Any, Vararg}
@inline let
clo = @inbounds let mem = Memory{T}(undef, N)
copyto!(mem, tup)
for i ∈ Base.OneTo(N - 1) # insertion sort
x = mem[begin + i]
j = i
while 0 < j
y = mem[(begin - 1) + j]
if y ≤ x
break
end
mem[begin + j] = y
j -= 1
end
mem[begin + j] = x
end
let mem = mem
function closure(i::Int)
@inbounds mem[i]
end
end
end
ntuple(clo, Val{N}())
end
else
tup
end
end
const tup = ntuple((_ -> rand()), 9);
@allocated tuple_sort(tup) # `96` on master, `0` on PR |
so the MWE for the remaining pkgeval failure is fairly odd...
|
while investigating some missed optimizations in #56847, @gbaraldi and I realized that `copy(::Array)` was using `jl_genericmemory_copy_slice` rather than the `memmove`/`jl_genericmemory_copyto` that `copyto!` lowers to. This version lets us use the faster LLVM based Memory initialization, and the memove can theoretically be further optimized by LLVM (e.g. not copying elements that get over-written without ever being read). ``` julia> @Btime copy($[1,2,3]) 15.521 ns (2 allocations: 80 bytes) # before 12.116 ns (2 allocations: 80 bytes) #after julia> m = Memory{Int}(undef, 3); julia> m.=[1,2,3]; julia> @Btime copy($m) 11.013 ns (1 allocation: 48 bytes) #before 9.042 ns (1 allocation: 48 bytes) #after ``` We also optimize the `memorynew` type inference to make it so that getting the length of a memory with known length will propagate that length information (which is important for cases like `similar`/`copy` etc).
d7c27ce
to
1f95279
Compare
1f95279
to
123a650
Compare
#56938 fixes the most recently known problem. Lets see if it's ready now. |
@nanosoldier |
The package evaluation job you requested has completed - possible new issues were detected. |
Of remaining Nanosoldier failures, Malt, and HNSW are due to invalid use of ccall, the 5 that errored unexpectedly all appear to be due to a loading bug affecting IJulia that isn't related to this PR, EulerAngles fails on 1.11 when tested without checkbounds=true. TensorOperationsTBLIS is doing some very weird things with their CCalls that I think are illegal (but I'm not 100% sure). All in all, I think this is ready to go. Merging Monday sans objections. |
replaces #55913 (the rebase was more annoying than starting from scratch)
This allows the compiler to better understand what's going on for
memorynew
with compile-time constant length, allowing for LLVM level escape analysis in some cases. There is more room to grow this (currently this only optimizes for fairly small Memory since bigger ones would require writing some more LLVM code, and we probably want a size limit on putting Memory on the stack to avoid stackoverflow. For larger ones, we could potentially inline the free so the Memory doesn't have to be swept by the GC, etc.