-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NO MERGE Speedup extrem filter #90
NO MERGE Speedup extrem filter #90
Conversation
669c278
to
d45a086
Compare
I'm not sure if But the good news is that LoopVectorization can be used here since we only support gray images, which gives an extra 2x boost. But it's still not the 10x as you expected... julia> using TestImages, ImageMorphology
julia> img = Float64.(testimage("cameraman"));
julia> @btime ImageMorphology._extreme_filter_C4_2D!(max, $(similar(img)), $img, 1);
321.266 μs (11252 allocations: 695.47 KiB)
julia> @btime ImageMorphology.extreme_filter!(max, $(similar(img)), $img, $(strel_diamond(img)));
1.142 ms (18 allocations: 480 bytes) Since the implementation is vectorized, the new version actually supports GPU (although it's incredibly slow...) julia> using ArrayInterfaceCUDA, CUDA
julia> img = CuArray(Float64.(testimage("cameraman")));
julia> @btime ImageMorphology._extreme_filter_C4_2D!(max, $(similar(img)), $img, 1);
16.543 ms (83432 allocations: 4.54 MiB) so let's just live in the CPU world for now :) |
@ThomasRetornaz I'm pretty sure I don't exactly understand the algorithm, could you point me to any reference? Besides this PR I was just about to implement 1, do you happen to know which one is more efficient? Footnotes
|
As a reference for MATLAB's implementation on my windows PC, it is about 120μs. img = imread("cameraman.tif");
imresize(img, [512, 512]);
f = @() g(img);
timeit(f)
function g(img)
for i = 1:1000
imdilate(img, strel('diamond', 1));
end
return
end |
Nice try around loopvectorization i think its the right path, but we miss something due to enormous grow on number of allocations Note i m think there is typo @tturbo -> @turbo For explanation of the algorithm may you could have a look on |
My ref in c++ is around 40 micro seconds, so the matlab implem is pretty good. We will try to beat it :) |
sorry i miss the question |
I'm trying my best but remain stuck by a 10x factor |
This is impressive already! I would be very happy to see and merge a stable version when you're satisfied.
A maintainability note on this: it's totally okay to have a piece of hacked loop solution to achieve the best performance. But I hope the usage of this trick is limited (or, wrapped as a stable API) -- because you might be the only one who understands this for a very long time. In the meantime, I'm trying to understand and update the old codebase and experiment if there's anything I can improve on the concept and generic support. |
Above the curent results of my investigation
Other remarks there is dedicated algorithm for binary ie we need only to inspect the boundary of connected component Again thanks for your help :)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to do more tweaks on this? If not then we can find a good place to dispatch the extreme_filter!
call to this optimized implementation and then merge this. I mean, this is already a huge improvement
I haven't took the time to read the reference paper, I'll do it before merging this.
src/extreme_filter.jl
Outdated
|
||
# ptr level optimized implementation for Real types | ||
# short-circuit all check | ||
function _unsafe_padded_copyto!(dest::AbstractVector, src::AbstractVector, dir, N, v) where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we need to restrict the type annotation to Vector
(array types with valid pointer for memcopy) for safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, do you want if fix it or could you do this on your side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you're satisfied with the implementation, I can do the cleanup work before merging.
Here's my understanding of the todo work:
- further tweak the performance (trying my previous comments and maybe more)
- call the optimized implementation from
extreme_filter!
under certain condition - maybe a few more tests and benchmarks
- rebase on master
Let me know if you plan to do all of them or just the first performance tweak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could try to do points 1,3,4 (i will made a short try to avoid unecessay copy, but it could be in another ieteration), i will add comment and docs also
for the point 2 you are more qualified than me to ensure all safety conditions are taking into account
I try to do this today!
And thanks again of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get 109.639 μs (2051 allocations: 76.38 KiB) when applying my last three commented changes, which is (slightly) faster than MATLAB already.
The other allocations are probably overhead due to SubArray
(view
). If we can replace the view
usage with explicit index calculation, the overheads and allocations might just be eliminated.
src/extreme_filter.jl
Outdated
#sup(x-2,dilate(x-1)),inf(x-2,erode(x-1)) | ||
LoopVectorization.vmap!(f, tmp, viewprevious, tmp2) | ||
#sup(sup(x-2,dilate(x-1),x) || inf(inf(x-2,erode(x-1),x) | ||
LoopVectorization.vmap!(f, viewout, tmp, viewnext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to fuse them into one loop so that LoopVectorization gets more information on how to optimize the loop
#sup(x-2,dilate(x-1)),inf(x-2,erode(x-1)) | |
LoopVectorization.vmap!(f, tmp, viewprevious, tmp2) | |
#sup(sup(x-2,dilate(x-1),x) || inf(inf(x-2,erode(x-1),x) | |
LoopVectorization.vmap!(f, viewout, tmp, viewnext) | |
@turbo for i in eachindex(viewout, viewprevious, tmp, tmp2) | |
viewout[i] = f(f(viewprevious[i], viewnext[i]), tmp2[i]) | |
end |
No the next move will be to implement 3D cases aka V6 V26, but it could be in another PR function extremFilter(f,out,A,Union<strelHori,strelVerti,strelDiag45,strelDiag135> strel) Reminder for direction not aligned with the Image grid, we have to implement peridodic lines approach :( |
\o/ \o/ \o/ Yes! Great, thanks a lot |
Let's keep this PR limited to C4/C8 connectivity only. I just can't wait to finish and merge this PR! |
Sorry, could you please apply your subgestions i think i miss something and can't reach the expected speedup |
But my current results are so maybee double check on your side ? |
@ThomasRetornaz Can you push a new version here so that I can track your latest changes? |
…4_2D using view Faster than existing algo but quiet disapointing compare to C code Todo: + tests with copy instead of view + populate benchmark + add C8,C6,C26
This supports OffsetArray, it also supports CuArray (but requires ArrayInterfaceGPUArrays loaded)
… and benchmarks + documentation
747cb7d
to
5042eba
Compare
Done with rebase |
- fuse two fmap loops into one bigger loop -- this gives LoopVectorization more flexibility to analyze and optimize the loops. - remove unnecessary memcopy operation during setup Benchmark using cameraman (512x512) gray image on Ubuntu(WSL) i9-12900K Julia 1.8.0-rc1, we get almost 2.2x performance boost C42D: before: 214.230 μs (2 allocations: 8.25 KiB) after: 97.196 μs (3 allocations: 12.38 KiB) C82D: before: 221.889 μs (1029 allocations: 64.55 KiB) after: 98.940 μs (5 allocations: 20.62 KiB)
I think we've unnecessarily complicates the implementation, here's a what I did from scratch based on my understanding of the reference paper and the current implementation: function _unsafe_extreme_filter_C4_2D_jc!(f::MAX_OR_MIN, out::AbstractArray{T,2}, A::AbstractArray{T,2}, iter) where {T}
@debug "call the optimized `extreme_filter` implementation for SEDiamond SE and 2D images" fname =
_unsafe_extreme_filter_C4_2D!
axes(out) == axes(A) || throw(DimensionMismatch("axes(out) must match axes(A)"))
# To avoid the result affected by loop order, we need two arrays
src = (out === A) || (iter > 1) ? copy(A) : A
Ω = strel(CartesianIndex, strel_diamond((3, 3)))
δ = CartesianIndex(1, 1)
R = CartesianIndices(src)
R_inner = (first(R) + δ):(last(R) - δ)
# applying radius=r filter is equivalent to applying radius=1 filter r times
for i in 1:iter
# NOTE(johnnychen94): the inner region loop is essentially equivalent to the
# following code. Here we buffer the getindex to further improve the performance
# by explicitly telling the compiler to reuse the SIMD register
# @turbo for y in 2:(size(src, 2) - 1), x in 2:(size(src, 1) - 1)
# tmp = f(f(src[x, y], src[x - 1, y]), src[x + 1, y])
# out[x, y] = f(f(tmp, src[x, y - 1]), src[x, y + 1])
# end
@turbo for y in 2:(size(src, 2) - 1)
p_up, p, p_down = src[1, y], src[2, y], src[3, y]
for x in 2:(size(src, 1) - 1)
p_up, p, p_down = p, p_down, src[x + 1, y]
tmp = f(f(p, p_up), p_down)
out[x, y] = f(f(tmp, src[x, y - 1]), src[x, y + 1])
end
end
@inbounds for p in EdgeIterator(R, R_inner)
# for edge points, skip if the offset exceeds the boundary
s = A[p]
for o in Ω
q = p + o
checkbounds(Bool, A, q) || continue
s = f(s, A[q])
end
out[p] = s
end
end
return out
end julia> img = Float64.(testimage("cameraman"));
julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D_jc!(max, $(similar(img)), $img, 1);
82.422 μs (7 allocations: 448 bytes)
julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D!(max, $(similar(img)), $img, 1);
98.856 μs (3 allocations: 12.38 KiB) Simpler and more efficient. If replacing julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D_jc!(max, $(similar(img)), $img, 1);
43.143 μs (7 allocations: 448 bytes) |
Nice !!! I tented to copy my c implem too much .... |
The codes come from #90. Co-Authored-by: Retornaz Thomas <[email protected]>
The codes come from #90. Co-Authored-by: Retornaz Thomas <[email protected]>
migrated from #90 Co-authored-by: Retornaz Thomas <[email protected]>
migrated from #90 Co-authored-by: Retornaz Thomas <[email protected]>
First attempt to speedup extremfilter for special case eg 2D diamond
We achieve a speedup around 2,5/3 which is quiet disapointing :/, i'm far from my heavily optim C++ implem (miss div 10)
@johnnychen94 if you know special julia tricks around this new code ? or maybee i miss something
Thanks in advance
If not i will do the other case 2D diamond and 3D and clean the PR