NO MERGE Speedup extrem filter #90

ThomasRetornaz · 2022-05-30T04:23:02Z

First attempt to speedup extremfilter for special case eg 2D diamond
We achieve a speedup around 2,5/3 which is quiet disapointing :/, i'm far from my heavily optim C++ implem (miss div 10)
@johnnychen94 if you know special julia tricks around this new code ? or maybee i miss something
Thanks in advance
If not i will do the other case 2D diamond and 3D and clean the PR

johnnychen94 · 2022-05-30T12:43:29Z

I'm not sure if memcopy and memset helps here -- it seems that they perform similarily https://discourse.julialang.org/t/use-memcpy-instead-of-memmove-to-copy-array-when-theres-no-overlap/64042/4?u=johnnychen94

But the good news is that LoopVectorization can be used here since we only support gray images, which gives an extra 2x boost. But it's still not the 10x as you expected...

julia> using TestImages, ImageMorphology

julia> img = Float64.(testimage("cameraman"));

julia> @btime ImageMorphology._extreme_filter_C4_2D!(max, $(similar(img)), $img, 1);
  321.266 μs (11252 allocations: 695.47 KiB)

julia> @btime ImageMorphology.extreme_filter!(max, $(similar(img)), $img, $(strel_diamond(img)));
  1.142 ms (18 allocations: 480 bytes)

Since the implementation is vectorized, the new version actually supports GPU (although it's incredibly slow...)

julia> using ArrayInterfaceCUDA, CUDA

julia> img = CuArray(Float64.(testimage("cameraman")));

julia> @btime ImageMorphology._extreme_filter_C4_2D!(max, $(similar(img)), $img, 1);
  16.543 ms (83432 allocations: 4.54 MiB)

so let's just live in the CPU world for now :)

johnnychen94 · 2022-05-30T13:01:59Z

@ThomasRetornaz I'm pretty sure I don't exactly understand the algorithm, could you point me to any reference?

Besides this PR I was just about to implement ¹, do you happen to know which one is more efficient?

Van Droogenbroeck, Marc, and Hugues Talbot. "Fast computation of morphological operations with arbitrary structuring elements." Pattern recognition letters 17.14 (1996): 1451-1460. https://www.sciencedirect.com/science/article/pii/S0167865596001134 ↩

johnnychen94 · 2022-05-30T13:13:46Z

As a reference for MATLAB's implementation on my windows PC, it is about 120μs.

img = imread("cameraman.tif");
imresize(img, [512, 512]);

f = @() g(img);
timeit(f)

function g(img)
    for i = 1:1000
        imdilate(img, strel('diamond', 1));
    end
    return
end

ThomasRetornaz · 2022-06-02T07:38:26Z

Nice try around loopvectorization i think its the right path, but we miss something due to enormous grow on number of allocations
May we could move _numericarray inside the main function and check if number of allocations goes down
I will try

Note i m think there is typo @tturbo -> @turbo

For explanation of the algorithm may you could have a look on
https://www.researchgate.net/publication/325480366_In-place_SIMD_Accelerated_Mathematical_Morphology
Its the same idea, my proposed algorithm treat the whole line instead the register
(if its not clear i could make an explanation in more detailled)

ThomasRetornaz · 2022-06-02T07:39:35Z

As a reference for MATLAB's implementation on my windows PC, it is about 120μs.

img = imread("cameraman.tif");
imresize(img, [512, 512]);

f = @() g(img);
timeit(f)

function g(img)
    for i = 1:1000
        imdilate(img, strel('diamond', 1));
    end
    return
end

My ref in c++ is around 40 micro seconds, so the matlab implem is pretty good. We will try to beat it :)

ThomasRetornaz · 2022-06-03T17:02:01Z

@ThomasRetornaz I'm pretty sure I don't exactly understand the algorithm, could you point me to any reference?

Besides this PR I was just about to implement 1, do you happen to know which one is more efficient?

Footnotes
1. Van Droogenbroeck, Marc, and Hugues Talbot. "Fast computation of morphological operations with arbitrary structuring elements." Pattern recognition letters 17.14 (1996): 1451-1460. https://www.sciencedirect.com/science/article/pii/S0167865596001134 [↩](#user-content-fnref-1-a0e165354f18aa982eaafffaff81b910)

sorry i miss the question
The above paper propose fast implem for arbitrary se, for common se (diamond,box) my "proposed" version should be faster
I will try also to play with concrete simd type via simd.jl

ThomasRetornaz · 2022-06-08T09:01:23Z

I'm trying my best but remain stuck by a 10x factor
The two sub operator shift/padd and mapf run approximatly 10 times slower than required
I specialize the algorithm for N0F8 type and use loopvectorization inner function like vmap directly, we beat current implem but it’s not worth it
I will try if i have time to go deeper in simd manipulation in julia through explicit LLVM function see https://github.com/eschnett/SIMD.jl
As an other exercice i will try to implement https://hal-upec-upem.archives-ouvertes.fr/hal-00692897/document and linear operator using Lemonier or lemire approach (cited by the above paper)
Good luck on your side o/

johnnychen94 · 2022-06-08T21:33:20Z

This is impressive already! I would be very happy to see and merge a stable version when you're satisfied.

I will try if i have time to go deeper in simd manipulation in julia through explicit LLVM function see https://github.com/eschnett/SIMD.jl

A maintainability note on this: it's totally okay to have a piece of hacked loop solution to achieve the best performance. But I hope the usage of this trick is limited (or, wrapped as a stable API) -- because you might be the only one who understands this for a very long time.
For instance, the mapf is very likely to be replaced with something better (or more stable). It's not ImageMorphology that requires this much performance tweak, ImageFiltering needs it as well in the near future (JuliaImages/ImageFiltering.jl#228).

In the meantime, I'm trying to understand and update the old codebase and experiment if there's anything I can improve on the concept and generic support.
Personally, I like the strel(Bool, se) and strel(CartesianIndex, se) stuff, and I'm trying to add more like this.

ThomasRetornaz · 2022-06-12T04:22:27Z

Above the curent results of my investigation
Remarqs:

my "optim" 2d implem accept all Gray/Real like inputs :)
Those implem win all times :) (Fortunaletly its only 2D and basic shape)
I'm 2 times slower than matlab implem you spot :(
My Implem are quiet unsafe (call pointer and loopvectorization directly), so we need to enclose calls with massive check
I don't understand why there is so much allocation with "optim" versions :/ (i will try to beat that)
i add 1024*1024 case to see if something wrong with cache efficiency
- we see that "r1_diamond" could be slower than r1_generic !

Other remarks there is dedicated algorithm for binary ie we need only to inspect the boundary of connected component
We could add this in a dedicated PR/Feature Request

Again thanks for your help :)

ID	time	memory	allocations
`["extreme_filter", "Gray{Float32}", "1024×1024", "r1_box_2D_optim"]`	3.919 ms (5%)	4.27 MiB (1%)	8208
`["extreme_filter", "Gray{Float32}", "1024×1024", "r1_diamond"]`	11.927 ms (5%)	4.00 MiB (1%)	22
`["extreme_filter", "Gray{Float32}", "1024×1024", "r1_diamond_2D_optim"]`	3.552 ms (5%)	4.35 MiB (1%)	10244
`["extreme_filter", "Gray{Float32}", "1024×1024", "r1_generic"]`	8.838 ms (5%)	4.00 MiB (1%)	12
`["extreme_filter", "Gray{Float32}", "1024×1024", "r5_diamond"]`	56.276 ms (5%)	8.00 MiB (1%)	88
`["extreme_filter", "Gray{Float32}", "1024×1024", "r5_generic"]`	129.566 ms (5%)	4.00 MiB (1%)	14
`["extreme_filter", "Gray{Float32}", "256×256", "r1_box_2D_optim"]`	278.300 μs (5%)	317.64 KiB (1%)	1550
`["extreme_filter", "Gray{Float32}", "256×256", "r1_diamond"]`	280.400 μs (5%)	256.62 KiB (1%)	22
`["extreme_filter", "Gray{Float32}", "256×256", "r1_diamond_2D_optim"]`	233.900 μs (5%)	338.19 KiB (1%)	2052
`["extreme_filter", "Gray{Float32}", "256×256", "r1_generic"]`	499.800 μs (5%)	256.67 KiB (1%)	12
`["extreme_filter", "Gray{Float32}", "256×256", "r5_diamond"]`	1.258 ms (5%)	514.42 KiB (1%)	88
`["extreme_filter", "Gray{Float32}", "256×256", "r5_generic"]`	7.812 ms (5%)	260.17 KiB (1%)	14
`["extreme_filter", "Gray{Float32}", "512×512", "r1_box_2D_optim"]`	1.017 ms (5%)	1.14 MiB (1%)	4112
`["extreme_filter", "Gray{Float32}", "512×512", "r1_diamond"]`	2.022 ms (5%)	1.00 MiB (1%)	23
`["extreme_filter", "Gray{Float32}", "512×512", "r1_diamond_2D_optim"]`	899.100 μs (5%)	1.18 MiB (1%)	5124
`["extreme_filter", "Gray{Float32}", "512×512", "r1_generic"]`	2.149 ms (5%)	1.00 MiB (1%)	13
`["extreme_filter", "Gray{Float32}", "512×512", "r5_diamond"]`	9.395 ms (5%)	2.00 MiB (1%)	88
`["extreme_filter", "Gray{Float32}", "512×512", "r5_generic"]`	32.407 ms (5%)	1.00 MiB (1%)	14
`["extreme_filter", "Gray{N0f8}", "1024×1024", "r1_box_2D_optim"]`	1.307 ms (5%)	1.26 MiB (1%)	8208
`["extreme_filter", "Gray{N0f8}", "1024×1024", "r1_diamond"]`	5.036 ms (5%)	1.00 MiB (1%)	22
`["extreme_filter", "Gray{N0f8}", "1024×1024", "r1_diamond_2D_optim"]`	1.166 ms (5%)	1.35 MiB (1%)	10244
`["extreme_filter", "Gray{N0f8}", "1024×1024", "r1_generic"]`	4.313 ms (5%)	1.00 MiB (1%)	12
`["extreme_filter", "Gray{N0f8}", "1024×1024", "r5_diamond"]`	25.052 ms (5%)	2.00 MiB (1%)	88
`["extreme_filter", "Gray{N0f8}", "1024×1024", "r5_generic"]`	51.115 ms (5%)	1.00 MiB (1%)	14
`["extreme_filter", "Gray{N0f8}", "256×256", "r1_box_2D_optim"]`	134.700 μs (5%)	122.03 KiB (1%)	1550
`["extreme_filter", "Gray{N0f8}", "256×256", "r1_diamond"]`	96.500 μs (5%)	64.69 KiB (1%)	22
`["extreme_filter", "Gray{N0f8}", "256×256", "r1_diamond_2D_optim"]`	112.600 μs (5%)	144.78 KiB (1%)	2052
`["extreme_filter", "Gray{N0f8}", "256×256", "r1_generic"]`	244.000 μs (5%)	64.73 KiB (1%)	12
`["extreme_filter", "Gray{N0f8}", "256×256", "r5_diamond"]`	463.200 μs (5%)	130.55 KiB (1%)	88
`["extreme_filter", "Gray{N0f8}", "256×256", "r5_generic"]`	3.278 ms (5%)	68.23 KiB (1%)	14
`["extreme_filter", "Gray{N0f8}", "512×512", "r1_box_2D_optim"]`	387.500 μs (5%)	387.23 KiB (1%)	4112
`["extreme_filter", "Gray{N0f8}", "512×512", "r1_diamond"]`	415.100 μs (5%)	256.75 KiB (1%)	23
`["extreme_filter", "Gray{N0f8}", "512×512", "r1_diamond_2D_optim"]`	336.100 μs (5%)	433.25 KiB (1%)	5124
`["extreme_filter", "Gray{N0f8}", "512×512", "r1_generic"]`	980.900 μs (5%)	256.77 KiB (1%)	13
`["extreme_filter", "Gray{N0f8}", "512×512", "r5_diamond"]`	1.964 ms (5%)	514.55 KiB (1%)	88
`["extreme_filter", "Gray{N0f8}", "512×512", "r5_generic"]`	12.967 ms (5%)	260.23 KiB (1%)	14
`["extreme_filter", "Int64", "1024×1024", "r1_box_2D_optim"]`	3.861 ms (5%)	8.29 MiB (1%)	8208
`["extreme_filter", "Int64", "1024×1024", "r1_diamond"]`	11.843 ms (5%)	8.00 MiB (1%)	22
`["extreme_filter", "Int64", "1024×1024", "r1_diamond_2D_optim"]`	2.796 ms (5%)	8.36 MiB (1%)	10244
`["extreme_filter", "Int64", "1024×1024", "r1_generic"]`	6.806 ms (5%)	8.00 MiB (1%)	12
`["extreme_filter", "Int64", "1024×1024", "r5_diamond"]`	51.677 ms (5%)	16.00 MiB (1%)	88
`["extreme_filter", "Int64", "1024×1024", "r5_generic"]`	60.091 ms (5%)	8.00 MiB (1%)	14
`["extreme_filter", "Int64", "256×256", "r1_box_2D_optim"]`	191.700 μs (5%)	578.95 KiB (1%)	1550
`["extreme_filter", "Int64", "256×256", "r1_diamond"]`	303.500 μs (5%)	512.62 KiB (1%)	22
`["extreme_filter", "Int64", "256×256", "r1_diamond_2D_optim"]`	144.800 μs (5%)	596.31 KiB (1%)	2052
`["extreme_filter", "Int64", "256×256", "r1_generic"]`	384.600 μs (5%)	512.67 KiB (1%)	12
`["extreme_filter", "Int64", "256×256", "r5_diamond"]`	1.618 ms (5%)	1.00 MiB (1%)	88
`["extreme_filter", "Int64", "256×256", "r5_generic"]`	3.442 ms (5%)	516.17 KiB (1%)	14
`["extreme_filter", "Int64", "512×512", "r1_box_2D_optim"]`	914.200 μs (5%)	2.15 MiB (1%)	4112
`["extreme_filter", "Int64", "512×512", "r1_diamond"]`	1.689 ms (5%)	2.00 MiB (1%)	23
`["extreme_filter", "Int64", "512×512", "r1_diamond_2D_optim"]`	664.900 μs (5%)	2.18 MiB (1%)	5124
`["extreme_filter", "Int64", "512×512", "r1_generic"]`	1.683 ms (5%)	2.00 MiB (1%)	13
`["extreme_filter", "Int64", "512×512", "r5_diamond"]`	7.656 ms (5%)	4.00 MiB (1%)	88
`["extreme_filter", "Int64", "512×512", "r5_generic"]`	14.798 ms (5%)	2.00 MiB (1%)	14

johnnychen94

Do you plan to do more tweaks on this? If not then we can find a good place to dispatch the extreme_filter! call to this optimized implementation and then merge this. I mean, this is already a huge improvement

I haven't took the time to read the reference paper, I'll do it before merging this.

src/extreme_filter.jl

johnnychen94 · 2022-06-13T02:08:51Z

src/extreme_filter.jl

+
+# ptr level optimized implementation for Real types
+# short-circuit all check
+function _unsafe_padded_copyto!(dest::AbstractVector, src::AbstractVector, dir, N, v) where {T}


I'm pretty sure we need to restrict the type annotation to Vector (array types with valid pointer for memcopy) for safety.

I think you're right, do you want if fix it or could you do this on your side

When you're satisfied with the implementation, I can do the cleanup work before merging.

Here's my understanding of the todo work:

further tweak the performance (trying my previous comments and maybe more)

call the optimized implementation from extreme_filter! under certain condition

maybe a few more tests and benchmarks

rebase on master

Let me know if you plan to do all of them or just the first performance tweak.

I could try to do points 1,3,4 (i will made a short try to avoid unecessay copy, but it could be in another ieteration), i will add comment and docs also
for the point 2 you are more qualified than me to ensure all safety conditions are taking into account
I try to do this today!
And thanks again of course

src/extreme_filter.jl

johnnychen94

I get 109.639 μs (2051 allocations: 76.38 KiB) when applying my last three commented changes, which is (slightly) faster than MATLAB already.

The other allocations are probably overhead due to SubArray (view). If we can replace the view usage with explicit index calculation, the overheads and allocations might just be eliminated.

johnnychen94 · 2022-06-13T02:59:18Z

src/extreme_filter.jl

+            #sup(x-2,dilate(x-1)),inf(x-2,erode(x-1))
+            LoopVectorization.vmap!(f, tmp, viewprevious, tmp2)
+            #sup(sup(x-2,dilate(x-1),x) || inf(inf(x-2,erode(x-1),x)
+            LoopVectorization.vmap!(f, viewout, tmp, viewnext)


It's better to fuse them into one loop so that LoopVectorization gets more information on how to optimize the loop

Suggested change

#sup(x-2,dilate(x-1)),inf(x-2,erode(x-1))

LoopVectorization.vmap!(f, tmp, viewprevious, tmp2)

#sup(sup(x-2,dilate(x-1),x) || inf(inf(x-2,erode(x-1),x)

LoopVectorization.vmap!(f, viewout, tmp, viewnext)

@turbo for i in eachindex(viewout, viewprevious, tmp, tmp2)

viewout[i] = f(f(viewprevious[i], viewnext[i]), tmp2[i])

end

src/extreme_filter.jl

ThomasRetornaz · 2022-06-14T04:59:02Z

Do you plan to do more tweaks on this? If not then we can find a good place to dispatch the extreme_filter! call to this optimized implementation and then merge this. I mean, this is already a huge improvement

I haven't took the time to read the reference paper, I'll do it before merging this.

No the next move will be to implement 3D cases aka V6 V26, but it could be in another PR
Note that we could also implement linearSE aka C2 horizontal (line) and vertical(column), for "small" SE eg radius<10: iterate C2 in this case could be faster than dedicated linear implem
So we could have something like

function extremFilter(f,out,A,Union<strelHori,strelVerti,strelDiag45,strelDiag135> strel)
if strel.r<10 //to be becnhmarked
//iterate _unsafe_extremefilterC2Direction
return
end
return _extremFilterLemonier (or Lemire or van herk) //it will depend on benchmark
end
I will do in another PR, when we have defining linearSE

Reminder for direction not aligned with the Image grid, we have to implement peridodic lines approach :(

ThomasRetornaz · 2022-06-14T05:03:45Z

I get 109.639 μs (2051 allocations: 76.38 KiB) when applying my last three commented changes, which is (slightly) faster than MATLAB already.

The other allocations are probably overhead due to SubArray (view). If we can replace the view usage with explicit index calculation, the overheads and allocations might just be eliminated.

\o/ \o/ \o/ Yes! Great, thanks a lot
Beating matlab (probably C implem) its a huge achievement :)
I think we could keep this implem as you propose( i could go back on this when i do other SE (Linear and 3D))

johnnychen94 · 2022-06-14T05:07:42Z

Let's keep this PR limited to C4/C8 connectivity only. I just can't wait to finish and merge this PR!

ThomasRetornaz · 2022-06-15T05:09:28Z

I get 109.639 μs (2051 allocations: 76.38 KiB) when applying my last three commented changes, which is (slightly) faster than MATLAB already.

The other allocations are probably overhead due to SubArray (view). If we can replace the view usage with explicit index calculation, the overheads and allocations might just be eliminated.

Sorry, could you please apply your subgestions i think i miss something and can't reach the expected speedup
On my side i document and add tests and benchmarks
Thanks

ThomasRetornaz · 2022-06-15T05:19:51Z

I get 109.639 μs (2051 allocations: 76.38 KiB) when applying my last three commented changes, which is (slightly) faster than MATLAB already.
The other allocations are probably overhead due to SubArray (view). If we can replace the view usage with explicit index calculation, the overheads and allocations might just be eliminated.

Sorry, could you please apply your subgestions i think i miss something and can't reach the expected speedup On my side i document and add tests and benchmarks Thanks

But my current results are
"Gray{N0f8}" => 1-element BenchmarkTools.BenchmarkGroup:
tags: []
"512×512" => 2-element BenchmarkTools.BenchmarkGroup:
tags: []
"r1_diamond_2D_optim" => Trial(94.300 μs)
"r1_box_2D_optim" => Trial(126.600 μs)

so maybee double check on your side ?
Regards
Thanks for your help
TR

johnnychen94 · 2022-06-15T06:15:28Z

@ThomasRetornaz Can you push a new version here so that I can track your latest changes?

…4_2D using view Faster than existing algo but quiet disapointing compare to C code Todo: + tests with copy instead of view + populate benchmark + add C8,C6,C26

This supports OffsetArray, it also supports CuArray (but requires ArrayInterfaceGPUArrays loaded)

… and benchmarks + documentation

ThomasRetornaz · 2022-06-15T07:04:42Z

@ThomasRetornaz Can you push a new version here so that I can track your latest changes?

Done with rebase
Thanks

- fuse two fmap loops into one bigger loop -- this gives LoopVectorization more flexibility to analyze and optimize the loops. - remove unnecessary memcopy operation during setup Benchmark using cameraman (512x512) gray image on Ubuntu(WSL) i9-12900K Julia 1.8.0-rc1, we get almost 2.2x performance boost C42D: before: 214.230 μs (2 allocations: 8.25 KiB) after: 97.196 μs (3 allocations: 12.38 KiB) C82D: before: 221.889 μs (1029 allocations: 64.55 KiB) after: 98.940 μs (5 allocations: 20.62 KiB)

johnnychen94 · 2022-06-15T19:08:43Z

I think we've unnecessarily complicates the implementation, here's a what I did from scratch based on my understanding of the reference paper and the current implementation:

function _unsafe_extreme_filter_C4_2D_jc!(f::MAX_OR_MIN, out::AbstractArray{T,2}, A::AbstractArray{T,2}, iter) where {T}
    @debug "call the optimized `extreme_filter` implementation for SEDiamond SE and 2D images" fname =
        _unsafe_extreme_filter_C4_2D!
    axes(out) == axes(A) || throw(DimensionMismatch("axes(out) must match axes(A)"))

    # To avoid the result affected by loop order, we need two arrays
    src = (out === A) || (iter > 1) ? copy(A) : A

    Ω = strel(CartesianIndex, strel_diamond((3, 3)))
    δ = CartesianIndex(1, 1)
    R = CartesianIndices(src)
    R_inner = (first(R) + δ):(last(R) - δ)

    # applying radius=r filter is equivalent to applying radius=1 filter r times
    for i in 1:iter
        # NOTE(johnnychen94): the inner region loop is essentially equivalent to the
        # following code. Here we buffer the getindex to further improve the performance
        # by explicitly telling the compiler to reuse the SIMD register

        # @turbo for y in 2:(size(src, 2) - 1), x in 2:(size(src, 1) - 1)
        #     tmp = f(f(src[x, y], src[x - 1, y]), src[x + 1, y])
        #     out[x, y] = f(f(tmp, src[x, y - 1]), src[x, y + 1])
        # end
        @turbo for y in 2:(size(src, 2) - 1)
            p_up, p, p_down = src[1, y], src[2, y], src[3, y]
            for x in 2:(size(src, 1) - 1)
                p_up, p, p_down = p, p_down, src[x + 1, y]
                tmp = f(f(p, p_up), p_down)
                out[x, y] = f(f(tmp, src[x, y - 1]), src[x, y + 1])
            end
        end

        @inbounds for p in EdgeIterator(R, R_inner)
            # for edge points, skip if the offset exceeds the boundary
            s = A[p]
            for o in Ω
                q = p + o
                checkbounds(Bool, A, q) || continue
                s = f(s, A[q])
            end
            out[p] = s
        end
    end
    return out
end

julia> img = Float64.(testimage("cameraman"));

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D_jc!(max, $(similar(img)), $img, 1);
  82.422 μs (7 allocations: 448 bytes)

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D!(max, $(similar(img)), $img, 1);
  98.856 μs (3 allocations: 12.38 KiB)

Simpler and more efficient.

If replacing @turbo with @tturbo, we get another 2x boost (which is 3x faster than MATLAB):

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D_jc!(max, $(similar(img)), $img, 1);
  43.143 μs (7 allocations: 448 bytes)

ThomasRetornaz · 2022-06-15T19:23:05Z

I think we've unnecessarily complicates the implementation, here's a what I did from scratch based on my understanding of the reference paper and the current implementation:

function _unsafe_extreme_filter_C4_2D_jc!(f::MAX_OR_MIN, out::AbstractArray{T,2}, A::AbstractArray{T,2}, iter) where {T}
    @debug "call the optimized `extreme_filter` implementation for SEDiamond SE and 2D images" fname =
        _unsafe_extreme_filter_C4_2D!
    axes(out) == axes(A) || throw(DimensionMismatch("axes(out) must match axes(A)"))

    # To avoid the result affected by loop order, we need two arrays
    src = (out === A) || (iter > 1) ? copy(A) : A

    Ω = strel(CartesianIndex, strel_diamond((3, 3)))
    δ = CartesianIndex(1, 1)
    R = CartesianIndices(src)
    R_inner = (first(R) + δ):(last(R) - δ)

    # applying radius=r filter is equivalent to applying radius=1 filter r times
    for i in 1:iter
        # NOTE(johnnychen94): the inner region loop is essentially equivalent to the
        # following code. Here we buffer the getindex to further improve the performance
        # by explicitly telling the compiler to reuse the SIMD register

        # @turbo for y in 2:(size(src, 2) - 1), x in 2:(size(src, 1) - 1)
        #     tmp = f(f(src[x, y], src[x - 1, y]), src[x + 1, y])
        #     out[x, y] = f(f(tmp, src[x, y - 1]), src[x, y + 1])
        # end
        @turbo for y in 2:(size(src, 2) - 1)
            p_up, p, p_down = src[1, y], src[2, y], src[3, y]
            for x in 2:(size(src, 1) - 1)
                p_up, p, p_down = p, p_down, src[x + 1, y]
                tmp = f(f(p, p_up), p_down)
                out[x, y] = f(f(tmp, src[x, y - 1]), src[x, y + 1])
            end
        end

        @inbounds for p in EdgeIterator(R, R_inner)
            # for edge points, skip if the offset exceeds the boundary
            s = A[p]
            for o in Ω
                q = p + o
                checkbounds(Bool, A, q) || continue
                s = f(s, A[q])
            end
            out[p] = s
        end
    end
    return out
end

julia> img = Float64.(testimage("cameraman"));

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D_jc!(max, $(similar(img)), $img, 1);
  82.422 μs (7 allocations: 448 bytes)

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D!(max, $(similar(img)), $img, 1);
  98.856 μs (3 allocations: 12.38 KiB)

Simpler and more efficient.

If replacing @turbo with @tturbo, we get another 2x boost (which is 3x faster than MATLAB):

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D_jc!(max, $(similar(img)), $img, 1);
  43.143 μs (7 allocations: 448 bytes)

Nice !!! I tented to copy my c implem too much ....
Your code is simpler, more julia centric and FASTER \o/

The codes come from #90. Co-Authored-by: Retornaz Thomas <[email protected]>

migrated from #90 Co-authored-by: Retornaz Thomas <[email protected]>

johnnychen94 force-pushed the speedup_extrem_filter branch from 669c278 to d45a086 Compare May 30, 2022 12:39

johnnychen94 reviewed Jun 13, 2022

View reviewed changes

src/extreme_filter.jl Outdated Show resolved Hide resolved

johnnychen94 reviewed Jun 13, 2022

View reviewed changes

ThomasRetornaz and others added 7 commits June 15, 2022 08:26

[ImageMorphology][extreme_filter] WIP first version _extreme_filter_C…

9b5b26a

…4_2D using view Faster than existing algo but quiet disapointing compare to C code Todo: + tests with copy instead of view + populate benchmark + add C8,C6,C26

[ImageMorphology]{extreme_filter] format

b865b09

perf: introduce LoopVectorization

4e8c140

generalize implementation for generic arrays

5e22f49

This supports OffsetArray, it also supports CuArray (but requires ArrayInterfaceGPUArrays loaded)

[ImaheMorphology][extreme_filter] speedup V4 (diamond),V8 (box) for 2D

d6276ea

[ImageMorphology][extreme_filter] 2D box,diamond optim add more tests…

1c7b749

… and benchmarks + documentation

[ImageMorphology][extreme_filter] fix rebase

5042eba

ThomasRetornaz force-pushed the speedup_extrem_filter branch from 747cb7d to 5042eba Compare June 15, 2022 07:03

This was referenced Jun 15, 2022

perf: AVX implementation for 2D diamond SE #97

Merged

strel: more shape generators #75

Open

johnnychen94 added a commit that referenced this pull request Jun 24, 2022

perf: adopt the optimized AVX algorithm for C2 diamond SE

3fd3661

The codes come from #90. Co-Authored-by: Retornaz Thomas <[email protected]>

johnnychen94 mentioned this pull request Jun 24, 2022

perf: adopt the optimized AVX algorithm for C2 diamond SE #102

Merged

johnnychen94 added a commit that referenced this pull request Jun 24, 2022

perf: adopt the optimized AVX algorithm for C2 diamond SE

91f0c33

The codes come from #90. Co-Authored-by: Retornaz Thomas <[email protected]>

johnnychen94 added a commit that referenced this pull request Jun 28, 2022

perf: optimized extreme_filter AVX implementation for 2D box SE

f1db2b7

migrated from #90 Co-authored-by: Retornaz Thomas <[email protected]>

johnnychen94 mentioned this pull request Jun 28, 2022

perf: adopt the optimized AVX algorithm for C2 box SE #105

Merged

johnnychen94 added a commit that referenced this pull request Jun 28, 2022

perf: optimized extreme_filter AVX implementation for 2D box SE

e70bfc4

migrated from #90 Co-authored-by: Retornaz Thomas <[email protected]>

johnnychen94 closed this in #105 Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NO MERGE Speedup extrem filter #90

NO MERGE Speedup extrem filter #90

ThomasRetornaz commented May 30, 2022

johnnychen94 commented May 30, 2022 •

edited

Loading

johnnychen94 commented May 30, 2022 •

edited

Loading

johnnychen94 commented May 30, 2022

ThomasRetornaz commented Jun 2, 2022

ThomasRetornaz commented Jun 2, 2022

ThomasRetornaz commented Jun 3, 2022

Footnotes

ThomasRetornaz commented Jun 8, 2022

johnnychen94 commented Jun 8, 2022 •

edited

Loading

ThomasRetornaz commented Jun 12, 2022

johnnychen94 left a comment

johnnychen94 Jun 13, 2022

ThomasRetornaz Jun 14, 2022

johnnychen94 Jun 14, 2022 •

edited

Loading

ThomasRetornaz Jun 14, 2022

johnnychen94 left a comment •

edited

Loading

johnnychen94 Jun 13, 2022

ThomasRetornaz commented Jun 14, 2022

ThomasRetornaz commented Jun 14, 2022

johnnychen94 commented Jun 14, 2022 •

edited

Loading

ThomasRetornaz commented Jun 15, 2022

ThomasRetornaz commented Jun 15, 2022

johnnychen94 commented Jun 15, 2022

ThomasRetornaz commented Jun 15, 2022

johnnychen94 commented Jun 15, 2022 •

edited

Loading

ThomasRetornaz commented Jun 15, 2022

NO MERGE Speedup extrem filter #90

NO MERGE Speedup extrem filter #90

Conversation

ThomasRetornaz commented May 30, 2022

johnnychen94 commented May 30, 2022 • edited Loading

johnnychen94 commented May 30, 2022 • edited Loading

Footnotes

johnnychen94 commented May 30, 2022

ThomasRetornaz commented Jun 2, 2022

ThomasRetornaz commented Jun 2, 2022

ThomasRetornaz commented Jun 3, 2022

Footnotes

ThomasRetornaz commented Jun 8, 2022

johnnychen94 commented Jun 8, 2022 • edited Loading

ThomasRetornaz commented Jun 12, 2022

johnnychen94 left a comment

Choose a reason for hiding this comment

johnnychen94 Jun 13, 2022

Choose a reason for hiding this comment

ThomasRetornaz Jun 14, 2022

Choose a reason for hiding this comment

johnnychen94 Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

ThomasRetornaz Jun 14, 2022

Choose a reason for hiding this comment

johnnychen94 left a comment • edited Loading

Choose a reason for hiding this comment

johnnychen94 Jun 13, 2022

Choose a reason for hiding this comment

ThomasRetornaz commented Jun 14, 2022

ThomasRetornaz commented Jun 14, 2022

johnnychen94 commented Jun 14, 2022 • edited Loading

ThomasRetornaz commented Jun 15, 2022

ThomasRetornaz commented Jun 15, 2022

johnnychen94 commented Jun 15, 2022

ThomasRetornaz commented Jun 15, 2022

johnnychen94 commented Jun 15, 2022 • edited Loading

ThomasRetornaz commented Jun 15, 2022

johnnychen94 commented May 30, 2022 •

edited

Loading

johnnychen94 commented May 30, 2022 •

edited

Loading

johnnychen94 commented Jun 8, 2022 •

edited

Loading

johnnychen94 Jun 14, 2022 •

edited

Loading

johnnychen94 left a comment •

edited

Loading

johnnychen94 commented Jun 14, 2022 •

edited

Loading

johnnychen94 commented Jun 15, 2022 •

edited

Loading