Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO MERGE Speedup extrem filter #90

Closed

Conversation

ThomasRetornaz
Copy link
Collaborator

First attempt to speedup extremfilter for special case eg 2D diamond
We achieve a speedup around 2,5/3 which is quiet disapointing :/, i'm far from my heavily optim C++ implem (miss div 10)
@johnnychen94 if you know special julia tricks around this new code ? or maybee i miss something
Thanks in advance
If not i will do the other case 2D diamond and 3D and clean the PR

@johnnychen94 johnnychen94 force-pushed the speedup_extrem_filter branch from 669c278 to d45a086 Compare May 30, 2022 12:39
@johnnychen94
Copy link
Member

johnnychen94 commented May 30, 2022

I'm not sure if memcopy and memset helps here -- it seems that they perform similarily https://discourse.julialang.org/t/use-memcpy-instead-of-memmove-to-copy-array-when-theres-no-overlap/64042/4?u=johnnychen94

But the good news is that LoopVectorization can be used here since we only support gray images, which gives an extra 2x boost. But it's still not the 10x as you expected...

julia> using TestImages, ImageMorphology

julia> img = Float64.(testimage("cameraman"));

julia> @btime ImageMorphology._extreme_filter_C4_2D!(max, $(similar(img)), $img, 1);
  321.266 μs (11252 allocations: 695.47 KiB)

julia> @btime ImageMorphology.extreme_filter!(max, $(similar(img)), $img, $(strel_diamond(img)));
  1.142 ms (18 allocations: 480 bytes)

Since the implementation is vectorized, the new version actually supports GPU (although it's incredibly slow...)

julia> using ArrayInterfaceCUDA, CUDA

julia> img = CuArray(Float64.(testimage("cameraman")));

julia> @btime ImageMorphology._extreme_filter_C4_2D!(max, $(similar(img)), $img, 1);
  16.543 ms (83432 allocations: 4.54 MiB)

so let's just live in the CPU world for now :)

@johnnychen94
Copy link
Member

johnnychen94 commented May 30, 2022

@ThomasRetornaz I'm pretty sure I don't exactly understand the algorithm, could you point me to any reference?

Besides this PR I was just about to implement 1, do you happen to know which one is more efficient?

Footnotes

  1. Van Droogenbroeck, Marc, and Hugues Talbot. "Fast computation of morphological operations with arbitrary structuring elements." Pattern recognition letters 17.14 (1996): 1451-1460. https://www.sciencedirect.com/science/article/pii/S0167865596001134

@johnnychen94
Copy link
Member

As a reference for MATLAB's implementation on my windows PC, it is about 120μs.

img = imread("cameraman.tif");
imresize(img, [512, 512]);

f = @() g(img);
timeit(f)

function g(img)
    for i = 1:1000
        imdilate(img, strel('diamond', 1));
    end
    return
end

@ThomasRetornaz
Copy link
Collaborator Author

Nice try around loopvectorization i think its the right path, but we miss something due to enormous grow on number of allocations
May we could move _numericarray inside the main function and check if number of allocations goes down
I will try

Note i m think there is typo @tturbo -> @turbo

For explanation of the algorithm may you could have a look on
https://www.researchgate.net/publication/325480366_In-place_SIMD_Accelerated_Mathematical_Morphology
Its the same idea, my proposed algorithm treat the whole line instead the register
(if its not clear i could make an explanation in more detailled)

@ThomasRetornaz
Copy link
Collaborator Author

As a reference for MATLAB's implementation on my windows PC, it is about 120μs.

img = imread("cameraman.tif");
imresize(img, [512, 512]);

f = @() g(img);
timeit(f)

function g(img)
    for i = 1:1000
        imdilate(img, strel('diamond', 1));
    end
    return
end

My ref in c++ is around 40 micro seconds, so the matlab implem is pretty good. We will try to beat it :)

@ThomasRetornaz
Copy link
Collaborator Author

@ThomasRetornaz I'm pretty sure I don't exactly understand the algorithm, could you point me to any reference?

Besides this PR I was just about to implement 1, do you happen to know which one is more efficient?

Footnotes

1. Van Droogenbroeck, Marc, and Hugues Talbot. "Fast computation of morphological operations with arbitrary structuring elements." Pattern recognition letters 17.14 (1996): 1451-1460. https://www.sciencedirect.com/science/article/pii/S0167865596001134 [↩](#user-content-fnref-1-a0e165354f18aa982eaafffaff81b910)

sorry i miss the question
The above paper propose fast implem for arbitrary se, for common se (diamond,box) my "proposed" version should be faster
I will try also to play with concrete simd type via simd.jl

@ThomasRetornaz
Copy link
Collaborator Author

I'm trying my best but remain stuck by a 10x factor
The two sub operator shift/padd and mapf run approximatly 10 times slower than required
I specialize the algorithm for N0F8 type and use loopvectorization inner function like vmap directly, we beat current implem but it’s not worth it
I will try if i have time to go deeper in simd manipulation in julia through explicit LLVM function see https://github.com/eschnett/SIMD.jl
As an other exercice i will try to implement https://hal-upec-upem.archives-ouvertes.fr/hal-00692897/document and linear operator using Lemonier or lemire approach (cited by the above paper)
Good luck on your side o/

@johnnychen94
Copy link
Member

johnnychen94 commented Jun 8, 2022

This is impressive already! I would be very happy to see and merge a stable version when you're satisfied.

I will try if i have time to go deeper in simd manipulation in julia through explicit LLVM function see https://github.com/eschnett/SIMD.jl

A maintainability note on this: it's totally okay to have a piece of hacked loop solution to achieve the best performance. But I hope the usage of this trick is limited (or, wrapped as a stable API) -- because you might be the only one who understands this for a very long time.
For instance, the mapf is very likely to be replaced with something better (or more stable). It's not ImageMorphology that requires this much performance tweak, ImageFiltering needs it as well in the near future (JuliaImages/ImageFiltering.jl#228).


In the meantime, I'm trying to understand and update the old codebase and experiment if there's anything I can improve on the concept and generic support.
Personally, I like the strel(Bool, se) and strel(CartesianIndex, se) stuff, and I'm trying to add more like this.

@ThomasRetornaz
Copy link
Collaborator Author

Above the curent results of my investigation
Remarqs:

  • my "optim" 2d implem accept all Gray/Real like inputs :)
  • Those implem win all times :) (Fortunaletly its only 2D and basic shape)
  • I'm 2 times slower than matlab implem you spot :(
  • My Implem are quiet unsafe (call pointer and loopvectorization directly), so we need to enclose calls with massive check
  • I don't understand why there is so much allocation with "optim" versions :/ (i will try to beat that)
  • i add 1024*1024 case to see if something wrong with cache efficiency
    • we see that "r1_diamond" could be slower than r1_generic !

Other remarks there is dedicated algorithm for binary ie we need only to inspect the boundary of connected component
We could add this in a dedicated PR/Feature Request

Again thanks for your help :)

ID time GC time memory allocations
["extreme_filter", "Gray{Float32}", "1024×1024", "r1_box_2D_optim"] 3.919 ms (5%) 4.27 MiB (1%) 8208
["extreme_filter", "Gray{Float32}", "1024×1024", "r1_diamond"] 11.927 ms (5%) 4.00 MiB (1%) 22
["extreme_filter", "Gray{Float32}", "1024×1024", "r1_diamond_2D_optim"] 3.552 ms (5%) 4.35 MiB (1%) 10244
["extreme_filter", "Gray{Float32}", "1024×1024", "r1_generic"] 8.838 ms (5%) 4.00 MiB (1%) 12
["extreme_filter", "Gray{Float32}", "1024×1024", "r5_diamond"] 56.276 ms (5%) 8.00 MiB (1%) 88
["extreme_filter", "Gray{Float32}", "1024×1024", "r5_generic"] 129.566 ms (5%) 4.00 MiB (1%) 14
["extreme_filter", "Gray{Float32}", "256×256", "r1_box_2D_optim"] 278.300 μs (5%) 317.64 KiB (1%) 1550
["extreme_filter", "Gray{Float32}", "256×256", "r1_diamond"] 280.400 μs (5%) 256.62 KiB (1%) 22
["extreme_filter", "Gray{Float32}", "256×256", "r1_diamond_2D_optim"] 233.900 μs (5%) 338.19 KiB (1%) 2052
["extreme_filter", "Gray{Float32}", "256×256", "r1_generic"] 499.800 μs (5%) 256.67 KiB (1%) 12
["extreme_filter", "Gray{Float32}", "256×256", "r5_diamond"] 1.258 ms (5%) 514.42 KiB (1%) 88
["extreme_filter", "Gray{Float32}", "256×256", "r5_generic"] 7.812 ms (5%) 260.17 KiB (1%) 14
["extreme_filter", "Gray{Float32}", "512×512", "r1_box_2D_optim"] 1.017 ms (5%) 1.14 MiB (1%) 4112
["extreme_filter", "Gray{Float32}", "512×512", "r1_diamond"] 2.022 ms (5%) 1.00 MiB (1%) 23
["extreme_filter", "Gray{Float32}", "512×512", "r1_diamond_2D_optim"] 899.100 μs (5%) 1.18 MiB (1%) 5124
["extreme_filter", "Gray{Float32}", "512×512", "r1_generic"] 2.149 ms (5%) 1.00 MiB (1%) 13
["extreme_filter", "Gray{Float32}", "512×512", "r5_diamond"] 9.395 ms (5%) 2.00 MiB (1%) 88
["extreme_filter", "Gray{Float32}", "512×512", "r5_generic"] 32.407 ms (5%) 1.00 MiB (1%) 14
["extreme_filter", "Gray{N0f8}", "1024×1024", "r1_box_2D_optim"] 1.307 ms (5%) 1.26 MiB (1%) 8208
["extreme_filter", "Gray{N0f8}", "1024×1024", "r1_diamond"] 5.036 ms (5%) 1.00 MiB (1%) 22
["extreme_filter", "Gray{N0f8}", "1024×1024", "r1_diamond_2D_optim"] 1.166 ms (5%) 1.35 MiB (1%) 10244
["extreme_filter", "Gray{N0f8}", "1024×1024", "r1_generic"] 4.313 ms (5%) 1.00 MiB (1%) 12
["extreme_filter", "Gray{N0f8}", "1024×1024", "r5_diamond"] 25.052 ms (5%) 2.00 MiB (1%) 88
["extreme_filter", "Gray{N0f8}", "1024×1024", "r5_generic"] 51.115 ms (5%) 1.00 MiB (1%) 14
["extreme_filter", "Gray{N0f8}", "256×256", "r1_box_2D_optim"] 134.700 μs (5%) 122.03 KiB (1%) 1550
["extreme_filter", "Gray{N0f8}", "256×256", "r1_diamond"] 96.500 μs (5%) 64.69 KiB (1%) 22
["extreme_filter", "Gray{N0f8}", "256×256", "r1_diamond_2D_optim"] 112.600 μs (5%) 144.78 KiB (1%) 2052
["extreme_filter", "Gray{N0f8}", "256×256", "r1_generic"] 244.000 μs (5%) 64.73 KiB (1%) 12
["extreme_filter", "Gray{N0f8}", "256×256", "r5_diamond"] 463.200 μs (5%) 130.55 KiB (1%) 88
["extreme_filter", "Gray{N0f8}", "256×256", "r5_generic"] 3.278 ms (5%) 68.23 KiB (1%) 14
["extreme_filter", "Gray{N0f8}", "512×512", "r1_box_2D_optim"] 387.500 μs (5%) 387.23 KiB (1%) 4112
["extreme_filter", "Gray{N0f8}", "512×512", "r1_diamond"] 415.100 μs (5%) 256.75 KiB (1%) 23
["extreme_filter", "Gray{N0f8}", "512×512", "r1_diamond_2D_optim"] 336.100 μs (5%) 433.25 KiB (1%) 5124
["extreme_filter", "Gray{N0f8}", "512×512", "r1_generic"] 980.900 μs (5%) 256.77 KiB (1%) 13
["extreme_filter", "Gray{N0f8}", "512×512", "r5_diamond"] 1.964 ms (5%) 514.55 KiB (1%) 88
["extreme_filter", "Gray{N0f8}", "512×512", "r5_generic"] 12.967 ms (5%) 260.23 KiB (1%) 14
["extreme_filter", "Int64", "1024×1024", "r1_box_2D_optim"] 3.861 ms (5%) 8.29 MiB (1%) 8208
["extreme_filter", "Int64", "1024×1024", "r1_diamond"] 11.843 ms (5%) 8.00 MiB (1%) 22
["extreme_filter", "Int64", "1024×1024", "r1_diamond_2D_optim"] 2.796 ms (5%) 8.36 MiB (1%) 10244
["extreme_filter", "Int64", "1024×1024", "r1_generic"] 6.806 ms (5%) 8.00 MiB (1%) 12
["extreme_filter", "Int64", "1024×1024", "r5_diamond"] 51.677 ms (5%) 16.00 MiB (1%) 88
["extreme_filter", "Int64", "1024×1024", "r5_generic"] 60.091 ms (5%) 8.00 MiB (1%) 14
["extreme_filter", "Int64", "256×256", "r1_box_2D_optim"] 191.700 μs (5%) 578.95 KiB (1%) 1550
["extreme_filter", "Int64", "256×256", "r1_diamond"] 303.500 μs (5%) 512.62 KiB (1%) 22
["extreme_filter", "Int64", "256×256", "r1_diamond_2D_optim"] 144.800 μs (5%) 596.31 KiB (1%) 2052
["extreme_filter", "Int64", "256×256", "r1_generic"] 384.600 μs (5%) 512.67 KiB (1%) 12
["extreme_filter", "Int64", "256×256", "r5_diamond"] 1.618 ms (5%) 1.00 MiB (1%) 88
["extreme_filter", "Int64", "256×256", "r5_generic"] 3.442 ms (5%) 516.17 KiB (1%) 14
["extreme_filter", "Int64", "512×512", "r1_box_2D_optim"] 914.200 μs (5%) 2.15 MiB (1%) 4112
["extreme_filter", "Int64", "512×512", "r1_diamond"] 1.689 ms (5%) 2.00 MiB (1%) 23
["extreme_filter", "Int64", "512×512", "r1_diamond_2D_optim"] 664.900 μs (5%) 2.18 MiB (1%) 5124
["extreme_filter", "Int64", "512×512", "r1_generic"] 1.683 ms (5%) 2.00 MiB (1%) 13
["extreme_filter", "Int64", "512×512", "r5_diamond"] 7.656 ms (5%) 4.00 MiB (1%) 88
["extreme_filter", "Int64", "512×512", "r5_generic"] 14.798 ms (5%) 2.00 MiB (1%) 14

Copy link
Member

@johnnychen94 johnnychen94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to do more tweaks on this? If not then we can find a good place to dispatch the extreme_filter! call to this optimized implementation and then merge this. I mean, this is already a huge improvement

I haven't took the time to read the reference paper, I'll do it before merging this.

src/extreme_filter.jl Show resolved Hide resolved
src/extreme_filter.jl Outdated Show resolved Hide resolved

# ptr level optimized implementation for Real types
# short-circuit all check
function _unsafe_padded_copyto!(dest::AbstractVector, src::AbstractVector, dir, N, v) where {T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we need to restrict the type annotation to Vector (array types with valid pointer for memcopy) for safety.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, do you want if fix it or could you do this on your side

Copy link
Member

@johnnychen94 johnnychen94 Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you're satisfied with the implementation, I can do the cleanup work before merging.

Here's my understanding of the todo work:

  • further tweak the performance (trying my previous comments and maybe more)
  • call the optimized implementation from extreme_filter! under certain condition
  • maybe a few more tests and benchmarks
  • rebase on master

Let me know if you plan to do all of them or just the first performance tweak.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could try to do points 1,3,4 (i will made a short try to avoid unecessay copy, but it could be in another ieteration), i will add comment and docs also
for the point 2 you are more qualified than me to ensure all safety conditions are taking into account
I try to do this today!
And thanks again of course

src/extreme_filter.jl Outdated Show resolved Hide resolved
Copy link
Member

@johnnychen94 johnnychen94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get 109.639 μs (2051 allocations: 76.38 KiB) when applying my last three commented changes, which is (slightly) faster than MATLAB already.

The other allocations are probably overhead due to SubArray (view). If we can replace the view usage with explicit index calculation, the overheads and allocations might just be eliminated.

Comment on lines 362 to 393
#sup(x-2,dilate(x-1)),inf(x-2,erode(x-1))
LoopVectorization.vmap!(f, tmp, viewprevious, tmp2)
#sup(sup(x-2,dilate(x-1),x) || inf(inf(x-2,erode(x-1),x)
LoopVectorization.vmap!(f, viewout, tmp, viewnext)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to fuse them into one loop so that LoopVectorization gets more information on how to optimize the loop

Suggested change
#sup(x-2,dilate(x-1)),inf(x-2,erode(x-1))
LoopVectorization.vmap!(f, tmp, viewprevious, tmp2)
#sup(sup(x-2,dilate(x-1),x) || inf(inf(x-2,erode(x-1),x)
LoopVectorization.vmap!(f, viewout, tmp, viewnext)
@turbo for i in eachindex(viewout, viewprevious, tmp, tmp2)
viewout[i] = f(f(viewprevious[i], viewnext[i]), tmp2[i])
end

src/extreme_filter.jl Outdated Show resolved Hide resolved
@ThomasRetornaz
Copy link
Collaborator Author

Do you plan to do more tweaks on this? If not then we can find a good place to dispatch the extreme_filter! call to this optimized implementation and then merge this. I mean, this is already a huge improvement

I haven't took the time to read the reference paper, I'll do it before merging this.

No the next move will be to implement 3D cases aka V6 V26, but it could be in another PR
Note that we could also implement linearSE aka C2 horizontal (line) and vertical(column), for "small" SE eg radius<10: iterate C2 in this case could be faster than dedicated linear implem
So we could have something like

function extremFilter(f,out,A,Union<strelHori,strelVerti,strelDiag45,strelDiag135> strel)
if strel.r<10 //to be becnhmarked
//iterate _unsafe_extremefilterC2Direction
return
end
return _extremFilterLemonier (or Lemire or van herk) //it will depend on benchmark
end
I will do in another PR, when we have defining linearSE

Reminder for direction not aligned with the Image grid, we have to implement peridodic lines approach :(

@ThomasRetornaz
Copy link
Collaborator Author

I get 109.639 μs (2051 allocations: 76.38 KiB) when applying my last three commented changes, which is (slightly) faster than MATLAB already.

The other allocations are probably overhead due to SubArray (view). If we can replace the view usage with explicit index calculation, the overheads and allocations might just be eliminated.

\o/ \o/ \o/ Yes! Great, thanks a lot
Beating matlab (probably C implem) its a huge achievement :)
I think we could keep this implem as you propose( i could go back on this when i do other SE (Linear and 3D))

@johnnychen94
Copy link
Member

johnnychen94 commented Jun 14, 2022

Let's keep this PR limited to C4/C8 connectivity only. I just can't wait to finish and merge this PR!

@ThomasRetornaz
Copy link
Collaborator Author

I get 109.639 μs (2051 allocations: 76.38 KiB) when applying my last three commented changes, which is (slightly) faster than MATLAB already.

The other allocations are probably overhead due to SubArray (view). If we can replace the view usage with explicit index calculation, the overheads and allocations might just be eliminated.

Sorry, could you please apply your subgestions i think i miss something and can't reach the expected speedup
On my side i document and add tests and benchmarks
Thanks

@ThomasRetornaz
Copy link
Collaborator Author

I get 109.639 μs (2051 allocations: 76.38 KiB) when applying my last three commented changes, which is (slightly) faster than MATLAB already.
The other allocations are probably overhead due to SubArray (view). If we can replace the view usage with explicit index calculation, the overheads and allocations might just be eliminated.

Sorry, could you please apply your subgestions i think i miss something and can't reach the expected speedup On my side i document and add tests and benchmarks Thanks

But my current results are
"Gray{N0f8}" => 1-element BenchmarkTools.BenchmarkGroup:
tags: []
"512×512" => 2-element BenchmarkTools.BenchmarkGroup:
tags: []
"r1_diamond_2D_optim" => Trial(94.300 μs)
"r1_box_2D_optim" => Trial(126.600 μs)

so maybee double check on your side ?
Regards
Thanks for your help
TR

@johnnychen94
Copy link
Member

@ThomasRetornaz Can you push a new version here so that I can track your latest changes?

ThomasRetornaz and others added 7 commits June 15, 2022 08:26
…4_2D using view

Faster than existing algo but quiet disapointing compare to C code
Todo:
+ tests with copy instead of view
+ populate benchmark
+ add C8,C6,C26
This supports OffsetArray, it also supports CuArray (but requires
ArrayInterfaceGPUArrays loaded)
@ThomasRetornaz ThomasRetornaz force-pushed the speedup_extrem_filter branch from 747cb7d to 5042eba Compare June 15, 2022 07:03
@ThomasRetornaz
Copy link
Collaborator Author

@ThomasRetornaz Can you push a new version here so that I can track your latest changes?

Done with rebase
Thanks

- fuse two fmap loops into one bigger loop -- this gives
  LoopVectorization more flexibility to analyze and optimize the
  loops.
- remove unnecessary memcopy operation during setup

Benchmark using cameraman (512x512) gray image on Ubuntu(WSL) i9-12900K
Julia 1.8.0-rc1, we get almost 2.2x performance boost

C42D:
  before: 214.230 μs (2 allocations: 8.25 KiB)
  after:  97.196 μs (3 allocations: 12.38 KiB)
C82D:
  before: 221.889 μs (1029 allocations: 64.55 KiB)
  after:  98.940 μs (5 allocations: 20.62 KiB)
@johnnychen94
Copy link
Member

johnnychen94 commented Jun 15, 2022

I think we've unnecessarily complicates the implementation, here's a what I did from scratch based on my understanding of the reference paper and the current implementation:

function _unsafe_extreme_filter_C4_2D_jc!(f::MAX_OR_MIN, out::AbstractArray{T,2}, A::AbstractArray{T,2}, iter) where {T}
    @debug "call the optimized `extreme_filter` implementation for SEDiamond SE and 2D images" fname =
        _unsafe_extreme_filter_C4_2D!
    axes(out) == axes(A) || throw(DimensionMismatch("axes(out) must match axes(A)"))

    # To avoid the result affected by loop order, we need two arrays
    src = (out === A) || (iter > 1) ? copy(A) : A

    Ω = strel(CartesianIndex, strel_diamond((3, 3)))
    δ = CartesianIndex(1, 1)
    R = CartesianIndices(src)
    R_inner = (first(R) + δ):(last(R) - δ)

    # applying radius=r filter is equivalent to applying radius=1 filter r times
    for i in 1:iter
        # NOTE(johnnychen94): the inner region loop is essentially equivalent to the
        # following code. Here we buffer the getindex to further improve the performance
        # by explicitly telling the compiler to reuse the SIMD register

        # @turbo for y in 2:(size(src, 2) - 1), x in 2:(size(src, 1) - 1)
        #     tmp = f(f(src[x, y], src[x - 1, y]), src[x + 1, y])
        #     out[x, y] = f(f(tmp, src[x, y - 1]), src[x, y + 1])
        # end
        @turbo for y in 2:(size(src, 2) - 1)
            p_up, p, p_down = src[1, y], src[2, y], src[3, y]
            for x in 2:(size(src, 1) - 1)
                p_up, p, p_down = p, p_down, src[x + 1, y]
                tmp = f(f(p, p_up), p_down)
                out[x, y] = f(f(tmp, src[x, y - 1]), src[x, y + 1])
            end
        end

        @inbounds for p in EdgeIterator(R, R_inner)
            # for edge points, skip if the offset exceeds the boundary
            s = A[p]
            for o in Ω
                q = p + o
                checkbounds(Bool, A, q) || continue
                s = f(s, A[q])
            end
            out[p] = s
        end
    end
    return out
end
julia> img = Float64.(testimage("cameraman"));

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D_jc!(max, $(similar(img)), $img, 1);
  82.422 μs (7 allocations: 448 bytes)

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D!(max, $(similar(img)), $img, 1);
  98.856 μs (3 allocations: 12.38 KiB)

Simpler and more efficient.

If replacing @turbo with @tturbo, we get another 2x boost (which is 3x faster than MATLAB):

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D_jc!(max, $(similar(img)), $img, 1);
  43.143 μs (7 allocations: 448 bytes)

@ThomasRetornaz
Copy link
Collaborator Author

I think we've unnecessarily complicates the implementation, here's a what I did from scratch based on my understanding of the reference paper and the current implementation:

function _unsafe_extreme_filter_C4_2D_jc!(f::MAX_OR_MIN, out::AbstractArray{T,2}, A::AbstractArray{T,2}, iter) where {T}
    @debug "call the optimized `extreme_filter` implementation for SEDiamond SE and 2D images" fname =
        _unsafe_extreme_filter_C4_2D!
    axes(out) == axes(A) || throw(DimensionMismatch("axes(out) must match axes(A)"))

    # To avoid the result affected by loop order, we need two arrays
    src = (out === A) || (iter > 1) ? copy(A) : A

    Ω = strel(CartesianIndex, strel_diamond((3, 3)))
    δ = CartesianIndex(1, 1)
    R = CartesianIndices(src)
    R_inner = (first(R) + δ):(last(R) - δ)

    # applying radius=r filter is equivalent to applying radius=1 filter r times
    for i in 1:iter
        # NOTE(johnnychen94): the inner region loop is essentially equivalent to the
        # following code. Here we buffer the getindex to further improve the performance
        # by explicitly telling the compiler to reuse the SIMD register

        # @turbo for y in 2:(size(src, 2) - 1), x in 2:(size(src, 1) - 1)
        #     tmp = f(f(src[x, y], src[x - 1, y]), src[x + 1, y])
        #     out[x, y] = f(f(tmp, src[x, y - 1]), src[x, y + 1])
        # end
        @turbo for y in 2:(size(src, 2) - 1)
            p_up, p, p_down = src[1, y], src[2, y], src[3, y]
            for x in 2:(size(src, 1) - 1)
                p_up, p, p_down = p, p_down, src[x + 1, y]
                tmp = f(f(p, p_up), p_down)
                out[x, y] = f(f(tmp, src[x, y - 1]), src[x, y + 1])
            end
        end

        @inbounds for p in EdgeIterator(R, R_inner)
            # for edge points, skip if the offset exceeds the boundary
            s = A[p]
            for o in Ω
                q = p + o
                checkbounds(Bool, A, q) || continue
                s = f(s, A[q])
            end
            out[p] = s
        end
    end
    return out
end
julia> img = Float64.(testimage("cameraman"));

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D_jc!(max, $(similar(img)), $img, 1);
  82.422 μs (7 allocations: 448 bytes)

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D!(max, $(similar(img)), $img, 1);
  98.856 μs (3 allocations: 12.38 KiB)

Simpler and more efficient.

If replacing @turbo with @tturbo, we get another 2x boost (which is 3x faster than MATLAB):

julia> @btime ImageMorphology._unsafe_extreme_filter_C4_2D_jc!(max, $(similar(img)), $img, 1);
  43.143 μs (7 allocations: 448 bytes)

Nice !!! I tented to copy my c implem too much ....
Your code is simpler, more julia centric and FASTER \o/

johnnychen94 added a commit that referenced this pull request Jun 24, 2022
The codes come from #90.

Co-Authored-by: Retornaz Thomas <[email protected]>
johnnychen94 added a commit that referenced this pull request Jun 24, 2022
The codes come from #90.

Co-Authored-by: Retornaz Thomas <[email protected]>
johnnychen94 added a commit that referenced this pull request Jun 28, 2022
johnnychen94 added a commit that referenced this pull request Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants