Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docstring for VecUnroll? #71

Closed
timholy opened this issue Sep 22, 2021 · 5 comments · Fixed by #72
Closed

Docstring for VecUnroll? #71

timholy opened this issue Sep 22, 2021 · 5 comments · Fixed by #72

Comments

@timholy
Copy link
Contributor

timholy commented Sep 22, 2021

I'm looking into providing support for multichannel colors in ImageFiltering. As you may know, JuliaImages provides real RGB types that encode the color of a pixel without adding an array dimension to do it. Naturally, these aren't natively supported by VectorizationBase. Obviously, I can reinterpret(reshape, Float32 #=or whatever=#, img), but everything gets a lot uglier if you have to add array dimensions. I am guessing that VecUnroll is kind of like a SVector, is that right? If so, what do the parameters "mean"? Or if that's not the case, is there a good solution for supporting the equivalent of an NTuple{N,T} where T<:NativeTypes?

@timholy
Copy link
Contributor Author

timholy commented Sep 22, 2021

In case it helps I pushed JuliaImages/ImageFiltering.jl#229 so you can see where I was going

@chriselrod
Copy link
Member

A VecUnroll{N,W,T} is typically a tuple of N+1 AbstractSIMDVector{W,T}s, e.g. a VecUnroll{3,8,Float32} is a collection of 4x Vec{8,Float32}.
The primary purposes are to:

  1. Interleave instructions when unrolling code that calls functions with many instructions (e.g. exp).
  2. Allow for optimizations when loading/storing them.

For example:

julia> using VectorizationBase

julia> x = [(R = rand(Float32), G = rand(Float32), B = rand(Float32)) for _ in 1:100];

julia> x[1:4]
4-element Vector{NamedTuple{(:R, :G, :B), Tuple{Float32, Float32, Float32}}}:
 (R = 0.26760215, G = 0.71913034, B = 0.21303588)
 (R = 0.23587197, G = 0.13380599, B = 0.32123268)
 (R = 0.15133655, G = 0.95029974, B = 0.515872)
 (R = 0.14430124, G = 0.5951618, B = 0.30713272)

julia> vload(stridedpointer(reinterpret(reshape, Float32, x)), Unroll{1,1,3,2,Int(pick_vector_width(Float32)),zero(UInt),1}((1,1)))
3 x Vec{16, Float32}
Vec{16, Float32}<0.26760215f0, 0.23587197f0, 0.15133655f0, 0.14430124f0, 0.96873146f0, 0.87005985f0, 0.77632064f0, 0.7556457f0, 0.5719081f0, 0.9102745f0, 0.857614f0, 0.77330655f0, 0.492265f0, 0.38813847f0, 0.12650585f0, 0.6479362f0>
Vec{16, Float32}<0.71913034f0, 0.13380599f0, 0.95029974f0, 0.5951618f0, 0.84525603f0, 0.91485614f0, 0.598403f0, 0.8754583f0, 0.05103159f0, 0.08334786f0, 0.669178f0, 0.07402092f0, 0.47366828f0, 0.9429586f0, 0.8576969f0, 0.8917303f0>
Vec{16, Float32}<0.21303588f0, 0.32123268f0, 0.515872f0, 0.30713272f0, 0.13002992f0, 0.33429003f0, 0.35313886f0, 0.76046395f0, 0.7992017f0, 0.8299498f0, 0.6705442f0, 0.674591f0, 0.6830538f0, 0.03794396f0, 0.21829528f0, 0.8817978f0>

julia> typeof(ans)
VecUnroll{2, 16, Float32, Vec{16, Float32}}

So the VecUnroll{2, 16, Float32} consists of 3 x Vec{16, Float32}.

In this case, the first 4 elements of the first of these Vecs is:

0.26760215f0, 0.23587197f0, 0.15133655f0, 0.14430124f0

Matching the first 4 Rs. Similar for the seconf vector and G and the third vector and B.

This is going to be faster than 3 loads, because 3 independent loads would require 3 gather instructions, because the R, G, and B fields are not contiguous in memory.
By loading them as a group, it performs 3 contiguous loads, and then uses a sequence of shuffle instructions to get them into the correct order.

I'm emphasizing this here as it is probably an important use case if you're using arrays of structs (e.g. arrays of colors) instead of structs of arrays.
Although it'd be a useful feature for TiledIteration to support transitioning from arrays of structs to struct of arrays format. (Note that LoopVectorization shouldn't be needed for these shuffles; LLVM's autovectorizer normally does this on its own).

I am guessing that VecUnroll is kind of like a SVector, is that right?

It can be used a little like one, e.g.:

julia> VecUnroll((1.0, 2.0, 3.0, 4.0))
4 x Float64
1.0
2.0
3.0
4.0

julia> abs2(ans)
4 x Float64
1.0
4.0
9.0
16.0

julia> typeof(ans)
VecUnroll{3, 1, Float64, Float64}

They're not AbstractArrays and do not support the AbstractArray interface, but they are still AbstractSIMD and should support the AbstractSIMD interface. As such, they also still have some optimizations:

julia> x = VecUnroll((1.0, 2.0, 3.0, 4.0));

julia> @btime exp($(Ref(x))[])
  5.317 ns (0 allocations: 0 bytes)
4 x Float64
2.7182818284590455
7.3890560989306495
20.085536923187668
54.59815003314424

julia> t = (1.0, 2.0, 3.0, 4.0);

julia> @btime exp.($(Ref(t))[])
  19.657 ns (0 allocations: 0 bytes)
(2.718281828459045, 7.38905609893065, 20.085536923187668, 54.598150033144236)

Or if that's not the case, is there a good solution for supporting the equivalent of an NTuple{N,T} where T<:NativeTypes?

I'd have to look more at the PR, but are you wanting LoopVectorization to support AbstractArray{<:NTuple{N,T<:NativeTypes}} without the need for reinterpret?
There isn't any support for that yet.

I'm still (very slowly) working on rewriting LoopVectorization.
One of the goals of the rewrite is to seperate the front end, scheduler, and code generator into separate packages with documented and reasonably stable APIs.
This would make it easier to add support in the front end, or add a new front end, that supports NTuples, e.g. hopefully eventually using the compiler plugins.

Conceivably, one could add specific support for NTuples by automating the reinterpret and (preferably only when necessary) unrolling.

@timholy
Copy link
Contributor Author

timholy commented Sep 23, 2021

As background, we have two key abstractions that, in my opinion, make JuliaImages the nicest platform for writing generic image processing code (we aim to unify biomedical imaging and computer vision, whereas the vast majority of suites are firmly in one camp or another). Since I suspect I may start pestering you a lot, perhaps it makes sense to spend a little bit of time dragging you through a brief introduction/motivation.

Before introducing the two key abstractions, let me acknowledge that they are centered on behavior rather than representation, and do not in and of themselves get in the way of the array-of-structs vs struct-of-arrays issue. So this is not an argument against that viewpoint.

With that background, the two key abstractions are:

  1. each element of an array is a pixel/voxel/whatever---we typically avoid using an array dimension for encoding partial information like color channels, instead we bundle them together into complete structs
  2. we use numeric types where you can take the mathematical meaning at face value. In particular, for us 255 != 1.0, which might sound pretty obvious but turns out to be contrary to what the vast majority of other suites would have you believe.

The first of these is important for both correctness and generalizablilty:

  • when you use an array dimension for color, many algorithms have to make a decision about what array dimensions you should apply your operation to, which makes everything ugly and increases duplication (and errors, if you forget to worry about color)
  • the typical heuristics for identifying a color dimension ("is the final dimension of size 3?") are insane, and introduce problems both in supporting:
    • three-dimensional biomedical imaging or time-series images (I've seen algorithms do something wildly unexpected because I happened to pass in a thin "slice" of 3 temporal grayscale frames, or slice of 3 z-planes in a 3d grayscale image)
    • hyperspectral imaging (which uses more than 3 element for the color channel)

The second abstraction allows us to divorce meaning from representation. In most suites, "white" is 255 if you're using UInt8 and 1.0 if you're using Float32; hence, an innocent operation like Float32.(img) or RGB{Float32}.(img) to change the precision for some future operation can change your display of the image from richly-detailed to saturated:
image
(Warning users "don't do this, instead use our special routines that divide by 255" is one of the very first things that appears in the documentation of, e.g., scikit-images, but this just doesn't even come up in JuliaImages).
Moreover, judging white/saturated from the eltype is fraught if, for example, you're using a scientific 12-bit camera, for which the integer value 4095 corresponds to saturated but is typically represented using a UInt16 (a few suites can handle packed data for a few operations, but in practice everyone unpacks quickly before getting to serious processing). So instead we created the FixedPointNumbers package which reinterprets 0xff as 1.0 (the so-called N0f8 type, Normalized number with 0 integer bits and 8 fractional bits). For a 12-bit camera, naturally we use N4f12, which is just a reinterpretation of UInt16 where the bit collection 0x0fff is given the mathematical meaning of 1.0 and 0xffff is equivalent to 65535/4095, a hair over 16.

These are very simple, low-level abstractions, and they enable us to write a lot of code that is flexible, correct, and short. But they both focus on getting away from native hardware types and hence pose a challenge to a package like LoopVectorization. So let me now describe a couple of the things that might be needed to build the bridge:

  • if there aren't the equivalent of SVector{3,T} eltypes, then we need a struct-of-arrays or an unpack-as-dimension representation. I think we should improve our usage of struct-of-arrays anyway. Of course, we support AbstractArray basically everywhere so anyone can use StructArray{RGB{Float32}}(r=rand(Float32, 128, 100), g=rand(Float32, 128, 100), b=rand(Float32, 128, 100)). While we should provide a convenience utility for doing that, it remains important to support the array-of-structs representation. One of JuliaImages' other goals is good support for "big data," typically involving a mmap somewhere in your pipeline, and that means you can't force people to change the representation just to run code. Question: are you saying we could do that with:

    julia> using ImageCore, StructArrays, MappedArrays
    
    julia> img = rand(RGB{Float32}, 128, 100);   # an array-of-structs
    
    julia> imga = StructArray{RGB{Float32}}(r=mappedarray(red, img), g=mappedarray(green, img), b=mappedarray(blue, img));  # lazy struct-of-arrays

    and that would basically be enough to make LoopVectorization happy?

  • for the N0f8 issue, an interesting example is linear filtering. If you reinterpret N0f8 values back to their backing UInt8, you also have to scale the kernel to compensate. That's the approach I took here in the PR I submitted yesterday. Concern: it's not entirely obvious that it will always be that easy.

@chriselrod
Copy link
Member

chriselrod commented Sep 23, 2021

If you're writing a lot of color functions, the easiest thing may be to write a transform like you did for CartesianIndices, where the operations get duplicated as necessary given Colors or SVector{N,T} / some custom type.
Or wait for the rewrite and handling it more generically in the front end (also, all the old code will be scrapped by then).

StructArray doesn't really help things "work", it just might buy performance because of the memory layout.
I.e., you still can't use the StructArray directly, i.e. you still need to reference them individually:

@turbo for i in eachindex(imga)
    r = imga.r[i]
    g = imga.g[i]
    b = imga.b[i]
    # do something with r, g, b
end

and it isn't obvious to me that this has a real advantage over

imga = reinterpret(reshape, Float32, img)
for i in eachindex(img) # should work and effectively drop trailing dims
    r = imga[1,i]
    g = imga[2,i]
    b = imga[3,i]   
    # do something with r, g, b
end

in terms of convenience.

But there are potential performance differences based on memory layout.
If r, g, and b in the struct array are each contiguous, then the former would allow them to be loaded/stored efficiently.

If r, g, and b in the struct array are not contiguous, e.g. if they're mapped arrays of the original array, then the former hides the fact that r[i], g[i], and b[i] are next to each other in memory, making loading/storing very expensive.

The latter, assuming the underlying memory is an array of structs, is reasonably efficient: the data will be loaded/stored in contiguous chunks (fast), and then shuffled into separate vectors of r, g, and b in registers. Shuffling is slower than not shuffling, but reasonably fast, and much faster than gather/scatter instructions (and scatter is AVX512, making storing even worse without it).

Of course, doing the above manually means also manually scaling the kernels because you're still working with native types...

timholy added a commit to timholy/VectorizationBase.jl that referenced this issue Sep 24, 2021
@timholy
Copy link
Contributor Author

timholy commented Sep 24, 2021

Yeah, we can add an array dimension and make corresponding adjustments to other inputs. It's a little less pretty, but the performance advantages are compelling. Moreover, the main benefit from our abstractions is "communication about intent with the user," and once that has been achieved we can afford a bit of less-than-pretty specialization.

Thanks for the consultation! In gratitude/fair exchange, an attempted docstring for VecUnroll is in #72.

chriselrod pushed a commit that referenced this issue Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants