Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for SIMD instructions #80

Open
Magalame opened this issue Jul 1, 2019 · 14 comments
Open

Support for SIMD instructions #80

Magalame opened this issue Jul 1, 2019 · 14 comments

Comments

@Magalame
Copy link

Magalame commented Jul 1, 2019

Hi!

I was wondering if there were any specific plan to implement SIMD instructions. I've been playing around with them in the context of the dense-linear-algebra library, and they seem to be fundamental to reach BLAS-like performance. I was wondering if you were planning to introduce SIMD to massiv?

@lehins
Copy link
Owner

lehins commented Jul 1, 2019

@Magalame I definitely do have a strong desire to add SIMD support to massiv. I have tried a few things before and have been pondering on some ideas on how to best implement it for a while now. I think I might a good idea that could work. Time permitting, I'll try to implement a proof of concept in some near future and will ping you here so I can get some feedback. ;)

In short, the idea I have, is to go directly through C world with FFI, rather than using primitives from GHC.Prim, since those require LLVM and I don't want to force anyone to install it.

@klapaucius
Copy link

@Magalame
Copy link
Author

Magalame commented Jul 2, 2019

@lehins I see, sounds good I'm excited :)

If as @klapaucius points out SIMD support gets to the NGC backend, it might avoid you a lot of trouble

@klapaucius
Copy link

Aaaaand... merged https://gitlab.haskell.org/ghc/ghc/commit/acd795583625401c5554f8e04ec7efca18814011

@lehins
Copy link
Owner

lehins commented Jul 3, 2019

It is really good news that ghc will finally be able generate SIMD natively.

I had a pretty good day in that direction as well. Was experimenting with implementing matrix multiplication using SIMD in C and FFI and saw a very nice speed up:

benchmarking Mult/Seq/multiplyTranspose
time                 27.22 ms   (27.09 ms .. 27.33 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 27.31 ms   (27.25 ms .. 27.37 ms)
std dev              135.6 μs   (103.9 μs .. 170.9 μs)
                          
benchmarking Mult/Seq/multiplyTransposeSIMD
time                 17.23 ms   (17.04 ms .. 17.55 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 17.14 ms   (17.07 ms .. 17.29 ms)
std dev              228.4 μs   (133.3 μs .. 387.6 μs)
                          
benchmarking Mult/Par/multiplyTranspose
time                 7.683 ms   (7.255 ms .. 8.234 ms)
                     0.982 R²   (0.969 R² .. 0.996 R²)
mean                 7.575 ms   (7.426 ms .. 7.766 ms)
std dev              474.1 μs   (285.0 μs .. 635.8 μs)
variance introduced by outliers: 34% (moderately inflated)
                          
benchmarking Mult/Par/multiplyTransposeSIMD
time                 3.454 ms   (3.411 ms .. 3.503 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 3.392 ms   (3.357 ms .. 3.427 ms)
std dev              100.7 μs   (77.90 μs .. 146.5 μs)
variance introduced by outliers: 14% (moderately inflated)

It is still pretty far from being production ready, but it is getting me closer to having some sort of SIMD capabilities in massiv

@Magalame
Copy link
Author

Magalame commented Jul 3, 2019

I have to say I'm fairly impressed with the native GHC SIMD (llvm) capabilities:

benchmarking DLA/Matrix-matrix multiplication
time                 2.573 ms   (2.546 ms .. 2.598 ms)
                     0.998 R²   (0.995 R² .. 0.999 R²)
mean                 2.608 ms   (2.584 ms .. 2.660 ms)
std dev              110.6 μs   (64.93 μs .. 220.6 μs)
variance introduced by outliers: 26% (moderately inflated)

benchmarking DLA/Matrix-matrix multiplication  simd
time                 396.5 μs   (393.3 μs .. 401.3 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 408.4 μs   (404.7 μs .. 414.3 μs)
std dev              15.62 μs   (11.65 μs .. 20.91 μs)
variance introduced by outliers: 32% (moderately inflated)

benchmarking DLA/Norm
time                 20.65 μs   (20.34 μs .. 20.98 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 20.75 μs   (20.57 μs .. 20.95 μs)
std dev              641.9 ns   (540.2 ns .. 838.4 ns)
variance introduced by outliers: 34% (moderately inflated)

benchmarking DLA/Norm simd
time                 4.357 μs   (4.327 μs .. 4.387 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 4.346 μs   (4.322 μs .. 4.376 μs)
std dev              89.97 ns   (71.59 ns .. 114.1 ns)
variance introduced by outliers: 22% (moderately inflated)

benchmarking Hmatrix/Matrix-matrix multiplication
time                 334.8 μs   (332.1 μs .. 337.7 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 334.7 μs   (332.6 μs .. 336.7 μs)
std dev              6.768 μs   (5.800 μs .. 8.143 μs)
variance introduced by outliers: 12% (moderately inflated)

benchmarking Hmatrix/Norm
time                 4.822 μs   (4.756 μs .. 4.889 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 4.836 μs   (4.785 μs .. 4.894 μs)
std dev              182.2 ns   (153.2 ns .. 217.7 ns)
variance introduced by outliers: 48% (moderately inflated)

@lehins
Copy link
Owner

lehins commented Jul 4, 2019

I started a separate repo for now for the massiv-simd package that will eventually find its home in this repository. For now it only contains a handful of functions that can do things on 128bit Doubles(x2), but the most important thing is to figure out the overall structure of the package and see if it is really viable. So far, I think, it looks really promising, both design wise and the speed observed.

@Magalame
Copy link
Author

Magalame commented Jul 9, 2019

I'll look into it! Someone also recommended me this paper on the same subject: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/haskell-beats-C.pdf
It's concerned with the current implementation of vector so I'm not sure it'll be 100% relevant here, but it might yield some insight regarding the possibility of a more abstract API for SIMD

@lehins
Copy link
Owner

lehins commented Jul 9, 2019

@Magalame It is a very interesting read, highly recommend it too. Unfortunately vector still doesn't use SIMD, despite what the paper says. It does give a very good insight into very powerful stream fusion that is vector build on, although I don't think it is terribly relevant in the context of multidimensional arrays, unless the arrays are ragged.

@klapaucius
Copy link

@lehins
Copy link
Owner

lehins commented Jul 9, 2019

@klapaucius That's where that branch went, I was looking all over for it, I even asked Roman about it. :D (Thank you for posting a link to it!)
Despite that it was implemented 6 years ago, unfortunately it doesn't change what I said, it is still not in vector

@Magalame
Copy link
Author

Magalame commented Jul 9, 2019

@klapaucius your comments always are providential!

@klapaucius
Copy link

@lehins
Copy link
Owner

lehins commented Jul 9, 2019

@klapaucius your comments are always so concise! 😄

Spoke too soon about the SIMD support in ghc. But's ok, I am sure Ben Gamari will figure it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants