llamafile_sgemm API - INT8 implementation #10912

amritahs-ibm · 2024-12-20T05:33:43Z

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

amritahs-ibm · 2024-12-20T07:11:30Z

Hi @ggerganov,
Can you please help reviewing this PR. Or suggest any missing actions required from me to get this patch reviewed.

slaren · 2024-12-23T00:24:14Z

We will need to merge #10714 first, since there may be some conflicts.

amritahs-ibm · 2024-12-23T05:07:20Z

Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR.

Djip007 · 2024-12-24T14:12:39Z

I'll try to submit it today. 🤞

ggml/src/ggml-cpu/llamafile/sgemm.cpp

This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <[email protected]>

amritahs-ibm · 2024-12-26T10:15:53Z

I have made the changes suggested by @Djip007 and pushed. @slaren / @Djip007 / @ggerganov Please review the changes.

Djip007

These are quick personal comments wait for @slaren @ggerganov before change.
And I read it very quickly.

Djip007 · 2024-12-26T18:33:42Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+#define COMPUTE(ACC, c_idx, s_idx) \
+    __builtin_mma_disassemble_acc(vec_C, ACC); \
+    for (int i = 0; i < 4; i++) { \
+        CA[i] = vec_splats((float)(((double)comparray[c_idx+i]) * -128.0)); \
+        res[i] = vec_add(vec_ctf(vec_C[i], 0), CA[i]); \
+        fin_res[s_idx+i] = vec_madd(res[i], vs[s_idx+i], fin_res[s_idx+i]); \
+    } \
+
+#define SAVE_RES(ii, jj, RM, RN, idx) \
+    for (int I = 0; I < RM; I++) { \
+        for (int J = 0; J < RN; J++) { \
+            *((float*)(C+ii+((jj+J)*ldc)+I)) = *((float*)&fin_res[idx+I]+J); \
+        } \
+    } \
+
+#define PERMUTE_VEC(IR1, IR2, IR3, IR4, vecOffset) \
+    t1 = vec_perm(IR1, IR2, swiz1); \
+    t2 = vec_perm(IR1, IR2, swiz2); \
+    t3 = vec_perm(IR3, IR4, swiz1); \
+    t4 = vec_perm(IR3, IR4, swiz2); \
+    t5 = vec_perm(t1, t3, swiz3); \
+    t6 = vec_perm(t1, t3, swiz4); \
+    t7 = vec_perm(t2, t4, swiz3); \
+    t8 = vec_perm(t2, t4, swiz4); \
+    if (flip == true) { \
+        t5 = vec_xor(t5, xor_vector); \
+        t6 = vec_xor(t6, xor_vector); \
+        t7 = vec_xor(t7, xor_vector); \
+        t8 = vec_xor(t8, xor_vector); \
+    } \
+    vec_xst(t5, 0, vecOffset); \
+    vec_xst(t6, 0, vecOffset+16); \
+    vec_xst(t7, 0, vecOffset+32); \
+    vec_xst(t8, 0, vecOffset+48); \


In C++, I prefer not to use C-macros, but rather simple templates.

@Djip007
I used macros just to avoid some repitition. I can replace them by function or templates.
From the performance viewpoint and considering the nature of the above macros, would function be better than templates?

for example:

#define SAVE_RES(ii, jj, RM, RN, idx) \ for (int I = 0; I < RM; I++) { \ for (int J = 0; J < RN; J++) { \ *((float*)(C+ii+((jj+J)*ldc)+I)) = *((float*)&fin_res[idx+I]+J); \ } \ } \

I will use template for RM/RN it make more likely let the compiler unrool the 2 loops for small values.
It may be sometime needed for various TYPE.
But yes, no need for template for most case.

ggml/src/ggml-cpu/llamafile/sgemm.cpp

Djip007 · 2024-12-26T18:42:29Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+        free(comparray);
+    }
+
+    void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n, int RM, int RN) {


as for gemm I would have made a template:

template<int RM, int RN> void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n)

That way you can remove the malloc/free.

@Djip007
I intend to use
std::array<int, 4> comparray;
inside gemm_small as well.

So is it necessary to make gemm_small template?. I am just thinking if making gemm_small a template function would have a minor performance impact.

Please suggest.

With RM/RN as template param the for is more simple for compiler to unroll or vectorise.
The only drawback is that the compiler will have to create as many functions as template parameters, so the code will be a bit bigger.
Sometime I use https://godbolt.org/ to see what compilers can do.

Djip007 · 2024-12-26T18:48:08Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+                    }
+                }
+                __builtin_mma_disassemble_acc(vec_C, &acc_0);
+                TA *aoffset = const_cast<TA*>(A+(ii*lda)+l);


is this const_cast needed?

@Djip007
When I try to cast it normally like this:
TA aoffset = (TA)(A+(iilda)+l);
I get this warning:
warning: cast from type ‘const block_q8_0’ to type ‘block_q8_0*’ casts away qualifiers [-Wcast-qual].

Is there any other way to avoid this warning?

I was thinking more about

const TA* aoffset = A+(ii*lda)+l;

ie: no need to remove the const. (and not a good idea for compiler optimisation)

auto aoffset = A + ii*lda + l;

Djip007 · 2024-12-26T19:50:26Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+            mc = 8;
+            nc = 8;
+            gemm<8,8>(m0, m, n0, n);
+        } else if (m_rem >= 8 && n_rem >= 8) {


I am not sure that the 2 first if is needed.
Or you miss the gemm<16,8> / gemmw8,16> ?

@Djip007
Yes, I had gemm<16,8> and gemm<8,16> earlier. But those had some performance problems and I switched to gemm<8,8>.
I plan to work on solving the performance issues and use the bigger kernels.

Is it ok if I retain this code as place holder for future use of gemm<16,8>/gemm<8,16> ?

If not, I can remove that first 2 ifs now, and add later it later if I can solve the performance issues.
Please suggest.

If you want to keep it for now, you can also leave it commented and add a note to say why.
Like that, when someone reads the code, there is always a doubt if there is an error ;)

Djip007 · 2024-12-27T22:43:27Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+            int64_t ii = m0 + job / xtiles * RM;
+            int64_t jj = n0 + job % xtiles * RN;
+            //int *comparray = (int *)malloc(RM * sizeof(int), 4096);
+            int *comparray = (int *)malloc(RM * sizeof(int));


yes , if

template<int RM, int RN> void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n)

then you can use

std::array<int, RM> comparray;

Djip007 · 2024-12-27T23:12:31Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+        if (RM == 4 && RN == 8) {
+            kernel = &tinyBLAS_Q0_PPC::KERNEL_4x8;
+        } else if (RM == 8 && RN == 4) {
+            kernel = &tinyBLAS_Q0_PPC::KERNEL_8x4;
+        } else if (RM == 8 && RN == 8) {
+            kernel = &tinyBLAS_Q0_PPC::KERNEL_8x8;
+        }


In this case, I would have tried to make KERNEL_N_M a template, if not possible/simple I would have done
something like that now we can use c++17:

tempate<int RN, int RM> inline void kernel(int64_t ii, int64_t jj) { if constexpr(RM == 4 && RN == 8) { KERNEL_4x8(ii,jj); } else if constexpr(RM == 8 && RN == 4) { KERNEL_4x8(ii,jj); } else if constexpr(RM == 8 && RN == 8) { KERNEL_8x8(ii,jj); } else { static_assert(false, "RN/RM values not supported"); } }

and use

kernel<RM,RN>(ii, jj);

This remove the need of "methode-pointer", and do not compile if you use it with not supported RM/RN.

Djip007 · 2024-12-27T23:28:47Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+    }
+
+  private:
+    void (tinyBLAS_Q0_PPC::*kernel)(int64_t, int64_t);


Did not see it last time, dangerous, it is not thread safe.

Djip007 · 2024-12-28T01:40:59Z

@amritahs-ibm
I didn't realize that MMA was "Matrix Multiply Accelerate".
Have you see what was done with amx or "aarch64" kernel? There is now "simple" arch to create kernel that can "repack" the weight, so it fit to the structure needed for such Matrix OP?
That way A is repack at load time once, and only B need to be repack at runtime.

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 20, 2024

amritahs-ibm force-pushed the sgemm_q8 branch 2 times, most recently from 85c5280 to d70f5fc Compare December 20, 2024 06:22

Djip007 reviewed Dec 24, 2024

View reviewed changes

ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved

amritahs-ibm force-pushed the sgemm_q8 branch from d70f5fc to a8d3700 Compare December 26, 2024 10:12

Djip007 reviewed Dec 26, 2024

View reviewed changes

Djip007 reviewed Dec 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llamafile_sgemm API - INT8 implementation #10912

llamafile_sgemm API - INT8 implementation #10912

amritahs-ibm commented Dec 20, 2024 •

edited

Loading

amritahs-ibm commented Dec 20, 2024

slaren commented Dec 23, 2024

amritahs-ibm commented Dec 23, 2024

Djip007 commented Dec 24, 2024

amritahs-ibm commented Dec 26, 2024

Djip007 left a comment

Djip007 Dec 26, 2024

amritahs-ibm Dec 27, 2024

Djip007 Dec 27, 2024

Djip007 Dec 26, 2024

amritahs-ibm Dec 27, 2024

Djip007 Dec 27, 2024 •

edited

Loading

Djip007 Dec 26, 2024

amritahs-ibm Dec 27, 2024

Djip007 Dec 27, 2024

Djip007 Dec 26, 2024

amritahs-ibm Dec 27, 2024 •

edited

Loading

Djip007 Dec 27, 2024

Djip007 Dec 27, 2024

Djip007 Dec 27, 2024

Djip007 Dec 27, 2024

Djip007 commented Dec 28, 2024

llamafile_sgemm API - INT8 implementation #10912

Are you sure you want to change the base?

llamafile_sgemm API - INT8 implementation #10912

Conversation

amritahs-ibm commented Dec 20, 2024 • edited Loading

amritahs-ibm commented Dec 20, 2024

slaren commented Dec 23, 2024

amritahs-ibm commented Dec 23, 2024

Djip007 commented Dec 24, 2024

amritahs-ibm commented Dec 26, 2024

Djip007 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Djip007 Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amritahs-ibm Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Djip007 commented Dec 28, 2024

amritahs-ibm commented Dec 20, 2024 •

edited

Loading

Djip007 Dec 27, 2024 •

edited

Loading

amritahs-ibm Dec 27, 2024 •

edited

Loading