Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llamafile_sgemm API - INT8 implementation #10912

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amritahs-ibm
Copy link
Contributor

@amritahs-ibm amritahs-ibm commented Dec 20, 2024

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 20, 2024
@amritahs-ibm amritahs-ibm force-pushed the sgemm_q8 branch 2 times, most recently from 85c5280 to d70f5fc Compare December 20, 2024 06:22
@amritahs-ibm
Copy link
Contributor Author

Hi @ggerganov,
Can you please help reviewing this PR. Or suggest any missing actions required from me to get this patch reviewed.

@slaren
Copy link
Collaborator

slaren commented Dec 23, 2024

We will need to merge #10714 first, since there may be some conflicts.

@amritahs-ibm
Copy link
Contributor Author

Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR.

@Djip007
Copy link
Contributor

Djip007 commented Dec 24, 2024

I'll try to submit it today. 🤞

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <[email protected]>
@amritahs-ibm
Copy link
Contributor Author

I have made the changes suggested by @Djip007 and pushed. @slaren / @Djip007 / @ggerganov Please review the changes.

Copy link
Contributor

@Djip007 Djip007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are quick personal comments wait for @slaren @ggerganov before change.
And I read it very quickly.

Comment on lines +1052 to +1085
#define COMPUTE(ACC, c_idx, s_idx) \
__builtin_mma_disassemble_acc(vec_C, ACC); \
for (int i = 0; i < 4; i++) { \
CA[i] = vec_splats((float)(((double)comparray[c_idx+i]) * -128.0)); \
res[i] = vec_add(vec_ctf(vec_C[i], 0), CA[i]); \
fin_res[s_idx+i] = vec_madd(res[i], vs[s_idx+i], fin_res[s_idx+i]); \
} \

#define SAVE_RES(ii, jj, RM, RN, idx) \
for (int I = 0; I < RM; I++) { \
for (int J = 0; J < RN; J++) { \
*((float*)(C+ii+((jj+J)*ldc)+I)) = *((float*)&fin_res[idx+I]+J); \
} \
} \

#define PERMUTE_VEC(IR1, IR2, IR3, IR4, vecOffset) \
t1 = vec_perm(IR1, IR2, swiz1); \
t2 = vec_perm(IR1, IR2, swiz2); \
t3 = vec_perm(IR3, IR4, swiz1); \
t4 = vec_perm(IR3, IR4, swiz2); \
t5 = vec_perm(t1, t3, swiz3); \
t6 = vec_perm(t1, t3, swiz4); \
t7 = vec_perm(t2, t4, swiz3); \
t8 = vec_perm(t2, t4, swiz4); \
if (flip == true) { \
t5 = vec_xor(t5, xor_vector); \
t6 = vec_xor(t6, xor_vector); \
t7 = vec_xor(t7, xor_vector); \
t8 = vec_xor(t8, xor_vector); \
} \
vec_xst(t5, 0, vecOffset); \
vec_xst(t6, 0, vecOffset+16); \
vec_xst(t7, 0, vecOffset+32); \
vec_xst(t8, 0, vecOffset+48); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C++, I prefer not to use C-macros, but rather simple templates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Djip007
I used macros just to avoid some repitition. I can replace them by function or templates.
From the performance viewpoint and considering the nature of the above macros, would function be better than templates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example:

#define SAVE_RES(ii, jj, RM, RN, idx) \
    for (int I = 0; I < RM; I++) { \
        for (int J = 0; J < RN; J++) { \
            *((float*)(C+ii+((jj+J)*ldc)+I)) = *((float*)&fin_res[idx+I]+J); \
        } \
    } \

I will use template for RM/RN it make more likely let the compiler unrool the 2 loops for small values.
It may be sometime needed for various TYPE.
But yes, no need for template for most case.

ggml/src/ggml-cpu/llamafile/sgemm.cpp Show resolved Hide resolved
free(comparray);
}

void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n, int RM, int RN) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for gemm I would have made a template:

template<int RM, int RN>
void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n)

That way you can remove the malloc/free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Djip007
I intend to use
std::array<int, 4> comparray;
inside gemm_small as well.

So is it necessary to make gemm_small template?. I am just thinking if making gemm_small a template function would have a minor performance impact.

Please suggest.

Copy link
Contributor

@Djip007 Djip007 Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With RM/RN as template param the for is more simple for compiler to unroll or vectorise.
The only drawback is that the compiler will have to create as many functions as template parameters, so the code will be a bit bigger.
Sometime I use https://godbolt.org/ to see what compilers can do.

}
}
__builtin_mma_disassemble_acc(vec_C, &acc_0);
TA *aoffset = const_cast<TA*>(A+(ii*lda)+l);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this const_cast needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Djip007
When I try to cast it normally like this:
TA aoffset = (TA)(A+(iilda)+l);
I get this warning:
warning: cast from type ‘const block_q8_0
’ to type ‘block_q8_0*’ casts away qualifiers [-Wcast-qual].

Is there any other way to avoid this warning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more about

const TA* aoffset = A+(ii*lda)+l;

ie: no need to remove the const. (and not a good idea for compiler optimisation)

auto aoffset = A + ii*lda + l;

mc = 8;
nc = 8;
gemm<8,8>(m0, m, n0, n);
} else if (m_rem >= 8 && n_rem >= 8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that the 2 first if is needed.
Or you miss the gemm<16,8> / gemmw8,16> ?

Copy link
Contributor Author

@amritahs-ibm amritahs-ibm Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Djip007
Yes, I had gemm<16,8> and gemm<8,16> earlier. But those had some performance problems and I switched to gemm<8,8>.
I plan to work on solving the performance issues and use the bigger kernels.

Is it ok if I retain this code as place holder for future use of gemm<16,8>/gemm<8,16> ?

If not, I can remove that first 2 ifs now, and add later it later if I can solve the performance issues.
Please suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to keep it for now, you can also leave it commented and add a note to say why.
Like that, when someone reads the code, there is always a doubt if there is an error ;)

int64_t ii = m0 + job / xtiles * RM;
int64_t jj = n0 + job % xtiles * RN;
//int *comparray = (int *)malloc(RM * sizeof(int), 4096);
int *comparray = (int *)malloc(RM * sizeof(int));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes , if

template<int RM, int RN>
void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n)

then you can use

std::array<int, RM> comparray;

Comment on lines +1594 to +1600
if (RM == 4 && RN == 8) {
kernel = &tinyBLAS_Q0_PPC::KERNEL_4x8;
} else if (RM == 8 && RN == 4) {
kernel = &tinyBLAS_Q0_PPC::KERNEL_8x4;
} else if (RM == 8 && RN == 8) {
kernel = &tinyBLAS_Q0_PPC::KERNEL_8x8;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I would have tried to make KERNEL_N_M a template, if not possible/simple I would have done
something like that now we can use c++17:

tempate<int RN, int RM>
inline void kernel(int64_t ii, int64_t jj) {
    if constexpr(RM == 4 && RN == 8) {
        KERNEL_4x8(ii,jj);
    } else if constexpr(RM == 8 && RN == 4) {
        KERNEL_4x8(ii,jj);
    } else if constexpr(RM == 8 && RN == 8) {
        KERNEL_8x8(ii,jj);
    } else {
        static_assert(false, "RN/RM values not supported");
    }
}

and use

    kernel<RM,RN>(ii, jj);

This remove the need of "methode-pointer", and do not compile if you use it with not supported RM/RN.

}

private:
void (tinyBLAS_Q0_PPC::*kernel)(int64_t, int64_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not see it last time, dangerous, it is not thread safe.

@Djip007
Copy link
Contributor

Djip007 commented Dec 28, 2024

@amritahs-ibm
I didn't realize that MMA was "Matrix Multiply Accelerate".
Have you see what was done with amx or "aarch64" kernel? There is now "simple" arch to create kernel that can "repack" the weight, so it fit to the structure needed for such Matrix OP?
That way A is repack at load time once, and only B need to be repack at runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants