Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snippets] SplitDimensionM: heuristic update #28180

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/common/snippets/docs/mha_optimization_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ The supported by decomposition Transpose orders are defined by `TokenizeMHASnipp

[SplitDimensionM](../src/pass/split_dimension_m.cpp) splits M dimension of MHA in 2 parts (`batch_m` and `new_m`) by inserting Reshape on A input of the first Matmul and output of the second Matmul (the rest Subgraph's inputs are reshaped by Unsqueeze-like reshapes in order not to break subgraph semantic).
This optimization increases parallel work amount by `batch_m` times thus enabling a more efficient parallel execution in some cases.
The splitting is performed based on heuristic algorithm which can be found in `SplitDimensionM::get_splited_dimensions` method.
The splitting is performed based on heuristic algorithm which can be found in `SplitDimensionM::split` method.

Let's consider an example of the transformation:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,18 @@ class SplitDimensionM: public CommonOptimizations::SubgraphPass {

private:
static std::shared_ptr<ov::op::v0::MatMul> get_matmul(const std::shared_ptr<op::Subgraph>& subgraph);
static std::pair<size_t, size_t> get_splited_dimensions(size_t batch_dim, size_t m_dim, size_t optimal_parallelism_work_amount);
/**
* @brief Contains splitM approaches allowing to get the batch ideally divisible by optimal_parallelism_work_amount
*/
static std::pair<size_t, size_t> split_ideally(size_t batch_dim, size_t m_dim, size_t optimal_parallelism_work_amount);
/**
* @brief Splits m_dim to minimize kernel_m in order to reduce waiting time for idle threads at the last parallel loop iteration.
*/
static std::pair<size_t, size_t> split_minimize_kernel_wa(size_t batch_dim, size_t m_dim, size_t optimal_parallelism_work_amount);
/**
* @brief Splits m_dim to get the batch in (optimal_parallelism_work_amount, 2 * optimal_parallelism_work_amount) interval
*/
static std::pair<size_t, size_t> split_conservatively_increase_parallel_wa(size_t batch_dim, size_t m_dim, size_t optimal_parallelism_work_amount);

void reshape_subgraph(const std::shared_ptr<op::Subgraph>& subgraph, const ov::Shape& shape, size_t batch_m_dim, size_t new_m_dim);

Expand Down
82 changes: 53 additions & 29 deletions src/common/snippets/src/pass/split_dimension_m.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

#include "snippets/pass/split_dimension_m.hpp"

#include "snippets/utils/utils.hpp"
#include "snippets/itt.hpp"
#include "snippets/utils/utils.hpp"

namespace {
size_t get_dim_M(const ov::Shape& shape) {
Expand All @@ -31,45 +31,55 @@ bool SplitDimensionM::is_supported_matmul(const std::shared_ptr<const ov::Node>&
return matmul && !matmul->get_transpose_a() && !matmul->is_dynamic();
}

std::pair<size_t, size_t> SplitDimensionM::get_splited_dimensions(size_t batch_dim, size_t m_dim, size_t optimal_parallelism_work_amount) {
std::pair<size_t, size_t> splited = { 1, m_dim };

std::pair<size_t, size_t> SplitDimensionM::split_ideally(size_t batch_dim, size_t m_dim, size_t optimal_parallelism_work_amount) {
// Ideal case #1: M can be split on the parts one of which complements the batch dimension to the optimal parallel work amount
// In this case, each thread will execute the Snippets kernel once
const size_t lower_bound = optimal_parallelism_work_amount / batch_dim;
if (lower_bound * batch_dim == optimal_parallelism_work_amount && m_dim % lower_bound == 0) {
splited.first = lower_bound;
splited.second = m_dim / lower_bound;
OPENVINO_ASSERT(splited.first * splited.second == m_dim, "Incorrect dimension M splitting!");
return splited;
}
if (lower_bound * batch_dim == optimal_parallelism_work_amount && m_dim % lower_bound == 0)
return std::make_pair(lower_bound, m_dim / lower_bound);

// Ideal case #2: M is divisible by optimal parallel work amount, and the new_m_dim is big enough
// In this case, each thread will execute the Snippets kernel 'batch_dim' times
if (m_dim % optimal_parallelism_work_amount == 0) {
const auto new_m_dim = m_dim / optimal_parallelism_work_amount;
const size_t min_kernel_m = 64;
if (new_m_dim >= min_kernel_m) {
splited.first = optimal_parallelism_work_amount;
splited.second = new_m_dim;
OPENVINO_ASSERT(splited.first * splited.second == m_dim, "Incorrect dimension M splitting!");
return splited;
}
if (new_m_dim >= min_kernel_m)
return std::make_pair(optimal_parallelism_work_amount, new_m_dim);
}

return std::make_pair(1, m_dim);
}

std::pair<size_t, size_t> SplitDimensionM::split_conservatively_increase_parallel_wa(size_t batch_dim, size_t m_dim, size_t optimal_parallelism_work_amount) {
std::pair<size_t, size_t> splited = { 1, m_dim };
const size_t upper_bound = utils::div_up(2 * optimal_parallelism_work_amount, batch_dim);
for (size_t divisor_0 = upper_bound - 1; divisor_0 > 1; divisor_0--) {
size_t divisor_1 = m_dim / divisor_0;
if (divisor_1 * divisor_0 == m_dim) {
splited.first = divisor_0;
splited.second = divisor_1;
break;
}
if (divisor_1 * divisor_0 == m_dim)
return divisor_0 * batch_dim >= optimal_parallelism_work_amount ? std::make_pair(divisor_0, divisor_1) : splited;
}
OPENVINO_ASSERT(splited.first * splited.second == m_dim, "Incorrect dimension M splitting!");
return splited;
}

std::pair<size_t, size_t> SplitDimensionM::split_minimize_kernel_wa(size_t batch_dim, size_t m_dim, size_t optimal_parallelism_work_amount) {
constexpr size_t min_kernel_m = 32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In second case in compute_ideal_cases_heuristic min_kernel_m is 64 while this is 32 here.
What's about to use always 64 and set as const static attribute of the class?
Or is there difference between heuristics and we really need to have smaller min_kernel_m in aggressive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's better to have one min_kernel_m value. But I think that it should be 32, not 64. 64 value was set to empirically avoid the cases in which external repacking feature doesn't work, and overheads on repacking duplication inside kernel are bigger than benefits from the splitting. If external repacking works (and it seems like it will work in all cases after tokenization adjustments), we can easily lower min_kernel_m for compute_ideal_cases_heuristic

std::pair<size_t, size_t> best_result = {1, m_dim};
for (size_t divisor = 2; divisor < std::sqrt(m_dim); ++divisor) {
if (m_dim % divisor != 0)
continue;
if (divisor >= min_kernel_m)
return std::make_pair(m_dim / divisor, divisor);
const size_t m_kernel = m_dim / divisor;
if (m_kernel >= min_kernel_m) {
best_result.first = divisor;
best_result.second = m_kernel;
}
Comment on lines +70 to +76
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear from this code why we try to find maximal divisor? If divisor >= min_kernel_m we return, but if m_kernel >= min_kernel_m we continue. Why?
It seems to me that this problem is symmetrical and should be treated accordingly:

  1. if (m_dim % divisor == 0) => we can split M, great. There are 2 ways to do that:
    a. (m_dim / divisor, divisor)
    b. (divisor, m_dim / divisor)
  2. These 2 ways are absolutely identical except for that this always holds: divisor < (m_dim / divisor)
  3. It means that one way to split optimally is to start from the max divisor (=sqrt(m_dim)), go downward and return as soon as the parallel work is sufficient: if(batch_dim * m_dim/divisor >= optimal_parallelism_work_amount). This way we'll make sure that both the kernel WA is maximal (since we're going downwards) and the parallel WA is optimal.
  4. Alternatively, we can start from the minimal acceptable divisor, which should be min_kernel_m, go upward and return as soon as (m_dim % divisor == 0). This way we'll guarantee that the kernel work amount is larger than the minimal one (since we started from min_kernel_m) and the parallel work amount is maximal (since m_dim / divisor is deceasing). But that's not the case in this particular function, since we want to maximize the kernel WA.
  5. Sometimes these min_kernel_m and optimal_parallelism_work_amount can be mutually exclusive, so we should think carefully which is more important. I guess that the parallel work amount should be prioritized, so the approach from 3 should be used.
  6. It looks like we try to implement a mix of the above strategies here: we inspect both a and b splits, and return a if the min_kernel WA is achieved and b if parallel WA is satisfied. This may be not consistent in some circumstances, especially when the parallel & kernel WA limitations can't be fulfilled simultaneously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing I want to point out is that this heuristic maximizes m_batch (= minimizes m_kernel) + has a limitation m_kernel >= min_kernel_m. In other words, we try to find m_kernel bigger than min_kernel_m and at the same time as close as possible to this value.
So batch_dim * m_dim/divisor >= optimal_parallelism_work_amount is not sufficient criteria. Moreover, we can start this heuristic when batch_dim is already bigger than optimal_parallelism_work_amount. For the motivation of this strategy, please refer to the function's description:

Splits m_dim to minimize kernel_m in order to reduce waiting time for idle threads at the last parallel loop iteration.

My logic is structured in the following way (taking into account that divisor is ascending) for the splitting candidates:

  • if divisor is more than min_kernel_m, (a) strategy is used, and we can guarantee that this is the most optimal implementation from m_kernel minimization perspective.
  • if divisor is less than min_kernel_m, (b) strategy is used. But it is not guaranteed that the current m_kernel = m_dim / divisor is minimized: one of the next divisor from (divisor, sqrt(m_dim)) interval can be more optimal.

Alternatively, I can implement the same logic via 2 fors with different traversal directions:

  1. for (size_t divisor = min_kernel_m; divisor < std::sqrt(m_dim); ++divisor)
  2. for (size_t divisor = min_kernel_m - 1; divisor > 1; --divisor)

Sometimes these min_kernel_m and optimal_parallelism_work_amount can be mutually exclusive, so we should think carefully which is more important. I guess that the parallel work amount should be prioritized, so the approach from 3 should be used.

This is true. But the current heuristic covers the most important cases (big shapes in SD topology), at least on the machines where these changes were tested. And we agreed offline that we need to limit these changes' impact on other topologies.
Anyway, if I have time on validation, I will try to further tune heuristic to cover the described situation

}
if (best_result.first * batch_dim >= optimal_parallelism_work_amount)
return best_result;
return std::make_pair(1, m_dim);
}

bool SplitDimensionM::can_be_optimized(const std::shared_ptr<const ov::Node>& node, size_t concurrency) {
if (!is_supported_matmul(node))
return false;
Expand Down Expand Up @@ -131,16 +141,30 @@ bool SplitDimensionM::split(const ov::Shape& shape, size_t optimal_parallelism_w
if (is_prime_number(m_dim))
return false;

auto is_optimized = [&](size_t batch_dim) {
return batch_dim >= optimal_parallelism_work_amount;
};

// We skip optimization if the current batch is optimal for concurrency
if (is_optimized(batch_dim))
if (batch_dim % optimal_parallelism_work_amount == 0)
return false;

std::tie(batch_m_dim, new_m_dim) = get_splited_dimensions(batch_dim, m_dim, optimal_parallelism_work_amount);
return is_optimized(batch_dim * batch_m_dim);
auto split_is_done = [&batch_m_dim]() {
return batch_m_dim != 1;
};

std::tie(batch_m_dim, new_m_dim) = split_ideally(batch_dim, m_dim, optimal_parallelism_work_amount);
if (split_is_done())
return true;

// If M dim is big enough, aggressive heuristic is used for kernel_m minimization.
// For smaller M dim, conservative heuristic is used to preserve old behavour.
const bool big_m_dim = m_dim >= 4000;
Comment on lines +156 to +158
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, we also can support the case with small M.
If batch < optimal_parallelism_work_amount and M is quite small (for example, M < 64), nothing is needed to be updated and splitted - let's execute as it is.

I don't insist to do it in this PR but I have some models (for example, action-recognition or levit) with small values of batch and M where this pass is applied and there will be M = 4 or even M = 1. And this action leads to perf degrdation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your idea. In third-party brgemm heuristics, I saw that minimal allowed m_kernel is 16. Probably we can take this into account in our heuristics.
But it's also important is that SplitDimensionM::split is used in CPU callback (via can_be_optimized), so if it returns false, the MHA tokenization doesn't happen. So another question that we need to answer is whether we need to even tokenize such MHA's or not

if (big_m_dim) {
std::tie(batch_m_dim, new_m_dim) = split_minimize_kernel_wa(batch_dim, m_dim, optimal_parallelism_work_amount);
if (split_is_done())
return true;
}
if (batch_dim < optimal_parallelism_work_amount) {
std::tie(batch_m_dim, new_m_dim) = split_conservatively_increase_parallel_wa(batch_dim, m_dim, optimal_parallelism_work_amount);
}
return split_is_done();
}

void SplitDimensionM::reshape_subgraph(const std::shared_ptr<op::Subgraph>& subgraph, const ov::Shape& shape, size_t batch_m_dim, size_t new_m_dim) {
Expand Down
2 changes: 2 additions & 0 deletions src/common/snippets/tests/src/utils/split_dim_m.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ const std::vector<SplitDimensionMParams> split_dimension_cases = {
{InputData{25, 50, 40}, ReferenceData{true, 2, 25}},
{InputData{5, 16384, 40}, ReferenceData{true, 8, 2048}},
{InputData{5, 16384, 32}, ReferenceData{true, 32, 512}},
{InputData{48, 4097, 32}, ReferenceData{true, 17, 241}},
{InputData{48, 6600, 32}, ReferenceData{true, 200, 33}},
};

INSTANTIATE_TEST_SUITE_P(smoke_Snippets_SplitDimensionM,
Expand Down
Loading