-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable intel devices CPU/XPU/HPU for python backend #245
base: main
Are you sure you want to change the base?
Conversation
4c09b22
to
4d285bd
Compare
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
@OlivierDehaene @Narsil Please help to review. |
add python backend support for xlm-roberta type model
Signed-off-by: Liu, Kaixuan <[email protected]>
Signed-off-by: Liu, Kaixuan <[email protected]>
Signed-off-by: Liu, Kaixuan <[email protected]>
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: Liu, Kaixuan <[email protected]>
Signed-off-by: Liu, Kaixuan <[email protected]>
Signed-off-by: Liu, Kaixuan <[email protected]>
add XPU and HPU support
Signed-off-by: kaixuanliu <[email protected]>
add import ipex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be the same as https://github.com/huggingface/tei-gaudi/blob/habana-main/backends/python/server/requirements.txt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are the same, except here I delete some unused python packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was asking because they seem outdated (e.g. optimum-habana == 1.12.0
), probably because this PR was opened before the release of optimum-habana 1.13. Can you update this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, Yes. I will change it~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have updated. BTW, we just updated the HPU model forward side implementation(using FlashBert), can you help take another look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
backends/src/lib.rs
Outdated
#[instrument(skip(self))] | ||
pub async fn warmup_hpu( | ||
&self, | ||
mut max_input_length: usize, | ||
max_token: usize, | ||
max_bs: Option<usize> | ||
) -> Result<(), BackendError> { | ||
let read_env_var = |key: &str, default: usize| -> usize { | ||
env::var(key).ok().map_or(default, |value| value.parse::<usize>().unwrap()) | ||
}; | ||
let seq_bucket_size: usize = read_env_var("PAD_SEQUENCE_TO_MULTIPLE_OF", 128); | ||
let max_warmup_length: usize = read_env_var("MAX_WARMUP_SEQUENCE_LENGTH", 1024); | ||
|
||
let max_batch_size = match max_bs { | ||
Some(value) => value as usize, | ||
None => read_env_var("MAX_WARMUP_BATCH_SIZE", 8), | ||
}; | ||
|
||
let mut batch_sizes: Vec<usize> = powers_of_two(max_batch_size); | ||
if let Some(&last) = batch_sizes.last() { | ||
if last < max_batch_size { | ||
batch_sizes.push(max_batch_size); | ||
} | ||
} | ||
if max_warmup_length > max_input_length { | ||
return Err(BackendError::Start( | ||
format!("max_warmup_length ({max_warmup_length}) exceeds model's max_input_length ({max_input_length}), you can modify this value adding `-e MAX_WARMUP_SEQUENCE_LENGTH=<new_warmup_length>` to your Docker run command") | ||
)); | ||
} | ||
if seq_bucket_size > max_warmup_length { | ||
return Err(BackendError::Start( | ||
format!("PAD_SEQUENCE_TO_MULTIPLE_OF ({seq_bucket_size}) exceeds model's max warmup length ({max_warmup_length}), you can modify these values adding `-e PAD_SEQUENCE_TO_MULTIPLE_OF=<new_value>` or `-e MAX_WARMUP_SEQUENCE_LENGTH=<new_value> to your Docker run command`") | ||
)); | ||
} | ||
|
||
max_input_length = std::cmp::min(max_input_length, max_warmup_length); | ||
let mut seq_lengths: Vec<usize> = (seq_bucket_size..max_input_length+1).step_by(seq_bucket_size as usize).collect(); | ||
if let Some(&last) = seq_lengths.last() { | ||
if last < max_input_length { | ||
seq_lengths.push(max_input_length); | ||
} | ||
} | ||
|
||
let mut shapes: Vec<(u32, u32)> = Vec::with_capacity(batch_sizes.len() * seq_lengths.len()); | ||
for batch_size in &batch_sizes { | ||
for seq_length in &seq_lengths { | ||
shapes.push((*batch_size as u32, *seq_length as u32)); | ||
} | ||
} | ||
for shape in shapes.iter() { | ||
let batch = self.create_warmup_batch(*shape, max_token as u32); | ||
match &self.model_type { | ||
ModelType::Classifier => self.predict(batch).await.map(|_| ()), | ||
ModelType::Embedding(_) => self.embed(batch).await.map(|_| ()), | ||
}?; | ||
tracing::info!("finish warmup for batch: {}, length: {}", shape.0, shape.1); | ||
} | ||
Ok(()) | ||
} | ||
|
||
#[instrument(skip_all)] | ||
pub fn create_warmup_batch( | ||
&self, | ||
shape: (u32, u32), | ||
max_token: u32, | ||
) -> Batch { | ||
let (batch_size, length) = shape; | ||
let mut batched_input_ids = Vec::new(); | ||
let mut batched_token_type_ids = Vec::new(); | ||
let mut batched_position_ids = Vec::new(); | ||
let mut cumulative_seq_lengths = Vec::with_capacity(batch_size as usize + 1); | ||
let mut pooled_indices = Vec::with_capacity(batch_size as usize); | ||
cumulative_seq_lengths.push(0); | ||
let input_ids: Vec<u32> = (0..length).map(|_| rand::thread_rng().gen_range(0..max_token)).collect(); | ||
let token_type_ids: Vec<u32> = vec![0; length as usize]; | ||
let position_ids: Vec<u32> = (0..length).collect(); | ||
let mut current_length = 0; | ||
for batch_id in 0..batch_size { | ||
batched_input_ids.extend(input_ids.iter().cloned()); | ||
batched_token_type_ids.extend(token_type_ids.iter().cloned()); | ||
batched_position_ids.extend(position_ids.iter().cloned()); | ||
current_length += input_ids.len(); | ||
cumulative_seq_lengths.push(current_length as u32); | ||
pooled_indices.push(batch_id); | ||
} | ||
Batch { | ||
input_ids: batched_input_ids, | ||
token_type_ids: batched_token_type_ids, | ||
position_ids: batched_position_ids, | ||
cumulative_seq_lengths, | ||
max_length: length, | ||
pooled_indices, | ||
raw_indices: vec![], | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as in https://github.com/huggingface/tei-gaudi/blob/habana-main/backends/src/lib.rs right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
Signed-off-by: kaixuanliu <[email protected]>
add hpu flashBert support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HPU-specific changes look good to me, I didn't check the rest.
Signed-off-by: kaixuanliu <[email protected]>
nice code
Dockerfile-intel
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to call this file Dockerfile-hpu
since it is for Gaudi only right? To stay consistent with requirements-hpu.txt
and requirements-intel.txt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, Dockerfile-intel
is for all intel platform(CPU,XPU and HPU). We use build-args
to separate them. And requirements-intel.txt
is for CPU and XPU; requirements-hpu.txt
is for HPU only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, I didn't see the build-args. Thanks!
@OlivierDehaene , could you help review? Thx |
Cargo fmt and ruff format
@OlivierDehaene , Hi, can you help review? Thx |
@OlivierDehaene , could you help review? Thx |
Unused imports and better imports
Signed-off-by: kaixuanliu <[email protected]>
Signed-off-by: kaixuanliu <[email protected]>
@OlivierDehaene @Narsil , Can you help take a review? Thanks!
|
upgrade xpu-ipex to 2.3.110
cpu_results = embedding.view(-1).tolist() | ||
cpu_results = embedding.reshape(-1).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a view costs less than a reshape, why changing it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here if we send batched requests like:
curl 127.0.0.1:8080/embed -X POST -d '{"inputs":["What is Deep Learning?", "It is a lovely day"]}' -H 'Content-Type: application/json'
, it will return error RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
def hpu_add_layer_norm( | ||
add: torch.Tensor, | ||
x: torch.Tensor, | ||
weight: torch.Tensor, | ||
bias: torch.Tensor, | ||
epsilon: float, | ||
add_back: bool, | ||
): | ||
if add is not None: | ||
added_tensor = torch.add(add, x, alpha=1.0) | ||
output = F.layer_norm(added_tensor, [x.size(-1)], weight, bias, epsilon) | ||
if add_back: | ||
add.add_(x) | ||
return output | ||
else: | ||
return F.layer_norm(x, [x.size(-1)], weight=weight, bias=bias, eps=epsilon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make sense to move this to a file and import it in the case of hpu device only, same as https://github.com/huggingface/text-embeddings-inference/pull/245/files#diff-0974ea7d63e0618f6efe7ab5bdfd6ff7102d5858d241b01448214588dd0bc1cdR49
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I create a python file hpu_op.py
under the path backends/python/server/text_embeddings_server/utils/
and put the function of hpu_add_layer_norm
in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the late response, I was ooo. looking back at it, this method only depends on torch and is generic enough so no need for an hpu ops file.
@IlyasMoutawwakil , Hi, do you have other comments on this PR? |
Signed-off-by: Liu, Kaixuan <[email protected]>
fix conflict env
What does this PR do?
Enable CPU device for python backend
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@OlivierDehaene OR @Narsil