Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable intel devices CPU/XPU/HPU for python backend #245

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

yuanwu2017
Copy link

@yuanwu2017 yuanwu2017 commented Apr 22, 2024

What does this PR do?

Enable CPU device for python backend

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene OR @Narsil

@yuanwu2017 yuanwu2017 force-pushed the ipex branch 2 times, most recently from 4c09b22 to 4d285bd Compare April 22, 2024 20:41
@yuanwu2017 yuanwu2017 marked this pull request as draft April 23, 2024 01:17
@yuanwu2017 yuanwu2017 changed the title Enable the IPEX optimization for python backend Enable CPU device for python backend Apr 23, 2024
@yuanwu2017 yuanwu2017 marked this pull request as ready for review June 23, 2024 14:40
@yuanwu2017
Copy link
Author

@OlivierDehaene @Narsil Please help to review.

yuanwu2017 and others added 11 commits July 17, 2024 16:22
add python backend support for xlm-roberta type model
Signed-off-by: Liu, Kaixuan <[email protected]>
Signed-off-by: Liu, Kaixuan <[email protected]>
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: Liu, Kaixuan <[email protected]>
Signed-off-by: Liu, Kaixuan <[email protected]>
add XPU and HPU support
@yuanwu2017 yuanwu2017 changed the title Enable CPU device for python backend Enable intel devices CPU/XPU/HPU for python backend Aug 20, 2024
kaixuanliu and others added 2 commits August 22, 2024 05:10
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are the same, except here I delete some unused python packages.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking because they seem outdated (e.g. optimum-habana == 1.12.0), probably because this PR was opened before the release of optimum-habana 1.13. Can you update this file?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, Yes. I will change it~

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated. BTW, we just updated the HPU model forward side implementation(using FlashBert), can you help take another look?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 99 to 194
#[instrument(skip(self))]
pub async fn warmup_hpu(
&self,
mut max_input_length: usize,
max_token: usize,
max_bs: Option<usize>
) -> Result<(), BackendError> {
let read_env_var = |key: &str, default: usize| -> usize {
env::var(key).ok().map_or(default, |value| value.parse::<usize>().unwrap())
};
let seq_bucket_size: usize = read_env_var("PAD_SEQUENCE_TO_MULTIPLE_OF", 128);
let max_warmup_length: usize = read_env_var("MAX_WARMUP_SEQUENCE_LENGTH", 1024);

let max_batch_size = match max_bs {
Some(value) => value as usize,
None => read_env_var("MAX_WARMUP_BATCH_SIZE", 8),
};

let mut batch_sizes: Vec<usize> = powers_of_two(max_batch_size);
if let Some(&last) = batch_sizes.last() {
if last < max_batch_size {
batch_sizes.push(max_batch_size);
}
}
if max_warmup_length > max_input_length {
return Err(BackendError::Start(
format!("max_warmup_length ({max_warmup_length}) exceeds model's max_input_length ({max_input_length}), you can modify this value adding `-e MAX_WARMUP_SEQUENCE_LENGTH=<new_warmup_length>` to your Docker run command")
));
}
if seq_bucket_size > max_warmup_length {
return Err(BackendError::Start(
format!("PAD_SEQUENCE_TO_MULTIPLE_OF ({seq_bucket_size}) exceeds model's max warmup length ({max_warmup_length}), you can modify these values adding `-e PAD_SEQUENCE_TO_MULTIPLE_OF=<new_value>` or `-e MAX_WARMUP_SEQUENCE_LENGTH=<new_value> to your Docker run command`")
));
}

max_input_length = std::cmp::min(max_input_length, max_warmup_length);
let mut seq_lengths: Vec<usize> = (seq_bucket_size..max_input_length+1).step_by(seq_bucket_size as usize).collect();
if let Some(&last) = seq_lengths.last() {
if last < max_input_length {
seq_lengths.push(max_input_length);
}
}

let mut shapes: Vec<(u32, u32)> = Vec::with_capacity(batch_sizes.len() * seq_lengths.len());
for batch_size in &batch_sizes {
for seq_length in &seq_lengths {
shapes.push((*batch_size as u32, *seq_length as u32));
}
}
for shape in shapes.iter() {
let batch = self.create_warmup_batch(*shape, max_token as u32);
match &self.model_type {
ModelType::Classifier => self.predict(batch).await.map(|_| ()),
ModelType::Embedding(_) => self.embed(batch).await.map(|_| ()),
}?;
tracing::info!("finish warmup for batch: {}, length: {}", shape.0, shape.1);
}
Ok(())
}

#[instrument(skip_all)]
pub fn create_warmup_batch(
&self,
shape: (u32, u32),
max_token: u32,
) -> Batch {
let (batch_size, length) = shape;
let mut batched_input_ids = Vec::new();
let mut batched_token_type_ids = Vec::new();
let mut batched_position_ids = Vec::new();
let mut cumulative_seq_lengths = Vec::with_capacity(batch_size as usize + 1);
let mut pooled_indices = Vec::with_capacity(batch_size as usize);
cumulative_seq_lengths.push(0);
let input_ids: Vec<u32> = (0..length).map(|_| rand::thread_rng().gen_range(0..max_token)).collect();
let token_type_ids: Vec<u32> = vec![0; length as usize];
let position_ids: Vec<u32> = (0..length).collect();
let mut current_length = 0;
for batch_id in 0..batch_size {
batched_input_ids.extend(input_ids.iter().cloned());
batched_token_type_ids.extend(token_type_ids.iter().cloned());
batched_position_ids.extend(position_ids.iter().cloned());
current_length += input_ids.len();
cumulative_seq_lengths.push(current_length as u32);
pooled_indices.push(batch_id);
}
Batch {
input_ids: batched_input_ids,
token_type_ids: batched_token_type_ids,
position_ids: batched_position_ids,
cumulative_seq_lengths,
max_length: length,
pooled_indices,
raw_indices: vec![],
}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

kaixuanliu and others added 2 commits August 29, 2024 06:54
Signed-off-by: kaixuanliu <[email protected]>
add hpu flashBert support
Copy link

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HPU-specific changes look good to me, I didn't check the rest.

kaixuanliu and others added 2 commits August 29, 2024 08:40
Dockerfile-intel Outdated
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to call this file Dockerfile-hpu since it is for Gaudi only right? To stay consistent with requirements-hpu.txt and requirements-intel.txt.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, Dockerfile-intel is for all intel platform(CPU,XPU and HPU). We use build-args to separate them. And requirements-intel.txt is for CPU and XPU; requirements-hpu.txt is for HPU only.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, I didn't see the build-args. Thanks!

@yao-matrix
Copy link

@OlivierDehaene , could you help review? Thx

@kaixuanliu
Copy link

@OlivierDehaene , Hi, can you help review? Thx

@yao-matrix
Copy link

@OlivierDehaene , could you help review? Thx

@kaixuanliu
Copy link

kaixuanliu commented Sep 24, 2024

@OlivierDehaene @Narsil , Can you help take a review? Thanks!
cmd line to build docker image:

#CPU 
 
docker build --build-arg PLATFORM="cpu" -f Dockerfile-intel -t tei_cpu .
#XPU    

docker build --build-arg PLATFORM="xpu" -f Dockerfile-intel -t tei_xpu .
#HPU  

docker build --build-arg PLATFORM="hpu" -f Dockerfile-intel -t tei_hpu .

@yao-matrix
Copy link

@mfuntowicz @kding1

upgrade xpu-ipex to 2.3.110
Comment on lines 45 to 44
cpu_results = embedding.view(-1).tolist()
cpu_results = embedding.reshape(-1).tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a view costs less than a reshape, why changing it ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here if we send batched requests like:
curl 127.0.0.1:8080/embed -X POST -d '{"inputs":["What is Deep Learning?", "It is a lovely day"]}' -H 'Content-Type: application/json'
, it will return error RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Comment on lines +18 to +33
def hpu_add_layer_norm(
add: torch.Tensor,
x: torch.Tensor,
weight: torch.Tensor,
bias: torch.Tensor,
epsilon: float,
add_back: bool,
):
if add is not None:
added_tensor = torch.add(add, x, alpha=1.0)
output = F.layer_norm(added_tensor, [x.size(-1)], weight, bias, epsilon)
if add_back:
add.add_(x)
return output
else:
return F.layer_norm(x, [x.size(-1)], weight=weight, bias=bias, eps=epsilon)
Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make sense to move this to a file and import it in the case of hpu device only, same as https://github.com/huggingface/text-embeddings-inference/pull/245/files#diff-0974ea7d63e0618f6efe7ab5bdfd6ff7102d5858d241b01448214588dd0bc1cdR49

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I create a python file hpu_op.py under the path backends/python/server/text_embeddings_server/utils/ and put the function of hpu_add_layer_norm in this file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late response, I was ooo. looking back at it, this method only depends on torch and is generic enough so no need for an hpu ops file.

@kaixuanliu
Copy link

kaixuanliu commented Nov 29, 2024

@IlyasMoutawwakil , Hi, do you have other comments on this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants