Sagemaker Asynchronous TEI Endpoints Fail On Requests Greater Than 2mb #433

ma3842 · 2024-11-07T07:41:24Z

System Info

TEI Image v1.4.0
AWS Sagemaker Deployment
1 x ml.g5.xlarge instance Asynchronous Deployment

Link to prior discussion: https://discuss.huggingface.co/t/async-tei-deployment-cannot-handle-requests-greater-than-2mb/107529/1

On deployment, we see two relevant logs:
in logs on inital deployment:

{
   "timestamp": "2024-11-04T19:32:25.770556Z",
   "level": "INFO",
   "message": "Args { 
model_id: \"BAA*/**e-m3\", revision: None, tokenization_workers: None, dtype: Some(Float16), 
pooling: None, max_concurrent_requests: 512, max_batch_tokens: 163840000, 
max_batch_requests: Some(30), max_client_batch_size: 320000, 
auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, 
hostname: \"container-0.local\", port: 8080, uds_path: \"/tmp/text-embeddings-inference-server\", 
huggingface_hub_cache: Some(\"/data\"), payload_limit: 200000000, 
api_key: None, json_output: true, otlp_endpoint: None, 
otlp_service_name: \"text-embeddings-inference.server\", cors_allow_origin: None }",
   "target": "text_embeddings_router",
   "filename": "router/src/main.rs",
   "line_number": 175
}

in cloud watch data logs (on failure):
Received client error (413) from primary with message "Failed to buffer the request body: length limit exceeded

The request in particular for the above log was 2.4 mb in size. Our payload limit env var is set such that it should be able to support 200 mb. Looks like it is able to handle payloads up to 2mb so it seems like the limitation is stemming from payload size override not being recognized rather than token lengths.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

import boto3
sagemaker_client = boto3.client('sagemaker', region_name=AWS_REGION)
from sagemaker.huggingface import get_huggingface_llm_image_uri
get_huggingface_llm_image_uri("huggingface-tei") # ...tei:2.0.1-tei1.4.0-gpu-py310-cu122-ubuntu22.04

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'BAAI/bge-m3',
    "DTYPE": "float16",
    "MAX_BATCH_TOKENS": "163840000",
    "MAX_CLIENT_BATCH_SIZE": "320000",
    "PAYLOAD_LIMIT": "200000000",
    "MAX_BATCH_REQUESTS": "30",
    'MMS_MAX_REQUEST_SIZE': '2000000000',
    'MMS_MAX_RESPONSE_SIZE': '2000000000',
    'MMS_DEFAULT_RESPONSE_TIMEOUT': '900',
    "TS_MAX_REQUEST_SIZE": "1000000000",  # true max is 1gb
    "TS_MAX_RESPONSE_SIZE": "1000000000",  # true max is 1gb
    "SAGEMAKER_TS_RESPONSE_TIMEOUT": "3600",
    "SAGEMAKER_MODEL_SERVER_TIMEOUT": "3600",
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-tei"),
    env=hub,
    role=role, 
)

async_config = AsyncInferenceConfig( output_path='s3://<BUCKET>/TEI_embedding/output', max_concurrent_invocations_per_instance=1)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.xlarge",
    tags=tags,
    endpoint_name="experimental-tei-endpoint",
    async_inference_config=async_config,
  )

import json
import boto3

def upload_to_s3(data, bucket, key):
  s3 = boto3.resource('s3')
  s3_object = s3.Object(bucket, key)
  s3_object.put(Body=json.dumps(data))
  return f"s3://{bucket}/{key}"

def invoke_async_endpoint(endpoint_name, payload):
  sagemaker_runtime = boto3.client('sagemaker-runtime')
  response = sagemaker_runtime.invoke_endpoint_async(
      EndpointName=endpoint_name,
      InputLocation=payload,
      ContentType='application/json'
  )
  return response['OutputLocation']

data = ["My name is Clara and I am" * 1000, "My name is Clara and I am", "I"] * 100
inputs = {
    "inputs": data,
}

bucket = '<S3_BUCKET>
key = 'TEI_embedding/input/data.json'

payload_s3_path = upload_to_s3(inputs, bucket, key)
print(f"Uploaded payload to: {payload_s3_path}")

# Invoke the asynchronous endpoint
endpoint_name = 'experimental-tei-endpoint'
output_location = invoke_async_endpoint(endpoint_name, payload_s3_path)
print(f"Asynchronous inference initiated. Output location: {output_location}")

Expected behavior

Payloads up to 200mb in size to be supported in line with PAYLOAD_LIMIT and existing async endpoints (which should support up to 1 gb as per sagemaker documentation)

The text was updated successfully, but these errors were encountered:

ma3842 · 2024-11-07T07:43:54Z

cc: @philschmid follow up from our email thread. Thanks again!

Jamesargy6 · 2024-12-02T23:47:09Z

I am also having the same issue with my deployment. Setting PAYLOAD_LIMIT as an environment variable seems to have no impact on the behavior of my endpoint. The other variables I am setting (such as MAX_CLIENT_BATCH_SIZE) are working as expected.

It appears this was identified as an issue before, resolved (#298), and released with version v1.3.0. Perhaps there was a regression?

ma3842 mentioned this issue Dec 12, 2024

TEI Latest Version Support (T.E.I 1.5.0, 1.5.1, 1.6.0) awslabs/llm-hosting-container#109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sagemaker Asynchronous TEI Endpoints Fail On Requests Greater Than 2mb #433

Sagemaker Asynchronous TEI Endpoints Fail On Requests Greater Than 2mb #433

ma3842 commented Nov 7, 2024 •

edited

Loading

ma3842 commented Nov 7, 2024 •

edited

Loading

Jamesargy6 commented Dec 2, 2024

Sagemaker Asynchronous TEI Endpoints Fail On Requests Greater Than 2mb #433

Sagemaker Asynchronous TEI Endpoints Fail On Requests Greater Than 2mb #433

Comments

ma3842 commented Nov 7, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

ma3842 commented Nov 7, 2024 • edited Loading

Jamesargy6 commented Dec 2, 2024

ma3842 commented Nov 7, 2024 •

edited

Loading

ma3842 commented Nov 7, 2024 •

edited

Loading