Inconsistency in how different URL paths are handled (in inference endpoints) #398

MoritzLaurer · 2024-09-04T16:50:04Z

System Info

Inference endpoints
TEI version 1.5

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Create an inference endpoint with a TEI container for reranking with a reranking model:

from huggingface_hub import create_inference_endpoint


repository = "BAAI/bge-reranker-base"  #"BAAI/bge-reranker-large-base"
endpoint_name = "bge-reranker-large-base-05"
namespace = "MoritzLaurer"  # your user or organization name


# check if endpoint with this name already exists from previous tests
available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]
if endpoint_name in available_endpoints_names:
    endpoint_exists = True
else: 
    endpoint_exists = False
print("Does the endpoint already exist?", endpoint_exists)
    

# create new endpoint
if not endpoint_exists:
    endpoint = create_inference_endpoint(
        endpoint_name,
        repository=repository,
        namespace=namespace,
        framework="pytorch",
        task="sentence-ranking",
        # see the available hardware options here: https://huggingface.co/docs/inference-endpoints/pricing#pricing
        accelerator="gpu",
        vendor="aws",
        region="us-east-1",
        instance_size="x1",
        instance_type="nvidia-a10g",
        min_replica=0,
        max_replica=1,
        type="protected",
        custom_image={
            "health_route":"/health",
            "env": {
                "MAX_BATCH_TOKENS":"16384",
                "MAX_CONCURRENT_REQUESTS":"512",
                "MAX_BATCH_REQUESTS": "160",
                "MODEL_ID":"/repository"
            },
            "url":"ghcr.io/huggingface/text-embeddings-inference:latest"
        },
    )
    print("Waiting for endpoint to be created")
    endpoint.wait()
    print("Endpoint ready")

# if endpoint with this name already exists, get existing endpoint
else:
    endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)
    if endpoint.status in ["paused", "scaledToZero"]:
        print("Resuming endpoint")
        endpoint.resume()
    print("Waiting for endpoint to start")
    endpoint.wait()
    print("Endpoint ready")

Send request both with /rerank path appended to endpoint.url or without:

import requests

HEADERS = {"Authorization": f"Bearer {huggingface_hub.get_token()}"}
API_URL = endpoint.url + "/rerank"

# function for standard http requests
def query(payload=None, api_url=None):
    response = requests.post(api_url, headers=HEADERS, json=payload)
    return response.json()

output = query(
    payload = {
        "query":"What is Deep Learning?", 
        "texts": ["Deep Learning is not...", "Deep learning is...", "testtest"]
    },
    
    api_url = API_URL
)

print(output)

In both cases the it get the same and correct reranking output.

On the other hand, when I create this endpoint for sentence-similarity with an embedding model:

from huggingface_hub import create_inference_endpoint


repository = "thenlper/gte-large"  #"BAAI/bge-reranker-large-base"
endpoint_name = "gte-large-001"
namespace = "MoritzLaurer"  # your user or organization name


# check if endpoint with this name already exists from previous tests
available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]
if endpoint_name in available_endpoints_names:
    endpoint_exists = True
else: 
    endpoint_exists = False
print("Does the endpoint already exist?", endpoint_exists)
    

# create new endpoint
if not endpoint_exists:
    endpoint = create_inference_endpoint(
        endpoint_name,
        repository=repository,
        namespace=namespace,
        framework="pytorch",
        task="sentence-similarity",
        # see the available hardware options here: https://huggingface.co/docs/inference-endpoints/pricing#pricing
        accelerator="gpu",
        vendor="aws",
        region="us-east-1",
        instance_size="x1",
        instance_type="nvidia-a10g",
        min_replica=2,
        max_replica=4,
        type="protected",
        custom_image={
            "health_route":"/health",
            "env": {
                "MAX_BATCH_TOKENS":"16384",
                "MAX_CONCURRENT_REQUESTS":"512",
                "MAX_BATCH_REQUESTS": "124",
                "MODEL_ID": "/repository"},
            "url":"ghcr.io/huggingface/text-embeddings-inference:latest"
        }
    )
    print("Waiting for endpoint to be created")
    endpoint.wait()
    print("Endpoint ready")

# if endpoint with this name already exists, get existing endpoint
else:
    endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)
    if endpoint.status in ["paused", "scaledToZero"]:
        print("Resuming endpoint")
        endpoint.resume()
    print("Waiting for endpoint to start")
    endpoint.wait()
    print("Endpoint ready")

Then I need to append the /similarity route at the end of the URL get correct outputs.

import requests

API_URL = endpoint.url + "/similarity"  #"https://c5hhcabur7dqwyj7.us-east-1.aws.endpoints.huggingface.cloud" + "/similarity"
headers = {
	"Accept" : "application/json",
	"Authorization": f"Bearer {huggingface_hub.get_token()}",
	"Content-Type": "application/json" 
}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

output = query({
	"inputs": {"sentences": [
		"That is a happy dog",
		"That is a very happy person",
		"Today is a sunny day"
	],
	"source_sentence": "That is a happy person",
	"parameters": {}}
})

output
#[0.91960955, 0.98106885, 0.8241128]

If I don't manually append /similarity to the URL, I get the following error:

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File ~/miniconda/lib/python3.9/site-packages/requests/models.py:974, in Response.json(self, **kwargs)
    973 try:
--> 974     return complexjson.loads(self.text, **kwargs)
    975 except JSONDecodeError as e:
    976     # Catch JSON-related errors and raise as requests.JSONDecodeError
    977     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError

File ~/miniconda/lib/python3.9/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:

File ~/miniconda/lib/python3.9/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    333 """Return the Python representation of ``s`` (a ``str`` instance
    334 containing a JSON document).
    335 
    336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338 end = _w(s, end).end()

File ~/miniconda/lib/python3.9/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
Cell In[17], line 14
     11 	response = requests.post(API_URL, headers=headers, json=payload)
     12 	return response.json()
---> 14 output = query({
     15 	"inputs": {"sentences": [
     16 		"That is a happy dog",
     17 		"That is a very happy person",
     18 		"Today is a sunny day"
     19 	],
     20 	"source_sentence": "That is a happy person",
     21 	"parameters": {}}
     22 })
     24 output

Cell In[17], line 12, in query(payload)
     10 def query(payload):
     11 	response = requests.post(API_URL, headers=headers, json=payload)
---> 12 	return response.json()

File ~/miniconda/lib/python3.9/site-packages/requests/models.py:978, in Response.json(self, **kwargs)
    974     return complexjson.loads(self.text, **kwargs)
    975 except JSONDecodeError as e:
    976     # Catch JSON-related errors and raise as requests.JSONDecodeError
    977     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 978     raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Expected behavior

Either consistently force to append the correct path, or not.

See this internal thread for context

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in how different URL paths are handled (in inference endpoints) #398

Inconsistency in how different URL paths are handled (in inference endpoints) #398

MoritzLaurer commented Sep 4, 2024

Inconsistency in how different URL paths are handled (in inference endpoints) #398

Inconsistency in how different URL paths are handled (in inference endpoints) #398

Comments

MoritzLaurer commented Sep 4, 2024

System Info

Information

Tasks

Reproduction

Expected behavior