Predict for Python-Backend #450

michaelfeil · 2024-12-13T02:52:52Z

What does this PR do?

The PR enables Classifier Models to be run with the Python Backend.

On a high level, the ModelType gets called into the Python API.
The grcp protocol got an extension for the Predict(repeated Score) interface.
The Python Server then runs either AutoModel / FlashBert or AutoModelForSequenceClassification.

This is particular useful as it runs models e.g. with DebertaV2 e.g. mixedbread-ai/mxbai-rerank-xsmall-v1

Closes #386 #357
Closes #449 @kaixuanliu I partially picked up your stale PR, lmk if you want to be commit co-author.
Fixed: Makefile command issue.
This PR has been formatted with cargo fmt and python black.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

I added a

make run-reranker-dev

2024-12-13T02:39:30.494103Z  INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "mix*******-**/*****-******-*****l-v1", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "michaelfeil-dev-pod-h100-0", port: 3000, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2024-12-13T02:39:30.494244Z  INFO hf_hub: /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"    
2024-12-13T02:39:30.578633Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:20: Starting download
2024-12-13T02:39:30.578653Z  INFO download_artifacts:download_pool_config: text_embeddings_core::download: core/src/download.rs:53: Downloading `1_Pooling/config.json`
2024-12-13T02:39:30.975772Z  WARN download_artifacts: text_embeddings_core::download: core/src/download.rs:26: Download failed: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/resolve/main/1_Pooling/config.json)
2024-12-13T02:39:32.063030Z  INFO download_artifacts:download_new_st_config: text_embeddings_core::download: core/src/download.rs:77: Downloading `config_sentence_transformers.json`
2024-12-13T02:39:32.222611Z  WARN download_artifacts: text_embeddings_core::download: core/src/download.rs:36: Download failed: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/resolve/main/config_sentence_transformers.json)
2024-12-13T02:39:32.222630Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:40: Downloading `config.json`
2024-12-13T02:39:32.222665Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:43: Downloading `tokenizer.json`
2024-12-13T02:39:32.222677Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:47: Model artifacts downloaded in 1.644047184s
2024-12-13T02:39:32.377947Z  WARN tokenizers::tokenizer::serialization: /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '[MASK]' was expected to have ID '128000' but was given ID 'None'    
2024-12-13T02:39:32.378421Z  WARN text_embeddings_router: router/src/lib.rs:184: Could not find a Sentence Transformers config
2024-12-13T02:39:32.378431Z  INFO text_embeddings_router: router/src/lib.rs:188: Maximum number of tokens per request: 512
2024-12-13T02:39:32.378440Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 208 tokenization workers
2024-12-13T02:39:45.335673Z  INFO text_embeddings_router: router/src/lib.rs:230: Starting model backend
2024-12-13T02:39:45.335724Z  INFO text_embeddings_backend: backends/src/lib.rs:360: Downloading `model.safetensors`
2024-12-13T02:39:45.335788Z  INFO text_embeddings_backend: backends/src/lib.rs:244: Model weights downloaded in 64.629µs
2024-12-13T02:39:45.335855Z ERROR text_embeddings_backend: backends/src/lib.rs:255: Could not start Candle backend: Could not start backend: Model is not supported

Caused by:
    unknown variant `deberta-v2`, expected one of `bert`, `xlm-roberta`, `camembert`, `roberta`, `distilbert`, `nomic_bert`, `mistral`, `new`, `qwen2`, `mpnet` at line 21 column 28
2024-12-13T02:39:45.336213Z  INFO text_embeddings_backend_python::management: backends/python/src/management.rs:79: Starting Python backend
2024-12-13T02:39:48.582641Z  WARN python-backend: text_embeddings_backend_python::logging: backends/python/src/logging.rs:39: Could not import Flash Attention enabled models: No module named 'dropout_layer_norm'

2024-12-13T02:39:50.145457Z  INFO python-backend: text_embeddings_backend_python::logging: backends/python/src/logging.rs:37: Server started at unix:///tmp/text-embeddings-inference-server

2024-12-13T02:39:50.145903Z  INFO text_embeddings_backend_python::management: backends/python/src/management.rs:140: Python backend ready in 4.376466659s
2024-12-13T02:39:50.573917Z  INFO text_embeddings_router: router/src/lib.rs:248: Warming up model
2024-12-13T02:39:50.824605Z  WARN text_embeddings_router: router/src/lib.rs:310: Invalid hostname, defaulting to 0.0.0.0
2024-12-13T02:39:50.825655Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1812: Starting HTTP server: 0.0.0.0:3000
2024-12-13T02:39:50.825667Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1813: Ready
2024-12-13T02:41:33.254410Z  INFO rerank{total_time="13.560855ms" tokenization_time="802.831µs" queue_time="769.966µs" inference_time="11.89108ms"}: text_embeddings_router::http::server: router/src/http/server.rs:459: Success
2024-12-13T02:41:40.415004Z  INFO rerank{total_time="12.697461ms" tokenization_time="664.931µs" queue_time="745.23µs" inference_time="11.198792ms"}: text_embeddings_router::http::server: router/src/http/server.rs:459: Success

e.g.
{
  "query": "What is Deep Learning?",
  "raw_scores": false,
  "return_text": true,
  "texts": [
    "Deep learning is..", "Deep Learning is part of ML ", "Paris is the capital of France"
  ],
  "truncate": false,
  "truncation_direction": "Right"
}

200 | Response bodyDownload[   {     "index": 1,     "text": "Deep Learning is part of ML ",     "score": 0.7718435   },   {     "index": 0,     "text": "Deep learning is..",     "score": 0.68825895   },   {     "index": 2,     "text": "Paris is the capital of France",     "score": 0.034815196   } ]

Who can review?

@OlivierDehaene @Narsil

michaelfeil added 2 commits December 13, 2024 02:42

tested backed for classifier

edeec1b

fmt

802d53c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predict for Python-Backend #450

Predict for Python-Backend #450

michaelfeil commented Dec 13, 2024 •

edited

Loading

Predict for Python-Backend #450

Are you sure you want to change the base?

Predict for Python-Backend #450

Conversation

michaelfeil commented Dec 13, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

michaelfeil commented Dec 13, 2024 •

edited

Loading