-
Notifications
You must be signed in to change notification settings - Fork 262
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
10 changed files
with
1,764 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
name: Build api-inference documentation | ||
|
||
on: | ||
push: | ||
paths: | ||
- "docs/api-inference/**" | ||
branches: | ||
- main | ||
|
||
jobs: | ||
build: | ||
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main | ||
with: | ||
commit_sha: ${{ github.sha }} | ||
package: hub-docs | ||
package_name: api-inference | ||
path_to_docs: hub-docs/docs/api-inference/ | ||
additional_args: --not_python_module | ||
secrets: | ||
token: ${{ secrets.HUGGINGFACE_PUSH }} |
21 changes: 21 additions & 0 deletions
21
.github/workflows/api_inference_build_pr_documentation.yml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
name: Build api-inference PR Documentation | ||
|
||
on: | ||
pull_request: | ||
paths: | ||
- "docs/api-inference/**" | ||
|
||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} | ||
cancel-in-progress: true | ||
|
||
jobs: | ||
build: | ||
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main | ||
with: | ||
commit_sha: ${{ github.event.pull_request.head.sha }} | ||
pr_number: ${{ github.event.number }} | ||
package: hub-docs | ||
package_name: api-inference | ||
path_to_docs: hub-docs/docs/api-inference/ | ||
additional_args: --not_python_module |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
name: Delete api-inference dev documentation | ||
|
||
on: | ||
pull_request: | ||
types: [ closed ] | ||
|
||
|
||
jobs: | ||
delete: | ||
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main | ||
with: | ||
pr_number: ${{ github.event.number }} | ||
package: hub-docs | ||
package_name: api-inference | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
- sections: | ||
- local: index | ||
title: 🤗 Accelerated Inference API | ||
- local: quicktour | ||
title: Overview | ||
- local: detailed_parameters | ||
title: Detailed parameters | ||
- local: parallelism | ||
title: Parallelism and batch jobs | ||
- local: usage | ||
title: Detailed usage and pinned models | ||
- local: faq | ||
title: More information about the API | ||
title: Getting started |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# More information about the API | ||
|
||
## Rate limits | ||
|
||
The free Inference API may be rate limited for heavy use cases. We try to balance the loads evenly between all our | ||
available resources, and favoring steady flows of requests. If your account suddenly sends 10k requests then you're | ||
likely to receive 503 errors saying models are loading. In order to prevent that, you should instead try to start | ||
running queries smoothly from 0 to 10k over the course of a few minutes. | ||
|
||
## Running private models | ||
|
||
You can run private models by default ! If you don't see them on your | ||
[Hugging Face](https://huggingface.co) page please make sure you are | ||
logged in. Within the API make sure you include your token, otherwise | ||
your model will be declared as non existent. | ||
|
||
## Running a public model that I do not own | ||
|
||
You can. Please check the model card for any licensing issue that might | ||
arise, but most public models are delivered by researchers and are | ||
usable within commercial products. But please double check. | ||
|
||
## Finetuning a public model | ||
|
||
To automatically finetune a model on your data, please try [AutoTrain](https://huggingface.co/autotrain). It's a | ||
hands-free solution for automatically training and deploying a model; all you have to do is upload your data! | ||
|
||
## Running the inference on my infrastructure | ||
|
||
To run on premise inference on your own infrastructure, please contact our team to [request a | ||
demo](https://huggingface.co/platform#form) for more information about our [Private | ||
Hub](https://huggingface.co/platform). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
<!-- DISABLE-FRONTMATTER-SECTIONS --> | ||
|
||
# 🤗 Hosted Inference API | ||
|
||
Test and evaluate, for free, over 80,000 publicly accessible machine learning models, or your own private models, via simple HTTP requests, with fast inference hosted on Hugging Face shared infrastructure. | ||
|
||
<Tip> | ||
|
||
The Inference API is free to use, and rate limited. If you need an inference solution for production, check out our [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) service. With Inference Endpoints, you can easily deploy any machine learning model on dedicated and fully managed infrastructure. Select the cloud, region, compute instance, autoscaling range and security level to match your model, latency, throughput, and compliance needs. | ||
|
||
</Tip> | ||
|
||
## Main features: | ||
|
||
- Get predictions from **80,000+ Transformers models** (T5, Blenderbot, Bart, GPT-2, Pegasus\...) | ||
- Switch from one model to the next by just switching the model ID | ||
- Use built-in integrations with **over 20 Open-Source libraries** (spaCy, SpeechBrain, etc). | ||
- Upload, manage and serve your **own models privately** | ||
- Run Classification, Image Segmentation, Automatic Speech Recognition, NER, Conversational, Summarization, Translation, Question-Answering, Embeddings Extraction tasks | ||
- Out of the box accelerated inference on **CPU** powered by Intel Xeon Ice Lake | ||
|
||
## Third-party library models: | ||
|
||
- The [Hub](https://huggingface.co) now supports many new libraries: | ||
|
||
- [SpaCy](https://spacy.io/), [AllenNLP](https://allennlp.org/), | ||
- [Speechbrain](https://speechbrain.github.io/), | ||
- [Timm](https://pypi.org/project/timm/) and [many others](https://huggingface.co/docs/hub/libraries)... | ||
|
||
- Those models are enabled on the API thanks to some docker integration [api-inference-community](https://github.com/huggingface/huggingface_hub/tree/main/api-inference-community). | ||
|
||
<Tip warning> | ||
|
||
Please note however, that these models will not allow you ([tracking issue](https://github.com/huggingface/huggingface_hub/issues/85)): | ||
|
||
- To get full optimization | ||
- To run private models | ||
- To get access to GPU inference | ||
|
||
</Tip> | ||
|
||
## If you are looking for custom support from the Hugging Face team | ||
|
||
<a target="_blank" href="https://huggingface.co/support"> | ||
<img alt="HuggingFace Expert Acceleration Program" src="https://cdn-media.huggingface.co/marketing/transformers/new-support-improved.png" style="max-width: 600px; border: 1px solid #eee; border-radius: 4px; box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.05);"> | ||
</a><br> | ||
|
||
## Hugging Face is trusted in production by over 10,000 companies | ||
|
||
<img class="block dark:hidden !shadow-none !border-0 !rounded-none" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/inference-api/companies-light.png" width="600"> | ||
<img class="hidden dark:block !shadow-none !border-0 !rounded-none" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/inference-api/companies-dark.png" width="600"> | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# Parallelism and batch jobs | ||
|
||
In order to get your answers as quickly as possible, you probably want | ||
to run some kind of parallelism on your jobs. | ||
|
||
There are two options at your disposal for this. | ||
|
||
- The streaming option | ||
- The dataset option | ||
|
||
## Streaming | ||
|
||
In order to maximize the speed of inference, instead of running many | ||
HTTP requests it will be more efficient to stream your data to the API. | ||
This will require the use of websockets on your end. Internally we're | ||
using a queue system where machines can variously pull work, seamlessly | ||
using parallelism for you. **This is meant as a batching mechanism and a | ||
single stream should be open at any give time**. If multiple streams are | ||
open, requests will go to either without any guarantee. This is intended | ||
as it allows recovering from a stream cut. Simply reinitializing the | ||
stream will recover results, everything that was sent is being processed | ||
even if you are not connected anymore. So make sure you don't send item | ||
multiple times other wise you will be billed multiple times. | ||
|
||
Here is a small example: | ||
|
||
<inferencesnippet> | ||
<python> | ||
<literalinclude> | ||
{"path": "../../tests/documentation/test_parallelism.py", | ||
"language": "python", | ||
"start-after": "START python_parallelism", | ||
"end-before": "END python_parallelism", | ||
"dedent": 8} | ||
</literalinclude> | ||
</python> | ||
<js> | ||
<literalinclude> | ||
{"path": "../../tests/documentation/test_parallelism.py", | ||
"language": "js", | ||
"start-after": "START node_parallelism", | ||
"end-before": "END node_parallelism", | ||
"dedent": 8} | ||
</literalinclude> | ||
</js> | ||
<curl> | ||
<literalinclude> | ||
{"path": "../../tests/documentation/test_parallelism.py", | ||
"language": "bash", | ||
"start-after": "START curl_parallelism", | ||
"end-before": "END curl_parallelism", | ||
"dedent": 8} | ||
</literalinclude> | ||
</curl> | ||
</inferencesnippet> | ||
|
||
The messages you need to send need to contain inputs keys. | ||
|
||
Optionnally you can specifiy id key that will be sent back | ||
with the result. We try to maintain the ordering of results as you sent, | ||
but it's better to be sure, the id key is there for that. | ||
|
||
Optionnally, you can specify parameters key that | ||
corresponds to `detailed_parameters` of | ||
the pipeline you are using. | ||
|
||
The received messages will _always_ contain a type key. | ||
|
||
- status message: These messages will contain a | ||
message key that will inform you about the current | ||
status of the job | ||
- results message: These messages will contain the | ||
actual results of the computation. id will contain the | ||
id you have sent (or one will be generated automatically). | ||
outputs will contain the result like it would be sent | ||
by the HTTP endpoint. See `detailed_parameters` for more information. | ||
|
||
## Dataset | ||
|
||
If you are running regularly against the same dataset to check | ||
differences between models or drifts we recommend using a | ||
[dataset](https://huggingface.co/docs/datasets/) . | ||
|
||
Or use any of the 2000 available datasets: | ||
[here](https://huggingface.co/datasets). | ||
|
||
The outputs of this method will automatically create a private dataset | ||
on your account, and use git mechanisms to store versions of the various | ||
outputs. | ||
|
||
<inferencesnippet> | ||
<python> | ||
<literalinclude> | ||
{"path": "../../tests/documentation/test_parallelism.py", | ||
"language": "python", | ||
"start-after": "START python_parallelism_datasets", | ||
"end-before": "END python_parallelism_datasets", | ||
"dedent": 8} | ||
</literalinclude> | ||
</python> | ||
<js> | ||
<literalinclude> | ||
{"path": "../../tests/documentation/test_parallelism.py", | ||
"language": "node", | ||
"start-after": "START node_parallelism_datasets", | ||
"end-before": "END node_parallelism_datasets", | ||
"dedent": 8} | ||
</literalinclude> | ||
</js> | ||
<curl> | ||
<literalinclude> | ||
{"path": "../../tests/documentation/test_parallelism.py", | ||
"language": "bash", | ||
"start-after": "START curl_parallelism_datasets", | ||
"end-before": "END curl_parallelism_datasets", | ||
"dedent": 8} | ||
</literalinclude> | ||
</curl> | ||
</inferencesnippet> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
# Overview | ||
|
||
Let's have a quick look at the 🤗 Hosted Inference API. | ||
|
||
## Main features: | ||
|
||
- Leverage **80,000+ Transformer models** (T5, Blenderbot, Bart, GPT-2, Pegasus\...) | ||
- Upload, manage and serve your **own models privately** | ||
- Run Classification, NER, Conversational, Summarization, Translation, Question-Answering, Embeddings Extraction tasks | ||
- Get up to **10x inference speedup** to reduce user latency | ||
- Accelerated inference for a number of supported models on CPU | ||
- Run **large models** that are challenging to deploy in production | ||
- Scale up to 1,000 requests per second with **automatic scaling** built-in | ||
- **Ship new NLP, CV, Audio, or RL features faster** as new models become available | ||
- Build your business on a platform powered by the reference open source project in ML | ||
|
||
## Get your API Token | ||
|
||
To get started you need to: | ||
|
||
- [Register](https://huggingface.co/join) or [Login](https://huggingface.co/login). | ||
- Get a User Access or API token [in your Hugging Face profile settings](https://huggingface.co/settings/tokens). | ||
|
||
You should see a token `hf_xxxxx` (old tokens are `api_XXXXXXXX` or `api_org_XXXXXXX`). | ||
|
||
If you do not submit your API token when sending requests to the API, | ||
you will not be able to run inference on your private models. | ||
|
||
## Running Inference with API Requests | ||
|
||
The first step is to choose which model you are going to run. Go to the | ||
[Model Hub](https://huggingface.co/models) and select the model you want | ||
to use. If you are unsure where to start, make sure to check the | ||
[recommended models for each ML | ||
task](https://api-inference.huggingface.co/docs/python/html/detailed_parameters.html#detailed-parameters) | ||
available, or the [Tasks](https://huggingface.co/tasks) overview. | ||
|
||
``` | ||
ENDPOINT = https://api-inference.huggingface.co/models/<MODEL_ID> | ||
``` | ||
|
||
Let's use [gpt2](https://huggingface.co/gpt2) as an example. To run | ||
inference, simply use this code: | ||
|
||
<inferencesnippet> | ||
<python> | ||
<literalinclude> | ||
{"path": "../../tests/documentation/test_inference.py", | ||
"language": "python", | ||
"start-after": "START simple_inference", | ||
"end-before": "END simple_inference", | ||
"dedent": 8} | ||
</literalinclude> | ||
</python> | ||
<js> | ||
<literalinclude> | ||
{"path": "../../tests/documentation/test_inference.py", | ||
"language": "node", | ||
"start-after": "START node_simple_inference", | ||
"end-before": "END node_simple_inference", | ||
"dedent": 8} | ||
</literalinclude> | ||
</js> | ||
<curl> | ||
<literalinclude> | ||
{"path": "../../tests/documentation/test_inference.py", | ||
"language": "bash", | ||
"start-after": "START curl_simple_inference", | ||
"end-before": "END curl_simple_inference", | ||
"dedent": 8} | ||
</literalinclude> | ||
</curl> | ||
</inferencesnippet> | ||
|
||
## API Options and Parameters | ||
|
||
Depending on the task (aka pipeline) the model is configured for, the | ||
request will accept specific parameters. When sending requests to run | ||
any model, API options allow you to specify the caching and model | ||
loading behavior. All API options and | ||
parameters are detailed here [`detailed_parameters`](detailed_parameters). | ||
|
||
## Using CPU-Accelerated Inference | ||
|
||
As an API customer, your API token will automatically enable CPU-Accelerated inference on your requests if the model type is supported. For instance, if you compare | ||
gpt2 model inference through our API with | ||
CPU-Acceleration, compared to running inference on the model out of the | ||
box on a local setup, you should measure a **\~10x speedup**. The | ||
specific performance boost depends on the model and input payload (and | ||
your local hardware). | ||
|
||
To verify you are using the CPU-Accelerated version of a model you can | ||
check the x-compute-type header of your requests, which | ||
should be cpu+optimized. If you do not see it, it simply | ||
means not all optimizations are turned on. This can be for various | ||
factors; the model might have been added recently to transformers, or | ||
the model can be optimized in several different ways and the best one | ||
depends on your use case. | ||
|
||
If you contact us at [email protected], we'll be able to | ||
increase the inference speed for you, depending on your actual use case. | ||
|
||
## Model Loading and latency | ||
|
||
The Hosted Inference API can serve predictions on-demand from over 100,000 models deployed on the Hugging Face Hub, dynamically loaded on shared infrastructure. If the requested model is not loaded in memory, the Hosted Inference API will start by loading the model into memory and returning a 503 response, before it can respond with the prediction. | ||
|
||
If your use case requires large volume or predictable latencies, you can use our paid solution [Inference Endpoints](https://huggingface.co/inference-endpoints) to easily deploy your models on dedicated, fully-managed infrastructure. With Inference Endpoints you can quickly create endpoints on the cloud, region, CPU or GPU compute instance of your choice. |
Oops, something went wrong.