Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EIS Unified ChatCompletions Integration #118871

Open
wants to merge 48 commits into
base: main
Choose a base branch
from
Open

Conversation

jaybcee
Copy link
Member

@jaybcee jaybcee commented Dec 17, 2024

Parent PR: #118301

We need to call EIS via Elasticsearch. This PR implements the functionality.

Testing

Run via

1. `./gradlew localDistro`
2. `cd build/distribution/local/elasticsearch-9.0.0-SNAPSHOT`
3. `./bin/elasticsearch -E xpack.inference.elastic.url=https://localhost:8443 -E xpack.inference.elastic.http.ssl.verification_mode=none -E xpack.security.enabled=false -E xpack.security.enrollment.enabled=false`
  1. Create endpoint via
curl --location --request PUT 'http://localhost:9200/_inference/completion/test' \
--header 'Content-Type: application/json' \
--data '{
    "service": "elastic",
    "service_settings": {
        "model_id": "claude-3.5-sonnet"
    }
}' -k
  1. We eventually expect to have a default endpoint.
  2. The model name is a bit of a placeholder for now its unclear to me what we expose. In any case its trivial. We have an external to internal mapping.

It returns

{
    "inference_id": "test",
    "task_type": "completion",
    "service": "elastic",
    "service_settings": {
        "model_id": "claude-3.5-sonnet",
        "rate_limit": {
            "requests_per_minute": 1000
        }
    }
}

Then we perform inference via

curl --location 'http://localhost:9200/_inference/completion/test/_unified' \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
        {
            "role": "user",
            "content": "In only two digits and nothing else, what is the meaning of life?"
        }
    ],
    "model" : "claude-3.5-sonnet",
    "temperature": 0.7,
    "max_completion_tokens": 300
}' -k 

Returns

curl --location 'http://localhost:9200/_inference/completion/test/_unified' \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
        {
            "role": "user",
            "content": "In only two digits and nothing else, what is the meaning of life?"
        }
    ],
    "model" : "claude-3.5-sonnet",
    "temperature": 0.7,
    "max_completion_tokens": 300
}' -k
event: message
data: {"id":"unified-a52c5569-6fca-48dd-9a03-cf6b2d999995","choices":[{"delta":{"role":"assistant"},"index":0}],"model":"claude-3.5-sonnet","object":"chat.completion.chunk"}

event: message
data: {"id":"unified-a52c5569-6fca-48dd-9a03-cf6b2d999995","choices":[{"delta":{"content":"42"},"index":0}],"model":"claude-3.5-sonnet","object":"chat.completion.chunk"}

event: message
data: {"id":"unified-a52c5569-6fca-48dd-9a03-cf6b2d999995","choices":[{"delta":{},"index":0}],"model":"claude-3.5-sonnet","object":"chat.completion.chunk"}

event: message
data: {"id":"unified-a52c5569-6fca-48dd-9a03-cf6b2d999995","choices":[{"delta":{},"finish_reason":"stop","index":0}],"model":"claude-3.5-sonnet","object":"chat.completion.chunk"}

event: message
data: {"id":"unified-a52c5569-6fca-48dd-9a03-cf6b2d999995","choices":[{"delta":{},"index":0}],"model":"claude-3.5-sonnet","object":"chat.completion.chunk","usage":{"completion_tokens":4,"prompt_tokens":22,"total_tokens":26}}

event: message
data: [DONE]

Copy link
Member Author

@jaybcee jaybcee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately this worked mostly out of the box. I had so change EIS a bit to reflect the SSE.

https://github.com/elastic/eis-gateway/pull/207

It sends the response with a data prefix.

Did we want to implement more tests?

return new URI(elasticInferenceServiceComponents().elasticInferenceServiceUrl() + "/api/v1/chat/completions");
}

// TODO create the Configuration class?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonathan-buttner

Can you explain why you had this TODO? I'm not sure what it brings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a follow up, I think we can address this after we merge this PR. Maybe create an issue so we don't forget it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


public static final String NAME = "elastic_inference_service_completion_service_settings";

// TODO what value do we put here?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timgrein , do you have any suggestion? I'm not up to speed on the state of rate limiting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I guess we could use the default from bedrock for now?

Copy link
Member Author

@jaybcee jaybcee Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the environment and quota set... We should leave it as is for now unless any objection. Is it ok to leave the TODO? I'll drop a note in the ES integration issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it to 240 for now, but, a customers quota and our shared quota can be different. In any case rate limiting is mildly opaque to me. This is a good enough number for now.

public static ElasticInferenceServiceCompletionServiceSettings fromMap(Map<String, Object> map, ConfigurationParseContext context) {
ValidationException validationException = new ValidationException();

// TODO does EIS have this?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timgrein, same thing, do we want limit per model at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean rate limit grouping per model? Not yet, I think we'll group on project ids first. When ELSER is available on EIS we can additionally group by model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not clear. I meant in the context of ES. Or did you mean we should rate limit on project id within ES?

private static final String ROLE = "user";
private static final String USER = "a_user";

// TODO remove if EIS doesn't use the model and user fields
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxhniebergall, we need the model. The user field is a bit ambiguous. Do we set it and ignore it or should we stop sending it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss at the inference sync tomorrow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we'll get rid of it for now. It's available for some Bedrock models but it has to passed in an odd way. I'll remove the references to it in the code as well.

As for its usage, I don't think we use it in a meaningful way. My brief Googling shows that its useful for the provider to identify one of your users who is "jailbreaking" the LLM should you get suspended.

@jaybcee jaybcee marked this pull request as ready for review December 19, 2024 02:26
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Dec 19, 2024
@jaybcee jaybcee added the :SearchOrg/Inference Label for the Search Inference team label Dec 19, 2024
@elasticsearchmachine elasticsearchmachine added Team:SearchOrg Meta label for the Search Org (Enterprise Search) Team:Search - Inference labels Dec 19, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-inference-team (Team:Search - Inference)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Dec 19, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-eng (Team:SearchOrg)

@jaybcee jaybcee changed the title [WIP] EIS Unified ChatCompletions Integration EIS Unified ChatCompletions Integration Dec 19, 2024
Copy link
Contributor

@timgrein timgrein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👏

Left some comments. Requested changes, because of the usage of TransportVersion.V_8_16_0 in getMinimalSupportedTransportVersion, that's something we must change I think.

docs/changelog/118871.yaml Outdated Show resolved Hide resolved

public static final String NAME = "elastic_inference_service_completion_service_settings";

// TODO what value do we put here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I guess we could use the default from bedrock for now?

public static ElasticInferenceServiceCompletionServiceSettings fromMap(Map<String, Object> map, ConfigurationParseContext context) {
ValidationException validationException = new ValidationException();

// TODO does EIS have this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean rate limit grouping per model? Not yet, I think we'll group on project ids first. When ELSER is available on EIS we can additionally group by model.


@Override
public TransportVersion getMinimalSupportedVersion() {
return TransportVersions.V_8_16_0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a new TransportVersion here, right? (requesting changes, I guess this could break things otherwise)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the correct version? 8.18?

Copy link
Contributor

@timgrein timgrein Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add a specific one to TransportVersions. Collapsing transport versions happens after a release AFAIU

Something like:

    public static final TransportVersion ELASTIC_INFERENCE_SERVICE_UNIFIED_CHAT_COMPLETIONS_INTEGRATION = def(8_XXX_00_0);

Copy link
Contributor

@timgrein timgrein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing request changes to unblock this PR, when I'm on PTO - we chatted about the TransportVersion change, which needs to be addressed

@timgrein timgrein dismissed their stale review December 19, 2024 17:24

Unblocking the PR, when I'm on PTO - we've discussed the changes, which need to be addressed

Copy link
Contributor

@jonathan-buttner jonathan-buttner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I left a few questions/suggestions

@Override
public InferenceServiceResults parseResult(Request request, Flow.Publisher<HttpResult> flow) {
var serverSentEventProcessor = new ServerSentEventProcessor(new ServerSentEventParser());
var openAiProcessor = new OpenAiUnifiedStreamingProcessor(); // EIS uses the unified API spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself to move and rename that class: #119085

@@ -97,7 +127,7 @@ protected void doInfer(
TimeValue timeout,
ActionListener<InferenceServiceResults> listener
) {
if (model instanceof ElasticInferenceServiceModel == false) {
if (model instanceof ElasticInferenceServiceExecutableActionModel == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for reviewers, the reason I did this is because the completion model doesn't adhere to the visitor pattern like the sparse embedding model does. This could get weird if we eventually support the completion task type in the non-unified API. If that happens I suppose we could either create a new model class or undo this class hierarchy and add the visitor pattern.

) {
this.unifiedChatInput = Objects.requireNonNull(unifiedChatInput);
this.model = Objects.requireNonNull(model);
this.uri = model.uri();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since we're keeping a reference to the model we can probably just do this.model.uri() when we need it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch ;)

builder.field(MAX_COMPLETION_TOKENS_FIELD, unifiedRequest.maxCompletionTokens());
}

// Underlying providers except OpenAI only return 1 possible choice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Underlying providers except OpenAI only return 1 possible choice.
// Underlying providers expect OpenAI to only return 1 possible choice.

try {
this.uri = createUri();
} catch (URISyntaxException e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know right now this is a setting that we set, but long term are we going to hard code this or is it injected? I'm just thinking maybe we should go ahead and return an ElasticsearchStatusException since I wouldn't expect to see this error very often 🤔

Copy link
Member Author

@jaybcee jaybcee Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... You bring up an interesting point. I would argue that the minute the setting is loaded, if its not a valid URI, then it should error. Our hardcoded suffix should never be wrong.

The way it currently works (I think), is we take the raw string verbatim and then pass the buck, where the error would be caught here. I'll make it return the error you suggested and leave a TODO to revisit this. I think it can be improved but I don't want to creep out of the scope of this PR.

return new URI(elasticInferenceServiceComponents().elasticInferenceServiceUrl() + "/api/v1/chat/completions");
}

// TODO create the Configuration class?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.startObject();
unifiedRequestEntity.toXContent(builder, params);
builder.field(MODEL_FIELD, modelId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder for myself that we can probably merge this and the openai class back together since they're sending the same stuff.


public static final String NAME = "elastic_inference_service_completion_service_settings";

// TODO what value do we put here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import org.apache.http.client.methods.HttpPost;
import org.elasticsearch.tasks.Task;

public interface TraceContextAware {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making this an abstract class?

Copy link
Member Author

@jaybcee jaybcee Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I prefer to not have inheritance and this was simpler. I will use composition and just make it a concrete member. Lmk if thats ok 😄.

"stream_options": {
"include_usage": true
},
"user": "a_user"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this since we're not sending user anymore.

maxhniebergall and others added 7 commits December 23, 2024 13:15
# Conflicts:
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/external/request/openai/OpenAiUnifiedChatCompletionRequestEntity.java
#	x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/external/request/openai/OpenAiUnifiedChatCompletionRequestEntityTests.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Feature:GenAI Features around GenAI :SearchOrg/Inference Label for the Search Inference team Team:Search - Inference Team:SearchOrg Meta label for the Search Org (Enterprise Search) v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants