Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradio-Web界面推理和Python推理输出不一致 #2705

Open
1 of 3 tasks
m00nLi opened this issue Dec 25, 2024 · 3 comments
Open
1 of 3 tasks

Gradio-Web界面推理和Python推理输出不一致 #2705

m00nLi opened this issue Dec 25, 2024 · 3 comments
Labels
Milestone

Comments

@m00nLi
Copy link

m00nLi commented Dec 25, 2024

System Info / 系統信息

Linux: Ubuntu 22.04 LTS
GPU: H100
NVIDIA-SMI: 560.35.05
Driver Version: 560.35.05
CUDA Version: 12.6
python: 3.10.15
gradio: 5.9.1
transformers: 4.47.0

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

xinference: 1.1.0
xinference-client: 0.13.3

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

1. Gradio Web界面推理

  1. 启动 xinference, xinference-local --host 0.0.0.0 --port 9997
  2. http://localhost:9997/ui/#/launch_model/llm 启动InternVL2系列模型(包括8B、26B、40B),也测试过Qwen2-vl-instruct-7B, 72B
  3. http://localhost:9997/internvl2/? 正常上传图片,输入提示词,点击send得到模型输出

2. python 推理

根据xinference文档https://inference.readthedocs.io/zh-cn/latest/index.html的LLM推理代码修改,其中"max_tokens": 512, "temperature": 1.0,改为跟web界面的默认参数一致:

from xinference.client import Client

client = Client("http://localhost:9997")
model = client.get_model("internvl2")

query_text = "your query"
image_file = "/path/to/image.jpg"

# Chat to VL model
result = model.chat(
   messages=[
     {
        "role": "user",
        "content": [
           {"type": "text", "text": query_text },
           {
              "type": "image_url",
              "image_url": {
                 "url": image_file ,
              },
           },
        ],
     }
  ],
  generate_config={"max_tokens": 512, "temperature": 1.0}
)
msg = result ['choices'][0]['message']['content']
print(msg)

在Web界面推理同一张图片和提示词多次,得到的输出是非常稳定的。
但用同样的图片和提示词,观察Web推理和python推理的输出,发现差异非常大,有时两个回答的结论完全相反了。

3.于是我再用gradio_client推理

from gradio_client import Client, handle_file
client = Client("http://localhost:9997/internvl2")

query_text = "your query"
image_file = "/path/to/image.jpg"

interlvl2 =  client.predict(bot=[],
                                text=query_text,
                                image=handle_file(img_file),
                                video=None,
                                api_name="/add_text")

result = client.predict(bot=interlvl2[0],
                            max_tokens=512,
                            temperature=1.0,
                            stream=False,
                            api_name="/predict")
client.predict(
        api_name="/clear_history"
    )

结果我发现跟Web界面推理的也对不上,这个不科学啊,这个gradio_client不是模仿Web界面推理的吗?

Expected behavior / 期待表现

期待三种方式的推理输出一致

@XprobeBot XprobeBot added the gpu label Dec 25, 2024
@XprobeBot XprobeBot added this to the v1.x milestone Dec 25, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Dec 26, 2024

Gradio 界面背后也是正常接口,大模型推理不一致是正常现象吧。

@m00nLi
Copy link
Author

m00nLi commented Dec 27, 2024

Gradio 界面背后也是正常接口,大模型推理不一致是正常现象吧。

Gradio 界面调用的是python推理client.get_model("Model_UID").chat(...)接口吗?是的话这输出不应该是一样的吗?
同一张图片和提示词,Gradio 界面推理,我推理10次输出基本是一样的;python推理10次输出基本也是一样的;
但是这两个一对比就是不一样的。
比如我问这张图有几个人,Gradio 界面推理10次都是说有三个人;python推理10次都是说只有一个人,这还正常吗

@qinxuye
Copy link
Contributor

qinxuye commented Dec 27, 2024

def build_chat_vl_interface(
self,
) -> "gr.Blocks":
def predict(history, bot, max_tokens, temperature, stream):
from ..client import RESTfulClient
client = RESTfulClient(self.endpoint)
client._set_token(self._access_token)
model = client.get_model(self.model_uid)
assert isinstance(model, RESTfulChatModelHandle)
if stream:
response_content = ""
for chunk in model.chat(
messages=history,
generate_config={
"max_tokens": max_tokens,
"temperature": temperature,
"stream": stream,
},
):
assert isinstance(chunk, dict)
delta = chunk["choices"][0]["delta"]
if "content" not in delta:
continue
else:
response_content += delta["content"]
bot[-1][1] = response_content
yield history, bot
history.append(
{
"content": response_content,
"role": "assistant",
}
)
bot[-1][1] = response_content
yield history, bot
else:
response = model.chat(
messages=history,
generate_config={
"max_tokens": max_tokens,
"temperature": temperature,
"stream": stream,
},
)
history.append(response["choices"][0]["message"])
bot[-1][1] = history[-1]["content"]
yield history, bot

这是 gradio 调用的代码。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants