Gradio-Web界面推理和Python推理输出不一致 #2705

m00nLi · 2024-12-25T07:19:37Z

System Info / 系統信息

Linux: Ubuntu 22.04 LTS
GPU: H100
NVIDIA-SMI: 560.35.05
Driver Version: 560.35.05
CUDA Version: 12.6
python: 3.10.15
gradio: 5.9.1
transformers: 4.47.0

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

xinference: 1.1.0
xinference-client: 0.13.3

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

1. Gradio Web界面推理

启动 xinference, xinference-local --host 0.0.0.0 --port 9997
在http://localhost:9997/ui/#/launch_model/llm 启动InternVL2系列模型（包括8B、26B、40B），也测试过Qwen2-vl-instruct-7B， 72B
在http://localhost:9997/internvl2/? 正常上传图片，输入提示词，点击send得到模型输出

2. python 推理

根据xinference文档https://inference.readthedocs.io/zh-cn/latest/index.html的LLM推理代码修改，其中"max_tokens": 512, "temperature": 1.0，改为跟web界面的默认参数一致：

from xinference.client import Client

client = Client("http://localhost:9997")
model = client.get_model("internvl2")

query_text = "your query"
image_file = "/path/to/image.jpg"

# Chat to VL model
result = model.chat(
   messages=[
     {
        "role": "user",
        "content": [
           {"type": "text", "text": query_text },
           {
              "type": "image_url",
              "image_url": {
                 "url": image_file ,
              },
           },
        ],
     }
  ],
  generate_config={"max_tokens": 512, "temperature": 1.0}
)
msg = result ['choices'][0]['message']['content']
print(msg)

在Web界面推理同一张图片和提示词多次，得到的输出是非常稳定的。
但用同样的图片和提示词，观察Web推理和python推理的输出，发现差异非常大，有时两个回答的结论完全相反了。

3.于是我再用`gradio_client`推理

from gradio_client import Client, handle_file
client = Client("http://localhost:9997/internvl2")

query_text = "your query"
image_file = "/path/to/image.jpg"

interlvl2 =  client.predict(bot=[],
                                text=query_text,
                                image=handle_file(img_file),
                                video=None,
                                api_name="/add_text")

result = client.predict(bot=interlvl2[0],
                            max_tokens=512,
                            temperature=1.0,
                            stream=False,
                            api_name="/predict")
client.predict(
        api_name="/clear_history"
    )

结果我发现跟Web界面推理的也对不上，这个不科学啊，这个gradio_client不是模仿Web界面推理的吗？

Expected behavior / 期待表现

期待三种方式的推理输出一致

The text was updated successfully, but these errors were encountered:

qinxuye · 2024-12-26T10:56:10Z

Gradio 界面背后也是正常接口，大模型推理不一致是正常现象吧。

m00nLi · 2024-12-27T01:46:04Z

Gradio 界面背后也是正常接口，大模型推理不一致是正常现象吧。

Gradio 界面调用的是python推理client.get_model("Model_UID").chat(...)接口吗？是的话这输出不应该是一样的吗？
同一张图片和提示词，Gradio 界面推理，我推理10次输出基本是一样的；python推理10次输出基本也是一样的；
但是这两个一对比就是不一样的。
比如我问这张图有几个人，Gradio 界面推理10次都是说有三个人；python推理10次都是说只有一个人，这还正常吗

qinxuye · 2024-12-27T02:52:09Z

inference/xinference/core/chat_interface.py

Lines 187 to 235 in ae7b3f6

    
           def build_chat_vl_interface( 
        
               self, 
        
           ) -> "gr.Blocks": 
        
               def predict(history, bot, max_tokens, temperature, stream): 
        
                   from ..client import RESTfulClient 
        
                   client = RESTfulClient(self.endpoint) 
        
                   client._set_token(self._access_token) 
        
                   model = client.get_model(self.model_uid) 
        
                   assert isinstance(model, RESTfulChatModelHandle) 
        
                   if stream: 
        
                       response_content = "" 
        
                       for chunk in model.chat( 
        
                           messages=history, 
        
                           generate_config={ 
        
                               "max_tokens": max_tokens, 
        
                               "temperature": temperature, 
        
                               "stream": stream, 
        
                           }, 
        
                       ): 
        
                           assert isinstance(chunk, dict) 
        
                           delta = chunk["choices"][0]["delta"] 
        
                           if "content" not in delta: 
        
                               continue 
        
                           else: 
        
                               response_content += delta["content"] 
        
                               bot[-1][1] = response_content 
        
                               yield history, bot 
        
                       history.append( 
        
                           { 
        
                               "content": response_content, 
        
                               "role": "assistant", 
        
                           } 
        
                       ) 
        
                       bot[-1][1] = response_content 
        
                       yield history, bot 
        
                   else: 
        
                       response = model.chat( 
        
                           messages=history, 
        
                           generate_config={ 
        
                               "max_tokens": max_tokens, 
        
                               "temperature": temperature, 
        
                               "stream": stream, 
        
                           }, 
        
                       ) 
        
                       history.append(response["choices"][0]["message"]) 
        
                       bot[-1][1] = history[-1]["content"] 
        
                       yield history, bot

这是 gradio 调用的代码。

XprobeBot added the gpu label Dec 25, 2024

XprobeBot added this to the v1.x milestone Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradio-Web界面推理和Python推理输出不一致 #2705

Gradio-Web界面推理和Python推理输出不一致 #2705

m00nLi commented Dec 25, 2024 •

edited

Loading

qinxuye commented Dec 26, 2024

m00nLi commented Dec 27, 2024 •

edited

Loading

qinxuye commented Dec 27, 2024

Gradio-Web界面推理和Python推理输出不一致 #2705

Gradio-Web界面推理和Python推理输出不一致 #2705

Comments

m00nLi commented Dec 25, 2024 • edited Loading

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

1. Gradio Web界面推理

2. python 推理

3.于是我再用gradio_client推理

Expected behavior / 期待表现

qinxuye commented Dec 26, 2024

m00nLi commented Dec 27, 2024 • edited Loading

qinxuye commented Dec 27, 2024

m00nLi commented Dec 25, 2024 •

edited

Loading

3.于是我再用`gradio_client`推理

m00nLi commented Dec 27, 2024 •

edited

Loading