RPC works with LLama but not Qwen. #10943

ehtom · 2024-12-22T13:39:17Z

ehtom
Dec 22, 2024

Hello!

I have been experimenting using the following machine configuration:

Machine1: 4090, WSL Linux on Win11, 2.5GbE wired. This machine runs rpc-server.
Machine2: 2*2080Ti, Ubuntu Linux, 1GbE wired. This machine runs llama-cli or llama-server.

I have been attempting to run/test the following models. I had to comment out

llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp

Line 467 in ebdee94

GGML_ASSERT(tensor->ne[0] % 512 == 0 && "unsupported quantized tensor");

to run these quantized models:

Tulu3 70b Q4_K_M (bartowski quant) - WORKS 7tok/s all layers offloaded.
Llama 3.3 70b Q4_K_M (bartowski quant) - WORKS 7tok/s all layers offloaded.
Qwen2.5 72b Q4_K_M (official and bartowski quant) - RUNS BUT EMITS GARBAGE when all layers offloaded. WORKS but very slowly with about 40 layers offloaded.

Based on the line which I commented out, I suspect that this is because Qwen2.5-72b has a intermediate_size of 29568, which is not divisible by 512?

If this is the reason, is it possible to get Qwen2.5 working over RPC by implementing cuda-like padding of 512 in ggml-rpc.cpp?

I think this RPC functionality is extremely cool and its a lot more lightweight and configurable for enthusiasts than other options in other engines, which seem geared towards setting up production inference clusters as they all rely on docker + ray combo it seems.

ehtom · 2024-12-29T01:22:06Z

ehtom
Dec 29, 2024
Author

I have been trying to understand the RPC architecture and I've come up with this:

The ggml_backend_rpc functions implement a sort of 'interface' backend which just calls an rpc server hosting the real backend. Some of the functions don't do what you might initially expect like the init function does not actually initialize anything on the server and the vast majority of action happens when doing graph computations.
The rpc_server takes requests and passes everything to a CUDA backend.
CUDA backends pad everything to 512.

It looks like RPC does not keep track of this padding which is actually going on, so it ends up not mirroring what the (assuming CUDA) backend is actually doing. In addition it will be passing wrong tensor information to whatever code is coordinating all the memory and tensor splitting. I guess this is the source of why this doesn't work for Qwen? Is this correct or am I barking up the wrong tree here?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC works with LLama but not Qwen. #10943

{{title}}

Replies: 1 comment

{{title}}

Select a reply

RPC works with LLama but not Qwen. #10943

ehtom Dec 22, 2024

Replies: 1 comment

ehtom Dec 29, 2024 Author

ehtom
Dec 22, 2024

ehtom
Dec 29, 2024
Author