-
Hi! I'm interested in using ComfyUI with multiple GPUs for both training and inference.
Any guidance or insights on this matter would be extremely helpful. |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 2 replies
-
I've been having the same problem and I'm willing to pay!! |
Beta Was this translation helpful? Give feedback.
-
SwarmUI provides a UI that can handle multiple ComfyUI instances as backends at once. Currently, ComfyUI does not provide a method to execute workflows in parallel. If you are a developer and want to implement inference functionality for multiple GPUs, I think modifying the KSampler would be the most effective approach. If I had a multiple GPU environment, I would like to experiment with this, but I'm not sure if PyTorch can properly handle this scenario. It's important to note that several custom nodes use implementations that hijack the Sampling function. Your modifications might cause issues with these nodes. |
Beta Was this translation helpful? Give feedback.
-
Pytorch does provide multiGPU handling, atleast you do have functionality for choosing devices so should be doable. |
Beta Was this translation helpful? Give feedback.
-
To be honest i have no idea but I have the same problem cause i NEEED more VRAM but havent bought an extra GPU yet. I found two Custom nodes in the ComfyUI Node Manager tho. Its called ComfyUI_NetDist and the descriptions says: "Run ComfyUI workflows on multiple local GPUs/networked machines. Nodes: Remote images, Local Remote control" let me know if it works casue imma buy and extra GPU then! Best of luck :) |
Beta Was this translation helpful? Give feedback.
-
Hmn... the stuff about splitting model execution parts across GPUs is more complicated... a good start would be update the queue node to support sending the next available job in the queue to the next available local GPU and or remote server URL to accelerate cranking through the queued jobs. Probably add some UI to select which GPUs to queue on, in case you have mixed size memory GPUs and some are too small for that model. Model splitting is useful too, but I think batching across more cards is lower hanging fruit that would be useful in the near term. Add an asynchronous completion node to shove in in front of the image save node too, so the job knows how to route results back instead of writing the file out locally on the remote server. EasyDiffusion is/was fantastic at this kind of thing, but it's in maintenance mode and was recently broken by Huggingface changing their URLs around. As AI accelerates there will be a glut of older cards on the market that are considered too slow for the miners but are still useful, and it turns out that hooking up a ton of cards in a mining rack with pcie 1x extenders will limit you to like 1x pcie lanes, but for a lot of stuff that's only going to affect the model load times....for rendering batches of images on the same model the models are already loaded on the card so after the first one pcie link speed doesn't matter much, and you can attach on the order of 28 GPUs to a single system with those extenders, a couple mining case racks, and using server power supplies...e.g. I picked up a used mining rig with 6x nvidia 6gb 1060's for super cheap, and picked up a ton of Tesla M40 24GB cards for under $100 each and a bunch of Tesla K80 cards for $30 each....while the perf is on the slow side, and eventually they will drop off the cuda support list for latest nvidia compute capabilities, but those old 24gb and dual 12gb cards have a lot of life left in them. (assuming you either diy coolers or have a proper server chassis). they have enough memory to run the bigger models easily and I don't really care if I have to wait overnight to generate a few thousand images....running 8 of them in parallel isn't that far off the speed of a newer card if you can get it to be easy to batch the jobs across cards. You can also use full speed 16x riser cables to hook up things at full speed....a bunch of motherboards that physically don't allow you to plug in and use all the pcie ports suddenly let you use all the ports when you use risers, and ATX power supplies are a joke compared to the server breakout board setups. If you don't mind your rig looking like Lain's bedroom from 90's anime. It's not like you wouldn't stick the thing in your garage or basement anyway. I'm trying to play with NetDist and StableSwarmUI, and it does fire up across cards, but it's also pretty janky and goofy. StableSwarmUI handles launching a bunch of server instances easily enough, but ComfyUI needs a way to submit the jobs easily.... and I think the best place to start that would be at the queueing. |
Beta Was this translation helpful? Give feedback.
-
is it possible to use this ? https://github.com/mit-han-lab/distrifuser |
Beta Was this translation helpful? Give feedback.
-
xDit provided a comfy wrapper for their multiple GPU solution |
Beta Was this translation helpful? Give feedback.
SwarmUI provides a UI that can handle multiple ComfyUI instances as backends at once.
https://github.com/mcmonkeyprojects/SwarmUI
Currently, ComfyUI does not provide a method to execute workflows in parallel.
If you are a developer and want to implement inference functionality for multiple GPUs, I think modifying the KSampler would be the most effective approach.
If I had a multiple GPU environment, I would like to experiment with this, but I'm not sure if PyTorch can properly handle this scenario.
It's important to note that several custom nodes use implementations that hijack the Sampling function. Your modifications might cause issues with these nodes.