How to use GPUs over multiple computers for local AI?

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

How to use GPUs over multiple computers for local AI?

Overspark@feddit.nl · 3 days ago

A 10 Gbps network is MUCH slower than even the smallest oldest PCIe slot you have. So cramming the GPUs in any old slot that’ll fit is a much better option than distributing it over multiple PCs.

litchralee@sh.itjust.works · 3 days ago

I agree with the idea of not using a 10 Gbps network for GPU work. Just one small nitpick: PCIe Gen 1 in an x1 slot is only capable of 2.5 GTransfers/sec, which translates to about 2 GBits/sec, making it about 5x slower than a 10 Gbps line-rate network.

I sincerely hope OP is not running modern AI work on a mobo with only Gen 1…

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Thanks for the comment. I don’t want to use a networked distributed cluster for AI if I can help it. I’m looking at other options and maybe I’ll find something

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Your point is valid. Originally I was looking for deals on cheap CPU + Motherboard combos that will offer me a lot of PCIe and won’t be very expensive, but I couldn’t find anything good for EPYC. I am now looking for used supermicro motherboards and maybe I can get something I like. I don’t want to do networking for this project either but it was the only idea I could think of a few hours back

Xanza@lemm.ee · 3 days ago

consumer motherboards don’t have that many PCIe slots

The number of PCIe slots isn’t the most limiting factor when it comes to consumer motherboards. It’s the number of PCIe lanes that are supported by your CPU and the motherboard has access to.

It’s difficult to find non-server focused hardware that can do something like this because you need a significant number of PCIe lanes to accommodate your CPU, and your several GPUs at full speed. Using an M.2 SSD? Even more difficult.

Your 1 GPU per machine is a decent approach. Using a Kubernetes cluster with device plugins is likely the best way to accomplish what you want here. It would involve setting up your cluster, installing the drivers for your GPU (on each node) which then exposes the device to the system. Then when you create your Ollama container, in the prestart hook, ensure you expose your GPUs to the container for usage.

The issue with doing this, is 10Gbe is very slow compared to your GPU via PCIe. You’re networking all these GPUs to do some cool stuff, but then you’re severely bottle-necking yourself with your network. All in all, it’s not a very good plan.

marauding_gibberish142@lemmy.dbzer0.com · 2 days ago

I agree with your assessment. I was indeed going to run k8s, just hadn’t figured out what you told me. Thanks for that.

And yes, I realised that 10Gbe is just not enough for this stuff. But another commenter told me to look for used threadripper and EPYC boards (which are extremely expensive for me), which gave me the idea to look for older Intel CPU+Motherboard combos. Maybe I’ll have some luck there. I was going to use Talos in a VM with all the GPUs passed through to it.

Sims@lemmy.ml · 3 days ago

There are several solutions:

https://github.com/b4rtaz/distributed-llama

https://github.com/exo-explore/exo

https://github.com/kalavai-net/kalavai-client

https://petals.dev/

Didn’t try any of them and haven’t looked for 6 months, so maybe something better have arrived…

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Thank you for the links. I will go through them

Wigglytuff@lemmy.world · 2 days ago

I’ve tried Exo and it worked fairly well for me. Combined my 7900 XTX, GTX 1070, and M2 MacBook Pro.

ebuttonsdude@ani.social · 2 days ago

+1 on exo, worked for me across the 7900xtx, 6800xt, and 1070ti

elucubra@sopuli.xyz · 2 days ago

Ignorant here. Would mining rigs work for this?

marauding_gibberish142@lemmy.dbzer0.com · 2 days ago

I think yes

BombOmOm@lemmy.world · edit-2 3 days ago

Basically no GPU needs a full PCIe x16 slot to run at full speed. There are motherboards out there which will give you 3 or 4 slots of PCIe x8 electrical (x16 physical). I would look into those.

Edit: If you are willing to buy a board that supports AMD Epyc processors, you can get boards with basically as many PCIe slots as you could ever hope for. But that is almost certainly overkill for this task.

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Aren’t Epyc boards really expensive? I was going to buy 3-4 used computers and stuff a GPU in each.

Are there motherboards on the used market that can run the E5-2600 V4 series CPUs and have multiple PCIe Xi slots? The only ones I found were super expensive/esoteric.

litchralee@sh.itjust.works · edit-2 3 days ago

Prior-gen Epyc boards show up on eBay from time to time, often as CPU+mobo bundles from Chinese datacenters that are upgrading to latest gen. These can be had for a deal, if they’re still available, and would provide PCIe lanes for days.

BombOmOm@lemmy.world · edit-2 3 days ago

Yeah, adding to your post, Threadripper also has lots of PCIe lanes. Here is one that has 4 x16 slots. And, note, I am not endorsing that specific listing. I did very minimal research on that listing, just using it as an example.

Edit: Marauding_gibberish, if you need/want AM5: x670E motherboards have a good number of PCIe lanes and can be bought used now (x870E are newest gen AM5 with lots of lanes as well, but both pale compared to what you can get with Epyc or Threadripper).

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Thanks for the tip on x670, I’ll take a look

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

I see. I must be doing something wrong because the only ones I found were over $1000 on eBay. Do you have any tips/favoured listings?

BombOmOm@lemmy.world · edit-2 3 days ago

All I did for that one was search “Threadripper” and look at the pictures for ones with 4x x16 slots that were not hella expensive. There are technically filters for that, but, I don’t trust people to list their things correctly.

For which chipsets, ect to look for, check out this page. If you click on Learn More next to AM5 for example, it tells you how many PCIe lanes are on each chipset type which can give you some initial search criteria to look for. (That is what made me point out x670E as it has the most lanes, but is not newest gen, so you can find used versions.)

reptar@lemmy.world · edit-2 3 days ago

Hey I built a micro -atx epyc for work that has tons of pcie slots. Pretty sure it was an ASRock (or ASRack). I can find the details tomorrow if you’d like. Just let me know!

E: well, it looks like I remembered wrong and it was an atx, not micro. I think it is ASRock Rack ROMED8-2T and it has 7 PCIe4.0 x16 (I needed a lot). Unfortunately I don’t think it’s sold anymore other than really high prices on eBay.

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Thank you, and that highlights the problem - I don’t see any affordable options (around $200 or so for a motherboard + CPU combo) for a lot of PCIe lanes other than purchasing Frankenstein boards from Aliexpress. Which isn’t going to be a thing for much longer with tariffs, so I’m looking elsewhere

reptar@lemmy.world · 3 days ago

Yes, I inadvertently emphasized your challenge :-/

just_another_person@lemmy.world · 3 days ago

Wow, so you want to use inefficient models super cheap. I guarantee nobody has ever thought of this before. Good move coming to Lemmy for tips on how to do so. I bet you’re the next Sam Altman 🤣

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

I don’t understand your point, but I was going to use 4 GPUs (something like used 3090s when they get cheaper or the Arc B580s) to run the smaller models like Mistral small.

Darkmoon_UK@lemm.ee · 2 days ago

deleted by creator

False@lemmy.world · 3 days ago

You’re entering the realm of enterprise AI horizontal scaling which is $$$$

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

I’m not going to do anything enterprise. I’m not sure how people seem to think of it this way when I didn’t even mention it.

I plan to use 4 GPUs with 16-24GB VRAM each to run smaller 24B models.

False@lemmy.world · 3 days ago

I didn’t say you were, I said you were asking about a topic that enters that area.

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

I see. Thanks

WhyJiffie@sh.itjust.works · 3 days ago

well that looks like small enterprise scale

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

If you consider 4 B580s as enterprise, sure I guess

WhyJiffie@sh.itjust.works · 2 days ago

not that, but 4 GPUs with 16-24GB VRAM I do

Xanza@lemm.ee · 3 days ago

I’m not going to do anything enterprise.

You are, though. You’re creating a GPU cluster for generative AI which is an enterprise endeavor…

marauding_gibberish142@lemmy.dbzer0.com · 2 days ago

Specifically because PCIe slots go for a premium on motherboards and CPU architectures. If I didn’t have to worry about PCIe I wouldn’t care about a networked AI cluster. But yes, I accept what you say

Darkmoon_UK@lemm.ee · 2 days ago

deleted by creator

nesc@lemmy.cafe · 3 days ago

Distributed llama

irmadlad@lemmy.world · 3 days ago

That looks interesting. Might have to check it out.

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Thank you, I’ll take a look

MalReynolds@slrpnk.net · 2 days ago

Sure, works fine for inference with tensor parallelism, USB4 / thunderbolt 4/5 is a better (40Gbit+ and already there) bet than ethernet (see distributed-llama). Trash for training / fine tuning, that needs higher inter GPU speed, or better a bigger GPU VRAM.

happydoors@lemm.ee · 2 days ago

I know nothing technical to help you. But this guy’s YouTube video goes over random shit about using different computers. I believe he uses thunderbolt 4 to connect the systems, though. Plenty of other material on YouTube, as well.

marauding_gibberish142@lemmy.dbzer0.com · 2 days ago

Thanks man, I’ll take a look

Telorand@reddthat.com · 3 days ago

Maybe you want something like a Beowulf Cluster?

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Never heard of it. What is it about?

Telorand@reddthat.com · 3 days ago

It’s a way to do distributed parallel computing using consumer-grade hardware. I don’t actually know a ton about them, so you’d be better served by looking up information about them.

https://en.wikipedia.org/wiki/Beowulf_cluster

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Thanks

False@lemmy.world · 3 days ago

deleted by creator

vane@lemmy.world · edit-2 3 days ago

If you want to use supercomputer software, setup SLURM scheduler on those machines. There are many tutorials how to do distributed gpu computing with slurm. I have it on my todo list.
https://github.com/SchedMD/slurm
https://slurm.schedmd.com/

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Thanks but I’m not going to run supercomputers. I just want to run 4 GPUs separately because of inadequate PCIe lanes in a single computer to run 24B-30B models

vane@lemmy.world · edit-2 2 days ago

I believe you can run 30B models on single used rtx 3090 24GB at least I run 32B deepseek-r1 on it using ollama. Just make sure you have enought ram > 24GB.

marauding_gibberish142@lemmy.dbzer0.com · 2 days ago

Heavily quantized?

vane@lemmy.world · edit-2 2 days ago

I run this one. https://ollama.com/library/deepseek-r1:32b-qwen-distill-q4_K_M with this frontend https://github.com/open-webui/open-webui on single rtx 3090 hardware 64gb ram. It works quite well for what I wanted it to do. I wanted to connect 2x 3090 cards with slurm to run 70b models but haven’t found time to do it.

marauding_gibberish142@lemmy.dbzer0.com · 2 days ago

I see. Thanks

just_another_person@lemmy.world · edit-2 3 days ago

Why?

You’re trying to run a DC setup in your home for AI bullshit?

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

It is because modern consumer GPUs do not have enough VRAM to load the 24B models. I want to run Mistral small locally.

Natanox@discuss.tchncs.de · edit-2 3 days ago

May take a look at systems with the newer AMD SoC’s first. They utilize the systems’ RAM and come with a proper NPU, once ollama or mistral.rs are supporting those they might give you sufficient performance for your needs for way lower costs (incl. power consumption). Depending on how NPU support gets implemented it might even become possible to use NPU and GPU in tandem, that would probably enable pretty powerful models to be run on consumer-grade hardware at reasonable speed.

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Thanks, but will NPUs integrated along with the CPU ever match the performance of a discrete GPU?

Natanox@discuss.tchncs.de · 3 days ago

Depends on which GPU you compare it with, what model you use, what kind of RAM it has to work with, ecetera. NPU’s are purpose-built chips after all. Unfortunately the whole tech is still very young, so we’ll have to wait for stuff like ollama to introduce native support for an apples-to-apples comparison. The raw numbers to however do look promising.

just_another_person@lemmy.world · 3 days ago

It wouldn’t even matter. OP doesn’t understand how any of this works, and is instead just running rampant calling everything bullshit 😂

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

I’d prefer that you reply with examples/an explanation of what I’m doing wrong instead of cursing

just_another_person@lemmy.world · 3 days ago

Look at all your replies, chum. Everyone is already telling you.

TootGuitar@sh.itjust.works · edit-2 3 days ago

This is false: Mistral small 24b at q4_K_M quantization is 15GB. q8 is 26GB. A 3090/4090/5090 with 24GB or two cards with 16GB (I recommend the 4060 Ti 16GB) will work fine with this model, and will work in a single computer. Like others have said, 10Gbe will be a huge bottleneck, plus it’s just simply not necessary to distribute a 24b model across multiple machines.

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Thank you, but which consumer motherboard + CPU combo is giving me 32 lanes of PCIe Gen 4 neatly divided into 2 x16 slots for me to put 2 GPUs in? I only asked this question because I was going to buy used computers and stuff a GPU in each.

Your point about networking is valid, and I’ll be hesitant to invest in 25Gbe right now

CondorWonder@lemmy.ca · edit-2 3 days ago

You don’t need cards to have full bandwidth, they only time it will matter is when you’re loading the models on the card. You need a motherboard with x16 slots but even x4 connections would be good enough. Running the model doesn’t need a lot of bandwidth. Remember you only load the model once then reuse it.

An x4 pcie gen 4 slot has ~7.8 GiB/s theoretical transfer rate (after overhead), a x16 has ~31.5GiB/s - so disk I/O is likely your limit even for a x4 slot.

overhead was already in calculations

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

I see. That solves a lot of the headaches I imagined I would have. Thank you so much for clearing that up

just_another_person@lemmy.world · edit-2 3 days ago

I assume you’re talking about a CUDA implementation here. There’s ways to do this with that system, and even sub-projects that expand on that. I’m mostly pointing how pointless it is for you to do this. What a waste of time and money.

Edit: others are also pointing this out, but I’m still being downvoted. Mkay.

marauding_gibberish142@lemmy.dbzer0.com · 3 days ago

Used 3090s go for $800. I was planning to wait for the ARC B580s to go down in price to buy a few. The reason for the networked setup is because I didn’t find there to be enough PCIe lanes in any of the used computers I was looking at. If there’s either an affordable card with good performance and 48GB of VRAM, or there’s an affordable motherboard + CPU combo with a lot of PCIe lanes under $200, then I’ll gladly drop the idea of the distributed AI. I just need lots of VRAM and this is the only way I could think of.

Thanks

just_another_person@lemmy.world · 3 days ago

PLEASE look back at the crypto mining rush of a decade ago. I implore you.

You’re buying into something that doesn’t exist.