The problem is simple: consumer motherboards don’t have that many PCIe slots, and consumer CPUs don’t have enough lanes to run 3+ GPUs at full PCIe gen 3 or gen 4 speeds.

My idea was to buy 3-4 computers for cheap, slot a GPU into each of them and use 4 of them in tandem. I imagine this will require some sort of agent running on each node which will be connected through a 10Gbe network. I can get a 10Gbe network running for this project.

Does Ollama or any other local AI project support this? Getting a server motherboard with CPU is going to get expensive very quickly, but this would be a great alternative.

Thanks

  • Overspark@feddit.nl
    link
    fedilink
    English
    arrow-up
    35
    ·
    3 days ago

    A 10 Gbps network is MUCH slower than even the smallest oldest PCIe slot you have. So cramming the GPUs in any old slot that’ll fit is a much better option than distributing it over multiple PCs.

    • litchralee@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      13
      ·
      3 days ago

      I agree with the idea of not using a 10 Gbps network for GPU work. Just one small nitpick: PCIe Gen 1 in an x1 slot is only capable of 2.5 GTransfers/sec, which translates to about 2 GBits/sec, making it about 5x slower than a 10 Gbps line-rate network.

      I sincerely hope OP is not running modern AI work on a mobo with only Gen 1…

      • marauding_gibberish142@lemmy.dbzer0.comOP
        link
        fedilink
        English
        arrow-up
        3
        ·
        3 days ago

        Thanks for the comment. I don’t want to use a networked distributed cluster for AI if I can help it. I’m looking at other options and maybe I’ll find something

    • marauding_gibberish142@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      3 days ago

      Your point is valid. Originally I was looking for deals on cheap CPU + Motherboard combos that will offer me a lot of PCIe and won’t be very expensive, but I couldn’t find anything good for EPYC. I am now looking for used supermicro motherboards and maybe I can get something I like. I don’t want to do networking for this project either but it was the only idea I could think of a few hours back

  • Xanza@lemm.ee
    link
    fedilink
    English
    arrow-up
    9
    ·
    3 days ago

    consumer motherboards don’t have that many PCIe slots

    The number of PCIe slots isn’t the most limiting factor when it comes to consumer motherboards. It’s the number of PCIe lanes that are supported by your CPU and the motherboard has access to.

    It’s difficult to find non-server focused hardware that can do something like this because you need a significant number of PCIe lanes to accommodate your CPU, and your several GPUs at full speed. Using an M.2 SSD? Even more difficult.

    Your 1 GPU per machine is a decent approach. Using a Kubernetes cluster with device plugins is likely the best way to accomplish what you want here. It would involve setting up your cluster, installing the drivers for your GPU (on each node) which then exposes the device to the system. Then when you create your Ollama container, in the prestart hook, ensure you expose your GPUs to the container for usage.

    The issue with doing this, is 10Gbe is very slow compared to your GPU via PCIe. You’re networking all these GPUs to do some cool stuff, but then you’re severely bottle-necking yourself with your network. All in all, it’s not a very good plan.

    • marauding_gibberish142@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      I agree with your assessment. I was indeed going to run k8s, just hadn’t figured out what you told me. Thanks for that.

      And yes, I realised that 10Gbe is just not enough for this stuff. But another commenter told me to look for used threadripper and EPYC boards (which are extremely expensive for me), which gave me the idea to look for older Intel CPU+Motherboard combos. Maybe I’ll have some luck there. I was going to use Talos in a VM with all the GPUs passed through to it.

  • BombOmOm@lemmy.world
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    1
    ·
    edit-2
    3 days ago

    Basically no GPU needs a full PCIe x16 slot to run at full speed. There are motherboards out there which will give you 3 or 4 slots of PCIe x8 electrical (x16 physical). I would look into those.

    Edit: If you are willing to buy a board that supports AMD Epyc processors, you can get boards with basically as many PCIe slots as you could ever hope for. But that is almost certainly overkill for this task.

    • marauding_gibberish142@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      2
      ·
      3 days ago

      Aren’t Epyc boards really expensive? I was going to buy 3-4 used computers and stuff a GPU in each.

      Are there motherboards on the used market that can run the E5-2600 V4 series CPUs and have multiple PCIe Xi slots? The only ones I found were super expensive/esoteric.

      • litchralee@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        3 days ago

        Prior-gen Epyc boards show up on eBay from time to time, often as CPU+mobo bundles from Chinese datacenters that are upgrading to latest gen. These can be had for a deal, if they’re still available, and would provide PCIe lanes for days.

        • BombOmOm@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          3 days ago

          Yeah, adding to your post, Threadripper also has lots of PCIe lanes. Here is one that has 4 x16 slots. And, note, I am not endorsing that specific listing. I did very minimal research on that listing, just using it as an example.

          Edit: Marauding_gibberish, if you need/want AM5: x670E motherboards have a good number of PCIe lanes and can be bought used now (x870E are newest gen AM5 with lots of lanes as well, but both pale compared to what you can get with Epyc or Threadripper).

            • BombOmOm@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              ·
              edit-2
              3 days ago

              All I did for that one was search “Threadripper” and look at the pictures for ones with 4x x16 slots that were not hella expensive. There are technically filters for that, but, I don’t trust people to list their things correctly.

              For which chipsets, ect to look for, check out this page. If you click on Learn More next to AM5 for example, it tells you how many PCIe lanes are on each chipset type which can give you some initial search criteria to look for. (That is what made me point out x670E as it has the most lanes, but is not newest gen, so you can find used versions.)

      • reptar@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        3 days ago

        Hey I built a micro -atx epyc for work that has tons of pcie slots. Pretty sure it was an ASRock (or ASRack). I can find the details tomorrow if you’d like. Just let me know!

        E: well, it looks like I remembered wrong and it was an atx, not micro. I think it is ASRock Rack ROMED8-2T and it has 7 PCIe4.0 x16 (I needed a lot). Unfortunately I don’t think it’s sold anymore other than really high prices on eBay.

        • marauding_gibberish142@lemmy.dbzer0.comOP
          link
          fedilink
          English
          arrow-up
          2
          ·
          3 days ago

          Thank you, and that highlights the problem - I don’t see any affordable options (around $200 or so for a motherboard + CPU combo) for a lot of PCIe lanes other than purchasing Frankenstein boards from Aliexpress. Which isn’t going to be a thing for much longer with tariffs, so I’m looking elsewhere

      • just_another_person@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        16
        ·
        3 days ago

        Wow, so you want to use inefficient models super cheap. I guarantee nobody has ever thought of this before. Good move coming to Lemmy for tips on how to do so. I bet you’re the next Sam Altman 🤣

  • False@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    2
    ·
    3 days ago

    You’re entering the realm of enterprise AI horizontal scaling which is $$$$

  • MalReynolds@slrpnk.net
    link
    fedilink
    English
    arrow-up
    1
    ·
    2 days ago

    Sure, works fine for inference with tensor parallelism, USB4 / thunderbolt 4/5 is a better (40Gbit+ and already there) bet than ethernet (see distributed-llama). Trash for training / fine tuning, that needs higher inter GPU speed, or better a bigger GPU VRAM.

  • happydoors@lemm.ee
    link
    fedilink
    English
    arrow-up
    1
    ·
    2 days ago

    I know nothing technical to help you. But this guy’s YouTube video goes over random shit about using different computers. I believe he uses thunderbolt 4 to connect the systems, though. Plenty of other material on YouTube, as well.

      • Natanox@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        5
        ·
        edit-2
        3 days ago

        May take a look at systems with the newer AMD SoC’s first. They utilize the systems’ RAM and come with a proper NPU, once ollama or mistral.rs are supporting those they might give you sufficient performance for your needs for way lower costs (incl. power consumption). Depending on how NPU support gets implemented it might even become possible to use NPU and GPU in tandem, that would probably enable pretty powerful models to be run on consumer-grade hardware at reasonable speed.

          • Natanox@discuss.tchncs.de
            link
            fedilink
            English
            arrow-up
            2
            ·
            3 days ago

            Depends on which GPU you compare it with, what model you use, what kind of RAM it has to work with, ecetera. NPU’s are purpose-built chips after all. Unfortunately the whole tech is still very young, so we’ll have to wait for stuff like ollama to introduce native support for an apples-to-apples comparison. The raw numbers to however do look promising.

        • just_another_person@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          7
          ·
          3 days ago

          It wouldn’t even matter. OP doesn’t understand how any of this works, and is instead just running rampant calling everything bullshit 😂

      • TootGuitar@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        3 days ago

        This is false: Mistral small 24b at q4_K_M quantization is 15GB. q8 is 26GB. A 3090/4090/5090 with 24GB or two cards with 16GB (I recommend the 4060 Ti 16GB) will work fine with this model, and will work in a single computer. Like others have said, 10Gbe will be a huge bottleneck, plus it’s just simply not necessary to distribute a 24b model across multiple machines.

        • marauding_gibberish142@lemmy.dbzer0.comOP
          link
          fedilink
          English
          arrow-up
          2
          ·
          3 days ago

          Thank you, but which consumer motherboard + CPU combo is giving me 32 lanes of PCIe Gen 4 neatly divided into 2 x16 slots for me to put 2 GPUs in? I only asked this question because I was going to buy used computers and stuff a GPU in each.

          Your point about networking is valid, and I’ll be hesitant to invest in 25Gbe right now

          • CondorWonder@lemmy.ca
            link
            fedilink
            English
            arrow-up
            4
            ·
            edit-2
            3 days ago

            You don’t need cards to have full bandwidth, they only time it will matter is when you’re loading the models on the card. You need a motherboard with x16 slots but even x4 connections would be good enough. Running the model doesn’t need a lot of bandwidth. Remember you only load the model once then reuse it.

            An x4 pcie gen 4 slot has ~7.8 GiB/s theoretical transfer rate (after overhead), a x16 has ~31.5GiB/s - so disk I/O is likely your limit even for a x4 slot.

            • overhead was already in calculations
      • just_another_person@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        4
        ·
        edit-2
        3 days ago

        I assume you’re talking about a CUDA implementation here. There’s ways to do this with that system, and even sub-projects that expand on that. I’m mostly pointing how pointless it is for you to do this. What a waste of time and money.

        Edit: others are also pointing this out, but I’m still being downvoted. Mkay.

        • marauding_gibberish142@lemmy.dbzer0.comOP
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          1
          ·
          3 days ago

          Used 3090s go for $800. I was planning to wait for the ARC B580s to go down in price to buy a few. The reason for the networked setup is because I didn’t find there to be enough PCIe lanes in any of the used computers I was looking at. If there’s either an affordable card with good performance and 48GB of VRAM, or there’s an affordable motherboard + CPU combo with a lot of PCIe lanes under $200, then I’ll gladly drop the idea of the distributed AI. I just need lots of VRAM and this is the only way I could think of.

          Thanks

          • just_another_person@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            2
            ·
            3 days ago

            PLEASE look back at the crypto mining rush of a decade ago. I implore you.

            You’re buying into something that doesn’t exist.