Hey everybody, brand new to running local LLMs, so I’m learning as I go. Also brand new to lemmy.

I have a 16 GB VRAM card, and I was running some models that would overflow 16GB by using the CPU+RAM to run some of the layers. It worked, but was very slow, even for only a few layers.

Well I noticed llama.cpp has an rpc-server feature, so I tried it. It was very easy to use. Lin here, but probably similar on Win or Mac. I had an older gaming rig sitting around with a GTX 1080 in it. Much slower than my 4080, but using it to run a few layers is still FAR faster than using the CPU. Night and day almost.

The main drawbacks I’ve experienced so far are,

  • By default it tries to split the model evenly between machines. That’s fine if you have the same card in all of them, but I wanted to put as much of the model as possible on the fastest card. You can do that using the --tensor-split parameter, but it requires some experimenting to get it right.

  • It loads the rpc machine’s part of the model across the network every time you start the server, which can be slow on 1 gigabit network. I didn’t see any way to tell rpc-server to load the model from a local copy. It makes my startups go from 1-2 seconds, up to like 30-50 sec.

  • Q8 quantized KV cache works, but Q4 does not.

Lots of people may not be able to run 2 or 3 GPUs in one PC, but might have another PC they can add over the network. Worth a try, I’d say, if you want more VRAM space.