Good morning, Nostr. Who's running local LLMs? What are the best models that can run at home for coding on a beefy PC system? In 2026, I want to dig into local LLMs more and stop using Claude and Gemini as much. I know I can use Maple for more private AI, but I prefer running my own model. I also like the fact there are no restrictions on these models ran locally. I know hardware is the bottleneck here; hopefully, these things become more efficient.
Thread
Login to reply
Replies (14)
24 gb VRAM (3090, 7900 cards): the latest mistral 24b, qwen3 32b and qwen3 30a3 (MoE)
48gb: 70b size models at decent quants, mistral dev large at lobotomized quants. Mistral dev large is the main one in this bracket. There might be other good 70b's released lately
96gb: gpt-oss 120b
This is to fit everything in vram. With MoE's (qwen3 30a3, GPT-oss) you can get by with VRAM+RAM without ruining your speed depending on the speed of your ram.
But it's usually a speed hit so I don't use anything that doesn't fit in VRAM
yup, and about 2tb storage just to fit all those models.
Thank you 🙏. Current cards I own 3090 and 7900xtx. Someone suggested dual 3090s. I'll check out the models thank you.
@Kajoozie Maflingo uses local LLMs for his image generation, not coding. But he may have some insight.
Local LLMs are a blessing, free and loyal to you 😛
My set up is i913900k + 64 GB at 6400MT/s + RTX 4090
The absolute best all-around AI is llama3.3 but it is a bit outdated and and slow. Newer MOE models like llama 4 and gpt-oss are flashier and faster but they are mostly experts on hallucinating.
People will also suggest deepseek but generally speaking 24gb vram is just too small for "reasoning models" to actually be an improvement. I haven't tried some of the more recent developments, but I have some hope.
If someone were to train a llama3.3 like AI but have it focus on tool use, like reading the source code and documentation for the libraries you have installed, then I think it could be very good.
I don't think you can really run anything unless you have a card with a minimum of 16gb vram. Even then the model you can run would be 1/4 of sonnet performance. You need like 4, 24gb cards to get close.
As I understand it, you'll want to limit to 1b tokens per 1GB of ram.
I used local LLM exclusively, mostly for coding. Two used 24g 3090's which provides 48g of vram. It runs models up to 70b with very fast performance.
When inferring or training,
1. It uses a lot of power, peaking around 800W
2. It spins up the fans pretty loudly
I don't think it's necessary to go local for open source coding though. Maple, mostly gpt-oss-120 is great for that. I think it is necessary to go local for uncensored models or training with your own data, and discussing things that don't fit mainstream bullshit narratives.
I got a 3090 I
might pick another. Thank you!
Dual 3090s is (somehow) still the sweet spot for local
Played with ollama before thanks!
