Local LLMs: 16gb vram model advice and quantization question

TheCookingSenpai@lemmy.ml to

AI@lemmy.mlEnglish · 25 days ago

So, I have a 16GB vram GPU (4070 ti Super) and 32GB DDR4 RAM. The RAM is slow af and thus I tend to run models fully on GPU.

I can easily run up to 21b-ish models with Q4, sometimes high Q3.

I am testing various models out there but I was wondering if you guys have any reccommendation.

I am also really interested in understanding if quantization really decrease the model quality so much. Like, It would be better to have a Q6 12b model (like Gemma 3 12b), a Q2_K_L 32b model (such as QwQ 32b) or a Q3_XS model (such as Gemma 3 27b)?

You must log in or register to comment.

Chat

AI@lemmy.ml

artificial_intel@lemmy.ml

Create a post

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

36 users / day
36 users / week
36 users / month
36 users / 6 months
0 local subscribers
4.96K subscribers
47 Posts
0 Comments
Modlog

mods: