So, I have a 16GB vram GPU (4070 ti Super) and 32GB DDR4 RAM. The RAM is slow af and thus I tend to run models fully on GPU.
I can easily run up to 21b-ish models with Q4, sometimes high Q3.
I am testing various models out there but I was wondering if you guys have any reccommendation.
I am also really interested in understanding if quantization really decrease the model quality so much. Like, It would be better to have a Q6 12b model (like Gemma 3 12b), a Q2_K_L 32b model (such as QwQ 32b) or a Q3_XS model (such as Gemma 3 27b)?
You must log in or register to comment.