Big day for people who use AI locally. According to benchmarks this is a big step forward to free, small LLMs.

  • brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    4 months ago

    A 3090.

    But it should be fine on a 3060, with zero offloading.

    Dump ollama for long context. Grab a 5-6bpw exl2 quantization and load it with Q4 or Q6 cache depending on how much context you want. I personally use EXUI, but text-gen-webui and tabbyapi (with some other frontend) will also load them.