Skip Navigation

InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)R
Posts
23
Comments
107
Joined
3 yr. ago

  • Global sustainability rules???

  • I don't follow the discussions on this topic very closely, but as I understood, there are different ways to achieve the goal, but all impact quality to some extent. Heretic is discussed as one one of the SOTA methods. The README posted above states the following, so it seems that heretic is some sort of next gen abliteration.

    It combines an advanced implementation of directional ablation, also known as "abliteration" (Arditi et al. 2024, Lai 2025 (1, 2)), with a TPE-based parameter optimizer powered by Optuna.

  • Yeah I enjoy it as well. Just in case you missed it - a fix was merged into llama.cpp two days ago which is said to improve quality.

    Edit: I stand corrected - the fix for the issue you're experiencing has not yet been merged.

  • LocalLLaMA @sh.itjust.works

    Qwen3-Coder-Next

    huggingface.co /Qwen/Qwen3-Coder-Next
  • Exactly this. Since it does not seem to be federated, you're still forced to give your data to a third party you can't choose. And this makes the open source aspect a rather marginal benefit, at least for the privacy-concerned end user. Still, I appreaciate the effort.

  • Given that Google generated more than 250 billion U.S. dollars in ad revenue in 2024, I'd say they must be pretty effective.

    Source

  • I see. When I run the inference engine containerized, will the container be able to run its own version of CUDA or use the host's version?

  • Thank you for taking the time to respond.

    I've used vLLM for hosting a smaller model which could fit in two of GPUs, it was very performant especially for multiple requests at the same time. The major drawback for my setup was that it only supports tensor parallelism for 2, 4, 8, etc. GPUs and data paralellism slowed inference down considerably, at least for my cards. exllamav3 is the only engine I'm aware of which support 3-way TP.

    But I'm fully with you in that vLLM seems to be the most recommended and battle-tested solution.

    I might take a look at how I can safely upgrade the driver until I can afford a fourth card and switch back to vLLM.

  • I use the the proprietary ones from Nvidia, they're at 535 on oldstable IIRC but there are a lot newer ones.

    I use 3xRTX2000e Ada. It's a rather new, quite power efficient GPU manufactured by PNY.

    As inference engine I use exllamav3 with tabbyAPI. I like it very much because it supports 3-way tensor paralellism, making it a lot faster for me than llamacpp.

  • I use the the proprietary ones from Nvidia, they're at 535 on oldstable IIRC but there are a lot newer ones.

  • LocalLLaMA @sh.itjust.works

    Relevance of GPU driver version for inference performance

  • That brian typo really gave me a chuckle. Hope you found the movie you were looking for.

  • Wikipedia states the UI layer is propriertary, is that true?

  • The country's official app for COVID immunity certificates or whatever they were called was available on F-Droid at the time.

  • Too bad they've only been dropping dense models recently. Also kind of interesting since with Mixtral back in the days they were way ahead of time.

  • LocalLLaMA @sh.itjust.works

    Magistral-Small-2509 by Mistral has been released

    huggingface.co /mistralai/Magistral-Small-2509
  • I'd add that memory bandwidth is still a relevant factor, so the faster the RAM the faster the inference will be. I think this model would be a perfect fit for the Strix Halo or a >= 64GB Apple Silicon machine, when aiming for CPU-only inference. But mind that llamacpp does not yet support the qwen3-next architecture.

  • LocalLLaMA @sh.itjust.works

    Qwen3-Next with 80b-a3b parameters is out

    huggingface.co /collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
  • One reason could be that the audience on lemmy has a left-ish bias and there's a political component to the Spotify exodus.

    Edit: don't get me wrong, I love seeing content and engagement on here.

  • LocalLLaMA @sh.itjust.works

    ExLlamaV3 adds tensor parallelism support

    github.com /turboderp-org/exllamav3/releases/tag/v0.0.6
  • LocalLLaMA @sh.itjust.works

    New, promising MoE model "Hunyuan" by Tencent

    huggingface.co /tencent/Hunyuan-A13B-Instruct
  • SFTPGo is such an awesome project, never had any problems with it.

  • LocalLLaMA @sh.itjust.works

    Do you quantize models yourself?

  • Linux @lemmy.ml

    Well, that's offending

  • Selfhosted @lemmy.world

    Any experience with Pangolin?

  • Technology @lemmy.world

    More than 140 Kenya Facebook moderators diagnosed with severe PTSD

    www.theguardian.com /media/2024/dec/18/kenya-facebook-moderators-sue-after-diagnoses-of-severe-ptsd
  • Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ @lemmy.dbzer0.com

    Don't forget to ...

  • Selfhosted @lemmy.world

    Chaining routers and GUA IPv6 addresses

  • Lemmy Shitpost @lemmy.world

    USA to be renamed to XXX

  • Mildly Infuriating @lemmy.world

    Modern online banking

  • Selfhosted @lemmy.world

    Any of you have a self-hosted AI "hub"? (e.g. for LLM, stable-diffusion, ...)

  • Funny @sh.itjust.works

    I wonder how much storage comes with this driving school

  • Selfhosted @lemmy.world

    Migrated my self-hosted Nextcloud to AIO and I absolutely love it