Skip Navigation

robber

@ robber @lemmy.ml

Posts

23
Comments

107
Joined

3 yr. ago

LocalLLaMA @sh.itjust.works
robber @lemmy.ml
2d ago

llama.cpp: don't sleep on --split-mode tensor

github.com /ggml-org/llama.cpp/pull/19378

0
2d ago

Noob here: Why is Google making Gemma open-source?
Jump
robber @lemmy.ml 2d ago
A lot has been said, but to add to the list I'd say it gives them access to quite a large pool of free testers.
LLM architectures and optimization techniques change rapidly and by releasing open-weight models a lot of enthusiasts will evaluate new models for free, help implement support in inference engines, catch bugs etc. (and in turn, ofc, get a new model to run for free, so it's at least somewhat symbiotic).
We have at least seen this quite obviously when Alibaba released Qwen3-Next, which was a somewhat undertrained but still useful model which introduced the architecture that their latest models now use "in production" (also their paid "Max" models).

LocalLLaMA @sh.itjust.works

robber @lemmy.ml

4w ago

Gemma 4 is here

huggingface.co /collections/google/gemma-4

2mo ago

HP's ink-blocking firmware may violate new global sustainability rules

Jump

robber @lemmy.ml 2mo ago

Global sustainability rules???

2mo ago

Smaller qwen3.5 models released

Jump

robber @lemmy.ml 2mo ago

I don't follow the discussions on this topic very closely, but as I understood, there are different ways to achieve the goal, but all impact quality to some extent. Heretic is discussed as one one of the SOTA methods. The README posted above states the following, so it seems that heretic is some sort of next gen abliteration.

It combines an advanced implementation of directional ablation, also known as "abliteration" (Arditi et al. 2024, Lai 2025 (1, 2)), with a TPE-based parameter optimizer powered by Optuna.

LocalLLaMA @sh.itjust.works

robber @lemmy.ml

2mo ago

Smaller qwen3.5 models released

huggingface.co /collections/Qwen/qwen35

3mo ago

Qwen3-Coder-Next

Jump

robber @lemmy.ml 3mo ago

Yeah I enjoy it as well. Just in case you missed it - a fix was merged into llama.cpp two days ago which is said to improve quality.

Edit: I stand corrected - the fix for the issue you're experiencing has not yet been merged.

LocalLLaMA @sh.itjust.works

robber @lemmy.ml

3mo ago

Qwen3-Coder-Next

huggingface.co /Qwen/Qwen3-Coder-Next

5mo ago

Would you try a FOSS dating app?

Jump

robber @lemmy.ml 5mo ago

Exactly this. Since it does not seem to be federated, you're still forced to give your data to a third party you can't choose. And this makes the open source aspect a rather marginal benefit, at least for the privacy-concerned end user. Still, I appreaciate the effort.

5mo ago

Would you try a FOSS dating app?

Jump

robber @lemmy.ml 5mo ago

I haven't tried it, but there is one: https://github.com/Alovoa/alovoa

6mo ago

How effective are ads?

Jump

robber @lemmy.ml 6mo ago

Given that Google generated more than 250 billion U.S. dollars in ad revenue in 2024, I'd say they must be pretty effective.

Source

7mo ago

Forgejo v13.0 is available

Jump

robber @lemmy.ml 7mo ago

Depends on the version you're running.

https://forgejo.org/docs/latest/admin/upgrade/from-gitea/

7mo ago

Relevance of GPU driver version for inference performance

Jump

robber @lemmy.ml 7mo ago

I see. When I run the inference engine containerized, will the container be able to run its own version of CUDA or use the host's version?

7mo ago

Relevance of GPU driver version for inference performance

Jump

robber @lemmy.ml 7mo ago

Thank you for taking the time to respond.

I've used vLLM for hosting a smaller model which could fit in two of GPUs, it was very performant especially for multiple requests at the same time. The major drawback for my setup was that it only supports tensor parallelism for 2, 4, 8, etc. GPUs and data paralellism slowed inference down considerably, at least for my cards. exllamav3 is the only engine I'm aware of which support 3-way TP.

But I'm fully with you in that vLLM seems to be the most recommended and battle-tested solution.

I might take a look at how I can safely upgrade the driver until I can afford a fourth card and switch back to vLLM.

7mo ago

Relevance of GPU driver version for inference performance

Jump

robber @lemmy.ml 7mo ago

I use the the proprietary ones from Nvidia, they're at 535 on oldstable IIRC but there are a lot newer ones.

I use 3xRTX2000e Ada. It's a rather new, quite power efficient GPU manufactured by PNY.

As inference engine I use exllamav3 with tabbyAPI. I like it very much because it supports 3-way tensor paralellism, making it a lot faster for me than llamacpp.

7mo ago

Relevance of GPU driver version for inference performance

Jump

robber @lemmy.ml 7mo ago

I use the the proprietary ones from Nvidia, they're at 535 on oldstable IIRC but there are a lot newer ones.

LocalLLaMA @sh.itjust.works

robber @lemmy.ml

7mo ago

Relevance of GPU driver version for inference performance

7mo ago

Looking for a movie about a guy who gets their brian transfered to this like construction worker thing. I can't remember the name.

Jump

robber @lemmy.ml 7mo ago

That brian typo really gave me a chuckle. Hope you found the movie you were looking for.

7mo ago

Anyone using a Linux Smarphone?

Jump

robber @lemmy.ml 7mo ago

Wikipedia states the UI layer is propriertary, is that true?

7mo ago

Big Brother is watching Switzerland!

Jump

robber @lemmy.ml 7mo ago

The country's official app for COVID immunity certificates or whatever they were called was available on F-Droid at the time.

7mo ago

Magistral-Small-2509 by Mistral has been released

Jump

robber @lemmy.ml 7mo ago

Too bad they've only been dropping dense models recently. Also kind of interesting since with Mixtral back in the days they were way ahead of time.

7mo ago

FLX1s is Launched

Jump

robber @lemmy.ml 7mo ago

A review from earlier this year didn't sound too bad.

Edit: as pointed out, the review seems to be about the previous version of the phone.

LocalLLaMA @sh.itjust.works

robber @lemmy.ml

8mo ago

Magistral-Small-2509 by Mistral has been released

huggingface.co /mistralai/Magistral-Small-2509

8mo ago

Qwen3-Next with 80b-a3b parameters is out

Jump

robber @lemmy.ml 8mo ago

I'd add that memory bandwidth is still a relevant factor, so the faster the RAM the faster the inference will be. I think this model would be a perfect fit for the Strix Halo or a >= 64GB Apple Silicon machine, when aiming for CPU-only inference. But mind that llamacpp does not yet support the qwen3-next architecture.