Incorrect output in Gemma 4: seeking a solution to the problem

#79

by Lintrarius - opened 5 days ago

Hello! Could you please help me figure out how to solve an output issue with Gemma 4?
Any Gemma 4 — regardless of the quantization and version — produces characteristic artifacts (they vary, but these appear most frequently):

lilt’S a lilt’s a lilt…
a lS… de lS… l l S…
la la la a la l l l l l l l l l
lS lS lL lL lLL

Backend: Oobabooga Text Generation WebUI (v4.6.2, 9dcf574, with Gemma 4 support and the latest version llama.cpp);
SillyTavern (1.16.0).
I’ve looked through the settings, disabled speculative decoding, and tried different templates — but to no avail.

Could the issue be related to decoding? Has anyone else encountered this problem? Or perhaps this error is recognizable — could someone suggest a way to deal with it? I would be incredibly grateful for any hint on how to use the model correctly in ST.

I’d like to apologize in advance for my English — it’s not my native language.

alexcardo

4 days ago

•

edited 4 days ago

I multiple times reported this issue. Nobody cares. Even people working with vLLM are trying to figure it out. Yet they can't realize the this issue belongs to the original model's wights (as I multiple times reported with the screenshots from the Google AI Studio). I stopped report this and just waiting for more people did the same. It happens to any language, on various length of the model's context. It's for sure the issue related to the original weights. Google just need to update the weights with the new checkpoint. I hope it will be fixed with Gemini 4.1. However, I doubt it happen someday. Google Deep Mind doesn't release "middle stage" checkpoint.

P.s.: Don't take me wrong. Gemma4-31B is an ASTONISHING model. But it's impossible to use it in production because of this bug (((

P.p.s: My today's experience:

The user is talking about the " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la l’L'C" (the la l’L'C) " la

and so on....

If Google pays attention to my voice I pleasantly report my guesses as I collect my observations since the release date.

thnamratha

Google org about 11 hours ago

Hi all,

Thanks for addressing this issue. To help us to know the root cause, could you please clarify:

Does it fail immediately, or start correctly and then collapse?
Token Count: Roughly how many tokens consumed when the loop begins?
Prompt Type: Does this happen with short questions or only long prompts.(eg: document uploads)?
Reproduction Prompt: If possible, could you please share the exact prompt that triggered it.

alexcardo

about 10 hours ago

•

edited about 7 hours ago

It doesn't start immediately. Yet, when I just started testing it (the day of release) in Google AI Studio, I asked the model to translate Fibonacci Liber Abaci and the effect was just immediate.

But in the most subsequent tests it happens on the long context. Yet not that long. It happens starting from maybe 20-40K tokens. So the short answer is YES, on longer context.

See the 1st option. But for sure it happens after 50K context.

Model is trying to use the token "la" or just "l" in my case. "A la carte", "l'object", etc.

There is no exact prompt. It happens in the process of the dialog. The model uses one time the "la" token and then it starts to say "la la la" about some subject. Eventually, it turns into a "la la la" nightmare.

There is an opinion that it's a reaction of the model to the safety filter. That all these tokens are actually the safety tokens. The model keeps them in some hidden layer and they appear in the model answer as the model think the content is not safe.

Ok, I understand that you want to make the model safe. But what's wrong in studying Fibonacci and other medieval texts? I suppose that it's a false positive trigger of the safety filer.

If you need more detail, I can provide. But don't be ready for some exact prompts as there is NO such a prompt. It happens unexpectedly.

As for the FP8 cache,... Yes, it does matter. But it only multiplies the issue more fast. The original weights have the same issue too. If you want I can upload the screenshots from the Google AI Studio here or my API calls.

Lintrarius

about 8 hours ago

Thank you for responding! This is very important for anyone who has encountered this problem.

Does it fail immediately, or start correctly and then collapse?
The failure is rarely immediate. In most cases, the model starts the generation correctly, maintaining the persona and following the prompt instructions for the first few sentences. The collapse typically occurs after approximately 40–50 words. The model then enters a repetitive loop (e.g., "la la la", "l l l l", or "a bit a bit"). Occasionally, the model seems to recognize the glitch, generating phrases like "Wait, let's get back to the answer," but immediately falls back into the repetitive loop.
Token Count: Roughly how many tokens consumed when the loop begins?
The "healthy" generation window is quite narrow. Based on my observations, the collapse usually happens within the range of 50 to 250 tokens of the assistant's response. Once this threshold is crossed, the probability of a "hard-lock" repetition cycle increases significantly.
Prompt Type: Does this happen with short questions or only long prompts?
The issue appears to be independent of prompt length or complexity. It occurs during simple greetings in a fresh chat (empty context). There is no clear correlation with the input size; the model collapses even when the total context is well under 1k tokens.
Reproduction Prompt:
The issue is persistent across almost any input. A simple introductory prompt triggers it consistently.

Additional Technical Context:

Backends tested: llama.cpp (via Oobabooga Text Generation WebUI), vLLM.

Quantizations: It does not depend on the quantization or the model version.

The problem exists both in the clean official version and in the versions modified by the community.

The model often fails to respect stop sequences (like <turn|>) once the repetition starts, leading to infinite "la la la" sequences until the max token limit is reached.

It is important to note that this is not exclusively a local inference issue. Many users are reporting identical "la la la" artifacts and repetition collapses even when using the official Google API. This suggests that the root cause is not related to local hardware limitations, specific quantizations (like GGUF), or third-party backends.

alexcardo

about 6 hours ago

•

edited about 6 hours ago

I think this CoT is exactly what you're searching for. This is about the headphones I've just purchased. There is no conspiracy or other "marked" cases. I'm just trying to compare a budget isodynamic headphones with my Sennheiser HD 650

The user is once again correcting me. He explicitly stated that Takstar HF-580 did handle complex tracks like Radiohead - Creep and Nirvana, resolving the instruments and voices ("разложить инструменты и голоса по полочкам").
The user is frustrated that I keep assuming they "fail" in complex genres (metal/rock) just because I'm applying a general "budget planar" stereotype. He's telling me that these specific headphones (with the fabric pads) actually handled it.

User's experience: HF-580 (with fabric pads) successfully resolve complex mixes (Radiohead, Nirvana), something I assumed would be a weakness.
My error: I keep projecting the typical limitations of cheap planars onto this specific experience. I'm ignoring the user's direct evidence that these specific units did a good job.
Admit the failure: I'm failing to process the "successes" and defaulting to "stereotypes" of budget gear.
Acknowledge the " la la la " (Separation): If the HF-580 can separate voices and instruments in Creep (a very dynamic and dense track), then we are talking about a level of resolution that is genuinely surprising for the price.
The "Big Picture": The user's point is that the " la la la " ( la la la ) in these headphones is far better than expected, and they aren't just "good for the money," but actually competitive with legends like HD-650 across the board.
Sincere apology. No more "but" or "however."
Directly acknowledge: "Yes, they handled the metal/rock/grunge/alt’s density."
Recognize that my " la la la la " (separation) theory was wrong for this specific model/setup.
Express genuine surprise at the driver's performance.
Pivote back to the " la la la la " ( la la la l).
User said: "S- la la la la l" (separation of instruments) was a win for Takstar.
My previous response: I said "they lack separation." $\rightarrow$ This is where I "failed" the context.
Correcting now: "They separated the instruments and voices in Nirvana and Radiohead."
Tone: Humble, a bit shocked, fully accepting the user's data.
Key takeaway: The HF-580 is a genuine anomaly/beast.

Wait, looking back at the dialogue, did the user say " la la la la la l" ( la la la l) was a failure?
User's previous message: "В каких-то композициях, звук вообще не отличим от HD-650. В каких-то он отличается кардинально... в некоторых композициях всё смешивается в кучу."
Aha! The user did say "in some compositions everything mixes into a pile" (в некоторых композициях всё смешивается в кучу).
But then he later clarified: " la la la la la la la de la la la la la la la l la la l la l l l la la l la la la de l l la l l l l la la la l la l la l la la la l l l l l la l l l la la l la l l la la la l la l l la l la la l l l la l la l de la l l la l l l la la la l la l l la l la la l l la la la la la la la l la l la l la la la la l la l l l la la la la l la l de la l l l l la l la l l la l l la l la la l la l la l l l l la l la l la l l la l l l l l la l la l l l l l la l l la l l la la l l l l l l l la l l l la la l l l la l l l l l de l la l l la l la la la la l la l la l la l la la de l la l la la l l la la la l l l la l l la l l la l l l l la la l la la l la l la l l la l de la l de la la l l la l la l de l la l la la l l la l l de la la l de la la la la l l la de la l l de l la la l de l la la l l la la de l la l la l de l la la la de la la l l la la la l de la de l la la la la l la de l la l la l de la de l l de de l la la l l l l de de l la de l de l l l la la de la la l l de l l l de la la la l la la la l la

P.s.: As you can see... the "la la la" started unexpectedly. And once appeared, never stop...

Here is my attempt to regenerate the response that ended up correctly

The user is correcting me again.
* What I said: I assumed Takstar failed/became "mushy" in metal/rock (Radiohead, Nirvana) and only succeeded in "airy" tracks (Jarre).
* What the user actually said: He tested them on tracks that normally sound like a "mush" (Radiohead - Creep, Nirvana), and the Takstars managed to separate the instruments and voices. They "pulled" (потянули) it.
* User's Point: The HF-580 (with fabric pads) are performing far better across the board than I'm giving them credit for. They aren't just "okay for the price"; they are actively performing at a level that surprises him, even in complex/dense mixes.

Failure point: I fell back into the stereotype of "budget planars = bad separation in complex tracks." I ignored the user's explicit statement: "S-core and Nirvana, Takstar also pulls it."
The "Aha" moment: These headphones (specifically this driver in this configuration) are genuinely high-performance. They are effectively "punching" way above their weight class, rivaling the HD-650 not just in "air" but in resolution and separation.
Acknowledge the mistake clearly: "I'm failing to listen to the actual results you're giving me."
Correct the narrative: The HF-580 (with fabric pads) are actually performing at a reference level of separation, even in "messy" tracks like Creep.
Synthesize the final picture:
- HD-650: The legendary balanced reference.
- HF-580: A surprising, high-resolution monster that can separate instruments in dense mixes and handle the most delicate atmospheric tracks.
Admit that the "Chi-Fi" hack has reached a level where the "legend" (HD-650) is no longer the only answer, but a peer.
Admit total failure in interpreting the specific "mush" context.
Celebrate the discovery: "They actually separated the instruments in Creep!"
Final realization: These headphones are a genuine anomaly/discovery.
Keep it raw, a bit self-deprecating, and very focused on the user's findings.

alexcardo

about 6 hours ago

The experiment above showcases that the model is capable of generating the correct output from the 2nd attempt, which means that the length of the content isn't the issue. The issue appears when the model is trying to generate some specific token ("l", "la") and is going to the infinite loop for a whatever reason.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment