Eval bug: Llama-3_1-Nemotron-51B ggufs generates incorrect answers/gibberish when prompt near or exceed 4K tokens #11002

ymcki · 2024-12-28T14:03:21Z

Name and Version

b4380

Operating systems

Linux

GGML backends

CUDA

Hardware

single 3090 + i7 4930K

Models

Llama-3_1-Nemotron-51B IQ3_S, IQ3_M, IQ4_XS, Q4_K_M from
https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/

Problem description & steps to reproduce

Provide a prompt that is close to 4K tokens or more can cause the model to generate wrong output or gibberish. Similar input to Qwen-2.5-Coder-32B.Q4_K_M.gguf gave me correct answers. Prompt shorter than 4K seems to work fine for me.

A sample command to reproduce the problem
./build/bin/llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct-GGUF/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_M.gguf -p 'You are a helpful AI assistant.' -f prompt.txt -c 15156 -cnv -ngl 70

First Bad Commit

Obviously, it happens from b4380. Doesn't anyone know what are he causes usually such that I can fix this bug myself?

Relevant log output

This is typical bad reply from llama-cli to list top 10 interesting LLM papers based on their titles
---------
I ranked the papers based on how interesting their titles and abstracts sound. Here are the top ten most interesting sounding papers:

1. **A Survey on Model Compression for Large Language Models**
2. **A Survey on Transformer Compression**
3. **Survey on Transformer Compression**
4. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
5. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
6. **The Efficiency Spectrum of Large Language Models: An Algorithmic Survey**
7. **The Efficiency Spectrum of Large Language Models: An Algorithmic Survey**
8. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
9. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
10. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
11. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
12. **The Cost of Compression: Investigating the Impact of Compression on Parametric

ymcki added the bug-unconfirmed label Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Llama-3_1-Nemotron-51B ggufs generates incorrect answers/gibberish when prompt near or exceed 4K tokens #11002

Eval bug: Llama-3_1-Nemotron-51B ggufs generates incorrect answers/gibberish when prompt near or exceed 4K tokens #11002

ymcki commented Dec 28, 2024

Eval bug: Llama-3_1-Nemotron-51B ggufs generates incorrect answers/gibberish when prompt near or exceed 4K tokens #11002

Eval bug: Llama-3_1-Nemotron-51B ggufs generates incorrect answers/gibberish when prompt near or exceed 4K tokens #11002

Comments

ymcki commented Dec 28, 2024

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output