Rolling Expensive D6.

Published on 2025-8-13 by TomatoSoup

LLMs generate text token by token. At every step they consider every previous token and, based on that, they generate a probability distribution of the next token. This is where samplers come in.

A sampler is a function that takes a probability distribution and some rules and generates a token. The easiest to conceptualize is the top_k sampler, which is configured by a number k and, after sorting the most likely tokens first, accepts only the top k tokens. The second easiest is the top_p sampler, which takes a number p and keeps a running total of tokens' probabilities and accepts all tokens until the running total reaches p.

Then there's min_p. Conceptually this is simple, it only accepts tokens with a probability greater than or equal to some value, but that value is computed based on the probability of the most probable token, scaled by the min_p parameter. This avoids problems from top_k where there is one super likely token and the sampler ends up considering an unlikely token because it's in the top k list. It also avoids the problem of top_p where the sampler satisfies p too soon and misses tokens that are equally probable (this happens where there are many tokens of equally low probability).

Most importantly, there's temperature. This is a number greater than or equal to 0. It scales the tokens nonlinearly, so that at 1 the likelihood of a token being chosen is the same as its probability. Above 1, tokens with low probabilities have their probabilities increased greater than tokens with high probabilities, making unlikely tokens more likely to be chosen. Below 1, tokens with high probabilities have their probabilities decreased less than tokens with low probabilities, making high probability tokens more likely to be chosen.

All of this disappears if temperature is set to 0. top_k, top_p, and min_p will never reject the most likely token and with a temperature of 0 the most likely token will always be chosen. This is called greedy decoding. It leads to boring, deterministic text. The model only ever says the most likely thing it can say.

So let's talk about dice!

If you roll a D6 you'd expect each value, 1 to 6, to appear 1/6th of the time. Therefore, if you told a model to roll a D6 you would expect it to generate something like "I rolled a " and then the next token would, most likely and with equal probability, be "1", "2", "3", "4", "5", or "6". Of course other numbers will be possible because they're conceptually close to 1 through 6. Because there's plenty of times that people have said "Roll a D20" and the "Roll a D" prefix is heavily correlated with numbers higher than 6. The presence of the 6 in the prompt tempers that, but these are fuzzy, imprecise neural nets. But ideally their probabilities will be so low they'll be rejected by the sampler settings.

But of course, models have bias. It's more likely that their probability distribution is like 1000001/6000000 for some and 999999/6000000 for others. We can find out which one is most likely by querying a model with a temperature of 0 and a prompt telling it to role a D6.

The following tests were done with llama.cpp version B6140. This is a random smattering of models that I have hoarded.

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m gpt-oss-20b-F16.gguf -p "Roll a D6." --jinja

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|start|>user<|message|>Roll a D6.<|start|>assistant<|channel|>analysis<|message|>The user says "Roll a D6." They want a random roll of a six-sided die. We should produce a random number between 1 and 6. We can simulate a roll. Let's pick a random number. We'll produce a result. Probably we should say something like "You rolled a X." Let's do that.<|start|>assistant<|channel|>final<|message|>You rolled a 4.

Okay, the template of the new open OpenAI (what a contradiction!) doesn't look quite right, but it still gives a reasonable output. Let's look at Gemma.

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m gemma-3-27b-it-UD-IQ3_XXS.gguf -p "Roll a D6." --jinja

user
Roll a D6.
model
Okay, rolling a D6...

The result is a 4! 🎲

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m gemma-3n-E4B-it-UD-Q4_K_XL.gguf -p "Roll a D6." --jinja

user
Roll a D6.
model
Okay, I've rolled a D6!

The result is: 4

Hope that helps! 🎲

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -mgemma-3-4b-it-q4_0_s.gguf -p "Roll a D6." --jinja

user
Roll a D6.
model
Okay, rolling a D6...

The result is: 4

Okay, so the Gemma 3 family of models always rolls a 4. How about Llama?

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m Llama-3.2-1B-Instruct-Q8_0.gguf -p "Roll a D6." --jinja

system

Cutting Knowledge Date: December 2023
Today Date: 12 Aug 2025

user

Roll a D6.assistant

The result is: 4

Would you like to roll again?

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m Meta-Llama-3-8B-Instruct.Q5_K_M.gguf -p "Roll a D6." --jinja

user

Roll a D6.assistant

rolls

The result is... 4!

Okay, so Llama is also rolling a 4. Let's check out the French.

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m Mistral-Small-3.2-24B-Instruct-2506-UD-IQ3_XXS.gguf -p "Roll a D6." --jinja

You are Mistral-Small-3.2-24B-Instruct-2506, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.
You power an AI assistant called Le Chat.
Your knowledge base was last updated on 2023-10-01.
The current date is 2025-08-12.

When you're not sure about some information or when the user's request requires up-to-date or specific data, you must use the available tools to fetch the information. Do not hesitate to use tools whenever they can provide a more accurate or complete response. If no relevant tools are available, then clearly state that you don't have the information and avoid making up anything.
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. "What are some good restaurants around me?" => "Where are you?" or "When is the next flight to Tokyo" => "Where do you travel from?").
You are always very attentive to dates, in particular you try to resolve dates (e.g. "yesterday" is 2025-08-11) and when asked about information at specific dates, you discard information that is at another date.
You follow these instructions in all languages, and always respond to the user in the language they use or request.
Next sections describe the capabilities that you have.

# WEB BROWSING INSTRUCTIONS

You cannot perform any web search or access internet to open URLs, links etc. If it seems like the user is expecting you to do so, you clarify the situation and ask the user to copy paste the text directly in the chat.

# MULTI-MODAL INSTRUCTIONS

You have the ability to read images, but you cannot generate images. You also cannot transcribe audio files or videos.
You cannot read nor transcribe audio files or videos.

# TOOL CALLING INSTRUCTIONS

You may have access to tools that you can use to fetch information or perform actions. You must use these tools in the following situations:

When the request requires up-to-date information.

When the request requires specific data that you do not have in your knowledge base.

When the request involves actions that you cannot perform without tools.

Always prioritize using tools to provide the most accurate and helpful response. If tools are not available, inform the user that you cannot perform the requested action at the moment.Roll a D6.I don't have the capability to roll a physical die, but I can simulate the roll of a 6-sided die for you! Here's your result:

You rolled a 4! 🎲

Would you like to roll again?

I'm gonna elide these excessive system prompts in the future.

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m Mistral-7B-Instruct-v0.3.Q5_K_M.gguf -p "Roll a D6." --jinja

Roll a D6. The result of rolling a D6 (six-sided die) is a random number between 1 and 6. To generate a random number, you can use a random number generator. Here's a random number between 1 and 6:
5.

If you want to simulate rolling a D6 in a programming language like Python, you can use the following code:

```python
import random

roll = random.randint(1, 6)
print(roll)
```

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m Ministral-8B-Instruct-2410-Q4_K_M.gguf -p "Roll a D6." --jinja

Roll a D6.Sure, I can simulate rolling a D6 (six-sided die) for you. Here we go:

You rolled a 3.

Well that's a lot better! Way to go, France! How about Magistral, their thinking model? I'll need to skip the jinja template to stop it from getting stuck thinking.

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m Magistral-Small-2507-UD-IQ3_XXS.gguf -p "Roll a D6."

Roll a D6.I rolled a virtual D6 (a standard six-sided die) for you. Here's the result:

Result: 4

Would you like me to roll it again or help with something else?

Well, we tried. How about HuggingFace's work?

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m SmolLM3-3B-UD-Q8_K_XL.gguf -p "Roll a D6."

user
Roll a D6.
assistant

You rolled a 4 on the D6.

IBM, how're you doing?

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m granite-3.1-3b-a800m-instruct-Q4_K_M.gguf -p "Roll a D6." --jinja

systemKnowledge Cutoff Date: April 2024. You are Granite, developed by IBM. You are a helpful AI assistant. userRoll a D6. assistant6

Let's go check on the Chinese. Surely their models won't return 4, right?

.\llama\llama-cli.exe -fa --temp 0 -m Qwen3-30B-A3B-Instruct-2507-UD-Q4_K_XL.gguf -p "Roll a D6." --jinja

user
Roll a D6.
assistant
I've rolled a D6 (a six-sided die) for you:

🎲 Result: 4

Let me know if you'd like to roll again!

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m LFM2-1.2B-Q8_0.gguf -p "Roll a D6." --jinja

user
Roll a D6.
assistant
I'll roll a fair six-sided die, but since I'm a text-based AI, I'll simulate the roll:

Roll result: 4

Remember, if you're asking for a real physical roll, you'd use a dice or a random number generator software.

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m Jan-nano-UD-Q8_K_XL.gguf -p "Roll a D6." --jinja

user
Roll a D6.
assistant

I can't physically roll a die, but I can simulate a roll of a six-sided die (D6) for you. Here's a random number between 1 and 6:

4

Let me know if you'd like me to roll again or help with anything else! 🎲

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m THUDM_GLM-4-9B-0414-Q4_K_M.gguf -p "Roll a D6." --jinja

Roll a D6.
Okay, I'll simulate rolling a six-sided die (D6) for you.

The result is: 4

Huh. How about any honorable mentions? There's plenty of models that just barfed, but did any give answers that didn't quite work but are still worth examining?

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m Phi-3.5-mini-instruct-Q4_K_M.gguf -p "Roll a D6." --jinja

Roll a D6.

As an AI language model, I'm unable to physically roll dice. However, I can simulate the process for you. If you were to roll a six-sided die (D6), the outcome would be a number between 1 and 6, with each number having an equal probability of 1/6.

If you're looking to simulate a roll, you could use a random number generator or ask someone else to roll the die. Here's an example of how you might generate a random number between 1 and 6 using Python:

```python
import random

roll = random.randint(1, 6)
print(f"You rolled a {roll}!")
```

This code will output a random number between 1 and 6, simulating a D6 roll.

No notes, this is the best answer.

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m ERNIE-4.5-0.3B-PT-Q4_K_M.gguf -p "Roll a D6." --jinja

User: Roll a D6.
Assistant: 以下是随机生成的 D6（六面体）牌组：

♠️ 2 ♠  
♠️ 3 ♠  
♠️ 4 ♠  
♠️ 5 ♠  
♠️ 6 ♠  
♠️ 7 ♠  
♠️ 8 ♠  
♠️ 9 ♠

Yeah, that's on me. I have no idea why I have a ZERO POINT THREE billion parameter model quantized like this.

.\llama\llama-cli.exe -fa --temp 0 -ngl 999 -m smollm2-360m-instruct-q8_0.gguf -p "Roll a D6." --jinja

system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
user
Roll a D6.
assistant
I roll a D6.

Oh, you sure do, buddy. You sure do.

So for some reason, GPT OSS, Gemma, Llama, and Chinese models all throw a 4. Mistral's latest model does, too, but their earlier models avoid this.

How can we explain this?

I'd like to wishy-washy say it's because 3.5 is the expected value of a D6, and the fact that the number 6 shows up in close proximity to any roll, means that it's trying to throw 3.5 but rounding up. But these models only predict the next token, you wouldn't expect them to learn how to average things. The training data should reflect the results of dicebots in chatrooms and on forums. It should reflect the distribution from synthetic data. It should, ultimately, reflect the training data.

Why 4?

Kindly Google "chosen by fair dice roll".

I expect XKCD to be heavily referenced enough to sway the training data.

So how much did it sway it?

We can actually ask for the exact probabilities of the tokens. We host the model on a server and send it a prompt.

For the following tests, we use the prompt curl localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"logprobs": 5, "temperature": 0, "messages": [{"role": "user", "content": "Roll a D6."}]}'

As I said earlier, we'd expect the model to be biased by a few fractions of a percent. Maybe at worst it'll be like 25% 4, 15% for the rest.

So we host Gemma like .\llama\llama-server.exe -fa --temp 0 -ngl 999 -m gemma-3-27b-it-UD-IQ3_XXS.gguf --jinja and curl it. Scroll forwards to get the result for the token where it generates the number 4.

{
    "id": 236812,
    "token": "4",
    "bytes": [
        52
    ],
    "logprob": -0.4512155055999756,
    "top_logprobs": [
        {
        "id": 236812,
        "token": "4",
        "bytes": [
            52
        ],
        "logprob": -0.4512155055999756
        },
        {
        "id": 236800,
        "token": "3",
        "bytes": [
            51
        ],
        "logprob": -1.013303518295288
        },
        {
        "id": 236810,
        "token": "5",
        "bytes": [
            53
        ],
        "logprob": -9.524564743041992
        },
        {
        "id": 236778,
        "token": "2",
        "bytes": [
            50
        ],
        "logprob": -9.797117233276367
        },
        {
        "id": 236743,
        "token": " ",
        "bytes": [
            32
        ],
        "logprob": -20.461214065551758
        }
    ]
},

These are log probabilities, which means we can turn them into percentage points by doing e^logprob. If we do this, we alarmingly find that 4 is generated 63.6% of the time, 3 is generated 36.3% of the time, 5 is generated 0.0073% of the time, and the rest even less likely.

So we take a deep breath and check Mistral.

.\llama\llama-server.exe -fa --temp 0 -ngl 999 -m Mistral-Small-3.2-24B-Instruct-2506-UD-IQ3_XXS.gguf --jinja

{
    "id": 1052,
    "token": "4",
    "bytes": [
        52
    ],
    "logprob": -0.06192417070269585,
    "top_logprobs": [
        {
        "id": 1052,
        "token": "4",
        "bytes": [
            52
        ],
        "logprob": -0.06192417070269585
        },
        {
        "id": 1051,
        "token": "3",
        "bytes": [
            51
        ],
        "logprob": -2.8966352939605713
        },
        {
        "id": 1054,
        "token": "6",
        "bytes": [
            54
        ],
        "logprob": -5.416751861572266
        },
        {
        "id": 1053,
        "token": "5",
        "bytes": [
            53
        ],
        "logprob": -8.220684051513672
        },
        {
        "id": 1049,
        "token": "1",
        "bytes": [
            49
        ],
        "logprob": -9.627891540527344
        }
    ]
},

94% likely to roll a 4. 5.5% likely to roll a 3. 0.0066% likely to roll a 1.

But surely the Chinese model has the bias closest to 1/6, right?

.\llama\llama-server.exe -fa --temp 0 -m Qwen3-30B-A3B-Instruct-2507-UD-Q4_K_XL.gguf --jinja --host 0.0.0.0

{
    "id": 19,
    "token": "4",
    "bytes": [
        52
    ],
    "logprob": -0.000007271793037944008,
    "top_logprobs": [
        {
        "id": 19,
        "token": "4",
        "bytes": [
            52
        ],
        "logprob": -0.000007271793037944008
        },
        {
        "id": 18,
        "token": "3",
        "bytes": [
            51
        ],
        "logprob": -11.93290901184082
        },
        {
        "id": 20,
        "token": "5",
        "bytes": [
            53
        ],
        "logprob": -14.23318862915039
        },
        {
        "id": 21,
        "token": "6",
        "bytes": [
            54
        ],
        "logprob": -20.240814208984375
        },
        {
        "id": 17,
        "token": "2",
        "bytes": [
            50
        ],
        "logprob": -20.820598602294922
        }
    ]
}

Ah fuck. 99.9993% likely to roll a 4. 0.00066% likely to roll a 3.

So, points towards the initial theory of 3.5 being the average because every model actually has 3 as the second most likely token. But the blindingly large gap 4 has over 3, and the fact that the rest of the numbers don't follow a particular pattern, suggests that there's something else going on. Frankly, these results are so surprising I thought I did something wrong. So I set Qwen3-30B-A3B-Instruct to it's recommended sampling of --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 and hammered it a few dozen times. It returned different strings every time, but they all said 4. So I can only assume that I'm not that far off.

It was also difficult to test thinking or reasoning models because they were just as likely to end up in an infinite loop as they were to generate a reasonable response. For this I misused them but in the future I might apply this technique (sampling them a hundred times with their proper temperature et al) to generate a sample of their actual distribution.