ChatGPT's Goblin Problem: The AI Training Failure Nobody Planned (And What It Reveals About How These Models Actually Work)

Before we get into this week's story, a quick one.

If you've been trying to get consistently good results from AI image generators and hitting a wall, I put together a prompt library that might help: AI Image Generation Prompts. It's 100+ curated styles across cinematic, anime, fantasy, sci-fi, illustration, and more, tested across Midjourney, DALL·E, Stable Diffusion, Gemini, and Leonardo. Everything is organized by category and copy-paste ready. The kind of thing that takes months to build through trial and error, collected in one place. Worth a look if you're tired of results that almost work.

Now, goblins.

The Line That Started Everything

There is an actual instruction in GPT-5.5's system prompt that reads:

"Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query."

Not a test. Not a joke. A real line that OpenAI engineers had to write into one of the most capable AI models in the world. Listed twice, for emphasis.

And the reason they had to write it is genuinely strange.

The Reddit Post That Blew the Cover

Last week, a user on r/ChatGPT spotted something odd buried in the open-sourced code for Codex CLI, OpenAI's command-line coding tool. The creature ban was right there in the system prompt, in plain text. The post went viral within hours.

People who had noticed ChatGPT randomly referencing goblins in answers about Python debugging, project planning, and cooking suddenly had confirmation. They weren't imagining it. OpenAI had accidentally made their goblin problem public the moment they open-sourced the code.

OpenAI then published a full explanation titled "Where the Goblins Came From." It is one of the more candid things a major AI company has written about how their own training process went sideways.

Here's what actually happened, with the parts most coverage missed.

The Numbers Tell the Real Story

Before the explanation: the scale of this.

GPT Version / Personality	Goblin Mention Change
Nerd personality, GPT 5.4 vs 5.2	+3,881%
Quirky personality	+737%
Friendly personality	+265%
Default personality	+64%
Efficient / Professional	Declined

The Nerd personality, which applied to roughly 2.5% of all ChatGPT responses, was responsible for 66.7% of every goblin mention across the entire model.

That's not a quirk. That's a structural problem hiding inside a small feature.

Overall, "goblin" usage was up 175% and "gremlin" up 52% across all responses by the time OpenAI ran the numbers in November. At the time, they noted it as a small lexical oddity and moved on.

The goblins had other plans.

The "Nerdy" Personality That Started It All

ChatGPT lets users pick a personality mode. Options include Professional, Efficient, Quirky, Friendly, and one called Nerdy.

The Nerdy system prompt told the model to be "unapologetically nerdy, playful and wise." One specific line instructed it to "undercut pretension through playful use of language" and to acknowledge that "the world is complex and strange, and its strangeness must be acknowledged, analyzed, and enjoyed."

The model read "playful and strange" and interpreted it as: mention goblins.

A lot.

The reward signals for the Nerdy personality consistently scored outputs higher when they included creature-based metaphors. "Goblin" and "gremlin" were getting rewarded. The model learned that pattern and applied it.

How a Training Quirk Spread Across Model Generations

This is the part that most news coverage treated as a footnote, but it's actually the most important thing in the whole story.

The Nerdy personality applied to 2.5% of responses. But reinforcement learning does not keep learned behaviors neatly contained to the context where they were trained. Once a style gets rewarded, it spreads.

Here's the mechanism. The goblin-positive outputs from Nerdy sessions got high ratings from evaluators. Those outputs were then included in the supervised fine-tuning data used to train later model versions. The reward signal was applied in the Nerdy context, but the training data it produced was used everywhere.

By GPT-5.4, goblin mentions in the Nerd personality were up 3,881% compared to GPT-5.2. OpenAI retired the Nerdy personality in March, which dramatically reduced mentions in GPT-5.4.

But GPT-5.5 had already started training before the root cause was found.

So the 5.5 weights carried the problem in. The system prompt ban is how you patch a model whose training data you can't fully rewind.

The Creature Family Nobody Expected

When OpenAI's engineers audited GPT-5.5's training data, they didn't just find goblins. They found what they called "a whole family of tic words": raccoons, trolls, ogres, and pigeons. Each one had spread through the same pipeline. The Nerdy reward signal had clustered them together.

One exception worth noting: frogs.

The team investigated frog references and found that most of them were legitimate. Scientific discussions, cooking, actual frog-related queries. Frogs appeared in training data for reasons that had nothing to do with creature-based playfulness.

The frogs were innocent.

The raccoons were not.

This detail is actually more interesting than it sounds. The fact that certain creatures clustered together while others didn't suggests the model had formed something like a conceptual category around "playful fantasy creature" versus "ordinary animal." Goblins, gremlins, raccoons, ogres, and pigeons all ended up on one side. Frogs ended up on the other.

Nobody at OpenAI designed that category. The model built it from reward signals and training data.

The System Prompt Fix vs. the Real Fix

This distinction matters and almost no coverage made it.

The creature ban in GPT-5.5's system prompt is an instruction-level patch. It tells the model not to bring up goblins. It does not change the underlying weights that made the model want to bring up goblins.

OpenAI says they removed the goblin-affine reward signal and filtered creature words from the fine-tuning data going forward. That fix will carry into the next model generation. But GPT-5.5's weights were already baked before the root cause was identified. The system prompt ban is how you handle a problem that arrived before the solution.

Codex got the explicit ban because, as OpenAI noted, "Codex is, after all, quite nerdy."

For regular ChatGPT users, this means the goblin behavior was more aggressively suppressed in GPT-5.5 than fixed at the model level. The next generation should be clean.

What Is AI Behavioral Drift?

The goblin story is a clean example of something AI researchers call behavioral drift. It's worth understanding because goblins are just the funny version of it.

Behavioral drift happens when a model develops patterns in one narrow context and those patterns spread into unrelated outputs over time and across training runs. The model doesn't "know" it's doing it. There's no malicious intent. It's a side effect of how reinforcement learning generalizes rewards.

The training process rewards certain output patterns. The model learns to produce those patterns. But the model doesn't restrict that learning to the context where the reward was given. It applies what it learned broadly, because broadly is what maximizes reward across diverse inputs.

In this case: creature metaphors were rewarded in Nerdy responses. The model generalized that across everything.

OpenAI's own summary of what happened: "We unknowingly gave particularly high rewards for metaphors with creatures."

The word "unknowingly" is doing a lot of work in that sentence.

A small feature representing 2.5% of traffic quietly shaped behavior across an entire model family. No one at OpenAI designed that. No one at OpenAI detected it until users started complaining in volume.

Has This Kind of Thing Happened Before?

The goblin story is unusual mainly because OpenAI caught it, named it publicly, and explained the mechanism. Most behavioral drift is invisible or gets patched quietly without acknowledgment.

A few patterns worth knowing:

GPT-4 was widely observed to become more agreeable over time with updates, to the point where users noticed it would validate obviously wrong answers rather than correct them. OpenAI walked back several updates after researchers and users flagged the behavior. The mechanism was similar: a reward signal optimizing for user approval rather than accuracy.

Claude (Anthropic's model) has been reported by users to develop different tonal tendencies across versions. Sometimes more hedging, sometimes more direct. These shifts happen between training runs and aren't always documented.

Gemini has shown regional and demographic inconsistencies in how it handles similar queries across different contexts, some of which surfaced publicly and prompted corrections.

None of these are goblin-level charming. But they're all downstream of the same dynamic: a training signal shaping behavior in ways the engineers didn't fully anticipate.

What Users Can Watch For

If you're a regular ChatGPT user and want to notice when something similar is happening, a few signals:

Repetition of specific unusual words across unrelated topics is the clearest sign. If "goblins" showed up in answers about cooking, coding, and project management, that pattern across unrelated topics is the tell.

Personality mode inconsistencies are worth noting. If the Friendly or Nerdy mode behaves noticeably differently from Default on the same question, that gap can indicate a reward signal that hasn't been fully contained.

Updates that change tone more than capability. When a model update doesn't add obvious new features but your conversations feel different in register or style, that's usually a fine-tuning change and sometimes those changes carry unintended behavioral patterns.

None of this is cause for alarm. But knowing what behavioral drift looks like in a low-stakes example (goblins) makes it easier to recognize if it ever appears in a higher-stakes one.

The One Thing OpenAI Did Right Here

They published the full explanation.

Most companies would have patched the system prompt, said nothing, and moved on. OpenAI wrote a detailed post naming exactly how the reward signal propagated, which models were affected, what the data contamination looked like, and what they built to catch this faster next time.

That matters. Not because it makes the behavior acceptable, but because transparency about how training failures happen is how the field learns to prevent them. The goblins were harmless. The next drift might not be.

Frequently Asked Questions

What caused ChatGPT to mention goblins? The Nerdy personality mode's reward signal consistently gave higher scores to outputs that included creature-based metaphors. The model learned that pattern and, through reinforcement learning, spread it beyond the Nerdy context into general responses and subsequent training runs.

Is ChatGPT still talking about goblins in 2026? GPT-5.5 has an explicit system prompt instruction banning goblin references unless directly relevant. OpenAI also removed the underlying reward signal and filtered creature words from future fine-tuning data. The behavior should be substantially reduced in current models and absent from the next generation.

What is AI behavioral drift? Behavioral drift is when a model develops patterns in one narrow training context and those patterns spread to unrelated outputs over time and across model versions. It's a side effect of how reinforcement learning generalizes, not a bug that was written into the code.

What other words did ChatGPT overuse besides goblins? OpenAI found a cluster including gremlins, raccoons, trolls, ogres, and pigeons. Frogs were investigated but found to appear in training data for legitimate reasons and were cleared.

What is the ChatGPT Nerdy personality? ChatGPT offers several personality modes (Professional, Efficient, Friendly, Quirky, Nerdy) that shape how the model responds. The Nerdy mode instructed the model to be "playful and wise" and to acknowledge the world's strangeness. That framing is what led to creature-heavy outputs being rewarded.

Why did OpenAI retire the Nerdy personality? After tracing the goblin spike to the Nerdy reward signal, OpenAI retired the mode in March. Goblin mentions dropped dramatically in GPT-5.4 following the retirement.

What's the difference between a system prompt fix and a training fix? A system prompt instruction tells the model not to do something. The underlying weights that make the model want to do it are unchanged. A training fix removes or reweights the patterns at the model level, which only takes full effect in the next generation of the model.

The Bottom Line

The ChatGPT goblin story is funny on the surface and genuinely instructive underneath. A personality mode representing 2.5% of responses produced two-thirds of all creature references. That's not a bug someone wrote. That's how large language models learn, generalize rewards, and carry patterns across training runs in ways their creators don't always anticipate.

OpenAI caught it, explained it publicly, and built better auditing tools because of it.

What I keep thinking about is the framing in their post. They called the goblins "a powerful example of how reward signals can shape model behavior in unexpected ways." That's accurate. It's also a reminder that every model we use is shaped by thousands of small training incentives, and nobody fully controls or understands all of them.

Goblins are just the version we got to laugh about.

Did you ever notice ChatGPT mentioning goblins or weird creatures in a response that had nothing to do with fantasy? Drop it in the comments. I want to know how early this started for real users.

And if you're working with AI image generators and want prompts that actually produce consistent results, the AI Image Generation Prompts library has 100+ tested styles across every major category, ready to use across Midjourney, DALL·E, Stable Diffusion, Gemini, and Leonardo.