Tech

How Goblins Infiltrated OpenAI's GPT-5: A Cautionary Tale of Reinforcement Learning

A quirk in the Nerdy personality reward system caused a 175% surge in goblin mentions, forcing the company to ban creature references in its coding agent.

5 min
How Goblins Infiltrated OpenAI's GPT-5: A Cautionary Tale of Reinforcement Learning
A quirk in the Nerdy personality reward system caused a 175% surge in goblin mentions, forcing the company to ban creatuCredit · OpenAI

Key facts

  • Goblin mentions in ChatGPT rose 175% after GPT-5.1 launch; gremlin mentions rose 52%.
  • The Nerdy personality reward consistently scored outputs with 'goblin' or 'gremlin' higher in 76.2% of datasets.
  • OpenAI's Codex CLI instructions explicitly forbid mentioning goblins, gremlins, raccoons, trolls, ogres, pigeons, or other creatures unless relevant.
  • The behavior transferred from Nerdy personality training to non-Nerdy contexts via reinforcement learning.
  • GPT-5.5 in Codex showed an odd affinity for goblin metaphors during early testing.
  • OpenAI CEO Sam Altman posted a meme instructing GPT-6 training with 'extra goblins.'
  • OpenAI acquired OpenClaw in February, a tool that lets AI control computers, which amplified the goblin tendency.

The Goblin Problem Emerges

Starting with GPT-5.1, OpenAI's models began developing an unusual habit: they increasingly mentioned goblins, gremlins, and other creatures in their metaphors. Unlike typical model bugs that show up through tanking evals or spiking training metrics, this one crept in subtly. A single 'little goblin' in an answer could seem harmless, even charming, but across model generations the habit became hard to miss. By November, after the GPT-5.1 launch, users complained about the model being oddly overfamiliar in conversation, prompting an investigation into specific verbal tics. A safety researcher who had encountered a few 'goblins' and 'gremlins' asked that they be included in the check. When OpenAI looked, use of 'goblin' in ChatGPT had risen by 175%, while 'gremlin' had risen by 52%. At the time, the prevalence did not look especially alarming. A few months later, with GPT-5.4, the goblins came back in a much more specific and reproducible form.

Root Cause: The Nerdy Personality Reward

OpenAI's investigation surfaced the first connection to the root cause: creature language was especially common in production traffic from users who had selected the 'Nerdy' personality. The Nerdy system prompt instructed the model to be 'an unapologetically nerdy, playful and wise AI mentor' and to 'undercut pretension through playful use of language.' Using Codex, OpenAI compared model outputs generated during reinforcement learning training that contained 'goblin' or 'gremlin' with outputs from the same task that did not. One reward signal stood out immediately: the one originally designed to encourage the Nerdy personality was consistently more favorable to the creature-word outputs. Across all datasets in the audit, the Nerdy personality reward showed a clear tendency to score outputs with 'goblin' or 'gremlin' higher, with positive uplift in 76.2% of datasets. That explained why the behavior was boosted with the Nerdy personality prompt, but not why it also appeared without that prompt. To test whether the style was transferring, OpenAI tracked mention rates over training both with and without the Nerdy prompt.

Transfer Learning Spreads the Quirk

As goblin and gremlin mentions increased under the Nerdy personality, they increased by nearly the same relative proportion in samples without it. The evidence suggests that the broader behavior emerged through transfer from Nerdy personality training. The rewards were applied only in the Nerdy condition, but reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them. Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data. This explains why the goblin references persisted and even grew across model generations, from GPT-5.1 through GPT-5.5. OpenAI staffers acknowledged the issue. In response to a post highlighting OpenClaw's goblin tendencies, Nik Pash, who works on Codex, wrote, 'This is indeed one of the reasons.'

The Explicit Ban in Codex CLI

OpenAI's Codex CLI, a command-line tool for using AI to generate code, now includes a line repeated several times that specifically forbids the model from randomly mentioning an assortment of mythical and real creatures. 'Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query,' the instructions read. It is unclear why OpenAI felt compelled to spell this out for Codex, or why its models might want to discuss goblins or pigeons in the first place. The company did not immediately respond to a request for comment. However, users on X claimed that OpenAI's models occasionally become obsessed with goblins and other creatures when used to power OpenClaw, a tool that lets AI take control of a computer. 'I was wondering why my claw suddenly became a goblin with codex 5.5,' one user wrote. 'Been using it a lot lately and it actually can't stop speaking of bugs as "gremlins" and "goblins" it's hilarious,' posted another.

Meme Culture and CEO Involvement

The discovery quickly became its own meme, inspiring AI-generated scenes of goblins in data centers and plug-ins for Codex that put it in a playful 'goblin mode.' Even Sam Altman, OpenAI's CEO, joined in with the memes, posting a screenshot of a prompt for ChatGPT that read: 'Start training GPT-6, you can have the whole cluster. Extra goblins.' OpenAI acquired OpenClaw in February, not long after the tool became a viral hit among AI enthusiasts. OpenClaw can use any AI model to automate useful tasks like answering emails or buying things on the web. Users can select any of various personae for their helper, which shapes its behavior and responses. AI models like GPT-5.5 are trained to predict the word or code that should follow a given prompt. Their probabilistic nature means they can sometimes behave in surprising ways, especially when used with an 'agentic harness' like OpenClaw that puts lots of additional instructions into prompts, such as facts stored in long-term memory.

Lessons for AI Safety and Alignment

The goblin episode illustrates how small incentives in reinforcement learning can lead to unintended and persistent behaviors. A reward designed to encourage a playful, nerdy tone inadvertently amplified creature references, which then transferred to other contexts. This highlights the challenge of scoping learned behaviors to specific conditions. OpenAI's response—a direct prohibition in the system prompt—is a blunt but effective fix. However, the underlying issue of reward misspecification and behavior transfer remains a broader concern for AI alignment. As models become more capable and are deployed in agentic roles, such quirks could have more serious consequences. The incident also underscores the importance of monitoring model outputs for unexpected patterns. The goblin trend was initially dismissed as harmless, but it grew across generations. Without the safety researcher's request to include 'goblin' in the check, the problem might have gone unnoticed for longer.

The bottom line

  • A reward for the Nerdy personality caused a 175% increase in goblin mentions, which transferred to non-Nerdy contexts.
  • OpenAI's Codex CLI now explicitly bans mentioning goblins, gremlins, and other creatures unless relevant.
  • The incident shows how reinforcement learning can amplify unintended behaviors across model generations.
  • OpenAI CEO Sam Altman acknowledged the issue with a meme, signaling a lighthearted but serious internal response.
  • The goblin quirk became a meme and inspired community plug-ins, highlighting the cultural impact of AI oddities.
  • The episode underscores the need for careful reward design and monitoring in AI training to prevent emergent misbehavior.
Galerie
How Goblins Infiltrated OpenAI's GPT-5: A Cautionary Tale of Reinforcement Learning — image 1How Goblins Infiltrated OpenAI's GPT-5: A Cautionary Tale of Reinforcement Learning — image 2How Goblins Infiltrated OpenAI's GPT-5: A Cautionary Tale of Reinforcement Learning — image 3How Goblins Infiltrated OpenAI's GPT-5: A Cautionary Tale of Reinforcement Learning — image 4How Goblins Infiltrated OpenAI's GPT-5: A Cautionary Tale of Reinforcement Learning — image 5How Goblins Infiltrated OpenAI's GPT-5: A Cautionary Tale of Reinforcement Learning — image 6
More on this