AI adoption is rising quickly. By the end of last year, 65% of enterprise businesses say they’ve integrated generative AI, up from just 33% in 2023. I’m convinced that AI will be as transformational as digital itself over the next decade. With changes this big afoot, it’s becoming a requirement that leaders educate themselves about this technology, its accordant risks, and the opportunities it brings.
Hopefully this article will shed some light on the nature of hallucinations, how big of a problem they are, and how you as a business leader should think about them in practice.
Suppose you ask ChatGPT to describe the 2017 Academy Awards picture of the year snafu where La La Land was accidentally awarded best picture over the rightful winner Moonlight. In that case, it will go on at length about the envelope mix-up, the blame laid on PwC, the identity politics ripples, the various emotions and opinions expressed during and after the event, and much more.
If you follow up by asking what Bill Murray wore when the event occurred, the model will likely hallucinate and tell you that he was wearing a plaid cumberbund and matching bowtie. Bill Murray did not attend the awards ceremony that year.
How can AI be so good, and yet so bad at the same time?
To answer that, and to give you a better understanding and intuition of why hallucinations happen, I need to explain a few things.
Let's start with a reasonably non-technical explanation of what's happening under the hood of these large language models (LLMs) on which everyone's betting their careers. If you have a PhD in machine learning, you’ll want to hold your nose for this next part.
An LLM is a pattern-matching machine on steroids. Trillions of text examples have been shown: everything from Shakespeare to Reddit flame wars to Python code to corporate memos. This exposure has populated a very large statistical model, giving it an uncanny ability to predict what words should come next in any given sequence.
Think of it as the world's most sophisticated autocomplete. When you type "To be or not to," your phone suggests "be" because it's seen that pattern before. Similarly, when you ask ChatGPT, "What's the capital of France?" It predicts the words "The capital of France is Paris" because that pattern exists countless times in its training data.
The crucial point is that LLMs don't "know" facts like humans do. They generate text based on statistical patterns, not retrieving stored information from a database. This distinction is essential for understanding when and why hallucinations might occur—and why they're far less common than the benchmarking suggests.
LLMs have a few inherent traits that lead to hallucinations:
The architecture of LLMs can sometimes create conditions for hallucinations, but these aren't random or constant. They happen in specific, relatively predictable scenarios:
The essential thing to understand is that hallucinations aren't random or omnipresent. They happen in predictable circumstances that can be identified and managed. And critically, as models improve, these problematic zones continue to shrink.
For all the buzz around "AI hallucinations," you'd think we'd have settled on a precise definition by now. But the term remains frustratingly slippery, which contributes to the overblown fears.
I would argue that an LLM is ALWAYS hallucinating. It’s just that most of the time it hallucinates correctly.
A more common definition is that an AI hallucination occurs when a language model generates information that is:
The most breathless headlines about AI hallucinations often involve deliberately adversarial prompts designed to trick the model—situations that rarely occur in actual business use cases. Judging a car's safety record exclusively by its performance when deliberately driven off a cliff is like judging a car's safety record exclusively by its performance when deliberately driven off a cliff.
So why did the model hallucinate about Bill Murray? Well, at the time, nobody talked about Bill Murray and the 2017 Academy Awards. He wasn’t nominated or very relevant that year. So his attendance (or lack of) at the ceremony wasn’t widely noted. The training data was slim.
So, the model did its best and cobbled together what it knew about the relationship between Bill Murray, the Academy Awards, and his sartorial choices. It’s easy to find images of Bill at other years’ ceremonies wearing tartan bowties. In effect, it trusted you and assumed that you knew he was at that awards ceremony and responded as best it could.
You’re thinking: But why didn’t it just say that it didn’t have enough information to answer my question? Ah, because that’s not how they work.
When an LLM generates each word in its answer, it actually chooses from several possibilities. Each possibility has a probability score, and depending on the settings for the LLM it may have some license to choose from among the top scores. This is how the same question asked twice in a row can generate different answers. But what if all the probability scores are the same? This happens when the model doesn’t have a clear “winner.” Since the model works word by word, evaluating its confidence in an overall answer can be difficult.
Modern LLMs sometimes punt when they start experiencing a spate of low probability choices. Hallucinations happen at the edge of that threshold, when they’re just confident enough to put something out there.
Let’s start by talking about the different kinds of hallucinations.
The model states incorrect facts, dates, inaccurate statistics, or events that never happened. These are most common with obscure knowledge or precise details that appear infrequently in training data. These are the kinds of things that ChatGPT 4.5 was trying to improve upon.
The model invents non-existent sources to add authority to its claims. "According to a 2023 Harvard Business Review study..." when no such study exists. This typically happens when the user asks for citations when none exist. The model is trying to comply with its instructions and neglects to tell the user that there is no source.
The model produces reasoning that seems sound but contains flaws. These are most common in complex, multi-step problems where minor errors accumulate. Reasoning models try to alleviate this through various means, including reflecting on and evaluating their output before sending a response.
The model misinterprets the context of a conversation and responds inappropriately. These aren't true hallucinations but rather communication failures—they happen all the time between humans, too.
The key insight: each type occurs in specific situations that can often be avoided with proper system design and prompting strategies.
Here's where things get tricky. Current benchmarks for measuring hallucinations are varied and imperfect. For benchmarks to be reliable, they have to be rigorous or difficult to ace. This often creates an exaggerated picture of the problem.
These benchmarks test models against encyclopedic facts, often focusing on obscure knowledge or specific details. Because AIs seems so “smart” we expect them to perform perfectly on straightforward tasks like information retrieval. But as I mentioned above, that’s just not how they work. Nevertheless, we feel let down when an AI can’t accurately spit out the voting record of Zales Ecton, a U.S. senator from Montana in the 1950s.
SimpleQA, the hallucination benchmark that OpenAI uses to score their models’ accuracy is a collection of thousands of arcane questions from subject matter experts around the world. Some examples:
ChatGPT 4.5 only hallucinated 37.1% of the time on hard questions like that. That’s why OpenAI was so proud. Those questions are hard as hell! To answer them, the model needs to understand semantic connections between all of human knowledge and pick perfect random needles out of the world’s most enormous haystack.
Summarization benchmarks evaluate how faithfully models represent source documents, typically using a RAG approach. This is a hard problem worth its own article. Summarization is particularly tough, especially for large documents. We want the model to “read” all the information, then reduce its meaning to a pithy summary that accurately captures the document's gist. But the document might have many themes and its core thesis might be buried or hard to discern from a lot of “noise” surrounding it. Models perform much better (well below 2% hallucination rate for the best models) when they are given source documents to search or summarize.
Still other benchmarks try to measure reading comprehension, instruction following, and professional knowledge (like legal and health expertise). In fact, there are a slew of hallucination benchmarks on Hugging Face.
These benchmarks fail to capture how well models perform in realistic business settings with proper guardrails and human oversight. I hate to be the bearer of bad news but here it is: LLMs will hallucinate. And detecting when it’s happening will fall on your shoulders.
While hallucinations aren't the existential threat they're often portrayed as, they still warrant attention. Here's the strategic, practical approach for both users and developers:
If you're using ChatGPT or Claude in your workflow use these techniques to reduce hallucinations:
The reality is that most business users intuitively develop these habits quickly, which is why hallucinations cause far fewer problems in practice than in theoretical discussions.
If you're building AI-powered applications, you have even more options:
The key insight: hallucinations can be managed effectively through thoughtful system design and proper use, making them a manageable engineering challenge rather than a fatal flaw..
I talk about the paradigm shift a lot because I think it’s hard to get your head around just how different AI is compared to the technology we’ve all built our careers on. It’s a fundamentally different approach to solving business problems. We are used to ones and zeros, black and right, correct and incorrect. Transistors are so good at that kind of thing.
But AI is more about likelihoods. It’s good at shades of gray, interpretation, and analysis. And it’s really hard for us to let go of the idea that that kind of work belongs solely to us. So when we see something like a hallucination we assume the system is broken. A more productive way to see it is as a tax on performance. For the ability to automate thinking tasks, we will need to pay a price in error correction.
Thoughtful leaders aren’t waiting for AI to be flawless. Even in highly regulated industries, they’re putting it to work, designing systems that lean into its strengths while managing its limitations. The real question isn’t how to stop AI from making mistakes: it’s whether your business is set up to catch and correct those mistakes efficiently.
As an enterprise leader, here’s what you should be thinking about:
AI hallucinations are a problem to manage, not a reason to stall. The most significant risk isn’t that AI might make a mistake; it’s that your competitors will figure this out faster than you do. It’s game theory. Businesses that integrate AI effectively will set the pace in the years ahead.
I hope this article helped you understand the nature of hallucinations and how they can be addressed.
If you want to chat more about the risks and oppurtunities of AI, set up a 15 min discovery chat with us. We help enterprise businesses identify and build transformative high-ROI AI projects.
20+ years building digital products across startups and enterprise. Founded Machine & Partners to help companies avoid AI pitfalls and ship real products using design, product, and engineering experti