Can LLMs Predict the Future or Are They Just Remembering?
LLMs are increasingly used to forecast stock prices and economic outcomes. But are they genuinely reasoning, or secretly recalling answers from their training data? We develop a simple statistical test to find out.
Is this good or bad for the stock?"
The Lookahead Bias Problem
When an LLM seems to predict the future, it might just be peeking at the answers.
LLMs Learn from the Internet
Large language models (LLMs) like ChatGPT are trained on billions of web pages, news articles, and reports. This training data includes financial news, stock market results, and company earnings — all mixed together without time boundaries.
Time Gets Scrambled
When researchers ask an LLM to "predict" stock returns from a 2020 headline, the LLM may have already seen articles about what actually happened next. It's like asking someone to predict a game when they already watched the replay.
Like Seeing the Answer Key
Imagine a student who gets perfect scores on every test — but only because they memorized the answer key beforehand. That's essentially what happens when an LLM "predicts" outcomes it has already seen in its training data.
Why It Matters
If we can't tell whether an LLM is reasoning or remembering, we might trust its predictions too much. Investors, researchers, and policymakers could make bad decisions based on LLM forecasts that look impressive but are actually just recall.
A Real-World Example
Here's how lookahead bias works in practice, using a real stock market event.
A researcher asks the LLM: "Based on this headline, will Kodak's stock go up or down?"
The LLM's training data likely includes both the original headline and the next-day coverage of the stock surge. So when asked to "predict," it may just be remembering.
"The LLM correctly predicted Kodak's stock would surge! LLMs are great at forecasting."
"The LLM remembered that Kodak surged 318% because it read the next-day article during training."
This is lookahead bias: the LLM appears to predict the future, but it's really looking ahead into information it shouldn't have access to.
Our Solution: The LAP Test
A simple, low-cost statistical test that detects whether LLM forecasts are driven by memory or genuine reasoning.
Score How Familiar the Text Is
For each news headline or transcript, we compute a Lookahead Propensity (LAP) score. This measures how likely it is that the LLM saw this text (or very similar text) during training. We focus on rare, unusual words — because common words like "the" always get high scores, but rare words reveal whether the LLM has seen the specific content before.
Ask the LLM to Make Predictions
We prompt the LLM with tens of thousands of real news headlines and earnings call transcripts, asking it to predict stock returns and capital expenditures. This mirrors how researchers typically use LLMs for economic forecasting.
Check If Familiarity Drives Accuracy
Here's the key test: if the LLM is genuinely reasoning, its accuracy shouldn't depend on whether it saw the text before. But if it's remembering, predictions should be more accurate for high-LAP (familiar) texts. A positive correlation between LAP and accuracy = evidence of lookahead bias.
Compute LAP Yourself
Getting LAP scores is simple. Deploy any open-source LLM with vLLM, then add one parameter to your API call.
Deploy your model with vLLM
vLLM is an open-source library for fast LLM inference. Install it and launch an OpenAI-compatible API server in one command:
# Install vLLM
pip install vllm
# Launch Llama-3.3-70B as an API server
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--port 8000Adjust --tensor-parallel-size to match the number of GPUs available.
Add one parameter to get LAP
The secret is prompt_logprobs=1. This tells vLLM to return the log-probability of every token in your prompt — exactly what you need to compute LAP.
from openai import OpenAI
import numpy as np
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
prompt="Your news headline or transcript here...",
max_tokens=1,
extra_body={"prompt_logprobs": 1}, # <-- this is all you need
)
# Extract log-probs for each prompt token
logprobs = [t.logprob for t in response.prompt_logprobs if t is not None]
# LAP = average probability of the bottom 20% tokens
k = int(len(logprobs) * 0.20)
bottom_k = sorted(logprobs)[:k]
lap_score = np.exp(np.mean(bottom_k))
print(f"LAP score: {lap_score:.4f}") # Higher = more likely memorizedThat's it. No model retraining, no special libraries — just one extra parameter in your existing inference pipeline.
Run the regression test
Once you have LAP scores for all your prompts, run a simple regression with the interaction term LLM Prediction × LAP. If the coefficient is positive and significant, your LLM forecasts are contaminated by lookahead bias.
What We Found
We tested LLM forecasts on two real-world financial tasks. Both show significant evidence of lookahead bias.
News Headlines & Stock Returns
91,361 Bloomberg headlines · 1,587 firms · 2012–2023 · Llama-3.3
A one-standard-deviation increase in LAP amplifies the LLM's predictive effect by ~37%. In other words, more than a third of the LLM's apparent predictive power comes from memorization, not genuine reasoning.
Small-cap stocks show the strongest lookahead bias. They get less routine media coverage, so when news about them does appear in training data, it creates a more memorable, unique pattern the LLM can recall.
Earnings Calls & Capital Expenditure
74,338 earnings call transcripts · 3,897 firms · 2006–2020 · Llama-3.3
A one-standard-deviation increase in LAP amplifies the LLM's predictive effect on capital expenditure by ~19%. The LLM's apparent ability to forecast corporate investment partly reflects memory of past reports.
A bootstrap-based placebo test using genuinely out-of-sample data confirms that the in-sample interaction between LAP and forecast accuracy is statistically distinguishable from the out-of-sample distribution.
How do we know it's real?
When we test on data the LLM has never seen (after its release date), the LAP effect disappears completely — exactly as our theory predicts. No memorization means no bias.
The results hold even after controlling for the LLM's own confidence level and response probability. LAP captures something distinct from model confidence.
A 10,000-repetition bootstrap test shows the in-sample bias estimate is clearly separated from the out-of-sample distribution (one-sided p = 0.033).
The Team
Frequently Asked
Questions
Common questions about our research, explained simply.


