Can AI Predict the Future—or Is It Just Remembering?
AI models are increasingly used to forecast stock prices and economic outcomes. But are they genuinely reasoning, or secretly recalling answers from their training data? We develop a simple statistical test to find out.
Is this good or bad for the stock?"
The Lookahead Bias Problem
When AI seems to predict the future, it might just be peeking at the answers.
AI Learns from the Internet
Large language models (LLMs) like ChatGPT are trained on billions of web pages, news articles, and reports. This training data includes financial news, stock market results, and company earnings — all mixed together without time boundaries.
Time Gets Scrambled
When researchers ask AI to "predict" stock returns from a 2020 headline, the AI may have already seen articles about what actually happened next. It's like asking someone to predict a game when they already watched the replay.
Like Seeing the Answer Key
Imagine a student who gets perfect scores on every test — but only because they memorized the answer key beforehand. That's essentially what happens when an AI "predicts" outcomes it has already seen in its training data.
Why It Matters
If we can't tell whether AI is reasoning or remembering, we might trust its predictions too much. Investors, researchers, and policymakers could make bad decisions based on AI forecasts that look impressive but are actually just recall.
A Real-World Example
Here's how lookahead bias works in practice, using a real stock market event.
A researcher asks the AI: "Based on this headline, will Kodak's stock go up or down?"
The AI's training data likely includes both the original headline and the next-day coverage of the stock surge. So when asked to "predict," it may just be remembering.
"The AI correctly predicted Kodak's stock would surge! AI is great at forecasting."
"The AI remembered that Kodak surged 318% because it read the next-day article during training."
This is lookahead bias: the AI appears to predict the future, but it's really looking ahead into information it shouldn't have access to.
Our Solution: The LAP Test
A simple, low-cost statistical test that detects whether AI forecasts are driven by memory or genuine reasoning.
Score How Familiar the Text Is
For each news headline or transcript, we compute a Lookahead Propensity (LAP) score. This measures how likely it is that the AI saw this text (or very similar text) during training. We focus on rare, unusual words — because common words like "the" always get high scores, but rare words reveal whether the AI has seen the specific content before.
Ask the AI to Make Predictions
We prompt the AI with tens of thousands of real news headlines and earnings call transcripts, asking it to predict stock returns and capital expenditures. This mirrors how researchers typically use AI for economic forecasting.
Check If Familiarity Drives Accuracy
Here's the key test: if the AI is genuinely reasoning, its accuracy shouldn't depend on whether it saw the text before. But if it's remembering, predictions should be more accurate for high-LAP (familiar) texts. A positive correlation between LAP and accuracy = evidence of lookahead bias.
Compute LAP Yourself
Getting LAP scores is simple. Deploy any open-source LLM with vLLM, then add one parameter to your API call.
Deploy your model with vLLM
vLLM is an open-source library for fast LLM inference. Install it and launch an OpenAI-compatible API server in one command:
# Install vLLM
pip install vllm
# Launch Llama-3.3-70B as an API server
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--port 8000Adjust --tensor-parallel-size to match the number of GPUs available.
Add one parameter to get LAP
The secret is prompt_logprobs=1. This tells vLLM to return the log-probability of every token in your prompt — exactly what you need to compute LAP.
from openai import OpenAI
import numpy as np
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
prompt="Your news headline or transcript here...",
max_tokens=1,
extra_body={"prompt_logprobs": 1}, # <-- this is all you need
)
# Extract log-probs for each prompt token
logprobs = [t.logprob for t in response.prompt_logprobs if t is not None]
# LAP = average probability of the bottom 20% tokens
k = int(len(logprobs) * 0.20)
bottom_k = sorted(logprobs)[:k]
lap_score = np.exp(np.mean(bottom_k))
print(f"LAP score: {lap_score:.4f}") # Higher = more likely memorizedThat's it. No model retraining, no special libraries — just one extra parameter in your existing inference pipeline.
Run the regression test
Once you have LAP scores for all your prompts, run a simple regression with the interaction term AI Prediction × LAP. If the coefficient is positive and significant, your AI forecasts are contaminated by lookahead bias.
What We Found
We tested AI forecasts on two real-world financial tasks. Both show significant evidence of lookahead bias.
News Headlines & Stock Returns
91,361 Bloomberg headlines · 1,587 firms · 2012–2023 · Llama-3.3
A one-standard-deviation increase in LAP amplifies the AI's predictive effect by ~37%. In other words, more than a third of the AI's apparent predictive power comes from memorization, not genuine reasoning.
Small-cap stocks show the strongest lookahead bias. They get less routine media coverage, so when news about them does appear in training data, it creates a more memorable, unique pattern the AI can recall.
Earnings Calls & Capital Expenditure
74,338 earnings call transcripts · 3,897 firms · 2006–2020 · Llama-3.3
A one-standard-deviation increase in LAP amplifies the AI's predictive effect on capital expenditure by ~19%. The AI's apparent ability to forecast corporate investment partly reflects memory of past reports.
A bootstrap-based placebo test using genuinely out-of-sample data confirms that the in-sample interaction between LAP and forecast accuracy is statistically distinguishable from the out-of-sample distribution.
How do we know it's real?
When we test on data the AI has never seen (after its release date), the LAP effect disappears completely — exactly as our theory predicts. No memorization means no bias.
The results hold even after controlling for the AI's own confidence level and response probability. LAP captures something distinct from model confidence.
A 10,000-repetition bootstrap test shows the in-sample bias estimate is clearly separated from the out-of-sample distribution (one-sided p = 0.033).
The Team
Frequently Asked
Questions
Common questions about our research, explained simply.


