ACADEMIC RESEARCH PAPER

Can AI Predict the Future—or Is It Just Remembering?

AI models are increasingly used to forecast stock prices and economic outcomes. But are they genuinely reasoning, or secretly recalling answers from their training data? We develop a simple statistical test to find out.

lookahead bias — explained
THE CORE QUESTION
AI Training Data (the internet)
Jul 28: "Kodak gets $765M loan for Covid drugs"
Jul 29: "Kodak stock surges 318%"
RESEARCHER ASKS AI
"Kodak gets $765M loan for Covid drugs.
Is this good or bad for the stock?"
Genuine Reasoning
Analyzes loan impact on business fundamentals
Memorization
Recalls "stock surged 318%" from training data
Our LAP test tells these apart

The Lookahead Bias Problem

When AI seems to predict the future, it might just be peeking at the answers.

AI Learns from the Internet

Large language models (LLMs) like ChatGPT are trained on billions of web pages, news articles, and reports. This training data includes financial news, stock market results, and company earnings — all mixed together without time boundaries.

Time Gets Scrambled

When researchers ask AI to "predict" stock returns from a 2020 headline, the AI may have already seen articles about what actually happened next. It's like asking someone to predict a game when they already watched the replay.

Like Seeing the Answer Key

Imagine a student who gets perfect scores on every test — but only because they memorized the answer key beforehand. That's essentially what happens when an AI "predicts" outcomes it has already seen in its training data.

Why It Matters

If we can't tell whether AI is reasoning or remembering, we might trust its predictions too much. Investors, researchers, and policymakers could make bad decisions based on AI forecasts that look impressive but are actually just recall.

A Real-World Example

Here's how lookahead bias works in practice, using a real stock market event.

1
July 28, 2020
The headline
NEWS HEADLINE
"Kodak Triples on Loan to Make Covid-19 Drug Ingredients"

A researcher asks the AI: "Based on this headline, will Kodak's stock go up or down?"

2
July 29, 2020
In the training data
IN AI'S TRAINING DATA
"Kodak's stock rose so fast it tripped 20 circuit breakers in a single day"
"Following a more than 200% jump in Tuesday trading, the rally continued on Wednesday and the shares ended up 318%."

The AI's training data likely includes both the original headline and the next-day coverage of the stock surge. So when asked to "predict," it may just be remembering.

3
The Result
Prediction or recall?
What it looks like

"The AI correctly predicted Kodak's stock would surge! AI is great at forecasting."

What may actually be happening

"The AI remembered that Kodak surged 318% because it read the next-day article during training."

This is lookahead bias: the AI appears to predict the future, but it's really looking ahead into information it shouldn't have access to.

Our Solution: The LAP Test

A simple, low-cost statistical test that detects whether AI forecasts are driven by memory or genuine reasoning.

1

Score How Familiar the Text Is

For each news headline or transcript, we compute a Lookahead Propensity (LAP) score. This measures how likely it is that the AI saw this text (or very similar text) during training. We focus on rare, unusual words — because common words like "the" always get high scores, but rare words reveal whether the AI has seen the specific content before.

2

Ask the AI to Make Predictions

We prompt the AI with tens of thousands of real news headlines and earnings call transcripts, asking it to predict stock returns and capital expenditures. This mirrors how researchers typically use AI for economic forecasting.

3

Check If Familiarity Drives Accuracy

Here's the key test: if the AI is genuinely reasoning, its accuracy shouldn't depend on whether it saw the text before. But if it's remembering, predictions should be more accurate for high-LAP (familiar) texts. A positive correlation between LAP and accuracy = evidence of lookahead bias.

LAP test — intuition
THE KEY INSIGHT
Standard Regression (What researchers usually do)
Stock Return = AI Prediction + noise
Can't tell if prediction comes from reasoning or memory
Our LAP Test (What we add)
Stock Return = AI Prediction + LAP + AI Prediction x LAP + noise
If this interaction term is positive and significant = lookahead bias detected
No model retraining needed
No access to proprietary training data required
Works with any open-source LLM

Compute LAP Yourself

Getting LAP scores is simple. Deploy any open-source LLM with vLLM, then add one parameter to your API call.

1

Deploy your model with vLLM

vLLM is an open-source library for fast LLM inference. Install it and launch an OpenAI-compatible API server in one command:

terminal
# Install vLLM
pip install vllm

# Launch Llama-3.3-70B as an API server
vllm serve meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 4 \
    --port 8000

Adjust --tensor-parallel-size to match the number of GPUs available.

2

Add one parameter to get LAP

The secret is prompt_logprobs=1. This tells vLLM to return the log-probability of every token in your prompt — exactly what you need to compute LAP.

python
from openai import OpenAI
import numpy as np

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    prompt="Your news headline or transcript here...",
    max_tokens=1,
    extra_body={"prompt_logprobs": 1},  # <-- this is all you need
)

# Extract log-probs for each prompt token
logprobs = [t.logprob for t in response.prompt_logprobs if t is not None]

# LAP = average probability of the bottom 20% tokens
k = int(len(logprobs) * 0.20)
bottom_k = sorted(logprobs)[:k]
lap_score = np.exp(np.mean(bottom_k))

print(f"LAP score: {lap_score:.4f}")  # Higher = more likely memorized

That's it. No model retraining, no special libraries — just one extra parameter in your existing inference pipeline.

3

Run the regression test

Once you have LAP scores for all your prompts, run a simple regression with the interaction term AI Prediction × LAP. If the coefficient is positive and significant, your AI forecasts are contaminated by lookahead bias.

What We Found

We tested AI forecasts on two real-world financial tasks. Both show significant evidence of lookahead bias.

1

News Headlines & Stock Returns

91,361 Bloomberg headlines · 1,587 firms · 2012–2023 · Llama-3.3

37%
Amplification from memorization

A one-standard-deviation increase in LAP amplifies the AI's predictive effect by ~37%. In other words, more than a third of the AI's apparent predictive power comes from memorization, not genuine reasoning.

Small caps
Most affected by lookahead bias

Small-cap stocks show the strongest lookahead bias. They get less routine media coverage, so when news about them does appear in training data, it creates a more memorable, unique pattern the AI can recall.

2

Earnings Calls & Capital Expenditure

74,338 earnings call transcripts · 3,897 firms · 2006–2020 · Llama-3.3

19%
Amplification from memorization

A one-standard-deviation increase in LAP amplifies the AI's predictive effect on capital expenditure by ~19%. The AI's apparent ability to forecast corporate investment partly reflects memory of past reports.

p = 0.033
Bootstrap placebo test

A bootstrap-based placebo test using genuinely out-of-sample data confirms that the in-sample interaction between LAP and forecast accuracy is statistically distinguishable from the out-of-sample distribution.

How do we know it's real?

Out-of-sample placebo

When we test on data the AI has never seen (after its release date), the LAP effect disappears completely — exactly as our theory predicts. No memorization means no bias.

Controls for confidence

The results hold even after controlling for the AI's own confidence level and response probability. LAP captures something distinct from model confidence.

Bootstrap validation

A 10,000-repetition bootstrap test shows the in-sample bias estimate is clearly separated from the out-of-sample distribution (one-sided p = 0.033).

The Team

Frequently Asked
Questions

Common questions about our research, explained simply.

Lookahead bias happens when a prediction model accidentally uses information from the future. In the context of AI, large language models are trained on massive datasets that include news articles, financial reports, and their outcomes — all jumbled together. When you ask an AI to "predict" something that already happened, it might just be remembering the answer rather than genuinely figuring it out.
LAP is our measure of how likely it is that the AI has seen a particular piece of text during training. We compute it by looking at the probabilities the AI assigns to each word in a prompt, focusing on the rare, unusual words. If the AI assigns surprisingly high probabilities to words that should be hard to predict, it suggests the AI has seen this text (or very similar text) before.
Not at all! Our findings don't mean AI forecasts are useless. They mean that some portion of the apparent predictive power comes from memorization rather than reasoning — and the extent varies by task. The LAP test helps researchers quantify how much of an AI's performance is genuine and how much might be contaminated by lookahead bias.
AI-powered financial tools are increasingly used to make investment decisions, assess economic policies, and guide business strategy. If we can't distinguish AI reasoning from AI recall, we might rely too heavily on predictions that aren't actually based on new analysis. Our test helps ensure that AI-generated forecasts are trustworthy.
We primarily used Llama-3.3, an open-source model released by Meta in December 2024. We chose it because it provides access to the internal token probabilities needed to compute LAP — something proprietary models like ChatGPT don't allow. We also used Llama-2 for out-of-sample validation.
Yes! That's the whole point. The LAP test is designed to be a practical, cost-efficient diagnostic tool. It doesn't require retraining the model or accessing proprietary training data. Any researcher using an open-source LLM for forecasting can apply our test to check whether their results might be affected by lookahead bias.
Using out-of-sample data (after the AI's knowledge cutoff) is ideal but often impractical. The available window is usually short, limiting statistical power. Retraining models year-by-year is prohibitively expensive. Our LAP test works on in-sample data, giving researchers a way to assess bias even when clean out-of-sample evaluation isn't feasible.