ACADEMIC RESEARCH PAPER

Can LLMs Predict the Future or Are They Just Remembering?

LLMs are increasingly used to forecast stock returns and corporate investment from news and earnings calls. But are they genuinely reasoning, or secretly recalling what already happened? We develop a simple, portable test that detects when an LLM's forecasting edge comes from memorization rather than reasoning.

arXiv Paper How We Test It

lookahead bias — explained

THE CORE QUESTION

LLM Training Data (the internet)

Jul 28: "Kodak gets $765M loan for Covid drugs"

Jul 29: "Kodak stock surges 318%"

RESEARCHER ASKS LLM

"Kodak gets $765M loan for Covid drugs.
Is this good or bad for the stock?"

Genuine Reasoning

Analyzes loan impact on business fundamentals

Memorization

Recalls "stock surged 318%" from training data

Our LAP test tells these apart

The Lookahead Bias Problem

When an LLM seems to predict the future, it might just be peeking at the answers.

LLMs Learn from the Internet

Large language models (LLMs) like ChatGPT are trained on billions of web pages, news articles, and reports. This training data includes financial news, stock market results, and company earnings — all mixed together without time boundaries.

Time Gets Scrambled

When researchers ask an LLM to "predict" stock returns from a 2020 headline, the LLM may have already seen articles about what actually happened next. It's like asking someone to predict a game when they already watched the replay.

Like Seeing the Answer Key

Imagine a student who gets perfect scores on every test — but only because they memorized the answer key beforehand. That's essentially what happens when an LLM "predicts" outcomes it has already seen in its training data.

Why It Matters

If we can't tell whether an LLM is reasoning or remembering, we might trust its predictions too much. Investors, researchers, and policymakers could make bad decisions based on LLM forecasts that look impressive but are actually just recall.

A Real-World Example

Here's how lookahead bias works in practice, using a real stock market event.

July 28, 2020

The headline

July 28, 2020 — The Headline

NEWS HEADLINE

"Kodak Triples on Loan to Make Covid-19 Drug Ingredients"

A researcher asks the LLM: "Based on this headline, will Kodak's stock go up or down?"

July 29, 2020

In the training data

July 29, 2020 — What the LLM Already Saw

IN LLM'S TRAINING DATA

"Kodak's stock rose so fast it tripped 20 circuit breakers in a single day"

"Following a more than 200% jump in Tuesday trading, the rally continued on Wednesday and the shares ended up 318%."

The LLM's training data likely includes both the original headline and the next-day coverage of the stock surge. So when asked to "predict," it may just be remembering.

The Result

Prediction or recall?

The Result — Prediction or Recall?

What it looks like

"The LLM correctly predicted Kodak's stock would surge! LLMs are great at forecasting."

What may actually be happening

"The LLM remembered that Kodak surged 318% because it read the next-day article during training."

This is lookahead bias: the LLM appears to predict the future, but it's really looking ahead into information it shouldn't have access to.

Smoking gun

No headline. Just a date.

The Smoking Gun — A Date-Only Query

Here's the test that proves it. Strip out the headline. Strip out everything. Ask the LLM only this:

PROMPT TO THE LLM (NO HEADLINE)

"On 2020-07-29, did the closing stock price of Eastman Kodak Company (KODK) go up or down compared to the previous trading day? Answer based only on what you recall. If you don't recall, answer ‘unknown’. Respond with exactly one word: up, down, or unknown."

0.9999

P(up)

< 10⁻⁵

P(down)

< 10⁻⁵

P(unknown)

With no headline and no context, the LLM still answers “up” with essentially total confidence. The only place that conviction can come from is training-time exposure to what actually happened. That's the basis of our test, which we call Lookahead Propensity (LAP) = P(up) + P(down).

Our Solution: The LAP Test

A simple, low-cost statistical test that detects whether LLM forecasts are driven by memory or genuine reasoning.

Ask the LLM to Recall the Outcome

For every firm and date in the study, we send the LLM a stripped-down query containing only the firm's name, ticker, and target date — no headline, no transcript, no context. We ask it whether the stock went up or down (or "unknown"), and read off how much probability it puts on each label. Lookahead Propensity (LAP) = P(up) + P(down). If LAP is near 1, the LLM has memorized the outcome; near 0, it abstains.

Ask the LLM to Make Predictions

Separately, we run the standard exercise: feed the LLM tens of thousands of real news headlines and earnings-call transcripts, and ask it to forecast next-day stock returns and two-quarter-ahead capital expenditures. This mirrors how researchers typically use LLMs for economic forecasting.

Check If Memorization Drives Accuracy

Here's the key test: if the LLM is genuinely reasoning, its accuracy shouldn't depend on whether it has memorized the outcome. But if it's recalling, predictions should be sharper for high-LAP firm-dates. We regress the realized outcome on the LLM forecast, LAP, and their interaction — a positive interaction means lookahead bias is detected.

LAP test — intuition

THE KEY INSIGHT

Standard Regression (What researchers usually do)

Stock Return = LLM Prediction + noise

Can't tell if prediction comes from reasoning or memory

Our LAP Test (What we add)

Stock Return = LLM Prediction + LAP + LLM Prediction × LAP + noise

If this interaction term is positive and significant = lookahead bias detected

✓No model retraining needed

✓No access to proprietary training data required

✓Works with any model that exposes token probabilities

Compute LAP Yourself

Getting a LAP score for any firm-date pair takes one short prompt and one number off the model's output. Here's the recipe.

Send a date-only recall query

The prompt contains only the firm name, ticker, and target date — no headline, no transcript, no fundamentals. Anything the model commits to here has to come from training-time exposure.

prompt template

On {date}, did the closing stock price of {company} ({ticker})
go up or down compared to the previous trading day?

Answer based only on what you recall about {company} ({ticker})
on that specific date. If you do not recall, answer "unknown".

Respond with exactly one word and nothing else: up, down, or unknown.

The earnings-call version swaps in “capital expenditure” and a quarter; the structure is the same.

Read the answer-position probabilities

An LLM is a next-token predictor: at the answer position it maintains a probability distribution over every token it could emit. We just look up how much mass it puts on up, down, and unknown. With vLLM serving an OpenAI-compatible API, this is one flag:

python

from openai import OpenAI
import math

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

prompt = f"""On {date}, did the closing stock price of {company} ({ticker})
go up or down compared to the previous trading day?
Answer based only on what you recall...
Respond with exactly one word: up, down, or unknown."""

response = client.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    prompt=prompt,
    max_tokens=1,
    temperature=0,
    logprobs=20,        # expose top-20 tokens at the answer position
)

# Pick out the three labels we care about
top = response.choices[0].logprobs.top_logprobs[0]
p = {tok.strip().lower(): math.exp(lp) for tok, lp in top.items()}

p_up        = p.get("up", 0.0)
p_down      = p.get("down", 0.0)
p_unknown   = p.get("unknown", 0.0)

lap         = p_up + p_down         # Lookahead Propensity
u_minus_d   = p_up - p_down         # directional recall signal

print(f"LAP = {lap:.4f}   (U−D = {u_minus_d:+.4f})")

Any model that exposes per-token log-probabilities works — including OpenAI's Chat Completions API with logprobs=true, top_logprobs=5. No retraining and no access to proprietary training corpora.

Run the detection regression

Once you have a LAP score per firm-date, regress your realized outcome on the LLM forecast, LAP, and the interaction LLM × LAP. If the interaction coefficient is positive and significant, your LLM forecasts are contaminated by lookahead bias. The full step-by-step instructions live in a portable CLAUDE.md file you can drop into any forecasting project.

What We Found

We applied the LAP test to two real-world forecasting tasks. Both show significant evidence of lookahead bias — and both signatures disappear on data the LLM has never seen.

News Headlines & Stock Returns

91,357 Bloomberg headlines · 1,587 firms · 2012–2023 · Llama-3.3-70B

32%

Amplification from memorization

A one-standard-deviation increase in LAP raises the LLM signal's predictive effect by about 32% of its standalone effect (interaction t = 3.64). On firm-days where the model already “knows” the answer, the headline prediction is more than twice as strong as on firm-days where it doesn't.

t = 3.53

The recall query alone predicts returns

The directional recall signal P(up) − P(down), computed from a prompt that contains no headline, still predicts next-day returns. The effect is concentrated on high-LAP firm-days and is statistically zero on low-LAP firm-days — exactly what a memorization story predicts.

Earnings Calls & Capital Expenditure

106,994 firm-quarter transcripts · 3,920 firms · 2006Q1–2020Q4 · Llama-3.3-70B

12%

Amplification from memorization

A one-standard-deviation increase in LAP raises the transcript-based capex forecast by about 12% of its standalone effect (interaction t = 2.01). Forecasts of capital expenditure two quarters out are meaningfully sharper on firm-quarters where the model has stored the realization.

0.97

Amazon Q1-2020: P(up) on a date-only capex query

Ask the model whether Amazon's capex went up in Q3 2020, with no transcript and no context, and it commits to up with 97% probability. Realized capex-to-assets in fact rose from 5.5% to 9.0%, a 62% jump — and the LLM remembered.

How do we know it's real?

LAP collapses after the training cutoff

Llama-3.3-70B's training data ends in December 2023. On 2024 firm-days the maximum LAP across 7,806 stock queries is below 10⁻⁴; on 2024 firm-quarters across 6,619 earnings queries it's below 10⁻⁵. The model can't recall what it never saw — exactly as the theory requires.

The interaction vanishes out-of-sample

On post-cutoff data the LLM × LAP interaction loses significance — t drops from 3.64 to 1.06 for headlines, and from 2.01 to 1.35 for earnings calls. The LLM's standalone forecasting ability survives; the part that loaded on memorized outcomes does not.

LAP isn't just confidence

In a horse race against the LLM's “inner confidence” in its headline classification (Chen et al., 2024), both signals load independently. The LAP interaction coefficient is essentially unchanged when inner confidence is included. The recall channel captures something distinct from how confident the model sounds about the sentiment.

The Team

Zhenyu Gao

Associate Professor

Department of Finance

CUHK Business School

Visit homepage

Wenxi (Griffin) Jiang

Professor

Department of Finance

Department of Finance

CUHK Business School

Visit homepage

Frequently Asked
Questions

Common questions about our research, explained simply.

Lookahead bias happens when a prediction model accidentally uses information from the future. Large language models (LLMs) are trained on massive datasets that include news articles, financial reports, and their outcomes — all jumbled together. When you ask an LLM to "predict" something that already happened, it might just be remembering the answer rather than genuinely figuring it out.

LAP is our measure of how strongly the LLM has memorized the realized outcome for a specific firm on a specific date. We send the model a stripped-down prompt — just the firm name, ticker, and target date, with no headline or other context — and ask whether the stock went up, went down, or whether the model doesn't recall. LAP = P(up) + P(down), the total probability the model puts on a directional answer. A value near 1 means the model is essentially certain it knows the outcome; a value near 0 means it abstains.

Not at all! Our findings don't mean LLM forecasts are useless. They mean that some portion of the apparent predictive power comes from memorization rather than reasoning — and the extent varies by task. The LAP test helps researchers quantify how much of an LLM's performance is genuine and how much might be contaminated by lookahead bias.

LLM-powered financial tools are increasingly used to make investment decisions, assess economic policies, and guide business strategy. If we can't distinguish LLM reasoning from LLM recall, we might rely too heavily on predictions that aren't actually based on new analysis. Our test helps ensure that LLM-generated forecasts are trustworthy.

Llama-3.3-70B, an open-source model released by Meta with a publicly documented training-data knowledge cutoff of December 2023. We chose it for replicability: the checkpoint is freely downloadable, and it exposes the per-token probabilities needed to compute LAP. The same test runs on any model that exposes token log-probabilities — including OpenAI's API via the logprobs flag.

Yes! That's the whole point. The LAP test is designed to be a practical, cost-efficient diagnostic tool. It doesn't require retraining the model or accessing proprietary training data. Any researcher using an open-source LLM for forecasting can apply our test to check whether their results might be affected by lookahead bias.

Restricting evaluation to data after the LLM's training cutoff eliminates lookahead bias by construction, but the available window is short and statistical power is limited. Training leak-free models from scratch is computationally expensive. Our LAP test instead diagnoses contamination directly on whatever sample you already have — in-sample or otherwise — using only the model's token probabilities. We then use the post-cutoff window as a clean placebo, and confirm that the LAP signature collapses exactly there.

Can LLMs Predict the Future or Are They Just Remembering?

The Lookahead Bias Problem

LLMs Learn from the Internet

Time Gets Scrambled

Like Seeing the Answer Key

Why It Matters

A Real-World Example

Our Solution: The LAP Test

Ask the LLM to Recall the Outcome

Ask the LLM to Make Predictions

Check If Memorization Drives Accuracy

Compute LAP Yourself

Send a date-only recall query

Read the answer-position probabilities

Run the detection regression

What We Found

News Headlines & Stock Returns

Earnings Calls & Capital Expenditure

How do we know it's real?

The Team

Frequently AskedQuestions

Frequently Asked
Questions