---
title: "Study sketch — Temporal consistency of LLM-agent runtime recommendations"
type: synthesis
tags: [study-sketch, apm, ppm, temporal-consistency, evaluation, riess]
sources:
  - "[[sources/2023-riess-temporal-loss-remaining-cycle-time]]"
  - "[[sources/2026-calvanese-agentic-bpm-manifesto]]"
  - "[[sources/2022-kubrak-prescriptive-ppm-slr]]"
  - "[[sources/2014-groger-prescriptive-analytics-bpo]]"
  - "[[sources/2017-tax-lstm-process-prediction]]"
  - "[[sources/2021-bukhsh-processtransformer]]"
  - "[[sources/2025-riess-jorgensen-brage-benchmark-norwegian-llm]]"
  - "[[sources/2025-fournier-agentic-ai-process-observability]]"
  - "[[sources/2023-anjum-rocca-phi403-lecture-18-risky-predictions]]"
created: 2026-04-20
updated: 2026-04-20
---

# Study sketch — Temporal consistency of LLM-agent runtime recommendations

## Motivation and gap

[[sources/2023-riess-temporal-loss-remaining-cycle-time|Riess 2023]] introduced **Temporal Consistency (TC)** as a third axis of PPM evaluation alongside accuracy and earliness: a model whose predictions oscillate across prefix lengths is operationally unusable regardless of mean accuracy. The [[sources/2026-calvanese-agentic-bpm-manifesto|APM Manifesto]] elevates runtime [[concepts/conversational-actionability|actionability]] as a core capability — agents are expected to *recommend* interventions during execution — and [[sources/2022-kubrak-prescriptive-ppm-slr|Kubrak 2022]] catalogues "intervention policy" as the sixth (and least-developed) dimension of [[concepts/prescriptive-process-monitoring|PrPM]]. Yet no existing work evaluates whether LLM-agent recommendations are temporally consistent. If an agent flips its recommendation mid-case, downstream queue prioritisation, staffing, and customer-communication decisions destabilise — the same operational pathology Riess named for LSTM predictions, now in a higher-stakes setting. [[sources/2023-anjum-rocca-phi403-lecture-18-risky-predictions|PHI403 L18]]'s Popperian framing sharpens this: recommendations that can't commit to stable falsifiable predictions aren't scientific outputs.

## Research questions

- **RQ1 (measurement).** How do LLM-agents compare to conventional PPM models ([[sources/2017-tax-lstm-process-prediction|Tax LSTM]], [[sources/2021-bukhsh-processtransformer|ProcessTransformer]]) on the three-axis evaluation (accuracy, earliness, TC) when making runtime recommendations on identical prefix sequences?
- **RQ2 (mechanism).** Is recommendation-flipping driven primarily by [[concepts/aleatoric-vs-epistemic-uncertainty|epistemic uncertainty]] (longer reasoning traces, low prompt-signal) or [[concepts/aleatoric-vs-epistemic-uncertainty|aleatoric]] (process-variability-induced genuine ambiguity)?
- **RQ3 (intervention).** Do specific prompt-engineering techniques (chain-of-thought, explicit prior-commitment instructions, confidence-calibration requests) improve TC without degrading accuracy or earliness?

## Hypotheses

- **H1.** LLM-agents exhibit significantly more recommendation-flips per prefix than calibrated LSTM/Transformer baselines on the same cases — the "3-axis generalisation" hypothesis.
- **H2.** Recommendation-flipping is positively correlated with reasoning-trace length (a proxy for epistemic uncertainty per [[sources/2025-fournier-agentic-ai-process-observability|Fournier's observability framing]]) and weakly correlated with process entropy (aleatoric).
- **H3.** Chain-of-thought prompting *worsens* TC despite improving accuracy (an explicit-cost trade-off); explicit "commit to prior unless strong evidence" instructions improve TC at modest accuracy cost.
- **H4.** Across BPIC logs with varying regularity, TC degradation is larger on [[concepts/lasagna-spaghetti-processes|spaghetti]] logs than lasagna logs — generalising Riess 2023's log-dependency finding to agentic recommendations.

## Method

**Datasets.** Direct reuse of Riess 2023's public logs (Sepsis, Helpdesk, BPIC Traffic Fines, Hospital Billing) to preserve comparability + one private [[sources/2025-riess-jorgensen-brage-benchmark-norwegian-llm|BRAGE-adjacent]] Telenor customer-service log for industry validity.

**Recommendation targets.** Three canonical PPM outputs per [[syntheses/ppm-landscape|the landscape synthesis]]: next activity, remaining time, binary outcome. Each target evaluated independently.

**Models compared.**
- **LLM-agents (3):** Claude, GPT-4-class, Gemma2 — the BRAGE setup extended to per-prefix recommendation.
- **PPM baselines (3):** unweighted-L1 LSTM (Tax), temporally-weighted-L1 LSTM (Riess 2023 exponential variant), ProcessTransformer.
- **Rule-based control:** most-likely-next-activity from transition matrix.

**TC instrument.** Adapted from Riess 2023 — for remaining-time: monotonicity violations per prefix. For next-activity / outcome: *recommendation-flip rate* = count of recommendation changes across consecutive prefixes of the same case, normalised by prefix count.

**Uncertainty instrumentation.** Per recommendation, collect (a) LLM reasoning-trace length (epistemic proxy), (b) local transition-entropy at the current process state (aleatoric proxy), (c) self-reported confidence.

**Experimental conditions.** 3 prompt strategies × 3 LLMs × 3 PPM-baseline models × 5 logs × 3 recommendation targets. Per-case bootstrap for confidence intervals.

**Analysis.** Three-axis tables per Riess 2023 convention + regression of flip-rate on uncertainty proxies (H2) + ANOVA with prompt strategy × model interaction (H3).

## Validity threats

- **LLM version drift** during the study — pin model versions; publish exact APIs used.
- **Prompt-underfitting** for non-CoT control: use standardised prompt templates published in an appendix.
- **Construct of "flipping"**: some flips are *corrections* on new evidence — mitigated by separating *Bayes-coherent updates* (flip toward higher-likelihood option given new events) from incoherent oscillation.

## Deliverables and venues

- **Short paper.** Target: **Nordic Machine Intelligence** (direct continuity with Riess 2023 that introduced TC) or **ICPM 2027**. A crisp extension paper, 10–12 pages.
- **Benchmark harness.** Public code + evaluation scripts on GitHub. Consumers of Riess 2023 can immediately run TC on their own agents.

## Connections

Extends [[concepts/remaining-time-prediction|the three-axis evaluation]] beyond remaining-time to full PrPM recommendation space. Feeds [[concepts/prescriptive-process-monitoring|PrPM]] with a missing evaluation lens. Connects [[concepts/conversational-actionability]] (APM capability) to a measurable operational property. Empirically grounds the operational framing in [[syntheses/riess-research-arc|Riess 2023's commitment #5]].