--- title: "Study sketch — Temporal consistency of LLM-agent runtime recommendations" type: synthesis tags: [study-sketch, apm, ppm, temporal-consistency, evaluation, riess] sources: - "[[sources/2023-riess-temporal-loss-remaining-cycle-time]]" - "[[sources/2026-calvanese-agentic-bpm-manifesto]]" - "[[sources/2022-kubrak-prescriptive-ppm-slr]]" - "[[sources/2014-groger-prescriptive-analytics-bpo]]" - "[[sources/2017-tax-lstm-process-prediction]]" - "[[sources/2021-bukhsh-processtransformer]]" - "[[sources/2025-riess-jorgensen-brage-benchmark-norwegian-llm]]" - "[[sources/2025-fournier-agentic-ai-process-observability]]" - "[[sources/2023-anjum-rocca-phi403-lecture-18-risky-predictions]]" created: 2026-04-20 updated: 2026-04-20 --- # Study sketch — Temporal consistency of LLM-agent runtime recommendations ## Motivation and gap [[sources/2023-riess-temporal-loss-remaining-cycle-time|Riess 2023]] introduced **Temporal Consistency (TC)** as a third axis of PPM evaluation alongside accuracy and earliness: a model whose predictions oscillate across prefix lengths is operationally unusable regardless of mean accuracy. The [[sources/2026-calvanese-agentic-bpm-manifesto|APM Manifesto]] elevates runtime [[concepts/conversational-actionability|actionability]] as a core capability — agents are expected to *recommend* interventions during execution — and [[sources/2022-kubrak-prescriptive-ppm-slr|Kubrak 2022]] catalogues "intervention policy" as the sixth (and least-developed) dimension of [[concepts/prescriptive-process-monitoring|PrPM]]. Yet no existing work evaluates whether LLM-agent recommendations are temporally consistent. If an agent flips its recommendation mid-case, downstream queue prioritisation, staffing, and customer-communication decisions destabilise — the same operational pathology Riess named for LSTM predictions, now in a higher-stakes setting. [[sources/2023-anjum-rocca-phi403-lecture-18-risky-predictions|PHI403 L18]]'s Popperian framing sharpens this: recommendations that can't commit to stable falsifiable predictions aren't scientific outputs. ## Research questions - **RQ1 (measurement).** How do LLM-agents compare to conventional PPM models ([[sources/2017-tax-lstm-process-prediction|Tax LSTM]], [[sources/2021-bukhsh-processtransformer|ProcessTransformer]]) on the three-axis evaluation (accuracy, earliness, TC) when making runtime recommendations on identical prefix sequences? - **RQ2 (mechanism).** Is recommendation-flipping driven primarily by [[concepts/aleatoric-vs-epistemic-uncertainty|epistemic uncertainty]] (longer reasoning traces, low prompt-signal) or [[concepts/aleatoric-vs-epistemic-uncertainty|aleatoric]] (process-variability-induced genuine ambiguity)? - **RQ3 (intervention).** Do specific prompt-engineering techniques (chain-of-thought, explicit prior-commitment instructions, confidence-calibration requests) improve TC without degrading accuracy or earliness? ## Hypotheses - **H1.** LLM-agents exhibit significantly more recommendation-flips per prefix than calibrated LSTM/Transformer baselines on the same cases — the "3-axis generalisation" hypothesis. - **H2.** Recommendation-flipping is positively correlated with reasoning-trace length (a proxy for epistemic uncertainty per [[sources/2025-fournier-agentic-ai-process-observability|Fournier's observability framing]]) and weakly correlated with process entropy (aleatoric). - **H3.** Chain-of-thought prompting *worsens* TC despite improving accuracy (an explicit-cost trade-off); explicit "commit to prior unless strong evidence" instructions improve TC at modest accuracy cost. - **H4.** Across BPIC logs with varying regularity, TC degradation is larger on [[concepts/lasagna-spaghetti-processes|spaghetti]] logs than lasagna logs — generalising Riess 2023's log-dependency finding to agentic recommendations. ## Method **Datasets.** Direct reuse of Riess 2023's public logs (Sepsis, Helpdesk, BPIC Traffic Fines, Hospital Billing) to preserve comparability + one private [[sources/2025-riess-jorgensen-brage-benchmark-norwegian-llm|BRAGE-adjacent]] Telenor customer-service log for industry validity. **Recommendation targets.** Three canonical PPM outputs per [[syntheses/ppm-landscape|the landscape synthesis]]: next activity, remaining time, binary outcome. Each target evaluated independently. **Models compared.** - **LLM-agents (3):** Claude, GPT-4-class, Gemma2 — the BRAGE setup extended to per-prefix recommendation. - **PPM baselines (3):** unweighted-L1 LSTM (Tax), temporally-weighted-L1 LSTM (Riess 2023 exponential variant), ProcessTransformer. - **Rule-based control:** most-likely-next-activity from transition matrix. **TC instrument.** Adapted from Riess 2023 — for remaining-time: monotonicity violations per prefix. For next-activity / outcome: *recommendation-flip rate* = count of recommendation changes across consecutive prefixes of the same case, normalised by prefix count. **Uncertainty instrumentation.** Per recommendation, collect (a) LLM reasoning-trace length (epistemic proxy), (b) local transition-entropy at the current process state (aleatoric proxy), (c) self-reported confidence. **Experimental conditions.** 3 prompt strategies × 3 LLMs × 3 PPM-baseline models × 5 logs × 3 recommendation targets. Per-case bootstrap for confidence intervals. **Analysis.** Three-axis tables per Riess 2023 convention + regression of flip-rate on uncertainty proxies (H2) + ANOVA with prompt strategy × model interaction (H3). ## Validity threats - **LLM version drift** during the study — pin model versions; publish exact APIs used. - **Prompt-underfitting** for non-CoT control: use standardised prompt templates published in an appendix. - **Construct of "flipping"**: some flips are *corrections* on new evidence — mitigated by separating *Bayes-coherent updates* (flip toward higher-likelihood option given new events) from incoherent oscillation. ## Deliverables and venues - **Short paper.** Target: **Nordic Machine Intelligence** (direct continuity with Riess 2023 that introduced TC) or **ICPM 2027**. A crisp extension paper, 10–12 pages. - **Benchmark harness.** Public code + evaluation scripts on GitHub. Consumers of Riess 2023 can immediately run TC on their own agents. ## Connections Extends [[concepts/remaining-time-prediction|the three-axis evaluation]] beyond remaining-time to full PrPM recommendation space. Feeds [[concepts/prescriptive-process-monitoring|PrPM]] with a missing evaluation lens. Connects [[concepts/conversational-actionability]] (APM capability) to a measurable operational property. Empirically grounds the operational framing in [[syntheses/riess-research-arc|Riess 2023's commitment #5]].