---
title: "Riess Research Arc — Evaluation Rigour as Research Programme"
type: synthesis
tags: [ppm, concept-drift, simulation, llm, riess, evaluation-validity]
sources:
  - "[[sources/2022-riess-metaheuristics-concept-drift-survey]]"
  - "[[sources/2023-riess-temporal-loss-remaining-cycle-time]]"
  - "[[sources/2023-riess-phd-thesis-ppm]]"
  - "[[sources/2024-riess-synbps-simulation-framework]]"
  - "[[sources/2025-riess-jorgensen-brage-benchmark-norwegian-llm]]"
created: 2026-04-20
updated: 2026-04-20
---

# The Riess Research Arc — Evaluation Rigour as Research Programme

Critical synthesis of the five Mike Riess works in this wiki (2022–2025). The arc traces a coherent research programme anchored by a single methodological commitment: **evaluation validity** as the hinge on which PPM research either compounds or spins its wheels.

## The arc, chronologically

**2022 · Drift & model maintenance** ([[sources/2022-riess-metaheuristics-concept-drift-survey]]).
Survey of metaheuristics for concept-drift adaptation across fields. The substantive finding — population-based methods (GA, PSO) dominate, with evolution from single-task AutoML to Full Model Selection — is less important than the meta-finding: the literature is methodologically uneven. Class distributions unreported, drift characteristics undocumented, no head-to-head comparisons of population-based metaheuristics on the same drift problem. Riess's recommendation foreshadows all later work: *report drift characteristics alongside performance; evaluate metaheuristics as models themselves.*

**2023 · PPM earliness and the third axis** ([[sources/2023-riess-temporal-loss-remaining-cycle-time]]).
On the methodological side, Riess tackles remaining-cycle-time prediction with the established [[sources/2017-tax-lstm-process-prediction|Tax]]/[[sources/2017-navarin-lstm-data-aware-remaining-time|Navarin]] LSTM stack. The move is to interrogate the loss function itself: three temporal-decay L1 variants (exponential, power, moderate) are compared to unweighted MAE. The substantive result is modest (exponential decay helps on 2 of 4 logs). But the **axiomatic addition — Temporal Consistency (TC)** — is the paper's durable contribution: monotonic decrease of predictions, previously ignored by the field. Models that are accurate and early can still flip direction, confusing operational decision support. This makes a **three-axis evaluation** — accuracy, earliness, temporal consistency — the Riess proposal for remaining-time PPM.

**2024 · Simulation for external validity** ([[sources/2024-riess-synbps-simulation-framework]]).
Here the arc turns. Riess names the evaluation problem explicitly: PPM benchmarks are N≈3–9 event logs, each from one organisation, with no ability to manipulate data-generating-process factors. Existing BPS tools optimise for *ecological* validity (calibrated from real logs) but cannot answer *internal*-validity questions about what drives model performance. **SynBPS** is a parametric Markov-chain-based framework with user-controlled levers (memory order, state-space size, transition entropy, activity-duration distributions, stability/drift). It is explicitly a tool for *hypothesis testing* rather than realism. The research-methodology move is almost Fisherian: introduce controlled synthetic design into a field addicted to benchmark competitions. Python, open-source, pip-installable.

**2025 · LLM zero-shot as a drift-dodge in customer service** ([[sources/2025-riess-jorgensen-brage-benchmark-norwegian-llm]]).
Now at [[entities/telenor|Telenor]], Riess applies the 2022 drift framing to a concrete industrial pain-point. Supervised call-topic classifiers drift as product portfolios and customer issues evolve; zero-shot LLMs, steered by the same codebook given to human annotators, are proposed as a lower-maintenance alternative. The **BRAGE benchmark** (300 Norwegian transcripts, 8 product categories) quantifies the gap. Key findings: instruction-tuning matters far more than Norwegian pre-training; Gemma2 English-instruction models beat dedicated Norwegian fine-tunes; BRAGE correlates with HellaSwag (commonsense reasoning) but not NorNE (NER), meaning the benchmark probes reasoning-over-long-instructions rather than surface Norwegian coverage. The 60%-ish peak accuracy is explicitly labelled *not production-ready*. Again, evaluation rigour is the message.

## Recurring methodological commitments

1. **Evaluation rigour as the primary research problem.** Every single-method paper (2023 temporal loss, 2024 SynBPS, 2025 BRAGE) either critiques existing evaluation practice or proposes a new evaluation instrument. The 2022 survey does the same by meta-analysis.

2. **Control-variable manipulation via simulation.** SynBPS (2024) is the infrastructure; the 2023 loss-function paper anticipates it (*"future works might further study the relationship between the curvature of the loss and distribution characteristics of the training data, for instance via simulation"*).

3. **External validity via public benchmarks + scepticism about them.** Riess uses public event logs (Sepsis, Helpdesk, BPIC Traffic Fines, Hospital Billing) when expected, but explicitly argues in 2024 that they are insufficient for understanding *why* a method works.

4. **Open-source, Python, community-facing.** SynBPS on PyPI; BRAGE code public (data private for business reasons); explicit acknowledgement that PPM research is "primarily open source and performed using the Python programming language" (2024).

5. **Operational framing.** Earliness matters because of queue prioritisation and resource planning; temporal consistency matters because of shift scheduling; LLM accuracy matters because of customer-service analytics pipelines. The research questions come from practice.

## Connections to the broader wiki

- **PPM canon**: Riess builds on [[sources/2017-tax-lstm-process-prediction|Tax et al. 2017]], [[sources/2017-navarin-lstm-data-aware-remaining-time|Navarin et al. 2017]], [[sources/2016-teinemaa-structured-unstructured-ppm|Teinemaa et al. 2016]], [[sources/2019-verenich-survey-ppm|Verenich et al. 2019]], [[sources/2020-rama-maneiro-deep-learning-ppm-review|Rama-Maneiro et al. 2020]]. See [[concepts/predictive-process-monitoring]] and [[concepts/remaining-time-prediction]].
- **Drift**: extends [[concepts/concept-drift]]; the 2022 survey is the reference.
- **Simulation**: establishes [[concepts/business-process-simulation]] as a PPM-methodology page complementing [[methods/process-simulation]].
- **LLM benchmarking**: bridges to [[concepts/ai-agent-benchmarks]] with a low-resource-language telecom twist.
- **Philosophy-of-science bridge**: the evaluation-validity programme maps cleanly onto [[concepts/rct-limitations]] (external validity, ecological fallacy) and [[concepts/interventionist-theory-of-causation]] (SynBPS as controlled-intervention instrument for PPM methods).

## Open research trajectories

Each commitment above opens clean forward paths. The arc's characteristic move — name the boundary precisely, publish with honesty about it — turns each "limit" into a well-scoped next study rather than a loose end.

- **The loss × consistency trade-off as a joint-optimisation target.** The 2023 paper carefully documents that temporal-decay weighting can improve accuracy + earliness while degrading [[concepts/remaining-time-prediction|temporal consistency]] at certain prefixes. Read as a research signal, this is a precisely-located design frontier: the field now has a concrete trade-off surface to explore rather than a vague "loss functions matter" intuition. Natural follow-ups include multi-objective loss design with TC as an explicit constraint, prefix-phase-aware loss switching (tighter decay early, stabilising late), or learnable per-prefix weighting. The same frontier generalises cleanly to agentic recommendations — see [[syntheses/study-sketch-temporal-consistency-agents]].

- **Prescriptive queueing, scoped to its valid operating regime.** Paper III of the thesis (customer-service queue prioritisation by predicted loyalty, with Scholderer) reports that under a 60-hour service level the prioritisation converges toward FCFS. Read constructively, this is a *precisely-scoped operational finding*: it tells practitioners and researchers where predicted-loyalty prioritisation can realistically add value as a function of the SLA. That makes the paper a strong anchor for the under-developed sixth dimension of [[sources/2022-kubrak-prescriptive-ppm-slr|Kubrak 2022]] — [[concepts/intervention-policy|intervention policy]]. A natural follow-up study would characterise the SLA × case-complexity regime in which predicted-loyalty prioritisation dominates FCFS, using SynBPS to vary SLA and arrival intensity as free parameters. Publishing Paper III in this sharpened framing — *"when does predicted-loyalty prioritisation matter, and when does it not?"* — would add an empirical foothold the PrPM literature presently lacks.

- **SynBPS next generation.** The 2024 paper is explicit about its scope boundaries: Markov backbone, no concurrency, resources absorbed into duration distributions. Clearly-named boundaries are launch points for follow-on work. Three forward paths are already visible:
  - ***SynBPS-ABM*** — agent-based backbone to support concurrent activities and resource-aware prescriptive use-cases (including a revisit of Paper III's queueing problem).
  - ***SynBPS-APM*** — [[concepts/framed-autonomy|frame]]-aware agent-hooks for evaluating APM-style systems under controlled process-characteristic perturbations; see the companion [[syntheses/study-sketch-synbps-apm|study sketch]].
  - **Calibration bridges** — systematic methodology for mapping between SynBPS's synthetic regimes and observed industry logs, engaging directly with the Cartwright-Hardie external-validity question raised in [[sources/2023-anjum-rocca-phi403-lecture-11-is-more-data-better|PHI403 L11]].

- **Drift from PPM models to agent trajectories.** The 2022 drift survey was scoped to process-oriented machine learning; the 2025 BRAGE paper moves toward LLM-based classification in a drift-prone industry setting. A natural next step joins these: a taxonomy and empirical characterisation of drift *in agent trajectories themselves* (model-version, prompt, tool, upstream-data), with [[concepts/mape-k-loop|MAPE-K]] adaptation policies drawn from Riess 2022's metaheuristics catalogue — sketched in [[syntheses/study-sketch-agent-trajectory-drift]].

- **From controlled instruments to a platform.** Across SynBPS (2024), BRAGE (2025), and the PhD's evaluation protocols, a through-line is the building of *instruments* — testbeds, benchmarks, loss functions, evaluation axes. A platform move would integrate them into a single evaluation stack that future PPM/PrPM/APM contributions can adopt wholesale, lowering the activation energy for the methodological standards the 2022 survey recommends.

## One-line framing

*Across five works, Riess builds the evaluation infrastructure that proactive process monitoring needs to compound: controlled testbeds, explicit evaluation axes (earliness, temporal consistency, drift characterisation), and honest reporting of the operating regimes where each method works — the foundation a next decade of PPM / PrPM / APM research can build on.*