---
title: "Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs"
type: source
tags: [ppm, llm, in-context-learning, small-scale-data, beta-learner, semantic-hashing, prescriptive-future, foundational-llm-ppm, dumas, de-leoni]
authors: [Padella, Alessandro; de Leoni, Massimiliano; Dumas, Marlon]
year: 2026
venue: "arXiv:2601.11468v1 [cs.AI], 16 Jan 2026 — ACM submission (extends BPM Forum 2026 conference version)"
kind: paper
raw_path: "raw/Predictive process monitoring/2026-padella-llm-features-ppm.pdf"
arxiv: "2601.11468v1"
code_url: "https://github.com/Pado123/gui_xrecs_presc_analytics"
sources: []
key_claims:
  - "Gemini 2.5 Flash Thinking trained on only 100 traces (≤1.45% of available data) matches or surpasses CatBoost and PGTNet trained on the full event log across all three benchmark datasets (BPI12, Bac, Hospital) on both Total Time (MAE) and Activity Occurrence (F1)."
  - "Semantic hashing of activity/attribute names degrades LLM performance dramatically — Hospital MAE +1702%, Bac +71%, BPI12 +42% — empirically isolating the LLM's reliance on embodied domain knowledge from pure sequence correlations (Nemenyi post-hoc p < 0.01)."
  - "Manual coding of 150 LLM reasoning traces (50 per dataset × KPI) yields a catalogue of interpretable prediction patterns called β-learners (knn-act / knn-att / time-seq / path-pred for regression; Activity-/State-/Att-Based / Positive-Evidence for classification). LLM beats every individual β-learner by 6–80% in MAE/F1 → LLM does higher-order reasoning, not pattern replication."
  - "Good-Turing frequency estimation confirms 150 traces saturate β-learner pattern discovery — expected novel β-learners approach 0 after 100 new traces in every use case and KPI."
  - "A new trace-to-instance encoding ρ_seq designed for LLM context-length constraints — global attributes ⊕ (activity, duration) sequence ⊕ target — deliberately omits local attributes due to documented long-context degradation in LLMs."
  - "Modular 7-component prompt template (header, attribute description, output spec, running-trace format, domain background, examples, prediction request) is generic across logs/KPIs except for two analyst-specified sections."
  - "Future direction explicitly flagged: extend the framework to prescriptive process analytics — recommending actions, not only predictions."
created: 2026-05-11
updated: 2026-05-11
---

# Padella, de Leoni & Dumas 2026 — Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs

Extended journal-submission version of the authors' BPM Forum 2026 conference paper "Enhancing Predictive Process Monitoring on Small-Scale Event Logs Using LLMs" ([25] in the reference list). 19 pages, three research questions, three real-life event logs (BPI12, Bac, Hospital), two KPIs (Total Time regression + Activity Occurrence classification), state-of-the-art benchmarks (CatBoost, PGTNet), and a deep methodological dive into *what* LLMs actually do when generating PPM predictions.

This is the flagship empirical-and-interpretive source for **LLM-based PPM** in the wiki. Closes the open thread §6 of [[syntheses/ppm-landscape]] ("LLM-based PPM (post-2023) — still emerging; worth monitoring") and partially fills gap E.2.2 in [[syntheses/llm-bpm-reading-list]].

## What the paper does

The paper extends prior LLM-PPM work along three orthogonal axes:

1. **Generality (RQ1)** — broadens KPI scope from Total Time alone to Total Time + Activity Occurrence (regression + classification), evaluated statistically across three event logs.
2. **Semantic leverage (RQ2)** — introduces a semantic-hashing probe: every context-sensitive string (activity names, attribute names, attribute values, global attribute names) is replaced with a deterministic 4-character hash that preserves correlations while eliminating semantics. Comparing hashed-vs-non-hashed prediction quality isolates the contribution of the LLM's *embodied prior knowledge*.
3. **Reasoning anatomy (RQ3)** — distils 150 LLM reasoning traces per KPI into *β-learners*, a catalogue of reproducible reasoning patterns that can be re-implemented as standalone predictive models; then asks whether the LLM merely replicates them or performs higher-order reasoning.

## Key results

**RQ1 — prediction quality with 100 traces:**

| Use Case | Total Time MAE (lower better) | Activity Occurrence F1 (higher better) |
|---|---|---|
| **BPI12** | LLM 6508 ± 235 · CatBoost 6846 · PGTNet 3888 | LLM 0.77 ± 0.06 · CatBoost 0.80 |
| **Bac** | LLM **2265 ± 1072** · CatBoost 2647 · PGTNet 1245 (full-data benchmark) — when trained on 100, LLM crushes both retrained benchmarks (4753 / 6393) | LLM **0.98 ± 0.04** · CatBoost 0.95 |
| **Hospital** | LLM **115 ± 34** · CatBoost 253 · PGTNet 97 (full-data) — with 100 traces LLM beats both retrained benchmarks (259 / 132) | LLM 0.90 ± 0.08 · CatBoost 0.90 |

Statistical robustness: 20 repeated random samples of 100 traces; results reported with standard deviation.

**RQ2 — semantic hashing:**

| Use Case | MAE non-hashed | MAE hashed | Degradation |
|---|---|---|---|
| BPI12 | 6508 ± 235 | 9246 ± 873 | **+42 %** ** |
| Bac | 2265 ± 1072 | 3880 ± 3254 | **+71 %** *** |
| Hospital | 115 ± 34 | 2077 ± 232 | **+1702 %** ** |

Activity-occurrence F1 shows the same pattern but smaller in magnitude (–2 % to –7 %). Nemenyi post-hoc tests reject the null of "no difference" in all six conditions. Conclusion: the LLM relies materially on the semantics encoded in activity names and attribute names — most strongly in the Hospital dataset where activity strings like `LABORATORIO` and attributes like `Triage_Color` carry rich domain priors.

**RQ3 — β-learner distillation:**

Each β-learner is re-implemented as a standalone predictor and evaluated on the same test set. For Total Time, four families emerge: `knn-act` (k-NN on activity-based representations), `knn-att` (k-NN on attribute-based representations), `time-seq` (temporal sequence aggregation), `path-pred` (predicted-path estimation) — each with three aggregations (mean/median/mode), giving 12 distinct β-learners per dataset. For Activity Occurrence: Activity-Based, State-Based, Att-Based, Positive-Evidence (4 patterns).

The LLM beats every individual β-learner with Wilcoxon signed-rank significance, and the magnitude of the advantage (Δ LLM) ranges from 6% (knn-act-mean) to 80% (knn-att-mean on Hospital). The paper concludes that the LLM performs *higher-order* aggregation across β-learners, not mere pattern replication.

**Good-Turing coverage:**

Frequency-of-frequencies analysis of the 50-trace training set per dataset×KPI yields P(novel β-learner | new trace) ≈ 0 after 100 new traces. Expected novel β-learners at m=100: ≤ 0.014. This justifies the 150-trace sample size as sufficient for β-learner catalogue saturation.

## Methodological contributions

- **Sequential trace-to-string encoding ρ_seq** — formal definition: ρ_seq(σ) = global(σ) ⊕ (activity(e₁), duration(e₁)) ⊕ … ⊕ (activity(eₙ), duration(eₙ)) ⊕ K(σ). Local attributes deliberately omitted; rationale grounded in long-context degradation literature ([20] BABILong, [22] Long-Context LLMs Struggle).
- **7-part modular prompt** — instruction header · attribute & encoding description · output & reasoning spec · running-trace format · domain-specific background · example data · running-trace input. Two sections (lines 8–9 and 37–38 in Listing 1) are analyst-specified; the rest is generic.
- **Semantic-hashing probe protocol** — a reusable methodology for any future LLM-on-event-log study to isolate embodied-knowledge contributions. Particularly relevant to [APM Manifesto C3 benchmark contamination](syntheses/llm-bpm-reading-list.md#e2-gaps-the-corpus-acknowledges).

## Limitations

- Single LLM tested (Gemini 2.5 Flash Thinking). Anthropic Claude, OpenAI GPT, Llama family untested.
- β-learner derivation is manual — no automated rule mining or LLM-aided distillation tested.
- BPM Forum 2026 conference version covered Total Time only; this extension adds Activity Occurrence but not other PPM targets (next activity, suffix, anomaly).
- No real-time deployment / latency analysis — LLM inference cost vs. CatBoost/PGTNet is not measured.
- Cross-LLM/cross-prompt sensitivity not explored.

## Connections

**Concepts:**
- [[concepts/llm-based-ppm]] — *anchor source*; first wiki entry on LLM-based PPM as a method family.
- [[concepts/beta-learner-distillation]] — *introduces*.
- [[concepts/semantic-hashing-probe]] — *introduces*.
- [[concepts/predictive-process-monitoring]] — LLM-PPM as new family.
- [[concepts/remaining-time-prediction]] — Total Time KPI; LLM as new baseline.
- [[concepts/outcome-prediction]] — Activity Occurrence as outcome variant.
- [[concepts/trace-encoding]] — ρ_seq encoding.
- [[concepts/explainability-apm]] — β-learner distillation as explainability mechanism.
- [[concepts/prescriptive-process-monitoring]] — future-work bridge.

**Sources cited that exist in the wiki:**
- [25] (prior conference version, BPM Forum 2026) — stub-worthy referenced source.
- [34] [[sources/2019-verenich-survey-ppm|Verenich et al. 2019]] — remaining-time-prediction baseline.
- [32] [[sources/2017-difrancescomarino-a-priori-ppm]] — outcome-oriented PPM (cited for outcome-PPM literature; not the exact Teinemaa 2019 ref).
- [10] PGTNet (Amiri Elyasi et al. 2024) — benchmark.
- [9] CatBoost (Dorogush et al. 2017) — benchmark.

**Entities:**
- [[entities/marlon-dumas]] — third author; new addition to his authorship list in the wiki.
- [[entities/alessandro-padella]] — first author, new entity.
- [[entities/massimiliano-de-leoni]] — second author, new entity.

**Syntheses:**
- [[syntheses/ppm-landscape]] — §6 "LLM-based PPM (post-2023) — still emerging; worth monitoring" closes with this source.
- [[syntheses/llm-bpm-reading-list]] — new entry in §A.3 (LLM in process-relevant domains) and partial fill of gap E.2.2 (LLM in PrPM).