---
title: Outcome (Goal) Prediction
type: concept
tags: [ppm, prediction, classification, outcome]
sources:
  - "[[sources/2026-padella-llm-features-ppm]]"
created: 2026-04-13
updated: 2026-05-11
---

# Outcome Prediction

A [[concepts/predictive-process-monitoring|PPM]] task: given a prefix of an event trace, predict a **case-level outcome** — typically binary or small-multi-class.

## Typical targets
- **SLA compliance** — will this case complete within the deadline?
- **Business outcome** — will this loan be repaid / approved / defaulted?
- **Positive / negative closure** — will this complaint be resolved favourably?
- **Constraint violation** — will a DECLARE rule be violated?
- **Activity occurrence** — will a specific (typically high-cost or high-rework) activity occur within the running case? Used as a classification KPI in [[sources/2026-padella-llm-features-ppm|Padella, de Leoni & Dumas 2026]] (`W_Nabellen incomplete dossiers` in BPI12, `Service Closure with BO Responsibility` in Bac, `LABORATORIO` in Hospital).

## Formulation
- **Input:** prefix of a running case.
- **Output:** categorical outcome `y ∈ {0, 1}` or `y ∈ {c₁, …, cₘ}`.
- **Training data:** extract prefixes from completed cases, labelled with the known ground-truth outcome.

## Benchmark (Teinemaa et al. 2016+)
A systematic literature review and benchmark by Teinemaa, Dumas, La Rosa, Maggi established the field's evaluation baseline — see [[sources/2016-teinemaa-outcome-ppm-review]].

## Evaluation
- **AUC-ROC / AUC-PR** — the standard in imbalanced settings.
- **Accuracy / F1** — less informative under class imbalance.
- **Earliness-accuracy trade-off** — a prediction earlier in the case is more actionable but harder; often plotted as accuracy vs prefix length.

## Philosophical caveats
An outcome prediction is a probability — but *which* kind? Interpreted as frequentist it gives the population-average rate of the outcome for cases with similar prefixes; applied to an individual case this is the **[[concepts/rct-limitations|ecological fallacy]]**. Interpreted as credence it expresses the model's epistemic uncertainty (see [[concepts/aleatoric-vs-epistemic-uncertainty]]). Interpreted as propensity it claims an intrinsic tendency of the case itself ([[concepts/probabilistic-causation]]). The choice matters whenever a prediction drives an intervention — see [[sources/2023-anjum-rocca-phi403-lecture-19-what-rcts-do-not-show]] and [[sources/2023-anjum-rocca-phi403-lecture-18-risky-predictions]].

## LLM-based outcome prediction

[[sources/2026-padella-llm-features-ppm|Padella et al. 2026]] benchmark Gemini 2.5 Flash Thinking against CatBoost Classifier on the Activity Occurrence variant with 100 training traces. LLM (non-hashed) achieves F1 0.77 / 0.98 / 0.90 on BPI12 / Bac / Hospital — matching or exceeding CatBoost trained on the full event log. Performance degrades slightly under [[concepts/semantic-hashing-probe|semantic hashing]] (-2 % to -7 %), confirming partial reliance on embodied prior knowledge for classification too.

## Related
[[concepts/predictive-process-monitoring]] · [[concepts/next-activity-prediction]] · [[concepts/remaining-time-prediction]] · [[concepts/trace-encoding]] · [[concepts/probabilistic-causation]] · [[concepts/llm-based-ppm]]