---
title: Trace Encoding
type: concept
tags: [ppm, encoding, features, preprocessing]
sources:
  - "[[sources/2026-padella-llm-features-ppm]]"
created: 2026-04-13
updated: 2026-05-11
---

# Trace Encoding

Converting a variable-length (and possibly multi-attribute) event trace prefix into a fixed-size feature representation consumable by a classifier or regressor. A design choice with large impact on [[concepts/predictive-process-monitoring|PPM]] performance.

## Encoding families (Teinemaa / Leontjeva / Verenich et al.)

### Last-state encoding
Only the attributes of the latest event are used. Simplest, weakest.

### Aggregate (boolean / frequency) encoding
Aggregate over all events in the prefix:
- **Boolean** — has activity `a` occurred?
- **Frequency** — how many times has `a` occurred?
Does not preserve order.

### Index-based encoding
Fixed-length padded representation: positions `1…k_max`, each carrying the activity (and possibly attributes) at that index. Preserves order but explodes feature space for long traces.

### Complex symbolic sequence encodings
HMM-based, N-gram-based — capture sub-sequence patterns without the sparsity of index encoding (Leontjeva et al. 2016).

### Learned embeddings
Neural approaches learn dense representations end-to-end:
- **Activity embeddings** (like word2vec for activities).
- **LSTM hidden states** as implicit encoding ([[concepts/lstm-ppm|LSTM-PPM]]).
- **Transformer contextual embeddings** ([[concepts/transformer-ppm|Transformer-PPM]]).

### LLM string encoding (sequential, ρ_seq)
The newest family, designed for [[concepts/llm-based-ppm|LLM-based PPM]]: a trace is serialised as a human-readable string consumable by a large language model. [[sources/2026-padella-llm-features-ppm|Padella, de Leoni & Dumas 2026]] introduce **ρ_seq**:

`ρ_seq(σ) = global(σ) ⊕ (activity(e₁), duration(e₁)) ⊕ … ⊕ (activity(eₙ), duration(eₙ)) ⊕ K(σ)`

Design choices:
- **Global attributes preserved** — they carry domain knowledge the LLM can leverage.
- **Local attributes deliberately omitted** — to respect LLM context-length constraints and avoid documented long-context degradation (BABILong, Long-Context LLMs Struggle).
- **Activity name + duration** per event — minimal control-flow + temporal information.
- **Target value appended** — for in-context completed-trace examples; absent for the running query trace.

ρ_seq sits *outside* the classical encoding taxonomy: it is human-readable rather than fixed-size, semantically meaningful (activity *names* matter, not just IDs), and tightly coupled to the LLM's prompt design.

## Data-awareness
Encoding of **event attributes** (data perspective) in addition to activities typically outperforms control-flow-only encodings ([[sources/2017-navarin-lstm-data-aware-remaining-time]]).

## Trade-offs
| Encoding | Order preserved | Feature size | Data-aware support |
|---|---|---|---|
| Last-state | – | Small | Partial |
| Aggregate | – | Medium | Yes |
| Index-based | ✔ | Large | Yes |
| Complex symbolic | Partial | Medium | Partial |
| Learned embeddings | ✔ | Dense, fixed | Yes |
| LLM string (ρ_seq) | ✔ | Variable string | Partial (global only) |

## Related
[[concepts/predictive-process-monitoring]] · [[concepts/lstm-ppm]] · [[concepts/transformer-ppm]] · [[concepts/llm-based-ppm]]