--- title: Trace Encoding type: concept tags: [ppm, encoding, features, preprocessing] sources: - "[[sources/2026-padella-llm-features-ppm]]" created: 2026-04-13 updated: 2026-05-11 --- # Trace Encoding Converting a variable-length (and possibly multi-attribute) event trace prefix into a fixed-size feature representation consumable by a classifier or regressor. A design choice with large impact on [[concepts/predictive-process-monitoring|PPM]] performance. ## Encoding families (Teinemaa / Leontjeva / Verenich et al.) ### Last-state encoding Only the attributes of the latest event are used. Simplest, weakest. ### Aggregate (boolean / frequency) encoding Aggregate over all events in the prefix: - **Boolean** — has activity `a` occurred? - **Frequency** — how many times has `a` occurred? Does not preserve order. ### Index-based encoding Fixed-length padded representation: positions `1…k_max`, each carrying the activity (and possibly attributes) at that index. Preserves order but explodes feature space for long traces. ### Complex symbolic sequence encodings HMM-based, N-gram-based — capture sub-sequence patterns without the sparsity of index encoding (Leontjeva et al. 2016). ### Learned embeddings Neural approaches learn dense representations end-to-end: - **Activity embeddings** (like word2vec for activities). - **LSTM hidden states** as implicit encoding ([[concepts/lstm-ppm|LSTM-PPM]]). - **Transformer contextual embeddings** ([[concepts/transformer-ppm|Transformer-PPM]]). ### LLM string encoding (sequential, ρ_seq) The newest family, designed for [[concepts/llm-based-ppm|LLM-based PPM]]: a trace is serialised as a human-readable string consumable by a large language model. [[sources/2026-padella-llm-features-ppm|Padella, de Leoni & Dumas 2026]] introduce **ρ_seq**: `ρ_seq(σ) = global(σ) ⊕ (activity(e₁), duration(e₁)) ⊕ … ⊕ (activity(eₙ), duration(eₙ)) ⊕ K(σ)` Design choices: - **Global attributes preserved** — they carry domain knowledge the LLM can leverage. - **Local attributes deliberately omitted** — to respect LLM context-length constraints and avoid documented long-context degradation (BABILong, Long-Context LLMs Struggle). - **Activity name + duration** per event — minimal control-flow + temporal information. - **Target value appended** — for in-context completed-trace examples; absent for the running query trace. ρ_seq sits *outside* the classical encoding taxonomy: it is human-readable rather than fixed-size, semantically meaningful (activity *names* matter, not just IDs), and tightly coupled to the LLM's prompt design. ## Data-awareness Encoding of **event attributes** (data perspective) in addition to activities typically outperforms control-flow-only encodings ([[sources/2017-navarin-lstm-data-aware-remaining-time]]). ## Trade-offs | Encoding | Order preserved | Feature size | Data-aware support | |---|---|---|---| | Last-state | – | Small | Partial | | Aggregate | – | Medium | Yes | | Index-based | ✔ | Large | Yes | | Complex symbolic | Partial | Medium | Partial | | Learned embeddings | ✔ | Dense, fixed | Yes | | LLM string (ρ_seq) | ✔ | Variable string | Partial (global only) | ## Related [[concepts/predictive-process-monitoring]] · [[concepts/lstm-ppm]] · [[concepts/transformer-ppm]] · [[concepts/llm-based-ppm]]