--- title: "Semantic hashing probe (for LLM embodied-knowledge isolation)" type: concept tags: [llm-ppm, evaluation-methodology, embodied-knowledge, benchmark-contamination, llm-probing, robustness-protocol] sources: - "[[sources/2026-padella-llm-features-ppm]]" created: 2026-05-11 updated: 2026-05-11 --- # Semantic hashing probe A reusable evaluation protocol introduced by [[sources/2026-padella-llm-features-ppm|Padella, de Leoni & Dumas 2026]] for **isolating the contribution of an LLM's embodied prior knowledge from pure distributional pattern matching** on a given task. Applied here to LLM-based PPM, the probe transfers naturally to any LLM-evaluation context where benchmark contamination, prior-knowledge leakage, or semantic-shortcut concerns matter. ## Procedure Given a task input that contains semantically meaningful strings (activity names, attribute names, attribute values, entity labels, etc.): 1. **Identify the context-sensitive set** ℋ — all strings whose meaning the LLM could plausibly know from pre-training: activity names + attribute names + attribute values for categorical attributes. 2. **Define a deterministic hash function** H : ℋ → Σ⁴ mapping each string to a unique 4-character identifier (Padella et al. used `c₁c₂c₃c₄` with cᵢ ∈ A–Z, 0–9). Hash preserves *correlations* (same input → same hash) while eliminating *semantics* (no shared substrings, no recognisable words, no prior associations). 3. **Generate two parallel test sets** — non-hashed (original) and hashed (every s ∈ ℋ replaced by H(s)). 4. **Run the LLM on both versions of the same input**, repeated to obtain statistical samples. 5. **Compare prediction quality** on hashed vs. non-hashed. Significant degradation under hashing indicates semantic reliance. 6. **Statistical confirmation** via post-hoc tests (Padella et al. used Nemenyi post-hoc with H₀: no difference, H₁: non-hashed superior). ## Empirical headline from Padella et al. | Use Case | MAE non-hashed | MAE hashed | Degradation | p-value | |---|---|---|---|---| | BPI12 | 6508 ± 235 | 9246 ± 873 | **+42 %** | 0.002 ** | | Bac | 2265 ± 1072 | 3880 ± 3254 | **+71 %** | <0.001 *** | | Hospital | 115 ± 34 | 2077 ± 232 | **+1702 %** | 0.002 ** | For Activity Occurrence (classification), F1 degradation is consistent but smaller in magnitude (−2 % to −7 %, p significant in all three). Hospital exhibits the most dramatic shift — consistent with the most semantically rich activity vocabulary (e.g., `LABORATORIO`, `Triage_Color`). ## Why this matters beyond Padella et al. - **Disentangles two ablations** that are often conflated in LLM benchmarking: (a) the model's reasoning capability, (b) the model's pre-training-derived priors about the test domain. - **Operationalises a "C3 benchmark contamination" check** — the APM Manifesto flags benchmark-leakage as an unresolved problem; semantic hashing offers a concrete probe. - **Transferable to other BPM-AI evaluations** — process model evaluation (e.g., Rebmann et al. 2024 semantics-aware PM benchmarks), LLM-conformance checking, LLM-bot evaluations in process modelling. - **Cheap to apply** — only requires a deterministic hash function and re-running existing benchmarks; no model retraining needed. ## Limitations - **Hash collisions** — Padella et al.'s 4-character alphabet gives 36⁴ ≈ 1.7M unique codes, more than ample for typical event logs. Larger vocabularies need longer hashes. - **Symbol-level priors remain** — the probe removes word-level semantics but not lower-level priors (e.g. distributional knowledge about how categorical variables behave). Cannot fully isolate distributional vs. semantic reliance. - **Pure-semantic shortcuts only** — the probe says nothing about whether the LLM is exploiting *positional* or *syntactic* shortcuts. - **Operates on string content, not on model architecture** — does not distinguish "pre-training leakage" from "in-context learning of semantically rich inputs" (e.g. via example shots that happen to contain meaningful activity names). ## Related [[concepts/llm-based-ppm]] · [[concepts/beta-learner-distillation]] · [[concepts/predictive-process-monitoring]] · [[syntheses/llm-bpm-reading-list]]