---
title: "Semantic hashing probe (for LLM embodied-knowledge isolation)"
type: concept
tags: [llm-ppm, evaluation-methodology, embodied-knowledge, benchmark-contamination, llm-probing, robustness-protocol]
sources:
  - "[[sources/2026-padella-llm-features-ppm]]"
created: 2026-05-11
updated: 2026-05-11
---

# Semantic hashing probe

A reusable evaluation protocol introduced by [[sources/2026-padella-llm-features-ppm|Padella, de Leoni & Dumas 2026]] for **isolating the contribution of an LLM's embodied prior knowledge from pure distributional pattern matching** on a given task. Applied here to LLM-based PPM, the probe transfers naturally to any LLM-evaluation context where benchmark contamination, prior-knowledge leakage, or semantic-shortcut concerns matter.

## Procedure

Given a task input that contains semantically meaningful strings (activity names, attribute names, attribute values, entity labels, etc.):

1. **Identify the context-sensitive set** ℋ — all strings whose meaning the LLM could plausibly know from pre-training: activity names + attribute names + attribute values for categorical attributes.
2. **Define a deterministic hash function** H : ℋ → Σ⁴ mapping each string to a unique 4-character identifier (Padella et al. used `c₁c₂c₃c₄` with cᵢ ∈ A–Z, 0–9). Hash preserves *correlations* (same input → same hash) while eliminating *semantics* (no shared substrings, no recognisable words, no prior associations).
3. **Generate two parallel test sets** — non-hashed (original) and hashed (every s ∈ ℋ replaced by H(s)).
4. **Run the LLM on both versions of the same input**, repeated to obtain statistical samples.
5. **Compare prediction quality** on hashed vs. non-hashed. Significant degradation under hashing indicates semantic reliance.
6. **Statistical confirmation** via post-hoc tests (Padella et al. used Nemenyi post-hoc with H₀: no difference, H₁: non-hashed superior).

## Empirical headline from Padella et al.

| Use Case | MAE non-hashed | MAE hashed | Degradation | p-value |
|---|---|---|---|---|
| BPI12 | 6508 ± 235 | 9246 ± 873 | **+42 %** | 0.002 ** |
| Bac | 2265 ± 1072 | 3880 ± 3254 | **+71 %** | <0.001 *** |
| Hospital | 115 ± 34 | 2077 ± 232 | **+1702 %** | 0.002 ** |

For Activity Occurrence (classification), F1 degradation is consistent but smaller in magnitude (−2 % to −7 %, p significant in all three). Hospital exhibits the most dramatic shift — consistent with the most semantically rich activity vocabulary (e.g., `LABORATORIO`, `Triage_Color`).

## Why this matters beyond Padella et al.

- **Disentangles two ablations** that are often conflated in LLM benchmarking: (a) the model's reasoning capability, (b) the model's pre-training-derived priors about the test domain.
- **Operationalises a "C3 benchmark contamination" check** — the APM Manifesto flags benchmark-leakage as an unresolved problem; semantic hashing offers a concrete probe.
- **Transferable to other BPM-AI evaluations** — process model evaluation (e.g., Rebmann et al. 2024 semantics-aware PM benchmarks), LLM-conformance checking, LLM-bot evaluations in process modelling.
- **Cheap to apply** — only requires a deterministic hash function and re-running existing benchmarks; no model retraining needed.

## Limitations

- **Hash collisions** — Padella et al.'s 4-character alphabet gives 36⁴ ≈ 1.7M unique codes, more than ample for typical event logs. Larger vocabularies need longer hashes.
- **Symbol-level priors remain** — the probe removes word-level semantics but not lower-level priors (e.g. distributional knowledge about how categorical variables behave). Cannot fully isolate distributional vs. semantic reliance.
- **Pure-semantic shortcuts only** — the probe says nothing about whether the LLM is exploiting *positional* or *syntactic* shortcuts.
- **Operates on string content, not on model architecture** — does not distinguish "pre-training leakage" from "in-context learning of semantically rich inputs" (e.g. via example shots that happen to contain meaningful activity names).

## Related

[[concepts/llm-based-ppm]] · [[concepts/beta-learner-distillation]] · [[concepts/predictive-process-monitoring]] · [[syntheses/llm-bpm-reading-list]]