---
title: "The BRAGE Benchmark: Evaluating Zero-shot Learning Capabilities of LLMs for Norwegian Customer Service Dialogues"
type: source
tags: [llm, benchmark, zero-shot, norwegian, customer-service, instruction-tuning, evaluation, concept-drift, telecom]
authors: [Riess, Mike; Jørgensen, Tollef Emil]
year: 2025
venue: "Proceedings of NoDaLiDa / Baltic-HLT 2025, pp. 525–536, University of Tartu Library"
kind: paper
raw_path: "raw/Riess/Riess 2025.pdf"
sources: ["[[sources/2022-riess-metaheuristics-concept-drift-survey]]"]
key_claims:
  - "Introduces BRAGE, a private 300-call Norwegian customer-service benchmark for zero-shot LLM classification into 8 product categories, constructed from real (anonymised) Telenor call transcripts using the same codebook given to human annotators."
  - "Instruction-tuned LLMs dramatically outperform base models on BRAGE; base models hover around random-guess accuracy (~19.72%), while the best instruction-tuned models (Gemma2 9B IT, 27B IT) reach ~60–62% accuracy."
  - "English / multilingual instruction models (Gemma2 family) outperform Norwegian-specific pre-trained (PNB) and fine-tuned (FNB) models of similar parameter counts."
  - "The performance gap between base and instruction-tuned models is less pronounced on BRAGE than on other Norwegian generative benchmarks (NorNE, NoReC, NorQuAD, HellaSwag) — indicating BRAGE requires precise, generalisable instruction-tuning rather than surface-level language exposure."
  - "High HellaSwag (commonsense reasoning) score predicts high BRAGE score; high NorNE (NER) score does not — suggesting BRAGE probes reasoning over long instructions rather than surface Norwegian coverage."
  - "Motivation links to concept drift: call-topic distributions in customer service shift over time, so re-training classification models is costly; zero-shot LLM-based classification is proposed as a lower-maintenance alternative."
  - "Private benchmark due to business sensitivity; aggregated results shared publicly via https://github.com/tnresearch/brage and Nordic research collaboration channel."
  - "Sustainability reporting: 0.8394 kgCO2e over 29.2h on 4× RTX 3090 in Oslo."
created: 2026-04-20
updated: 2026-04-20
---

# Riess & Jørgensen 2025 — The BRAGE Benchmark

Co-authored NoDaLiDa/Baltic-HLT 2025 paper by Mike Riess ([[entities/telenor|Telenor Group]], Research and Innovation) and Tollef Emil Jørgensen ([[entities/ntnu|NTNU]], Department of Computer Science).

## Summary
Customer service in telecommunications produces large volumes of recorded calls, which after automatic speech recognition become transcribed dialogue data. Classification models over these transcripts drive analytics dashboards that let providers spot emerging issues in real time. Two operational problems recur: (1) building supervised classifiers is expertise-intensive, and (2) the input distribution drifts over time (new products, new issues), so classifiers must be retrained — which Riess's earlier work ([[sources/2022-riess-metaheuristics-concept-drift-survey|Riess 2022]]) explicitly framed as the [[concepts/concept-drift]] problem.

The paper proposes **zero-shot LLM classification** as a lower-maintenance alternative: instead of training a task-specific classifier, ask a general-purpose LLM to categorise a transcript using the same human-annotator codebook. This is evaluated via **BRAGE**, a private benchmark constructed from 300 transcribed Norwegian customer-service calls from a telecommunications provider (Telenor), annotated by a senior analyst with 25+ years of domain experience across eight product categories (Mobile, Services, Broadband, TV, Broadband-mobile, Email, Insurance, Other). Class distribution is skewed (Mobile ≈ 37%, Insurance ≈ 5.7%); random-guess and majority-class baselines are computed accordingly.

Research questions:
- **RQ1** — how do open-weight Norwegian models compare on BRAGE?
- **RQ2** — how do BRAGE results align with existing Norwegian downstream benchmarks (ScandEval: NorNE NER, NoReC sentiment, NorQuAD QA, HellaSwag-no commonsense)?

Methodology: zero-shot inference with constrained output (Outlines + HuggingFace Transformers), temperature 0, fixed seed, 10 bootstrap iterations per run, prompt format adapted per model card (ChatML / Alpaca / no-format for base models), truncation to first 250 tokens. Model set spans pre-trained multilingual base (P), pre-trained Norwegian Bokmål base (PNB), instruction-tuned multilingual (IT), Norwegian-fine-tuned (FNB) combinations. Metrics: Accuracy, Macro-F1, Matthews Correlation Coefficient.

Key findings:
1. **Base models are at random** (~19% accuracy); instruction-tuned models vary wildly, up to ~62% accuracy (Gemma2 9B IT).
2. **Multilingual beats Norwegian-specific** at matched parameter scales: Gemma2 2B IT (English-only) outperforms every dedicated Norwegian model tested.
3. **Domain-specific fine-tuning can hurt**: NorwAI and NORA.LLM Norwegian fine-tunes underperform their multilingual base versions; consistent with Barnett et al. 2024 and Ghosh et al. 2024 findings that SFT can degrade knowledge.
4. **BRAGE ≈ HellaSwag in rank correlation**, but diverges from NorNE — indicating the benchmark probes reasoning-over-long-instructions rather than surface Norwegian linguistic competence.
5. **Not production-ready**: 60%+ accuracy is not sufficient for full deployment; the approach partially automates annotation rather than eliminating it.

Discussion notes several recommendations: knowledge distillation to smaller student models at compute-optimal token counts, RLHF alignment, instruction-formatting studies, open Norwegian instruction datasets. Sustainability and cost-vs-benefit concerns about using large LLMs for tasks solvable by smaller energy-efficient models are explicitly raised.

## Connections
- Extends the [[concepts/concept-drift|concept-drift]] motivation from [[sources/2022-riess-metaheuristics-concept-drift-survey]] — LLM zero-shot classification as a drift-robust alternative to re-training supervised models.
- Establishes [[concepts/llm-benchmarking|LLM benchmarking]] (*new concept page may be warranted*) in a low-resource-language / telecom customer-service setting.
- Connects to [[concepts/ai-agent-benchmarks]] as a domain-specific benchmark design.
- Author entity hub: [[entities/mike-riess]]; co-author affiliation: [[entities/ntnu]] (Tollef Emil Jørgensen); primary affiliation: [[entities/telenor]].
- Code: https://github.com/tnresearch/brage