Language:

A man is sitting at a computer in an office, looking at the screen whilst typing on the keyboard.
Project

Find the right language model with MÖVE

MÖVE is Bundesdruckerei GmbH’s first holistic AI model comparison, developed specifically for the requirements of the public sector. It helps public authorities and public-sector organisations to make decisions on the responsible use of artificial intelligence.

Data and Facts at a Glance

Project name

Evaluating models for public administration (Modelle für die öffentliche Verwaltung evaluieren; abbr.: “MÖVE”)

Duration

Since 01/2025

Funding body

Bundesdruckerei’s in-house research and innovation project

Partners

  • Federal Office for Information Security (BSI)
  • Fraunhofer Institute for Applied and Integrated Security (AISEC)

Project objective

MÖVE systematically assesses large language models (LLMs) to provide guidance for the selection of suitable AI models. This enables public administrations and state institutions to use them responsibly, securely and effectively.

Thematic focus

  • Artificial intelligence 
  • AI governance 
  • Trustworthy AI 
  • Evaluation of large language models (LLMs)

Contributed areas of expertise

  • AI research and evaluation 
  • Development of evaluation and governance frameworks 
  • Public sector expertise 
  • Benchmarking 
  • Regulatory context (e.g. the EU AI Act) 
  • Innovation development in the public sector 
  • Knowledge transfer

Uncontrolled growth of AI poses a challenge for public administration

New language models (LLMs) are released every week. Each one claims to be more powerful, safer or more efficient than its competition. AI tools have enormous potential to make public administration more citizen-centric and future-proof. However, this creates a new problem for decision-makers in the public sector: the AI landscape is evolving faster than robust evaluation criteria are being developed. The key question is therefore not which model is generally considered the “best”, but rather which model suits a public authority’s specific requirements. 

This is precisely where many AI model comparisons reach their limits, as they measure capabilities such as English text comprehension, mathematical tasks or general world knowledge. By contrast, they rarely take into account what is much more important in day-to-day public authority work, such as whether a language model hallucinates when responding to citizens’ enquiries. Or whether the provider offers transparent documentation on what the model was trained on.

MÖVE – a benchmark for AI models in public administration

With MÖVE, Bundesdruckerei GmbH has created an evaluation framework that, for the first time, combines technical performance and governance requirements in a single system. This provides comparable and practice-oriented guidance for selecting suitable AI models.

Document with a magnifying glass

Practice-oriented evaluation

All models are tested using proprietary German-language datasets from the public administration context. These include, for example, summarising specialist texts or answering questions.

Pictogram sign with hook

Holistic assessment

Not only does it measure pure model performance, the criteria also cover governance aspects such as security, transparency, sustainability and compliance with democratic values.

Pictogram Data Competitive advantage

Well-informed decisions

The results provide a reliable data basis for selecting suitable AI models and support the responsible use of artificial intelligence in the public sector.

“MÖVE is the first holistic AI benchmark for public administration and makes Germany a pioneer in the responsible use of language models in the public sector.”

Dr. Thilo Michael, MÖVE Technical Lead

Evaluation methodology based on practice-oriented datasets

Subject-matter experts developed nine test datasets that reflect real use cases from German public administration. Instead of abstract questions, they use the likes of legal texts, internal administrative documents and publications from federal ministries.

Several of these datasets were created manually in-house (gold standard), while others were curated from publicly available administrative sources (silver standard). No details about the data used are published, so that results are protected from distortion by pre-trained models.

MÖVE evaluation criteria for AI models

With this data as a basis, each model undergoes an automated assessment against seven criteria.

Performance – what can the model do?

Criterion What is assessed
Summarising Quality of summaries of resolutions, judgements and specialist texts
Answering questions Precise answers based on specified documents (RAG scenario)
Extracting topics Categorising and tagging documents

Governance – how responsibly does the model behave?

Criterion What is assessed
Hallucinations How often the model invents content that is not in the source
Politics & values Compatibility of the answers with democratic core values
Sustainability Efficiency in the use of computing resources
Transparency How openly the provider documents training data, architecture and terms of use

Further criteria are under development. These include: translation, social fairness and security (the latter is being developed jointly with the BSI).

What can be done with the results

All assessments feed into the interactive model comparison. The results of over 40 evaluated language models are available there. Individual criteria can be shown or hidden. The overall score is automatically recalculated based on the selection. This makes it quicker to see which model fits your requirements.

Comparable results for good decisions

Those responsible for the use of AI in a public authority do not need a blanket model recommendation, but a robust basis for decision-making. MÖVE provides an independent and systematic assessment of relevant language models for this purpose. The assessment is based on criteria that have been defined for the public sector and its representatives.

These partners support MÖVE

Evaluating AI systems for governmental use requires the highest standards in security, methodology and regulatory compliance. That is why MÖVE is not a standalone initiative. As a research project of Bundesdruckerei GmbH, the evaluation framework is being developed in collaboration with leading German institutions and is subject to ongoing scientific refinement.

Logo of Fraunhofer AISEC

The collaboration with the Fraunhofer Institute for Applied and Integrated Security (AISEC) focuses on the ongoing scientific and methodological development of the security evaluation criteria. New findings are incorporated into the MÖVE evaluation framework step by step.

Logo of the Federal Office for Information Security

In cooperation with the Federal Office for Information Security (BSI), approaches to assessment in the areas of cybersecurity, robustness and factual accuracy are being further developed.

„Generative AI presents us with major challenges, but we will make it trustworthy! MÖVE is a holistic AI benchmarking tool for LLMs with a focus on the German language and is a very important building block in achieving this.“

Gerhard Wunder, Head of Department Cognitive Security Technologies | Fraunhofer Institute for Applied and Integrated Security (AISEC)
Möve

Current results from the MÖVE model comparison

MÖVE provides a quick overview of AI language models in direct comparison. The results are published on an ongoing basis on the MÖVE project website.

Tools

MÖVE framework available as open source

The MÖVE framework is available in the open-source repository for anyone interested in understanding the evaluation methodology.

FAQ: Frequently Asked Questions

The selection of evaluated language models follows clearly defined criteria. The aim is to provide a comparison that is as practice-oriented and relevant as possible for use in the public sector. 

These include, among other things: 

  1. Open-weight models with publicly available weights that can be run on-premises 
  2. Models that are already in use in public-authority environments (e.g. KIPITZ from ITZBund) 
  3. Small language models with fewer than around 12 billion parameters, for resource-efficient local execution 
  4. German-language-optimised or fine-tuned models such as SauerkrautLM or Teuken 
  5. Proprietary reference models such as GPT-4o or GPT-4o-mini as a technological benchmark 

The list is continually being expanded. Suggestions are welcome.

An accuracy analysis of the evaluation framework is carried out on the results from MÖVE. Bootstrapped 95% confidence intervals are calculated for the scores of the individual models. 

In addition, a multi-stage quality assurance analysis is performed: 

  • Internal consistency check 
    This checks whether the evaluation model produces stable and reproducible results over several runs.
  • Comparison with other evaluators 
    The results are compared with independent evaluation models in order to validate the assessments externally and to underpin them methodologically.
  • Check for systematic bias 
    The analysis also examines whether individual AI models tend to systematically favour their own phrasing or familiar response patterns. 

In future, the calculated confidence intervals will be displayed transparently on the website and documented in a separate publication.

The test tasks in MÖVE are deliberately aligned with real requirements from day-to-day public administration. The focus is on activities that regularly occur in public authorities and where language models may be able to provide support in future. This includes, among other things, the precise summarisation of complex specialist texts such as resolutions, judgements or internal administrative documents. 

It also examines how reliably a model answers enquiries when it is only permitted to use information from specified sources such as statutory texts or guidelines. This makes it possible to assess how well a model works on a factual basis and whether the risk of hallucinations is reduced. In addition, MÖVE analyses how accurately documents are categorised and assigned to appropriate topics or keywords.

For a fair comparison, all language models are tested under comparable conditions. Each model runs with the officially recommended settings of the respective manufacturer and on standardised hardware.

The focus is not on a single best score or a technically optimised individual result. Rather, the typical performance that a model actually demonstrates in day-to-day practice is assessed. For this reason, additional technical interventions to artificially stabilise or reproduce answers are deliberately avoided.

To rule out data contamination. If the data were public, models could be trained on it and the results would therefore no longer be meaningful.

Submit your own model

Would you like to have your own language model evaluated? Please submit your model by email to: kontakt-kikc@bdr.de 

The same conditions apply to all models:

  • The results will be published, regardless of how the model performs 
  • Each model goes through the same evaluation process

Do you have questions or feedback on MÖVE? Simply contact us.

Camilla Dalerci
Deputy Head of the AI Competence Centre and Project Manager for MÖVE
Email: camilla.dalerci@bdr.de