Find the right language model with MÖVE
MÖVE is Bundesdruckerei GmbH’s first holistic AI model comparison, developed specifically for the requirements of the public sector. It helps public authorities and public-sector organisations to make decisions on the responsible use of artificial intelligence.
Data and Facts at a Glance
Project name
Evaluating models for public administration (Modelle für die öffentliche Verwaltung evaluieren; abbr.: “MÖVE”)
Duration
Since 01/2025
Funding body
Bundesdruckerei’s in-house research and innovation project
Partners
- Federal Office for Information Security (BSI)
- Fraunhofer Institute for Applied and Integrated Security (AISEC)
Project objective
MÖVE systematically assesses large language models (LLMs) to provide guidance for the selection of suitable AI models. This enables public administrations and state institutions to use them responsibly, securely and effectively.
Thematic focus
- Artificial intelligence
- AI governance
- Trustworthy AI
- Evaluation of large language models (LLMs)
Contributed areas of expertise
- AI research and evaluation
- Development of evaluation and governance frameworks
- Public sector expertise
- Benchmarking
- Regulatory context (e.g. the EU AI Act)
- Innovation development in the public sector
- Knowledge transfer
Uncontrolled growth of AI poses a challenge for public administration
New language models (LLMs) are released every week. Each one claims to be more powerful, safer or more efficient than its competition. AI tools have enormous potential to make public administration more citizen-centric and future-proof. However, this creates a new problem for decision-makers in the public sector: the AI landscape is evolving faster than robust evaluation criteria are being developed. The key question is therefore not which model is generally considered the “best”, but rather which model suits a public authority’s specific requirements.
This is precisely where many AI model comparisons reach their limits, as they measure capabilities such as English text comprehension, mathematical tasks or general world knowledge. By contrast, they rarely take into account what is much more important in day-to-day public authority work, such as whether a language model hallucinates when responding to citizens’ enquiries. Or whether the provider offers transparent documentation on what the model was trained on.
MÖVE – a benchmark for AI models in public administration
With MÖVE, Bundesdruckerei GmbH has created an evaluation framework that, for the first time, combines technical performance and governance requirements in a single system. This provides comparable and practice-oriented guidance for selecting suitable AI models.
Evaluation methodology based on practice-oriented datasets
Subject-matter experts developed nine test datasets that reflect real use cases from German public administration. Instead of abstract questions, they use the likes of legal texts, internal administrative documents and publications from federal ministries.
Several of these datasets were created manually in-house (gold standard), while others were curated from publicly available administrative sources (silver standard). No details about the data used are published, so that results are protected from distortion by pre-trained models.
MÖVE evaluation criteria for AI models
With this data as a basis, each model undergoes an automated assessment against seven criteria.
Performance – what can the model do?
Governance – how responsibly does the model behave?
Further criteria are under development. These include: translation, social fairness and security (the latter is being developed jointly with the BSI).
What can be done with the results
All assessments feed into the interactive model comparison. The results of over 40 evaluated language models are available there. Individual criteria can be shown or hidden. The overall score is automatically recalculated based on the selection. This makes it quicker to see which model fits your requirements.
Comparable results for good decisions
Those responsible for the use of AI in a public authority do not need a blanket model recommendation, but a robust basis for decision-making. MÖVE provides an independent and systematic assessment of relevant language models for this purpose. The assessment is based on criteria that have been defined for the public sector and its representatives.
These partners support MÖVE
Evaluating AI systems for governmental use requires the highest standards in security, methodology and regulatory compliance. That is why MÖVE is not a standalone initiative. As a research project of Bundesdruckerei GmbH, the evaluation framework is being developed in collaboration with leading German institutions and is subject to ongoing scientific refinement.
The collaboration with the Fraunhofer Institute for Applied and Integrated Security (AISEC) focuses on the ongoing scientific and methodological development of the security evaluation criteria. New findings are incorporated into the MÖVE evaluation framework step by step.
In cooperation with the Federal Office for Information Security (BSI), approaches to assessment in the areas of cybersecurity, robustness and factual accuracy are being further developed.
FAQ: Frequently Asked Questions
The selection of evaluated language models follows clearly defined criteria. The aim is to provide a comparison that is as practice-oriented and relevant as possible for use in the public sector.
These include, among other things:
- Open-weight models with publicly available weights that can be run on-premises
- Models that are already in use in public-authority environments (e.g. KIPITZ from ITZBund)
- Small language models with fewer than around 12 billion parameters, for resource-efficient local execution
- German-language-optimised or fine-tuned models such as SauerkrautLM or Teuken
- Proprietary reference models such as GPT-4o or GPT-4o-mini as a technological benchmark
The list is continually being expanded. Suggestions are welcome.
An accuracy analysis of the evaluation framework is carried out on the results from MÖVE. Bootstrapped 95% confidence intervals are calculated for the scores of the individual models.
In addition, a multi-stage quality assurance analysis is performed:
- Internal consistency check
This checks whether the evaluation model produces stable and reproducible results over several runs. - Comparison with other evaluators
The results are compared with independent evaluation models in order to validate the assessments externally and to underpin them methodologically. - Check for systematic bias
The analysis also examines whether individual AI models tend to systematically favour their own phrasing or familiar response patterns.
In future, the calculated confidence intervals will be displayed transparently on the website and documented in a separate publication.
The test tasks in MÖVE are deliberately aligned with real requirements from day-to-day public administration. The focus is on activities that regularly occur in public authorities and where language models may be able to provide support in future. This includes, among other things, the precise summarisation of complex specialist texts such as resolutions, judgements or internal administrative documents.
It also examines how reliably a model answers enquiries when it is only permitted to use information from specified sources such as statutory texts or guidelines. This makes it possible to assess how well a model works on a factual basis and whether the risk of hallucinations is reduced. In addition, MÖVE analyses how accurately documents are categorised and assigned to appropriate topics or keywords.
For a fair comparison, all language models are tested under comparable conditions. Each model runs with the officially recommended settings of the respective manufacturer and on standardised hardware.
The focus is not on a single best score or a technically optimised individual result. Rather, the typical performance that a model actually demonstrates in day-to-day practice is assessed. For this reason, additional technical interventions to artificially stabilise or reproduce answers are deliberately avoided.
To rule out data contamination. If the data were public, models could be trained on it and the results would therefore no longer be meaningful.