Technical Blog Vol. 3

Ilaria Luise,

Savvas Melidonis,

Sorcha Owens,

Julius Polz

Traditional weather forecasting relies on numerical models with decades of refinement, evaluated using well-established tools designed for gridded data (like GRIB or NetCDF) and fixed forecast horizons. But machine learning is changing the game - offering unstructured, probabilistic, and dynamically adapted weather predictions. Existing evaluation frameworks assume rigid data formats and workflows, and often remain closed-source, limiting collaboration.

In contrast, the WeatherGenerator project is a multi-institutional initiative that operates on data fundamentally different from traditional gridded formats. It ingests and produces unstructured data and operates on irregular meshes. Furthermore, its machine learning development cycles require rapid experimentation, short feedback loops, and frequent model iteration. Evaluation frameworks that require extensive preprocessing, rigid data alignment, or complex operational dependencies are therefore poorly suited to this context.

This led to the recognition that existing tools were insufficiently flexible for the project’s objectives. In particular, the rigidity of traditional evaluation frameworks made them unsuitable for fast turnaround evaluation during early development phases, when models and data streams are frequently changing.

In response to these challenges, the WeatherGenerator consortium developed a new evaluation framework known as WeatherGenerator FastEvaluation package. The primary objective of this framework is to provide a flexible, open-source, and extensible evaluation solution tailored to the needs of machine learning–based weather and climate modelling. Rather than attempting to replicate the full functionality of long-established operational verification systems, the tool focuses on enabling rapid, consistent, and reproducible evaluation across a wide range of data types and experimental configurations.

Key features

A central design principle of the WeatherGenerator FastEvaluation tool is the minimisation of assumptions about the structure and format of the data being evaluated. Unlike many existing tools, it does not assume that inputs are provided on a regular latitude–longitude grid, nor does it require fixed forecast lead times or a single deterministic output stream.

One of the main features of the tool is its ability to perform scoring and visualisation within a unified framework. Rather than treating metric computation and plotting as separate tasks, the framework allows users to compute skill scores, generate two-dimensional maps, produce histograms, and create summary visualisations in a consistent and coordinated manner. This integration simplifies the evaluation workflow and reduces the potential for inconsistencies between numerical results and visual diagnostics.

From a performance perspective, WeGen FastEvaluation supports parallel processing of data, allowing computationally intensive score calculations to be distributed across available resources. This capability is essential for evaluating large datasets or ensemble forecasts on high-performance computing systems.

Scoring

The fast evaluation package provides a performant framework for the computation of verification metrics centered around weather forecast verification. It is easily extensible, allowing users to implement custom metrics as needed while already providing a wide range of commonly used metrics out of the box. Where possible, it integrates existing packages (e.g. xskillscore) to leverage well maintained and optimized community tools. The package has a set of standard error metrics (e.g. RMSE, MAE, etc.), skill-based measures (e.g. ETS, PSS), diagnostic metrics assessing statistical properties of the forecast (e.g. forecast rate of change, spatial variability) as well as categorical (threshold-based) metrics. It also comes with a set of probabilistic scores as a measure of skill and ensemble calibration in probabilistic forecasts (e.g. CRPS, SSR, rank histograms). The tools also support the computation forecast anomalies, i.e. deviations from the climatological baseline, which are essential for assessing forecast skill beyond persistence (e.g. ACC (Anomaly Correlation Coefficient), FACT (Forecast Activity), and TACT (Target Activity)). A full set of the current implemented metrics is provided in the documentation.

animation_oqrb4nfc_preds_ens_0_1_ERA5_global_10ff_matplotl Ilaria Luise.gif

When evaluating and comparing several model runs and configurations, the package utilizes JSON-based caching of computed scores to avoid recomputation. It also supports systematic logging of evaluation results to MLflow for online visualisation.

Visualization and Plotting

The evaluation package also includes several built-in visualization tools for both data exploration and scores comparisons.

Data exploration

Geospatial visualisations are available in the form of maps, enabling inspection of the spatial structure of forecasts and errors. These can be generated for individual forecast lead times, either globally or over user-defined regions. In addition, temporal evolution can be visualised through animations (GIF or MP4), facilitating the analysis of how spatial patterns develop across forecast steps.

Score comparisons

Model performance across forecast lead times can be assessed using line charts, which illustrate the evolution of selected metrics. For ensemble forecasts, these plots can also represent ensemble spread, enabling probabilistic evaluation and uncertainty quantification.

For comparative analysis across different trained models, the package includes a set of additional tools specifically designed for this purpose. Score cards, bar plots, ratio plots and heat maps summarize relative performance changes of the models with respect to a chosen baseline across variables of interest.

Bildschirmfoto 2026-05-28 um 10.36.08.png

Example of heat map use to have an overview of the forecast skill across forecast steps and variables for two different runs.

Typical workflow

A standard evaluation workflow begins with line plots of lead-time metrics to assess overall performance trends. This is followed by spatial maps and animations to analyze the distribution and propagation of errors. When ensemble outputs are available, line plots can additionally be used to visualize uncertainty through ensemble spread. For multi-model comparisons, bar plots and scorecards provide an efficient summary of relative performance across configurations.

Connection with other evaluation tools

The evaluation package also includes an export tool to connect the WeatherGenerator output to the format required by the other tools and aid more comprehensive analyses. This allows users to convert native WeatherGenerator outputs, saved during the inference of a model in Zarr format, to a variety of widely accepted file formats e.g. NetCDFs and GRIBs, that can be read by external open source and national meteorological services and partner institutions’ in-house tools. Furthermore, this tool provides the infrastructure for model engagement upon a more widely available public release.

At the moment the export tool supports conversions to CF-compliant netCDF and GRIB formats suitable for the UKMet Office CSET evaluation package, the METNorway verification and Diana tools, as well as the ECMWF’s Quaver and the MeteoSwiss mlEval tools. A collaborative effort to create an extension to existing export tool resulted in performance improvements for the existing code infrastructure and provides an example upon which other national meteorological services can draw upon to devise their own custom outputs.

The tool also supports re-griddings to other regular grids at conversion level through the ECMWF library earthkit-regrid. With this at the user’s disposal, WeatherGenerator’s results can be more easily compared with traditional NWP models like IFS and recent ML models like Microsoft’s Aurora, supporting both model development and evaluation. We will support more file formats as user requirements evolve and support efforts to evaluate the WeatherGenerator model using existing tools.

In summary, WeatherGenerator FastEvaluation is one of the first examples of flexible, scalable, and open-source solution that bridges the gap between traditional verification tools and the needs of modern machine learning workflows, enabling efficient, reproducible, and comprehensive model evaluation for these new set of multi-resolution and multi-purpose models.