Forecast Verification for AI Models - Article by Britta Seegebrecht and Sabrina Wahl

Forecast Verification for AI models
Can we trust AI models?
This question could be from a sci-fi movie but nowadays it becomes more and more relevant since so called artificial intelligence (AI) based on machine learning (ML) has made its way into a vast range of application fields from marketing to costumer services, from medical assistance to quality control – and weather forecasting is no exception.
For predictions of the weather conditions this question translates roughly to: ‘How accurate and reliable is the forecast, where and under which conditions/for which situations?’ These questions are targeted by the field of forecast verification.What is forecast verification and why is it important?
The rise of numerical weather prediction (NWP) models in the mid-20th century enhanced the need for objective measures to assess the quality of forecasts.Depending on the type of forecast (deterministic or probabilistic) and the nature of the predicted variable (continuous values or binary event occurrence), different metrics or skill scores can be calculated to objectively quantify forecast performance in terms of numerical values. In this way, the progress of a model during its development can be systematically monitored, and different models can be compared against each other.However, ranking forecasts from different models according to their value/skill for different tasks constitutes a crucial basis for their interpretation and application. The best model for rainfall might not be the best for predicting upper-air winds, and performance can vary across different regions.
Therefore, forecast verification is a complex task which must consider different perspectives and needs. The goal of model verification is to provide a comprehensive picture of model performance based on a series of meaningful scores.To this end a wide range of verification methods has been developed over the past decades.
New insights and approaches in model development, e.g., the increasing model resolution, the introduction of probabilistic models and awareness for potential effects of climate change on the weather, guided this path. But the influence is not just one sided but mutual: Weaknesses of forecasting models are identified and characterized by verification allowing to address them properly. The rise of AI weather prediction models introduces new questions to assess the quality and dependability of such models.
How do NWP and AI-WP models differ?
Generally speaking weather forecasting describes the mapping between a previous atmospheric state at time t0 to a future state at time t > t0.The definition and application of this mapping differs profoundly in AI-WP and NWP models:
NWP models are based on our physical understanding of the earth system. A set of physical equations describing spatial and temporal relationships between different variables constitutes the core of these models. The solution of this system can be approximated by numerical integration.
In contrast, AI-, ML- or data-driven model are all used synonymously for models that “learn” statistical, spatial-temporal relationships based on a huge amount of past cases (called training data). If the training data is a representative sample of all possible atmospheric states, ,the trained model can “transfer” the learned relations to cases outside the trainings period and produce skillful forecasts. This is called generalization and is a crucial principle for AI models.
The training of a large model with billions of parameters can last several weeks to months. Once trained sufficiently, AIWP models provide a very fast and economic way to map between atmospheric states. This could be a big advantage for practical and economic reasons and drives the fast progress in this field.
Which new verification questions arise for AIWP models?
This new approach of learning relations inherent in historical data poses new verification questions such as: Are the AI WP models ‘producing’ physically realistic atmospheric states? Are physical realism and consistency a necessary prerequisite for good and reliable forecasts? Can those models predict extreme high impact events which are rare in or even exceeding the range of the trainings data? Is an AIWP model trained on historical data robust in a changing climate?
Answering questions like this is of high scientific but also political or socio-economic interest to assess the potential and limitations of AI models especially when it comes to system-relevant applications as disaster control.
Within the RAINA project we aim to contribute to this challenging task by developing an extended verification framework including well-known and potentially new methods suited to address some of those questions.