The provided simulations

Two ingredients are necessary to use unseen-awg simulations.

unseen-awg is a resampling-based weather generator, i.e., it simply combines time steps from an existing dataset in a new temporal order. Therefore, simulated time series are stored very compactly as “look-up tables” of the resampled dataset. Using the simulations requires downloading the “look-up tables” and the dataset of impact-relevant variables that is being resampled. Both ingredients can be downloaded from the World Data Center for Climate as described in an earlier section.

The following datasets archived at the World Data Center for Climate are necessary to make use of the simulations:

As described earlier, all partial zip files of the dataset of impact-relevant variables must be extracted. If some of the partial files are not included, one may still be able to open the resulting zarr store, but it will contain missing data.

Optionally, the simulations can also be used in combination with the reforecast dataset of large-scale atmospheric circulation fields: “Preprocessed atmospheric circulation variables - Extended ensemble forecast hindcast (ECMWF) for unseen-awg simulations”.

Opening the simulations using xarray

We use the xarray Python library that makes working with Zarr stores and netcdf4 files easy:

generated_time_series = xarray.open_dataset(path_timeseries)
ds_impact_relevant_variables = xarray.open_zarr(path_ds_impact_relevant_variables)

# smaller chunks for reduced memory demand
ds_impact_relevant_variables = ds_impact_relevant_variables.chunk(
    {"ensemble_member": 1, "init_time": 1}
)

For each out_time time step, the simulated time series (generated_time_series) hold the coordinate vectors that allow looking up the corresponding data within the large reforecast dataset, i.e. the ensemble_member, init_time, and lead_time of a reforecast time step:

generated_time_series
<xarray.Dataset> Size: 92MB
Dimensions:          (seed: 500, out_time: 7670)
Coordinates:
  * seed             (seed) int64 4kB 0 1 2 3 4 5 6 ... 494 495 496 497 498 499
  * out_time         (out_time) datetime64[ns] 61kB 2003-01-01 ... 2023-12-31
    sigma            float64 8B ...
    blocksize        timedelta64[ns] 8B ...
Data variables:
    lead_time        (seed, out_time) timedelta64[ns] 31MB ...
    init_time        (seed, out_time) datetime64[ns] 31MB ...
    ensemble_member  (seed, out_time) float64 31MB ...
Attributes:
    Conventions:        CF-1.7
    Title:              Example simulations with unseen-awg v1.0
    Source:             Simulations were obtained using unseen-awg v1.0 and d...
    Creator:            Jonathan Wider (ORCID: 0000-0002-5185-5768)
    Institution:        Helmholtz Centre for Environmental Research – UFZ
    Creation_date:      2026-04-13 20:11:16
    License:            Creative Commons Attribution 4.0 International
    probability_model:  NoRestrictions

The provided dataset includes 500 generated daily weather time series, each 21 years long. To extract the actual weather data, the generated time series can simply be used as index labels to an xarray dataset of impact-relevant variables:

generated_data = ds_impact_relevant_variables.sel(generated_time_series)

The dataset ds_impact_relevant_variables is a Zarr store. As a result, xarray uses lazy computations, loading only the data into memory that is currently required. In the simplest case, users can just load desired subsets of the data to memory, e.g., to load the first 10 time steps from the first generated time series:

generated_data.sel(seed=0).isel(out_time=slice(0, 10)).load()
<xarray.Dataset> Size: 4MB
Dimensions:          (out_time: 10, latitude: 105, longitude: 125)
Coordinates:
  * out_time         (out_time) datetime64[ns] 80B 2003-01-01 ... 2003-01-10
    ensemble_member  (out_time) int64 80B 4 4 4 4 4 4 4 4 4 4
    init_time        (out_time) datetime64[ns] 80B 2016-12-07 ... 2016-12-07
    lead_time        (out_time) timedelta64[ns] 80B 22 days 23 days ... 31 days
  * latitude         (latitude) float64 840B 71.8 71.4 71.0 ... 31.0 30.6 30.2
  * longitude        (longitude) float64 1kB -9.8 -9.4 -9.0 ... 39.0 39.4 39.8
    seed             int64 8B 0
    sigma            float64 8B 2.5
    blocksize        timedelta64[ns] 8B 30 days
Data variables:
    mn2t             (out_time, latitude, longitude) float64 1MB -12.5 ... 3.413
    mx2t             (out_time, latitude, longitude) float64 1MB -4.464 ... 1...
    t2m              (out_time, latitude, longitude) float64 1MB -9.487 ... 1...
    tp               (out_time, latitude, longitude) float64 1MB 0.5798 ... 0.0
Attributes:
    Conventions:    CF-1.7
    Title:          Daily aggregate extended ensemble forecast hindcast impac...
    Source:         Contains modified “Extended ensemble forecast hindcast” d...
    Creator:        Jonathan Wider (ORCID: 0000-0002-5185-5768)
    Institution:    Helmholtz Centre for Environmental Research – UFZ
    Creation_date:  2026-04-13 20:11:23
    License:        Creative Commons Attribution 4.0 International

dask can be used to parallelize computations. In some cases, rechunking the preprocessed reforecast dataset (e.g. along spatial dimensions) may be desired; computing across many chunks can increase compute and memory demand.

Only subsets of the generated data should be loaded into memory as the full dataset created by “looking up” all generated time series is very large:

size_in_gb = generated_data.nbytes / (1024) ** 3
print(f"Dataset size: {size_in_gb:.2f} GB")
Dataset size: 1500.17 GB