The provided simulations

Two ingredients are necessary to use unseen-awg simulations.

unseen-awg is a resampling-based weather generator, i.e., it simply combines time steps from an existing dataset in a new temporal order. Therefore, simulated time series are stored very compactly as “look-up tables” of the resampled dataset. Using the simulations requires downloading the “look-up tables” and the dataset of impact-relevant variables that is being resampled. Both ingredients can be downloaded from the World Data Center for Climate as described in an earlier section.

The following datasets archived at the World Data Center for Climate are necessary to make use of the simulations:

As described earlier, all partial zip files of the dataset of impact-relevant variables must be extracted. If some of the partial files are not included, one may still be able to open the resulting zarr store, but it will contain missing data.

Optionally, the simulations can also be used in combination with the reforecast dataset of large-scale atmospheric circulation fields: “Preprocessed atmospheric circulation variables - Extended ensemble forecast hindcast (ECMWF) for unseen-awg simulations”.

Opening the simulations using xarray

We use the xarray Python library that makes working with Zarr stores and netcdf4 files easy:

generated_time_series = xarray.open_dataset(path_timeseries)
ds_impact_relevant_variables = xarray.open_zarr(path_ds_impact_relevant_variables)

# smaller chunks for reduced memory demand
ds_impact_relevant_variables = ds_impact_relevant_variables.chunk(
    {"ensemble_member": 1, "init_time": 1}
)

For each out_time time step, the simulated time series (generated_time_series) hold the coordinate vectors that allow looking up the corresponding data within the large reforecast dataset, i.e. the ensemble_member, init_time, and lead_time of a reforecast time step:

The provided dataset includes 500 generated daily weather time series, each 21 years long. To extract the actual weather data, the generated time series can simply be used as index labels to an xarray dataset of impact-relevant variables:

generated_data = ds_impact_relevant_variables.sel(generated_time_series)

The dataset ds_impact_relevant_variables is a Zarr store. As a result, xarray uses lazy computations, loading only the data into memory that is currently required. In the simplest case, users can just load desired subsets of the data to memory, e.g., to load the first 10 time steps from the first generated time series:

dask can be used to parallelize computations. In some cases, rechunking the preprocessed reforecast dataset (e.g. along spatial dimensions) may be desired; computing across many chunks can increase compute and memory demand.

Only subsets of the generated data should be loaded into memory as the full dataset created by “looking up” all generated time series is very large:

size_in_gb = generated_data.nbytes / (1024) ** 3
print(f"Dataset size: {size_in_gb:.2f} GB")

Dataset size: 1500.17 GB