{ "cells": [ { "cell_type": "code", "execution_count": 3, "id": "d621dabd", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import os\n", "\n", "import numpy as np\n", "import yaml\n", "\n", "from unseen_awg.data_classes import InitTimeLeadTimeMemberState\n", "from unseen_awg.probability_models import (\n", " NormalProbabilityKeepMinimalNDays,\n", " NormalProbabilityModel,\n", ")\n", "from unseen_awg.time_steppers import NoLeapYearStepper, StandardStepper\n", "from unseen_awg.weather_generator import WeatherGenerator" ] }, { "cell_type": "code", "execution_count": 4, "id": "078442e2", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "with open(\"../../configs/paths.yaml\") as file:\n", " paths = yaml.safe_load(file)[\"paths\"]\n", "dir_wg = os.path.join(paths[\"dir_wgs\"], \"wg_reforecasts_5e06172f_f40e9460_1e69bda9\")" ] }, { "cell_type": "markdown", "id": "82751d0a", "metadata": {}, "source": [ "# Creating new simulations with the provided *unseen-awg* weather generator" ] }, { "cell_type": "markdown", "id": "fe7c82e2", "metadata": {}, "source": [ "You can download and extract an instance of *unseen-awg* as described in {ref}`data-download`.\n", "\n", "Below, we first detail some technical aspects of *unseen-awg* and then explain how to load stored weather generators and how to use them to generate new simulations." ] }, { "cell_type": "markdown", "id": "fe8400f4", "metadata": {}, "source": [ "## Storage format" ] }, { "cell_type": "markdown", "id": "7a93d225", "metadata": {}, "source": [ "Each *unseen-awg* weather generator gets saved as a directory that contains `params.yaml`, a file holding the parameters of the weather generator:" ] }, { "cell_type": "code", "execution_count": 5, "id": "905e7aa0", "metadata": {}, "outputs": [], "source": [ "with open(os.path.join(dir_wg, \"params.yaml\")) as file:\n", " config = yaml.safe_load(file)" ] }, { "cell_type": "code", "execution_count": 6, "id": "1f1bae73", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# overwrite here for privacy reasons before printing\n", "config[\"zarr_year_dayofyear\"] = (\n", " \"\"\n", ")\n", "config[\"dir_wg\"] = \"\"" ] }, { "cell_type": "code", "execution_count": 7, "id": "5e98aeb0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'weather_generator.similarity': 'mse_similarity',\n", " 'weather_generator.var': 'geopotential_height',\n", " 'weather_generator.window_size': 10,\n", " 'weather_generator.n_samples': 14,\n", " 'weather_generator.use_precomputed_similarities': True,\n", " 'zarr_year_dayofyear': '',\n", " 'dir_wg': ''}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "config" ] }, { "cell_type": "markdown", "id": "2f88864e", "metadata": {}, "source": [ "For *unseen-awg* instances with precomputed similarities, the parameters `\"zarr_year_dayofyear\"` and `\"dir_wg\"` are mainly used within the `Snakemake` workflow and for some diagnostic plots. Consider adjusting them using the paths you defined in `configs/paths.yaml` ({ref}`configs-paths`). For new instances of *unseen-awg*, the paths will automatically be set correctly." ] }, { "cell_type": "markdown", "id": "a30519f3", "metadata": {}, "source": [ "A core concept of *unseen-awg* weather generators is computing similarities between pairs of atmospheric circulation states. By default, the similarities are computed as the negative squared Euclidean distance.\n", "\n", "If similarities are precomputed, i.e., if the `weather_generator.use_precomputed_similarities` parameter was set to `True` during initialization of the weather generator, the weather generator's directory additionally includes a zarr store `similarities.zarr` holding similarities between pairs of atmospheric circulation states. Similarities can also be computed on-the-fly. While simulating time series with the weather generator, additional helper files `map__steps_transition.nc` are created in the weather generator's directory, `tau` indicates the block size used in the given simulation." ] }, { "cell_type": "markdown", "id": "e0a67de5", "metadata": {}, "source": [ "## Loading a Weather Generator\n", "\n", "A weather generator instance can be loaded from such a directory using:" ] }, { "cell_type": "code", "execution_count": 8, "id": "94e61b8d", "metadata": {}, "outputs": [], "source": [ "wg = WeatherGenerator.load(dir_wg)" ] }, { "cell_type": "markdown", "id": "7fb2a78b", "metadata": {}, "source": [ "The dataset of similarities can be accessed through:" ] }, { "cell_type": "code", "execution_count": 9, "id": "0b3bc184", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 704GB\n",
       "Dimensions:            (dayofyear: 366, year: 21, sample: 14,\n",
       "                        ensemble_member: 11, d_shift: 23, c_year: 21,\n",
       "                        c_sample: 14, c_ensemble_member: 11)\n",
       "Coordinates: (12/16)\n",
       "  * dayofyear          (dayofyear) int64 3kB 1 2 3 4 5 6 ... 362 363 364 365 366\n",
       "  * year               (year) int64 168B 2003 2004 2005 2006 ... 2021 2022 2023\n",
       "    valid_time         (dayofyear, year) datetime64[ns] 61kB dask.array<chunksize=(366, 21), meta=np.ndarray>\n",
       "  * sample             (sample) int64 112B 0 1 2 3 4 5 6 7 8 9 10 11 12 13\n",
       "    init_time          (dayofyear, year, sample) datetime64[ns] 861kB dask.array<chunksize=(1, 21, 14), meta=np.ndarray>\n",
       "    lead_time          (dayofyear, year, sample) timedelta64[ns] 861kB dask.array<chunksize=(183, 11, 14), meta=np.ndarray>\n",
       "    ...                 ...\n",
       "    c_valid_time       (dayofyear, d_shift, c_year) datetime64[ns] 1MB dask.array<chunksize=(183, 12, 21), meta=np.ndarray>\n",
       "    m_is_near          (dayofyear, year, d_shift, c_year) bool 4MB dask.array<chunksize=(183, 11, 12, 21), meta=np.ndarray>\n",
       "  * c_sample           (c_sample) int64 112B 0 1 2 3 4 5 6 7 8 9 10 11 12 13\n",
       "    c_init_time        (dayofyear, d_shift, c_year, c_sample) datetime64[ns] 20MB dask.array<chunksize=(92, 12, 11, 7), meta=np.ndarray>\n",
       "    c_lead_time        (dayofyear, d_shift, c_year, c_sample) timedelta64[ns] 20MB dask.array<chunksize=(92, 12, 11, 7), meta=np.ndarray>\n",
       "  * c_ensemble_member  (c_ensemble_member) int64 88B 0 1 2 3 4 5 6 7 8 9 10\n",
       "Data variables:\n",
       "    similarities       (dayofyear, year, sample, ensemble_member, d_shift, c_year, c_sample, c_ensemble_member) float64 704GB dask.array<chunksize=(1, 1, 14, 11, 23, 21, 14, 11), meta=np.ndarray>\n",
       "Attributes:\n",
       "    Conventions:    CF-1.7\n",
       "    Title:          Similarities between pairs of states in the underlying da...\n",
       "    Source:         Computed according to the selected similarity measure fro...\n",
       "    Creator:        Jonathan Wider (ORCID: 0000-0002-5185-5768)\n",
       "    Institution:    Helmholtz Centre for Environmental Research – UFZ\n",
       "    Creation_date:  2026-04-13 20:11:23\n",
       "    License:        Creative Commons Attribution 4.0 International
" ], "text/plain": [ " Size: 704GB\n", "Dimensions: (dayofyear: 366, year: 21, sample: 14,\n", " ensemble_member: 11, d_shift: 23, c_year: 21,\n", " c_sample: 14, c_ensemble_member: 11)\n", "Coordinates: (12/16)\n", " * dayofyear (dayofyear) int64 3kB 1 2 3 4 5 6 ... 362 363 364 365 366\n", " * year (year) int64 168B 2003 2004 2005 2006 ... 2021 2022 2023\n", " valid_time (dayofyear, year) datetime64[ns] 61kB dask.array\n", " * sample (sample) int64 112B 0 1 2 3 4 5 6 7 8 9 10 11 12 13\n", " init_time (dayofyear, year, sample) datetime64[ns] 861kB dask.array\n", " lead_time (dayofyear, year, sample) timedelta64[ns] 861kB dask.array\n", " ... ...\n", " c_valid_time (dayofyear, d_shift, c_year) datetime64[ns] 1MB dask.array\n", " m_is_near (dayofyear, year, d_shift, c_year) bool 4MB dask.array\n", " * c_sample (c_sample) int64 112B 0 1 2 3 4 5 6 7 8 9 10 11 12 13\n", " c_init_time (dayofyear, d_shift, c_year, c_sample) datetime64[ns] 20MB dask.array\n", " c_lead_time (dayofyear, d_shift, c_year, c_sample) timedelta64[ns] 20MB dask.array\n", " * c_ensemble_member (c_ensemble_member) int64 88B 0 1 2 3 4 5 6 7 8 9 10\n", "Data variables:\n", " similarities (dayofyear, year, sample, ensemble_member, d_shift, c_year, c_sample, c_ensemble_member) float64 704GB dask.array\n", "Attributes:\n", " Conventions: CF-1.7\n", " Title: Similarities between pairs of states in the underlying da...\n", " Source: Computed according to the selected similarity measure fro...\n", " Creator: Jonathan Wider (ORCID: 0000-0002-5185-5768)\n", " Institution: Helmholtz Centre for Environmental Research – UFZ\n", " Creation_date: 2026-04-13 20:11:23\n", " License: Creative Commons Attribution 4.0 International" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wg.ds_similarities" ] }, { "cell_type": "markdown", "id": "24c40f04", "metadata": {}, "source": [ "The array is set up in a way that it includes similarities between all \"valid\" pairs of large-scale atmospheric circulation states, i.e. states whose calendar date is similar enough. The conventions for the dimensions of the similarities dataset are as follows:" ] }, { "cell_type": "code", "execution_count": 10, "id": "22c804ec", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "FrozenMappingWarningOnValuesAccess({'dayofyear': 366, 'year': 21, 'sample': 14, 'ensemble_member': 11, 'd_shift': 23, 'c_year': 21, 'c_sample': 14, 'c_ensemble_member': 11})" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wg.ds_similarities.dims" ] }, { "cell_type": "markdown", "id": "5248e831", "metadata": {}, "source": [ "Assume that we want to compute between a fixed \"reference state\" and a set of \"candidate states\". Then the dimensions are:\n", "- `dayofyear`, `year`: Day-of-year and year of a reference state.\n", "- `sample`: \"Sample\" of a reference state. This is a dimension we create internally to differentiate between forecasts with the same `valid_time` (and therefore same `year`, `dayofyear`), but different initialization date (`init_time`).\n", "- `ensemble_member`: The reforecast dataset includes multiple ensemble members. `ensemble_member` identifies the ensemble member of the reference state.\n", "- `d_shift`: Shift between the day of year of the `valid_time` of the reference state and the corresponding day of year for the candidate state.\n", "- `c_year`, `c_sample` and `c_ensemble_member` have the same meaning for the candidate state as `year`, `sample` and `ensemble_member` for the reference state." ] }, { "cell_type": "markdown", "id": "66eb6d48", "metadata": {}, "source": [ "For example, to extract the similarity between \n", "- the large-scale atmospheric state projected for January 1st 2010 by the 5th ensemble member of a ensemble reforecast initialized on December 28th 2009 and \n", "- the large-scale atmospheric state of the 0th ensemble member of a reforecast started on January 1st 2010 itself,\n", "\n", "one could do the following:" ] }, { "cell_type": "code", "execution_count": 11, "id": "1da1cb9e", "metadata": {}, "outputs": [], "source": [ "reference_valid_time = np.datetime64(\"2010-01-01\")\n", "reference_init_time = np.datetime64(\"2009-12-28\")\n", "reference_ensemble_member = 5\n", "\n", "candidate_valid_time = np.datetime64(\"2010-01-01\")\n", "candidate_init_time = np.datetime64(\"2010-01-01\")\n", "candidate_ensemble_member = 0" ] }, { "cell_type": "code", "execution_count": 12, "id": "7c5a9560", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'similarities' (dayofyear: 1, year: 1, sample: 1, d_shift: 1,\n",
       "                                  c_year: 1, c_sample: 1)> Size: 8B\n",
       "array([[[[[[-10.66284565]]]]]])\n",
       "Coordinates: (12/16)\n",
       "  * dayofyear          (dayofyear) int64 8B 1\n",
       "  * year               (year) int64 8B 2010\n",
       "    valid_time         (dayofyear, year) datetime64[ns] 8B 2010-01-01\n",
       "  * sample             (sample) int64 8B 11\n",
       "    init_time          (dayofyear, year, sample) datetime64[ns] 8B 2009-12-28\n",
       "    lead_time          (dayofyear, year, sample) timedelta64[ns] 8B 4 days\n",
       "    ...                 ...\n",
       "    m_is_near          (dayofyear, year, d_shift, c_year) bool 1B True\n",
       "  * c_sample           (c_sample) int64 8B 12\n",
       "    c_init_time        (dayofyear, d_shift, c_year, c_sample) datetime64[ns] 8B ...\n",
       "    c_lead_time        (dayofyear, d_shift, c_year, c_sample) timedelta64[ns] 8B ...\n",
       "    c_ensemble_member  int64 8B 0\n",
       "    ensemble_member    int64 8B 5\n",
       "Attributes:\n",
       "    long_name:  similarity score between forecast and candidate analog\n",
       "    units:      1
" ], "text/plain": [ " Size: 8B\n", "array([[[[[[-10.66284565]]]]]])\n", "Coordinates: (12/16)\n", " * dayofyear (dayofyear) int64 8B 1\n", " * year (year) int64 8B 2010\n", " valid_time (dayofyear, year) datetime64[ns] 8B 2010-01-01\n", " * sample (sample) int64 8B 11\n", " init_time (dayofyear, year, sample) datetime64[ns] 8B 2009-12-28\n", " lead_time (dayofyear, year, sample) timedelta64[ns] 8B 4 days\n", " ... ...\n", " m_is_near (dayofyear, year, d_shift, c_year) bool 1B True\n", " * c_sample (c_sample) int64 8B 12\n", " c_init_time (dayofyear, d_shift, c_year, c_sample) datetime64[ns] 8B ...\n", " c_lead_time (dayofyear, d_shift, c_year, c_sample) timedelta64[ns] 8B ...\n", " c_ensemble_member int64 8B 0\n", " ensemble_member int64 8B 5\n", "Attributes:\n", " long_name: similarity score between forecast and candidate analog\n", " units: 1" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wg.ds_similarities.where(\n", " np.logical_and(\n", " wg.ds_similarities.valid_time.load() == reference_valid_time,\n", " wg.ds_similarities.init_time.load() == reference_init_time,\n", " ),\n", " drop=True,\n", ").sel(\n", " ensemble_member=reference_ensemble_member,\n", " c_ensemble_member=candidate_ensemble_member,\n", ").where(\n", " np.logical_and(\n", " wg.ds_similarities.c_valid_time.load() == candidate_valid_time,\n", " wg.ds_similarities.c_init_time.load() == candidate_init_time,\n", " ),\n", " drop=True,\n", ")[\"similarities\"].load()" ] }, { "cell_type": "markdown", "id": "f752c428", "metadata": {}, "source": [ "In practice, this manual extraction of similarities is rarely necessary." ] }, { "cell_type": "markdown", "id": "44a33689", "metadata": {}, "source": [ "## Sampling a time series with the weather generator" ] }, { "cell_type": "markdown", "id": "adcb4b80", "metadata": {}, "source": [ "Even after a weather generator was configured, several parameters that can be adapted per simulation. These are:\n", "- `blocksize`: The sampling can be configured to generate time series in contiguous blocks of `blocksize` days.\n", "- `probability_model`: Specifies how probabilities are to be calculated from the similarities. This object allows restrictions to allowed candidate states, e.g., using an instance of the `NormalProbabilityNotLargerThanFixedDate` class forbids sampling analogs beyond a chosen date. With this restriction, *unseen-awg* can be used in a forecast-like setting.\n", "- `stepper_class`: Determines how to evolve the output time. This makes it possible to use different output calendars. E.g., the `NoLeapYearStepper` class can be chosen to sample a time series using a no-leap-year calendar.\n", "- `n_steps`: How many steps of length `blocksize` the resulting time series should include. \n", "- `rng`: A random number generator that allows making sampling reproducible.\n", "- `initialization`: The initial state to start the time series generation from. By default, a random state is chosen among the possible options in the dataset.\n", "- `start_by_taking_analog`: Whether the sampling of the time series starts by from an analog of the initial condition.\n", "- `show_progressbar`: Track sampling progress." ] }, { "cell_type": "code", "execution_count": 13, "id": "4d11ae2b", "metadata": {}, "outputs": [], "source": [ "trajectory = wg.sample_trajectory(\n", " blocksize=5,\n", " probability_model=NormalProbabilityModel(sigma=2.5),\n", " stepper_class=StandardStepper,\n", " n_steps=100,\n", " rng=np.random.default_rng(seed=0),\n", " initialization=InitTimeLeadTimeMemberState(\n", " init_time=np.datetime64(\"2009-12-28\", \"ns\"),\n", " lead_time=np.timedelta64(4, \"D\"),\n", " ensemble_member=0,\n", " ),\n", " start_by_taking_analog=False,\n", " show_progressbar=False,\n", ")" ] }, { "cell_type": "markdown", "id": "3408849e", "metadata": {}, "source": [ "The resulting time series has a very compact format: it is an `xarray.Dataset` with the assigned output time `out_time` as the dimension and the `lead_time`, `init_time`, and `ensemble_member` of the sampled states as variables. That means, we don't store any meteorological data, only coordinates to look up the corresponding data for a given `out_time`:" ] }, { "cell_type": "code", "execution_count": 14, "id": "672b1965", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 16kB\n",
       "Dimensions:          (out_time: 500)\n",
       "Coordinates:\n",
       "  * out_time         (out_time) object 4kB 2010-01-01 00:00:00 ... 2011-05-15...\n",
       "Data variables:\n",
       "    lead_time        (out_time) timedelta64[ns] 4kB 4 days 5 days ... 13 days\n",
       "    init_time        (out_time) datetime64[ns] 4kB 2009-12-28 ... 2021-05-02\n",
       "    ensemble_member  (out_time) int64 4kB 0 0 0 0 0 0 0 0 0 ... 9 9 9 4 4 4 4 4
" ], "text/plain": [ " Size: 16kB\n", "Dimensions: (out_time: 500)\n", "Coordinates:\n", " * out_time (out_time) object 4kB 2010-01-01 00:00:00 ... 2011-05-15...\n", "Data variables:\n", " lead_time (out_time) timedelta64[ns] 4kB 4 days 5 days ... 13 days\n", " init_time (out_time) datetime64[ns] 4kB 2009-12-28 ... 2021-05-02\n", " ensemble_member (out_time) int64 4kB 0 0 0 0 0 0 0 0 0 ... 9 9 9 4 4 4 4 4" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trajectory" ] }, { "cell_type": "markdown", "id": "bf10756e", "metadata": {}, "source": [ "Using a different set up (`blocksize` 10 days, no-leap-year calendar, no transitions between states closer than 180 days, random initialization):" ] }, { "cell_type": "code", "execution_count": 15, "id": "7ae7ac09", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/gpfs1/schlecker/home/wider/Projects/unseen-awg/src/unseen_awg/weather_generator.py:876: FutureWarning: dropping variables using `drop` is deprecated; use drop_vars.\n", " .drop(\"datapoint\")\n" ] } ], "source": [ "trajectory_2 = wg.sample_trajectory(\n", " blocksize=10,\n", " probability_model=NormalProbabilityKeepMinimalNDays(sigma=2.5, n_days_min=180),\n", " stepper_class=NoLeapYearStepper,\n", " n_steps=100,\n", " rng=np.random.default_rng(seed=0),\n", " start_by_taking_analog=False,\n", " show_progressbar=False,\n", ")" ] }, { "cell_type": "code", "execution_count": 19, "id": "a4ff1ace", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 32kB\n",
       "Dimensions:          (out_time: 1000)\n",
       "Coordinates:\n",
       "  * out_time         (out_time) object 8kB 2016-11-07 00:00:00 ... 2019-08-03...\n",
       "Data variables:\n",
       "    lead_time        (out_time) timedelta64[ns] 8kB 12 days 13 days ... 22 days\n",
       "    init_time        (out_time) datetime64[ns] 8kB 2016-10-26 ... 2016-07-06\n",
       "    ensemble_member  (out_time) int64 8kB 7 7 7 7 7 7 7 7 7 ... 8 8 8 8 8 8 8 8
" ], "text/plain": [ " Size: 32kB\n", "Dimensions: (out_time: 1000)\n", "Coordinates:\n", " * out_time (out_time) object 8kB 2016-11-07 00:00:00 ... 2019-08-03...\n", "Data variables:\n", " lead_time (out_time) timedelta64[ns] 8kB 12 days 13 days ... 22 days\n", " init_time (out_time) datetime64[ns] 8kB 2016-10-26 ... 2016-07-06\n", " ensemble_member (out_time) int64 8kB 7 7 7 7 7 7 7 7 7 ... 8 8 8 8 8 8 8 8" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trajectory_2" ] }, { "cell_type": "markdown", "id": "94792256", "metadata": {}, "source": [ "The code for the different components is modular, so that creating new classes for different probability distributions and calendars is possible. Similarly, different methods for computing similarities can be implemented." ] } ], "metadata": { "kernelspec": { "display_name": "unseen-awg", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }