ecmwf_models

Readers and converters for ECMWF reanalysis (ERA5 and ERA5-Land) data. Written in Python.

Works great in combination with pytesmo.

Installation

This package has been tested on Linux, Windows and macOS for python 3.10, 3.11, and 3.12. Ideally you should use one of the supported python versions (the package might still work for older python versions).

Use pip to install all required python dependencies as well as the ecmwf_models package from pypi.

pip install ecmwf_models

On Windows systems, it might be necessary to install required C-libraries via conda. For installation we recommend Miniconda:

conda install -c conda-forge pygrib netcdf4 pyresample pykdtree

Quick Start

Download image data from CDS (set up API first) using the era5 download and era5land download console command (see era5 download --help for all options) …

era5land download /tmp/era5/img -s 2024-04-01 -e 2024-04-05 -v swvl1,swvl2 --h_steps 0,12

… and convert them to time series (ideally for a longer period). Check era5 reshuffle --help

era5land reshuffle /tmp/era5/img /tmp/era5/ts -s 2024-04-01 -e 2024-04-05 --land_points True

Finally, in python, read the time series data for a location as a pandas DataFrame.

>> from ecmwf_models.interface import ERATs
>> ds = ERATs('/tmp/era5/ts')
>> ds.read(18, 48)  # (lon, lat)

                        swvl1     swvl2
2024-04-01 00:00:00  0.318054  0.329590
2024-04-01 12:00:00  0.310715  0.325958
2024-04-02 00:00:00  0.360229  0.323502
        ...             ...       ...
2024-04-04 12:00:00  0.343353  0.348755
2024-04-05 00:00:00  0.350266  0.346558
2024-04-05 12:00:00  0.343994  0.344498

More programs are available to keep an exisiting image and time series record up-to-date. Type era5 --help and era5land --help to see all available programs.

CDS API Setup

In order to download data from CDS, this package uses the CDS API (https://pypi.org/project/cdsapi/). You can either pass your credentials directly on the command line (which might be unsafe) or set up a .cdsapirc file in your home directory (recommended). Please see the description at https://cds.climate.copernicus.eu/how-to-api.

Supported Products

At the moment this package supports

ERA5
ERA5-Land

reanalysis data in grib and netcdf format (download, reading, time series creation) with a default spatial sampling of 0.25 degrees (ERA5), and 0.1 degrees (ERA5-Land). It should be easy to extend the package to support other ECMWF reanalysis products. This will be done as need arises.

Docker image

We provide a docker image for this package. This contains all pre-installed dependencies and can simply be pulled via

$ docker pull ghcr.io/tuw-geo/ecmwf_models:latest

Alternatively, to build the image locally using the provided Dockerfile, call from the package root

$ docker buildx build -t ecmwf_models:latest . 2>&1 | tee docker_build.log

Afterwards, you can execute the era5 and era5land commands directly in the container (after mounting some volumes to write data to). The easiest way to set the API credentials in this case is via the CDSAPI_KEY container variable or the --cds_token option as below.

$ docker run -v /data/era5/img:/container/path ecmwf_models:latest bash -c \
   'era5land update_img /container/path --cds_token xxxx-xxx-xxx-xx-xxxx'

You can use this together with a task scheduler to regularly pull new data.

Citation

If you use the software in a publication then please cite it using the Zenodo DOI. Be aware that this badge links to the latest package version.

Contribute

We are happy if you want to contribute. Please raise an issue explaining what is missing or if you find a bug. Please take a look at the developers guide.

Downloading ERA5 and ERA5-Land data

ERA5 (and ERA5-Land) data can be downloaded manually from the Copernicus Data Store (CDS) or automatically via the CDS api, as done in the download modules (era5 download and era5land download). Before you can use this, you have to set up an account at the CDS and get your API key.

Then you can use the programs era5 download and era5land download to download ERA5 images between a passed start and end date. Passing --help will show additional information on using the commands.

For example, the following command in your terminal would download ERA5 images for all available layers of soil moisture in netcdf format, between January 1st and February 1st 2000 in netcdf format into /path/to/storage. The data will be stored in subfolders of the format YYYY/jjj. The temporal resolution of the images is 6 hours by default, but can be changed using the --h_steps option.

era5 download /path/to/storage -s 2000-01-01 -e 2000-02-01 \
    --variables swvl1,swvl2,swvl3,swvl4 --h_steps 0,6,18,24

The names of the variables to download can be its long names, the short names (as in the example). See the ERA5 variable table and ERA5-Land variable table to look up the right name for the CDS API.

By default, the command expects that you have set up your .cdsapirc file to identify with the data store as described above. Alternatively you can pass your token directly with the download command using the --cds_token option. Or you can set an environment variable CDSAPI_KEY that contains your token.

We recommend downloading data in netcdf format, however, using the --as_grib option, you can also download data in grib format.

For all other available options, type era5 download --help, or era5land download --help respectively

Updating an existing record

After some time, new ERA data will become available. You can then use the program era5 update_img with a path to the existing record, to download new images with the same settings that became available since the last time the record was downloaded. You might even set up a cron job to check for new data in regular intervals to keep your copy up-to-date.

Reading data

To read the downloaded image data in python we recommend standard libraries such as xarray or netCDF4.

However, you can also use the internal classes from this package. The main purpose of these, however, is to use them in the time series conversion module.

For example, you can read the image for some variables at a specific date. In this case for a stack of downloaded image files (the chosen date must be available of course):

>> from ecmwf_models.era5.reader import ERA5NcDs
>> root_path = "/path/to/netcdf_storage"
>> ds = ERA5NcDs(root_path, parameter=['swvl1'])
>> img = ds.read(datetime(2010, 1, 1, 0))

# To read the coordinates
>> img.lat   # also: img.lon
array([[ 90. ,  90. ,  90. , ...,  90. ,  90. ,  90. ],
       [ 89.9,  89.9,  89.9, ...,  89.9,  89.9,  89.9],
       [ 89.8,  89.8,  89.8, ...,  89.8,  89.8,  89.8],
       ...,
       [-89.8, -89.8, -89.8, ..., -89.8, -89.8, -89.8],
       [-89.9, -89.9, -89.9, ..., -89.9, -89.9, -89.9],
       [-90. , -90. , -90. , ..., -90. , -90. , -90. ]])

# To read the data variables
>> img.data['swvl1']
array([[   nan,    nan,    nan, ...,    nan,    nan,    nan],
       [   nan,    nan,    nan, ...,    nan,    nan,    nan],
       [   nan,    nan,    nan, ...,    nan,    nan,    nan],
       ...,
       [0.159 , 0.1589, 0.1588, ..., 0.1595, 0.1594, 0.1592],
       [0.1582, 0.1582, 0.1581, ..., 0.1588, 0.1587, 0.1584],
       [0.206 , 0.206 , 0.206 , ..., 0.206 , 0.206 , 0.206 ]])

The equivalent class to read grib files is called in ERA5GrbDs.

Conversion to time series format

For a lot of applications it is favorable to convert the image based format into a format which is optimized for fast time series retrieval. This is what we often need for e.g. validation studies. This can be done by stacking the images into a netCDF file and choosing the correct chunk sizes or a lot of other methods. We have chosen to do it in the following way:

Store the time series as netCDF4 Climate and Forecast convention (CF) Orthogonal multidimensional array representation
Store the time series in 5x5 degree cells. This means there will be up to 2566 cell files and a file called grid.nc which contains the information about which grid point is stored in which file. This allows us to read a whole 5x5 degree area into memory and iterate over the time series quickly.

This conversion can be performed using the era5 reshuffle (respectively era5land reshuffle) command line program. An example would be:

era5 reshuffle /path/to/img /out/ts/path 2000-01-01 2000-12-31 \
     -v swvl1,swvl2 --h_steps 0,12 --bbox -10 30 30 60 --land_points

Which would take (previously downloaded) ERA5 images (at time stamps 0:00 and 12:00 UTC) stored in /path/to/img from January 1st 2000 to December 31st 2000 and store the data within land points of the selected bounding box of variables “swvl1” and “swvl2” as time series in the folder /out/ts/path.

The passed variable names (-v) have to correspond with the names in the downloaded file, i.e. use the variable short names here.

For all other option see the output up era5 reshuffle --help and era5land reshuffle --help

Conversion to time series is performed by the repurpose package in the background.

Append new image data to existing time series

Similar to the update_img program, we also provide programs to simplify updating an existing time series record with newly downloaded images via the era5 update_ts and era5land update_ts programs. This will use the settings file created during the initial time series conversion (with reshuffle) and look for new image data in the same path that is not yet available in the given time series record.

This option is ideally used together with the update_img program in, e.g. a cron job, to first download new images, and then append them to their time series counterpart.

era5 update_ts /existing/ts/record

Alternatively, you can also use the reshuffle command, with a target path that already contains time series. This will also append new data (but make sure you use the same settings as before).

Reading converted time series data

For reading time series data, that the era5 reshuffle and era5land reshuffle command produces, the class ERATs can be used. This will return a time series of values for the chosen location.

Optional arguments that are forwarded to the parent class (OrthoMultiTs, as defined in pynetcf.time_series) can be passed as well:

>> from ecmwf_models import ERATs
# read_bulk reads full files into memory
# read_ts takes either lon, lat coordinates to perform a nearest neighbour search
# or a grid point index (from the grid.nc file) and returns a pandas.DataFrame.
>> ds = ERATs(ts_path, ioclass_kws={'read_bulk': True})

>> ds.read(18, 48)  # (lon, lat)

                        swvl1     swvl2
2024-04-01 00:00:00  0.318054  0.329590
2024-04-01 12:00:00  0.310715  0.325958
2024-04-02 00:00:00  0.360229  0.323502
        ...             ...       ...
2024-04-04 12:00:00  0.343353  0.348755
2024-04-05 00:00:00  0.350266  0.346558
2024-04-05 12:00:00  0.343994  0.344498

Bulk reading speeds up reading multiple points from a cell file by storing the file in memory for subsequent calls. Either Longitude and Latitude can be passed to perform a nearest neighbour search on the data grid (grid.nc in the time series path) or the grid point index (GPI) can be passed directly.