Method#

Basic#

The MIdAS Python implementation aims to perform efficient bias adjustment of large data sets, such as multi-decadal climate model data over large domains.

Adjustments are performed for

single grid points, i.e. spatial dependency is not accounted for
single variables, i.e. dependency between variables is not accounted for
daily time steps
model data when model data and reference data are located on the same grid

Input#

The input data, both for reference and model, should come in the form of timeseries with a fixed timestep, e.g., daily.

We focus on the adjustment of model output that comes in the form of a three dimensional field with two horizontal dimensions and one time dimension, \(X(x, y, t)\). Often, the horizontal dimensions will be given by longitude and latitude, but particularly for regional models, other dimensions are common too, for example in projections such as the Lambert conformal projection.

These fields embody physical phenomena, e.g. precipitation and near-surface temperature and are referred to as variables. Many variables have standard variable names, for example, precipitations is usually called pr in files and near-surface temperature is often referred to as tas. In addition, meta-data like units and standard_name are often specified in additional standards, for example, the CMIP6 data request (https://cordex.org/experiment-guidelines/cordex-cmip6/requests-to-cmip6/) for global climate model output, or the corresponding CORDEX standard (https://cordex.org/experiment-guidelines/cordex-cmip5/cordex-core/cordex-core-simulation-framework/)for the output of regional climate models.

Technically, the input data is always in netCDF files that adhere to the CF conventions(https://cfconventions.org/) and follow the aforementioned applicable standards. Bringing such metadata to CF-convention conformity is the process Standardization.

Output template generation#

The output filename can be set by the -o command flag. When the -o flag is missing, MIdAS automatically generates an output template for the filename given by user, ie., outputfilename.tmp.nc. This works well for files with the same basic filename structure, where only the time stamp, variable and simulation components differ. For example, given a variable tas and two input files:

1:"/PATH/tas_{project}_historical_{run_model_version}_day_20060101-20101231.nc"
2:"/PATH/tas_{project}_scenario_{run_model_version}_day_200110101-20151231.nc"

The output template starts with the variable that is being computed, in this case tas. Thereafter, the parts that are similar between the files are kept. If the input files contains both historical and scenario simulations, these will be concatenated with a hyphen in the output template. The time period of the template will always represent the whole period as covered by all files. Thus, it will be reformatted to ensure hyphenated form, start-end, even in the case of a single timestamp, e.g, a single year. If the input files contains a frequency keyword, the keyword will be replaced by the output frequency. Otherwise, the output frequency will be added to the end of the output template. The input files from above would result in the following output template:

"tasAdjust_{project}_{run_model_version}_{frequency}"

Note

This works well with CMIP/CORDEX style filenames.

Main features#

Quantile mapping#

The core of bias adjustment in MIdAS is an empirical quantile mapping (EQM) method, where the cumulative distribution functions (CDFs) of the reference data set and of the model data set are used to bias-adjust the original value from model data.

We use superscripts to distinguish reference and model by writing \(X^{\text{ref}}_j\) and \(X^{\text{mod}}_j\), respectively, where j stands for each timestep. Since we are only interested in statistical properties of the timeseries, they may have differing time coordinates, which we refer to as \(t^{\text{ref}}_j\) and \(t^{\text{mod}}_j\) respectively.

The bias adjustment consists of two steps, the calibration and the adjustment. In the calibration step, a transfer function is constructed, which is later applied in the adjustment step.

The overall bias adjustment looks like this:

Use calibration \(C\) to compute transfer function(s)

(1)#\[ F_{j} := C_{j}(t^{\text{ref}}_j, X^{\text{ref}}_j, t^{\text{mod}}_j, X^{\text{mod}}_j)\]

Adjust model data using transfer function(s)

(2)#\[ \tilde{X}^{\text{mod}}_j = F_{j}(t^{\text{mod}}_j, X^{\text{mod}}_j)\]

The adjustment step is completely local and thus particularly well suited for parallelization. Note that the adjustment is never applied to the reference data.

We treat the timeseries for every point independently. This implies, that reference and model data must be located on the same spatial grid. When this is not the case, a remapping must be performed outside of MIdAS before the data can be used as input.

Note

We limit ourselves to daily data. Dissaggregation to sub-daily can easily be performed outside MIdAS.

Day-of-year#

The bias adjustment is performed separately for different times of the year. MIdAS uses a day-of-year (DOY = [1,365]) separation where the time series are split in smaller sections depending on the day of the year. Typically, a moving window of 15 days before and after is used (\(dD = 15\)).

We here assume a 365-year calendar, i.e. without leap days. Leap days are not explicitly handled here, and the DOY is defined from the first of January and counts the days until 365 is reached. Therefore,the 31 December on leap years is not included, but will be bias-adjusted with the same parameters as DOY = 365.

Linear smooth spline fitting#

We fit a linear smoothing spline (LSS) function to the Q–Q plot (similarly to the fitQmapSSPLIN function proposed by Gudmundsson et al., 2012,2016, using the routine splrep from Virtanen et al. (2020), which in turn is based on the FITPACK routines by Paul Dierckx (Dierckx, 1975, 1981, 1982, 1993,1995). This approach allows a good approximation of \(F_{j}\) with far fewer knots in the spline, while still guaranteeing a good representation of all data points.

A linear function is fitted to the 90% most central data points (e.g., the lowest and highest 5 % of data are removed) to avoid excessive impact from outliers in the tails. The weights to the spline function are then defined according to the standard deviation of every point, which is estimated as the distance of that point from the linear fit. To avoid excessive weights for individual points, a minimal standard deviation of 1 percent of the midpoint of all reference values are set to the threshold. All points that have an estimated standard deviation smaller than this get assigned the minimal standard deviation as their estimate. The smoothing factor to the spline function is set the number of data points.

Time-scale Separation#

The time-scale separation involves the computation of anomalies. This is commonly done in one of two ways, additive or multiplicative.

Additive Time-scale Separation#

We calculate the daily mean, \(X_{m}\), for a day, \(d\) with a given window size \(31\) days as:

(3)#\[ X_{m} = \left\langle X_{d'} \right\rangle_{d' \in [d-15, d+15]}\]

The daily anomalies ,\(X_{da}\), are calculated as:

(4)#\[ X_{da} = X_{d} - X_{m}\]

where, \(w\) refers to as a week.

This allows us to decompose the original data as:

(5)#\[ X_d = X_{m} + X_{da}\]

This decomposition facilitates the separate bias adjustment of the three components, giving the final bias adjusted data as:

(6)#\[ \tilde{X}_d = \tilde{X}_{m} + \tilde{X}_{da}\]

Multiplicative Time-scale Separation#

We calculate the daily mean, \(X_{m}\), for a day, \(d\) in the same way as for the additive case, i.e. as a running mean with a given window size: \(31\) days as:

(7)#\[ X_{m} = \left\langle X_{d'} \right\rangle_{d' \in [d-15, d+15]}\]

The daily anomalies are now calculated as:

(8)#\[ X_{da} = X_{d} / X_{m}\]

This allows us to decompose the original data as:

(9)#\[ X_d = X_{m} * X_{da}\]

This decomposition facilitates the separate bias adjustment of the three components, giving the final bias adjusted data as:

(10)#\[ \tilde{X}_d = \tilde{X}_{m} * \tilde{X}_{da}\]

Extrapolation#

A two-step solution is found work well for this issue: (i) excluding data from the tail data in the calibration period, the extrapolation feature is activated for all time periods, even the calibration period, and (ii) applying an outlier insensitive method for linear regression to find an extrapolation parametrisation for the tail (Berg et al., 2024).

The Theil-Sen (Theil, 1950; Sen, 1968) regression (outlier insensitive method) and the normal regression (outlier sensitive method) are implemented to be used to fit. The former fits median of slopes derived from each individual pair of points in the tail, wheras the latter fits the extreme values by using normal least square.

To active it, users need to determine the method to be used, the Theil-Sen (theilsen) or the normal regression (ols), and percentage of data to be excluded. The syntax is extrapolation_method_extrapolation_parameters. For example, theilsen_r01t05 implies that 1) the 1% of the data on the upper tail is excluded; 2) the theil-sen method is used for fitting and its parameters are calibrated on percentiles 94-99.

Indirect bias adjustment#

For some variables, bias-adjusting meteorological variables in a direct way (same type of variables) may introduce errors. Indirect bias adjustment is therefore needed.

hurs_inv#

Bias-adjustment for inverse relative humidity, hurs_inv, is implemented. It is useful in the region where relative humidity is usually high. Before bias-adjustment, the relative humidity is inverted as:

(11)#\[hurs\_\text{inv}= 100 - hurs\]

After bias-adjustment, bias-adjusted inverse relative humdity is then inverted to be relative humidity by:

(12)#\[\widetilde{hurs}^{\text{mod}}_j = 100 - \widetilde{{hurs\_\text{inv}}}^{\text{mod}}_j\]

A range check is always made at the end in order to ensure \(\widetilde{hurs}^{\text{mod}}_j = [0,100]\).

tasmax & tasmin#

Daily maximum temperature, tasmax, and daily minimum temperature, tasmin, are indirectly adjusted. Instead of an independent bias adjustment of tasmax and tasmin, we follow the ISIMIP3 method to bias-adjust these two variables from bias-adjusted tas, tasrange, and tasskew values.

tasrange describes daily temperature range, which is calculated as:

(13)#\[tasrange = tasmax − tasmin\]

tasskew describes the skewness of the daily temperature cycle with consideration of daily temperature range, daily mean temperature and daily minimum temperature. It is calculated as:

(14)#\[tasskew = (tas − tasmin)/tasrange\]

tasrange and tasskew are pre-processed outside MIdAS. These two variables are bias-adjusted by the same way as precipitation (operator, pr) without using -s.

Bias adjustment of tasmax and tasmin are finally achieved via post-process outside MIdAS:

(15)#\[\widetilde{tasmax}^{\text{mod}}_j=\widetilde{tasrange}^{\text{mod}}_j+\widetilde{tasmin}^{\text{mod}}_j\]

(16)#\[\widetilde{tasmin}^{\text{mod}}_j=\widetilde{tas}^{\text{mod}}_j - \widetilde{tasskew}^{\text{mod}}_j*\widetilde{tasrange}^{\text{mod}}_j\]