# Method

## Basic
The MIdAS Python implementation aims to perform efficient bias
adjustment of large data sets, such as multi-decadal climate model data
over large domains.

Adjustments are performed for
 - single grid points, i.e. spatial dependency is not accounted for
 - single variables, i.e. dependency between variables is not accounted for
 - daily time steps
 - model data when model data and reference data are located on the same grid


### Input

The input data, both for `reference` and `model`, should come in the form of timeseries with a fixed timestep, e.g., daily.

<!--```{note}
 By sub-daily we mean that the fixed timestep is a certain number of hours that
 divide the 24 hours of one day. Typical timesteps are 3, 6, or 12 hours.
```
-->
We focus on
the adjustment of model output that comes in the form of a three dimensional
field with two horizontal dimensions and one time dimension, {math}`X(x, y,
t)`. Often, the horizontal dimensions will be given by longitude and latitude,
but particularly for regional models, other dimensions are common too, for
example in projections such as the Lambert conformal projection.

These fields embody physical phenomena, e.g. precipitation and near-surface
temperature and are referred to as variables.
Many variables have standard variable names, for example, precipitations
is usually called `pr` in files and near-surface temperature is often referred
to as `tas`.
In addition, meta-data like `units` and `standard_name` are often specified in
additional standards, for example, the `CMIP6 data request` (https://cordex.org/experiment-guidelines/cordex-cmip6/requests-to-cmip6/) for global climate model
output, or the corresponding `CORDEX standard` (https://cordex.org/experiment-guidelines/cordex-cmip5/cordex-core/cordex-core-simulation-framework/)for the output of regional
climate models.

Technically, the input data is always in `netCDF` files that adhere to the
`CF conventions`(https://cfconventions.org/) and follow the aforementioned applicable standards. Bringing such metadata to CF-convention conformity is the process `Standardization`.


### Output template generation
The output filename can be set by the `-o` command flag. When the `-o` flag is missing, MIdAS automatically generates an output template for the filename given by user, ie., outputfilename.tmp.nc. This works well for files with the same basic filename structure, where only the time stamp, variable and simulation components differ. For example, given a variable `tas` and two input files:
:::{code}
1:"/PATH/tas_{project}_historical_{run_model_version}_day_20060101-20101231.nc"
2:"/PATH/tas_{project}_scenario_{run_model_version}_day_200110101-20151231.nc"
:::
The output template starts with the variable that is being computed, in this case `tas`. Thereafter, the parts that are similar between the files are kept. If the input files contains both historical and scenario simulations, these will be concatenated with a hyphen in the output template. The time period of the template will always represent the whole period as covered by all files. Thus, it will be reformatted to ensure hyphenated form, `start-end`, even in the case of a single timestamp, e.g, a single year. If the input files contains a frequency keyword, the keyword will be replaced by the output frequency. Otherwise, the output frequency will be added to the end of the output template. The input files from above would result in the following output template:
```{code-block} bash
"tasAdjust_{project}_{run_model_version}_{frequency}"
```
<!--
```{Note}
If MIdAS fails to construct the output template because the time period could not be determined, the template will be given as `{index}_{base}.nc`. If the construction fails due to any other problems the fallback template `{index}_{frequency}.nc` will be returned.
```
-->
```{Note}
This works well with CMIP/CORDEX style filenames.
```

## Main features
(QAM)=
### Quantile mapping

The core of bias adjustment in MIdAS is an empirical
quantile mapping (EQM) method, where the cumulative distribution functions
(CDFs) of the reference data set and of the model
data set are used to bias-adjust the original value from model data.

We use superscripts to distinguish `reference` and `model` by writing
{math}`X^{\text{ref}}_j` and {math}`X^{\text{mod}}_j`, respectively, where `j` stands for each timestep.
Since we are only interested in statistical properties of the timeseries, they
may have differing time coordinates, which we refer to as
{math}`t^{\text{ref}}_j` and {math}`t^{\text{mod}}_j` respectively.

The bias adjustment consists of two steps, the calibration and the adjustment.
In the calibration step, a transfer function is constructed, which is later applied in
the adjustment step.

The overall bias adjustment looks like this:

1. Use calibration {math}`C` to compute transfer function(s)

```{math}
:label: mymath
 F_{j} := C_{j}(t^{\text{ref}}_j, X^{\text{ref}}_j, t^{\text{mod}}_j, X^{\text{mod}}_j)
```


2. Adjust model data using transfer function(s)

```{math}
:label: mymath_xm
   \tilde{X}^{\text{mod}}_j = F_{j}(t^{\text{mod}}_j, X^{\text{mod}}_j)
```

The adjustment step is completely local and thus particularly
well suited for parallelization. Note that the adjustment is never applied to the reference data.

We treat the timeseries for every point independently.
This implies, that `reference` and `model` data must be located on the same spatial
grid. When this is not the case, a remapping must be performed outside of MIdAS
before the data can be used as input.

```{Note}
 We limit ourselves to daily data. Dissaggregation to sub-daily can
          easily be performed outside MIdAS.
```

### Day-of-year
The bias adjustment is performed separately for different times of the year.
MIdAS uses a day-of-year (DOY = [1,365]) separation where the time series are split in smaller
sections depending on the day of the year. Typically, a moving window of 15 days before and after is used ({math}`dD = 15`).
<!--, such that
{math}`j \in [DOY(j-dD), DOY(j+dD)]`.

For this purpose the day-of-year (DOY) definition is used:

```{math}
:label: doyeq
DOY(j) =
\left\{
  \begin{array}{ c l }
   j, & \textrm{if } 1 < DOY(j \pm 15) < 365 \\
   j - 365, & \text{if } DOY(j \pm 15) > 365 \\
	365 - j, & \text{if } DOY(j \pm 15) < 0
  \end{array}
\right.
```

{math}`t_j` then identifies the timesteps of the complete timeseries that fall within
a defined range of DOY.

```{Note}
-->
We here assume a 365-year calendar, i.e. without leap days. Leap days are
not explicitly handled here, and the DOY is defined from the first of
January and counts the days until 365 is reached. Therefore,the 31 December on leap years is not included, but will be
bias-adjusted with the same parameters as DOY = 365.

<!--```-->
(sf)=
### Linear smooth spline fitting

<!--To find a good approximation for {math}`F_{j}` that offers better resource efficiency and scalability, -->
We fit
a linear smoothing spline (LSS) function to the Q–Q plot (similarly to the `fitQmapSSPLIN` function proposed by [Gudmundsson
et al., 2012](REF),[2016](REF), using the routine `splrep` from [Virtanen et al. (2020)](REF), which in turn is based on the `FITPACK` routines by
Paul Dierckx ([Dierckx, 1975, 1981, 1982, 1993,1995](REF)). This approach allows a good approximation of {math}`F_{j}` with far fewer
knots in the spline, while still guaranteeing a good representation of all data points.

A linear function is fitted to the 90% most central data points (e.g., the lowest and highest 5 % of data are removed) to avoid excessive impact from outliers in the tails. The weights to the spline function are then defined according to the standard deviation of every point, which is estimated as the distance of that point from the linear fit. To avoid excessive weights for individual points, a minimal standard deviation of 1 percent of the midpoint of all reference values are set to the threshold. All points that have an estimated standard deviation smaller than this get assigned the minimal standard deviation as their estimate. The smoothing factor to the spline function is set the number of data points.


### Time-scale Separation

The time-scale separation involves the computation of anomalies.
This is commonly done in one of two ways, additive or multiplicative.

(addTS)=
#### Additive Time-scale Separation

We calculate the daily mean, {math}`X_{m}`, for a day, {math}`d` with a given window size {math}`31` days as:

```{math}
:label: xm
   X_{m} = \left\langle X_{d'} \right\rangle_{d' \in [d-15, d+15]}
```
The daily anomalies ,{math}`X_{da}`, are calculated as:

```{math}
:label: xda
   X_{da} = X_{d} - X_{m}
```
where,
{math}`w` refers to as a week.

This allows us to decompose the original data as:

```{math}
:label: xd
   X_d = X_{m} + X_{da}
```

This decomposition facilitates the separate bias adjustment of the three
components, giving the final bias adjusted data as:

```{math}
:label: mymath_xddecomp
   \tilde{X}_d = \tilde{X}_{m} + \tilde{X}_{da}
```


#### Multiplicative Time-scale Separation

We calculate the daily mean, {math}`X_{m}`, for a day, {math}`d` in the same way as for the additive case, i.e. as a running mean with a given window size:
{math}`31` days as:

```{math}
:label: mul_XM
   X_{m} = \left\langle X_{d'} \right\rangle_{d' \in [d-15, d+15]}
```
The daily anomalies are now calculated as:

```{math}
:label: mul_XDa
   X_{da} = X_{d} / X_{m}
```
This allows us to decompose the original data as:

```{math}
:label: mul_XD
   X_d = X_{m} * X_{da}
```

This decomposition facilitates the separate bias adjustment of the three
components, giving the final bias adjusted data as:

```{math}
:label: mymath_XDdec
   \tilde{X}_d = \tilde{X}_{m} * \tilde{X}_{da}
```

### Extrapolation

<!--
We explore different methods to handle data at the tails of the distribution while using quantile mapping method. Two precipitation indicators, the annual mean and the maximum daily precipitation amount per year, are used for evaluation.
-->


A two-step solution is found work well for this issue:
(i) excluding data from the tail data in the calibration period, the extrapolation feature is activated for all time periods, even the
calibration period, and (ii) applying an outlier insensitive method for linear regression to find an extrapolation
parametrisation for the tail ([Berg et al., 2024](REF)).

The Theil-Sen ([Theil, 1950](REF); [Sen, 1968](REF)) regression (outlier insensitive method) and the normal regression (outlier sensitive method) are implemented to be used to fit. The former fits median of slopes derived from each individual pair of points in the tail, wheras the latter fits the extreme values by using normal least square.

To active it, users need to determine the method to be used, the Theil-Sen (theilsen) or the normal regression (ols), and percentage of data to be excluded. The syntax is `extrapolation_method`_`extrapolation_parameters`. For example, `theilsen_r01t05` implies that 1) the 1% of the data on the upper tail is excluded; 2) the theil-sen method is used for fitting and its parameters are calibrated on percentiles 94-99.


(sp)=
## Indirect bias adjustment

For some variables, bias-adjusting meteorological variables in a direct way (same type of variables) may introduce errors. Indirect bias adjustment is therefore needed.
(hursinv)=
### hurs_inv
Bias-adjustment for inverse relative humidity, `hurs_inv`, is implemented. It is useful in the region where relative humidity is usually high.
Before bias-adjustment, the relative humidity is inverted as:
```{math}
:label: mymath_hursinv
hurs\_\text{inv}= 100 - hurs
```
After bias-adjustment, bias-adjusted inverse relative humdity is then inverted to be relative humidity by:
```{math}
:label: mymath_hursadj
\widetilde{hurs}^{\text{mod}}_j = 100 - \widetilde{{hurs\_\text{inv}}}^{\text{mod}}_j
```
A range check is always made at the end in order to ensure {math}`\widetilde{hurs}^{\text{mod}}_j = [0,100]`.

(tasmax_tasmin)=
### tasmax & tasmin
Daily maximum temperature, `tasmax`, and daily minimum temperature, `tasmin`, are indirectly adjusted.
Instead of an independent bias adjustment of `tasmax` and `tasmin`, we follow the ISIMIP3 method to bias-adjust these two variables from bias-adjusted `tas`, `tasrange`, and `tasskew` values.

`tasrange` describes daily temperature range, which is calculated as:
 ```{math}
:label: mymath_tasrange
tasrange = tasmax − tasmin
```
`tasskew` describes the skewness of the daily temperature cycle with consideration of daily temperature range, daily mean temperature and daily minimum temperature. It is calculated as:

```{math}
:label: mymath_tasskew
tasskew = (tas − tasmin)/tasrange
```
`tasrange` and `tasskew` are pre-processed outside MIdAS. These two variables are bias-adjusted by the same way as precipitation (operator, pr) without using `-s`.

Bias adjustment of  `tasmax` and  `tasmin` are finally achieved via post-process outside MIdAS:
```{math}
:label: mymath_tasmaxadj
\widetilde{tasmax}^{\text{mod}}_j=\widetilde{tasrange}^{\text{mod}}_j+\widetilde{tasmin}^{\text{mod}}_j
```

```{math}
:label: mymath_tasminadj
\widetilde{tasmin}^{\text{mod}}_j=\widetilde{tas}^{\text{mod}}_j - \widetilde{tasskew}^{\text{mod}}_j*\widetilde{tasrange}^{\text{mod}}_j
```