API Reference

This section provides comprehensive documentation for all public APIs.

Core Functions

Control the length of the time series data from a simulation run.

It starts drawing initial_run_length number of observations (samples) by calling the get_trajectory function in a loop to reach equilibration or pass the warm-up period.

Note

get_trajectory is a callback function with a specific signature of get_trajectory(nstep: int) -> 1darray if we only have one variable or get_trajectory(nstep: int) -> 2darray with the shape of (number_of_variables, nstep)

To use extra arguments in the get_trajectory, one can use the other specific signature of get_trajectory(nstep: int, args: dict) -> 1darray or get_trajectory(nstep: int, args: dict) -> 2darray with the shape of (number_of_variables, nstep)

where all the required variables can be pass thrugh the args dictionary.

All the values returned from this function should be finite values, otherwise the code will stop wih error message explaining the issue.

Examples:

>>> rng = np.random.RandomState(12345)
>>> start = 0
>>> stop = 0
>>> def get_trajectory(step):
...     global start, stop
...     start = stop
...     if 100000 < start + step:
...         step = 100000 - start
...     stop += step
...     data = np.ones(step) * 10 + (rng.random_sample(step) - 0.5)
...     return data

or,

>>> targs = {'start': 0, 'stop': 0}
>>> def get_trajectory(step, targs):
...     targs['start'] = targs['stop']
...     if 100000 < targs['start'] + step:
...         step = 100000 - targs['start']
...     targs['stop'] += step
...     data = np.ones(step) * 10 + (rng.random_sample(step) - 0.5)
...     return data

Then it continues drawing observations until some pre-specified level of absolute or relative precision has been reached.

The relative precision is defined as a half-width of the estimator’s confidence interval (CI).

At each checkpoint, an upper confidence limit (UCL) is approximated. The drawing of observations is terminated, if UCL is less than the pre-specified absolute precision absolute_accuracy or if the relative UCL (UCL divided by the computed sample mean) is less than a pre-specified value, relative_accuracy.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

The Relative accuracy is the confidence interval half-width or UCL divided by the sample mean. If the ratio is bigger than relative_accuracy, the length of the time series is deemed not long enough to estimate the mean with sufficient accuracy, which means the run should be extended.

In order to avoid problems caused by sequential UCL evaluation cost, this calculation should not be repeated too frequently. Heidelberger and Welch (1981) [heidelberger1981] suggested increasing the run length by a factor run_length_factor > 1.5, each time, so that estimate has the same, of reasonably large proportion of new data.

The accuracy parameter relative_accuracy specifies the maximum relative error that will be allowed in the mean value of time-series data. In other words, the distance from the confidence limit(s) to the mean (which is also known as the precision, half-width, or margin of error). A value of 0.01 is usually used to request two digits of accuracy, and so forth.

The parameter confidence_coefficient is the confidence coefficient and often, the values 0.95 is used. For the confidence coefficient, confidence_coefficient, we can use the following interpretation,

If thousands of samples of n items are drawn from a population using simple random sampling and a confidence interval is calculated for each sample, the proportion of those intervals that will include the true population mean is confidence_coefficient.

The maximum_run_length parameter places an upper bound on how long the simulation will run. If the specified accuracy cannot be achieved within this time, the simulation will terminate, and a warning message will appear in the report.

The maximum_equilibration_step parameter places an upper bound on how long the simulation will run to reach equilibration or pass the warm-up period. If the equilibration or warm-up period cannot be detected within this time, the simulation will terminate and a warning message will appear in the report.

Note

By default and if not specified on input, the maximum_equilibration_step is defined as half of the maximum_run_length.

Note

By default, the algorithm will use relative_accuracy as a termination criterion, and in case of failure, it switches to use the absolute_accuracy.

If using the absolute_accuracy is desired, one should set the relative_accuracy to None.

Examples:

>>> run_length_control(get_trajectory,
...                    number_of_variables=1,
...                    relative_accuracy=None
...                    absolute_accuracy=0.1)

The algorithm converts relative_accuracy``and ``absolute_accuracy floating numbers to arrays with the shape of (number_of_variables, ), when the number_of_variables bigger than one. By default, it uses relative_accuracy as a termination criterion for the corresponding variable number, and in case of failure, it switches to use the absolute_accuracy.

If the absolute_accuracy is desired for one or some variables, one should provide both relative_accuracy``and ``absolute_accuracy as an array. Then it must set the corresponding relative_accuracy in the array to None and set the correct absolute_accuracy` at the right place in the collection.

E.g.,

>>> run_length_control(get_trajectory,
...                    number_of_variables=3,
...                    relative_accuracy=[0.1, 0.05, None]
...                    absolute_accuracy=[0.1, 0.05, 0.1])

or,

>>> run_length_control(get_trajectory,
...                    number_of_variables=3,
...                    relative_accuracy=[None, 0.05, None]
...                    absolute_accuracy=[0.1,  0.05, 0.1])

Note

confidence_interval_approximation_method is set to a method to use for approximating the upper confidence limit of the mean.

By default, (uncorrelated_sample approach) uses the independent samples in the time-series data to approximate the confidence intervals for the mean. The other methods have different approaches.

E.g., in the heidel_welch method, it requires no such independence assumption. In this spectral approach, the problem of dealing with dependent data are largely avoided by working in the frequency domain with the sample spectrum (periodogram) of the process.

Note

population_mean is a variable known (true) mean. Expected value in null hypothesis. It is an extra information for normally distributed data.

Note

for non-normally distributed data, and as an extra check on the convergence one should provide the population info using population_cdf, population_args, population_loc, and population_scale for a specific distribution.

Parameters:

get_trajectory (callback function) –
A callback function with a specific signature of get_trajectory(nstep: int) -> 1darray if we only have one variable or get_trajectory(nstep: int) -> 2darray with the shape of (number_of_variables, nstep)

Note

all the values returned from this function should be finite values, otherwise the code will stop wih error message explaining the issue.
get_trajectory_args (dict, optional) – Extra arguments passed to the get_trajectory function. (default: {}) To use this option, the dictionary may contain start and stop keywords as well as other keywords which are needed in the function. get_trajectory(nstep, get_trajectory_args) -> 1darray
number_of_variables (int, optional) – number of variables in the corresponding time-series data from get_trajectory callback function. (default: 1)
initial_run_length (int, optional) – initial run length. (default: 2000)
run_length_factor (float, optional) – run length increasing factor. (default: 1.0)
maximum_run_length (int, optional) – the maximum run length represents a cost constraint. (default: 1000000)
maximum_equilibration_step (int, optional) – the maximum number of steps as an equilibration hard limit. If the algorithm finds equilibration_step greater than this limit it will fail. For the default None, the function is using maximum_run_length // 2 as the maximum equilibration step. (default: None)
minimum_number_of_independent_samples (int, optional) – minimum number of independent samples. This is an extra parameter to terminate the run after the pre-specified level of absolute or relative precision has been reached and there are minimum number of independent samples available for further analysis. (default: None)
relative_accuracy (float, or 1darray, optional) – a relative half-width requirement or the accuracy parameter. Target value for the ratio of halfwidth to sample mean. If number_of_variables > 1, relative_accuracy can be a scalar to be used for all variables or a 1darray of values of size number_of_variables. (default: 0.1)
absolute_accuracy (float, or 1darray, optional) – a half-width requirement or the accuracy parameter. Target value for the ratio of halfwidth to sample mean. If number_of_variables > 1, relative_accuracy can be a scalar to be used for all variables or a 1darray of values of size number_of_variables. (default: 0.1)
population_mean (float, or 1darray, optional) –
variable known (true) mean. Expected value in null hypothesis. (default: None)
Note

For number_of_variables > 1, and if population_mean is provided, it should be a list or array of values. It should be set to None for variables which we do not intend to use this extra measure.

Examples:
```
>>> run_length_control(get_trajectory,
...                    number_of_variables=3,
...                    population_mean=[None, 297., None])
```
population_standard_deviation (float, or 1darray, optional) –
population standard deviation. (default: None)
Note

For number_of_variables > 1, and if population_standard_deviation is provided, it should be a list or array of values. It should be set to None for variables which we do not intend to use this extra measure.

Examples:
```
>>> run_length_control(
...     get_trajectory,
...     number_of_variables=3,
...     population_mean=[None, 297., None],
...     population_standard_deviation=[None, 10., None])
```

population_cdf (str, or 1darray, optional) –

The name of a distribution. (default: None)

Examples: >>> run_length_control( … get_trajectory, … number_of_variables=2, … population_cdf=[None, ‘gamma’], … population_args=[None, (1.99,)], … population_loc=[None, None], … population_scale=[None, None])

or,

>>> run_length_control(
...     get_trajectory,
...     number_of_variables=2,
...     population_mean=[297., None],
...     population_standard_deviation=[10., None],
...     population_cdf=[None, 'gamma'],
...     population_args=[None, (1.99,)],
...     population_loc=[None, None],
...     population_scale=[None, None])

population_args (tuple, or list of tuples, optional) – Distribution parameter. (default: None)
population_loc (float, or 1darray, or None) – location of the distribution. (default: None)
population_scale (float, or 1darray, or None) – scale of the distribution. (default: None)
confidence_coefficient (float, optional) – (or confidence level) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)
confidence_interval_approximation_method (str, optional) – Method to use for approximating the upper confidence limit of the mean. One of the ucl_methods aproaches. (default: ‘uncorrelated_sample’)
heidel_welch_number_points (int, optional) – the number of points in Heidelberger and Welch’s spectral method that are used to obtain the polynomial fit. The parameter heidel_welch_number_points determines the frequency range over which the fit is made. (default: 50)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
test_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the periodogram dataset to include in the test split. If int, represents the absolute number of test samples. (default: None)
train_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the preiodogram dataset to include in the train split. If int, represents the absolute number of train samples. (default: None)
batch_size (int, optional) – batch size. (default: 5)
scale (str, optional) – a method to standardize a batched dataset. (default: ‘translate_scale’)
with_centering (bool, optional) – if True, use batched data minus the scale metod centering approach. (default: False)
with_scaling (bool, optional) – if True, scale the batched data to scale metod scaling approach. (default: False)
ignore_end (int, or float, or None, optional) – if int, it is the last few (batch) points that should be ignored. if float, should be in (0, 1) and it is the percent of last (batch) points that should be ignored. if None it would be set to the batch_size in bacth method and to the one fourth of the total number of points elsewhere. (default: None)
number_of_cores (int, optional) – The maximum number of concurrently running jobs, such as the number of Python worker processes or the size of the thread-pool. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. (default: 1)
si (str, optional) – statistical inefficiency method. (default: ‘statistical_inefficiency’)
nskip (int, optional) – the number of data points to skip in estimating ucl. (default: 1)
minimum_correlation_time (int, optional) – The minimum amount of correlation function to compute in estimating ucl. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)
equilibration_solver (str, optional) – offset-search strategy used when refining the equilibration length, one of "auto", "exhaustive", or "unimodal". "auto" uses the exact exhaustive scan for short series and switches to the fast unimodal (ternary) search for long series, where the exhaustive scan’s \(O(N^2 \log N)\) cost is prohibitive. See estimate_equilibration_length for details. (default: “auto”)
dump_trajectory (bool, optional) – if True, dump the final trajectory data to a file dump_trajectory_fp. (default: False)
dump_trajectory_fp (str, object with a write(string) method, optional) – a .write()-supporting file-like object or a name string to open a file. (default: ‘kim_convergence_trajectory.edn’)
fp (str, object with a write(string) method, optional) – if an str equals to 'return' the function will return string of the analysis results on the length of the time series. Otherwise it must be an object with write(string) method. If it is None, sys.stdout will be used which prints objects on the screen. (default: None)
fp_format (str) – one of the txt, json, or edn format. (default: ‘txt’)

Returns:

Union[str, bool]: True if the length of the time series is long enough to estimate the mean with sufficient accuracy or with enough requested sample size; False otherwise. If fp == 'return', a string containing the analysis results is returned instead.

UCL Methods

Upper Confidence Limit (UCL) module.

Upper Confidence Limit (UCL): The upper boundary (or limit) of a confidence interval of a parameter of interest such as the population mean.

A confidence interval is how much uncertainty there is with any particular statistic [nistdiv898]. Confidence limits for the mean are interval estimates. Interval estimates are often desirable because instead of a single estimate for the mean, a confidence interval generates a lower and upper limit. It indicates how much uncertainty there is in our estimation of the true mean. The narrower the gap, the more precise our estimate is. We use a confidence level to express confidence limits. Choosing the confidence level is somewhat arbitrary, but 90 %, 95 %, and 99 % intervals are standard, and 95 % is the most commonly used.

Note

One should note that a 95 % confidence interval does not mean a 95 % probability of containing the true mean. The interval computed from a sample either has the true mean, or it does not. The confidence level is simply the proportion of samples of a given size that may be expected to contain the true mean. For a 95 % confidence interval, if many samples are collected and the confidence interval computed, in the long run, about 95 % of these intervals would contain the true mean.

class kim_convergence.ucl.HeidelbergerWelch

Heidelberger and Welch algorithm.

Heidelberger and Welch (1981) [heidelberger1981] Object.

heidel_welch_set

Flag indicating if the Heidelberger and Welch constants are set.

Type:: bool

heidel_welch_k

The number of points that are used to obtain the polynomial fit in Heidelberger and Welch’s spectral method.

Type:: int

heidel_welch_n

The number of time series data points or number of batches in Heidelberger and Welch’s spectral method.

Type:: int

heidel_welch_p

Probability.

Type:: float

a_matrix

Auxiliary matrix.

Type:: ndarray

a_matrix_1_inv

The (Moore-Penrose) pseudo-inverse of a matrix for the first degree polynomial fit in Heidelberger and Welch’s spectral method.

Type:: ndarray

a_matrix_2_inv

The (Moore-Penrose) pseudo-inverse of a matrix for the second degree polynomial fit in Heidelberger and Welch’s spectral method.

Type:: ndarray

a_matrix_3_inv

The (Moore-Penrose) pseudo-inverse of a matrix for the third degree polynomial fit in Heidelberger and Welch’s spectral method.

Type:: ndarray

heidel_welch_c1_1

Heidelberger and Welch’s C1 constant for the first degree polynomial fit.

Type:: float

heidel_welch_c1_2

Heidelberger and Welch’s C1 constant for the second degree polynomial fit.

Type:: float

heidel_welch_c1_3

Heidelberger and Welch’s C1 constant for the third degree polynomial fit.

Type:: float

heidel_welch_c2_1

Heidelberger and Welch’s C2 constant for the first degree polynomial fit.

Type:: float

heidel_welch_c2_2

Heidelberger and Welch’s C2 constant for the second degree polynomial fit.

Type:: float

heidel_welch_c2_3

Heidelberger and Welch’s C2 constant for the third degree polynomial fit.

Type:: float

tm_1

t_distribution inverse cumulative distribution function for C2_1 degrees of freedom.

Type:: float

tm_2

t_distribution inverse cumulative distribution function for C2_2 degrees of freedom.

Type:: float

tm_3

t_distribution inverse cumulative distribution function for C2_3 degrees of freedom.

Type:: float

get_heidel_welch_auxilary_matrices() → tuple: Get the Heidelberger and Welch auxilary matrices.

get_heidel_welch_c1() → tuple: Get the Heidelberger and Welch C1 constants.

get_heidel_welch_c2() → tuple: Get the Heidelberger and Welch C2 constants.

get_heidel_welch_constants() → tuple: Get the Heidelberger and Welch constants.

get_heidel_welch_knp() → tuple: Get the heidel_welch_number_points, n, and confidence_coefficient.

get_heidel_welch_tm() → tuple

Get the Heidelberger and Welch t_distribution ppf.

Get the Heidelberger and Welch t_distribution ppf for C2 degrees of freedom.

is_heidel_welch_set() → bool: Return True if the flag is set to True.

set_heidel_welch_constants(*, confidence_coefficient: float = 0.95, heidel_welch_number_points: int = 50)

Set Heidelberger and Welch constants globally.

Set the constants necessary for application of the Heidelberger and Welch’s [heidelberger1981] confidence interval generation method.

Parameters:

confidence_coefficient (float) – probability (or confidence interval) and must be between 0.0 and 1.0. (default: 0.95)
heidel_welch_number_points (int) – the number of points in Heidelberger and Welch’s spectral method that are used to obtain the polynomial fit. The parameter heidel_welch_number_points determines the frequency range over which the fit is made. (default: 50)

unset_heidel_welch_constants(): Unset the Heidelberger and Welch flag.

class kim_convergence.ucl.MSER_m

MSER-m algorithm.

The MSER [white1997] and MSER-5 [spratt1998] rules determine the truncation point as the value of \(d\) that best balances the tradeoff between improved accuracy (elimination of bias) and decreased precision (reduction in the sample size) for the input series. They select a truncation point that minimizes the width of the marginal confidence interval about the truncated sample mean. The marginal confidence interval is a measure of the homogeneity of the truncated series. The optimal truncation point \(d(j)^*\) selected by MSER-m can be expressed as:

\[d(j)^* = \underset{n>d(j) \geq 0}{\text{argmin}} \left[ \frac{1}{(n(j)-d(j))^2} \sum_{i=d}^{n}{\left(X_i(j)- \bar{X}_{n,d}(j) \right )^2} \right]\]

MSER-m applies the equation to a series of batch averages instead of the raw series. The CI estimators can be computed from the truncated sequence of batch means.

estimate_equilibration_length(time_series_data: ndarray | list[float], *, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, ignore_end: int | float | None = None, number_of_cores: int = 1, si: str | float | int | None = None, nskip: int | None = 1, fft: bool = True, minimum_correlation_time: int | None = None) → tuple[bool, int]: Estimate the equilibration point in a time series data.

class kim_convergence.ucl.MSER_m_y

MSER_m_y algorithm.

MSER_m_y [yousefi2011] computes k batch means of size m to evaluate the MSER-m statistic as described in [spratt1998] and detect the truncation point. If the truncation is detected, the point estimator of the mean is the sample mean of all observations in the truncated data set.

To compute the UCL, the MSER_m_y applies the von Neumann randomness test [vonneumann1941], [vonneumann1941b] to the truncated data to find a new batch size \(m^*\) for which the new batch means are approximately independent. It checks the randomness test on successively larger batch sizes until the test is finally passed and the batch means are finally determined to be approximately independent of each other. It starts by setting the initial batch size m as 1, and calculate the number of batches k’ accordingly.

significance_level

Significance level. A probability threshold below which the null hypothesis will be rejected.

Type:: float

class kim_convergence.ucl.N_SKART

N-Skart algorithm.

N-Skart [tafazzoli2011] is a nonsequential procedure designed to compute a half the width of the confidence_coefficient% probability interval (CI) (confidence interval, or credible interval) around the time-series mean.

Note

N-Skart is a variant of the method of batch means.

N-Skart makes some modifications to the confidence interval (CI). These modifications account for the skewness (non-normality), and autocorrelation of the batch means which affect the distribution of the underlying Student’s t-statistic.

k_number_batches

number of nonspaced (adjacent) batches of size batch_size.

Type:: int

kp_number_batches

number of nonspaced (adjacent) batches.

Type:: int

batch_size

bacth size.

Type:: int

number_batches_per_spacer

number of batches per spacer.

Type:: int

maximum_number_batches_per_spacer

maximum number of batches per spacer.

Type:: int

significance_level

Significance level. A probability threshold below which the null hypothesis will be rejected.

Type:: float

randomness_test_counter

counter for applying the randomness test of von Neumann [vonneumann1941] [vonneumann1941b].

Type:: int

estimate_equilibration_length(time_series_data: ndarray | list[float], *, si: str | float | int | None = None, nskip: int | None = 1, fft: bool = True, minimum_correlation_time: int | None = None, ignore_end: int | float | None = None, number_of_cores: int = 1, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False) → tuple[bool, int]

Estimate the equilibration point in a time series data.

Estimate the equilibration point in a time series data using the N-Skart algorithm.

Parameters:

time_series_data (array_like, 1d) – time series data.

Returns:

tuple[bool, int]: truncated: True if truncation was applied. truncation_point: Index at which to truncate.

Note

if N-Skart does not detect the equilibration it will return truncated as False and the equilibration index equals to the last index in the time series data.

Note

nskip, ignore_end, and number_of_cores are accepted for API compatibility but are not used by this method.

class kim_convergence.ucl.UCLBase

Upper Confidence Limit base class.

ci(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, equilibration_length_estimate: int = 0, heidel_welch_number_points: int = 50, batch_size: int = 5, fft: bool = True, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, test_size: int | float | None = None, train_size: int | float | None = None, population_standard_deviation: float | None = None, si: str | float | int | None = None, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None, sample_method: str | None = None) → tuple[float, float]: Approximate the confidence interval of the mean.

estimate_equilibration_length(time_series_data: ndarray | list[float], *, si: str | None = None, nskip: int | None = 1, fft: bool = True, minimum_correlation_time: int | None = None, ignore_end: int | float | None = None, number_of_cores: int = 1, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False) → tuple[bool, int]: Estimate the equilibration point in a time series data.

property indices: Get the indices.

property mean: Get the mean.

property name: Get the name.

relative_half_width_estimate(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, equilibration_length_estimate: int = 0, heidel_welch_number_points: int = 50, batch_size: int = 5, fft: bool = True, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, test_size: int | float | None = None, train_size: int | float | None = None, population_standard_deviation: float | None = None, si: str | float | int | None = None, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None, sample_method: str | None = None) → float: Get the relative half width estimate.

requires_si_computation() → bool: Return True if this UCL method requires statistical inefficiency computation.

property sample_size: Get the sample_size.

Set the indices.

Parameters:

time_series_data (array_like, 1d) – time series data.
si (float, or str, optional) – estimated statistical inefficiency. (default: None)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

set_si(time_series_data, *, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None) → None

Set the si (statistical inefficiency).

Parameters:

time_series_data (array_like, 1d) – time series data.
si (float, or str, optional) – estimated statistical inefficiency. (default: None)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

property si: Get the si.

property std: Get the std.

ucl(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, equilibration_length_estimate: int = 0, heidel_welch_number_points: int = 50, batch_size: int = 5, fft: bool = True, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, test_size: int | float | None = None, train_size: int | float | None = None, population_standard_deviation: float | None = None, si: str | float | int | None = None, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None, sample_method: str | None = None) → float: Approximate the upper confidence limit of the mean.

class kim_convergence.ucl.UncorrelatedSamples: UncorrelatedSamples algorithm.

kim_convergence.ucl.heidelberger_welch_ci(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, heidel_welch_number_points: int = 50, fft: bool = True, test_size: int | float | None = None, train_size: int | float | None = None, obj: HeidelbergerWelch | None = None) → tuple[float, float]

Approximate the confidence interval of the mean.

Parameters:

time_series_data (array_like, 1d) – time series data.
confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)
heidel_welch_number_points (int, optional) – the number of points that are used to obtain the polynomial fit. The parameter heidel_welch_number_points determines the frequency range over which the fit is made. (default: 50)
fft (bool, optional) – Use FFT convolution for long series. (default: True)
test_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the periodogram dataset to include in the test split. If int, represents the absolute number of test samples. (default: None)
train_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the preiodogram dataset to include in the train split. If int, represents the absolute number of train samples. (default: None)
obj (HeidelbergerWelch, optional) – instance of HeidelbergerWelch (default: None)

Returns:

tuple[float, float]: Lower and upper confidence limits for the mean.

kim_convergence.ucl.heidelberger_welch_relative_half_width_estimate(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, heidel_welch_number_points: int = 50, fft: bool = True, test_size: int | float | None = None, train_size: int | float | None = None, obj: HeidelbergerWelch | None = None) → float

Get the relative half width estimate.

The relative half width estimate is the confidence interval half-width or upper confidence limit (UCL) divided by the sample mean.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

Parameters:

time_series_data (array_like, 1d) – time series data.
confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)
heidel_welch_number_points (int, optional) – the number of points that are used to obtain the polynomial fit. The parameter heidel_welch_number_points determines the frequency range over which the fit is made. (default: 50)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
test_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the periodogram dataset to include in the test split. If int, represents the absolute number of test samples. (default: None)
train_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the preiodogram dataset to include in the train split. If int, represents the absolute number of train samples. (default: None)
obj (HeidelbergerWelch, optional) – instance of HeidelbergerWelch (default: None)

Returns:

float: Relative half width estimate

kim_convergence.ucl.heidelberger_welch_ucl(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, heidel_welch_number_points: int = 50, fft: bool = True, test_size: int | float | None = None, train_size: int | float | None = None, obj: HeidelbergerWelch | None = None) → float: Approximate the upper confidence limit of the mean.

kim_convergence.ucl.mser_m(time_series_data: ndarray | list[float], *, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, ignore_end: int | float | None = None) → tuple[bool, int]

Determine the truncation point using marginal standard error rules.

Determine the truncation point using marginal standard error rules (MSER). The MSER [white1997] and MSER-5 [spratt1998] rules determine the truncation point as the value of \(d\) that best balances the tradeoff between improved accuracy (elimination of bias) and decreased precision (reduction in the sample size) for the input series. They select a truncation point that minimizes the width of the marginal confidence interval about the truncated sample mean. The marginal confidence interval is a measure of the homogeneity of the truncated series. The optimal truncation point \(d(j)^*\) selected by MSER-m can be expressed as:

\[d(j)^* = \underset{n>d(j) \geq 0}{\text{argmin}} \left[ \frac{1}{(n(j)-d(j))^2} \sum_{i=d}^{n}{\left(X_i(j)- \bar{X}_{n,d}(j) \right )^2} \right]\]

MSER-m applies the equation to a series of batch averages instead of the raw series.

Parameters:

time_series_data (array_like, 1d) – Time series data.
batch_size (int, optional) – batch size. (default: 5)
scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale’)
with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)
with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)
ignore_end (int, or float, or None, optional) – if int, it is the last few batch points that should be ignored. if float, should be in (0, 1) and it is the percent of last batch points that should be ignored. if None it would be set to the \(Min(batch_size, number_batches / 4)\). (default: None)

Returns:

tuple[bool, int]: truncated: True if truncation was applied. truncation_point: Index at which to truncate.

Note

MSER-m sometimes erroneously reports a truncation point at the end of the data series. This is because the method can be overly sensitive to observations at the end of the data series that are close in value. Here, we avoid this artifact, by not allowing the algorithm to consider the standard errors calculated from the last few data points.

Note

If the truncation point returned by MSER-m > n/2, it is considered an invalid value and truncated will return as False. It means the method has not been provided with enough data to produce a valid result, and more data is required.

Note

If the truncation obtained by MSER-m is the last index of the batched data, the MSER-m returns the time series data’s last index as the truncation point. This index can be used as a measure that the algorithm did not find any truncation point.

kim_convergence.ucl.mser_m_ci(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m | None = None) → tuple[float, float]

Approximate the confidence interval of the mean [mokashi2010].

Parameters:

time_series_data (array_like, 1d) – time series data.
confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)
batch_size (int, optional) – batch size. (default: 5)
scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale)
with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)
with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)
obj (MSER_m, optional) – instance of MSER_m (default: None)

Returns:

tuple[float, float]: Lower and upper confidence limits for the mean.

kim_convergence.ucl.mser_m_relative_half_width_estimate(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m | None = None) → float

Get the relative half width estimate.

The relative half width estimate is the confidence interval half-width or upper confidence limit (UCL) divided by the sample mean.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

Parameters:

time_series_data (array_like, 1d) – time series data.
confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)
batch_size (int, optional) – batch size. (default: 5)
scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale)
with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)
with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)
obj (MSER_m, optional) – instance of MSER_m (default: None)

Returns:

float: Relative half width estimate.

kim_convergence.ucl.mser_m_ucl(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m | None = None) → float: Approximate the upper confidence limit of the mean.

kim_convergence.ucl.mser_m_y_ci(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m_y | None = None) → tuple[float, float]

Approximate the confidence interval of the mean [mokashi2010].

Parameters:

time_series_data (array_like, 1d) – time series data.
confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)
batch_size (int, optional) – batch size. (default: 5)
scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale)
with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)
with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)
obj (MSER_m_y, optional) – instance of MSER_m_y (default: None)

Returns:

tuple[float, float]: Lower and upper confidence limits for the mean.

kim_convergence.ucl.mser_m_y_relative_half_width_estimate(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m_y | None = None) → float

Get the relative half width estimate.

The relative half width estimate is the confidence interval half-width or upper confidence limit (UCL) divided by the sample mean.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

Parameters:

time_series_data (array_like, 1d) – time series data.
confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)
batch_size (int, optional) – batch size. (default: 5)
scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale)
with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)
with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)
obj (MSER_m_y, optional) – instance of MSER_m_y (default: None)

Returns:

float: Relative half width estimate.

kim_convergence.ucl.mser_m_y_ucl(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m_y | None = None) → float: Approximate the upper confidence limit of the mean.

kim_convergence.ucl.n_skart_ci(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, equilibration_length_estimate: int = 0, fft: bool = True, obj: N_SKART | None = None) → tuple[float, float]

Approximate the confidence interval of the mean.

Parameters:

time_series_data (array_like, 1d) – time series data.
equilibration_length_estimate (int, optional) – an estimate for the equilibration length.
confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
obj (N_SKART, optional) – instance of N_SKART (default: None)

Returns:

tuple[float, float]: Lower and upper confidence limits for the mean.

kim_convergence.ucl.n_skart_relative_half_width_estimate(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, equilibration_length_estimate: int = 0, fft: bool = True, obj: N_SKART | None = None) → float

Get the relative half width estimate.

The relative half width estimate is the confidence interval half-width or upper confidence limit (UCL) divided by the sample mean.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

Parameters:

time_series_data (array_like, 1d) – time series data.
equilibration_length_estimate (int, optional) – an estimate for the equilibration length.
confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
obj (N_SKART, optional) – instance of N_SKART (default: None)

Returns:

float: Relative half width estimate.

kim_convergence.ucl.n_skart_ucl(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, equilibration_length_estimate: int = 0, fft: bool = True, obj: N_SKART | None = None) → float: Approximate the upper confidence limit of the mean.

Approximate the confidence interval of the mean.

If the population standard deviation is known, and population_standard_deviation is given,

\[UCL = t_{\alpha,d} \left(\frac{\text population\ standard\ deviation}{\sqrt{n}}\right)\]
If the population standard deviation is unknown, the sample standard deviation is estimated and be used as sample_standard_deviation,

\[UCL = t_{\alpha,d} \left(\frac{\text sample\ standard\ deviation}{\sqrt{n}}\right)\]

In both cases, the Student's t distribution is used as the critical value. This value depends on the confidence_coefficient and the degrees of freedom, which is found by subtracting one from the number of observations.

Confidence limits for the mean are interval estimates. Interval estimates are often desirable because instead of a single estimate for the mean, a confidence interval generates a lower and upper limit. It indicates how much uncertainty there is in our estimation of the true mean. The narrower the gap, the more precise our estimate is.

Confidence limits are defined as \(\bar{Y} \pm UCL,\) where \(\bar{Y}\) is the sample mean, and \(UCL\) is the approximate upper confidence limit of the mean.

Parameters:

time_series_data (array_like, 1d) – time series data.
confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)
population_standard_deviation (float, optional) – population standard deviation. (default: None)
si (float, or str, optional) – estimated statistical inefficiency. (default: None)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)
uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. (default: None)
sample_method (str, optional) – sampling method, one of the uncorrelated, random, or block_averaged. (default: None)
obj (UncorrelatedSamples, optional) – instance of UncorrelatedSamples (default: None)

Returns:

tuple[float, float]: Lower and upper confidence limits for the mean. The approximately unbiased estimate of confidence Limits for the mean.

Get the relative half width estimate.

The relative half width estimate is the confidence interval half-width or upper confidence limit (UCL) divided by the sample mean.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

Parameters:

time_series_data (array_like, 1d) – time series data.
confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)
population_standard_deviation (float, optional) – population standard deviation. (default: None)
si (float, or str, optional) – estimated statistical inefficiency. (default: None)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)
uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. (default: None)
sample_method (str, optional) – sampling method, one of the uncorrelated, random, or block_averaged. (default: None)
obj (UncorrelatedSamples, optional) – instance of UncorrelatedSamples (default: None)

Returns:

float: Relative half width estimate

kim_convergence.ucl.uncorrelated_samples_ucl(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, population_standard_deviation: float | None = None, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None, sample_method: str | None = None, obj: UncorrelatedSamples | None = None) → float: Approximate the upper confidence limit of the mean.

Statistical Functions

stats module.

class kim_convergence.stats.ZERO_RC(xlo: float, xhi: float, *, abs_tol: float = 1e-50, rel_tol: float = 1e-08)

Zero finding class by reverse communication.

zero(status: int, x: float, fx: float, xlo: float, xhi: float)

Perform the zero finding.

Parameters:

status (int) – Status. If 0, other parameters are ignored.
x (float) – Input value at which function f is evaluated.
fx (float) – Function value f(x).
xlo (float) – Lower interval bound.
xhi (float) – Upper interval bound.

Returns:

tuple[int, float, float, float]: status: 0 = finished, 1 = needs eval, -1 = error. x: updated candidate. xlo/xhi: refined bracketing interval.

class kim_convergence.stats.ZERO_RC_BOUNDS(small: float, big: float, abs_step: float, rel_step: float, step_mul: float, *, abs_tol: float = 1e-50, rel_tol: float = 1e-08)

Bound zero finding class by reverse communication.

zero(status: int, x: float, fx: float)

Bounds the zero of the function.

Bounds the zero of the function and finds zero of the function by reverse communication.

f must be a monotone function, otherwise the results are undefined. If f is an increasing monotone, then the result is bound by [f(x-tolerance(x)) f(x+tolerance(x))]. If f is a decreasing monotone, then the result is bound by [f(x+tolerance(x)) f(x-tolerance(x))]. Where tolerance(x) = Maximum(abs_tol, rel_tol * |x|).

Parameters:

status (int) – Status. If 0, other parameters are ignored.
x (float) – Input value at which function f is evaluated.
fx (float) – Function value f(x).

Returns:

tuple[int, float]: status: 0 = finished without error, 1 = needs another evaluation. x: updated input value.

kim_convergence.stats.auto_correlate(x: ndarray | list[float], *, nlags: int | None = None, fft: bool = False) → ndarray

Calculate the auto-correlation function.

Calculate the auto-correlation function for nlags lag for the input array. This estimator is biased.

Parameters:

x (array_like, 1d) – Time series data.
nlags (int > 0 or None, optional) – Number of lags to return auto-correlation for it. (default: None)
fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

ndarray: Calculated auto correlation function.

kim_convergence.stats.auto_covariance(x: ndarray | list[float], *, fft: bool = False) → ndarray

Calculate biased auto-covariance estimates.

Compute auto-covariance estimates for every lag for the input array. This estimator is biased.

\[\gamma_k = \frac{1}{N}\sum\limits_{t=1}^{N-K}(x_t-\bar{x})(x_{t+K}-\bar{x})\]

Note

Some sources use the following formula for computing the autocovariance:

\[\gamma_k = \frac{1}{N-K}\sum\limits_{t=1}^{N-K}(x_t-\bar{x})(x_{t+K}-\bar{x})\]

This definition has less bias, than the one used here. But the \(\frac{1}{N}\) formulation has some desirable statistical properties and is the most commonly used in the statistics literature.

Parameters:

x (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

1darray: Estimated autocovariances.

Raises:

CRError – If input validation fails.

kim_convergence.stats.beta(a: float, b: float) → float

Beta function.

Beta function [numrec2007] is defined as,

\[B(a, b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)},\]

where \(\Gamma\) is the gamma function.

Parameters:

a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.

Returns:

float: Beta function value.

kim_convergence.stats.betacf(a: float, b: float, x: float, *, eps: float = 1e-15, max_iteration: int = 200, _fpmin: float = 1e-30) → float

Continued fraction for incomplete beta function by modified Lentz’s method.

Evaluates continued fraction for incomplete beta function by modified Lentz’s method [numrec2007].

Parameters:

a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Real-valued such that it must be between 0.0 and 1.0.
eps (float, optional) – Machine precision epsilon. (default: {np.finfo(np.float64).resolution})
max_iteration (int, optional) – Maximum number of iterations. (default: 200)
_fpmin (float, optional) – Minimum floating point precision. (default: 1.0e-30)

Returns:

float: Continued fraction for incomplete beta function.

kim_convergence.stats.betai(a: float, b: float, x: float) → float

Incomplete beta function.

Incomplete beta function [numrec2007] is defined as,

\[I_x(a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \int_0^x~t^{a-1}(1-t)^{b-1}~dt,\]

Parameters:

a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Real-valued such that it must be between 0.0 and 1.0.

Returns:

float: Incomplete beta function value.

kim_convergence.stats.betai_cdf(a: float, b: float, x: float) → float

Calculate the cumulative distribution of the incomplete beta distribution.

Calculate the cumulative distribution of the incomplete beta distribution with parameters a and b as,

\[\int_0^x \frac{t^{a-1}~(1-t)^{b-1}}{Beta(a,b)}~dt,\]

where, \(Beta(a,b)\) is the beta function.

Parameters:

a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Upper limit of integration

Returns:

float: Cumulative incomplete beta distribution.

kim_convergence.stats.betai_cdf_ccdf(a: float, b: float, x: float) → tuple[float, float]

Calculate the cumulative distribution of the incomplete beta distribution.

Calculate the cumulative distribution of the incomplete beta distribution with parameters a and b as,

\[\int_0^x \frac{t^{a-1}~(1-t)^{b-1}}{Beta(a,b)}~dt,\]

where, \(Beta(a,b)\) is the beta function.

Parameters:

a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Upper limit of integration

Returns:

tuple[float, float]: Cumulative incomplete beta distribution, compliment of the cumulative incomplete beta distribution.

kim_convergence.stats.check_population_cdf_args(population_cdf: str | None, population_args: tuple)

Check the input population_cdf and population_args for correctness.

Parameters:

population_cdf (str) – The name of a distribution.
population_args (tuple) – Distribution parameter.

kim_convergence.stats.chi_square_test(sample_var: float, sample_size: int, population_var: float, significance_level: float = 0.050000000000000044) → bool

Chi-square test for the variance.

Calculate the chi-square test for the variance. This is a two-sided test. Test Statistic is \(T=(N−1)\frac{\text{var}}{\text{var}_0}\), where where N is the sample size and var is the sample variance. The ratio var/var0 compares the ratio of the sample variance to the target variance. The more this ratio deviates from 1, the more likely we are to reject the null hypothesis.

The null hypothesis is that the variance of a sample of independent observations x is equal to the given population variance, population_var.

Parameters:

sample_var (float) – Sample variance.
sample_size (int) – Number of samples.
population_var (float) – population variance.
significance_level (float) – Significance level. A probability threshold below which the null hypothesis will be rejected. (default: 0.05)

Returns:

bool: True if the variance of a sample of independent observations x equals the given population variance population_var.

kim_convergence.stats.cross_correlate(x: ndarray | list[float], y: ndarray | list[float] | None, *, nlags: int | None = None, fft: bool = False) → ndarray

Calculate the cross-correlation function.

Calculate the cross-correlation function for nlags lag for the input array. This estimator is biased.

Parameters:

x (array_like, 1d) – Time series data.
y (array_like, 1d) – Time series data.
nlags (int > 0 or None, optional) – Number of lags to return auto-correlation for. (default: None)
fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

ndarray: Calculated cross correlation.

kim_convergence.stats.cross_covariance(x: ndarray | list[float], y: ndarray | list[float] | None, *, fft: bool = False) → ndarray

Calculate the biased cross covariance estimate between two time series.

Calculate the cross covariance between two time series for every lag for the input arrays. This estimator is biased.

Parameters:

x (array_like, 1d) – Time series data.
y (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

1darray: Calculated cross covariances.

Raises:

CRError – If input validation fails.

kim_convergence.stats.get_distribution_stats(population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None)

Get the distribution stats from its name.

The stats include, Median, Mean, Variance, and Standard deviation of the distribution.

Parameters:

population_cdf (str) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.

Returns:

tuple: median, mean, var, std

kim_convergence.stats.get_fft_optimal_size(input_size: int) → int

Find the optimal size for the FFT solver.

Get the next regular number greater than or equal to input_size [statsmodels]. Regular numbers are composites of the prime factors 2, 3, and 5. Also known as 5-smooth numbers or Hamming numbers, these are the optimal size for inputs to FFT solvers.

Parameters:

input_size (int) – Input data size we want to use the FFT solver on it. This is the length to start searching from it and is a positive integer.

Returns:

int: The first 5-smooth number greater than or equal to input_size.

kim_convergence.stats.int_power(x: ndarray | list[float], exponent: int) → ndarray

Array elements raised to the power exponent.

Parameters:

x (array_like, 1d) – The bases.
exponent (int) – The exponent

Returns:

1darray: Computed power array.

kim_convergence.stats.kruskal_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) → bool

Kruskal-Wallis H-test for independent samples.

The Kruskal-Wallis H-test tests the null hypothesis that the median of the time series data is the same as the one from population_cdf.

It is a non-parametric version of ANOVA.

Parameters:

time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool: True if the median of the time-series data equals the median of the specified population distribution.

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma
>>> rng = np.random.RandomState(12345)
>>> a = 1.99
>>> x = rng.gamma(a, 1, size=20)
>>> kruskal_test(x,
                 population_cdf='gamma',
                 population_args=(shape,),
                 population_loc=0,
                 population_scale=1,
                 significance_level=0.05)
True

kim_convergence.stats.ks_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) → bool

Kolmogorov-Smirnov test for goodness of fit.

Note

This test is only valid for continuous distributions.

It uses the distribution of an observed variable against a given distribution.

The null hypothesis is that the observed samples are drawn from the same continuous distribution as the given distribution with population_loc and population_scale if they are given.

Note

The alternative hypothesis is two-sided. Where the empirical cumulative distribution function of the observed variables is less or greater than the cumulative distribution function of the given distribution.

The probability density of the given population distribution is in the standardized form. Thus to shift and/or scale the distribution population_loc and population_scale parameters are used. In these cases, the variable change y <- x, where y = (x - loc) / scale

Parameters:

time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool: True if the observed samples are drawn from the same continuous distribution as the given one (two-tailed p-value > significance_level).

kim_convergence.stats.levene_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) → bool

Perform modified Levene test for equal variances.

The modified Levene test tests the null hypothesis that one sample input time_series_data is from population population_cdf with the same variance [nistdiv898b].

Note

This test is fixed to use ‘median’ variation of the Levene’s test.

Although the optimal choice depends on the underlying distribution, the definition based on the median is recommended as the choice that provides good robustness against many types of non-normal data while retaining good power.

Robustness means the ability of the test to not falsely detect unequal variances when the underlying data are not normally distributed and the variables are in fact equal.

Power means the ability of the test to detect unequal variances when the variances are in fact unequal.

Parameters:

time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool: True if the sample variance equals the population variance (two-tailed p-value > significance_level).

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma, alpha
>>> rng = np.random.RandomState(12345)
>>> shape, scale = 2., 2.
>>> x = rng.gamma(shape, scale, size=1000)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(shape,),
                population_loc=0,
                population_scale=scale,
                significance_level=0.05)
True

>>> a = 1.99
>>> x = gamma.rvs(a, size=1000, random_state=rng)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
True

>>> x = alpha.rvs(a, size=1000, random_state=rng)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
False

Reject the null hypothesis at a confidence level of 5%, concluding that there is a difference in variance of the time_series_data and gamma distribution with shape parameter a.

Example:

>>> levene_test(x,
                population_cdf='alpha',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
True

kim_convergence.stats.modified_periodogram(x: ndarray | list[float], *, fft: bool = False, with_mean: bool = False) → ndarray

Compute a modified periodogram to estimate the power spectrum.

Estimate the power spectrum using a modified periodogram. A periodogram [heidelberger1981] is an estimate of the spectral density of a signal and it is defined as,

\[\left \{ I\left(\frac{k}{n}\right) \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor},\; I\left( \frac{k}{n} \right) = \left| \sum_{j=0}^{j=n-1} {x(j) e^{-2\pi i j k / n}} \right|^2 / n\]

Parameters:

x (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)
with_mean (bool, optional) – If True, use x minus its mean. (default: False)

Returns:

1darray: Computed modified periodogram array.

Note

This function does not return the array of sample frequencies. In case of need, one can compute it as,

\[f = \left \{ \frac{k}{n} \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor + 1}\]

or

>>> f = np.arange(1., x.size//2 + 1) / x.size

Raises:: CRError – If input validation fails.

kim_convergence.stats.moment(x: ndarray | list[float], *, moment: int = 1) → float

Calculates the nth moment about the mean for a sample.

Parameters:

x (array_like, 1d) – Time series data.
moment (int, optional) – Order of central moment that is returned. (default: 1)

Returns:

float: n-th central moment.

Note

The k-th central moment of a time series data,

\[m_k = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^k,\]

where \(n\) is the number of samples and \(\bar{x}\) is the mean.

kim_convergence.stats.normal_interval(confidence_level: float, *, loc: float = 0.0, scale: float = 1.0) → tuple[float, float]

Compute the normal distribution confidence interval.

Compute the normal-distribution confidence interval with equal areas around the median.

Parameters:

confidence_level (float) – Confidence coefficient (must be between 0.0 and 1.0).
loc (float, optional) – Location parameter. (default: 0.0)
scale (float, optional) – Scale parameter. (default: 1.0)

Returns:

tuple[float, float]: Lower and upper bounds of the confidence interval that contains \(100~\text{confidence_level}\%\) of the distribution.

Note

Confidence interval is a range of values that is likely to contain an unknown population parameter.
Confidence level is the percentage of the confidence intervals which will hold the population parameter.
The significance level or alpha is the probability of rejecting the null hypothesis when it is true. To find alpha, just subtract the confidence interval from 100%. E.g., the significance level for a 90% confidence level is 100% – 90% = 10%.

kim_convergence.stats.normal_inv_cdf(p: float, *, loc=0.0, scale: float = 1.0) → float

Compute the normal distribution inverse cumulative distribution function.

Parameters:

p (float) – Probability (must be between 0.0 and 1.0).
loc (float, optional) – Location parameter. (default: 0.0)
scale (float, optional) – Scale parameter. (default: 1.0)

Returns:

float: Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

kim_convergence.stats.periodogram(x: ndarray | list[float], *, fft: bool = False, with_mean: bool = False) → ndarray

Compute a periodogram to estimate the power spectrum.

Parameters:

x (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)
with_mean (bool, optional) – If True, use x minus its mean. (default: False)

Returns:

1darray: Computed power spectrum array.

Note

This function does not return the array of sample frequencies. In case of need, one can compute it as,

\[f = \left \{ \frac{k}{n} \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor + 1}\]

or

>>> f = np.arange(1., x.size//2 + 1) / x.size

kim_convergence.stats.randomness_test(x: ndarray | list[float], significance_level: float) → bool

Testing for independence of observations.

The von-Neumann ratio test of independence of variables is a test designed for checking the independence of subsequent observations.

The null hypothesis is that the data are independent and normally distributed.

Parameters:

x (array_like, 1d) – Time series data.
significance_level (float) – Probability threshold below which the null hypothesis is rejected.

Returns:

bool: True if the observations are independent.

Note

Given a series \(x\) of \(n\) data points, the Von-Neumann test [vonneumann1941] [vonneumann1941b] statistic is

\[v = \frac{\sum_{i=2}^{n} (x_i - x_{i-1})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\]

Under the null hypothesis of independence, the mean \(\bar{v} = 2\) and the variance \(\sigma^2_v = \frac{4 (n - 2)}{(n^2-1)}\) (see [williams1941], and [madansky1988] for a simple derivation).

kim_convergence.stats.s_normal_inv_cdf(p: float) → float

Compute the standard normal distribution inverse cumulative distribution function.

Compute the inverse cumulative distribution function (percent point function or quantile function) for standard normal distribution [pythonstats], [wichura1988].

Parameters:

p (float) – Probability (must be between 0.0 and 1.0).

Returns:

float: Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

kim_convergence.stats.skew(x: ndarray | list[float], *, bias: bool = False) → float

Compute the time series data set skewness [zwillinger2000].

skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

Parameters:

x (array_like, 1d) – Time series data.
bias (bool, optional) – If False, then the calculations are corrected for statistical bias. (default: False)

Returns:

float: The skewness

Note

For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution.

The sample skewness is computed as the Fisher-Pearson coefficient of skewness \(g_1 = \frac{m_3}{m_2^{3/2}}\), where \(m_i\) is the biased sample \(i\texttt{th}\) central moment. If bias is False, the calculations are corrected for bias and the value computed is the adjusted Fisher-Pearson standardized moment coefficient, i.e.

\[G_1 = \frac{k_3}{k_2^{3/2}} = \frac{\sqrt{N(N-1)}}{N-2} \frac{m_3}{m_2^{3/2}}.\]

kim_convergence.stats.t_cdf(t: float, df: float) → float

Compute the cumulative distribution of the t-distribution.

The cumulative distribution of the t-distribution for t > 0, can be written in terms of the regularized incomplete beta function as,

\[\int_{-\infty}^t f(u)\,du = 1 - \frac{1}{2} I_{x(t)}\left(\frac{\nu}{2}, \frac{1}{2}\right),\]

where,

\[x(t) = \frac{\nu}{{t^2+\nu}}.\]

Other t values would be obtained by symmetry.

Parameters:

t (float) – Upper limit of the integration.
df (float) – Degrees of freedom, must be a positive number.

Returns:

float: Cumulative t-distribution.

kim_convergence.stats.t_cdf_ccdf(t: float, df: float) → tuple[float, float]

Compute the cumulative distribution of the t-distribution.

The cumulative distribution of the t-distribution for t > 0, can be written in terms of the regularized incomplete beta function as,

\[\int_{-\infty}^t f(u)\,du = 1 - \frac{1}{2} I_{x(t)}\left(\frac{\nu}{2}, \frac{1}{2}\right),\]

where,

\[x(t) = \frac{\nu}{{t^2+\nu}}.\]

Other t values would be obtained by symmetry.

Parameters:

t (float) – Upper limit of the integration.
df (float) – Degrees of freedom, must be a positive number.

Returns:

tuple[float, float]: cdf: cumulative t-distribution value. ccdf: complement of the cumulative t-distribution (1 - cdf).

kim_convergence.stats.t_interval(confidence_level: float, df: float, *, loc: float = 0.0, scale: float = 1.0) → tuple[float, float]

Compute the t_distribution confidence interval.

Compute the t_distribution confidence interval with equal areas around the median.

Parameters:

confidence_level (float) – (or confidence coefficient) must be between 0.0 and 1.0
df (float) – Degrees of freedom, must be > 0.
loc (float, optional) – location parameter (default: 0.0)
scale (float, optional) – scale parameter (default: 1.0)

Returns:

tuple[float, float]: Lower and upper bounds of the confidence interval that contains \(100 \cdot \text{confidence_level}\%\) of the t-distribution.

Note

Confidence interval is a range of values that is likely to contain an unknown population parameter.
Confidence level is the percentage of the confidence intervals which will hold the population parameter.
The significance level or alpha is the probability of rejecting the null hypothesis when it is true. To find alpha, just subtract the confidence interval from 100%. E.g., the significance level for a 90% confidence level is 100% – 90% = 10%.

kim_convergence.stats.t_inv_cdf(p: float, df: float, *, loc: float = 0.0, scale: float = 1.0, _tol: float = 1e-08, _atol: float = 1e-50, _rtinf: float = 1e+100) → float

Compute the t_distribution inverse cumulative distribution function.

Compute the inverse cumulative distribution function (percent point function or quantile function) for t-distributions with df degrees of freedom. Inverse cumulative distribution function finds the value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability.

Parameters:

p (float) – Probability (must be between 0.0 and 1.0)
df (float) – Degrees of freedom, must be > 1.
loc (float, optional) – location parameter (default: 0.0)
scale (float, optional) – scale parameter (default: 1.0)

Returns:

float: Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

kim_convergence.stats.t_test(sample_mean: float, sample_std: float, sample_size: int, population_mean: float, significance_level: float = 0.050000000000000044) → bool

T-test for the mean.

Calculate the T-test for the mean. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations x is equal to the given population mean, population_mean.

Parameters:

sample_mean (float) – Sample mean.
sample_std (float) – Sample standard deviation.
sample_size (int) – Number of samples.
population_mean (float) – Expected value in the null hypothesis.
significance_level (float) – Significance level. A probability threshold below which the null hypothesis will be rejected. (default: 0.05)

Returns:

bool: True if the expected value (mean) of a sample of independent observations x equals the given population mean population_mean.

kim_convergence.stats.wilcoxon_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) → bool

Calculate the Wilcoxon signed-rank test.

Here it is used as a non-parametric test to determine whether an unknown population mean is different from a specific value.

Parameters:

time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool: True if the sample is drawn from the specified population distribution.

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma
>>> rng = np.random.RandomState(12345)
>>> shape, scale = 2., 2.
>>> x = rng.gamma(shape, scale, size=1000)
>>> wilcoxon_test(x,
                  population_cdf='gamma',
                  population_args=(shape,),
                  population_loc=0,
                  population_scale=scale,
                  significance_level=0.05)
True

>>> wilcoxon_test(x,
                  population_cdf='gamma',
                  population_args=(shape,),
                  population_loc=0,
                  population_scale=1,
                  significance_level=0.05)
False

Time Series Functions

Time series module.

kim_convergence.timeseries.estimate_equilibration_length(time_series_data: ndarray | list[float], *, si: str | None = None, nskip: int | None = 1, fft: bool = True, minimum_correlation_time: int | None = None, ignore_end: int | float | None = None, number_of_cores: int = 1, solver: str = 'auto', batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False) → tuple[int, float]

Estimate the equilibration point in a time series data.

Estimate the equilibration point in a time series data using the statistical inefficiencies [chodera2016], [geyer1992], [geyer2011].

The equilibration point is the offset t that maximizes the effective sample size \(N_{eff}(t) = (N - t) / g(x[t:])\), where \(g\) is the statistical inefficiency. Two solvers find this maximum:

"exhaustive" evaluates every candidate offset t in range(0, upper_bound, nskip). This is exact but costs one statistical-inefficiency computation per offset; because each computation is itself an \(O(N \log N)\) FFT, the total cost is \(O(N^2 \log N)\) and becomes intractable for long series (millions of points).
"unimodal" performs a ternary search, exploiting that \(N_{eff}(t)\) is (to a good approximation) unimodal in t. It evaluates only \(O(\log N)\) offsets, for a total cost of \(O(N \log^2 N)\).

Note

The "unimodal" solver is an approximate maximizer. Because \(N_{eff}(t)\) is locally jagged near its (typically flat) peak – the statistical inefficiency fluctuates from offset to offset – the ternary search may return an equilibration index a few offsets away from the exhaustive argmax. In practice the returned statistical inefficiency and effective sample size are within a fraction of a percent of the exhaustive optimum, and the difference in discarded samples is statistically negligible. Use solver="exhaustive" if the exact argmax is required and the series is short enough to afford the \(O(N^2 \log N)\) cost.

Parameters:

time_series_data (array_like, 1d) – Time-series data.
si (Optional[str], optional) – Statistical-inefficiency method. (default: None)
nskip (Optional[int], optional) – Number of data points to skip. (default: 1)
fft (bool, optional) – Use FFT convolution for long series. (default: True)
minimum_correlation_time (Optional[int], optional) – Minimum correlation-time window; algorithm stops when correlation first goes negative. (default: None)
ignore_end (Optional[Union[int, float]], optional) – If int, last points to ignore; if float in (0, 1), fraction to ignore; if None, uses one fourth of data. (default: None)
number_of_cores (int, optional) – The maximum number of concurrently running jobs, such as the number of Python worker processes or the size of the thread-pool. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. (default: 1)
solver (str, optional) – Offset-search strategy, one of "auto", "exhaustive", or "unimodal". "auto" uses the exhaustive scan when the number of candidate offsets is small (cheap and exact) and switches to the unimodal ternary search for large series, where the exhaustive scan’s \(O(N^2 \log N)\) cost is prohibitive. (default: “auto”)

Returns:

tuple[int, float]: equilibration_index: index where equilibrated region starts. statistical_inefficiency: statitical inefficiency estimates of a time series at the equilibration index estimate.

Note

batch_size, scale, with_centering, and with_scaling are accepted for API compatibility but are not used by this method.

kim_convergence.timeseries.geyer_r_statistical_inefficiency(x: ndarray | list[float], y: ndarray | list[float] | None = None, *, fft: bool = True, minimum_correlation_time: int | None = None) → float

Compute the statistical inefficiency.

Compute the statistical inefficiency using the Geyer’s [geyer1992], [geyer2011] initial monotone sequence criterion.

Note

The behavior is updated. Suppose the time series data is an array of (constant) numbers with standard deviation close to zero within abs_tol=1e-18, where abs(a) <= max(1e-9 * abs(a), abs_tol). In that case, this function returns the statistical inefficiency as the size of the time series data array.

Note

The effective sample size is computed by:

\[\begin{split}\hat{N}_{eff} &= \frac{N}{si} \\ si &= -1 + 2 \sum_{t'=0}^m \hat{P}_{t'}\end{split}\]

where \(N\) is the number of data points. \(\hat{P}_{t'} = \hat{\rho}_{2t'} + \hat{\rho}_{2t'+1}\), where \(\hat{\rho}_t'\) is the estimated auto-correlation at lag \(t'\), and \(m\) is the last integer for which \(\hat{P}_{t'}\) is still positive (largest \(m\) such that \(\hat{P}_{t'} > 0,~t'=1,\cdots,m\)). The initial monotone sequence is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding ones so that the estimated sequence is monotone.

The current implementation is similar to Stan [mcstan], which uses Geyer’s initial monotone sequence criterion (Geyer, 1992 [geyer1992]; Geyer, 2011 [geyer2011]).

Parameters:

x (array_like, 1d) – time series data. Using this method, statistical inefficiency can not be estimated with less than four data points.
y (array_like, 1d, optional) – time series data. If it is passed to this function, the cross-correlation of timeseries x and y will be estimated instead of the auto-correlation of timeseries x. (default: None)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

Returns:

float: estimated statistical inefficiency. \(si >= 1\) is the estimated statistical inefficiency (equal to \(si = -1 + 2 \sum_{t'=0}^m \hat{P}_{t'}\), where \(\hat{P}_{t'} = \hat{\rho}_{2t'} + \hat{\rho}_{2t'+1}\))

Note

minimum_correlation_time is accepted for API compatibility but is not used by this method.

kim_convergence.timeseries.geyer_split_r_statistical_inefficiency(x: ndarray | list[float], y: ndarray | list[float] | None = None, *, fft: bool = True, minimum_correlation_time: int | None = None) → float

Compute the statistical inefficiency.

Compute the statistical inefficiency using the split-r method of Geyer’s [geyer1992], [geyer2011] initial monotone sequence criterion.

Note

The effective sample size is computed by:

\[\begin{split}\hat{N}_{eff} &= \frac{N}{si} \\ si &= -1 + 2 \sum_{t'=0}^m \hat{P}_{t'}\end{split}\]

where \(N\) is the number of data points. \(\hat{P}_{t'} = \hat{\rho}_{2t'} + \hat{\rho}_{2t'+1}\), where \(\hat{\rho}_t'\) is the estimated auto-correlation at lag \(t'\), and \(m\) is the last integer for which \(\hat{P}_{t'}\) is still positive (largest \(m\) such that \(\hat{P}_{t'} > 0,~t'=1,\cdots,m\)). The initial monotone sequence is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding ones so that the estimated sequence is monotone.

The current implementation is similar to Stan [mcstan], which uses Geyer’s initial monotone sequence criterion (Geyer, 1992 [geyer1992]; Geyer, 2011 [geyer2011]).

Parameters:

x (array_like, 1d) – time series data. Using this method, statistical inefficiency can not be estimated with less than eight data points.
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

Returns:

float: estimated statistical inefficiency. \(si >= 1\) is the estimated statistical inefficiency (equal to \(si = -1 + 2 \sum_{t'=0}^m \hat{P}_{t'}\), where \(\hat{P}_{t'} = \hat{\rho}_{2t'} + \hat{\rho}_{2t'+1}\))

Note

minimum_correlation_time is accepted for API compatibility but is not used by this method.

kim_convergence.timeseries.geyer_split_statistical_inefficiency(x: ndarray | list[float], y: ndarray | list[float] | None = None, *, fft: bool = True, minimum_correlation_time: int | None = None) → float

Compute the statistical inefficiency.

Computes the effective sample size. The value returned is the minimum of effective sample size and the data size times log10(data size).

Note

Note that the effective sample size can not be estimated with less than four samples.

Note

The behavior is updated. Suppose the time series data is an array of (constant) numbers with standard deviation close to zero within abs_tol=1e-18, where abs(a) <= max(1e-9 * abs(a), abs_tol). In that case, this function returns the statistical inefficiency as the size of the time series data array.

Parameters:

x (array_like, 1d) – time series data.
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

Returns:

float: estimated statistical inefficiency. \(si >= 1\) is the estimated statistical inefficiency

Note

minimum_correlation_time is accepted for API compatibility but is not used by this method.

Estimate the integrated auto-correlation time.

The statistical inefficiency \(si\) of the observable \(x\) of a time series \(\left \{X\right \}_{t=0}^n\) is formally defined as, \(si \equiv 1 + 2\tau\), where \(\tau\) denotes the integrated auto-correlation time.

Parameters:

x (array_like, 1d) – time series data.
y (array_like, 1d, optional) – time series data. (default: None) If it is passed to this function, the cross-correlation of timeseries x and y will be estimated instead of the auto-correlation of timeseries x.
si (float, or str, optional) – estimated statistical inefficiency, or a method of computing the statistical inefficiency. (default: None)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

Returns:

float: integrated auto-correlation time. estimated \(\tau\) (the integrated auto-correlation time)

kim_convergence.timeseries.statistical_inefficiency(x: ndarray | list[float], y: ndarray | list[float] | None = None, *, fft: bool = True, minimum_correlation_time: int | None = None) → float

Compute the statistical inefficiency.

The statistical inefficiency \(si\) of the observable \(x\) of a time series \(\{X\}_{t=0}^n\) is formally defined as,

\[\begin{split}si &\equiv 1 + 2\tau \\ \tau &\equiv \sum_{t=0}^n {\left( 1 - \frac{t}{n} \right) C\left(t\right)} \\ C\left(t\right) &\equiv \frac{<x(X_{t_0})x(X_{t_0+t})> - {<x>}^2}{<x^2>-{<x>}^2}\end{split}\]

where \(\tau\) denotes the integrated auto-correlation time and \(C\left(t\right)\) is the normalized fluctuation auto-correlation function of the observable \(x\)

Note

The behavior is updated. Suppose the time series data is an array of (constant) numbers with standard deviation close to zero within abs_tol=1e-18, where abs(a) <= max(1e-9 * abs(a), abs_tol). In that case, this function returns the statistical inefficiency as the size of the time series data array.

Parameters:

x (array_like, 1d) – time series data.
y (array_like, 1d, optional) – time series data. If it is passed to this function, the cross-correlation of timeseries x and y will be estimated instead of the auto-correlation of timeseries x. (default: None)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

Returns:

float: estimated statistical inefficiency. \(si >= 1\) is the estimated statistical inefficiency (equal to \(1 + 2\tau\), where \(\tau\) denotes the integrated auto-correlation time).

Helper method to compute or return the statistical inefficiency value.

Parameters:

time_series_data (array_like, 1d) – time series data.
si (float, or str, optional) – estimated statistical inefficiency. (default: None)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

Returns:

float: estimated statistical inefficiency value. \(si >= 1\) is the estimated statistical inefficiency.

Return average value for each block after blocking the data.

At first, break down the time series data into the series of blocks, where each block contains si successive data points. If si (statistical inefficiency) is not provided it will be computed. Then the average value for each block is determined. This coarse graining approach is commonly used for thermodynamic properties.

Parameters:

time_series_data (array_like, 1d) – time series data.
si (float, or str, optional) – estimated statistical inefficiency. (default: None)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)
uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. must be monotonically increasing. If None they are computed automatically. (default: None)

Returns:

1darray: uncorrelated_sample of the time series data. average value for each block after blocking the time series data.

Return random data for each block after blocking the data.

At first, break down the time series data into the series of blocks, where each block contains si successive data points. If si (statistical inefficiency) is not provided it will be computed. Then a single value is taken at random from each block.

Parameters:

time_series_data (array_like, 1d) – time series data.
si (float, or str, optional) – estimated statistical inefficiency. (default: None)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)
uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. must be monotonically increasing. If None they are computed automatically. (default: None)

Returns:

1darray: uncorrelated_sample of the time series data. random data for each block after blocking the time series data.

Return time series data at uncorrelated sample indices.

Subsample a correlated timeseries to extract an effectively uncorrelated dataset. If si (statistical inefficiency) is not provided it will be computed.

Parameters:

time_series_data (array_like, 1d) – time series data.
si (float, or str, optional) – estimated statistical inefficiency.
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)
uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. must be monotonically increasing. If None they are computed automatically. (default: None)

Returns:

1darray: uncorrelated_sample of the time series data. time series data at uncorrelated sample indices.

Return indices of uncorrelated subsamples of the time series data.

Return indices of the uncorrelated sample of the time series data. Subsample a correlated timeseries to extract an effectively uncorrelated dataset. If si (statistical inefficiency) is not provided it will be computed.

Parameters:

time_series_data (array_like, 1d) – time series data.
si (float, or str, optional) – estimated statistical inefficiency. (default: None)
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

Returns:

1darray: indices array. Indices of uncorrelated subsamples of the time series data.

Get time series data at the sample_method uncorrelated_sample indices.

Subsample a correlated timeseries to extract an effectively uncorrelated dataset. If si (statistical inefficiency) is not provided it will be computed.

Parameters:

time_series_data (array_like, 1d) – time series data.
si (float, or str, optional) – estimated statistical inefficiency.
fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)
minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)
uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. (default: None)
sample_method (str, optional) – sampling method, one of the uncorrelated, random, or block_averaged. (default: None)

Returns:

1darray: uncorrelated_sample of the time series data. time series data at uncorrelated sample indices.

Utility Functions

batch

kim_convergence.batch(time_series_data: ~numpy.ndarray | list, *, batch_size: int = 5, func: ~typing.Callable[[...], ~numpy.ndarray] = <function mean>, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False) → ndarray

Batch the time series data.

Parameters:

time_series_data (array_like, 1d) – Time series data.
batch_size (int, optional) – batch size. (default: 5)
func (callable, optional) – Reduction function capable of receiving a single axis argument. It is called with time_series_data as first argument. (default: np.mean)
scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale’)
with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)
with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)

Returns

1darray: Batched (, and rescaled) data.

Note

This function will terminate the end of the data points which are remainder of the division of data points by the batch_size.

Note

By default, this method is using np.mean and compute the arithmetic mean.

Example:

>>> import numpy as np
>>> rng = np.random.RandomState(12345)
>>> x = np.ones(100) * 10 + (rng.random_sample(100) - 0.5)
>>> x_batch = batch(x, batch_size=5)
>>> x_batch.size
20
>>> print(x.mean(), x_batch.mean())
10.054804081191616 10.054804081191616

>>> x_batch_scaled = batch(x, batch_size=5,
                           scale='translate_scale',
                           with_scaling=True)
>>> x_batch_scaled.size
20
>>> print(x.mean(), x_batch_scaled.mean())
10.054804081191616 1.0

outlier_test

kim_convergence.outlier_test(x: ndarray | list[float], outlier_method: str = 'iqr') → ndarray | None

Test to detect what are outliers in the data.

The intuitive definition for the concept of an outlier in the data is a point that significantly deviates from its expected value. Therefore, given a time series (or a random sample from a population), a point can be declared an outlier if the distance to its expected value is higher than a predefined threshold (\(|x_i - E(x)| > \tau\)), where \(x_i\) is the observed data point, and \(E(x)\) is its expected value.

The methods based on this strategy are the most common approaches in the literature. These methods intend to detect outliers, but it is up to the analyst to decide if the detected points are real outliers. Thus it is necessary to characterize standard data points before removing any outliers detected by these approaches.

Parameters:

x (array_like, 1d) – Time series data.
outlier_method (str, optional) – Method for outlier detection. (default: ‘iqr’)

Returns:

Optional[ndarray]: Indices of outliers; None if no outliers found.

Scaler classes

class kim_convergence.MinMaxScale(*, feature_range: tuple[float, float] = (0, 1))

Standardize/Transform a dataset by scaling it to a given range.

This estimator scales and translates a dataset such that it is in the given range, e.g. between zero and one.

The transformation is given by:

\[\begin{split}x_{\text{std}} = \frac{x - \min(x)}{\max(x) - \min(x)} \\ \text{scaled}_x = x_{\text{std}} \cdot (\text{max} - \text{min}) + \text{min}\end{split}\]

where min, max = feature_range.

Parameters:: feature_range (tuple, optional) – tuple (min, max). (default: (0, 1)). Desired range of transformed data.

Examples:

>>> from kim_convergence import MinMaxScale, minmax_scale
>>> data = [-1., 3.]
>>> mms = MinMaxScale()
>>> scaled_x = mms.scale(data)
>>> print(scaled_x)
[0. 1.]

>>> x = mms.inverse(scaled_x)
>>> print(x)
[-1.  3.]

>>> data = [-1., 3., 100.]
>>> scaled_x = minmax_scale(data)
>>> print(scaled_x)
[0. 0.03960396 1.]

>>> mms = MinMaxScale()
>>> scaled_x = mms.scale(data)
>>> x = mms.inverse(scaled_x)
>>> print(x)
[ -1. 3. 100.]

inverse(x: ndarray) → ndarray

Undo the scaling of dataset to its original range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray: Transformed data.

scale(x: ndarray | list) → ndarray

Standardize a dataset by scaling it to a given range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray: Scaled dataset to a given range.

class kim_convergence.TranslateScale(*, with_centering: bool = True, with_scaling: bool = True)

Standardize a dataset.

Standardize a dataset by translating the data set so that \(x[0]=0\) and rescaled by overall averages so that the numbers are of O(1) with a good spread. (default: True)

The translate and scale of a sample x is calculated as:

\[z = \frac{(x - x_0)}{u}\]

where \(x_0\) is \(x[0]\) or \(0\) if with_centering=False, and u is the mean of the samples or \(1\) if with_scaling=False.

Parameters:

with_centering (bool, optional) – If True, use x minus its first element. (default: True)
with_scaling (bool, optional) – If True, scale the data to overall averages so that the numbers are of O(1) with a good spread. (default: True)

Examples:

>>> from kim_convergence import TranslateScale
>>> data = [1., 2., 2., 2., 3.]
>>> tsc = TranslateScale()
>>> scaled_x = tsc.scale(data)
>>> print(scaled_x)
[0. 1. 1. 1. 2.]

>>> x = tsc.inverse(scaled_x)
>>> print(x)
[1. 2. 2. 2. 3.]

inverse(x: ndarray) → ndarray

Undo the scaling of dataset to its original range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray: Transformed data.

scale(x: ndarray | list) → ndarray

Standardize a dataset by scaling it to a given range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray: Scaled dataset to a given range.

class kim_convergence.StandardScale(*, with_centering: bool = True, with_scaling: bool = True)

Standardize a dataset.

Standardize a dataset by removing the mean and scaling to unit variance. The standard score of a sample x is calculated as:

\[z = \frac{(x - u)}{s}\]

where u is the mean of the samples or \(0\) if with_centering=False , and s is the standard deviation of the samples or \(1\) if with_scaling=False.

Centering and scaling happen independently.

Parameters:

with_centering (bool, optional) – If True, use x minus its mean, or center the data before scaling. (default: True)
with_scaling (bool, optional) – If True, scale the data to unit variance (or equivalently, unit standard deviation). (default: True)

Note

If set explicitly with_centering=False (only variance scaling will be performed on x). We use a biased estimator for the standard deviation.

Examples:

>>> from kim_convergence import StandardScale
>>> data = [-0.5, 6]
>>> ssc = StandardScale()
>>> scaled_x = ssc.scale(data)
>>> print(scaled_x)
[-1.  1.]

>>> x = ssc.inverse(scaled_x)
>>> print(x)
[-0.5  6. ]

inverse(x: ndarray) → ndarray

Undo the scaling of dataset to its original range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray: Transformed data.

scale(x: ndarray | list) → ndarray

Standardize a dataset.

Parameters:

x (array_like, 1d) – The data to center and scale.

Returns:

1darray: Scaled and/or Centered dataset.

class kim_convergence.RobustScale(*, with_centering: bool = True, with_scaling: bool = True, quantile_range: tuple[float, float] = (25.0, 75.0))

Standardize a dataset.

Standardize a dataset by centering to the median and component wise scale according to the inter-quartile range. These features are robust to outliers.

This way removes the median and scales the data according to the quantile range. The Interquartile Range is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently.

Parameters:

with_centering (bool, optional) – If True, center the data before scaling. (default: True)
with_scaling (bool, optional) – If True, scale the data. (default: True)
quantile_range (tuple, or list, optional) – (q_min, q_max), 0.0 < q_min < q_max < 100.0 (default: (25.0, 75.0) = (1st quantile, 3rd quantile))

Examples:

>>> from kim_convergence import RobustScale
>>> data = [ 4.,  1., -2.]
>>> rsc = RobustScale()
>>> scaled_x = rsc.scale(data)
>>> print(scaled_x)
[ 1.22474487  0.         -1.22474487]

>>> x = rsc.inverse(scaled_x)
>>> print(x)
[ 4.  1. -2.]

inverse(x: ndarray) → ndarray

Undo the scaling of dataset to its original range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray: Transformed data.

scale(x: ndarray | list) → ndarray

Standardize a dataset using median and quantile range.

Parameters:

x (array_like, 1d) – The data to center and scale.

Returns:

1darray: Scaled dataset.

class kim_convergence.MaxAbsScale

Standardize a dataset to the [-1, 1] range.

Standardize a dataset to the [-1, 1] range such that the maximal absolute value in the data set will be 1.0.

Examples:

>>> from kim_convergence import MaxAbsScale
>>> data = [ 4.,  1., -9.]
>>> mas = MaxAbsScale()
>>> scaled_x = mas.scale(data)
>>> print(scaled_x)
[ 0.44444444  0.11111111 -1.        ]

>>> x = mas.inverse(scaled_x)
>>> print(x)
[ 4.  1. -9.]

inverse(x: ndarray) → ndarray

Undo the scaling of dataset to its original range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray: Transformed data.

scale(x: ndarray | list) → ndarray

Online computation of max absolute value of x for later scaling.

All of x is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters:

x (array_like, 1d) – The data to scale.

Returns:

1darray: Scaled dataset.

Convenience functions

minmax_scale

kim_convergence.minmax_scale(x: ndarray, *, with_centering: bool = True, with_scaling: bool = True, feature_range: tuple[float, float] = (0.0, 1.0)) → ndarray

Standardize/Transform a dataset by scaling it to a given range.

This estimator scales and translates a dataset such that it is in the given range, e.g. between zero and one.

The transformation is given by:

\[\begin{split}x_{\text{std}} = \frac{x - \min(x)}{\max(x) - \min(x)} \\ \text{scaled}_x = x_{\text{std}} \cdot (\text{max} - \text{min}) + \text{min}\end{split}\]

where min, max = feature_range.

Parameters:

x (array_like, 1d) – Time series data.
feature_range (tuple, optional) – tuple (min, max). (default: (0, 1)) Desired range of transformed data.

Returns:

1darray: Scaled dataset to a given range.

Note

with_centering, and with_scaling are accepted for API compatibility but are not used by this method.

translate_scale

kim_convergence.translate_scale(x: ndarray, *, with_centering: bool = True, with_scaling: bool = True) → ndarray

Standardize a dataset.

Standardize a dataset by translating the data set so that \(x[0]=0\) and rescaled by overall averages so that the numbers are of O(1) with a good spread. (default: True)

The translate and scale of a sample x is calculated as:

\[z = \frac{(x - x_0)}{u}\]

where \(x_0\) is \(x[0]\) or \(0\) if with_centering=False, and u is the mean of the samples or \(1\) if with_scaling=False.

Parameters:

x (array_like, 1d) – The data to center and scale.
with_centering (bool, optional) – If True, use x minus its first element. (default: True)
with_scaling (bool, optional) – If True, scale the data to overall averages so that the numbers are of O(1) with a good spread. (default: True)

Returns:

1darray: Scaled dataset.

standard_scale

kim_convergence.standard_scale(x: ndarray, *, with_centering: bool = True, with_scaling: bool = True) → ndarray

Standardize a dataset.

Standardize a dataset by removing the mean and scaling to unit variance. The standard score of a sample x is calculated as:

\[z = \frac{(x - u)}{s}\]

where u is the mean of the samples or \(0\) if with_centering=False , and s is the standard deviation of the samples or \(1\) if with_scaling=False.

Parameters:

x (array_like, 1d) – The data to center and scale.
with_centering (bool, optional) – If True, use x minus its mean, or center the data before scaling. (default: True)
with_scaling (bool, optional) – If True, scale the data to unit variance (or equivalently, unit standard deviation). (default: True)

Returns:

1darray: Scaled dataset

Note

If set explicitly with_centering=False (only variance scaling will be performed on x). We use a biased estimator for the standard deviation.

robust_scale

kim_convergence.robust_scale(x: ndarray, *, with_centering: bool = True, with_scaling: bool = True, quantile_range: tuple[float, float] = (25.0, 75.0)) → ndarray

Standardize a dataset.

Standardize a dataset by centering to the median and component wise scale according to the inter-quartile range.

Parameters:

x (array_like, 1d) – The data to center and scale.
with_centering (bool, optional) – If True, center the data before scaling. (default: True)
with_scaling (bool, optional) – If True, scale the data. (default: True)
quantile_range (tuple, or list, optional) – (q_min, q_max), 0.0 < q_min < q_max < 100.0 (default: (25.0, 75.0) = (1st quantile, 3rd quantile))

Returns:

1darray: Scaled dataset.

maxabs_scale

kim_convergence.maxabs_scale(x: ndarray, *, with_centering: bool = True, with_scaling: bool = True) → ndarray

Standardize a dataset to the [-1, 1] range.

Standardize a dataset to the [-1, 1] range such that the maximal absolute value in the data set will be 1.0.

Parameters:

x (array_like, 1d) – The data to center and scale.

Returns:

1darray: Scaled dataset.

Note

with_centering, and with_scaling are accepted for API compatibility but are not used by this method.

validate_split

Validate test/train sizes.

Helper function to validate the test/train sizes to be meaningful with regard to the size of the data (n_samples)

Parameters:

n_samples (int) – total number of sample points
train_size (int, float, or None) – train size
test_size (int, float, or None) – test size
default_test_size (int, float, or None, optional) – default test size. (default: None)

Returns:

tuple[int, int]: n_train: number of train points n_test: number of test points

Raises:

CRError – If any size is invalid or inconsistent.

train_test_split

Split time_series_data into random train and test indices.

Parameters:

time_series_data (array_like) – time series data, array-like of shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features.
test_size (int, float, or None, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to default_test_size. (default: 0.1)
train_size (int, float, or None, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. (default: None)
seed (None, int or np.random.RandomState(), optional) – random number seed. (default: None)
default_test_size (float, optional) – Default test size. (default: 0.1)

Returns:

tuple[np.ndarray, np.ndarray]: ind_train: training indices. ind_test: testing indices.

Raises:

CRError – If any size is invalid or inconsistent, or if seed has an illegal type.

Error Classes

CRError

exception kim_convergence.err.CRError(msg)

Raise an exception.

It raises an exception when receives an error message.

Parameters:: msg (str) – Human-readable error message.

CRSampleSizeError

exception kim_convergence.err.CRSampleSizeError(msg): Raise an exception if there is not enough samples.

Optional Integrations

ASE (Atomic Simulation Environment) integration is documented on its own page: see ASE Integration Module for the ASESampler sampler, the run_ase_equilibration driver, and the built-in property extractors.