API Reference

This section provides comprehensive documentation for all public APIs.

Core Functions

kim_convergence.run_length_control(get_trajectory: Callable, get_trajectory_args: dict | None = None, *, number_of_variables: int = 1, initial_run_length: int = 10000, run_length_factor: float = 1.0, maximum_run_length: int = 1000000, maximum_equilibration_step: int | None = None, minimum_number_of_independent_samples: int | None = None, relative_accuracy: float | list[float | None] | ndarray | None = 0.1, absolute_accuracy: float | list[float | None] | ndarray | None = 0.1, population_mean: float | list[float | None] | ndarray | None = None, population_standard_deviation: float | list[float | None] | ndarray | None = None, population_cdf: str | list[str | None] | None = None, population_args: tuple | list[tuple | None] | None = None, population_loc: float | list[float | None] | ndarray | None = None, population_scale: float | list[float | None] | ndarray | None = None, confidence_coefficient: float = 0.95, confidence_interval_approximation_method: str = 'uncorrelated_sample', heidel_welch_number_points: int = 50, fft: bool = True, test_size: int | float | None = None, train_size: int | float | None = None, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, ignore_end: int | float | None = None, number_of_cores: int = 1, si: str = None, nskip: int | None = 1, minimum_correlation_time: int | None = None, dump_trajectory: bool = False, dump_trajectory_fp: str = 'kim_convergence_trajectory.edn', fp: Any = None, fp_format: str = 'txt') str | bool

Control the length of the time series data from a simulation run.

It starts drawing initial_run_length number of observations (samples) by calling the get_trajectory function in a loop to reach equilibration or pass the warm-up period.

Note

get_trajectory is a callback function with a specific signature of get_trajectory(nstep: int) -> 1darray if we only have one variable or get_trajectory(nstep: int) -> 2darray with the shape of (number_of_variables, nstep)

To use extra arguments in the get_trajectory, one can use the other specific signature of get_trajectory(nstep: int, args: dict) -> 1darray or get_trajectory(nstep: int, args: dict) -> 2darray with the shape of (number_of_variables, nstep)

where all the required variables can be pass thrugh the args dictionary.

All the values returned from this function should be finite values, otherwise the code will stop wih error message explaining the issue.

Examples:

>>> rng = np.random.RandomState(12345)
>>> start = 0
>>> stop = 0
>>> def get_trajectory(step):
...     global start, stop
...     start = stop
...     if 100000 < start + step:
...         step = 100000 - start
...     stop += step
...     data = np.ones(step) * 10 + (rng.random_sample(step) - 0.5)
...     return data

or,

>>> targs = {'start': 0, 'stop': 0}
>>> def get_trajectory(step, targs):
...     targs['start'] = targs['stop']
...     if 100000 < targs['start'] + step:
...         step = 100000 - targs['start']
...     targs['stop'] += step
...     data = np.ones(step) * 10 + (rng.random_sample(step) - 0.5)
...     return data

Then it continues drawing observations until some pre-specified level of absolute or relative precision has been reached.

The relative precision is defined as a half-width of the estimator’s confidence interval (CI).

At each checkpoint, an upper confidence limit (UCL) is approximated. The drawing of observations is terminated, if UCL is less than the pre-specified absolute precision absolute_accuracy or if the relative UCL (UCL divided by the computed sample mean) is less than a pre-specified value, relative_accuracy.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

The Relative accuracy is the confidence interval half-width or UCL divided by the sample mean. If the ratio is bigger than relative_accuracy, the length of the time series is deemed not long enough to estimate the mean with sufficient accuracy, which means the run should be extended.

In order to avoid problems caused by sequential UCL evaluation cost, this calculation should not be repeated too frequently. Heidelberger and Welch (1981) [heidelberger1981] suggested increasing the run length by a factor run_length_factor > 1.5, each time, so that estimate has the same, of reasonably large proportion of new data.

The accuracy parameter relative_accuracy specifies the maximum relative error that will be allowed in the mean value of time-series data. In other words, the distance from the confidence limit(s) to the mean (which is also known as the precision, half-width, or margin of error). A value of 0.01 is usually used to request two digits of accuracy, and so forth.

The parameter confidence_coefficient is the confidence coefficient and often, the values 0.95 is used. For the confidence coefficient, confidence_coefficient, we can use the following interpretation,

If thousands of samples of n items are drawn from a population using simple random sampling and a confidence interval is calculated for each sample, the proportion of those intervals that will include the true population mean is confidence_coefficient.

The maximum_run_length parameter places an upper bound on how long the simulation will run. If the specified accuracy cannot be achieved within this time, the simulation will terminate, and a warning message will appear in the report.

The maximum_equilibration_step parameter places an upper bound on how long the simulation will run to reach equilibration or pass the warm-up period. If the equilibration or warm-up period cannot be detected within this time, the simulation will terminate and a warning message will appear in the report.

Note

By default and if not specified on input, the maximum_equilibration_step is defined as half of the maximum_run_length.

Note

By default, the algorithm will use relative_accuracy as a termination criterion, and in case of failure, it switches to use the absolute_accuracy.

If using the absolute_accuracy is desired, one should set the relative_accuracy to None.

Examples:

>>> run_length_control(get_trajectory,
...                    number_of_variables=1,
...                    relative_accuracy=None
...                    absolute_accuracy=0.1)

The algorithm converts relative_accuracy``and ``absolute_accuracy floating numbers to arrays with the shape of (number_of_variables, ), when the number_of_variables bigger than one. By default, it uses relative_accuracy as a termination criterion for the corresponding variable number, and in case of failure, it switches to use the absolute_accuracy.

If the absolute_accuracy is desired for one or some variables, one should provide both relative_accuracy``and ``absolute_accuracy as an array. Then it must set the corresponding relative_accuracy in the array to None and set the correct absolute_accuracy` at the right place in the collection.

E.g.,

>>> run_length_control(get_trajectory,
...                    number_of_variables=3,
...                    relative_accuracy=[0.1, 0.05, None]
...                    absolute_accuracy=[0.1, 0.05, 0.1])

or,

>>> run_length_control(get_trajectory,
...                    number_of_variables=3,
...                    relative_accuracy=[None, 0.05, None]
...                    absolute_accuracy=[0.1,  0.05, 0.1])

Note

confidence_interval_approximation_method is set to a method to use for approximating the upper confidence limit of the mean.

By default, (uncorrelated_sample approach) uses the independent samples in the time-series data to approximate the confidence intervals for the mean. The other methods have different approaches.

E.g., in the heidel_welch method, it requires no such independence assumption. In this spectral approach, the problem of dealing with dependent data are largely avoided by working in the frequency domain with the sample spectrum (periodogram) of the process.

Note

population_mean is a variable known (true) mean. Expected value in null hypothesis. It is an extra information for normally distributed data.

Note

for non-normally distributed data, and as an extra check on the convergence one should provide the population info using population_cdf, population_args, population_loc, and population_scale for a specific distribution.

Parameters:
  • get_trajectory (callback function) –

    A callback function with a specific signature of get_trajectory(nstep: int) -> 1darray if we only have one variable or get_trajectory(nstep: int) -> 2darray with the shape of (number_of_variables, nstep)

    Note

    all the values returned from this function should be finite values, otherwise the code will stop wih error message explaining the issue.

  • get_trajectory_args (dict, optional) – Extra arguments passed to the get_trajectory function. (default: {}) To use this option, the dictionary may contain start and stop keywords as well as other keywords which are needed in the function. get_trajectory(nstep, get_trajectory_args) -> 1darray

  • number_of_variables (int, optional) – number of variables in the corresponding time-series data from get_trajectory callback function. (default: 1)

  • initial_run_length (int, optional) – initial run length. (default: 2000)

  • run_length_factor (float, optional) – run length increasing factor. (default: 1.0)

  • maximum_run_length (int, optional) – the maximum run length represents a cost constraint. (default: 1000000)

  • maximum_equilibration_step (int, optional) – the maximum number of steps as an equilibration hard limit. If the algorithm finds equilibration_step greater than this limit it will fail. For the default None, the function is using maximum_run_length // 2 as the maximum equilibration step. (default: None)

  • minimum_number_of_independent_samples (int, optional) – minimum number of independent samples. This is an extra parameter to terminate the run after the pre-specified level of absolute or relative precision has been reached and there are minimum number of independent samples available for further analysis. (default: None)

  • relative_accuracy (float, or 1darray, optional) – a relative half-width requirement or the accuracy parameter. Target value for the ratio of halfwidth to sample mean. If number_of_variables > 1, relative_accuracy can be a scalar to be used for all variables or a 1darray of values of size number_of_variables. (default: 0.1)

  • absolute_accuracy (float, or 1darray, optional) – a half-width requirement or the accuracy parameter. Target value for the ratio of halfwidth to sample mean. If number_of_variables > 1, relative_accuracy can be a scalar to be used for all variables or a 1darray of values of size number_of_variables. (default: 0.1)

  • population_mean (float, or 1darray, optional) –

    variable known (true) mean. Expected value in null hypothesis. (default: None)

    Note

    For number_of_variables > 1, and if population_mean is provided, it should be a list or array of values. It should be set to None for variables which we do not intend to use this extra measure.

    Examples:

    >>> run_length_control(get_trajectory,
    ...                    number_of_variables=3,
    ...                    population_mean=[None, 297., None])
    

  • population_standard_deviation (float, or 1darray, optional) –

    population standard deviation. (default: None)

    Note

    For number_of_variables > 1, and if population_standard_deviation is provided, it should be a list or array of values. It should be set to None for variables which we do not intend to use this extra measure.

    Examples:

    >>> run_length_control(
    ...     get_trajectory,
    ...     number_of_variables=3,
    ...     population_mean=[None, 297., None],
    ...     population_standard_deviation=[None, 10., None])
    

  • population_cdf (str, or 1darray, optional) –

    The name of a distribution. (default: None)

    Examples: >>> run_length_control( … get_trajectory, … number_of_variables=2, … population_cdf=[None, ‘gamma’], … population_args=[None, (1.99,)], … population_loc=[None, None], … population_scale=[None, None])

    or,

    >>> run_length_control(
    ...     get_trajectory,
    ...     number_of_variables=2,
    ...     population_mean=[297., None],
    ...     population_standard_deviation=[10., None],
    ...     population_cdf=[None, 'gamma'],
    ...     population_args=[None, (1.99,)],
    ...     population_loc=[None, None],
    ...     population_scale=[None, None])
    

  • population_args (tuple, or list of tuples, optional) – Distribution parameter. (default: None)

  • population_loc (float, or 1darray, or None) – location of the distribution. (default: None)

  • population_scale (float, or 1darray, or None) – scale of the distribution. (default: None)

  • confidence_coefficient (float, optional) – (or confidence level) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)

  • confidence_interval_approximation_method (str, optional) – Method to use for approximating the upper confidence limit of the mean. One of the ucl_methods aproaches. (default: ‘uncorrelated_sample’)

  • heidel_welch_number_points (int, optional) – the number of points in Heidelberger and Welch’s spectral method that are used to obtain the polynomial fit. The parameter heidel_welch_number_points determines the frequency range over which the fit is made. (default: 50)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • test_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the periodogram dataset to include in the test split. If int, represents the absolute number of test samples. (default: None)

  • train_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the preiodogram dataset to include in the train split. If int, represents the absolute number of train samples. (default: None)

  • batch_size (int, optional) – batch size. (default: 5)

  • scale (str, optional) – a method to standardize a batched dataset. (default: ‘translate_scale’)

  • with_centering (bool, optional) – if True, use batched data minus the scale metod centering approach. (default: False)

  • with_scaling (bool, optional) – if True, scale the batched data to scale metod scaling approach. (default: False)

  • ignore_end (int, or float, or None, optional) – if int, it is the last few (batch) points that should be ignored. if float, should be in (0, 1) and it is the percent of last (batch) points that should be ignored. if None it would be set to the batch_size in bacth method and to the one fourth of the total number of points elsewhere. (default: None)

  • number_of_cores (int, optional) – The maximum number of concurrently running jobs, such as the number of Python worker processes or the size of the thread-pool. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. (default: 1)

  • si (str, optional) – statistical inefficiency method. (default: ‘statistical_inefficiency’)

  • nskip (int, optional) – the number of data points to skip in estimating ucl. (default: 1)

  • minimum_correlation_time (int, optional) – The minimum amount of correlation function to compute in estimating ucl. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

  • dump_trajectory (bool, optional) – if True, dump the final trajectory data to a file dump_trajectory_fp. (default: False)

  • dump_trajectory_fp (str, object with a write(string) method, optional) – a .write()-supporting file-like object or a name string to open a file. (default: ‘kim_convergence_trajectory.edn’)

  • fp (str, object with a write(string) method, optional) – if an str equals to 'return' the function will return string of the analysis results on the length of the time series. Otherwise it must be an object with write(string) method. If it is None, sys.stdout will be used which prints objects on the screen. (default: None)

  • fp_format (str) – one of the txt, json, or edn format. (default: ‘txt’)

Returns:

Union[str, bool]

True if the length of the time series is long enough to estimate the mean with sufficient accuracy or with enough requested sample size; False otherwise. If fp == 'return', a string containing the analysis results is returned instead.

UCL Methods

Upper Confidence Limit (UCL) module.

Upper Confidence Limit (UCL): The upper boundary (or limit) of a confidence interval of a parameter of interest such as the population mean.

A confidence interval is how much uncertainty there is with any particular statistic [nistdiv898]. Confidence limits for the mean are interval estimates. Interval estimates are often desirable because instead of a single estimate for the mean, a confidence interval generates a lower and upper limit. It indicates how much uncertainty there is in our estimation of the true mean. The narrower the gap, the more precise our estimate is. We use a confidence level to express confidence limits. Choosing the confidence level is somewhat arbitrary, but 90 %, 95 %, and 99 % intervals are standard, and 95 % is the most commonly used.

Note

One should note that a 95 % confidence interval does not mean a 95 % probability of containing the true mean. The interval computed from a sample either has the true mean, or it does not. The confidence level is simply the proportion of samples of a given size that may be expected to contain the true mean. For a 95 % confidence interval, if many samples are collected and the confidence interval computed, in the long run, about 95 % of these intervals would contain the true mean.

class kim_convergence.ucl.HeidelbergerWelch

Heidelberger and Welch algorithm.

Heidelberger and Welch (1981) [heidelberger1981] Object.

heidel_welch_set

Flag indicating if the Heidelberger and Welch constants are set.

Type:

bool

heidel_welch_k

The number of points that are used to obtain the polynomial fit in Heidelberger and Welch’s spectral method.

Type:

int

heidel_welch_n

The number of time series data points or number of batches in Heidelberger and Welch’s spectral method.

Type:

int

heidel_welch_p

Probability.

Type:

float

a_matrix

Auxiliary matrix.

Type:

ndarray

a_matrix_1_inv

The (Moore-Penrose) pseudo-inverse of a matrix for the first degree polynomial fit in Heidelberger and Welch’s spectral method.

Type:

ndarray

a_matrix_2_inv

The (Moore-Penrose) pseudo-inverse of a matrix for the second degree polynomial fit in Heidelberger and Welch’s spectral method.

Type:

ndarray

a_matrix_3_inv

The (Moore-Penrose) pseudo-inverse of a matrix for the third degree polynomial fit in Heidelberger and Welch’s spectral method.

Type:

ndarray

heidel_welch_c1_1

Heidelberger and Welch’s C1 constant for the first degree polynomial fit.

Type:

float

heidel_welch_c1_2

Heidelberger and Welch’s C1 constant for the second degree polynomial fit.

Type:

float

heidel_welch_c1_3

Heidelberger and Welch’s C1 constant for the third degree polynomial fit.

Type:

float

heidel_welch_c2_1

Heidelberger and Welch’s C2 constant for the first degree polynomial fit.

Type:

float

heidel_welch_c2_2

Heidelberger and Welch’s C2 constant for the second degree polynomial fit.

Type:

float

heidel_welch_c2_3

Heidelberger and Welch’s C2 constant for the third degree polynomial fit.

Type:

float

tm_1

t_distribution inverse cumulative distribution function for C2_1 degrees of freedom.

Type:

float

tm_2

t_distribution inverse cumulative distribution function for C2_2 degrees of freedom.

Type:

float

tm_3

t_distribution inverse cumulative distribution function for C2_3 degrees of freedom.

Type:

float

get_heidel_welch_auxilary_matrices() tuple

Get the Heidelberger and Welch auxilary matrices.

get_heidel_welch_c1() tuple

Get the Heidelberger and Welch C1 constants.

get_heidel_welch_c2() tuple

Get the Heidelberger and Welch C2 constants.

get_heidel_welch_constants() tuple

Get the Heidelberger and Welch constants.

get_heidel_welch_knp() tuple

Get the heidel_welch_number_points, n, and confidence_coefficient.

get_heidel_welch_tm() tuple

Get the Heidelberger and Welch t_distribution ppf.

Get the Heidelberger and Welch t_distribution ppf for C2 degrees of freedom.

is_heidel_welch_set() bool

Return True if the flag is set to True.

set_heidel_welch_constants(*, confidence_coefficient: float = 0.95, heidel_welch_number_points: int = 50)

Set Heidelberger and Welch constants globally.

Set the constants necessary for application of the Heidelberger and Welch’s [heidelberger1981] confidence interval generation method.

Parameters:
  • confidence_coefficient (float) – probability (or confidence interval) and must be between 0.0 and 1.0. (default: 0.95)

  • heidel_welch_number_points (int) – the number of points in Heidelberger and Welch’s spectral method that are used to obtain the polynomial fit. The parameter heidel_welch_number_points determines the frequency range over which the fit is made. (default: 50)

unset_heidel_welch_constants()

Unset the Heidelberger and Welch flag.

class kim_convergence.ucl.MSER_m

MSER-m algorithm.

The MSER [white1997] and MSER-5 [spratt1998] rules determine the truncation point as the value of \(d\) that best balances the tradeoff between improved accuracy (elimination of bias) and decreased precision (reduction in the sample size) for the input series. They select a truncation point that minimizes the width of the marginal confidence interval about the truncated sample mean. The marginal confidence interval is a measure of the homogeneity of the truncated series. The optimal truncation point \(d(j)^*\) selected by MSER-m can be expressed as:

\[d(j)^* = \underset{n>d(j) \geq 0}{\text{argmin}} \left[ \frac{1}{(n(j)-d(j))^2} \sum_{i=d}^{n}{\left(X_i(j)- \bar{X}_{n,d}(j) \right )^2} \right]\]

MSER-m applies the equation to a series of batch averages instead of the raw series. The CI estimators can be computed from the truncated sequence of batch means.

estimate_equilibration_length(time_series_data: ndarray | list[float], *, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, ignore_end: int | float | None = None, number_of_cores: int = 1, si: str | float | int | None = None, nskip: int | None = 1, fft: bool = True, minimum_correlation_time: int | None = None) tuple[bool, int]

Estimate the equilibration point in a time series data.

class kim_convergence.ucl.MSER_m_y

MSER_m_y algorithm.

MSER_m_y [yousefi2011] computes k batch means of size m to evaluate the MSER-m statistic as described in [spratt1998] and detect the truncation point. If the truncation is detected, the point estimator of the mean is the sample mean of all observations in the truncated data set.

To compute the UCL, the MSER_m_y applies the von Neumann randomness test [vonneumann1941], [vonneumann1941b] to the truncated data to find a new batch size \(m^*\) for which the new batch means are approximately independent. It checks the randomness test on successively larger batch sizes until the test is finally passed and the batch means are finally determined to be approximately independent of each other. It starts by setting the initial batch size m as 1, and calculate the number of batches k’ accordingly.

significance_level

Significance level. A probability threshold below which the null hypothesis will be rejected.

Type:

float

class kim_convergence.ucl.N_SKART

N-Skart algorithm.

N-Skart [tafazzoli2011] is a nonsequential procedure designed to compute a half the width of the confidence_coefficient% probability interval (CI) (confidence interval, or credible interval) around the time-series mean.

Note

N-Skart is a variant of the method of batch means.

N-Skart makes some modifications to the confidence interval (CI). These modifications account for the skewness (non-normality), and autocorrelation of the batch means which affect the distribution of the underlying Student’s t-statistic.

k_number_batches

number of nonspaced (adjacent) batches of size batch_size.

Type:

int

kp_number_batches

number of nonspaced (adjacent) batches.

Type:

int

batch_size

bacth size.

Type:

int

number_batches_per_spacer

number of batches per spacer.

Type:

int

maximum_number_batches_per_spacer

maximum number of batches per spacer.

Type:

int

significance_level

Significance level. A probability threshold below which the null hypothesis will be rejected.

Type:

float

randomness_test_counter

counter for applying the randomness test of von Neumann [vonneumann1941] [vonneumann1941b].

Type:

int

estimate_equilibration_length(time_series_data: ndarray | list[float], *, si: str | float | int | None = None, nskip: int | None = 1, fft: bool = True, minimum_correlation_time: int | None = None, ignore_end: int | float | None = None, number_of_cores: int = 1, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False) tuple[bool, int]

Estimate the equilibration point in a time series data.

Estimate the equilibration point in a time series data using the N-Skart algorithm.

Parameters:

time_series_data (array_like, 1d) – time series data.

Returns:

tuple[bool, int]

truncated: True if truncation was applied. truncation_point: Index at which to truncate.

Note

if N-Skart does not detect the equilibration it will return truncated as False and the equilibration index equals to the last index in the time series data.

Note

nskip, ignore_end, and number_of_cores are accepted for API compatibility but are not used by this method.

class kim_convergence.ucl.UCLBase

Upper Confidence Limit base class.

ci(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, equilibration_length_estimate: int = 0, heidel_welch_number_points: int = 50, batch_size: int = 5, fft: bool = True, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, test_size: int | float | None = None, train_size: int | float | None = None, population_standard_deviation: float | None = None, si: str | float | int | None = None, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None, sample_method: str | None = None) tuple[float, float]

Approximate the confidence interval of the mean.

estimate_equilibration_length(time_series_data: ndarray | list[float], *, si: str | None = None, nskip: int | None = 1, fft: bool = True, minimum_correlation_time: int | None = None, ignore_end: int | float | None = None, number_of_cores: int = 1, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False) tuple[bool, int]

Estimate the equilibration point in a time series data.

property indices

Get the indices.

property mean

Get the mean.

property name

Get the name.

relative_half_width_estimate(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, equilibration_length_estimate: int = 0, heidel_welch_number_points: int = 50, batch_size: int = 5, fft: bool = True, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, test_size: int | float | None = None, train_size: int | float | None = None, population_standard_deviation: float | None = None, si: str | float | int | None = None, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None, sample_method: str | None = None) float

Get the relative half width estimate.

requires_si_computation() bool

Return True if this UCL method requires statistical inefficiency computation.

property sample_size

Get the sample_size.

set_indices(time_series_data: ndarray | list[float], *, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None) None

Set the indices.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • si (float, or str, optional) – estimated statistical inefficiency. (default: None)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

set_si(time_series_data, *, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None) None

Set the si (statistical inefficiency).

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • si (float, or str, optional) – estimated statistical inefficiency. (default: None)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

property si

Get the si.

property std

Get the std.

ucl(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, equilibration_length_estimate: int = 0, heidel_welch_number_points: int = 50, batch_size: int = 5, fft: bool = True, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, test_size: int | float | None = None, train_size: int | float | None = None, population_standard_deviation: float | None = None, si: str | float | int | None = None, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None, sample_method: str | None = None) float

Approximate the upper confidence limit of the mean.

class kim_convergence.ucl.UncorrelatedSamples

UncorrelatedSamples algorithm.

kim_convergence.ucl.heidelberger_welch_ci(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, heidel_welch_number_points: int = 50, fft: bool = True, test_size: int | float | None = None, train_size: int | float | None = None, obj: HeidelbergerWelch | None = None) tuple[float, float]

Approximate the confidence interval of the mean.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)

  • heidel_welch_number_points (int, optional) – the number of points that are used to obtain the polynomial fit. The parameter heidel_welch_number_points determines the frequency range over which the fit is made. (default: 50)

  • fft (bool, optional) – Use FFT convolution for long series. (default: True)

  • test_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the periodogram dataset to include in the test split. If int, represents the absolute number of test samples. (default: None)

  • train_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the preiodogram dataset to include in the train split. If int, represents the absolute number of train samples. (default: None)

  • obj (HeidelbergerWelch, optional) – instance of HeidelbergerWelch (default: None)

Returns:

tuple[float, float]

Lower and upper confidence limits for the mean.

kim_convergence.ucl.heidelberger_welch_relative_half_width_estimate(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, heidel_welch_number_points: int = 50, fft: bool = True, test_size: int | float | None = None, train_size: int | float | None = None, obj: HeidelbergerWelch | None = None) float

Get the relative half width estimate.

The relative half width estimate is the confidence interval half-width or upper confidence limit (UCL) divided by the sample mean.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)

  • heidel_welch_number_points (int, optional) – the number of points that are used to obtain the polynomial fit. The parameter heidel_welch_number_points determines the frequency range over which the fit is made. (default: 50)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • test_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the periodogram dataset to include in the test split. If int, represents the absolute number of test samples. (default: None)

  • train_size (int, float, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the preiodogram dataset to include in the train split. If int, represents the absolute number of train samples. (default: None)

  • obj (HeidelbergerWelch, optional) – instance of HeidelbergerWelch (default: None)

Returns:

float

Relative half width estimate

kim_convergence.ucl.heidelberger_welch_ucl(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, heidel_welch_number_points: int = 50, fft: bool = True, test_size: int | float | None = None, train_size: int | float | None = None, obj: HeidelbergerWelch | None = None) float

Approximate the upper confidence limit of the mean.

kim_convergence.ucl.mser_m(time_series_data: ndarray | list[float], *, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, ignore_end: int | float | None = None) tuple[bool, int]

Determine the truncation point using marginal standard error rules.

Determine the truncation point using marginal standard error rules (MSER). The MSER [white1997] and MSER-5 [spratt1998] rules determine the truncation point as the value of \(d\) that best balances the tradeoff between improved accuracy (elimination of bias) and decreased precision (reduction in the sample size) for the input series. They select a truncation point that minimizes the width of the marginal confidence interval about the truncated sample mean. The marginal confidence interval is a measure of the homogeneity of the truncated series. The optimal truncation point \(d(j)^*\) selected by MSER-m can be expressed as:

\[d(j)^* = \underset{n>d(j) \geq 0}{\text{argmin}} \left[ \frac{1}{(n(j)-d(j))^2} \sum_{i=d}^{n}{\left(X_i(j)- \bar{X}_{n,d}(j) \right )^2} \right]\]

MSER-m applies the equation to a series of batch averages instead of the raw series.

Parameters:
  • time_series_data (array_like, 1d) – Time series data.

  • batch_size (int, optional) – batch size. (default: 5)

  • scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale’)

  • with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)

  • with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)

  • ignore_end (int, or float, or None, optional) – if int, it is the last few batch points that should be ignored. if float, should be in (0, 1) and it is the percent of last batch points that should be ignored. if None it would be set to the \(Min(batch_size, number_batches / 4)\). (default: None)

Returns:

tuple[bool, int]

truncated: True if truncation was applied. truncation_point: Index at which to truncate.

Note

MSER-m sometimes erroneously reports a truncation point at the end of the data series. This is because the method can be overly sensitive to observations at the end of the data series that are close in value. Here, we avoid this artifact, by not allowing the algorithm to consider the standard errors calculated from the last few data points.

Note

If the truncation point returned by MSER-m > n/2, it is considered an invalid value and truncated will return as False. It means the method has not been provided with enough data to produce a valid result, and more data is required.

Note

If the truncation obtained by MSER-m is the last index of the batched data, the MSER-m returns the time series data’s last index as the truncation point. This index can be used as a measure that the algorithm did not find any truncation point.

kim_convergence.ucl.mser_m_ci(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m | None = None) tuple[float, float]

Approximate the confidence interval of the mean [mokashi2010].

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)

  • batch_size (int, optional) – batch size. (default: 5)

  • scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale)

  • with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)

  • with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)

  • obj (MSER_m, optional) – instance of MSER_m (default: None)

Returns:

tuple[float, float]

Lower and upper confidence limits for the mean.

kim_convergence.ucl.mser_m_relative_half_width_estimate(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m | None = None) float

Get the relative half width estimate.

The relative half width estimate is the confidence interval half-width or upper confidence limit (UCL) divided by the sample mean.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)

  • batch_size (int, optional) – batch size. (default: 5)

  • scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale)

  • with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)

  • with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)

  • obj (MSER_m, optional) – instance of MSER_m (default: None)

Returns:

float

Relative half width estimate.

kim_convergence.ucl.mser_m_ucl(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m | None = None) float

Approximate the upper confidence limit of the mean.

kim_convergence.ucl.mser_m_y_ci(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m_y | None = None) tuple[float, float]

Approximate the confidence interval of the mean [mokashi2010].

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)

  • batch_size (int, optional) – batch size. (default: 5)

  • scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale)

  • with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)

  • with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)

  • obj (MSER_m_y, optional) – instance of MSER_m_y (default: None)

Returns:

tuple[float, float]

Lower and upper confidence limits for the mean.

kim_convergence.ucl.mser_m_y_relative_half_width_estimate(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m_y | None = None) float

Get the relative half width estimate.

The relative half width estimate is the confidence interval half-width or upper confidence limit (UCL) divided by the sample mean.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)

  • batch_size (int, optional) – batch size. (default: 5)

  • scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale)

  • with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)

  • with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)

  • obj (MSER_m_y, optional) – instance of MSER_m_y (default: None)

Returns:

float

Relative half width estimate.

kim_convergence.ucl.mser_m_y_ucl(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False, obj: MSER_m_y | None = None) float

Approximate the upper confidence limit of the mean.

kim_convergence.ucl.n_skart_ci(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, equilibration_length_estimate: int = 0, fft: bool = True, obj: N_SKART | None = None) tuple[float, float]

Approximate the confidence interval of the mean.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • equilibration_length_estimate (int, optional) – an estimate for the equilibration length.

  • confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • obj (N_SKART, optional) – instance of N_SKART (default: None)

Returns:

tuple[float, float]

Lower and upper confidence limits for the mean.

kim_convergence.ucl.n_skart_relative_half_width_estimate(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, equilibration_length_estimate: int = 0, fft: bool = True, obj: N_SKART | None = None) float

Get the relative half width estimate.

The relative half width estimate is the confidence interval half-width or upper confidence limit (UCL) divided by the sample mean.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • equilibration_length_estimate (int, optional) – an estimate for the equilibration length.

  • confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • obj (N_SKART, optional) – instance of N_SKART (default: None)

Returns:

float

Relative half width estimate.

kim_convergence.ucl.n_skart_ucl(time_series_data: ndarray | list[float], *, confidence_coefficient=0.95, equilibration_length_estimate: int = 0, fft: bool = True, obj: N_SKART | None = None) float

Approximate the upper confidence limit of the mean.

kim_convergence.ucl.uncorrelated_samples_ci(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, population_standard_deviation: float | None = None, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None, sample_method: str | None = None, obj: UncorrelatedSamples | None = None) tuple[float, float]

Approximate the confidence interval of the mean.

  • If the population standard deviation is known, and population_standard_deviation is given,

    \[UCL = t_{\alpha,d} \left(\frac{\text population\ standard\ deviation}{\sqrt{n}}\right)\]
  • If the population standard deviation is unknown, the sample standard deviation is estimated and be used as sample_standard_deviation,

    \[UCL = t_{\alpha,d} \left(\frac{\text sample\ standard\ deviation}{\sqrt{n}}\right)\]

In both cases, the Student's t distribution is used as the critical value. This value depends on the confidence_coefficient and the degrees of freedom, which is found by subtracting one from the number of observations.

Confidence limits for the mean are interval estimates. Interval estimates are often desirable because instead of a single estimate for the mean, a confidence interval generates a lower and upper limit. It indicates how much uncertainty there is in our estimation of the true mean. The narrower the gap, the more precise our estimate is.

Confidence limits are defined as \(\bar{Y} \pm UCL,\) where \(\bar{Y}\) is the sample mean, and \(UCL\) is the approximate upper confidence limit of the mean.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)

  • population_standard_deviation (float, optional) – population standard deviation. (default: None)

  • si (float, or str, optional) – estimated statistical inefficiency. (default: None)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

  • uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. (default: None)

  • sample_method (str, optional) – sampling method, one of the uncorrelated, random, or block_averaged. (default: None)

  • obj (UncorrelatedSamples, optional) – instance of UncorrelatedSamples (default: None)

Returns:

tuple[float, float]

Lower and upper confidence limits for the mean. The approximately unbiased estimate of confidence Limits for the mean.

kim_convergence.ucl.uncorrelated_samples_relative_half_width_estimate(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, population_standard_deviation: float | None = None, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None, sample_method: str | None = None, obj: UncorrelatedSamples | None = None) float

Get the relative half width estimate.

The relative half width estimate is the confidence interval half-width or upper confidence limit (UCL) divided by the sample mean.

The UCL is calculated as a confidence_coefficient% confidence interval for the mean, using the portion of the time series data, which is in the stationarity region.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • confidence_coefficient (float, optional) – probability (or confidence interval) and must be between 0.0 and 1.0, and represents the confidence for calculation of relative halfwidths estimation. (default: 0.95)

  • population_standard_deviation (float, optional) – population standard deviation. (default: None)

  • si (float, or str, optional) – estimated statistical inefficiency. (default: None)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

  • uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. (default: None)

  • sample_method (str, optional) – sampling method, one of the uncorrelated, random, or block_averaged. (default: None)

  • obj (UncorrelatedSamples, optional) – instance of UncorrelatedSamples (default: None)

Returns:

float

Relative half width estimate

kim_convergence.ucl.uncorrelated_samples_ucl(time_series_data: ndarray | list[float], *, confidence_coefficient: float = 0.95, population_standard_deviation: float | None = None, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None, sample_method: str | None = None, obj: UncorrelatedSamples | None = None) float

Approximate the upper confidence limit of the mean.

Statistical Functions

stats module.

class kim_convergence.stats.ZERO_RC(xlo: float, xhi: float, *, abs_tol: float = 1e-50, rel_tol: float = 1e-08)

Zero finding class by reverse communication.

zero(status: int, x: float, fx: float, xlo: float, xhi: float)

Perform the zero finding.

Parameters:
  • status (int) – Status. If 0, other parameters are ignored.

  • x (float) – Input value at which function f is evaluated.

  • fx (float) – Function value f(x).

  • xlo (float) – Lower interval bound.

  • xhi (float) – Upper interval bound.

Returns:

tuple[int, float, float, float]

status: 0 = finished, 1 = needs eval, -1 = error. x: updated candidate. xlo/xhi: refined bracketing interval.

class kim_convergence.stats.ZERO_RC_BOUNDS(small: float, big: float, abs_step: float, rel_step: float, step_mul: float, *, abs_tol: float = 1e-50, rel_tol: float = 1e-08)

Bound zero finding class by reverse communication.

zero(status: int, x: float, fx: float)

Bounds the zero of the function.

Bounds the zero of the function and finds zero of the function by reverse communication.

f must be a monotone function, otherwise the results are undefined. If f is an increasing monotone, then the result is bound by [f(x-tolerance(x)) f(x+tolerance(x))]. If f is a decreasing monotone, then the result is bound by [f(x+tolerance(x)) f(x-tolerance(x))]. Where tolerance(x) = Maximum(abs_tol, rel_tol * |x|).

Parameters:
  • status (int) – Status. If 0, other parameters are ignored.

  • x (float) – Input value at which function f is evaluated.

  • fx (float) – Function value f(x).

Returns:

tuple[int, float]

status: 0 = finished without error, 1 = needs another evaluation. x: updated input value.

kim_convergence.stats.auto_correlate(x: ndarray | list[float], *, nlags: int | None = None, fft: bool = False) ndarray

Calculate the auto-correlation function.

Calculate the auto-correlation function for nlags lag for the input array. This estimator is biased.

Parameters:
  • x (array_like, 1d) – Time series data.

  • nlags (int > 0 or None, optional) – Number of lags to return auto-correlation for it. (default: None)

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

ndarray

Calculated auto correlation function.

kim_convergence.stats.auto_covariance(x: ndarray | list[float], *, fft: bool = False) ndarray

Calculate biased auto-covariance estimates.

Compute auto-covariance estimates for every lag for the input array. This estimator is biased.

\[\gamma_k = \frac{1}{N}\sum\limits_{t=1}^{N-K}(x_t-\bar{x})(x_{t+K}-\bar{x})\]

Note

Some sources use the following formula for computing the autocovariance:

\[\gamma_k = \frac{1}{N-K}\sum\limits_{t=1}^{N-K}(x_t-\bar{x})(x_{t+K}-\bar{x})\]

This definition has less bias, than the one used here. But the \(\frac{1}{N}\) formulation has some desirable statistical properties and is the most commonly used in the statistics literature.

Parameters:
  • x (array_like, 1d) – Time series data.

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

1darray

Estimated autocovariances.

Raises:

CRError – If input validation fails.

kim_convergence.stats.beta(a: float, b: float) float

Beta function.

Beta function [numrec2007] is defined as,

\[B(a, b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)},\]

where \(\Gamma\) is the gamma function.

Parameters:
  • a (float) – First parameter of the beta distribution.

  • b (float) – Second parameter of the beta distribution.

Returns:

float

Beta function value.

kim_convergence.stats.betacf(a: float, b: float, x: float, *, eps: float = 1e-15, max_iteration: int = 200, _fpmin: float = 1e-30) float

Continued fraction for incomplete beta function by modified Lentz’s method.

Evaluates continued fraction for incomplete beta function by modified Lentz’s method [numrec2007].

Parameters:
  • a (float) – First parameter of the beta distribution.

  • b (float) – Second parameter of the beta distribution.

  • x (float) – Real-valued such that it must be between 0.0 and 1.0.

  • eps (float, optional) – Machine precision epsilon. (default: {np.finfo(np.float64).resolution})

  • max_iteration (int, optional) – Maximum number of iterations. (default: 200)

  • _fpmin (float, optional) – Minimum floating point precision. (default: 1.0e-30)

Returns:

float

Continued fraction for incomplete beta function.

kim_convergence.stats.betai(a: float, b: float, x: float) float

Incomplete beta function.

Incomplete beta function [numrec2007] is defined as,

\[I_x(a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \int_0^x~t^{a-1}(1-t)^{b-1}~dt,\]
Parameters:
  • a (float) – First parameter of the beta distribution.

  • b (float) – Second parameter of the beta distribution.

  • x (float) – Real-valued such that it must be between 0.0 and 1.0.

Returns:

float

Incomplete beta function value.

kim_convergence.stats.betai_cdf(a: float, b: float, x: float) float

Calculate the cumulative distribution of the incomplete beta distribution.

Calculate the cumulative distribution of the incomplete beta distribution with parameters a and b as,

\[\int_0^x \frac{t^{a-1}~(1-t)^{b-1}}{Beta(a,b)}~dt,\]

where, \(Beta(a,b)\) is the beta function.

Parameters:
  • a (float) – First parameter of the beta distribution.

  • b (float) – Second parameter of the beta distribution.

  • x (float) – Upper limit of integration

Returns:

float

Cumulative incomplete beta distribution.

kim_convergence.stats.betai_cdf_ccdf(a: float, b: float, x: float) tuple[float, float]

Calculate the cumulative distribution of the incomplete beta distribution.

Calculate the cumulative distribution of the incomplete beta distribution with parameters a and b as,

\[\int_0^x \frac{t^{a-1}~(1-t)^{b-1}}{Beta(a,b)}~dt,\]

where, \(Beta(a,b)\) is the beta function.

Parameters:
  • a (float) – First parameter of the beta distribution.

  • b (float) – Second parameter of the beta distribution.

  • x (float) – Upper limit of integration

Returns:

tuple[float, float]

Cumulative incomplete beta distribution, compliment of the cumulative incomplete beta distribution.

kim_convergence.stats.check_population_cdf_args(population_cdf: str | None, population_args: tuple)

Check the input population_cdf and population_args for correctness.

Parameters:
  • population_cdf (str) – The name of a distribution.

  • population_args (tuple) – Distribution parameter.

kim_convergence.stats.chi_square_test(sample_var: float, sample_size: int, population_var: float, significance_level: float = 0.050000000000000044) bool

Chi-square test for the variance.

Calculate the chi-square test for the variance. This is a two-sided test. Test Statistic is \(T=(N−1)\frac{\text{var}}{\text{var}_0}\), where where N is the sample size and var is the sample variance. The ratio var/var0 compares the ratio of the sample variance to the target variance. The more this ratio deviates from 1, the more likely we are to reject the null hypothesis.

The null hypothesis is that the variance of a sample of independent observations x is equal to the given population variance, population_var.

Parameters:
  • sample_var (float) – Sample variance.

  • sample_size (int) – Number of samples.

  • population_var (float) – population variance.

  • significance_level (float) – Significance level. A probability threshold below which the null hypothesis will be rejected. (default: 0.05)

Returns:

bool

True if the variance of a sample of independent observations x equals the given population variance population_var.

kim_convergence.stats.cross_correlate(x: ndarray | list[float], y: ndarray | list[float] | None, *, nlags: int | None = None, fft: bool = False) ndarray

Calculate the cross-correlation function.

Calculate the cross-correlation function for nlags lag for the input array. This estimator is biased.

Parameters:
  • x (array_like, 1d) – Time series data.

  • y (array_like, 1d) – Time series data.

  • nlags (int > 0 or None, optional) – Number of lags to return auto-correlation for. (default: None)

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

ndarray

Calculated cross correlation.

kim_convergence.stats.cross_covariance(x: ndarray | list[float], y: ndarray | list[float] | None, *, fft: bool = False) ndarray

Calculate the biased cross covariance estimate between two time series.

Calculate the cross covariance between two time series for every lag for the input arrays. This estimator is biased.

Parameters:
  • x (array_like, 1d) – Time series data.

  • y (array_like, 1d) – Time series data.

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

1darray

Calculated cross covariances.

Raises:

CRError – If input validation fails.

kim_convergence.stats.get_distribution_stats(population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None)

Get the distribution stats from its name.

The stats include, Median, Mean, Variance, and Standard deviation of the distribution.

Parameters:
  • population_cdf (str) – The name of a distribution.

  • population_args (tuple) – Distribution parameter.

  • population_loc (Optional[float]) – location of the distribution.

  • population_scale (Optional[float]) – scale of the distribution.

Returns:

tuple

median, mean, var, std

kim_convergence.stats.get_fft_optimal_size(input_size: int) int

Find the optimal size for the FFT solver.

Get the next regular number greater than or equal to input_size [statsmodels]. Regular numbers are composites of the prime factors 2, 3, and 5. Also known as 5-smooth numbers or Hamming numbers, these are the optimal size for inputs to FFT solvers.

Parameters:

input_size (int) – Input data size we want to use the FFT solver on it. This is the length to start searching from it and is a positive integer.

Returns:

int

The first 5-smooth number greater than or equal to input_size.

kim_convergence.stats.int_power(x: ndarray | list[float], exponent: int) ndarray

Array elements raised to the power exponent.

Parameters:
  • x (array_like, 1d) – The bases.

  • exponent (int) – The exponent

Returns:

1darray

Computed power array.

kim_convergence.stats.kruskal_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool

Kruskal-Wallis H-test for independent samples.

The Kruskal-Wallis H-test tests the null hypothesis that the median of the time series data is the same as the one from population_cdf.

It is a non-parametric version of ANOVA.

Parameters:
  • time_series_data (np.ndarray) – time series data.

  • population_cdf (Optional[str]) – The name of a distribution.

  • population_args (tuple) – Distribution parameter.

  • population_loc (Optional[float]) – location of the distribution.

  • population_scale (Optional[float]) – scale of the distribution.

  • significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool

True if the median of the time-series data equals the median of the specified population distribution.

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma
>>> rng = np.random.RandomState(12345)
>>> a = 1.99
>>> x = rng.gamma(a, 1, size=20)
>>> kruskal_test(x,
                 population_cdf='gamma',
                 population_args=(shape,),
                 population_loc=0,
                 population_scale=1,
                 significance_level=0.05)
True
kim_convergence.stats.ks_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool

Kolmogorov-Smirnov test for goodness of fit.

Note

This test is only valid for continuous distributions.

It uses the distribution of an observed variable against a given distribution.

The null hypothesis is that the observed samples are drawn from the same continuous distribution as the given distribution with population_loc and population_scale if they are given.

Note

The alternative hypothesis is two-sided. Where the empirical cumulative distribution function of the observed variables is less or greater than the cumulative distribution function of the given distribution.

The probability density of the given population distribution is in the standardized form. Thus to shift and/or scale the distribution population_loc and population_scale parameters are used. In these cases, the variable change y <- x, where y = (x - loc) / scale

Parameters:
  • time_series_data (np.ndarray) – time series data.

  • population_cdf (Optional[str]) – The name of a distribution.

  • population_args (tuple) – Distribution parameter.

  • population_loc (Optional[float]) – location of the distribution.

  • population_scale (Optional[float]) – scale of the distribution.

  • significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool

True if the observed samples are drawn from the same continuous distribution as the given one (two-tailed p-value > significance_level).

kim_convergence.stats.levene_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool

Perform modified Levene test for equal variances.

The modified Levene test tests the null hypothesis that one sample input time_series_data is from population population_cdf with the same variance [nistdiv898b].

Note

This test is fixed to use ‘median’ variation of the Levene’s test.

Although the optimal choice depends on the underlying distribution, the definition based on the median is recommended as the choice that provides good robustness against many types of non-normal data while retaining good power.

Robustness means the ability of the test to not falsely detect unequal variances when the underlying data are not normally distributed and the variables are in fact equal.

Power means the ability of the test to detect unequal variances when the variances are in fact unequal.

Parameters:
  • time_series_data (np.ndarray) – time series data.

  • population_cdf (Optional[str]) – The name of a distribution.

  • population_args (tuple) – Distribution parameter.

  • population_loc (Optional[float]) – location of the distribution.

  • population_scale (Optional[float]) – scale of the distribution.

  • significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool

True if the sample variance equals the population variance (two-tailed p-value > significance_level).

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma, alpha
>>> rng = np.random.RandomState(12345)
>>> shape, scale = 2., 2.
>>> x = rng.gamma(shape, scale, size=1000)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(shape,),
                population_loc=0,
                population_scale=scale,
                significance_level=0.05)
True
>>> a = 1.99
>>> x = gamma.rvs(a, size=1000, random_state=rng)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
True
>>> x = alpha.rvs(a, size=1000, random_state=rng)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
False

Reject the null hypothesis at a confidence level of 5%, concluding that there is a difference in variance of the time_series_data and gamma distribution with shape parameter a.

Example:

>>> levene_test(x,
                population_cdf='alpha',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
True
kim_convergence.stats.modified_periodogram(x: ndarray | list[float], *, fft: bool = False, with_mean: bool = False) ndarray

Compute a modified periodogram to estimate the power spectrum.

Estimate the power spectrum using a modified periodogram. A periodogram [heidelberger1981] is an estimate of the spectral density of a signal and it is defined as,

\[\left \{ I\left(\frac{k}{n}\right) \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor},\; I\left( \frac{k}{n} \right) = \left| \sum_{j=0}^{j=n-1} {x(j) e^{-2\pi i j k / n}} \right|^2 / n\]
Parameters:
  • x (array_like, 1d) – Time series data.

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

  • with_mean (bool, optional) – If True, use x minus its mean. (default: False)

Returns:

1darray

Computed modified periodogram array.

Note

This function does not return the array of sample frequencies. In case of need, one can compute it as,

\[f = \left \{ \frac{k}{n} \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor + 1}\]

or

>>> f = np.arange(1., x.size//2 + 1) / x.size
Raises:

CRError – If input validation fails.

kim_convergence.stats.moment(x: ndarray | list[float], *, moment: int = 1) float

Calculates the nth moment about the mean for a sample.

Parameters:
  • x (array_like, 1d) – Time series data.

  • moment (int, optional) – Order of central moment that is returned. (default: 1)

Returns:

float

n-th central moment.

Note

The k-th central moment of a time series data,

\[m_k = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^k,\]

where \(n\) is the number of samples and \(\bar{x}\) is the mean.

kim_convergence.stats.normal_interval(confidence_level: float, *, loc: float = 0.0, scale: float = 1.0) tuple[float, float]

Compute the normal distribution confidence interval.

Compute the normal-distribution confidence interval with equal areas around the median.

Parameters:
  • confidence_level (float) – Confidence coefficient (must be between 0.0 and 1.0).

  • loc (float, optional) – Location parameter. (default: 0.0)

  • scale (float, optional) – Scale parameter. (default: 1.0)

Returns:

tuple[float, float]

Lower and upper bounds of the confidence interval that contains \(100~\text{confidence_level}\%\) of the distribution.

Note

  • Confidence interval is a range of values that is likely to contain an unknown population parameter.

  • Confidence level is the percentage of the confidence intervals which will hold the population parameter.

  • The significance level or alpha is the probability of rejecting the null hypothesis when it is true. To find alpha, just subtract the confidence interval from 100%. E.g., the significance level for a 90% confidence level is 100% – 90% = 10%.

kim_convergence.stats.normal_inv_cdf(p: float, *, loc=0.0, scale: float = 1.0) float

Compute the normal distribution inverse cumulative distribution function.

Parameters:
  • p (float) – Probability (must be between 0.0 and 1.0).

  • loc (float, optional) – Location parameter. (default: 0.0)

  • scale (float, optional) – Scale parameter. (default: 1.0)

Returns:

float

Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

kim_convergence.stats.periodogram(x: ndarray | list[float], *, fft: bool = False, with_mean: bool = False) ndarray

Compute a periodogram to estimate the power spectrum.

Parameters:
  • x (array_like, 1d) – Time series data.

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

  • with_mean (bool, optional) – If True, use x minus its mean. (default: False)

Returns:

1darray

Computed power spectrum array.

Note

This function does not return the array of sample frequencies. In case of need, one can compute it as,

\[f = \left \{ \frac{k}{n} \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor + 1}\]

or

>>> f = np.arange(1., x.size//2 + 1) / x.size
kim_convergence.stats.randomness_test(x: ndarray | list[float], significance_level: float) bool

Testing for independence of observations.

The von-Neumann ratio test of independence of variables is a test designed for checking the independence of subsequent observations.

The null hypothesis is that the data are independent and normally distributed.

Parameters:
  • x (array_like, 1d) – Time series data.

  • significance_level (float) – Probability threshold below which the null hypothesis is rejected.

Returns:

bool

True if the observations are independent.

Note

Given a series \(x\) of \(n\) data points, the Von-Neumann test [vonneumann1941] [vonneumann1941b] statistic is

\[v = \frac{\sum_{i=2}^{n} (x_i - x_{i-1})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\]

Under the null hypothesis of independence, the mean \(\bar{v} = 2\) and the variance \(\sigma^2_v = \frac{4 (n - 2)}{(n^2-1)}\) (see [williams1941], and [madansky1988] for a simple derivation).

kim_convergence.stats.s_normal_inv_cdf(p: float) float

Compute the standard normal distribution inverse cumulative distribution function.

Compute the inverse cumulative distribution function (percent point function or quantile function) for standard normal distribution [pythonstats], [wichura1988].

Parameters:

p (float) – Probability (must be between 0.0 and 1.0).

Returns:

float

Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

kim_convergence.stats.skew(x: ndarray | list[float], *, bias: bool = False) float

Compute the time series data set skewness [zwillinger2000].

skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

Parameters:
  • x (array_like, 1d) – Time series data.

  • bias (bool, optional) – If False, then the calculations are corrected for statistical bias. (default: False)

Returns:

float

The skewness

Note

For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution.

The sample skewness is computed as the Fisher-Pearson coefficient of skewness \(g_1 = \frac{m_3}{m_2^{3/2}}\), where \(m_i\) is the biased sample \(i\texttt{th}\) central moment. If bias is False, the calculations are corrected for bias and the value computed is the adjusted Fisher-Pearson standardized moment coefficient, i.e.

\[G_1 = \frac{k_3}{k_2^{3/2}} = \frac{\sqrt{N(N-1)}}{N-2} \frac{m_3}{m_2^{3/2}}.\]
kim_convergence.stats.t_cdf(t: float, df: float) float

Compute the cumulative distribution of the t-distribution.

The cumulative distribution of the t-distribution for t > 0, can be written in terms of the regularized incomplete beta function as,

\[\int_{-\infty}^t f(u)\,du = 1 - \frac{1}{2} I_{x(t)}\left(\frac{\nu}{2}, \frac{1}{2}\right),\]

where,

\[x(t) = \frac{\nu}{{t^2+\nu}}.\]

Other t values would be obtained by symmetry.

Parameters:
  • t (float) – Upper limit of the integration.

  • df (float) – Degrees of freedom, must be a positive number.

Returns:

float

Cumulative t-distribution.

kim_convergence.stats.t_cdf_ccdf(t: float, df: float) tuple[float, float]

Compute the cumulative distribution of the t-distribution.

The cumulative distribution of the t-distribution for t > 0, can be written in terms of the regularized incomplete beta function as,

\[\int_{-\infty}^t f(u)\,du = 1 - \frac{1}{2} I_{x(t)}\left(\frac{\nu}{2}, \frac{1}{2}\right),\]

where,

\[x(t) = \frac{\nu}{{t^2+\nu}}.\]

Other t values would be obtained by symmetry.

Parameters:
  • t (float) – Upper limit of the integration.

  • df (float) – Degrees of freedom, must be a positive number.

Returns:

tuple[float, float]

cdf: cumulative t-distribution value. ccdf: complement of the cumulative t-distribution (1 - cdf).

kim_convergence.stats.t_interval(confidence_level: float, df: float, *, loc: float = 0.0, scale: float = 1.0) tuple[float, float]

Compute the t_distribution confidence interval.

Compute the t_distribution confidence interval with equal areas around the median.

Parameters:
  • confidence_level (float) – (or confidence coefficient) must be between 0.0 and 1.0

  • df (float) – Degrees of freedom, must be > 0.

  • loc (float, optional) – location parameter (default: 0.0)

  • scale (float, optional) – scale parameter (default: 1.0)

Returns:

tuple[float, float]

Lower and upper bounds of the confidence interval that contains \(100 \cdot \text{confidence_level}\%\) of the t-distribution.

Note

  • Confidence interval is a range of values that is likely to contain an unknown population parameter.

  • Confidence level is the percentage of the confidence intervals which will hold the population parameter.

  • The significance level or alpha is the probability of rejecting the null hypothesis when it is true. To find alpha, just subtract the confidence interval from 100%. E.g., the significance level for a 90% confidence level is 100% – 90% = 10%.

kim_convergence.stats.t_inv_cdf(p: float, df: float, *, loc: float = 0.0, scale: float = 1.0, _tol: float = 1e-08, _atol: float = 1e-50, _rtinf: float = 1e+100) float

Compute the t_distribution inverse cumulative distribution function.

Compute the inverse cumulative distribution function (percent point function or quantile function) for t-distributions with df degrees of freedom. Inverse cumulative distribution function finds the value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability.

Parameters:
  • p (float) – Probability (must be between 0.0 and 1.0)

  • df (float) – Degrees of freedom, must be > 1.

  • loc (float, optional) – location parameter (default: 0.0)

  • scale (float, optional) – scale parameter (default: 1.0)

Returns:

float

Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

kim_convergence.stats.t_test(sample_mean: float, sample_std: float, sample_size: int, population_mean: float, significance_level: float = 0.050000000000000044) bool

T-test for the mean.

Calculate the T-test for the mean. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations x is equal to the given population mean, population_mean.

Parameters:
  • sample_mean (float) – Sample mean.

  • sample_std (float) – Sample standard deviation.

  • sample_size (int) – Number of samples.

  • population_mean (float) – Expected value in the null hypothesis.

  • significance_level (float) – Significance level. A probability threshold below which the null hypothesis will be rejected. (default: 0.05)

Returns:

bool

True if the expected value (mean) of a sample of independent observations x equals the given population mean population_mean.

kim_convergence.stats.wilcoxon_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool

Calculate the Wilcoxon signed-rank test.

Here it is used as a non-parametric test to determine whether an unknown population mean is different from a specific value.

Parameters:
  • time_series_data (np.ndarray) – time series data.

  • population_cdf (Optional[str]) – The name of a distribution.

  • population_args (tuple) – Distribution parameter.

  • population_loc (Optional[float]) – location of the distribution.

  • population_scale (Optional[float]) – scale of the distribution.

  • significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool

True if the sample is drawn from the specified population distribution.

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma
>>> rng = np.random.RandomState(12345)
>>> shape, scale = 2., 2.
>>> x = rng.gamma(shape, scale, size=1000)
>>> wilcoxon_test(x,
                  population_cdf='gamma',
                  population_args=(shape,),
                  population_loc=0,
                  population_scale=scale,
                  significance_level=0.05)
True
>>> wilcoxon_test(x,
                  population_cdf='gamma',
                  population_args=(shape,),
                  population_loc=0,
                  population_scale=1,
                  significance_level=0.05)
False

Time Series Functions

Time series module.

kim_convergence.timeseries.estimate_equilibration_length(time_series_data: ndarray | list[float], *, si: str | None = None, nskip: int | None = 1, fft: bool = True, minimum_correlation_time: int | None = None, ignore_end: int | float | None = None, number_of_cores: int = 1, batch_size: int = 5, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False) tuple[int, float]

Estimate the equilibration point in a time series data.

Estimate the equilibration point in a time series data using the statistical inefficiencies [chodera2016], [geyer1992], [geyer2011].

Parameters:
  • time_series_data (array_like, 1d) – Time-series data.

  • si (Optional[str], optional) – Statistical-inefficiency method. (default: None)

  • nskip (Optional[int], optional) – Number of data points to skip. (default: 1)

  • fft (bool, optional) – Use FFT convolution for long series. (default: True)

  • minimum_correlation_time (Optional[int], optional) – Minimum correlation-time window; algorithm stops when correlation first goes negative. (default: None)

  • ignore_end (Optional[Union[int, float]], optional) – If int, last points to ignore; if float in (0, 1), fraction to ignore; if None, uses one fourth of data. (default: None)

  • number_of_cores (int, optional) – The maximum number of concurrently running jobs, such as the number of Python worker processes or the size of the thread-pool. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. (default: 1)

Returns:

tuple[int, float]

equilibration_index: index where equilibrated region starts. statistical_inefficiency: statitical inefficiency estimates of a time series at the equilibration index estimate.

Note

batch_size, scale, with_centering, and with_scaling are accepted for API compatibility but are not used by this method.

kim_convergence.timeseries.geyer_r_statistical_inefficiency(x: ndarray | list[float], y: ndarray | list[float] | None = None, *, fft: bool = True, minimum_correlation_time: int | None = None) float

Compute the statistical inefficiency.

Compute the statistical inefficiency using the Geyer’s [geyer1992], [geyer2011] initial monotone sequence criterion.

Note

The behavior is updated. Suppose the time series data is an array of (constant) numbers with standard deviation close to zero within abs_tol=1e-18, where abs(a) <= max(1e-9 * abs(a), abs_tol). In that case, this function returns the statistical inefficiency as the size of the time series data array.

Note

The effective sample size is computed by:

\[\begin{split}\hat{N}_{eff} &= \frac{N}{si} \\ si &= -1 + 2 \sum_{t'=0}^m \hat{P}_{t'}\end{split}\]

where \(N\) is the number of data points. \(\hat{P}_{t'} = \hat{\rho}_{2t'} + \hat{\rho}_{2t'+1}\), where \(\hat{\rho}_t'\) is the estimated auto-correlation at lag \(t'\), and \(m\) is the last integer for which \(\hat{P}_{t'}\) is still positive (largest \(m\) such that \(\hat{P}_{t'} > 0,~t'=1,\cdots,m\)). The initial monotone sequence is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding ones so that the estimated sequence is monotone.

The current implementation is similar to Stan [mcstan], which uses Geyer’s initial monotone sequence criterion (Geyer, 1992 [geyer1992]; Geyer, 2011 [geyer2011]).

Parameters:
  • x (array_like, 1d) – time series data. Using this method, statistical inefficiency can not be estimated with less than four data points.

  • y (array_like, 1d, optional) – time series data. If it is passed to this function, the cross-correlation of timeseries x and y will be estimated instead of the auto-correlation of timeseries x. (default: None)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

Returns:

float

estimated statistical inefficiency. \(si >= 1\) is the estimated statistical inefficiency (equal to \(si = -1 + 2 \sum_{t'=0}^m \hat{P}_{t'}\), where \(\hat{P}_{t'} = \hat{\rho}_{2t'} + \hat{\rho}_{2t'+1}\))

Note

minimum_correlation_time is accepted for API compatibility but is not used by this method.

kim_convergence.timeseries.geyer_split_r_statistical_inefficiency(x: ndarray | list[float], y: ndarray | list[float] | None = None, *, fft: bool = True, minimum_correlation_time: int | None = None) float

Compute the statistical inefficiency.

Compute the statistical inefficiency using the split-r method of Geyer’s [geyer1992], [geyer2011] initial monotone sequence criterion.

Note

The effective sample size is computed by:

\[\begin{split}\hat{N}_{eff} &= \frac{N}{si} \\ si &= -1 + 2 \sum_{t'=0}^m \hat{P}_{t'}\end{split}\]

where \(N\) is the number of data points. \(\hat{P}_{t'} = \hat{\rho}_{2t'} + \hat{\rho}_{2t'+1}\), where \(\hat{\rho}_t'\) is the estimated auto-correlation at lag \(t'\), and \(m\) is the last integer for which \(\hat{P}_{t'}\) is still positive (largest \(m\) such that \(\hat{P}_{t'} > 0,~t'=1,\cdots,m\)). The initial monotone sequence is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding ones so that the estimated sequence is monotone.

The current implementation is similar to Stan [mcstan], which uses Geyer’s initial monotone sequence criterion (Geyer, 1992 [geyer1992]; Geyer, 2011 [geyer2011]).

Parameters:
  • x (array_like, 1d) – time series data. Using this method, statistical inefficiency can not be estimated with less than eight data points.

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

Returns:

float

estimated statistical inefficiency. \(si >= 1\) is the estimated statistical inefficiency (equal to \(si = -1 + 2 \sum_{t'=0}^m \hat{P}_{t'}\), where \(\hat{P}_{t'} = \hat{\rho}_{2t'} + \hat{\rho}_{2t'+1}\))

Note

minimum_correlation_time is accepted for API compatibility but is not used by this method.

kim_convergence.timeseries.geyer_split_statistical_inefficiency(x: ndarray | list[float], y: ndarray | list[float] | None = None, *, fft: bool = True, minimum_correlation_time: int | None = None) float

Compute the statistical inefficiency.

Computes the effective sample size. The value returned is the minimum of effective sample size and the data size times log10(data size).

Note

Note that the effective sample size can not be estimated with less than four samples.

Note

The behavior is updated. Suppose the time series data is an array of (constant) numbers with standard deviation close to zero within abs_tol=1e-18, where abs(a) <= max(1e-9 * abs(a), abs_tol). In that case, this function returns the statistical inefficiency as the size of the time series data array.

Parameters:
  • x (array_like, 1d) – time series data.

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

Returns:

float

estimated statistical inefficiency. \(si >= 1\) is the estimated statistical inefficiency

Note

minimum_correlation_time is accepted for API compatibility but is not used by this method.

kim_convergence.timeseries.integrated_auto_correlation_time(x: ndarray | list[float], y: ndarray | list[float] | None = None, *, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None) float

Estimate the integrated auto-correlation time.

The statistical inefficiency \(si\) of the observable \(x\) of a time series \(\left \{X\right \}_{t=0}^n\) is formally defined as, \(si \equiv 1 + 2\tau\), where \(\tau\) denotes the integrated auto-correlation time.

Parameters:
  • x (array_like, 1d) – time series data.

  • y (array_like, 1d, optional) – time series data. (default: None) If it is passed to this function, the cross-correlation of timeseries x and y will be estimated instead of the auto-correlation of timeseries x.

  • si (float, or str, optional) – estimated statistical inefficiency, or a method of computing the statistical inefficiency. (default: None)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

Returns:

float

integrated auto-correlation time. estimated \(\tau\) (the integrated auto-correlation time)

kim_convergence.timeseries.statistical_inefficiency(x: ndarray | list[float], y: ndarray | list[float] | None = None, *, fft: bool = True, minimum_correlation_time: int | None = None) float

Compute the statistical inefficiency.

The statistical inefficiency \(si\) of the observable \(x\) of a time series \(\{X\}_{t=0}^n\) is formally defined as,

\[\begin{split}si &\equiv 1 + 2\tau \\ \tau &\equiv \sum_{t=0}^n {\left( 1 - \frac{t}{n} \right) C\left(t\right)} \\ C\left(t\right) &\equiv \frac{<x(X_{t_0})x(X_{t_0+t})> - {<x>}^2}{<x^2>-{<x>}^2}\end{split}\]

where \(\tau\) denotes the integrated auto-correlation time and \(C\left(t\right)\) is the normalized fluctuation auto-correlation function of the observable \(x\)

Note

The behavior is updated. Suppose the time series data is an array of (constant) numbers with standard deviation close to zero within abs_tol=1e-18, where abs(a) <= max(1e-9 * abs(a), abs_tol). In that case, this function returns the statistical inefficiency as the size of the time series data array.

Parameters:
  • x (array_like, 1d) – time series data.

  • y (array_like, 1d, optional) – time series data. If it is passed to this function, the cross-correlation of timeseries x and y will be estimated instead of the auto-correlation of timeseries x. (default: None)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

Returns:

float

estimated statistical inefficiency. \(si >= 1\) is the estimated statistical inefficiency (equal to \(1 + 2\tau\), where \(\tau\) denotes the integrated auto-correlation time).

kim_convergence.timeseries.time_series_data_si(time_series_data: ndarray | list[float], *, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None) float

Helper method to compute or return the statistical inefficiency value.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • si (float, or str, optional) – estimated statistical inefficiency. (default: None)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

Returns:

float

estimated statistical inefficiency value. \(si >= 1\) is the estimated statistical inefficiency.

kim_convergence.timeseries.time_series_data_uncorrelated_block_averaged_samples(time_series_data: ndarray | list[float], *, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None) ndarray

Return average value for each block after blocking the data.

At first, break down the time series data into the series of blocks, where each block contains si successive data points. If si (statistical inefficiency) is not provided it will be computed. Then the average value for each block is determined. This coarse graining approach is commonly used for thermodynamic properties.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • si (float, or str, optional) – estimated statistical inefficiency. (default: None)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

  • uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. must be monotonically increasing. If None they are computed automatically. (default: None)

Returns:

1darray

uncorrelated_sample of the time series data. average value for each block after blocking the time series data.

kim_convergence.timeseries.time_series_data_uncorrelated_random_samples(time_series_data: ndarray | list[float], *, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None) ndarray

Return random data for each block after blocking the data.

At first, break down the time series data into the series of blocks, where each block contains si successive data points. If si (statistical inefficiency) is not provided it will be computed. Then a single value is taken at random from each block.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • si (float, or str, optional) – estimated statistical inefficiency. (default: None)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

  • uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. must be monotonically increasing. If None they are computed automatically. (default: None)

Returns:

1darray

uncorrelated_sample of the time series data. random data for each block after blocking the time series data.

kim_convergence.timeseries.time_series_data_uncorrelated_samples(time_series_data: ndarray | list[float], *, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None) ndarray

Return time series data at uncorrelated sample indices.

Subsample a correlated timeseries to extract an effectively uncorrelated dataset. If si (statistical inefficiency) is not provided it will be computed.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • si (float, or str, optional) – estimated statistical inefficiency.

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

  • uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. must be monotonically increasing. If None they are computed automatically. (default: None)

Returns:

1darray

uncorrelated_sample of the time series data. time series data at uncorrelated sample indices.

kim_convergence.timeseries.uncorrelated_time_series_data_sample_indices(time_series_data: ndarray | list[float], *, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None) ndarray

Return indices of uncorrelated subsamples of the time series data.

Return indices of the uncorrelated sample of the time series data. Subsample a correlated timeseries to extract an effectively uncorrelated dataset. If si (statistical inefficiency) is not provided it will be computed.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • si (float, or str, optional) – estimated statistical inefficiency. (default: None)

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

Returns:

1darray

indices array. Indices of uncorrelated subsamples of the time series data.

kim_convergence.timeseries.uncorrelated_time_series_data_samples(time_series_data: ndarray | list[float], *, si: str | float | int | None = None, fft: bool = True, minimum_correlation_time: int | None = None, uncorrelated_sample_indices: ndarray | list[int] | None = None, sample_method: str | None = None) ndarray

Get time series data at the sample_method uncorrelated_sample indices.

Subsample a correlated timeseries to extract an effectively uncorrelated dataset. If si (statistical inefficiency) is not provided it will be computed.

Parameters:
  • time_series_data (array_like, 1d) – time series data.

  • si (float, or str, optional) – estimated statistical inefficiency.

  • fft (bool, optional) – if True, use FFT convolution. FFT should be preferred for long time series. (default: True)

  • minimum_correlation_time (int, optional) – minimum amount of correlation function to compute. The algorithm terminates after computing the correlation time out to minimum_correlation_time when the correlation function first goes negative. (default: None)

  • uncorrelated_sample_indices (array_like, 1d, optional) – indices of uncorrelated subsamples of the time series data. (default: None)

  • sample_method (str, optional) – sampling method, one of the uncorrelated, random, or block_averaged. (default: None)

Returns:

1darray

uncorrelated_sample of the time series data. time series data at uncorrelated sample indices.

Utility Functions

batch

kim_convergence.batch(time_series_data: ~numpy.ndarray | list, *, batch_size: int = 5, func: ~typing.Callable[[...], ~numpy.ndarray] = <function mean>, scale: str = 'translate_scale', with_centering: bool = False, with_scaling: bool = False) ndarray

Batch the time series data.

Parameters:
  • time_series_data (array_like, 1d) – Time series data.

  • batch_size (int, optional) – batch size. (default: 5)

  • func (callable, optional) – Reduction function capable of receiving a single axis argument. It is called with time_series_data as first argument. (default: np.mean)

  • scale (str, optional) – A method to standardize a dataset. (default: ‘translate_scale’)

  • with_centering (bool, optional) – If True, use time_series_data minus the scale metod centering approach. (default: False)

  • with_scaling (bool, optional) – If True, scale the data to scale metod scaling approach. (default: False)

Returns
1darray

Batched (, and rescaled) data.

Note

This function will terminate the end of the data points which are remainder of the division of data points by the batch_size.

Note

By default, this method is using np.mean and compute the arithmetic mean.

Example:

>>> import numpy as np
>>> rng = np.random.RandomState(12345)
>>> x = np.ones(100) * 10 + (rng.random_sample(100) - 0.5)
>>> x_batch = batch(x, batch_size=5)
>>> x_batch.size
20
>>> print(x.mean(), x_batch.mean())
10.054804081191616 10.054804081191616
>>> x_batch_scaled = batch(x, batch_size=5,
                           scale='translate_scale',
                           with_scaling=True)
>>> x_batch_scaled.size
20
>>> print(x.mean(), x_batch_scaled.mean())
10.054804081191616 1.0

outlier_test

kim_convergence.outlier_test(x: ndarray | list[float], outlier_method: str = 'iqr') ndarray | None

Test to detect what are outliers in the data.

The intuitive definition for the concept of an outlier in the data is a point that significantly deviates from its expected value. Therefore, given a time series (or a random sample from a population), a point can be declared an outlier if the distance to its expected value is higher than a predefined threshold (\(|x_i - E(x)| > \tau\)), where \(x_i\) is the observed data point, and \(E(x)\) is its expected value.

The methods based on this strategy are the most common approaches in the literature. These methods intend to detect outliers, but it is up to the analyst to decide if the detected points are real outliers. Thus it is necessary to characterize standard data points before removing any outliers detected by these approaches.

Parameters:
  • x (array_like, 1d) – Time series data.

  • outlier_method (str, optional) – Method for outlier detection. (default: ‘iqr’)

Returns:

Optional[ndarray]

Indices of outliers; None if no outliers found.

Scaler classes

class kim_convergence.MinMaxScale(*, feature_range: tuple[float, float] = (0, 1))

Standardize/Transform a dataset by scaling it to a given range.

This estimator scales and translates a dataset such that it is in the given range, e.g. between zero and one.

The transformation is given by:

\[\begin{split}x_{\text{std}} = \frac{x - \min(x)}{\max(x) - \min(x)} \\ \text{scaled}_x = x_{\text{std}} \cdot (\text{max} - \text{min}) + \text{min}\end{split}\]

where min, max = feature_range.

Parameters:

feature_range (tuple, optional) – tuple (min, max). (default: (0, 1)). Desired range of transformed data.

Examples:

>>> from kim_convergence import MinMaxScale, minmax_scale
>>> data = [-1., 3.]
>>> mms = MinMaxScale()
>>> scaled_x = mms.scale(data)
>>> print(scaled_x)
[0. 1.]
>>> x = mms.inverse(scaled_x)
>>> print(x)
[-1.  3.]
>>> data = [-1., 3., 100.]
>>> scaled_x = minmax_scale(data)
>>> print(scaled_x)
[0. 0.03960396 1.]
>>> mms = MinMaxScale()
>>> scaled_x = mms.scale(data)
>>> x = mms.inverse(scaled_x)
>>> print(x)
[ -1. 3. 100.]
inverse(x: ndarray) ndarray

Undo the scaling of dataset to its original range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray

Transformed data.

scale(x: ndarray | list) ndarray

Standardize a dataset by scaling it to a given range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray

Scaled dataset to a given range.

class kim_convergence.TranslateScale(*, with_centering: bool = True, with_scaling: bool = True)

Standardize a dataset.

Standardize a dataset by translating the data set so that \(x[0]=0\) and rescaled by overall averages so that the numbers are of O(1) with a good spread. (default: True)

The translate and scale of a sample x is calculated as:

\[z = \frac{(x - x_0)}{u}\]

where \(x_0\) is \(x[0]\) or \(0\) if with_centering=False, and u is the mean of the samples or \(1\) if with_scaling=False.

Parameters:
  • with_centering (bool, optional) – If True, use x minus its first element. (default: True)

  • with_scaling (bool, optional) – If True, scale the data to overall averages so that the numbers are of O(1) with a good spread. (default: True)

Examples:

>>> from kim_convergence import TranslateScale
>>> data = [1., 2., 2., 2., 3.]
>>> tsc = TranslateScale()
>>> scaled_x = tsc.scale(data)
>>> print(scaled_x)
[0. 1. 1. 1. 2.]
>>> x = tsc.inverse(scaled_x)
>>> print(x)
[1. 2. 2. 2. 3.]
inverse(x: ndarray) ndarray

Undo the scaling of dataset to its original range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray

Transformed data.

scale(x: ndarray | list) ndarray

Standardize a dataset by scaling it to a given range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray

Scaled dataset to a given range.

class kim_convergence.StandardScale(*, with_centering: bool = True, with_scaling: bool = True)

Standardize a dataset.

Standardize a dataset by removing the mean and scaling to unit variance. The standard score of a sample x is calculated as:

\[z = \frac{(x - u)}{s}\]

where u is the mean of the samples or \(0\) if with_centering=False , and s is the standard deviation of the samples or \(1\) if with_scaling=False.

Centering and scaling happen independently.

Parameters:
  • with_centering (bool, optional) – If True, use x minus its mean, or center the data before scaling. (default: True)

  • with_scaling (bool, optional) – If True, scale the data to unit variance (or equivalently, unit standard deviation). (default: True)

Note

If set explicitly with_centering=False (only variance scaling will be performed on x). We use a biased estimator for the standard deviation.

Examples:

>>> from kim_convergence import StandardScale
>>> data = [-0.5, 6]
>>> ssc = StandardScale()
>>> scaled_x = ssc.scale(data)
>>> print(scaled_x)
[-1.  1.]
>>> x = ssc.inverse(scaled_x)
>>> print(x)
[-0.5  6. ]
inverse(x: ndarray) ndarray

Undo the scaling of dataset to its original range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray

Transformed data.

scale(x: ndarray | list) ndarray

Standardize a dataset.

Parameters:

x (array_like, 1d) – The data to center and scale.

Returns:

1darray

Scaled and/or Centered dataset.

class kim_convergence.RobustScale(*, with_centering: bool = True, with_scaling: bool = True, quantile_range: tuple[float, float] = (25.0, 75.0))

Standardize a dataset.

Standardize a dataset by centering to the median and component wise scale according to the inter-quartile range. These features are robust to outliers.

This way removes the median and scales the data according to the quantile range. The Interquartile Range is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently.

Parameters:
  • with_centering (bool, optional) – If True, center the data before scaling. (default: True)

  • with_scaling (bool, optional) – If True, scale the data. (default: True)

  • quantile_range (tuple, or list, optional) – (q_min, q_max), 0.0 < q_min < q_max < 100.0 (default: (25.0, 75.0) = (1st quantile, 3rd quantile))

Examples:

>>> from kim_convergence import RobustScale
>>> data = [ 4.,  1., -2.]
>>> rsc = RobustScale()
>>> scaled_x = rsc.scale(data)
>>> print(scaled_x)
[ 1.22474487  0.         -1.22474487]
>>> x = rsc.inverse(scaled_x)
>>> print(x)
[ 4.  1. -2.]
inverse(x: ndarray) ndarray

Undo the scaling of dataset to its original range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray

Transformed data.

scale(x: ndarray | list) ndarray

Standardize a dataset using median and quantile range.

Parameters:

x (array_like, 1d) – The data to center and scale.

Returns:

1darray

Scaled dataset.

class kim_convergence.MaxAbsScale

Standardize a dataset to the [-1, 1] range.

Standardize a dataset to the [-1, 1] range such that the maximal absolute value in the data set will be 1.0.

Examples:

>>> from kim_convergence import MaxAbsScale
>>> data = [ 4.,  1., -9.]
>>> mas = MaxAbsScale()
>>> scaled_x = mas.scale(data)
>>> print(scaled_x)
[ 0.44444444  0.11111111 -1.        ]
>>> x = mas.inverse(scaled_x)
>>> print(x)
[ 4.  1. -9.]
inverse(x: ndarray) ndarray

Undo the scaling of dataset to its original range.

Parameters:

x (array_like, 1d) – Time series data.

Returns:

1darray

Transformed data.

scale(x: ndarray | list) ndarray

Online computation of max absolute value of x for later scaling.

All of x is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters:

x (array_like, 1d) – The data to scale.

Returns:

1darray

Scaled dataset.

Convenience functions

minmax_scale

kim_convergence.minmax_scale(x: ndarray, *, with_centering: bool = True, with_scaling: bool = True, feature_range: tuple[float, float] = (0.0, 1.0)) ndarray

Standardize/Transform a dataset by scaling it to a given range.

This estimator scales and translates a dataset such that it is in the given range, e.g. between zero and one.

The transformation is given by:

\[\begin{split}x_{\text{std}} = \frac{x - \min(x)}{\max(x) - \min(x)} \\ \text{scaled}_x = x_{\text{std}} \cdot (\text{max} - \text{min}) + \text{min}\end{split}\]

where min, max = feature_range.

Parameters:
  • x (array_like, 1d) – Time series data.

  • feature_range (tuple, optional) – tuple (min, max). (default: (0, 1)) Desired range of transformed data.

Returns:

1darray

Scaled dataset to a given range.

Note

with_centering, and with_scaling are accepted for API compatibility but are not used by this method.

translate_scale

kim_convergence.translate_scale(x: ndarray, *, with_centering: bool = True, with_scaling: bool = True) ndarray

Standardize a dataset.

Standardize a dataset by translating the data set so that \(x[0]=0\) and rescaled by overall averages so that the numbers are of O(1) with a good spread. (default: True)

The translate and scale of a sample x is calculated as:

\[z = \frac{(x - x_0)}{u}\]

where \(x_0\) is \(x[0]\) or \(0\) if with_centering=False, and u is the mean of the samples or \(1\) if with_scaling=False.

Parameters:
  • x (array_like, 1d) – The data to center and scale.

  • with_centering (bool, optional) – If True, use x minus its first element. (default: True)

  • with_scaling (bool, optional) – If True, scale the data to overall averages so that the numbers are of O(1) with a good spread. (default: True)

Returns:

1darray

Scaled dataset.

standard_scale

kim_convergence.standard_scale(x: ndarray, *, with_centering: bool = True, with_scaling: bool = True) ndarray

Standardize a dataset.

Standardize a dataset by removing the mean and scaling to unit variance. The standard score of a sample x is calculated as:

\[z = \frac{(x - u)}{s}\]

where u is the mean of the samples or \(0\) if with_centering=False , and s is the standard deviation of the samples or \(1\) if with_scaling=False.

Parameters:
  • x (array_like, 1d) – The data to center and scale.

  • with_centering (bool, optional) – If True, use x minus its mean, or center the data before scaling. (default: True)

  • with_scaling (bool, optional) – If True, scale the data to unit variance (or equivalently, unit standard deviation). (default: True)

Returns:

1darray

Scaled dataset

Note

If set explicitly with_centering=False (only variance scaling will be performed on x). We use a biased estimator for the standard deviation.

robust_scale

kim_convergence.robust_scale(x: ndarray, *, with_centering: bool = True, with_scaling: bool = True, quantile_range: tuple[float, float] = (25.0, 75.0)) ndarray

Standardize a dataset.

Standardize a dataset by centering to the median and component wise scale according to the inter-quartile range.

Parameters:
  • x (array_like, 1d) – The data to center and scale.

  • with_centering (bool, optional) – If True, center the data before scaling. (default: True)

  • with_scaling (bool, optional) – If True, scale the data. (default: True)

  • quantile_range (tuple, or list, optional) – (q_min, q_max), 0.0 < q_min < q_max < 100.0 (default: (25.0, 75.0) = (1st quantile, 3rd quantile))

Returns:

1darray

Scaled dataset.

maxabs_scale

kim_convergence.maxabs_scale(x: ndarray, *, with_centering: bool = True, with_scaling: bool = True) ndarray

Standardize a dataset to the [-1, 1] range.

Standardize a dataset to the [-1, 1] range such that the maximal absolute value in the data set will be 1.0.

Parameters:

x (array_like, 1d) – The data to center and scale.

Returns:

1darray

Scaled dataset.

Note

with_centering, and with_scaling are accepted for API compatibility but are not used by this method.

validate_split

kim_convergence.validate_split(*, n_samples: int, train_size: int | float | None, test_size: int | float | None, default_test_size: int | float | None = None) tuple[int, int]

Validate test/train sizes.

Helper function to validate the test/train sizes to be meaningful with regard to the size of the data (n_samples)

Parameters:
  • n_samples (int) – total number of sample points

  • train_size (int, float, or None) – train size

  • test_size (int, float, or None) – test size

  • default_test_size (int, float, or None, optional) – default test size. (default: None)

Returns:

tuple[int, int]

n_train: number of train points n_test: number of test points

Raises:

CRError – If any size is invalid or inconsistent.

train_test_split

kim_convergence.train_test_split(time_series_data: ndarray | list[float], *, train_size: int | float | None = None, test_size: int | float | None = None, seed: int | RandomState | None = None, default_test_size: int | float | None = 0.1) tuple[ndarray, ndarray]

Split time_series_data into random train and test indices.

Parameters:
  • time_series_data (array_like) – time series data, array-like of shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features.

  • test_size (int, float, or None, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to default_test_size. (default: 0.1)

  • train_size (int, float, or None, optional) – if float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. (default: None)

  • seed (None, int or np.random.RandomState(), optional) – random number seed. (default: None)

  • default_test_size (float, optional) – Default test size. (default: 0.1)

Returns:

tuple[np.ndarray, np.ndarray]

ind_train: training indices. ind_test: testing indices.

Raises:

CRError – If any size is invalid or inconsistent, or if seed has an illegal type.

Error Classes

CRError

exception kim_convergence.err.CRError(msg)

Raise an exception.

It raises an exception when receives an error message.

Parameters:

msg (str) – Human-readable error message.

CRSampleSizeError

exception kim_convergence.err.CRSampleSizeError(msg)

Raise an exception if there is not enough samples.