Statistics Module

Overview

The statistics module provides tools for statistical analysis, hypothesis testing, and time series analysis. It’s designed for use in convergence analysis and statistical quality control applications.

Contents

Distribution Functions
Hypothesis Tests
Time Series Analysis
Randomness Tests
Common Usage Patterns
Performance Considerations

Distribution Functions

Beta Distribution

Beta distribution module.

kim_convergence.stats.beta_dist.beta(a: float, b: float) → float

Beta function.

Beta function [numrec2007] is defined as,

\[B(a, b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)},\]

where \(\Gamma\) is the gamma function.

Parameters:

a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.

Returns:

float: Beta function value.

kim_convergence.stats.beta_dist.betacf(a: float, b: float, x: float, *, eps: float = 1e-15, max_iteration: int = 200, _fpmin: float = 1e-30) → float

Continued fraction for incomplete beta function by modified Lentz’s method.

Evaluates continued fraction for incomplete beta function by modified Lentz’s method [numrec2007].

Parameters:

a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Real-valued such that it must be between 0.0 and 1.0.
eps (float, optional) – Machine precision epsilon. (default: {np.finfo(np.float64).resolution})
max_iteration (int, optional) – Maximum number of iterations. (default: 200)
_fpmin (float, optional) – Minimum floating point precision. (default: 1.0e-30)

Returns:

float: Continued fraction for incomplete beta function.

kim_convergence.stats.beta_dist.betai(a: float, b: float, x: float) → float

Incomplete beta function.

Incomplete beta function [numrec2007] is defined as,

\[I_x(a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \int_0^x~t^{a-1}(1-t)^{b-1}~dt,\]

Parameters:

a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Real-valued such that it must be between 0.0 and 1.0.

Returns:

float: Incomplete beta function value.

kim_convergence.stats.beta_dist.betai_cdf(a: float, b: float, x: float) → float

Calculate the cumulative distribution of the incomplete beta distribution.

Calculate the cumulative distribution of the incomplete beta distribution with parameters a and b as,

\[\int_0^x \frac{t^{a-1}~(1-t)^{b-1}}{Beta(a,b)}~dt,\]

where, \(Beta(a,b)\) is the beta function.

Parameters:

a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Upper limit of integration

Returns:

float: Cumulative incomplete beta distribution.

kim_convergence.stats.beta_dist.betai_cdf_ccdf(a: float, b: float, x: float) → tuple[float, float]

Calculate the cumulative distribution of the incomplete beta distribution.

Calculate the cumulative distribution of the incomplete beta distribution with parameters a and b as,

\[\int_0^x \frac{t^{a-1}~(1-t)^{b-1}}{Beta(a,b)}~dt,\]

where, \(Beta(a,b)\) is the beta function.

Parameters:

a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Upper limit of integration

Returns:

tuple[float, float]: Cumulative incomplete beta distribution, compliment of the cumulative incomplete beta distribution.

The Beta function implementation follows the algorithms described in Numerical Recipes [numrec2007].

Normal Distribution

normal distribution module.

s_normal_inv_cdf code is adapted from python statistics module [pythonstats] by Yaser Afshar.

kim_convergence.stats.normal_dist.normal_interval(confidence_level: float, *, loc: float = 0.0, scale: float = 1.0) → tuple[float, float]

Compute the normal distribution confidence interval.

Compute the normal-distribution confidence interval with equal areas around the median.

Parameters:

confidence_level (float) – Confidence coefficient (must be between 0.0 and 1.0).
loc (float, optional) – Location parameter. (default: 0.0)
scale (float, optional) – Scale parameter. (default: 1.0)

Returns:

tuple[float, float]: Lower and upper bounds of the confidence interval that contains \(100~\text{confidence_level}\%\) of the distribution.

Note

Confidence interval is a range of values that is likely to contain an unknown population parameter.
Confidence level is the percentage of the confidence intervals which will hold the population parameter.
The significance level or alpha is the probability of rejecting the null hypothesis when it is true. To find alpha, just subtract the confidence interval from 100%. E.g., the significance level for a 90% confidence level is 100% – 90% = 10%.

kim_convergence.stats.normal_dist.normal_inv_cdf(p: float, *, loc=0.0, scale: float = 1.0) → float

Compute the normal distribution inverse cumulative distribution function.

Parameters:

p (float) – Probability (must be between 0.0 and 1.0).
loc (float, optional) – Location parameter. (default: 0.0)
scale (float, optional) – Scale parameter. (default: 1.0)

Returns:

float: Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

kim_convergence.stats.normal_dist.s_normal_inv_cdf(p: float) → float

Compute the standard normal distribution inverse cumulative distribution function.

Compute the inverse cumulative distribution function (percent point function or quantile function) for standard normal distribution [pythonstats], [wichura1988].

Parameters:

p (float) – Probability (must be between 0.0 and 1.0).

Returns:

float: Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

The inverse CDF computation uses the algorithm by Wichura [wichura1988].

t-Distribution

T distribution module.

This module is specialized for the kim-convergence code and is not a general function to be used for other purposes.

kim_convergence.stats.t_dist.t_cdf(t: float, df: float) → float

Compute the cumulative distribution of the t-distribution.

The cumulative distribution of the t-distribution for t > 0, can be written in terms of the regularized incomplete beta function as,

\[\int_{-\infty}^t f(u)\,du = 1 - \frac{1}{2} I_{x(t)}\left(\frac{\nu}{2}, \frac{1}{2}\right),\]

where,

\[x(t) = \frac{\nu}{{t^2+\nu}}.\]

Other t values would be obtained by symmetry.

Parameters:

t (float) – Upper limit of the integration.
df (float) – Degrees of freedom, must be a positive number.

Returns:

float: Cumulative t-distribution.

kim_convergence.stats.t_dist.t_cdf_ccdf(t: float, df: float) → tuple[float, float]

Compute the cumulative distribution of the t-distribution.

The cumulative distribution of the t-distribution for t > 0, can be written in terms of the regularized incomplete beta function as,

\[\int_{-\infty}^t f(u)\,du = 1 - \frac{1}{2} I_{x(t)}\left(\frac{\nu}{2}, \frac{1}{2}\right),\]

where,

\[x(t) = \frac{\nu}{{t^2+\nu}}.\]

Other t values would be obtained by symmetry.

Parameters:

t (float) – Upper limit of the integration.
df (float) – Degrees of freedom, must be a positive number.

Returns:

tuple[float, float]: cdf: cumulative t-distribution value. ccdf: complement of the cumulative t-distribution (1 - cdf).

kim_convergence.stats.t_dist.t_interval(confidence_level: float, df: float, *, loc: float = 0.0, scale: float = 1.0) → tuple[float, float]

Compute the t_distribution confidence interval.

Compute the t_distribution confidence interval with equal areas around the median.

Parameters:

confidence_level (float) – (or confidence coefficient) must be between 0.0 and 1.0
df (float) – Degrees of freedom, must be > 0.
loc (float, optional) – location parameter (default: 0.0)
scale (float, optional) – scale parameter (default: 1.0)

Returns:

tuple[float, float]: Lower and upper bounds of the confidence interval that contains \(100 \cdot \text{confidence_level}\%\) of the t-distribution.

Note

Confidence interval is a range of values that is likely to contain an unknown population parameter.
Confidence level is the percentage of the confidence intervals which will hold the population parameter.
The significance level or alpha is the probability of rejecting the null hypothesis when it is true. To find alpha, just subtract the confidence interval from 100%. E.g., the significance level for a 90% confidence level is 100% – 90% = 10%.

kim_convergence.stats.t_dist.t_inv_cdf(p: float, df: float, *, loc: float = 0.0, scale: float = 1.0, _tol: float = 1e-08, _atol: float = 1e-50, _rtinf: float = 1e+100) → float

Compute the t_distribution inverse cumulative distribution function.

Compute the inverse cumulative distribution function (percent point function or quantile function) for t-distributions with df degrees of freedom. Inverse cumulative distribution function finds the value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability.

Parameters:

p (float) – Probability (must be between 0.0 and 1.0)
df (float) – Degrees of freedom, must be > 1.
loc (float, optional) – location parameter (default: 0.0)
scale (float, optional) – scale parameter (default: 1.0)

Returns:

float: Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

The t-distribution functions are implemented using the regularized incomplete beta function as described in standard statistical references.

Hypothesis Tests

Tests for Normally Distributed Data

Test module for normal distributed data.

Note

The tests in this module are modified and fixed for the kim-convergence package use.

kim_convergence.stats.normal_test.chi_square_test(sample_var: float, sample_size: int, population_var: float, significance_level: float = 0.050000000000000044) → bool

Chi-square test for the variance.

Calculate the chi-square test for the variance. This is a two-sided test. Test Statistic is \(T=(N−1)\frac{\text{var}}{\text{var}_0}\), where where N is the sample size and var is the sample variance. The ratio var/var0 compares the ratio of the sample variance to the target variance. The more this ratio deviates from 1, the more likely we are to reject the null hypothesis.

The null hypothesis is that the variance of a sample of independent observations x is equal to the given population variance, population_var.

Parameters:

sample_var (float) – Sample variance.
sample_size (int) – Number of samples.
population_var (float) – population variance.
significance_level (float) – Significance level. A probability threshold below which the null hypothesis will be rejected. (default: 0.05)

Returns:

bool: True if the variance of a sample of independent observations x equals the given population variance population_var.

kim_convergence.stats.normal_test.t_test(sample_mean: float, sample_std: float, sample_size: int, population_mean: float, significance_level: float = 0.050000000000000044) → bool

T-test for the mean.

Calculate the T-test for the mean. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations x is equal to the given population mean, population_mean.

Parameters:

sample_mean (float) – Sample mean.
sample_std (float) – Sample standard deviation.
sample_size (int) – Number of samples.
population_mean (float) – Expected value in the null hypothesis.
significance_level (float) – Significance level. A probability threshold below which the null hypothesis will be rejected. (default: 0.05)

Returns:

bool: True if the expected value (mean) of a sample of independent observations x equals the given population mean population_mean.

Tests for Non-Normally Distributed Data

Test module for non-normally distributed data.

Note

The tests in this module are modified and fixed for the kim-convergence package use.

kim_convergence.stats.nonnormal_test.kruskal_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) → bool

Kruskal-Wallis H-test for independent samples.

The Kruskal-Wallis H-test tests the null hypothesis that the median of the time series data is the same as the one from population_cdf.

It is a non-parametric version of ANOVA.

Parameters:

time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool: True if the median of the time-series data equals the median of the specified population distribution.

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma
>>> rng = np.random.RandomState(12345)
>>> a = 1.99
>>> x = rng.gamma(a, 1, size=20)
>>> kruskal_test(x,
                 population_cdf='gamma',
                 population_args=(shape,),
                 population_loc=0,
                 population_scale=1,
                 significance_level=0.05)
True

kim_convergence.stats.nonnormal_test.ks_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) → bool

Kolmogorov-Smirnov test for goodness of fit.

Note

This test is only valid for continuous distributions.

It uses the distribution of an observed variable against a given distribution.

The null hypothesis is that the observed samples are drawn from the same continuous distribution as the given distribution with population_loc and population_scale if they are given.

Note

The alternative hypothesis is two-sided. Where the empirical cumulative distribution function of the observed variables is less or greater than the cumulative distribution function of the given distribution.

The probability density of the given population distribution is in the standardized form. Thus to shift and/or scale the distribution population_loc and population_scale parameters are used. In these cases, the variable change y <- x, where y = (x - loc) / scale

Parameters:

time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool: True if the observed samples are drawn from the same continuous distribution as the given one (two-tailed p-value > significance_level).

kim_convergence.stats.nonnormal_test.levene_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) → bool

Perform modified Levene test for equal variances.

The modified Levene test tests the null hypothesis that one sample input time_series_data is from population population_cdf with the same variance [nistdiv898b].

Note

This test is fixed to use ‘median’ variation of the Levene’s test.

Although the optimal choice depends on the underlying distribution, the definition based on the median is recommended as the choice that provides good robustness against many types of non-normal data while retaining good power.

Robustness means the ability of the test to not falsely detect unequal variances when the underlying data are not normally distributed and the variables are in fact equal.

Power means the ability of the test to detect unequal variances when the variances are in fact unequal.

Parameters:

time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool: True if the sample variance equals the population variance (two-tailed p-value > significance_level).

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma, alpha
>>> rng = np.random.RandomState(12345)
>>> shape, scale = 2., 2.
>>> x = rng.gamma(shape, scale, size=1000)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(shape,),
                population_loc=0,
                population_scale=scale,
                significance_level=0.05)
True

>>> a = 1.99
>>> x = gamma.rvs(a, size=1000, random_state=rng)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
True

>>> x = alpha.rvs(a, size=1000, random_state=rng)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
False

Reject the null hypothesis at a confidence level of 5%, concluding that there is a difference in variance of the time_series_data and gamma distribution with shape parameter a.

Example:

>>> levene_test(x,
                population_cdf='alpha',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
True

kim_convergence.stats.nonnormal_test.wilcoxon_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) → bool

Calculate the Wilcoxon signed-rank test.

Here it is used as a non-parametric test to determine whether an unknown population mean is different from a specific value.

Parameters:

time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool: True if the sample is drawn from the specified population distribution.

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma
>>> rng = np.random.RandomState(12345)
>>> shape, scale = 2., 2.
>>> x = rng.gamma(shape, scale, size=1000)
>>> wilcoxon_test(x,
                  population_cdf='gamma',
                  population_args=(shape,),
                  population_loc=0,
                  population_scale=scale,
                  significance_level=0.05)
True

>>> wilcoxon_test(x,
                  population_cdf='gamma',
                  population_args=(shape,),
                  population_loc=0,
                  population_scale=1,
                  significance_level=0.05)
False

The non-parametric tests in this module rely on distributions from SciPy [scipystats]. Available non-parametric tests include:

Kolmogorov-Smirnov test (ks_test): Tests if samples come from a given distribution
Levene’s test (levene_test): Tests for equal variances [nistdiv898b]
Wilcoxon signed-rank test (wilcoxon_test): Tests if median differs from a value
Kruskal-Wallis H-test (kruskal_test): Non-parametric version of ANOVA

Time Series Analysis Tools

Tools module.

Helper functions for time series analysis.

Environment Variables

KIM_CONV_FORCE_SUBPROC

If set (to any value), forces correlation and periodogram computations to run in isolated subprocesses using multiprocessing with the “spawn” start method.

This is primarily intended to avoid threading conflicts when kim-convergence is used inside heavily multi-threaded simulation codes (e.g., LAMMPS with OpenMP). It prevents nested parallelism issues that can cause deadlocks or severe performance degradation.

Performance warning:

In production simulations with large datasets: moderate overhead (typically 10-30% slower).
In unit tests, small datasets, or frequent calls: extremely high overhead (can be 1000x or more slower, especially on macOS) due to repeated process spawning and data copying.

Never set this variable when running unit tests or during development. It is intended only as an escape hatch for real simulation runs that exhibit threading deadlocks.

Example usage (only when needed):

export KIM_CONV_FORCE_SUBPROC=1
mpirun -np 8 lmp -in in.my_simulation   # or similar

This flag is optional and should remain unset in nearly all cases.

kim_convergence.stats.tools.auto_correlate(x: ndarray | list[float], *, nlags: int | None = None, fft: bool = False) → ndarray

Calculate the auto-correlation function.

Calculate the auto-correlation function for nlags lag for the input array. This estimator is biased.

Parameters:

x (array_like, 1d) – Time series data.
nlags (int > 0 or None, optional) – Number of lags to return auto-correlation for it. (default: None)
fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

ndarray: Calculated auto correlation function.

kim_convergence.stats.tools.auto_covariance(x: ndarray | list[float], *, fft: bool = False) → ndarray

Calculate biased auto-covariance estimates.

Compute auto-covariance estimates for every lag for the input array. This estimator is biased.

\[\gamma_k = \frac{1}{N}\sum\limits_{t=1}^{N-K}(x_t-\bar{x})(x_{t+K}-\bar{x})\]

Note

Some sources use the following formula for computing the autocovariance:

\[\gamma_k = \frac{1}{N-K}\sum\limits_{t=1}^{N-K}(x_t-\bar{x})(x_{t+K}-\bar{x})\]

This definition has less bias, than the one used here. But the \(\frac{1}{N}\) formulation has some desirable statistical properties and is the most commonly used in the statistics literature.

Parameters:

x (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

1darray: Estimated autocovariances.

Raises:

CRError – If input validation fails.

kim_convergence.stats.tools.cross_correlate(x: ndarray | list[float], y: ndarray | list[float] | None, *, nlags: int | None = None, fft: bool = False) → ndarray

Calculate the cross-correlation function.

Calculate the cross-correlation function for nlags lag for the input array. This estimator is biased.

Parameters:

x (array_like, 1d) – Time series data.
y (array_like, 1d) – Time series data.
nlags (int > 0 or None, optional) – Number of lags to return auto-correlation for. (default: None)
fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

ndarray: Calculated cross correlation.

kim_convergence.stats.tools.cross_covariance(x: ndarray | list[float], y: ndarray | list[float] | None, *, fft: bool = False) → ndarray

Calculate the biased cross covariance estimate between two time series.

Calculate the cross covariance between two time series for every lag for the input arrays. This estimator is biased.

Parameters:

x (array_like, 1d) – Time series data.
y (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

1darray: Calculated cross covariances.

Raises:

CRError – If input validation fails.

kim_convergence.stats.tools.int_power(x: ndarray | list[float], exponent: int) → ndarray

Array elements raised to the power exponent.

Parameters:

x (array_like, 1d) – The bases.
exponent (int) – The exponent

Returns:

1darray: Computed power array.

kim_convergence.stats.tools.modified_periodogram(x: ndarray | list[float], *, fft: bool = False, with_mean: bool = False) → ndarray

Compute a modified periodogram to estimate the power spectrum.

Estimate the power spectrum using a modified periodogram. A periodogram [heidelberger1981] is an estimate of the spectral density of a signal and it is defined as,

\[\left \{ I\left(\frac{k}{n}\right) \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor},\; I\left( \frac{k}{n} \right) = \left| \sum_{j=0}^{j=n-1} {x(j) e^{-2\pi i j k / n}} \right|^2 / n\]

Parameters:

x (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)
with_mean (bool, optional) – If True, use x minus its mean. (default: False)

Returns:

1darray: Computed modified periodogram array.

Note

This function does not return the array of sample frequencies. In case of need, one can compute it as,

\[f = \left \{ \frac{k}{n} \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor + 1}\]

or

>>> f = np.arange(1., x.size//2 + 1) / x.size

Raises:: CRError – If input validation fails.

kim_convergence.stats.tools.moment(x: ndarray | list[float], *, moment: int = 1) → float

Calculates the nth moment about the mean for a sample.

Parameters:

x (array_like, 1d) – Time series data.
moment (int, optional) – Order of central moment that is returned. (default: 1)

Returns:

float: n-th central moment.

Note

The k-th central moment of a time series data,

\[m_k = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^k,\]

where \(n\) is the number of samples and \(\bar{x}\) is the mean.

kim_convergence.stats.tools.periodogram(x: ndarray | list[float], *, fft: bool = False, with_mean: bool = False) → ndarray

Compute a periodogram to estimate the power spectrum.

Parameters:

x (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)
with_mean (bool, optional) – If True, use x minus its mean. (default: False)

Returns:

1darray: Computed power spectrum array.

Note

This function does not return the array of sample frequencies. In case of need, one can compute it as,

\[f = \left \{ \frac{k}{n} \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor + 1}\]

or

>>> f = np.arange(1., x.size//2 + 1) / x.size

kim_convergence.stats.tools.skew(x: ndarray | list[float], *, bias: bool = False) → float

Compute the time series data set skewness [zwillinger2000].

skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

Parameters:

x (array_like, 1d) – Time series data.
bias (bool, optional) – If False, then the calculations are corrected for statistical bias. (default: False)

Returns:

float: The skewness

Note

For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution.

The sample skewness is computed as the Fisher-Pearson coefficient of skewness \(g_1 = \frac{m_3}{m_2^{3/2}}\), where \(m_i\) is the biased sample \(i\texttt{th}\) central moment. If bias is False, the calculations are corrected for bias and the value computed is the adjusted Fisher-Pearson standardized moment coefficient, i.e.

\[G_1 = \frac{k_3}{k_2^{3/2}} = \frac{\sqrt{N(N-1)}}{N-2} \frac{m_3}{m_2^{3/2}}.\]

Randomness Test

Independence test module.

kim_convergence.stats.randomness_test.randomness_test(x: ndarray | list[float], significance_level: float) → bool

Testing for independence of observations.

The von-Neumann ratio test of independence of variables is a test designed for checking the independence of subsequent observations.

The null hypothesis is that the data are independent and normally distributed.

Parameters:

x (array_like, 1d) – Time series data.
significance_level (float) – Probability threshold below which the null hypothesis is rejected.

Returns:

bool: True if the observations are independent.

Note

Given a series \(x\) of \(n\) data points, the Von-Neumann test [vonneumann1941] [vonneumann1941b] statistic is

\[v = \frac{\sum_{i=2}^{n} (x_i - x_{i-1})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\]

Under the null hypothesis of independence, the mean \(\bar{v} = 2\) and the variance \(\sigma^2_v = \frac{4 (n - 2)}{(n^2-1)}\) (see [williams1941], and [madansky1988] for a simple derivation).

Common Usage Patterns

Testing if data is normally distributed:

import numpy as np
from kim_convergence.stats import t_test

# Generate sample data
data = np.random.normal(loc=0, scale=1, size=100)

# Perform t-test against population mean of 0
result = t_test(
    sample_mean=np.mean(data),
    sample_std=np.std(data),
    sample_size=len(data),
    population_mean=0,
    significance_level=0.05
)

print(f"Null hypothesis accepted: {result}")

Checking time series randomness:

from kim_convergence.stats.randomness_test import randomness_test

# Check if time series exhibits independence
is_random = randomness_test(time_series_data, significance_level=0.05)

if is_random:
    print("Time series appears independent")
else:
    print("Time series shows serial correlation")

Testing against a specific distribution:

from kim_convergence.stats.nonnormal_test import ks_test

# Test if data comes from a gamma distribution
is_gamma = ks_test(
    time_series_data,
    population_cdf='gamma',
    population_args=(2.0,),  # shape parameter
    population_loc=0,
    population_scale=1.0,
    significance_level=0.05
)

Computing autocorrelation:

from kim_convergence.stats.tools import auto_correlate

# Compute autocorrelation with FFT optimization
autocorr = auto_correlate(time_series_data, nlags=50, fft=True)

# First few lags (excluding lag 0 which is always 1.0)
print(f"Autocorrelation at lag 1: {autocorr[1]:.3f}")
print(f"Autocorrelation at lag 2: {autocorr[2]:.3f}")

Performance Considerations

FFT Optimization

For long time series, always use fft=True in autocorrelation functions:

# For time series with > 1000 points
autocorr = auto_correlate(large_time_series, fft=True)
crosscorr = cross_correlate(x, y, fft=True)

The get_fft_optimal_size() function finds optimal sizes for FFT computations by returning the smallest 5-smooth number (factors 2, 3, 5 only) greater than or equal to the input size [statsmodels].

Sample Size Requirements

Non-parametric tests: Require at least 5 data points
Randomness test: Requires at least 3 data points
t-distribution functions: Degrees of freedom must be > 1
KS test: Most effective with moderate to large sample sizes (>30)

Numerical Stability

Use bias=False in skew() function for unbiased estimation
Distribution functions handle edge cases (e.g., p=0, p=1) appropriately

Memory Usage

FFT-based functions create temporary arrays of optimal FFT size
Auto/cross-covariance functions return arrays of length N (not 2N-1)
Consider using nlags parameter to limit output size for long series

Error Handling

All functions raise appropriate exceptions:

CRError: For general errors and invalid inputs
CRSampleSizeError: For insufficient sample sizes
Value errors for out-of-range parameters (e.g., p ∉ [0,1])