Statistics Module

Overview

The statistics module provides tools for statistical analysis, hypothesis testing, and time series analysis. It’s designed for use in convergence analysis and statistical quality control applications.

Contents

  1. Distribution Functions

  2. Hypothesis Tests

  3. Time Series Analysis

  4. Randomness Tests

  5. Common Usage Patterns

  6. Performance Considerations

Distribution Functions

Beta Distribution

Beta distribution module.

kim_convergence.stats.beta_dist.beta(a: float, b: float) float

Beta function.

Beta function [numrec2007] is defined as,

\[B(a, b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)},\]

where \(\Gamma\) is the gamma function.

Parameters:
  • a (float) – First parameter of the beta distribution.

  • b (float) – Second parameter of the beta distribution.

Returns:

float

Beta function value.

kim_convergence.stats.beta_dist.betacf(a: float, b: float, x: float, *, eps: float = 1e-15, max_iteration: int = 200, _fpmin: float = 1e-30) float

Continued fraction for incomplete beta function by modified Lentz’s method.

Evaluates continued fraction for incomplete beta function by modified Lentz’s method [numrec2007].

Parameters:
  • a (float) – First parameter of the beta distribution.

  • b (float) – Second parameter of the beta distribution.

  • x (float) – Real-valued such that it must be between 0.0 and 1.0.

  • eps (float, optional) – Machine precision epsilon. (default: {np.finfo(np.float64).resolution})

  • max_iteration (int, optional) – Maximum number of iterations. (default: 200)

  • _fpmin (float, optional) – Minimum floating point precision. (default: 1.0e-30)

Returns:

float

Continued fraction for incomplete beta function.

kim_convergence.stats.beta_dist.betai(a: float, b: float, x: float) float

Incomplete beta function.

Incomplete beta function [numrec2007] is defined as,

\[I_x(a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \int_0^x~t^{a-1}(1-t)^{b-1}~dt,\]
Parameters:
  • a (float) – First parameter of the beta distribution.

  • b (float) – Second parameter of the beta distribution.

  • x (float) – Real-valued such that it must be between 0.0 and 1.0.

Returns:

float

Incomplete beta function value.

kim_convergence.stats.beta_dist.betai_cdf(a: float, b: float, x: float) float

Calculate the cumulative distribution of the incomplete beta distribution.

Calculate the cumulative distribution of the incomplete beta distribution with parameters a and b as,

\[\int_0^x \frac{t^{a-1}~(1-t)^{b-1}}{Beta(a,b)}~dt,\]

where, \(Beta(a,b)\) is the beta function.

Parameters:
  • a (float) – First parameter of the beta distribution.

  • b (float) – Second parameter of the beta distribution.

  • x (float) – Upper limit of integration

Returns:

float

Cumulative incomplete beta distribution.

kim_convergence.stats.beta_dist.betai_cdf_ccdf(a: float, b: float, x: float) tuple[float, float]

Calculate the cumulative distribution of the incomplete beta distribution.

Calculate the cumulative distribution of the incomplete beta distribution with parameters a and b as,

\[\int_0^x \frac{t^{a-1}~(1-t)^{b-1}}{Beta(a,b)}~dt,\]

where, \(Beta(a,b)\) is the beta function.

Parameters:
  • a (float) – First parameter of the beta distribution.

  • b (float) – Second parameter of the beta distribution.

  • x (float) – Upper limit of integration

Returns:

tuple[float, float]

Cumulative incomplete beta distribution, compliment of the cumulative incomplete beta distribution.

The Beta function implementation follows the algorithms described in Numerical Recipes [numrec2007].

Normal Distribution

normal distribution module.

s_normal_inv_cdf code is adapted from python statistics module [pythonstats] by Yaser Afshar.

kim_convergence.stats.normal_dist.normal_interval(confidence_level: float, *, loc: float = 0.0, scale: float = 1.0) tuple[float, float]

Compute the normal distribution confidence interval.

Compute the normal-distribution confidence interval with equal areas around the median.

Parameters:
  • confidence_level (float) – Confidence coefficient (must be between 0.0 and 1.0).

  • loc (float, optional) – Location parameter. (default: 0.0)

  • scale (float, optional) – Scale parameter. (default: 1.0)

Returns:

tuple[float, float]

Lower and upper bounds of the confidence interval that contains \(100~\text{confidence_level}\%\) of the distribution.

Note

  • Confidence interval is a range of values that is likely to contain an unknown population parameter.

  • Confidence level is the percentage of the confidence intervals which will hold the population parameter.

  • The significance level or alpha is the probability of rejecting the null hypothesis when it is true. To find alpha, just subtract the confidence interval from 100%. E.g., the significance level for a 90% confidence level is 100% – 90% = 10%.

kim_convergence.stats.normal_dist.normal_inv_cdf(p: float, *, loc=0.0, scale: float = 1.0) float

Compute the normal distribution inverse cumulative distribution function.

Parameters:
  • p (float) – Probability (must be between 0.0 and 1.0).

  • loc (float, optional) – Location parameter. (default: 0.0)

  • scale (float, optional) – Scale parameter. (default: 1.0)

Returns:

float

Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

kim_convergence.stats.normal_dist.s_normal_inv_cdf(p: float) float

Compute the standard normal distribution inverse cumulative distribution function.

Compute the inverse cumulative distribution function (percent point function or quantile function) for standard normal distribution [pythonstats], [wichura1988].

Parameters:

p (float) – Probability (must be between 0.0 and 1.0).

Returns:

float

Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

The inverse CDF computation uses the algorithm by Wichura [wichura1988].

t-Distribution

T distribution module.

This module is specialized for the kim-convergence code and is not a general function to be used for other purposes.

kim_convergence.stats.t_dist.t_cdf(t: float, df: float) float

Compute the cumulative distribution of the t-distribution.

The cumulative distribution of the t-distribution for t > 0, can be written in terms of the regularized incomplete beta function as,

\[\int_{-\infty}^t f(u)\,du = 1 - \frac{1}{2} I_{x(t)}\left(\frac{\nu}{2}, \frac{1}{2}\right),\]

where,

\[x(t) = \frac{\nu}{{t^2+\nu}}.\]

Other t values would be obtained by symmetry.

Parameters:
  • t (float) – Upper limit of the integration.

  • df (float) – Degrees of freedom, must be a positive number.

Returns:

float

Cumulative t-distribution.

kim_convergence.stats.t_dist.t_cdf_ccdf(t: float, df: float) tuple[float, float]

Compute the cumulative distribution of the t-distribution.

The cumulative distribution of the t-distribution for t > 0, can be written in terms of the regularized incomplete beta function as,

\[\int_{-\infty}^t f(u)\,du = 1 - \frac{1}{2} I_{x(t)}\left(\frac{\nu}{2}, \frac{1}{2}\right),\]

where,

\[x(t) = \frac{\nu}{{t^2+\nu}}.\]

Other t values would be obtained by symmetry.

Parameters:
  • t (float) – Upper limit of the integration.

  • df (float) – Degrees of freedom, must be a positive number.

Returns:

tuple[float, float]

cdf: cumulative t-distribution value. ccdf: complement of the cumulative t-distribution (1 - cdf).

kim_convergence.stats.t_dist.t_interval(confidence_level: float, df: float, *, loc: float = 0.0, scale: float = 1.0) tuple[float, float]

Compute the t_distribution confidence interval.

Compute the t_distribution confidence interval with equal areas around the median.

Parameters:
  • confidence_level (float) – (or confidence coefficient) must be between 0.0 and 1.0

  • df (float) – Degrees of freedom, must be > 0.

  • loc (float, optional) – location parameter (default: 0.0)

  • scale (float, optional) – scale parameter (default: 1.0)

Returns:

tuple[float, float]

Lower and upper bounds of the confidence interval that contains \(100 \cdot \text{confidence_level}\%\) of the t-distribution.

Note

  • Confidence interval is a range of values that is likely to contain an unknown population parameter.

  • Confidence level is the percentage of the confidence intervals which will hold the population parameter.

  • The significance level or alpha is the probability of rejecting the null hypothesis when it is true. To find alpha, just subtract the confidence interval from 100%. E.g., the significance level for a 90% confidence level is 100% – 90% = 10%.

kim_convergence.stats.t_dist.t_inv_cdf(p: float, df: float, *, loc: float = 0.0, scale: float = 1.0, _tol: float = 1e-08, _atol: float = 1e-50, _rtinf: float = 1e+100) float

Compute the t_distribution inverse cumulative distribution function.

Compute the inverse cumulative distribution function (percent point function or quantile function) for t-distributions with df degrees of freedom. Inverse cumulative distribution function finds the value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability.

Parameters:
  • p (float) – Probability (must be between 0.0 and 1.0)

  • df (float) – Degrees of freedom, must be > 1.

  • loc (float, optional) – location parameter (default: 0.0)

  • scale (float, optional) – scale parameter (default: 1.0)

Returns:

float

Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).

The t-distribution functions are implemented using the regularized incomplete beta function as described in standard statistical references.

Hypothesis Tests

Tests for Normally Distributed Data

Test module for normal distributed data.

Note

The tests in this module are modified and fixed for the kim-convergence package use.

kim_convergence.stats.normal_test.chi_square_test(sample_var: float, sample_size: int, population_var: float, significance_level: float = 0.050000000000000044) bool

Chi-square test for the variance.

Calculate the chi-square test for the variance. This is a two-sided test. Test Statistic is \(T=(N−1)\frac{\text{var}}{\text{var}_0}\), where where N is the sample size and var is the sample variance. The ratio var/var0 compares the ratio of the sample variance to the target variance. The more this ratio deviates from 1, the more likely we are to reject the null hypothesis.

The null hypothesis is that the variance of a sample of independent observations x is equal to the given population variance, population_var.

Parameters:
  • sample_var (float) – Sample variance.

  • sample_size (int) – Number of samples.

  • population_var (float) – population variance.

  • significance_level (float) – Significance level. A probability threshold below which the null hypothesis will be rejected. (default: 0.05)

Returns:

bool

True if the variance of a sample of independent observations x equals the given population variance population_var.

kim_convergence.stats.normal_test.t_test(sample_mean: float, sample_std: float, sample_size: int, population_mean: float, significance_level: float = 0.050000000000000044) bool

T-test for the mean.

Calculate the T-test for the mean. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations x is equal to the given population mean, population_mean.

Parameters:
  • sample_mean (float) – Sample mean.

  • sample_std (float) – Sample standard deviation.

  • sample_size (int) – Number of samples.

  • population_mean (float) – Expected value in the null hypothesis.

  • significance_level (float) – Significance level. A probability threshold below which the null hypothesis will be rejected. (default: 0.05)

Returns:

bool

True if the expected value (mean) of a sample of independent observations x equals the given population mean population_mean.

Tests for Non-Normally Distributed Data

Test module for non-normally distributed data.

Note

The tests in this module are modified and fixed for the kim-convergence package use.

kim_convergence.stats.nonnormal_test.kruskal_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool

Kruskal-Wallis H-test for independent samples.

The Kruskal-Wallis H-test tests the null hypothesis that the median of the time series data is the same as the one from population_cdf.

It is a non-parametric version of ANOVA.

Parameters:
  • time_series_data (np.ndarray) – time series data.

  • population_cdf (Optional[str]) – The name of a distribution.

  • population_args (tuple) – Distribution parameter.

  • population_loc (Optional[float]) – location of the distribution.

  • population_scale (Optional[float]) – scale of the distribution.

  • significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool

True if the median of the time-series data equals the median of the specified population distribution.

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma
>>> rng = np.random.RandomState(12345)
>>> a = 1.99
>>> x = rng.gamma(a, 1, size=20)
>>> kruskal_test(x,
                 population_cdf='gamma',
                 population_args=(shape,),
                 population_loc=0,
                 population_scale=1,
                 significance_level=0.05)
True
kim_convergence.stats.nonnormal_test.ks_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool

Kolmogorov-Smirnov test for goodness of fit.

Note

This test is only valid for continuous distributions.

It uses the distribution of an observed variable against a given distribution.

The null hypothesis is that the observed samples are drawn from the same continuous distribution as the given distribution with population_loc and population_scale if they are given.

Note

The alternative hypothesis is two-sided. Where the empirical cumulative distribution function of the observed variables is less or greater than the cumulative distribution function of the given distribution.

The probability density of the given population distribution is in the standardized form. Thus to shift and/or scale the distribution population_loc and population_scale parameters are used. In these cases, the variable change y <- x, where y = (x - loc) / scale

Parameters:
  • time_series_data (np.ndarray) – time series data.

  • population_cdf (Optional[str]) – The name of a distribution.

  • population_args (tuple) – Distribution parameter.

  • population_loc (Optional[float]) – location of the distribution.

  • population_scale (Optional[float]) – scale of the distribution.

  • significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool

True if the observed samples are drawn from the same continuous distribution as the given one (two-tailed p-value > significance_level).

kim_convergence.stats.nonnormal_test.levene_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool

Perform modified Levene test for equal variances.

The modified Levene test tests the null hypothesis that one sample input time_series_data is from population population_cdf with the same variance [nistdiv898b].

Note

This test is fixed to use ‘median’ variation of the Levene’s test.

Although the optimal choice depends on the underlying distribution, the definition based on the median is recommended as the choice that provides good robustness against many types of non-normal data while retaining good power.

Robustness means the ability of the test to not falsely detect unequal variances when the underlying data are not normally distributed and the variables are in fact equal.

Power means the ability of the test to detect unequal variances when the variances are in fact unequal.

Parameters:
  • time_series_data (np.ndarray) – time series data.

  • population_cdf (Optional[str]) – The name of a distribution.

  • population_args (tuple) – Distribution parameter.

  • population_loc (Optional[float]) – location of the distribution.

  • population_scale (Optional[float]) – scale of the distribution.

  • significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool

True if the sample variance equals the population variance (two-tailed p-value > significance_level).

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma, alpha
>>> rng = np.random.RandomState(12345)
>>> shape, scale = 2., 2.
>>> x = rng.gamma(shape, scale, size=1000)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(shape,),
                population_loc=0,
                population_scale=scale,
                significance_level=0.05)
True
>>> a = 1.99
>>> x = gamma.rvs(a, size=1000, random_state=rng)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
True
>>> x = alpha.rvs(a, size=1000, random_state=rng)
>>> levene_test(x,
                population_cdf='gamma',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
False

Reject the null hypothesis at a confidence level of 5%, concluding that there is a difference in variance of the time_series_data and gamma distribution with shape parameter a.

Example:

>>> levene_test(x,
                population_cdf='alpha',
                population_args=(a,),
                population_loc=0,
                population_scale=1,
                significance_level=0.05)
True
kim_convergence.stats.nonnormal_test.wilcoxon_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool

Calculate the Wilcoxon signed-rank test.

Here it is used as a non-parametric test to determine whether an unknown population mean is different from a specific value.

Parameters:
  • time_series_data (np.ndarray) – time series data.

  • population_cdf (Optional[str]) – The name of a distribution.

  • population_args (tuple) – Distribution parameter.

  • population_loc (Optional[float]) – location of the distribution.

  • population_scale (Optional[float]) – scale of the distribution.

  • significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)

Returns:

bool

True if the sample is drawn from the specified population distribution.

Examples:

>>> import numpy as np
>>> from scipy.stats import gamma
>>> rng = np.random.RandomState(12345)
>>> shape, scale = 2., 2.
>>> x = rng.gamma(shape, scale, size=1000)
>>> wilcoxon_test(x,
                  population_cdf='gamma',
                  population_args=(shape,),
                  population_loc=0,
                  population_scale=scale,
                  significance_level=0.05)
True
>>> wilcoxon_test(x,
                  population_cdf='gamma',
                  population_args=(shape,),
                  population_loc=0,
                  population_scale=1,
                  significance_level=0.05)
False

The non-parametric tests in this module rely on distributions from SciPy [scipystats]. Available non-parametric tests include:

  • Kolmogorov-Smirnov test (ks_test): Tests if samples come from a given distribution

  • Levene’s test (levene_test): Tests for equal variances [nistdiv898b]

  • Wilcoxon signed-rank test (wilcoxon_test): Tests if median differs from a value

  • Kruskal-Wallis H-test (kruskal_test): Non-parametric version of ANOVA

Time Series Analysis Tools

Tools module.

Helper functions for time series analysis.

Environment Variables

KIM_CONV_FORCE_SUBPROC

If set (to any value), forces correlation and periodogram computations to run in isolated subprocesses using multiprocessing with the “spawn” start method.

This is primarily intended to avoid threading conflicts when kim-convergence is used inside heavily multi-threaded simulation codes (e.g., LAMMPS with OpenMP). It prevents nested parallelism issues that can cause deadlocks or severe performance degradation.

Performance warning:

  • In production simulations with large datasets: moderate overhead (typically 10-30% slower).

  • In unit tests, small datasets, or frequent calls: extremely high overhead (can be 1000x or more slower, especially on macOS) due to repeated process spawning and data copying.

Never set this variable when running unit tests or during development. It is intended only as an escape hatch for real simulation runs that exhibit threading deadlocks.

Example usage (only when needed):

export KIM_CONV_FORCE_SUBPROC=1
mpirun -np 8 lmp -in in.my_simulation   # or similar

This flag is optional and should remain unset in nearly all cases.

kim_convergence.stats.tools.auto_correlate(x: ndarray | list[float], *, nlags: int | None = None, fft: bool = False) ndarray

Calculate the auto-correlation function.

Calculate the auto-correlation function for nlags lag for the input array. This estimator is biased.

Parameters:
  • x (array_like, 1d) – Time series data.

  • nlags (int > 0 or None, optional) – Number of lags to return auto-correlation for it. (default: None)

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

ndarray

Calculated auto correlation function.

kim_convergence.stats.tools.auto_covariance(x: ndarray | list[float], *, fft: bool = False) ndarray

Calculate biased auto-covariance estimates.

Compute auto-covariance estimates for every lag for the input array. This estimator is biased.

\[\gamma_k = \frac{1}{N}\sum\limits_{t=1}^{N-K}(x_t-\bar{x})(x_{t+K}-\bar{x})\]

Note

Some sources use the following formula for computing the autocovariance:

\[\gamma_k = \frac{1}{N-K}\sum\limits_{t=1}^{N-K}(x_t-\bar{x})(x_{t+K}-\bar{x})\]

This definition has less bias, than the one used here. But the \(\frac{1}{N}\) formulation has some desirable statistical properties and is the most commonly used in the statistics literature.

Parameters:
  • x (array_like, 1d) – Time series data.

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

1darray

Estimated autocovariances.

Raises:

CRError – If input validation fails.

kim_convergence.stats.tools.cross_correlate(x: ndarray | list[float], y: ndarray | list[float] | None, *, nlags: int | None = None, fft: bool = False) ndarray

Calculate the cross-correlation function.

Calculate the cross-correlation function for nlags lag for the input array. This estimator is biased.

Parameters:
  • x (array_like, 1d) – Time series data.

  • y (array_like, 1d) – Time series data.

  • nlags (int > 0 or None, optional) – Number of lags to return auto-correlation for. (default: None)

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

ndarray

Calculated cross correlation.

kim_convergence.stats.tools.cross_covariance(x: ndarray | list[float], y: ndarray | list[float] | None, *, fft: bool = False) ndarray

Calculate the biased cross covariance estimate between two time series.

Calculate the cross covariance between two time series for every lag for the input arrays. This estimator is biased.

Parameters:
  • x (array_like, 1d) – Time series data.

  • y (array_like, 1d) – Time series data.

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

Returns:

1darray

Calculated cross covariances.

Raises:

CRError – If input validation fails.

kim_convergence.stats.tools.int_power(x: ndarray | list[float], exponent: int) ndarray

Array elements raised to the power exponent.

Parameters:
  • x (array_like, 1d) – The bases.

  • exponent (int) – The exponent

Returns:

1darray

Computed power array.

kim_convergence.stats.tools.modified_periodogram(x: ndarray | list[float], *, fft: bool = False, with_mean: bool = False) ndarray

Compute a modified periodogram to estimate the power spectrum.

Estimate the power spectrum using a modified periodogram. A periodogram [heidelberger1981] is an estimate of the spectral density of a signal and it is defined as,

\[\left \{ I\left(\frac{k}{n}\right) \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor},\; I\left( \frac{k}{n} \right) = \left| \sum_{j=0}^{j=n-1} {x(j) e^{-2\pi i j k / n}} \right|^2 / n\]
Parameters:
  • x (array_like, 1d) – Time series data.

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

  • with_mean (bool, optional) – If True, use x minus its mean. (default: False)

Returns:

1darray

Computed modified periodogram array.

Note

This function does not return the array of sample frequencies. In case of need, one can compute it as,

\[f = \left \{ \frac{k}{n} \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor + 1}\]

or

>>> f = np.arange(1., x.size//2 + 1) / x.size
Raises:

CRError – If input validation fails.

kim_convergence.stats.tools.moment(x: ndarray | list[float], *, moment: int = 1) float

Calculates the nth moment about the mean for a sample.

Parameters:
  • x (array_like, 1d) – Time series data.

  • moment (int, optional) – Order of central moment that is returned. (default: 1)

Returns:

float

n-th central moment.

Note

The k-th central moment of a time series data,

\[m_k = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^k,\]

where \(n\) is the number of samples and \(\bar{x}\) is the mean.

kim_convergence.stats.tools.periodogram(x: ndarray | list[float], *, fft: bool = False, with_mean: bool = False) ndarray

Compute a periodogram to estimate the power spectrum.

Parameters:
  • x (array_like, 1d) – Time series data.

  • fft (bool, optional) – Use FFT convolution for long series. (default: False)

  • with_mean (bool, optional) – If True, use x minus its mean. (default: False)

Returns:

1darray

Computed power spectrum array.

Note

This function does not return the array of sample frequencies. In case of need, one can compute it as,

\[f = \left \{ \frac{k}{n} \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor + 1}\]

or

>>> f = np.arange(1., x.size//2 + 1) / x.size
kim_convergence.stats.tools.skew(x: ndarray | list[float], *, bias: bool = False) float

Compute the time series data set skewness [zwillinger2000].

skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

Parameters:
  • x (array_like, 1d) – Time series data.

  • bias (bool, optional) – If False, then the calculations are corrected for statistical bias. (default: False)

Returns:

float

The skewness

Note

For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution.

The sample skewness is computed as the Fisher-Pearson coefficient of skewness \(g_1 = \frac{m_3}{m_2^{3/2}}\), where \(m_i\) is the biased sample \(i\texttt{th}\) central moment. If bias is False, the calculations are corrected for bias and the value computed is the adjusted Fisher-Pearson standardized moment coefficient, i.e.

\[G_1 = \frac{k_3}{k_2^{3/2}} = \frac{\sqrt{N(N-1)}}{N-2} \frac{m_3}{m_2^{3/2}}.\]

Randomness Test

Independence test module.

kim_convergence.stats.randomness_test.randomness_test(x: ndarray | list[float], significance_level: float) bool

Testing for independence of observations.

The von-Neumann ratio test of independence of variables is a test designed for checking the independence of subsequent observations.

The null hypothesis is that the data are independent and normally distributed.

Parameters:
  • x (array_like, 1d) – Time series data.

  • significance_level (float) – Probability threshold below which the null hypothesis is rejected.

Returns:

bool

True if the observations are independent.

Note

Given a series \(x\) of \(n\) data points, the Von-Neumann test [vonneumann1941] [vonneumann1941b] statistic is

\[v = \frac{\sum_{i=2}^{n} (x_i - x_{i-1})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\]

Under the null hypothesis of independence, the mean \(\bar{v} = 2\) and the variance \(\sigma^2_v = \frac{4 (n - 2)}{(n^2-1)}\) (see [williams1941], and [madansky1988] for a simple derivation).

Common Usage Patterns

Testing if data is normally distributed:

import numpy as np
from kim_convergence.stats import t_test

# Generate sample data
data = np.random.normal(loc=0, scale=1, size=100)

# Perform t-test against population mean of 0
result = t_test(
    sample_mean=np.mean(data),
    sample_std=np.std(data),
    sample_size=len(data),
    population_mean=0,
    significance_level=0.05
)

print(f"Null hypothesis accepted: {result}")

Checking time series randomness:

from kim_convergence.stats.randomness_test import randomness_test

# Check if time series exhibits independence
is_random = randomness_test(time_series_data, significance_level=0.05)

if is_random:
    print("Time series appears independent")
else:
    print("Time series shows serial correlation")

Testing against a specific distribution:

from kim_convergence.stats.nonnormal_test import ks_test

# Test if data comes from a gamma distribution
is_gamma = ks_test(
    time_series_data,
    population_cdf='gamma',
    population_args=(2.0,),  # shape parameter
    population_loc=0,
    population_scale=1.0,
    significance_level=0.05
)

Computing autocorrelation:

from kim_convergence.stats.tools import auto_correlate

# Compute autocorrelation with FFT optimization
autocorr = auto_correlate(time_series_data, nlags=50, fft=True)

# First few lags (excluding lag 0 which is always 1.0)
print(f"Autocorrelation at lag 1: {autocorr[1]:.3f}")
print(f"Autocorrelation at lag 2: {autocorr[2]:.3f}")

Performance Considerations

FFT Optimization

For long time series, always use fft=True in autocorrelation functions:

# For time series with > 1000 points
autocorr = auto_correlate(large_time_series, fft=True)
crosscorr = cross_correlate(x, y, fft=True)

The get_fft_optimal_size() function finds optimal sizes for FFT computations by returning the smallest 5-smooth number (factors 2, 3, 5 only) greater than or equal to the input size [statsmodels].

Sample Size Requirements

  • Non-parametric tests: Require at least 5 data points

  • Randomness test: Requires at least 3 data points

  • t-distribution functions: Degrees of freedom must be > 1

  • KS test: Most effective with moderate to large sample sizes (>30)

Numerical Stability

  • Use bias=False in skew() function for unbiased estimation

  • Distribution functions handle edge cases (e.g., p=0, p=1) appropriately

Memory Usage

  • FFT-based functions create temporary arrays of optimal FFT size

  • Auto/cross-covariance functions return arrays of length N (not 2N-1)

  • Consider using nlags parameter to limit output size for long series

Error Handling

All functions raise appropriate exceptions:

  • CRError: For general errors and invalid inputs

  • CRSampleSizeError: For insufficient sample sizes

  • Value errors for out-of-range parameters (e.g., p ∉ [0,1])