Statistics Module
Overview
The statistics module provides tools for statistical analysis, hypothesis testing, and time series analysis. It’s designed for use in convergence analysis and statistical quality control applications.
Contents
Distribution Functions
Beta Distribution
Beta distribution module.
- kim_convergence.stats.beta_dist.beta(a: float, b: float) float
Beta function.
Beta function [numrec2007] is defined as,
\[B(a, b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)},\]where \(\Gamma\) is the gamma function.
- Parameters:
a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
- Returns:
- float
Beta function value.
- kim_convergence.stats.beta_dist.betacf(a: float, b: float, x: float, *, eps: float = 1e-15, max_iteration: int = 200, _fpmin: float = 1e-30) float
Continued fraction for incomplete beta function by modified Lentz’s method.
Evaluates continued fraction for incomplete beta function by modified Lentz’s method [numrec2007].
- Parameters:
a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Real-valued such that it must be between 0.0 and 1.0.
eps (float, optional) – Machine precision epsilon. (default: {np.finfo(np.float64).resolution})
max_iteration (int, optional) – Maximum number of iterations. (default: 200)
_fpmin (float, optional) – Minimum floating point precision. (default: 1.0e-30)
- Returns:
- float
Continued fraction for incomplete beta function.
- kim_convergence.stats.beta_dist.betai(a: float, b: float, x: float) float
Incomplete beta function.
Incomplete beta function [numrec2007] is defined as,
\[I_x(a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \int_0^x~t^{a-1}(1-t)^{b-1}~dt,\]- Parameters:
a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Real-valued such that it must be between 0.0 and 1.0.
- Returns:
- float
Incomplete beta function value.
- kim_convergence.stats.beta_dist.betai_cdf(a: float, b: float, x: float) float
Calculate the cumulative distribution of the incomplete beta distribution.
Calculate the cumulative distribution of the incomplete beta distribution with parameters a and b as,
\[\int_0^x \frac{t^{a-1}~(1-t)^{b-1}}{Beta(a,b)}~dt,\]where, \(Beta(a,b)\) is the beta function.
- Parameters:
a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Upper limit of integration
- Returns:
- float
Cumulative incomplete beta distribution.
- kim_convergence.stats.beta_dist.betai_cdf_ccdf(a: float, b: float, x: float) tuple[float, float]
Calculate the cumulative distribution of the incomplete beta distribution.
Calculate the cumulative distribution of the incomplete beta distribution with parameters a and b as,
\[\int_0^x \frac{t^{a-1}~(1-t)^{b-1}}{Beta(a,b)}~dt,\]where, \(Beta(a,b)\) is the beta function.
- Parameters:
a (float) – First parameter of the beta distribution.
b (float) – Second parameter of the beta distribution.
x (float) – Upper limit of integration
- Returns:
- tuple[float, float]
Cumulative incomplete beta distribution, compliment of the cumulative incomplete beta distribution.
The Beta function implementation follows the algorithms described in Numerical Recipes [numrec2007].
Normal Distribution
normal distribution module.
s_normal_inv_cdf code is adapted from python statistics module [pythonstats] by Yaser Afshar.
- kim_convergence.stats.normal_dist.normal_interval(confidence_level: float, *, loc: float = 0.0, scale: float = 1.0) tuple[float, float]
Compute the normal distribution confidence interval.
Compute the normal-distribution confidence interval with equal areas around the median.
- Parameters:
confidence_level (float) – Confidence coefficient (must be between 0.0 and 1.0).
loc (float, optional) – Location parameter. (default: 0.0)
scale (float, optional) – Scale parameter. (default: 1.0)
- Returns:
- tuple[float, float]
Lower and upper bounds of the confidence interval that contains \(100~\text{confidence_level}\%\) of the distribution.
Note
Confidence interval is a range of values that is likely to contain an unknown population parameter.
Confidence level is the percentage of the confidence intervals which will hold the population parameter.
The significance level or alpha is the probability of rejecting the null hypothesis when it is true. To find alpha, just subtract the confidence interval from 100%. E.g., the significance level for a 90% confidence level is 100% – 90% = 10%.
- kim_convergence.stats.normal_dist.normal_inv_cdf(p: float, *, loc=0.0, scale: float = 1.0) float
Compute the normal distribution inverse cumulative distribution function.
- Parameters:
p (float) – Probability (must be between 0.0 and 1.0).
loc (float, optional) – Location parameter. (default: 0.0)
scale (float, optional) – Scale parameter. (default: 1.0)
- Returns:
- float
Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).
- kim_convergence.stats.normal_dist.s_normal_inv_cdf(p: float) float
Compute the standard normal distribution inverse cumulative distribution function.
Compute the inverse cumulative distribution function (percent point function or quantile function) for standard normal distribution [pythonstats], [wichura1988].
- Parameters:
p (float) – Probability (must be between 0.0 and 1.0).
- Returns:
- float
Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).
The inverse CDF computation uses the algorithm by Wichura [wichura1988].
t-Distribution
T distribution module.
This module is specialized for the kim-convergence code and is not a
general function to be used for other purposes.
- kim_convergence.stats.t_dist.t_cdf(t: float, df: float) float
Compute the cumulative distribution of the t-distribution.
The cumulative distribution of the t-distribution for t > 0, can be written in terms of the regularized incomplete beta function as,
\[\int_{-\infty}^t f(u)\,du = 1 - \frac{1}{2} I_{x(t)}\left(\frac{\nu}{2}, \frac{1}{2}\right),\]where,
\[x(t) = \frac{\nu}{{t^2+\nu}}.\]Other t values would be obtained by symmetry.
- Parameters:
t (float) – Upper limit of the integration.
df (float) – Degrees of freedom, must be a positive number.
- Returns:
- float
Cumulative t-distribution.
- kim_convergence.stats.t_dist.t_cdf_ccdf(t: float, df: float) tuple[float, float]
Compute the cumulative distribution of the t-distribution.
The cumulative distribution of the t-distribution for t > 0, can be written in terms of the regularized incomplete beta function as,
\[\int_{-\infty}^t f(u)\,du = 1 - \frac{1}{2} I_{x(t)}\left(\frac{\nu}{2}, \frac{1}{2}\right),\]where,
\[x(t) = \frac{\nu}{{t^2+\nu}}.\]Other t values would be obtained by symmetry.
- Parameters:
t (float) – Upper limit of the integration.
df (float) – Degrees of freedom, must be a positive number.
- Returns:
- tuple[float, float]
cdf: cumulative t-distribution value. ccdf: complement of the cumulative t-distribution (1 - cdf).
- kim_convergence.stats.t_dist.t_interval(confidence_level: float, df: float, *, loc: float = 0.0, scale: float = 1.0) tuple[float, float]
Compute the t_distribution confidence interval.
Compute the t_distribution confidence interval with equal areas around the median.
- Parameters:
confidence_level (float) – (or confidence coefficient) must be between 0.0 and 1.0
df (float) – Degrees of freedom, must be > 0.
loc (float, optional) – location parameter (default: 0.0)
scale (float, optional) – scale parameter (default: 1.0)
- Returns:
- tuple[float, float]
Lower and upper bounds of the confidence interval that contains \(100 \cdot \text{confidence_level}\%\) of the t-distribution.
Note
Confidence interval is a range of values that is likely to contain an unknown population parameter.
Confidence level is the percentage of the confidence intervals which will hold the population parameter.
The significance level or alpha is the probability of rejecting the null hypothesis when it is true. To find alpha, just subtract the confidence interval from 100%. E.g., the significance level for a 90% confidence level is 100% – 90% = 10%.
- kim_convergence.stats.t_dist.t_inv_cdf(p: float, df: float, *, loc: float = 0.0, scale: float = 1.0, _tol: float = 1e-08, _atol: float = 1e-50, _rtinf: float = 1e+100) float
Compute the t_distribution inverse cumulative distribution function.
Compute the inverse cumulative distribution function (percent point function or quantile function) for t-distributions with df degrees of freedom. Inverse cumulative distribution function finds the value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability.
- Parameters:
p (float) – Probability (must be between 0.0 and 1.0)
df (float) – Degrees of freedom, must be > 1.
loc (float, optional) – location parameter (default: 0.0)
scale (float, optional) – scale parameter (default: 1.0)
- Returns:
- float
Inverse cumulative distribution function: value \(x\) such that \(P(X \le x) = p\).
The t-distribution functions are implemented using the regularized incomplete beta function as described in standard statistical references.
Hypothesis Tests
Tests for Normally Distributed Data
Test module for normal distributed data.
Note
The tests in this module are modified and fixed for the kim-convergence package use.
- kim_convergence.stats.normal_test.chi_square_test(sample_var: float, sample_size: int, population_var: float, significance_level: float = 0.050000000000000044) bool
Chi-square test for the variance.
Calculate the chi-square test for the variance. This is a two-sided test. Test Statistic is \(T=(N−1)\frac{\text{var}}{\text{var}_0}\), where where N is the sample size and var is the sample variance. The ratio var/var0 compares the ratio of the sample variance to the target variance. The more this ratio deviates from 1, the more likely we are to reject the null hypothesis.
The null hypothesis is that the variance of a sample of independent observations x is equal to the given population variance, population_var.
- Parameters:
sample_var (float) – Sample variance.
sample_size (int) – Number of samples.
population_var (float) – population variance.
significance_level (float) – Significance level. A probability threshold below which the null hypothesis will be rejected. (default: 0.05)
- Returns:
- bool
Trueif the variance of a sample of independent observationsxequals the given population variancepopulation_var.
- kim_convergence.stats.normal_test.t_test(sample_mean: float, sample_std: float, sample_size: int, population_mean: float, significance_level: float = 0.050000000000000044) bool
T-test for the mean.
Calculate the T-test for the mean. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations x is equal to the given population mean, population_mean.
- Parameters:
sample_mean (float) – Sample mean.
sample_std (float) – Sample standard deviation.
sample_size (int) – Number of samples.
population_mean (float) – Expected value in the null hypothesis.
significance_level (float) – Significance level. A probability threshold below which the null hypothesis will be rejected. (default: 0.05)
- Returns:
- bool
Trueif the expected value (mean) of a sample of independent observationsxequals the given population meanpopulation_mean.
Tests for Non-Normally Distributed Data
Test module for non-normally distributed data.
Note
The tests in this module are modified and fixed for the kim-convergence package use.
- kim_convergence.stats.nonnormal_test.kruskal_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool
Kruskal-Wallis H-test for independent samples.
The Kruskal-Wallis H-test tests the null hypothesis that the median of the time series data is the same as the one from population_cdf.
It is a non-parametric version of ANOVA.
- Parameters:
time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)
- Returns:
- bool
Trueif the median of the time-series data equals the median of the specified population distribution.
Examples:
>>> import numpy as np >>> from scipy.stats import gamma >>> rng = np.random.RandomState(12345) >>> a = 1.99 >>> x = rng.gamma(a, 1, size=20) >>> kruskal_test(x, population_cdf='gamma', population_args=(shape,), population_loc=0, population_scale=1, significance_level=0.05) True
- kim_convergence.stats.nonnormal_test.ks_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool
Kolmogorov-Smirnov test for goodness of fit.
Note
This test is only valid for continuous distributions.
It uses the distribution of an observed variable against a given distribution.
The null hypothesis is that the observed samples are drawn from the same continuous distribution as the given distribution with population_loc and population_scale if they are given.
Note
The alternative hypothesis is two-sided. Where the empirical cumulative distribution function of the observed variables is less or greater than the cumulative distribution function of the given distribution.
The probability density of the given population distribution is in the standardized form. Thus to shift and/or scale the distribution population_loc and population_scale parameters are used. In these cases, the variable change y <- x, where y = (x - loc) / scale
- Parameters:
time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)
- Returns:
- bool
Trueif the observed samples are drawn from the same continuous distribution as the given one (two-tailed p-value > significance_level).
- kim_convergence.stats.nonnormal_test.levene_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool
Perform modified Levene test for equal variances.
The modified Levene test tests the null hypothesis that one sample input time_series_data is from population population_cdf with the same variance [nistdiv898b].
Note
This test is fixed to use ‘median’ variation of the Levene’s test.
Although the optimal choice depends on the underlying distribution, the definition based on the median is recommended as the choice that provides good robustness against many types of non-normal data while retaining good power.
Robustness means the ability of the test to not falsely detect unequal variances when the underlying data are not normally distributed and the variables are in fact equal.
Power means the ability of the test to detect unequal variances when the variances are in fact unequal.
- Parameters:
time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)
- Returns:
- bool
Trueif the sample variance equals the population variance (two-tailed p-value > significance_level).
Examples:
>>> import numpy as np >>> from scipy.stats import gamma, alpha >>> rng = np.random.RandomState(12345) >>> shape, scale = 2., 2. >>> x = rng.gamma(shape, scale, size=1000) >>> levene_test(x, population_cdf='gamma', population_args=(shape,), population_loc=0, population_scale=scale, significance_level=0.05) True
>>> a = 1.99 >>> x = gamma.rvs(a, size=1000, random_state=rng) >>> levene_test(x, population_cdf='gamma', population_args=(a,), population_loc=0, population_scale=1, significance_level=0.05) True
>>> x = alpha.rvs(a, size=1000, random_state=rng) >>> levene_test(x, population_cdf='gamma', population_args=(a,), population_loc=0, population_scale=1, significance_level=0.05) False
Reject the null hypothesis at a confidence level of 5%, concluding that there is a difference in variance of the time_series_data and gamma distribution with shape parameter a.
Example:
>>> levene_test(x, population_cdf='alpha', population_args=(a,), population_loc=0, population_scale=1, significance_level=0.05) True
- kim_convergence.stats.nonnormal_test.wilcoxon_test(time_series_data: ndarray | list[float], population_cdf: str | None, population_args: tuple, population_loc: float | None, population_scale: float | None, significance_level: float = 0.050000000000000044) bool
Calculate the Wilcoxon signed-rank test.
Here it is used as a non-parametric test to determine whether an unknown population mean is different from a specific value.
- Parameters:
time_series_data (np.ndarray) – time series data.
population_cdf (Optional[str]) – The name of a distribution.
population_args (tuple) – Distribution parameter.
population_loc (Optional[float]) – location of the distribution.
population_scale (Optional[float]) – scale of the distribution.
significance_level (float, optional) – Probability threshold below which the null hypothesis is rejected. (default: 0.05)
- Returns:
- bool
Trueif the sample is drawn from the specified population distribution.
Examples:
>>> import numpy as np >>> from scipy.stats import gamma >>> rng = np.random.RandomState(12345) >>> shape, scale = 2., 2. >>> x = rng.gamma(shape, scale, size=1000) >>> wilcoxon_test(x, population_cdf='gamma', population_args=(shape,), population_loc=0, population_scale=scale, significance_level=0.05) True
>>> wilcoxon_test(x, population_cdf='gamma', population_args=(shape,), population_loc=0, population_scale=1, significance_level=0.05) False
The non-parametric tests in this module rely on distributions from SciPy [scipystats]. Available non-parametric tests include:
Kolmogorov-Smirnov test (
ks_test): Tests if samples come from a given distributionLevene’s test (
levene_test): Tests for equal variances [nistdiv898b]Wilcoxon signed-rank test (
wilcoxon_test): Tests if median differs from a valueKruskal-Wallis H-test (
kruskal_test): Non-parametric version of ANOVA
Time Series Analysis Tools
Tools module.
Helper functions for time series analysis.
Environment Variables
KIM_CONV_FORCE_SUBPROCIf set (to any value), forces correlation and periodogram computations to run in isolated subprocesses using
multiprocessingwith the “spawn” start method.This is primarily intended to avoid threading conflicts when kim-convergence is used inside heavily multi-threaded simulation codes (e.g., LAMMPS with OpenMP). It prevents nested parallelism issues that can cause deadlocks or severe performance degradation.
Performance warning:
In production simulations with large datasets: moderate overhead (typically 10-30% slower).
In unit tests, small datasets, or frequent calls: extremely high overhead (can be 1000x or more slower, especially on macOS) due to repeated process spawning and data copying.
Never set this variable when running unit tests or during development. It is intended only as an escape hatch for real simulation runs that exhibit threading deadlocks.
Example usage (only when needed):
export KIM_CONV_FORCE_SUBPROC=1 mpirun -np 8 lmp -in in.my_simulation # or similar
This flag is optional and should remain unset in nearly all cases.
- kim_convergence.stats.tools.auto_correlate(x: ndarray | list[float], *, nlags: int | None = None, fft: bool = False) ndarray
Calculate the auto-correlation function.
Calculate the auto-correlation function for nlags lag for the input array. This estimator is biased.
- Parameters:
x (array_like, 1d) – Time series data.
nlags (int > 0 or None, optional) – Number of lags to return auto-correlation for it. (default: None)
fft (bool, optional) – Use FFT convolution for long series. (default: False)
- Returns:
- ndarray
Calculated auto correlation function.
- kim_convergence.stats.tools.auto_covariance(x: ndarray | list[float], *, fft: bool = False) ndarray
Calculate biased auto-covariance estimates.
Compute auto-covariance estimates for every lag for the input array. This estimator is biased.
\[\gamma_k = \frac{1}{N}\sum\limits_{t=1}^{N-K}(x_t-\bar{x})(x_{t+K}-\bar{x})\]Note
Some sources use the following formula for computing the autocovariance:
\[\gamma_k = \frac{1}{N-K}\sum\limits_{t=1}^{N-K}(x_t-\bar{x})(x_{t+K}-\bar{x})\]This definition has less bias, than the one used here. But the \(\frac{1}{N}\) formulation has some desirable statistical properties and is the most commonly used in the statistics literature.
- Parameters:
x (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)
- Returns:
- 1darray
Estimated autocovariances.
- Raises:
CRError – If input validation fails.
- kim_convergence.stats.tools.cross_correlate(x: ndarray | list[float], y: ndarray | list[float] | None, *, nlags: int | None = None, fft: bool = False) ndarray
Calculate the cross-correlation function.
Calculate the cross-correlation function for nlags lag for the input array. This estimator is biased.
- Parameters:
x (array_like, 1d) – Time series data.
y (array_like, 1d) – Time series data.
nlags (int > 0 or None, optional) – Number of lags to return auto-correlation for. (default: None)
fft (bool, optional) – Use FFT convolution for long series. (default: False)
- Returns:
- ndarray
Calculated cross correlation.
- kim_convergence.stats.tools.cross_covariance(x: ndarray | list[float], y: ndarray | list[float] | None, *, fft: bool = False) ndarray
Calculate the biased cross covariance estimate between two time series.
Calculate the cross covariance between two time series for every lag for the input arrays. This estimator is biased.
- Parameters:
x (array_like, 1d) – Time series data.
y (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)
- Returns:
- 1darray
Calculated cross covariances.
- Raises:
CRError – If input validation fails.
- kim_convergence.stats.tools.int_power(x: ndarray | list[float], exponent: int) ndarray
Array elements raised to the power exponent.
- Parameters:
x (array_like, 1d) – The bases.
exponent (int) – The exponent
- Returns:
- 1darray
Computed power array.
- kim_convergence.stats.tools.modified_periodogram(x: ndarray | list[float], *, fft: bool = False, with_mean: bool = False) ndarray
Compute a modified periodogram to estimate the power spectrum.
Estimate the power spectrum using a modified periodogram. A periodogram [heidelberger1981] is an estimate of the spectral density of a signal and it is defined as,
\[\left \{ I\left(\frac{k}{n}\right) \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor},\; I\left( \frac{k}{n} \right) = \left| \sum_{j=0}^{j=n-1} {x(j) e^{-2\pi i j k / n}} \right|^2 / n\]- Parameters:
x (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)
with_mean (bool, optional) – If True, use x minus its mean. (default: False)
- Returns:
- 1darray
Computed modified periodogram array.
Note
This function does not return the array of sample frequencies. In case of need, one can compute it as,
\[f = \left \{ \frac{k}{n} \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor + 1}\]or
>>> f = np.arange(1., x.size//2 + 1) / x.size
- Raises:
CRError – If input validation fails.
- kim_convergence.stats.tools.moment(x: ndarray | list[float], *, moment: int = 1) float
Calculates the nth moment about the mean for a sample.
- Parameters:
x (array_like, 1d) – Time series data.
moment (int, optional) – Order of central moment that is returned. (default: 1)
- Returns:
- float
n-th central moment.
Note
The k-th central moment of a time series data,
\[m_k = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^k,\]where \(n\) is the number of samples and \(\bar{x}\) is the mean.
- kim_convergence.stats.tools.periodogram(x: ndarray | list[float], *, fft: bool = False, with_mean: bool = False) ndarray
Compute a periodogram to estimate the power spectrum.
- Parameters:
x (array_like, 1d) – Time series data.
fft (bool, optional) – Use FFT convolution for long series. (default: False)
with_mean (bool, optional) – If True, use x minus its mean. (default: False)
- Returns:
- 1darray
Computed power spectrum array.
Note
This function does not return the array of sample frequencies. In case of need, one can compute it as,
\[f = \left \{ \frac{k}{n} \right \}_{k = 1, \cdots, \left \lfloor \frac{n}{2} \right \rfloor + 1}\]or
>>> f = np.arange(1., x.size//2 + 1) / x.size
- kim_convergence.stats.tools.skew(x: ndarray | list[float], *, bias: bool = False) float
Compute the time series data set skewness [zwillinger2000].
skewnessis a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.- Parameters:
x (array_like, 1d) – Time series data.
bias (bool, optional) – If False, then the calculations are corrected for statistical bias. (default: False)
- Returns:
- float
The skewness
Note
For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution.
The sample skewness is computed as the Fisher-Pearson coefficient of skewness \(g_1 = \frac{m_3}{m_2^{3/2}}\), where \(m_i\) is the biased sample \(i\texttt{th}\) central moment. If
biasis False, the calculations are corrected for bias and the value computed is the adjusted Fisher-Pearson standardized moment coefficient, i.e.\[G_1 = \frac{k_3}{k_2^{3/2}} = \frac{\sqrt{N(N-1)}}{N-2} \frac{m_3}{m_2^{3/2}}.\]
Randomness Test
Independence test module.
- kim_convergence.stats.randomness_test.randomness_test(x: ndarray | list[float], significance_level: float) bool
Testing for independence of observations.
The von-Neumann ratio test of independence of variables is a test designed for checking the independence of subsequent observations.
The null hypothesis is that the data are independent and normally distributed.
- Parameters:
x (array_like, 1d) – Time series data.
significance_level (float) – Probability threshold below which the null hypothesis is rejected.
- Returns:
- bool
Trueif the observations are independent.
Note
Given a series \(x\) of \(n\) data points, the Von-Neumann test [vonneumann1941] [vonneumann1941b] statistic is
\[v = \frac{\sum_{i=2}^{n} (x_i - x_{i-1})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\]Under the null hypothesis of independence, the mean \(\bar{v} = 2\) and the variance \(\sigma^2_v = \frac{4 (n - 2)}{(n^2-1)}\) (see [williams1941], and [madansky1988] for a simple derivation).
Common Usage Patterns
Testing if data is normally distributed:
import numpy as np
from kim_convergence.stats import t_test
# Generate sample data
data = np.random.normal(loc=0, scale=1, size=100)
# Perform t-test against population mean of 0
result = t_test(
sample_mean=np.mean(data),
sample_std=np.std(data),
sample_size=len(data),
population_mean=0,
significance_level=0.05
)
print(f"Null hypothesis accepted: {result}")
Checking time series randomness:
from kim_convergence.stats.randomness_test import randomness_test
# Check if time series exhibits independence
is_random = randomness_test(time_series_data, significance_level=0.05)
if is_random:
print("Time series appears independent")
else:
print("Time series shows serial correlation")
Testing against a specific distribution:
from kim_convergence.stats.nonnormal_test import ks_test
# Test if data comes from a gamma distribution
is_gamma = ks_test(
time_series_data,
population_cdf='gamma',
population_args=(2.0,), # shape parameter
population_loc=0,
population_scale=1.0,
significance_level=0.05
)
Computing autocorrelation:
from kim_convergence.stats.tools import auto_correlate
# Compute autocorrelation with FFT optimization
autocorr = auto_correlate(time_series_data, nlags=50, fft=True)
# First few lags (excluding lag 0 which is always 1.0)
print(f"Autocorrelation at lag 1: {autocorr[1]:.3f}")
print(f"Autocorrelation at lag 2: {autocorr[2]:.3f}")
Performance Considerations
FFT Optimization
For long time series, always use fft=True in autocorrelation functions:
# For time series with > 1000 points
autocorr = auto_correlate(large_time_series, fft=True)
crosscorr = cross_correlate(x, y, fft=True)
The get_fft_optimal_size() function finds optimal sizes for FFT computations
by returning the smallest 5-smooth number (factors 2, 3, 5 only) greater than
or equal to the input size [statsmodels].
Sample Size Requirements
Non-parametric tests: Require at least 5 data points
Randomness test: Requires at least 3 data points
t-distribution functions: Degrees of freedom must be > 1
KS test: Most effective with moderate to large sample sizes (>30)
Numerical Stability
Use
bias=Falseinskew()function for unbiased estimationDistribution functions handle edge cases (e.g., p=0, p=1) appropriately
Memory Usage
FFT-based functions create temporary arrays of optimal FFT size
Auto/cross-covariance functions return arrays of length N (not 2N-1)
Consider using
nlagsparameter to limit output size for long series
Error Handling
All functions raise appropriate exceptions:
CRError: For general errors and invalid inputsCRSampleSizeError: For insufficient sample sizesValue errors for out-of-range parameters (e.g., p ∉ [0,1])