Other Utilities

Support tools for data preprocessing, error handling, and quality control.

Batch Means

Compute batch means from time series data. The batch function:

  • Divides data into non-overlapping batches of size batch_size

  • Applies a reduction function (default: np.mean) to each batch

  • Supports optional scaling (centering, standardization) after batching

  • Truncates remainder data points that don’t fit into complete batches

  • Returns a view (not a copy) when no scaling is requested

Scaling Methods

Five scaling methods are provided:

  • minmax_scale – Scale to a specified feature range (default: [0, 1])

  • translate_scale – Translate so first element is zero, scale by mean

  • standard_scale – Remove mean and scale to unit variance

  • robust_scale – Center to median, scale by interquartile range

  • maxabs_scale – Scale to [-1, 1] range by maximum absolute value

Each method is available as a function and a class with scale()/inverse() methods.

Error Handling

Custom exception classes and validation utilities:

  • CRError – Base exception with caller identification

  • CRSampleSizeError – Raised for insufficient data samples

  • cr_warning() – Print warning messages with caller context

  • cr_check() – Validate variable types and bounds

  • Decorators: _check_ndim, _check_isfinite for input validation

Outlier Detection

Seven methods to identify outliers:

  • iqr / boxplot – Points beyond 1.5 × IQR from quartiles

  • extreme_iqr / extreme_boxplot – Points beyond 3 × IQR

  • z_score / standard_score\(|Z|\) > 3 from mean and std

  • modified_z_score – Robust version using median and MAD (\(|Z|\) > 3.5)

Returns a 1-D NumPy array of indices or None if no outliers are found.

Data Splitting

train_test_split randomly partitions indices for training and testing:

  • Supports absolute counts or fractions for train/test sizes

  • Validates that splits are feasible given data size

  • Accepts an optional seed for reproducible splits

  • Uses NumPy’s random number generator internally

  • Returns two NumPy index arrays: (train_idx, test_idx)

Quick Example

import numpy as np
from kim_convergence import batch, standard_scale, outlier_test, train_test_split

data = np.random.randn(1000)

# Batch the data
batched = batch(data, batch_size=10)

# Scale to zero mean, unit variance
scaled = standard_scale(batched)

# Check for outliers
outliers = outlier_test(scaled, outlier_method='iqr')

# Split for validation
train_idx, test_idx = train_test_split(data, test_size=0.2, seed=42)

Usage Hints

  • Batch means: Use for variance estimation in correlated data

  • Scaling: Apply robust_scale when outliers are present

  • Outlier detection: modified_z_score works best for small datasets

  • Data splitting: Set seed for reproducible cross-validation